# Preprocess MS MARCO PASSAGE development dataset to prepare it for the T5-base doc2query training.

In [3]:
import pandas as pd
import numpy as np
import pickle

import os

In [12]:
MS_MARCO_PASSAGE_BASE_FOLDER="/mnt/0060f889-4c27-409b-b0de-47f5427515e3/unicamp/ia368v_dd/pyserini/collections/msmarco-passage"

QRELS_TRAIN_FILENAME="qrels.train.tsv"
QUERIES_TRAIN_FILENAME="queries.train.tsv"

COLLECTION_FILENAME="collection.tsv"

In [16]:
pd.set_option('display.max_colwidth', None)

In [7]:
qrels_df = pd.read_csv(os.path.join(MS_MARCO_PASSAGE_BASE_FOLDER, QRELS_TRAIN_FILENAME), sep='\t', header=None, names=['query_id', 'Q0', 'passage_id', 'relevance'])

In [8]:
qrels_df

Unnamed: 0,query_id,Q0,passage_id,relevance
0,1185869,0,0,1
1,1185868,0,16,1
2,597651,0,49,1
3,403613,0,60,1
4,1183785,0,389,1
...,...,...,...,...
532756,19285,0,8841362,1
532757,558837,0,4989159,1
532758,559149,0,8841547,1
532759,706678,0,8841643,1


In [11]:
qrels_df['relevance'].unique()

array([1])

In [9]:
queries_df = pd.read_csv(os.path.join(MS_MARCO_PASSAGE_BASE_FOLDER, QUERIES_TRAIN_FILENAME), sep='\t', header=None, names=['query_id', 'text'])

In [10]:
queries_df

Unnamed: 0,query_id,text
0,121352,define extreme
1,634306,what does chattel mean on credit history
2,920825,what was the great leap forward brainly
3,510633,tattoo fixers how much does it cost
4,737889,what is decentralization process.
...,...,...
808726,633855,what does canada post regulations mean
808727,1059728,wholesale lularoe price
808728,210839,how can i watch the day after
808729,908165,what to use instead of pgp in windows


In [13]:
collection_df = pd.read_csv(os.path.join(MS_MARCO_PASSAGE_BASE_FOLDER, COLLECTION_FILENAME), sep='\t', header=None, names=['passage_id', 'text'])

In [17]:
collection_df

Unnamed: 0,passage_id,text
0,0,The presence of communication amid scientific minds was equally important to the success of the Manhattan Project as scientific intellect was. The only cloud hanging over the impressive achievement of the atomic researchers and engineers is what their success truly meant; hundreds of thousands of innocent lives obliterated.
1,1,The Manhattan Project and its atomic bomb helped bring an end to World War II. Its legacy of peaceful uses of atomic energy continues to have an impact on history and science.
2,2,Essay on The Manhattan Project - The Manhattan Project The Manhattan Project was to see if making an atomic bomb possible. The success of this project would forever change the world forever making it known that something this powerful can be manmade.
3,3,"The Manhattan Project was the name for a project conducted during World War II, to develop the first atomic bomb. It refers specifically to the period of the project from 194 â¦ 2-1946 under the control of the U.S. Army Corps of Engineers, under the administration of General Leslie R. Groves."
4,4,"versions of each volume as well as complementary websites. The first websiteâThe Manhattan Project: An Interactive Historyâis available on the Office of History and Heritage Resources website, http://www.cfo. doe.gov/me70/history. The Office of History and Heritage Resources and the National Nuclear Security"
...,...,...
8841818,8841818,"When metal salts emit short wavelengths of visible light in the range of 400 to 500 nanometers, they produce violet and blue colors. When metal salts emit longer wavelengths of visible light in the range of 600 to 700 nanometers, they produce orange and red colors."
8841819,8841819,"Thousands of people across the United States will be celebrating Independence Day on July 4 by attending a fireworks display. The red, orange, yellow, green, blue and purple colors exploding in the night sky during a fireworks display are created by the use of metal salts. Photo credit: Jeff Golden. In chemistry, a salt is defined as an ionic compound that is formed from the reaction of an acid and a base."
8841820,8841820,"The recipe that creates blue, for example, includes varying amounts of copper chloride compounds, while red comes from strontium and lithium salts. Just like paints, secondary colors are made by mixing the ingredients of their primary-color relatives. A mixture of copper (blue) and strontium (red) makes purple."
8841821,8841821,"On Independence Days of yore, old-timey crowds were dazzled by a mere sprinkle or two of off-white lights. Later generations oohed-and-aahed at more colorful displays, as chemical combos were developed that could light up the sky in Technicolor. The march of progress in pyrotechnics didn't stop there, though."


In [18]:
queries_df[queries_df['query_id'] == 1185869]

Unnamed: 0,query_id,text
486744,1185869,)what was the immediate impact of the success of the manhattan project?


In [19]:
merged_df = qrels_df.merge(queries_df, left_on="query_id", right_on="query_id", how='inner')

In [20]:
merged_df

Unnamed: 0,query_id,Q0,passage_id,relevance,text
0,1185869,0,0,1,)what was the immediate impact of the success of the manhattan project?
1,1185868,0,16,1,"_________ justice is designed to repair the harm to victim, the community and the offender caused by the offender criminal act. question 19 options:"
2,597651,0,49,1,what color is amber urine
3,403613,0,60,1,is autoimmune hepatitis a bile acid synthesis disorder
4,1183785,0,389,1,elegxo meaning
...,...,...,...,...,...
532756,19285,0,8841362,1,anterolisthesis definition
532757,558837,0,4989159,1,what are fishing flies
532758,559149,0,8841547,1,what are fsh levels during perimenopause
532759,706678,0,8841643,1,what is a yowie


In [21]:
merged_df = merged_df.merge(collection_df, left_on='passage_id', right_on='passage_id', how='inner')

In [22]:
merged_df

Unnamed: 0,query_id,Q0,passage_id,relevance,text_x,text_y
0,1185869,0,0,1,)what was the immediate impact of the success of the manhattan project?,The presence of communication amid scientific minds was equally important to the success of the Manhattan Project as scientific intellect was. The only cloud hanging over the impressive achievement of the atomic researchers and engineers is what their success truly meant; hundreds of thousands of innocent lives obliterated.
1,1185868,0,16,1,"_________ justice is designed to repair the harm to victim, the community and the offender caused by the offender criminal act. question 19 options:","The approach is based on a theory of justice that considers crime and wrongdoing to be an offense against an individual or community, rather than the State. Restorative justice that fosters dialogue between victim and offender has shown the highest rates of victim satisfaction and offender accountability."
2,597651,0,49,1,what color is amber urine,"Colorâurine can be a variety of colors, most often shades of yellow, from very pale or colorless to very dark or amber. Unusual or abnormal urine colors can be the result of a disease process, several medications (e.g., multivitamins can turn urine bright yellow), or the result of eating certain foods."
3,403613,0,60,1,is autoimmune hepatitis a bile acid synthesis disorder,Inborn errors of bile acid synthesis can produce life-threatening cholestatic liver disease (which usually presents in infancy) and progressive neurological disease presenting later in childhood or in adult life.he neurological presentation often includes signs of upper motor neurone damage (spastic paraparesis). The most useful screening test for many of these disorders is analysis of urinary cholanoids (bile acids and bile alcohols); this is usually now achieved by electrospray ionisation tandem mass spectrometry.
4,1183785,0,389,1,elegxo meaning,"The word convict here (elegcw /elegxo) means to bring to light or expose error often with the idea of reproving or rebuking. It brings about knowledge of believing or doing something wrong, but it does not mean that the person will respond properly to that knowledge. Our usage of the English word, convict, is similar."
...,...,...,...,...,...,...
532756,562255,0,8841257,1,what are nephridia?,"These nephridia, which are called protonephridia, are branched tubules of ectodermal origin. They are closed at the internal ends by terminal cells, or solenocytes, and open to the exterior by means of excretory pores, or nephridiopores."
532757,19285,0,8841362,1,anterolisthesis definition,What is Anterolisthesis? Anterolisthesis is defined as a forward slippage of the upper vertebral body in relation to the vertebra below. The progression in the displacement of the involved vertebra can potentially pinch the spinal nerves of the vertebra and may also result in damages in the spinal cord.
532758,559149,0,8841547,1,what are fsh levels during perimenopause,"FSH and LH levels in perimenopause are often found to be high in comparison with levels of these hormones in menstruating women. However, they are not as high as in those women who have already reached a menopause."
532759,706678,0,8841643,1,what is a yowie,Yowie is one of several names given to a hominid reputed to live in the Australian wilderness. The creature has its roots in Aboriginal oral history.


In [25]:
merged_df = merged_df.rename(columns={'text_x': 'query_text', 'text_y': 'passage_text'})

In [23]:
os.getcwd()

'/mnt/0060f889-4c27-409b-b0de-47f5427515e3/unicamp/ia368v_dd/ia368v_dd_class_06'

In [26]:
with open("ms_marco_passage_dev.pkl", 'wb') as outputFile:
    pickle.dump(merged_df, outputFile, pickle.HIGHEST_PROTOCOL)