# 2022 basic data prep

This uses the 2022 selected topics to create train and eval query sets, and split the ground truth data for that.

Note that we use Python's Merseinne twister for reproducibility.

## Setup

Load some modules:

In [1]:
from random import Random
import pandas as pd
import numpy as np

Load the topic list:

In [5]:
with open('data/selected_topics_22.txt') as topic_file:
    topics = [line.strip() for line in topic_file]
topics.sort()
len(topics)

100

Load the WikiProject list:

In [3]:
projects = pd.read_json('data/trec_2022_wikiprojects.json.gz', lines=True)
projects.head()

Unnamed: 0,id,title,rel_docs
0,1,1000 Women in Religion,"[29305, 345779, 391183, 478677, 1607259, 18032..."
1,2,A Cappella,"[2411, 71307, 152558, 165101, 304559, 375332, ..."
2,3,A Song of Ice and Fire,"[12300, 12301, 12302, 713577, 713590, 713625, ..."
3,4,AIDS,"[1908, 12747, 14573, 23739, 26214, 26436, 2884..."
4,5,Abandoned Articles,"[41169, 69181, 69748, 180245, 201817, 374129, ..."


## Partition

Let's deterministically shuffle the topic list:

In [8]:
rng = Random(20220503)
topics.sort()
rng.shuffle(topics)
topics[:2]

['Classical music', 'Film/American cinema task force']

Make our training and eval query lists:

In [10]:
train_topics = topics[:50]
eval_topics = topics[50:]

## Output

And we'll save our output files:

In [12]:
train_projects = projects[projects['title'].isin(train_topics)]
train_projects.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 50 entries, 83 to 2858
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        50 non-null     int64 
 1   title     50 non-null     object
 2   rel_docs  50 non-null     object
dtypes: int64(1), object(2)
memory usage: 1.6+ KB


In [13]:
eval_projects = projects[projects['title'].isin(eval_topics)]
eval_projects.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 50 entries, 186 to 2871
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        50 non-null     int64 
 1   title     50 non-null     object
 2   rel_docs  50 non-null     object
dtypes: int64(1), object(2)
memory usage: 1.6+ KB


In [16]:
train_projects.to_json('data/trec_2022_train_reldocs.jsonl', orient='records', lines=True)

In [17]:
eval_projects.to_json('data/trec_2022_eval_reldocs.jsonl', orient='records', lines=True)

## Convert Files

We're now going to convert our article JSON files to Parquet for easy further processing.

In [19]:
arts = pd.read_json('data/trec_2022_articles.json.gz', lines=True)

KeyError: "None of ['id'] are in the columns"

In [21]:
arts.set_index('page_id', inplace=True)
arts.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6460238 entries, 12 to 70194530
Data columns (total 13 columns):
 #   Column                  Dtype  
---  ------                  -----  
 0   pred_qual               float64
 1   qual_cat                object 
 2   page_countries          object 
 3   page_subcont_regions    object 
 4   source_countries        object 
 5   source_subcont_regions  object 
 6   gender                  object 
 7   occupations             object 
 8   years                   object 
 9   num_sitelinks           int64  
 10  relative_pageviews      float64
 11  first_letter            object 
 12  creation_date           object 
dtypes: float64(2), int64(1), object(10)
memory usage: 4.5 GB


In [22]:
arts.to_parquet('data/trec_2022_articles.parquet', compression='zstd')

In [23]:
del arts

In [24]:
arts = pd.read_json('data/trec_2022_articles_discrete.json.gz', lines=True)
arts.set_index('page_id', inplace=True)
arts.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6460238 entries, 12 to 70194530
Data columns (total 19 columns):
 #   Column                       Dtype  
---  ------                       -----  
 0   pred_qual                    float64
 1   qual_cat                     object 
 2   page_countries               object 
 3   page_subcont_regions         object 
 4   source_countries             object 
 5   source_subcont_regions       object 
 6   gender                       object 
 7   occupations                  object 
 8   years                        object 
 9   num_sitelinks                int64  
 10  relative_pageviews           float64
 11  first_letter                 object 
 12  creation_date                object 
 13  first_letter_category        object 
 14  gender_category              object 
 15  creation_date_category       object 
 16  years_category               object 
 17  relative_pageviews_category  object 
 18  num_sitelinks_category       object 
dty

Remove the columns that have categorical equivalents:

In [37]:
arts.drop(columns=['first_letter', 'gender', 'creation_date', 'years', 'relative_pageviews', 'num_sitelinks'], inplace=True)

In [38]:
arts.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6460238 entries, 12 to 70194530
Data columns (total 13 columns):
 #   Column                       Dtype   
---  ------                       -----   
 0   pred_qual                    float64 
 1   qual_cat                     object  
 2   page_countries               object  
 3   page_subcont_regions         object  
 4   source_countries             object  
 5   source_subcont_regions       object  
 6   occupations                  object  
 7   first_letter_category        category
 8   gender_category              category
 9   creation_date_category       category
 10  years_category               category
 11  relative_pageviews_category  category
 12  num_sitelinks_category       category
dtypes: category(6), float64(1), object(6)
memory usage: 431.3+ MB


Save this all to Parquet:

In [39]:
arts.to_parquet('data/trec_2022_articles_discrete.parquet', compression='zstd')