# Processing the data to train a quantitative field classifier

In this notebook we will pare down the dataset into what we need to train our naive Bayes model. After dropping some irrelevant data, most of the work is in fitting the varied categories into one of the eight fields on the Arxiv today: physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
PATH = 'Data/arxivMetadata.json'

In [2]:
df = pd.read_json(PATH, lines=True)

## Dropping irrelevant columns

Let's begin by dropping the columns we do not need.

In [3]:
df.columns

Index(['id', 'submitter', 'authors', 'title', 'comments', 'journal-ref', 'doi',
       'report-no', 'categories', 'license', 'abstract', 'versions',
       'update_date', 'authors_parsed'],
      dtype='object')

In [4]:
df.drop(['id', 'submitter', 'authors', 'comments','journal-ref',
                        'doi', 'report-no', 'license',
                        'versions', 'update_date', 'authors_parsed'], axis=1, inplace=True)

In [5]:
df.columns

Index(['title', 'categories', 'abstract'], dtype='object')

## Assigning each paper to a field

### Extracting the primary category

As it stands, the categories column contains finer information than we want to discern.

In [6]:
df['categories'].head(10)

0               hep-ph
1        math.CO cs.CG
2       physics.gen-ph
3              math.CO
4      math.CA math.FA
5    cond-mat.mes-hall
6                gr-qc
7    cond-mat.mtrl-sci
8             astro-ph
9              math.CO
Name: categories, dtype: object

After crossverifying with the arxiv listings, we see that when two or more categories are listed, the first category listed is the primary category. Let's pick the primary category out and add it as a column in our dataframe. After this, we see that the overarching category label is located after the final period in the category name, so we pick this out as well.

In [7]:
df['primary'] = df.apply(lambda row: row['categories'].split()[0], axis=1)
df['primary'] = df.apply(lambda row: row['primary'].split('.')[0], axis=1)

In [8]:
df['primary'].unique()

array(['hep-ph', 'math', 'physics', 'cond-mat', 'gr-qc', 'astro-ph',
       'hep-th', 'hep-ex', 'nlin', 'q-bio', 'quant-ph', 'cs', 'nucl-th',
       'math-ph', 'hep-lat', 'nucl-ex', 'q-fin', 'stat', 'eess', 'econ',
       'acc-phys', 'adap-org', 'alg-geom', 'ao-sci', 'atom-ph',
       'bayes-an', 'chao-dyn', 'chem-ph', 'cmp-lg', 'comp-gas', 'dg-ga',
       'funct-an', 'mtrl-th', 'patt-sol', 'plasm-ph', 'q-alg', 'solv-int',
       'supr-con'], dtype=object)

### Assigning a field to each category

There are a good deal of physics categories ending in -ph so we use a regular expression to replace these categories with physics.

In [9]:
df['primary'] = df['primary'].replace('.*ph$', 'physics', regex=True)

In [10]:
df['primary'].unique()

array(['physics', 'math', 'cond-mat', 'gr-qc', 'hep-th', 'hep-ex', 'nlin',
       'q-bio', 'cs', 'nucl-th', 'hep-lat', 'nucl-ex', 'q-fin', 'stat',
       'eess', 'econ', 'acc-phys', 'adap-org', 'alg-geom', 'ao-sci',
       'bayes-an', 'chao-dyn', 'cmp-lg', 'comp-gas', 'dg-ga', 'funct-an',
       'mtrl-th', 'patt-sol', 'q-alg', 'solv-int', 'supr-con'],
      dtype=object)

There is now a reasonable amount to classify manually. Some judgement calls were made here about which subcategories belong to which overarching fields.

In [12]:
  
physicsSubjects = ['cond-mat', 'gr-qc', 'hep-th', 'hep-ex', 'nucl-th', 'hep-lat', 'nucl-ex','acc-phys', 'nlin', 'adap-org', 
                        'ao-sci', 'comp-gas', 'mtrl-th', 'supr-con']
mathSubjects = ['alg-geom', 'chao-dyn', 'q-alg', 'solv-int', 'funct-an', 'dg-ga']
statSubjects = ['bayes-an', 'patt-sol']
csSubjects =  ['cmp-lg'] 
df['primary'].replace(physicsSubjects, ['physics']*len(physicsSubjects), inplace=True)
df['primary'].replace(mathSubjects, ['math']*len(mathSubjects), inplace=True)
df['primary'].replace(statSubjects, ['stat']*len(statSubjects), inplace=True)
df['primary'].replace(csSubjects, ['cs']*len(csSubjects), inplace=True)


As we can see below, we have now categorized the categories into the desired eight categories. Let's save this list of categories for later.

In [13]:
categories = df['primary'].unique()
df['primary'].unique()

array(['physics', 'math', 'q-bio', 'cs', 'q-fin', 'stat', 'eess', 'econ'],
      dtype=object)

## Sampling the data

There are currently 2.5 million entries in our dataset, 1.3 million of which are physics papers. We will cut the total down by a factor of 10 and take a stratified sample to balance the data. To do this, we first separate out the data by category.

In [14]:
listOfFramesBySubject = [df.loc[df['primary'] == category] for category in categories]

Let's take a look at how much of each class we have to work with.

In [15]:
for frame in listOfFramesBySubject:
    print(frame.iloc[0].loc['primary'], frame.shape)


physics (1324943, 4)
math (497592, 4)
q-bio (27899, 4)
cs (504142, 4)
q-fin (10991, 4)
stat (48429, 4)
eess (47427, 4)
econ (6980, 4)


Most of our classes support 20,000 samples so we will aim to get that. Falling short of this, we will just take every entry of the smaller category in the dataset.

In [16]:
sampledLists = [frame.sample(min(20000, frame.shape[0])) for frame in listOfFramesBySubject]

Finally, we package the lists together and save the processed dataset as a pickled file.

In [17]:
totalList = pd.concat(sampledLists, axis=0)
OUTPUT = 'Data/sampledPapers.pickle'
totalList.to_pickle(OUTPUT)