# Identifying arXiv Article Subject Codes via NLP

The goal of this project is to predict primary subject codes for scientific articles available in the arXiv database based on the text of their abstract. This allows for rapid encoding of article subject material, and similar methods may be applicable for identifying key terms in articles submitted for addition to the database. 

Without machine learning, idexing articles for addition to a database along subject codes, key terms, and other metrics is a labor intensive process. For some databases, the labor cost of indexing a single article has been estimated to cost up to 10 dollars per article. Natural Language Processing offers the ability to automate this process, thereby saving up to 15 million dollars in labor for a dataset similar in size to the one used for this project upon intial upload. Maintenance costs may see further cost savings as routine updates to the database architecture or indexing system can be automated instead of requiring large quantities of expensive manual labor.

In [1]:
#imports
import pandas as pd
import numpy as np
import tensorflow as tf

from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras import Sequential
from keras.callbacks import ModelCheckpoint, ReduceLROnPlateau, EarlyStopping

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer, OneHotEncoder
from sklearn.feature_extraction import FeatureHasher

In [2]:
#import data
df = pd.read_csv("./arxiv-oai-af.tsv", delimiter="\t")
df

Unnamed: 0,abstract,acm_class,arxiv_id,author_text,categories,comments,created,doi,num_authors,num_categories,primary_cat,title,updated
0,If we assume the Thesis that any classical T...,,math/0212388,Bhupinder Singh Anand,math.GM,12 pages. Revision 1. Appendix 1 added. An HTM...,2002-12-31,,1,1,math.GM,Is a deterministic universe logically consiste...,2003-01-02
1,"We define the Cartesian product, composition...",,1205.6123,"Muhammad Akram, Wieslaw A. Dudek",cs.DM,,2012-04-29,10.1016/j.camwa.2010.11.004,2,1,cs.DM,Interval-valued fuzzy graphs,
2,We apply algebraic Morse theory to the Taylo...,,1806.07887,Robin Frankhuizen,"math.AT,math.AC,math.RA",27 pages; comments welcome. arXiv admin note: ...,2018-06-20,,1,3,math.AT,Massey products and the Golod property for sim...,
3,Anomalous transport is usually described eit...,,1007.3022,"Bartlomiej Dybiec, Ewa Gudowska-Nowak","cond-mat.stat-mech,math-ph,math.MP","10 pages, 7 figures",2010-07-18,10.1063/1.3522761,2,3,cond-mat.stat-mech,Subordinated diffusion and CTRW asymptotics,2010-11-09
4,"In this paper, an approximate solution to a ...",,1512.07787,"M. T. Araujo, E. Drigo Filho",cond-mat.stat-mech,"12 pages, 8 figures",2015-12-24,10.5488/CMP.18.43003,2,1,cond-mat.stat-mech,Approximate solution for Fokker-Planck equation,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1582236,21cm intensity mapping experiments aim to ob...,,1501.03823,"Laura Wolz, Filipe B. Abdalla, David Alonso, C...",astro-ph.CO,This article is part of the 'SKA Cosmology Cha...,2015-01-15,,10,1,astro-ph.CO,Foreground Subtraction in Intensity Mapping wi...,
1582237,We show the existence of smooth isolated cur...,,math/0110220,Andreas Leopold Knutsen,math.AG,18 pages. The previous version of the preprint...,2001-10-19,,1,1,math.AG,"Smooth, isolated curves in families of Calabi-...",2012-09-05
1582238,Sequence alignment is a tool in bioinformati...,,0907.2187,"S Wolfsheimer, O Melchert, AK Hartmann","cond-mat.stat-mech,cond-mat.dis-nn,q-bio.QM",,2009-07-13,10.1103/PhysRevE.80.061913,3,3,cond-mat.stat-mech,Finite-temperature local protein sequence alig...,
1582239,"We suggest that the majority of the ""young"",...",,astro-ph/0209553,Valery V. Kravtsov,astro-ph,"7 pages, no figures, accepted for publication ...",2002-09-26,10.1051/0004-6361:20021404,1,1,astro-ph,Second Parameter Globulars and Dwarf Spheroida...,


In [3]:
#find explicit nulls
df.isnull().sum()

abstract                0
acm_class         1560822
arxiv_id                0
author_text             0
categories              0
comments           301829
created                 0
doi                734897
num_authors             0
num_categories          0
primary_cat             0
title                   0
updated            991881
dtype: int64

In [4]:
#drop unnecessary columns 
trimmed_df = df.drop(columns=["acm_class", 
                              "comments", 
                              "created", 
                              "num_authors", 
                              "num_categories", 
                              "updated", 
                              "doi", 
                              "categories", 
                              "author_text", 
                              "title"]
                    )
trimmed_df

Unnamed: 0,abstract,arxiv_id,primary_cat
0,If we assume the Thesis that any classical T...,math/0212388,math.GM
1,"We define the Cartesian product, composition...",1205.6123,cs.DM
2,We apply algebraic Morse theory to the Taylo...,1806.07887,math.AT
3,Anomalous transport is usually described eit...,1007.3022,cond-mat.stat-mech
4,"In this paper, an approximate solution to a ...",1512.07787,cond-mat.stat-mech
...,...,...,...
1582236,21cm intensity mapping experiments aim to ob...,1501.03823,astro-ph.CO
1582237,We show the existence of smooth isolated cur...,math/0110220,math.AG
1582238,Sequence alignment is a tool in bioinformati...,0907.2187,cond-mat.stat-mech
1582239,"We suggest that the majority of the ""young"",...",astro-ph/0209553,astro-ph


In [5]:
#check for duplicate articles
trimmed_df["arxiv_id"].value_counts()

math/0212388    1
1511.00435      1
1003.0352       1
1403.6630       1
1606.07245      1
               ..
0710.1146       1
1508.04795      1
0909.0182       1
1101.0001       1
1309.3564       1
Name: arxiv_id, Length: 1582241, dtype: int64

In [6]:
#full dataset turned out to be too large for training with available resources
#looking for best training subset based on class balance
trimmed_df["primary_cat"].value_counts()[:50]

hep-ph                107925
astro-ph               94239
hep-th                 86019
quant-ph               71567
cond-mat.mes-hall      46310
gr-qc                  45681
cond-mat.mtrl-sci      39348
cond-mat.str-el        35899
cond-mat.stat-mech     32370
astro-ph.SR            30147
astro-ph.CO            29493
math.AP                28783
math.CO                27755
nucl-th                27524
astro-ph.GA            27103
math.PR                26395
cs.CV                  25941
math-ph                25426
math.AG                25312
cond-mat.supr-con      25137
astro-ph.HE            24153
cs.IT                  23048
math.NT                20834
math.DG                20613
cond-mat.soft          19548
hep-ex                 18774
cs.LG                  18178
physics.optics         17130
hep-lat                15185
math.OC                14899
math.DS                14742
math.NA                13912
math.FA                12971
astro-ph.EP            12844
cond-mat      

In [7]:
#taking top 50 most frequent subjects
n_subjects = 50
#identify the low frequency subjects
balance = trimmed_df["primary_cat"].value_counts()[:n_subjects]
index = np.array(balance.index)

#drop low frequency subjects
subject = trimmed_df["primary_cat"]
balanced_df = trimmed_df[subject.isin(index)]
balanced_df.shape

(1265014, 3)

Train-test split

In [8]:
train, test = train_test_split(balanced_df, test_size=0.2, random_state=42)

In [9]:
#save data for later
test.to_csv("./Train&Test/test_split.csv")
train.to_csv("./Train&Test/train_split.csv")

### Model Construction

The general approach is to create a neural network that will encode the data from the abstract, train on those abstracts using its recurrent architecture, and then return the subject codes that correspond. Two methods are generally used for this: LSTM and GRU architecture. The initial model will use GRU architecture due to computational limitations and the size of the dataset, but both mechanisms are candidates for this model.

In [10]:
#one hot encoding of target
FH = FeatureHasher(n_features=n_subjects, input_type="string")
train_target = FH.fit_transform(X=train["primary_cat"])
t_array_target = train_target.toarray()
t_array_target

array([[ 0.,  0.,  0., ...,  0.,  0., -1.],
       [ 0.,  0.,  0., ...,  0.,  0., -4.],
       [ 1.,  0.,  0., ...,  0.,  0.,  0.],
       ...,
       [ 1.,  0.,  0., ...,  0.,  0., -2.],
       [ 2.,  0.,  0., ...,  0.,  0., -3.],
       [ 0.,  0.,  0., ...,  0.,  0., -1.]])

In [11]:
t_array_target.shape

(1012011, 50)

In [12]:
#creating encoder to clean and encode abstract data
encoder = layers.experimental.preprocessing.TextVectorization(output_mode='int')
#calling adapt gets the layer to index all of the terms
#this step speeds up model performance and reduces parameters
encoder.adapt(train["abstract"])

In [14]:
#initial model
init_model = Sequential([
    tf.keras.Input(shape=(1,), dtype=tf.string),
    encoder,
    tf.keras.layers.Embedding(
        input_dim=len(encoder.get_vocabulary()),
        output_dim=32,
        # Use masking to handle the variable abstract lengths
        mask_zero=True),
    tf.keras.layers.GRU(32),
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dense(n_subjects, activation='softmax')
])

#compiling the model
init_model.compile(
    loss='categorical_crossentropy',
    optimizer='adam',
    metrics=['accuracy', 'AUC']
)

#creating a save file for this model
model_checkpoint_callback = ModelCheckpoint(filepath="./Checkpoints/cp.ckpt",
                                           save_weights_only=True,
                                           monitor='val_accuracy',
                                           mode='max',
                                           save_best_only=True)

#fit the model
print(init_model.summary())
NL1 = init_model.fit(x=train["abstract"], y=t_array_target, epochs=10,
                     batch_size=42, validation_split=0.3)

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 text_vectorization (TextVec  (None, None)             0         
 torization)                                                     
                                                                 
 embedding (Embedding)       (None, None, 32)          38059168  
                                                                 
 gru (GRU)                   (None, 32)                6336      
                                                                 
 dense (Dense)               (None, 32)                1056      
                                                                 
 dense_1 (Dense)             (None, 50)                1650      
                                                                 
Total params: 38,068,210
Trainable params: 38,068,210
Non-trainable params: 0
____________________________________________

  return t[start:end]




KeyboardInterrupt: 

### Model Iteration

### Model Evaluation

# Discussion