# Identifying arXiv Article Subject Codes via NLP

The goal of this project is to predict primary subject codes for scientific articles available in the arXiv database based on the text of their abstract. This allows for rapid encoding of article subject material, and similar methods may be applicable for identifying key terms in articles submitted for addition to the database. 

Without machine learning, idexing articles for addition to a database along subject codes, key terms, and other metrics is a labor intensive process. For some databases, the labor cost of indexing a single article has been estimated to cost up to 10 dollars per article. Natural Language Processing offers the ability to automate this process, thereby saving up to 15 million dollars in labor for a dataset similar in size to the one used for this project upon intial upload. Maintenance costs may see further cost savings as routine updates to the database architecture or indexing system can be automated instead of requiring large quantities of expensive manual labor.

In [1]:
#imports
import pandas as pd
import numpy as np
import tensorflow as tf

from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras import Sequential
from keras.callbacks import ModelCheckpoint, ReduceLROnPlateau, EarlyStopping

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer, OneHotEncoder
from sklearn.feature_extraction import FeatureHasher

2022-12-22 15:23:04.978109: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-12-22 15:23:07.534907: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/amazon/efa/lib64:/opt/amazon/openmpi/lib64:/usr/local/cuda/efa/lib:/usr/local/cuda/lib:/usr/local/cuda:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/targets/x86_64-linux/lib:/usr/local/lib:/usr/lib:/lib:
2022-12-22 15:23:07.535822: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so

In [2]:
#import data
df = pd.read_csv("./arxiv-oai-af.tsv", delimiter="\t")
df

Unnamed: 0,abstract,acm_class,arxiv_id,author_text,categories,comments,created,doi,num_authors,num_categories,primary_cat,title,updated
0,If we assume the Thesis that any classical T...,,math/0212388,Bhupinder Singh Anand,math.GM,12 pages. Revision 1. Appendix 1 added. An HTM...,2002-12-31,,1,1,math.GM,Is a deterministic universe logically consiste...,2003-01-02
1,"We define the Cartesian product, composition...",,1205.6123,"Muhammad Akram, Wieslaw A. Dudek",cs.DM,,2012-04-29,10.1016/j.camwa.2010.11.004,2,1,cs.DM,Interval-valued fuzzy graphs,
2,We apply algebraic Morse theory to the Taylo...,,1806.07887,Robin Frankhuizen,"math.AT,math.AC,math.RA",27 pages; comments welcome. arXiv admin note: ...,2018-06-20,,1,3,math.AT,Massey products and the Golod property for sim...,
3,Anomalous transport is usually described eit...,,1007.3022,"Bartlomiej Dybiec, Ewa Gudowska-Nowak","cond-mat.stat-mech,math-ph,math.MP","10 pages, 7 figures",2010-07-18,10.1063/1.3522761,2,3,cond-mat.stat-mech,Subordinated diffusion and CTRW asymptotics,2010-11-09
4,"In this paper, an approximate solution to a ...",,1512.07787,"M. T. Araujo, E. Drigo Filho",cond-mat.stat-mech,"12 pages, 8 figures",2015-12-24,10.5488/CMP.18.43003,2,1,cond-mat.stat-mech,Approximate solution for Fokker-Planck equation,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1582236,21cm intensity mapping experiments aim to ob...,,1501.03823,"Laura Wolz, Filipe B. Abdalla, David Alonso, C...",astro-ph.CO,This article is part of the 'SKA Cosmology Cha...,2015-01-15,,10,1,astro-ph.CO,Foreground Subtraction in Intensity Mapping wi...,
1582237,We show the existence of smooth isolated cur...,,math/0110220,Andreas Leopold Knutsen,math.AG,18 pages. The previous version of the preprint...,2001-10-19,,1,1,math.AG,"Smooth, isolated curves in families of Calabi-...",2012-09-05
1582238,Sequence alignment is a tool in bioinformati...,,0907.2187,"S Wolfsheimer, O Melchert, AK Hartmann","cond-mat.stat-mech,cond-mat.dis-nn,q-bio.QM",,2009-07-13,10.1103/PhysRevE.80.061913,3,3,cond-mat.stat-mech,Finite-temperature local protein sequence alig...,
1582239,"We suggest that the majority of the ""young"",...",,astro-ph/0209553,Valery V. Kravtsov,astro-ph,"7 pages, no figures, accepted for publication ...",2002-09-26,10.1051/0004-6361:20021404,1,1,astro-ph,Second Parameter Globulars and Dwarf Spheroida...,


In [3]:
#find explicit nulls
df.isnull().sum()

abstract                0
acm_class         1560822
arxiv_id                0
author_text             0
categories              0
comments           301829
created                 0
doi                734897
num_authors             0
num_categories          0
primary_cat             0
title                   0
updated            991881
dtype: int64

In [4]:
#drop unnecessary columns 
trimmed_df = df.drop(columns=["acm_class", 
                              "comments", 
                              "created", 
                              "num_authors", 
                              "num_categories", 
                              "updated", 
                              "doi", 
                              "categories", 
                              "author_text", 
                              "title"]
                    )
trimmed_df

Unnamed: 0,abstract,arxiv_id,primary_cat
0,If we assume the Thesis that any classical T...,math/0212388,math.GM
1,"We define the Cartesian product, composition...",1205.6123,cs.DM
2,We apply algebraic Morse theory to the Taylo...,1806.07887,math.AT
3,Anomalous transport is usually described eit...,1007.3022,cond-mat.stat-mech
4,"In this paper, an approximate solution to a ...",1512.07787,cond-mat.stat-mech
...,...,...,...
1582236,21cm intensity mapping experiments aim to ob...,1501.03823,astro-ph.CO
1582237,We show the existence of smooth isolated cur...,math/0110220,math.AG
1582238,Sequence alignment is a tool in bioinformati...,0907.2187,cond-mat.stat-mech
1582239,"We suggest that the majority of the ""young"",...",astro-ph/0209553,astro-ph


In [5]:
#check for duplicate articles
trimmed_df["arxiv_id"].value_counts()

math/0212388    1
1511.00435      1
1003.0352       1
1403.6630       1
1606.07245      1
               ..
0710.1146       1
1508.04795      1
0909.0182       1
1101.0001       1
1309.3564       1
Name: arxiv_id, Length: 1582241, dtype: int64

In [6]:
#full dataset turned out to be too large for training with available resources
#looking for best training subset based on class balance
trimmed_df["primary_cat"].value_counts()[:50]

hep-ph                107925
astro-ph               94239
hep-th                 86019
quant-ph               71567
cond-mat.mes-hall      46310
gr-qc                  45681
cond-mat.mtrl-sci      39348
cond-mat.str-el        35899
cond-mat.stat-mech     32370
astro-ph.SR            30147
astro-ph.CO            29493
math.AP                28783
math.CO                27755
nucl-th                27524
astro-ph.GA            27103
math.PR                26395
cs.CV                  25941
math-ph                25426
math.AG                25312
cond-mat.supr-con      25137
astro-ph.HE            24153
cs.IT                  23048
math.NT                20834
math.DG                20613
cond-mat.soft          19548
hep-ex                 18774
cs.LG                  18178
physics.optics         17130
hep-lat                15185
math.OC                14899
math.DS                14742
math.NA                13912
math.FA                12971
astro-ph.EP            12844
cond-mat      

In [4]:
#taking top 10 most frequent subjects
n_subjects = 10
#identify the low frequency subjects
balance = trimmed_df["primary_cat"].value_counts()[:n_subjects]
index = np.array(balance.index)

#drop low frequency subjects
subject = trimmed_df["primary_cat"]
balanced_df = trimmed_df[subject.isin(index)]
balanced_df.shape

NameError: name 'trimmed_df' is not defined

Train-test split

In [8]:
train, test = train_test_split(balanced_df, test_size=0.2, random_state=42)

In [9]:
#save data for later
test.to_csv("./Train&Test/test_split.csv")
train.to_csv("./Train&Test/train_split.csv")

### Model Construction

The general approach is to create a neural network that will encode the data from the abstract, train on those abstracts using its recurrent architecture, and then return the subject codes that correspond. Two methods are generally used for this: LSTM and GRU architecture. The initial model will use GRU architecture due to computational limitations and the size of the dataset, but both mechanisms are candidates for this model.

In [2]:
train = pd.read_csv("./Train&Test/train_split.csv")

In [5]:
#target encoder
def encode_target(data, features=n_subjects, input_type="string"):
    """takes a set of y values and one hot encodes them for the Neural Network output"""
    FH = FeatureHasher(n_features=features, input_type=input_type)
    target = FH.fit_transform(X=data)
    target_array = target.toarray()
    return target_array

In [6]:
#one hot encoding target for training set
train_target = encode_target(train["primary_cat"])

In [7]:
#creating encoder to clean and encode abstract data
encoder = layers.experimental.preprocessing.TextVectorization(output_mode='int')
#calling adapt gets the layer to index all of the terms
#this step speeds up model performance and reduces parameters
encoder.adapt(np.array(train["abstract"]))

2022-12-22 15:26:56.749914: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:267] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2022-12-22 15:26:56.749966: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (ip-172-31-34-239.us-east-2.compute.internal): /proc/driver/nvidia/version does not exist
2022-12-22 15:26:56.752020: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


Exploding gradient turned out to be a significant issue when constructing the initial model. In order to address this problem, several modifications were made during training and the number of subject headings was limited to a subset of the 10 most common headings in order to complete a few epochs and collect data that could help diagnose the problem. Gradient clipping was implemented quickly, but gradient instability persists.

In [None]:
#initial model
init_model = Sequential([
    #input layer
    tf.keras.Input(shape=(1,), dtype=tf.string),
    #encoder from cell above
    encoder,
    #embedding layer
    tf.keras.layers.Embedding(
        input_dim=len(encoder.get_vocabulary())+1,
        output_dim=10,
        # Use masking to handle the variable abstract lengths
        mask_zero=True),
    #recurrent layer
    tf.keras.layers.GRU(32),
    #decision time
    tf.keras.layers.Dense(32, activation='tanh'),
    tf.keras.layers.Dense(n_subjects, activation='softmax')
])

#compiling the model
init_model.compile(
    loss='categorical_hinge',
    #clipping due to initial issues with exploding gradient
    optimizer=tf.keras.optimizers.Adam(clipnorm=1.0),
    metrics=['accuracy', 
             'AUC', 
             tf.keras.metrics.CategoricalCrossentropy()]
)

#fit the model
print(init_model.summary())
NL1 = init_model.fit(x=train["abstract"], y=train_target, epochs=30,
                     batch_size=32, validation_split=0.3)

In [None]:
#save the initial model
tf.saved_model.save(init_model, "./models/")

### Model Iteration

Latent Semantic Analysis (LSA) on the vocabulary to determine ideal output dimension size for the embedding layer:

In [16]:
#imports
from sklearn.decomposition import TruncatedSVD
from nltk.tokenize import word_tokenize
import string
import nltk
from sklearn.feature_extraction.text import FeatureHasher

nltk.download('punkt')

In [26]:
#get vocab
vocab = encoder.get_vocabulary()
#returns list of strings

#define input data
abstracts = train["abstract"]

In [28]:
#create helper functions to process and then vectorize abstracts in sklearn

def pre_process(data):
    """Function takes in a list of abstracts, turns the whole thing lowercase, strips punctuation, tokenizes each word."""
    
    #preprocessing
    processed_data = []
    for abstract in data:
        
        #make lowercase
        abstract = abstract.lower()
        
        #tokenize words
        tokens = word_tokenize(abstract)
        
        #remove punctuation
        rm_punct = str.maketrans('','', string.punctuation)
        punkt_tokens = [token.translate(rm_punct) for token in tokens]
        final_tokens = [token for token in punkt_tokens if token!='']
        
        #add processed data to list of processed abstracts
        processed_data.append(final_tokens)
    
    return processed_data

def vectorize(data, vocabulary):
    """Maps the words according to the vocabulary list index. The function returns each abstract as a list of integer 
    values just like in the initial model. Unlike vectorizers from sklearn, this function retains the order of the words
    which our model is training on."""
    
    #iterate through the data and map the tokens to the vocabulary
    for i, abstract in enumerate(data):
        
        for j, word in enumerate(abstract):
            if word in vocabulary:
                word = vocabulary.index(word)
            #if a word is not in vocabulary, need to replace string with some vector
            else:
                word = len(vocabulary)+1
            
            #convert words to integers in abstract (leading to n-dimensional vectors)
            abstract[j] = word
        #convert abstracts to n-dimensional vectors    
        data[i] = abstract
    
    return data


In [29]:
#pre-process abstracts
processed = pre_process(abstracts)
processed

[['we',
  'uncover',
  'a',
  'novel',
  'solution',
  'of',
  'the',
  't',
  'hooft',
  'anomaly',
  'matching',
  'conditions',
  'for',
  'qcd',
  'interestingly',
  'in',
  'the',
  'perturbative',
  'regime',
  'the',
  'new',
  'gauge',
  'theory',
  'if',
  'interpreted',
  'as',
  'a',
  'possible',
  'qcd',
  'dual',
  'predicts',
  'the',
  'critical',
  'number',
  'of',
  'flavors',
  'above',
  'which',
  'qcd',
  'in',
  'the',
  'nonperturbative',
  'regime',
  'develops',
  'an',
  'infrared',
  'stable',
  'fixed',
  'point',
  'remarkably',
  'this',
  'value',
  'is',
  'identical',
  'to',
  'the',
  'maximum',
  'bound',
  'predicted',
  'in',
  'the',
  'nonpertubative',
  'regime',
  'via',
  'the',
  'allorders',
  'conjectured',
  'beta',
  'function',
  'for',
  'nonsupersymmetric',
  'gauge',
  'theories'],
 ['the',
  'kondo',
  'zero',
  'bias',
  'anomaly',
  'of',
  'co',
  'adatoms',
  'probed',
  'by',
  'scanning',
  'tunneling',
  'microscopy',
  'is'

In [30]:
#vectorize with feature hasher (custom vectorizer took too long to run and I was unable to standardize the array for pca)
#feature hasher representing abstracts with max 250 words (i.e. 250 features)
pca_encoder = FeatureHasher(n_features=250, input_type="string")
encoded_data = pca_encoder.fit_transform(abstracts)
encoded_data[0]

<1x250 sparse matrix of type '<class 'numpy.float64'>'
	with 35 stored elements in Compressed Sparse Row format>

In [31]:
#determine ideal output for embedding layer

#account for variance with PCA
principal = TruncatedSVD(n_components=100)

#fit the PCA
principal.fit(encoded_data)

#assess best number of dimensions for dense array in embedding layer
print(principal.explained_variance_ratio_)

[8.83138040e-01 1.65579679e-02 1.19518766e-02 8.38526928e-03
 7.80543691e-03 7.15115135e-03 6.79931128e-03 5.55821826e-03
 5.36200892e-03 4.83288647e-03 3.91263836e-03 3.73805650e-03
 3.39671373e-03 3.18125187e-03 2.78851008e-03 2.58118607e-03
 2.34670627e-03 2.03258095e-03 1.79615101e-03 1.67481414e-03
 1.57831391e-03 1.42530501e-03 1.21176539e-03 9.02409700e-04
 7.94811487e-04 7.50060245e-04 6.78605097e-04 6.19758796e-04
 4.77225055e-04 4.60806313e-04 4.19144806e-04 4.06100207e-04
 3.27067561e-04 3.22272494e-04 3.06940906e-04 2.90190288e-04
 2.81236039e-04 2.51625609e-04 2.25457216e-04 2.21441077e-04
 2.16116928e-04 2.10346560e-04 2.01370003e-04 1.96389363e-04
 1.84339810e-04 1.80491004e-04 1.64395452e-04 1.51085012e-04
 1.46628745e-04 1.30557786e-04 1.15881320e-04 1.00615439e-04
 9.63774520e-05 9.20388208e-05 8.73147353e-05 8.55542187e-05
 8.10348732e-05 7.89185988e-05 7.85169815e-05 6.35592411e-05
 6.17005406e-05 5.20590113e-05 4.13493772e-05 3.70116119e-05
 3.19936069e-05 2.532314

In [35]:
sum(principal.explained_variance_ratio_[:24])

0.9901085704670903

Second Model - adjusted Embedding layer output dimensions to improve accuracy

In [36]:
#second model
sec_model = Sequential([
    #input layer
    tf.keras.Input(shape=(1,), dtype=tf.string),
    #encoder from cell above
    encoder,
    #embedding layer
    tf.keras.layers.Embedding(
        input_dim=len(encoder.get_vocabulary())+1,
        output_dim=24,
        # Use masking to handle the variable abstract lengths
        mask_zero=True),
    #recurrent layer
    tf.keras.layers.GRU(32),
    #decision time
    tf.keras.layers.Dense(32, activation='tanh'),
    tf.keras.layers.Dense(n_subjects, activation='softmax')
])

#compiling the model
sec_model.compile(
    loss='categorical_hinge',
    #clipping due to initial issues with exploding gradient
    optimizer=tf.keras.optimizers.Adam(clipnorm=1.0),
    metrics=['accuracy', 
             'AUC', 
             tf.keras.metrics.CategoricalCrossentropy()]
)

#fit the model
print(sec_model.summary())
NL2 = sec_model.fit(x=train["abstract"], y=train_target, epochs=30,
                     batch_size=32, validation_split=0.3)

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 text_vectorization (TextVec  (None, None)             0         
 torization)                                                     
                                                                 
 embedding (Embedding)       (None, None, 24)          13357848  
                                                                 
 gru (GRU)                   (None, 32)                5568      
                                                                 
 dense (Dense)               (None, 32)                1056      
                                                                 
 dense_1 (Dense)             (None, 10)                330       
                                                                 
Total params: 13,364,802
Trainable params: 13,364,802
Non-trainable params: 0
____________________________________________

Epoch 28/30
Epoch 29/30
Epoch 30/30


In [38]:
#save the second model
tf.saved_model.save(sec_model, "./model2/")



INFO:tensorflow:Assets written to: ./model2/assets


INFO:tensorflow:Assets written to: ./model2/assets


Third model - Increased network depth by one dense layer in both the recurrent stack and in the dense stack.

In [None]:
#third model
c_model = Sequential([
    #input layer
    tf.keras.Input(shape=(1,), dtype=tf.string),
    #encoder from cell above
    encoder,
    #embedding layer
    tf.keras.layers.Embedding(
        input_dim=len(encoder.get_vocabulary())+1,
        output_dim=24,
        # Use masking to handle the variable abstract lengths
        mask_zero=True),
    #recurrent layer
    tf.keras.layers.GRU(12),
    #decision time
    tf.keras.layers.Dense(12, activation='tanh'),
    tf.keras.layers.Dense(12, activation='tanh'),
    tf.keras.layers.Dense(n_subjects, activation='softmax')
])

#compiling the model
c_model.compile(
    loss='categorical_hinge',
    #clipping due to initial issues with exploding gradient
    optimizer=tf.keras.optimizers.Adam(clipnorm=1.0),
    metrics=['accuracy', 
             'AUC', 
             tf.keras.metrics.CategoricalCrossentropy()]
)

#fit the model
print(c_model.summary())
NL2 = c_model.fit(x=train["abstract"], y=train_target, epochs=30,
                     batch_size=32, validation_split=0.3)

#save the third model
tf.saved_model.save(c_model, "./model3/")

### Model Evaluation

# Discussion