# Identifying arXiv Article Subject Codes via NLP

The goal of this project is to predict primary subject codes for scientific articles available in the arXiv database based on the text of their abstract. This allows for rapid encoding of article subject material, and similar methods may be applicable for identifying key terms in articles submitted for addition to the database. 

Without machine learning, idexing articles for addition to a database along subject codes, key terms, and other metrics is a labor intensive process. For some databases, the labor cost of indexing a single article has been estimated to cost up to 10 dollars per article. Natural Language Processing offers the ability to automate this process, thereby saving up to 15 million dollars in labor for a dataset similar in size to the one used for this project upon intial upload. Maintenance costs may see further cost savings as routine updates to the database architecture or indexing system can be automated instead of requiring large quantities of expensive manual labor.

In [2]:
#imports
import pandas as pd
import numpy as np
import tensorflow as tf

from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras import Sequential
from keras.callbacks import ModelCheckpoint, ReduceLROnPlateau, EarlyStopping

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer, OneHotEncoder
from sklearn.feature_extraction import FeatureHasher

2023-01-24 20:18:46.301594: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-01-24 20:18:48.693001: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/amazon/efa/lib64:/opt/amazon/openmpi/lib64:/usr/local/cuda/efa/lib:/usr/local/cuda/lib:/usr/local/cuda:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/targets/x86_64-linux/lib:/usr/local/lib:/usr/lib:/lib:
2023-01-24 20:18:48.693768: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so

In [3]:
#import data
df = pd.read_csv("./arxiv-oai-af.tsv", delimiter="\t")
df

Unnamed: 0,abstract,acm_class,arxiv_id,author_text,categories,comments,created,doi,num_authors,num_categories,primary_cat,title,updated
0,If we assume the Thesis that any classical T...,,math/0212388,Bhupinder Singh Anand,math.GM,12 pages. Revision 1. Appendix 1 added. An HTM...,2002-12-31,,1,1,math.GM,Is a deterministic universe logically consiste...,2003-01-02
1,"We define the Cartesian product, composition...",,1205.6123,"Muhammad Akram, Wieslaw A. Dudek",cs.DM,,2012-04-29,10.1016/j.camwa.2010.11.004,2,1,cs.DM,Interval-valued fuzzy graphs,
2,We apply algebraic Morse theory to the Taylo...,,1806.07887,Robin Frankhuizen,"math.AT,math.AC,math.RA",27 pages; comments welcome. arXiv admin note: ...,2018-06-20,,1,3,math.AT,Massey products and the Golod property for sim...,
3,Anomalous transport is usually described eit...,,1007.3022,"Bartlomiej Dybiec, Ewa Gudowska-Nowak","cond-mat.stat-mech,math-ph,math.MP","10 pages, 7 figures",2010-07-18,10.1063/1.3522761,2,3,cond-mat.stat-mech,Subordinated diffusion and CTRW asymptotics,2010-11-09
4,"In this paper, an approximate solution to a ...",,1512.07787,"M. T. Araujo, E. Drigo Filho",cond-mat.stat-mech,"12 pages, 8 figures",2015-12-24,10.5488/CMP.18.43003,2,1,cond-mat.stat-mech,Approximate solution for Fokker-Planck equation,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1582236,21cm intensity mapping experiments aim to ob...,,1501.03823,"Laura Wolz, Filipe B. Abdalla, David Alonso, C...",astro-ph.CO,This article is part of the 'SKA Cosmology Cha...,2015-01-15,,10,1,astro-ph.CO,Foreground Subtraction in Intensity Mapping wi...,
1582237,We show the existence of smooth isolated cur...,,math/0110220,Andreas Leopold Knutsen,math.AG,18 pages. The previous version of the preprint...,2001-10-19,,1,1,math.AG,"Smooth, isolated curves in families of Calabi-...",2012-09-05
1582238,Sequence alignment is a tool in bioinformati...,,0907.2187,"S Wolfsheimer, O Melchert, AK Hartmann","cond-mat.stat-mech,cond-mat.dis-nn,q-bio.QM",,2009-07-13,10.1103/PhysRevE.80.061913,3,3,cond-mat.stat-mech,Finite-temperature local protein sequence alig...,
1582239,"We suggest that the majority of the ""young"",...",,astro-ph/0209553,Valery V. Kravtsov,astro-ph,"7 pages, no figures, accepted for publication ...",2002-09-26,10.1051/0004-6361:20021404,1,1,astro-ph,Second Parameter Globulars and Dwarf Spheroida...,


In [3]:
#find explicit nulls
df.isnull().sum()

abstract                0
acm_class         1560822
arxiv_id                0
author_text             0
categories              0
comments           301829
created                 0
doi                734897
num_authors             0
num_categories          0
primary_cat             0
title                   0
updated            991881
dtype: int64

In [5]:
#drop unnecessary columns 
trimmed_df = df.drop(columns=["acm_class", 
                              "comments", 
                              "created", 
                              "num_authors", 
                              "num_categories", 
                              "updated", 
                              "doi", 
                              "categories", 
                              "author_text", 
                              "title"]
                    )
trimmed_df

Unnamed: 0,abstract,arxiv_id,primary_cat
0,If we assume the Thesis that any classical T...,math/0212388,math.GM
1,"We define the Cartesian product, composition...",1205.6123,cs.DM
2,We apply algebraic Morse theory to the Taylo...,1806.07887,math.AT
3,Anomalous transport is usually described eit...,1007.3022,cond-mat.stat-mech
4,"In this paper, an approximate solution to a ...",1512.07787,cond-mat.stat-mech
...,...,...,...
1582236,21cm intensity mapping experiments aim to ob...,1501.03823,astro-ph.CO
1582237,We show the existence of smooth isolated cur...,math/0110220,math.AG
1582238,Sequence alignment is a tool in bioinformati...,0907.2187,cond-mat.stat-mech
1582239,"We suggest that the majority of the ""young"",...",astro-ph/0209553,astro-ph


In [5]:
#check for duplicate articles
trimmed_df["arxiv_id"].value_counts()

math/0212388    1
1511.00435      1
1003.0352       1
1403.6630       1
1606.07245      1
               ..
0710.1146       1
1508.04795      1
0909.0182       1
1101.0001       1
1309.3564       1
Name: arxiv_id, Length: 1582241, dtype: int64

In [9]:
#full dataset turned out to be too large for training with available resources
#Need to create dataset with equal number of samples per category
pd.set_option('display.max_rows', 200)
trimmed_df["primary_cat"].value_counts()

hep-ph                107925
astro-ph               94239
hep-th                 86019
quant-ph               71567
cond-mat.mes-hall      46310
gr-qc                  45681
cond-mat.mtrl-sci      39348
cond-mat.str-el        35899
cond-mat.stat-mech     32370
astro-ph.SR            30147
astro-ph.CO            29493
math.AP                28783
math.CO                27755
nucl-th                27524
astro-ph.GA            27103
math.PR                26395
cs.CV                  25941
math-ph                25426
math.AG                25312
cond-mat.supr-con      25137
astro-ph.HE            24153
cs.IT                  23048
math.NT                20834
math.DG                20613
cond-mat.soft          19548
hep-ex                 18774
cs.LG                  18178
physics.optics         17130
hep-lat                15185
math.OC                14899
math.DS                14742
math.NA                13912
math.FA                12971
astro-ph.EP            12844
cond-mat      

In [11]:
#in order to train properly, we should have at least 1000 examples for each category in our dataset

#subset subjects that have more than 1000 examples in the original dataset
prim = trimmed_df["primary_cat"].value_counts()
print(f"total categories: {len(prim)}")
n_subjects = len([i for i in prim if i>1000])
print(f"number of subjects: {n_subjects}")
balance = prim[:n_subjects]
index = np.array(balance.index)

#drop low frequency subjects
subject = trimmed_df["primary_cat"]
balanced_df = trimmed_df[subject.isin(index)]
balanced_df.shape

total categories: 172
number of subjects: 129


(1560961, 3)

In [13]:
#randomly sample each category to contain 1000 samples from the overall dataset
#final dataset should contain 129,000 entries for train test splitting

rand_sam_df = balanced_df.groupby('primary_cat').apply(lambda s: s.sample(1000))
    
rand_sam_df

Unnamed: 0_level_0,Unnamed: 1_level_0,abstract,arxiv_id,primary_cat
primary_cat,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
alg-geom,540584,It is shown that the canonical ring of a min...,alg-geom/9703007,alg-geom
alg-geom,1476135,In the note we construct a family of \'etale...,alg-geom/9309002,alg-geom
alg-geom,942614,The aim of this paper is to construct an imm...,alg-geom/9706007,alg-geom
alg-geom,849953,Theta functions of level n on the principall...,alg-geom/9602008,alg-geom
alg-geom,1049098,"We piece together ingredients, which are wel...",alg-geom/9505035,alg-geom
...,...,...,...,...
stat.ML,952541,Latent feature modeling allows capturing the...,1706.03779,stat.ML
stat.ML,997009,Capturing the dependence structure of multiv...,1507.05899,stat.ML
stat.ML,1508225,"In a recently published paper [1], it is sho...",1901.02182,stat.ML
stat.ML,1393301,The existence of evasion attacks during the ...,1806.01471,stat.ML


In [16]:
#category sanity check
rand_sam_df["primary_cat"].value_counts()

alg-geom              1000
math.CT               1000
physics.acc-ph        1000
nucl-th               1000
nucl-ex               1000
nlin.SI               1000
nlin.PS               1000
nlin.CD               1000
nlin.AO               1000
math.ST               1000
math.SP               1000
math.SG               1000
math.RT               1000
math.RA               1000
math.QA               1000
math.PR               1000
math.OC               1000
math.OA               1000
math.NT               1000
math.NA               1000
math.MG               1000
math.LO               1000
math.KT               1000
math.HO               1000
math.GT               1000
math.GR               1000
math.GN               1000
math.GM               1000
math.FA               1000
math.DS               1000
math.DG               1000
physics.ao-ph         1000
physics.app-ph        1000
physics.atom-ph       1000
physics.space-ph      1000
stat.ME               1000
stat.CO               1000
s

In [17]:
#save data file in csv
rand_sam_df.to_csv("./extended_project_subset_data")

In [18]:
#train test split
train, test = train_test_split(rand_sam_df, test_size=0.2, random_state=42)

# Clustering Analysis

# Iterative Modeling

In [None]:
#target encoder
def encode_target(data, features=n_subjects, input_type="string"):
    """takes a set of y values and one hot encodes them for the Neural Network output"""
    FH = FeatureHasher(n_features=features, input_type=input_type)
    target = FH.fit_transform(X=data)
    target_array = target.toarray()
    return target_array

#one hot encoding target for training set
train_target = encode_target(train["primary_cat"])

In [None]:
#creating encoder to clean and encode abstract data
encoder = layers.experimental.preprocessing.TextVectorization(output_mode='int')
#calling adapt gets the layer to index all of the terms
#this step speeds up model performance and reduces parameters
encoder.adapt(np.array(train["abstract"]))