<a href="https://colab.research.google.com/github/auroramugnai/arXiv_classification/blob/main/arXiv_classification/keywords_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Clone the github repository and move to the inner directory.

In [1]:
!git clone https://github.com/auroramugnai/arXiv_classification.git
%cd arXiv_classification/arXiv_classification

Cloning into 'arXiv_classification'...
remote: Enumerating objects: 648, done.[K
remote: Counting objects: 100% (346/346), done.[K
remote: Compressing objects: 100% (158/158), done.[K
remote: Total 648 (delta 206), reused 244 (delta 163), pack-reused 302[K
Receiving objects: 100% (648/648), 14.91 MiB | 17.94 MiB/s, done.
Resolving deltas: 100% (359/359), done.
/content/arXiv_classification/arXiv_classification


# 1) Build the dataset and extract the keywords

In [2]:
import json
import random
import zipfile

import dask.bag as db
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import utils

In [3]:
SEED = 42 # fix random seed for reproducibility

(Or run this to read from .csv 10k articles with already extracted keywords and skip to section 2.)

In [4]:
# path = f"./kws_cs_10k.csv"
# df = pd.read_csv(path, dtype=str)

## 1.1Download the dataset
The following line of code comes from clicking on "Copy API command" in https://www.kaggle.com/datasets/Cornell-University/arxiv.

In [5]:
!kaggle datasets download -d Cornell-University/arxiv

Dataset URL: https://www.kaggle.com/datasets/Cornell-University/arxiv
License(s): CC0-1.0
Downloading arxiv.zip to /content/arXiv_classification/arXiv_classification
 99% 1.27G/1.28G [00:16<00:00, 43.5MB/s]
100% 1.28G/1.28G [00:16<00:00, 82.2MB/s]


Unzip the downloaded file.

In [6]:
with zipfile.ZipFile('./arxiv.zip', 'r') as zip_ref:
    zip_ref.extractall()

The unzipping creates an "arxiv-metadata-oai-snapshot.json". We now create a dask bag out of it.

In [7]:
path = "./arxiv-metadata-oai-snapshot.json"
arxiv_data = db.read_text(path).map(json.loads)
arxiv_data.take(1)

({'id': '0704.0001',
  'submitter': 'Pavel Nadolsky',
  'authors': "C. Bal\\'azs, E. L. Berger, P. M. Nadolsky, C.-P. Yuan",
  'title': 'Calculation of prompt diphoton production cross sections at Tevatron and\n  LHC energies',
  'comments': '37 pages, 15 figures; published version',
  'journal-ref': 'Phys.Rev.D76:013009,2007',
  'doi': '10.1103/PhysRevD.76.013009',
  'report-no': 'ANL-HEP-PR-07-12',
  'categories': 'hep-ph',
  'license': None,
  'abstract': '  A fully differential calculation in perturbative quantum chromodynamics is\npresented for the production of massive photon pairs at hadron colliders. All\nnext-to-leading order perturbative contributions from quark-antiquark,\ngluon-(anti)quark, and gluon-gluon subprocesses are included, as well as\nall-orders resummation of initial-state gluon radiation valid at\nnext-to-next-to-leading logarithmic accuracy. The region of phase space is\nspecified in which the calculation is most reliable. Good agreement is\ndemonstrated with d

## 1.2 Get rid of some unnecessary information

In [8]:
# Get the latest version of the articles.
get_latest_version = lambda x: x['versions'][-1]['created']

# Only keep articles published after 2022.
is_after_2020 = lambda x: int(get_latest_version(x).split(' ')[3]) > 2020

# Only keep some information.
cut_info = lambda x: {'id': x['id'],
                      'title': x['title'],
                      'category':x['categories'].split(' '),
                      'abstract':x['abstract'],}

# Only keep Computer Science macro-category.
is_only_cs = lambda x: all([s.startswith("cs.") for s in x['categories'].split(' ')])

arxiv_data_filtered = (arxiv_data.filter(is_after_2020).filter(is_only_cs).map(cut_info).compute())


# Create a pandas dataframe and save it to csv.
df = pd.DataFrame(arxiv_data_filtered)
df.to_csv("./cs_arxiv_data_filtered.csv", index=False)
df.head()

Unnamed: 0,id,title,category,abstract
0,710.3901,A recursive linear time modular decomposition ...,[cs.DM],A module of a graph G is a set of vertices t...
1,711.201,A Polynomial Time Algorithm for Graph Isomorphism,[cs.CC],We claimed that there is a polynomial algori...
2,802.3414,A Universal In-Place Reconfiguration Algorithm...,"[cs.CG, cs.MA, cs.RO]",In the modular robot reconfiguration problem...
3,803.3946,On the `Semantics' of Differential Privacy: A ...,"[cs.CR, cs.DB]","Differential privacy is a definition of ""pri..."
4,805.1877,Perfect tag identification protocol in RFID ne...,[cs.NI],Radio Frequency IDentification (RFID) system...


## 1.3 Get a fixed number of articles
To speed up computation and avoid a session crash.

In [26]:
df = pd.read_csv("cs_arxiv_data_filtered.csv", dtype=str)

In [27]:
num_data = 10000 # number of articles that we want to keep
print(f"The dataset contains {len(df)} articles.")

# Sample the dataset only if its length exceeds num_data.
if(len(df) > num_data):
    n_sample = num_data
    random.seed(SEED)
    df = df.sample(n=n_sample, axis=0)

df.to_csv("./dataset_to_classify.csv", index=False)
print(f"The dataset contains {len(df)} articles.")

The dataset contains 199846 articles.
The dataset contains 10000 articles.


# 2) Text-processing

In [13]:
!pip install -U spacy -q
!python -m spacy download en_core_web_md -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 MB[0m [31m10.2 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [14]:
import en_core_web_md
import spacy
from tqdm import tqdm

In [28]:
df = pd.read_csv("dataset_to_classify.csv", dtype=str)

Clean out the strings (this step will take a while).

In [29]:
# Remove stop words, punctuation, special characters, numbers.
nlp = spacy.load("en_core_web_md")
tqdm.pandas() # to display progress bar

# First on abstracts.
abs_cleaner = lambda x: utils.text_cleaner(text=x["abstract"], nlp=nlp)
df["clean_abstract"] = df.progress_apply(abs_cleaner, axis=1)

# Then on titles.
tit_cleaner = lambda x: utils.text_cleaner(text=x["title"], nlp=nlp)
df["clean_title"] = df.progress_apply(tit_cleaner, axis=1)

df.tail()

100%|██████████| 10000/10000 [09:30<00:00, 17.54it/s]
100%|██████████| 10000/10000 [01:46<00:00, 94.08it/s]


Unnamed: 0,id,title,category,abstract,clean_abstract,clean_title
9995,2402.01927,Mathemyths: Leveraging Large Language Models t...,['cs.HC'],Mathematical language is a cornerstone of a ...,mathematical language cornerstone child mathem...,mathemyth leverage large language model teach ...
9996,2404.10957,Personalized Federated Learning via Stacking,"['cs.LG', 'cs.CR', 'cs.DC']",Traditional Federated Learning (FL) methods ...,traditional federated learning fl method typic...,personalized federated learning stack
9997,2210.04847,NerfAcc: A General NeRF Acceleration Toolbox,"['cs.CV', 'cs.GR']","We propose NerfAcc, a toolbox for efficient ...",propose nerfacc toolbox efficient volumetric r...,nerfacc general nerf acceleration toolbox
9998,2008.07073,AlphaNet: Improving Long-Tail Classification B...,['cs.CV'],Methods in long-tail learning focus on impro...,method long tail learning focus improve perfor...,alphanet improve long tail classification comb...
9999,2105.14467,Occam Learning Meets Synthesis Through Unifica...,['cs.PL'],The generalizability of PBE solvers is the k...,generalizability pbe solver key empirical synt...,occam learning meet synthesis unification


In [30]:
# Counting NaN values
c = df['clean_title'].isna().sum()
print(c)

# Counting NaN values
c = df['clean_abstract'].isna().sum()
print(c)

0
0


In [32]:
# Add a space to separate title and abstract.
df["clean_text"] = df["clean_title"] + " " + df["clean_abstract"]

# Save to csv
df.to_csv(f"./processed_dataframe.csv", index=False)
df.head()

Unnamed: 0,id,title,category,abstract,clean_abstract,clean_title,clean_text
0,2203.13237,MD-SLAM: Multi-cue Direct SLAM,['cs.RO'],Simultaneous Localization and Mapping (SLAM)...,simultaneous localization mapping slam system ...,md slam multi cue direct slam,md slam multi cue direct slam simultaneous loc...
1,2402.19226,Investigating Gender Fairness in Machine Learn...,"['cs.LG', 'cs.CY']",Chronic pain significantly diminishes the qu...,chronic pain significantly diminish quality li...,investigate gender fairness machine learning d...,investigate gender fairness machine learning d...
2,2307.12149,CorrFL: Correlation-Based Neural Network Archi...,"['cs.LG', 'cs.DC', 'cs.NI']",The Federated Learning (FL) paradigm faces s...,federated learning fl paradigm face challenge ...,corrfl correlation base neural network archite...,corrfl correlation base neural network archite...
3,2302.01791,DilateFormer: Multi-Scale Dilated Transformer ...,['cs.CV'],"As a de facto solution, the vanilla Vision T...",de facto solution vanilla vision transformers ...,dilateformer multi scale dilated transformer v...,dilateformer multi scale dilated transformer v...
4,2401.03523,Characterizing Physical Memory Fragmentation,"['cs.OS', 'cs.PF']",External fragmentation of physical memory oc...,external fragmentation physical memory occur a...,characterize physical memory fragmentation,characterize physical memory fragmentation ext...


In [22]:
# t0 =  "W0_sample = np.random.normal(0,1)?"
# t = text_cleaner(t0, nlp=nlp)
# print(t)




# 4) Keywords extraction

In [33]:
!pip install KeyBERT -q
!pip install keyphrase-vectorizers -q

  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for KeyBERT (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m197.8/197.8 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m240.3/240.3 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.5/363.5 kB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m772.8/772.8 kB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.7/6.7 MB[0m [31m18.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m236.3/236.3 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m731.6/731.6 kB[0m [31m14.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━

In [34]:
from keybert import KeyBERT
from sklearn.feature_extraction.text import TfidfVectorizer

  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(


In [35]:
kw_model = KeyBERT('all-mpnet-base-v2')

extraction = lambda x: utils.extract_kws(text=x["clean_text"],
                                         kw_model=kw_model,
                                         seed=x["clean_title"].split(" "))

df["keywords"] = df.progress_apply(extraction, axis=1)

df.to_csv(f"./keywords.csv", index=False) # Save to csv
df.head()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

  0%|          | 2/10000 [00:17<23:57:26,  8.63s/it]


KeyboardInterrupt: 

# 4) Classification
Given an article:

- its feature X will be the cleaned text
- its label y will be its keyword

In [None]:
!pip install scikit-multilearn -q

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC

In [None]:
# Preparing X (features)
X = df["text"]

# Preparing y (labels)
y = df['keywords']

#Split data into train/test.
X_train, X_test, y_train_tot, y_test_tot = train_test_split(X, y,
                                                            test_size=0.5,
                                                            random_state=SEED)

In [None]:
# Select only the first keyword for every article.
y_train = [eval(x)[0] for x in y_train_tot]
y_test = [eval(x)[0] for x in y_test_tot]

Do the classification.

In [None]:
model = Pipeline([('tfidf', TfidfVectorizer(stop_words='english')),
                  ('svm_model', LinearSVC(verbose=1))])

y_pred = utils.run_model(model, X_train, X_test, y_train, y_test,
                         multilabel=False)

print('accuracy: ', accuracy_score(y_test, y_pred))

df_pred = pd.DataFrame({'clean_text': X_test,
                        'true_kws': y_test_tot,
                        'first_true_kw': y_test,
                        'predicted_kw': y_pred})

[LibLinear]accuracy:  0.371


In [None]:
df_pred = df_pred.reset_index(drop=True)
df_pred.head()

Unnamed: 0,clean_text,true_kws,first_true_kw,predicted_kw
0,building defect prediction models by online le...,"['learning', 'predict', 'defect', 'auc']",learning,prediction
1,adaptive discretization use voronoi trees for ...,"['pomdp', 'discretization', 'voronoi', 'tree']",pomdp,action
2,study the explanation for the automate predict...,"['classification', 'bug', 'shap', 'understand']",classification,explainability
3,airtrack onboard deep learning framework for l...,"['aircraft', 'tracking', 'dataset', 'daa']",aircraft,tracking
4,query complexity based optimal processing of r...,"['workload', 'query', 'partition', 'dataset']",workload,rdf


In [None]:
# Get the number of predicted kws that are contained in the list of true kws.
is_in_true_kws = lambda x: x.predicted_kw in x.true_kws
num_true = df_pred.apply(is_in_true_kws, axis=1).value_counts().loc[True]

# Turn it to percentage.
print(f"{round((num_true/len(df_pred))*100, 2)}% of predicted kws are true kws")

57.14% of predicted kws are true kws


# 3) Compute the distance between the true and the predicted keywords

In [None]:
import nltk
import spacy
from gensim.models import Word2Vec

In [None]:
# Lists of the keywords on which we want to compute the similarity.
kws_pred = df_pred['predicted_kw'].values
kws_true = df_pred['first_true_kw'].values

In [None]:
# Create the corpus using our processed texts.
corpus = list(df['text'].values)

# Tokenize the corpus.
nltk.download('punkt')
tokenized_corpus = [nltk.word_tokenize(text.lower()) for text in corpus]

# Train the Word2Vec model on the created corpus.
model = Word2Vec(tokenized_corpus, min_count=1)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Compute the meaning similarity.

In [None]:
simil_meaning_list = [] # meaninig similarity

for i, (kp, kb) in enumerate(zip(kws_pred, kws_true)):
    sim = model.wv.similarity(kp, kb)
    sim = float("{0:.2f}".format(sim))
    if (i<5): print(f"The similarity between '{kp}' and '{kb}' is: {sim}")
    simil_meaning_list.append(sim)

print(f"\nMEAN OF SIMILARITIES: {np.mean(simil_meaning_list)}")

The similarity between 'prediction' and 'learning' is: 0.22
The similarity between 'action' and 'pomdp' is: 0.16
The similarity between 'explainability' and 'classification' is: 0.26
The similarity between 'tracking' and 'aircraft' is: 0.41
The similarity between 'rdf' and 'workload' is: 0.4

MEAN OF SIMILARITIES: 0.595884
