<a href="https://colab.research.google.com/github/auroramugnai/arXiv_classification/blob/main/arXiv_classification/keywords_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Clone the github repository and move to the inner directory.

In [1]:
!git clone https://github.com/auroramugnai/arXiv_classification.git
%cd arXiv_classification/arXiv_classification

Cloning into 'arXiv_classification'...
remote: Enumerating objects: 192, done.[K
remote: Counting objects: 100% (62/62), done.[K
remote: Compressing objects: 100% (22/22), done.[K
remote: Total 192 (delta 52), reused 40 (delta 40), pack-reused 130[K
Receiving objects: 100% (192/192), 9.32 MiB | 12.72 MiB/s, done.
Resolving deltas: 100% (110/110), done.
/content/arXiv_classification/arXiv_classification


# 1) Build the dataset and extract the keywords

In [29]:
import json
import random
import zipfile

import dask.bag as db
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import utils

In [30]:
SEED = 42 # fix random seed for reproducibility

(Or run this to read from .csv 10k articles with already extracted keywords and skip to section 2.)

In [None]:
# path = f"./kws_cs_10k.csv"
# df = pd.read_csv(path, dtype=str)

## Download the dataset
The following line of code comes from clicking on "Copy API command" in https://www.kaggle.com/datasets/Cornell-University/arxiv.

In [31]:
!kaggle datasets download -d Cornell-University/arxiv

Dataset URL: https://www.kaggle.com/datasets/Cornell-University/arxiv
License(s): CC0-1.0
Downloading arxiv.zip to /content/arXiv_classification/arXiv_classification
 99% 1.26G/1.28G [00:13<00:00, 153MB/s]
100% 1.28G/1.28G [00:13<00:00, 99.4MB/s]


Unzip the downloaded file.

In [32]:
with zipfile.ZipFile('./arxiv.zip', 'r') as zip_ref:
    zip_ref.extractall()

The unzipping creates an "arxiv-metadata-oai-snapshot.json". We now create a dask bag out of it.

In [33]:
path = "./arxiv-metadata-oai-snapshot.json"
arxiv_data = db.read_text(path).map(json.loads)
arxiv_data.take(1)

({'id': '0704.0001',
  'submitter': 'Pavel Nadolsky',
  'authors': "C. Bal\\'azs, E. L. Berger, P. M. Nadolsky, C.-P. Yuan",
  'title': 'Calculation of prompt diphoton production cross sections at Tevatron and\n  LHC energies',
  'comments': '37 pages, 15 figures; published version',
  'journal-ref': 'Phys.Rev.D76:013009,2007',
  'doi': '10.1103/PhysRevD.76.013009',
  'report-no': 'ANL-HEP-PR-07-12',
  'categories': 'hep-ph',
  'license': None,
  'abstract': '  A fully differential calculation in perturbative quantum chromodynamics is\npresented for the production of massive photon pairs at hadron colliders. All\nnext-to-leading order perturbative contributions from quark-antiquark,\ngluon-(anti)quark, and gluon-gluon subprocesses are included, as well as\nall-orders resummation of initial-state gluon radiation valid at\nnext-to-next-to-leading logarithmic accuracy. The region of phase space is\nspecified in which the calculation is most reliable. Good agreement is\ndemonstrated with d

## Get rid of some unnecessary information

In [34]:
# Get the latest version of the articles.
get_latest_version = lambda x: x['versions'][-1]['created']

# Only keep articles published after 2022.
is_after_2020 = lambda x: int(get_latest_version(x).split(' ')[3]) > 2020

# Only keep some information.
cut_info = lambda x: {'id': x['id'],
                      'title': x['title'],
                      'category':x['categories'].split(' '),
                      'abstract':x['abstract'],}

# Only keep Computer Science macro-category.
is_only_cs = lambda x: all([s.startswith("cs.") for s in x['categories'].split(' ')])

arxiv_data_filtered = (arxiv_data.filter(is_after_2020).filter(is_only_cs).map(cut_info).compute())


# Create a pandas dataframe and save it to csv.
df = pd.DataFrame(arxiv_data_filtered)
df.to_csv("./cs_arxiv_data_filtered.csv", index=False)
df.head()

Unnamed: 0,id,title,category,abstract
0,710.3901,A recursive linear time modular decomposition ...,[cs.DM],A module of a graph G is a set of vertices t...
1,711.201,A Polynomial Time Algorithm for Graph Isomorphism,[cs.CC],We claimed that there is a polynomial algori...
2,802.3414,A Universal In-Place Reconfiguration Algorithm...,"[cs.CG, cs.MA, cs.RO]",In the modular robot reconfiguration problem...
3,803.3946,On the `Semantics' of Differential Privacy: A ...,"[cs.CR, cs.DB]","Differential privacy is a definition of ""pri..."
4,805.1877,Perfect tag identification protocol in RFID ne...,[cs.NI],Radio Frequency IDentification (RFID) system...


## Get a fixed number of articles
To speed up computation and avoid a session crash.

In [35]:
num_data = 10000 # number of articles that we want to keep
print(f"The dataset contains {len(df)} articles.")

# Sample the dataset only if its length exceeds num_data.
if(len(df) > num_data):
    n_sample = num_data
    random.seed(SEED)
    df = df.sample(n=n_sample, axis=0)

df.to_csv("./dataset_to_classify.csv", index=False)
print(f"The dataset contains {len(df)} articles.")

The dataset contains 199846 articles.
The dataset contains 10000 articles.


## Texts processing and keywords extraction

In [36]:
!pip install KeyBERT -q
!pip install keyphrase-vectorizers -q
!pip install -U spacy -q # spacy package to preprocess the abstract text
!python -m spacy download en_core_web_md -q

  Preparing metadata (setup.py) ... [?25l[?25hdone
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/171.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m171.5/171.5 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for KeyBERT (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m197.8/197.8 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m240.3/240.3 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.5/363.5 kB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m772.8/772.8 kB[0m [31m12.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.7/6.7 MB[0m [31m24.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━

In [37]:
import en_core_web_md
from keybert import KeyBERT
from sklearn.feature_extraction.text import TfidfVectorizer
import spacy
from tqdm import tqdm

  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(


Clean out the strings (this step will take a while).

In [38]:
# Remove stop words, punctuation, special characters, numbers.
nlp = spacy.load("en_core_web_md")
tqdm.pandas() # to display progress bar

# First on abstracts.
clean_abs = lambda x: utils.remove(text=x["abstract"], nlp=nlp)
df["clean_abstract"] = df.progress_apply(clean_abs, axis=1)

# Then on titles.
clean_tit = lambda x: utils.remove(text=x["title"], nlp=nlp)
df["clean_title"] = df.progress_apply(clean_tit, axis=1)
df.tail()

100%|██████████| 10000/10000 [06:06<00:00, 27.26it/s]
100%|██████████| 10000/10000 [01:07<00:00, 147.31it/s]


Unnamed: 0,id,title,category,abstract,clean_abstract,clean_title
120984,2303.16898,Bagging by Learning to Singulate Layers Using ...,[cs.RO],Many fabric handling and 2D deformable mater...,many fabric handling and deformable material t...,bag by learn to singulate layer use interactiv...
44200,2109.12421,Integrating Unsupervised Clustering and Label-...,"[cs.LG, cs.AI]",There is often a mixture of very frequent la...,there be often a mixture of very frequent labe...,integrate unsupervised clustering and label sp...
63850,2203.00158,GROW: A Row-Stationary Sparse-Dense GEMM Accel...,"[cs.AR, cs.AI, cs.LG]",Graph convolutional neural networks (GCNs) h...,graph convolutional neural network gcns have e...,grow a row stationary sparse dense gemm accele...
86194,2208.02369,Deep VULMAN: A Deep Reinforcement Learning-Ena...,"[cs.AI, cs.CR, cs.NE]",Cyber vulnerability management is a critical...,cyber vulnerability management be a critical f...,deep vulman a deep reinforcement learning enab...
29197,2106.02282,Decoupled Dialogue Modeling and Semantic Parsi...,[cs.CL],"Recently, Text-to-SQL for multi-turn dialogu...",recently text to sql for multi turn dialogue h...,decouple dialogue modeling and semantic parsin...


In [39]:
# Add a space to separate title and abstract.
df["clean_title"] = df["clean_title"].astype(str) + " "
df["text"] = df["clean_title"] + df["clean_abstract"]

# Save to csv
df.to_csv(f"./processed_dataframe.csv", index=False)
df.head()

Unnamed: 0,id,title,category,abstract,clean_abstract,clean_title,text
162830,2311.03205,PainSeeker: An Automated Method for Assessing ...,[cs.CV],"In this letter, we aim to investigate whethe...",in this letter we aim to investigate whether l...,painseeker an automated method for assess pain...,painseeker an automated method for assess pain...
190856,2403.12748,Building Brain Tumor Segmentation Networks wit...,"[cs.CV, cs.AI]",Brain tumor image segmentation is a challeng...,brain tumor image segmentation be a challengin...,building brain tumor segmentation networks wit...,building brain tumor segmentation networks wit...
70736,2204.07504,Systematic review of development literature fr...,[cs.DL],The purpose of this systematic review is to ...,the purpose of this systematic review be to id...,systematic review of development literature fr...,systematic review of development literature fr...
52209,2111.12608,"Cerberus Transformer: Joint Semantic, Affordan...",[cs.CV],Multi-task indoor scene understanding is wid...,multi task indoor scene understanding be widel...,cerberus transformer joint semantic affordance...,cerberus transformer joint semantic affordance...
110518,2301.11147,"Train Hard, Fight Easy: Robust Meta Reinforcem...",[cs.LG],A major challenge of reinforcement learning ...,a major challenge of reinforcement learning rl...,train hard fight easy robust meta reinforcemen...,train hard fight easy robust meta reinforcemen...


In [40]:
df2 = df.copy(deep=False)

kw_model = KeyBERT('all-mpnet-base-v2')
extraction = lambda x: utils.extract_kws(TEXT=x["text"],
                                   kw_model=kw_model,
                                   seed=x["clean_title"].split(" "))
df2["keywords"] = df2.progress_apply(extraction, axis=1)
df2.head()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

100%|██████████| 10000/10000 [15:50<00:00, 10.52it/s]


Unnamed: 0,id,title,category,abstract,clean_abstract,clean_title,text,keywords
162830,2311.03205,PainSeeker: An Automated Method for Assessing ...,[cs.CV],"In this letter, we aim to investigate whethe...",in this letter we aim to investigate whether l...,painseeker an automated method for assess pain...,painseeker an automated method for assess pain...,"[grimace, pain, ratspain, automated]"
190856,2403.12748,Building Brain Tumor Segmentation Networks wit...,"[cs.CV, cs.AI]",Brain tumor image segmentation is a challeng...,brain tumor image segmentation be a challengin...,building brain tumor segmentation networks wit...,building brain tumor segmentation networks wit...,"[convolutional, mri, glioblastoma, su]"
70736,2204.07504,Systematic review of development literature fr...,[cs.DL],The purpose of this systematic review is to ...,the purpose of this systematic review be to id...,systematic review of development literature fr...,systematic review of development literature fr...,"[colombia, growth, thesis, review]"
52209,2111.12608,"Cerberus Transformer: Joint Semantic, Affordan...",[cs.CV],Multi-task indoor scene understanding is wid...,multi task indoor scene understanding be widel...,cerberus transformer joint semantic affordance...,cerberus transformer joint semantic affordance...,"[affordance, parse, transformer, strong]"
110518,2301.11147,"Train Hard, Fight Easy: Robust Meta Reinforcem...",[cs.LG],A major challenge of reinforcement learning ...,a major challenge of reinforcement learning rl...,train hard fight easy robust meta reinforcemen...,train hard fight easy robust meta reinforcemen...,"[mrl, learning, robustness, easy]"


In [41]:
df2.to_csv(f"./keywords.csv", index=False) # Save to csv

In [52]:
df = pd.read_csv("./keywords.csv", dtype=str)

In [53]:
df.head()

Unnamed: 0,id,title,category,abstract,clean_abstract,clean_title,text,keywords
0,2311.03205,PainSeeker: An Automated Method for Assessing ...,['cs.CV'],"In this letter, we aim to investigate whethe...",in this letter we aim to investigate whether l...,painseeker an automated method for assess pain...,painseeker an automated method for assess pain...,"['grimace', 'pain', 'ratspain', 'automated']"
1,2403.12748,Building Brain Tumor Segmentation Networks wit...,"['cs.CV', 'cs.AI']",Brain tumor image segmentation is a challeng...,brain tumor image segmentation be a challengin...,building brain tumor segmentation networks wit...,building brain tumor segmentation networks wit...,"['convolutional', 'mri', 'glioblastoma', 'su']"
2,2204.07504,Systematic review of development literature fr...,['cs.DL'],The purpose of this systematic review is to ...,the purpose of this systematic review be to id...,systematic review of development literature fr...,systematic review of development literature fr...,"['colombia', 'growth', 'thesis', 'review']"
3,2111.12608,"Cerberus Transformer: Joint Semantic, Affordan...",['cs.CV'],Multi-task indoor scene understanding is wid...,multi task indoor scene understanding be widel...,cerberus transformer joint semantic affordance...,cerberus transformer joint semantic affordance...,"['affordance', 'parse', 'transformer', 'strong']"
4,2301.11147,"Train Hard, Fight Easy: Robust Meta Reinforcem...",['cs.LG'],A major challenge of reinforcement learning ...,a major challenge of reinforcement learning rl...,train hard fight easy robust meta reinforcemen...,train hard fight easy robust meta reinforcemen...,"['mrl', 'learning', 'robustness', 'easy']"


# 2) Single label classification X=abs+tit y=keyword

In [5]:
!pip install scikit-multilearn -q

Collecting scikit-multilearn
  Downloading scikit_multilearn-0.2.0-py3-none-any.whl (89 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m89.4/89.4 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: scikit-multilearn
Successfully installed scikit-multilearn-0.2.0


In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC

In [54]:
# Preparing X (features)
X = df["text"].values

# Preparing y (labels)
y = df['keywords'].values

#Split data into train/test.
X_train, X_test, y_train_tot, y_test_tot = train_test_split(X, y,
                                                            test_size=0.5,
                                                            random_state=SEED)

In [55]:
# Select only the first keyword for every article.
y_train = [eval(x)[0] for x in y_train_tot]
y_test = [eval(x)[0] for x in y_test_tot]

Do the classification.

In [56]:
model = Pipeline([('tfidf', TfidfVectorizer(stop_words='english')),
                         ('svm_model', LinearSVC(verbose=1))])

y_pred = utils.run_model_one(model, X_train, X_test, y_train, y_test)

print('accuracy: ', accuracy_score(y_test, y_pred))

df_pred = pd.DataFrame({'clean_text': X_test,
                        'true_kws': y_test_tot,
                        'first_true_kw': y_test,
                        'predicted_kw': y_pred})

[LibLinear]accuracy:  0.371


In [57]:
df_pred = df_pred.reset_index(drop=True)
df_pred.head()

Unnamed: 0,clean_text,true_kws,first_true_kw,predicted_kw
0,building defect prediction models by online le...,"['learning', 'predict', 'defect', 'auc']",learning,prediction
1,adaptive discretization use voronoi trees for ...,"['pomdp', 'discretization', 'voronoi', 'tree']",pomdp,action
2,study the explanation for the automate predict...,"['classification', 'bug', 'shap', 'understand']",classification,explainability
3,airtrack onboard deep learning framework for l...,"['aircraft', 'tracking', 'dataset', 'daa']",aircraft,tracking
4,query complexity based optimal processing of r...,"['workload', 'query', 'partition', 'dataset']",workload,rdf


In [58]:
# Get the number of predicted kws that are contained in the list of true kws.
is_in_true_kws = lambda x: x.predicted_kw in x.true_kws
num_true = df_pred.apply(is_in_true_kws, axis=1).value_counts().loc[True]

# Turn it to percentage.
print(f"{round((num_true/len(df_pred))*100, 2)}% of predicted kws are true kws")

57.14% of predicted kws are true kws


# 3) Compute the distance between the true and the predicted keywords

In [59]:
import nltk
import spacy
from gensim.models import Word2Vec

In [60]:
# Lists of the keywords on which we want to compute the similarity.
kws_pred = df_pred['predicted_kw'].values
kws_true = df_pred['first_true_kw'].values

In [61]:
# Create the corpus using our processed texts.
corpus = list(df['text'].values)

# Tokenize the corpus.
nltk.download('punkt')
tokenized_corpus = [nltk.word_tokenize(text.lower()) for text in corpus]

# Train the Word2Vec model on the created corpus.
model = Word2Vec(tokenized_corpus, min_count=1)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Compute the meaning similarity.

In [62]:
simil_meaning_list = [] # meaninig similarity

for i, (kp, kb) in enumerate(zip(kws_pred, kws_true)):
    sim = model.wv.similarity(kp, kb)
    sim = float("{0:.2f}".format(sim))
    if (i<5): print(f"The similarity between '{kp}' and '{kb}' is: {sim}")
    simil_meaning_list.append(sim)

print(f"\nMEAN OF SIMILARITIES: {np.mean(simil_meaning_list)}")

The similarity between 'prediction' and 'learning' is: 0.22
The similarity between 'action' and 'pomdp' is: 0.16
The similarity between 'explainability' and 'classification' is: 0.26
The similarity between 'tracking' and 'aircraft' is: 0.41
The similarity between 'rdf' and 'workload' is: 0.4

MEAN OF SIMILARITIES: 0.595884
