<a href="https://colab.research.google.com/github/auroramugnai/arXiv_classification/blob/main/arXiv_classification/keywords_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Clone the github repository and move to the inner directory.

In [1]:
!git clone https://github.com/auroramugnai/arXiv_classification.git
%cd arXiv_classification/arXiv_classification

Cloning into 'arXiv_classification'...
remote: Enumerating objects: 660, done.[K
remote: Counting objects: 100% (358/358), done.[K
remote: Compressing objects: 100% (170/170), done.[K
remote: Total 660 (delta 214), reused 243 (delta 163), pack-reused 302[K
Receiving objects: 100% (660/660), 15.48 MiB | 22.11 MiB/s, done.
Resolving deltas: 100% (367/367), done.
/content/arXiv_classification/arXiv_classification


# 1) Build the dataset

In [2]:
import json
import random
import zipfile

import dask.bag as db
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import utils

In [3]:
SEED = 42 # fix random seed for reproducibility

(Or run this to read from .csv 10k articles with already extracted keywords and skip to section 2.)

In [57]:
# path = f"./kws_cs_10k.csv"
# df2 = pd.read_csv(path, dtype=str)
# df2.head()

## 1.1Download the dataset
The following line of code comes from clicking on "Copy API command" in https://www.kaggle.com/datasets/Cornell-University/arxiv.

In [7]:
!kaggle datasets download -d Cornell-University/arxiv

Dataset URL: https://www.kaggle.com/datasets/Cornell-University/arxiv
License(s): CC0-1.0
Downloading arxiv.zip to /content/arXiv_classification/arXiv_classification
 99% 1.27G/1.28G [00:16<00:00, 57.0MB/s]
100% 1.28G/1.28G [00:16<00:00, 84.0MB/s]


Unzip the downloaded file.

In [8]:
with zipfile.ZipFile('./arxiv.zip', 'r') as zip_ref:
    zip_ref.extractall()

The unzipping creates an "arxiv-metadata-oai-snapshot.json". We now create a dask bag out of it.

In [9]:
path = "./arxiv-metadata-oai-snapshot.json"
arxiv_data = db.read_text(path).map(json.loads)
arxiv_data.take(1)

({'id': '0704.0001',
  'submitter': 'Pavel Nadolsky',
  'authors': "C. Bal\\'azs, E. L. Berger, P. M. Nadolsky, C.-P. Yuan",
  'title': 'Calculation of prompt diphoton production cross sections at Tevatron and\n  LHC energies',
  'comments': '37 pages, 15 figures; published version',
  'journal-ref': 'Phys.Rev.D76:013009,2007',
  'doi': '10.1103/PhysRevD.76.013009',
  'report-no': 'ANL-HEP-PR-07-12',
  'categories': 'hep-ph',
  'license': None,
  'abstract': '  A fully differential calculation in perturbative quantum chromodynamics is\npresented for the production of massive photon pairs at hadron colliders. All\nnext-to-leading order perturbative contributions from quark-antiquark,\ngluon-(anti)quark, and gluon-gluon subprocesses are included, as well as\nall-orders resummation of initial-state gluon radiation valid at\nnext-to-next-to-leading logarithmic accuracy. The region of phase space is\nspecified in which the calculation is most reliable. Good agreement is\ndemonstrated with d

## 1.2 Get rid of some unnecessary information

In [10]:
# Get the latest version of the articles.
get_latest_version = lambda x: x['versions'][-1]['created']

# Only keep articles published after 2022.
is_after_2020 = lambda x: int(get_latest_version(x).split(' ')[3]) > 2020

# Only keep some information.
cut_info = lambda x: {'id': x['id'],
                      'title': x['title'],
                      'category':x['categories'].split(' '),
                      'abstract':x['abstract'],}

# Only keep Computer Science macro-category.
is_only_cs = lambda x: all([s.startswith("cs.") for s in x['categories'].split(' ')])

arxiv_data_filtered = (arxiv_data.filter(is_after_2020).filter(is_only_cs).map(cut_info).compute())


# Create a pandas dataframe and save it to csv.
df = pd.DataFrame(arxiv_data_filtered)
df.to_csv("./cs_arxiv_data_filtered.csv", index=False)
df.head()

Unnamed: 0,id,title,category,abstract
0,710.3901,A recursive linear time modular decomposition ...,[cs.DM],A module of a graph G is a set of vertices t...
1,711.201,A Polynomial Time Algorithm for Graph Isomorphism,[cs.CC],We claimed that there is a polynomial algori...
2,802.3414,A Universal In-Place Reconfiguration Algorithm...,"[cs.CG, cs.MA, cs.RO]",In the modular robot reconfiguration problem...
3,803.3946,On the `Semantics' of Differential Privacy: A ...,"[cs.CR, cs.DB]","Differential privacy is a definition of ""pri..."
4,805.1877,Perfect tag identification protocol in RFID ne...,[cs.NI],Radio Frequency IDentification (RFID) system...


## 1.3 Get a fixed number of articles
To speed up computation and avoid a session crash.

In [11]:
path = "./cs_arxiv_data_filtered.csv"
df = pd.read_csv(path, dtype=str)
df.head()

Unnamed: 0,id,title,category,abstract
0,710.3901,A recursive linear time modular decomposition ...,['cs.DM'],A module of a graph G is a set of vertices t...
1,711.201,A Polynomial Time Algorithm for Graph Isomorphism,['cs.CC'],We claimed that there is a polynomial algori...
2,802.3414,A Universal In-Place Reconfiguration Algorithm...,"['cs.CG', 'cs.MA', 'cs.RO']",In the modular robot reconfiguration problem...
3,803.3946,On the `Semantics' of Differential Privacy: A ...,"['cs.CR', 'cs.DB']","Differential privacy is a definition of ""pri..."
4,805.1877,Perfect tag identification protocol in RFID ne...,['cs.NI'],Radio Frequency IDentification (RFID) system...


In [12]:
num_data = 10000 # number of articles that we want to keep
print(f"The dataset contains {len(df)} articles.")

# Sample the dataset only if its length exceeds num_data.
if(len(df) > num_data):
    n_sample = num_data
    random.seed(SEED)
    df = df.sample(n=n_sample, axis=0)

print(f"The dataset contains {len(df)} articles.")

The dataset contains 199846 articles.
The dataset contains 10000 articles.


# 2) Text-processing

In [13]:
!pip install -U spacy -q
!python -m spacy download en_core_web_md -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 MB[0m [31m17.2 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [14]:
import en_core_web_md
import spacy
from tqdm import tqdm

Clean out the strings (this step will take a while).

In [15]:
# Remove stop words, punctuation, special characters, numbers.
nlp = spacy.load("en_core_web_md")
tqdm.pandas() # to display progress bar

# First on abstracts.
abs_cleaner = lambda x: utils.text_cleaner(text=x["abstract"], nlp=nlp)
df["clean_abstract"] = df.progress_apply(abs_cleaner, axis=1)

# Then on titles.
tit_cleaner = lambda x: utils.text_cleaner(text=x["title"], nlp=nlp)
df["clean_title"] = df.progress_apply(tit_cleaner, axis=1)

df.tail()

100%|██████████| 10000/10000 [06:25<00:00, 25.92it/s]
100%|██████████| 10000/10000 [01:14<00:00, 133.85it/s]


Unnamed: 0,id,title,category,abstract,clean_abstract,clean_title
73593,2205.03026,Hearing voices at the National Library -- a sp...,['cs.CL'],This paper explains our work in developing n...,paper explain work develop new acoustic model ...,hear voice national library speech corpu acous...
74069,2205.04797,State Encoders in Reinforcement Learning for R...,['cs.IR'],Methods for reinforcement learning for recom...,method reinforcement learning recommendation i...,state encoders reinforcement learning recommen...
108574,2301.02924,Reducing Over-smoothing in Graph Neural Networ...,"['cs.LG', 'cs.AI']",Graph Neural Networks (GNNs) have achieved a...,graph neural networks gnns achieve lot success...,reduce smoothing graph neural networks use rel...
124538,2304.10253,Image retrieval outperforms diffusion models o...,"['cs.CV', 'cs.LG']",Many approaches have been proposed to use di...,approach propose use diffusion model augment t...,image retrieval outperform diffusion model dat...
40973,2109.01394,Topographic VAEs learn Equivariant Capsules,"['cs.LG', 'cs.AI', 'cs.NE']",In this work we seek to bridge the concepts ...,work seek bridge concept topographic organizat...,topographic vaes learn equivariant capsules


In [16]:
# Add a space to separate title and abstract.
df["clean_text"] = df["clean_title"] + " " + df["clean_abstract"]
df.head()

Unnamed: 0,id,title,category,abstract,clean_abstract,clean_title,clean_text
168573,2312.02139,DiffiT: Diffusion Vision Transformers for Imag...,"['cs.CV', 'cs.AI', 'cs.LG']",Diffusion models with their powerful express...,diffusion model powerful expressivity high sam...,diffit diffusion vision transformers image gen...,diffit diffusion vision transformers image gen...
44501,2109.13457,On the Geometry of Stable Steiner Tree Instances,['cs.DS'],In this note we consider the Steiner tree pr...,note consider steiner tree problem bilu linial...,geometry stable steiner tree instance,geometry stable steiner tree instance note con...
157749,2310.08884,Extending Multi-modal Contrastive Representations,['cs.CV'],Multi-modal contrastive representation (MCR)...,multi modal contrastive representation mcr mod...,extend multi modal contrastive representation,extend multi modal contrastive representation ...
56441,2112.13181,DeepMTL Pro: Deep Learning Based MultipleTrans...,['cs.NI'],"In this paper, we address the problem of Mul...",paper address problem multiple transmitter loc...,deepmtl pro deep learning base multipletransmi...,deepmtl pro deep learning base multipletransmi...
58118,2201.0512,SeamlessGAN: Self-Supervised Synthesis of Tile...,"['cs.CV', 'cs.GR', 'cs.LG', 'cs.MM']","We present SeamlessGAN, a method capable of ...",present seamlessgan method capable automatical...,seamlessgan self supervised synthesis tileable...,seamlessgan self supervised synthesis tileable...


# 4) Keywords extraction

In [17]:
!pip install KeyBERT -q
!pip install keyphrase-vectorizers -q

  Preparing metadata (setup.py) ... [?25l[?25hdone
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/171.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m171.5/171.5 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for KeyBERT (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m197.8/197.8 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m240.3/240.3 kB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.5/363.5 kB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m772.8/772.8 kB[0m [31m14.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.7/6.7 MB[0m [31m34.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━

In [18]:
from keybert import KeyBERT
from sklearn.feature_extraction.text import TfidfVectorizer

  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(


In [19]:
kw_model = KeyBERT('all-mpnet-base-v2')

extraction = lambda x: utils.extract_kws(text=x["clean_text"],
                                         kw_model=kw_model,
                                         seed=x["clean_title"].split(" "),
                                         top_n=4)

df["keywords"] = df.progress_apply(extraction, axis=1)
df.head()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

100%|██████████| 10000/10000 [15:51<00:00, 10.51it/s]


Unnamed: 0,id,title,category,abstract,clean_abstract,clean_title,clean_text,keywords
168573,2312.02139,DiffiT: Diffusion Vision Transformers for Imag...,"['cs.CV', 'cs.AI', 'cs.LG']",Diffusion models with their powerful express...,diffusion model powerful expressivity high sam...,diffit diffusion vision transformers image gen...,diffit diffusion vision transformers image gen...,"[transformer, diffusion, vision, generate]"
44501,2109.13457,On the Geometry of Stable Steiner Tree Instances,['cs.DS'],In this note we consider the Steiner tree pr...,note consider steiner tree problem bilu linial...,geometry stable steiner tree instance,geometry stable steiner tree instance note con...,"[steiner, tree, geometry, stability]"
157749,2310.08884,Extending Multi-modal Contrastive Representations,['cs.CV'],Multi-modal contrastive representation (MCR)...,multi modal contrastive representation mcr mod...,extend multi modal contrastive representation,extend multi modal contrastive representation ...,"[multimodal, contrastive, alignment, learn]"
56441,2112.13181,DeepMTL Pro: Deep Learning Based MultipleTrans...,['cs.NI'],"In this paper, we address the problem of Mul...",paper address problem multiple transmitter loc...,deepmtl pro deep learning base multipletransmi...,deepmtl pro deep learning base multipletransmi...,"[deepmtl, transmitter, alarm, location]"
58118,2201.0512,SeamlessGAN: Self-Supervised Synthesis of Tile...,"['cs.CV', 'cs.GR', 'cs.LG', 'cs.MM']","We present SeamlessGAN, a method capable of ...",present seamlessgan method capable automatical...,seamlessgan self supervised synthesis tileable...,seamlessgan self supervised synthesis tileable...,"[seamlessgan, texture, tileability, generate]"


# 4) Classification
Given an article:

- its feature X will be the cleaned text
- its label y will be its keyword

In [20]:
!pip install scikit-multilearn -q

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/89.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m89.4/89.4 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[?25h

In [21]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC

In [58]:
# Preparing X (features)
X = df["clean_text"]

# Preparing y (labels)
y = df['keywords']

#Split data into train/test.
X_train, X_test, y_train_tot, y_test_tot = train_test_split(X, y,
                                                            test_size=0.5,
                                                            random_state=SEED)

In [59]:
# Select only the first keyword for every article.
y_train = [x[0] for x in y_train_tot]
y_test = [x[0] for x in y_test_tot]

Do the classification.

In [60]:
model = Pipeline([('tfidf', TfidfVectorizer(stop_words='english')),
                  ('svm_model', LinearSVC(verbose=1))])

y_pred = utils.run_model(model, X_train, X_test, y_train, y_test,
                         multilabel=False)

print('accuracy: ', accuracy_score(y_test, y_pred))

df_pred = pd.DataFrame({'clean_text': X_test,
                        'true_kws': y_test_tot,
                        'first_true_kw': y_test,
                        'predicted_kw': y_pred})

[LibLinear]accuracy:  0.353


In [61]:
df_pred = df_pred.reset_index(drop=True)
df_pred.head()

Unnamed: 0,clean_text,true_kws,first_true_kw,predicted_kw
0,language model evaluation perplexity propose a...,"[lstm, language, nucleus, generate]",lstm,text
1,rwifislam effective wifi range base slam syste...,"[gps, rwifislam, range, indoor]",gps,surveillance
2,evaluate efficacy online assessments higher ed...,"[assessment, internet, student, try]",assessment,exam
3,cluster analysis deep embeddings contrastive l...,"[embedding, cluster, disentangled, dataset]",embedding,embedding
4,nu mcc multiview compressive coding neighborho...,"[compressive, multiview, rgb, udf]",compressive,meshing


In [62]:
# Get the number of predicted kws that are contained in the list of true kws.
is_in_true_kws = lambda x: x.predicted_kw in x.true_kws
num_true = df_pred.apply(is_in_true_kws, axis=1).value_counts().loc[True]

# Turn it to percentage.
print(f"{round((num_true/len(df_pred))*100, 2)}% of predicted kws are true kws")

53.76% of predicted kws are true kws


# 3) Compute the distance between the true and the predicted keywords

In [34]:
import nltk
import spacy
from gensim.models import Word2Vec

In [63]:
# Lists of the keywords on which we want to compute the similarity.
kws_pred = df_pred['predicted_kw'].values
kws_true = df_pred['first_true_kw'].values

In [64]:
# Create the corpus using our processed texts.
corpus = list(df['clean_text'].values)

# Tokenize the corpus.
nltk.download('punkt')
tokenized_corpus = [nltk.word_tokenize(text.lower()) for text in corpus]

# Train the Word2Vec model on the created corpus.
model = Word2Vec(tokenized_corpus, min_count=1)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Compute the meaning similarity.

In [65]:
simil_meaning_list = [] # meaninig similarity

for i, (kp, kb) in enumerate(zip(kws_pred, kws_true)):
    sim = model.wv.similarity(kp, kb)
    sim = float("{0:.2f}".format(sim))
    if (i<5): print(f"The similarity between '{kp}' and '{kb}' is: {sim}")
    simil_meaning_list.append(sim)

print(f"\nMEAN OF SIMILARITIES: {np.mean(simil_meaning_list)}")

The similarity between 'text' and 'lstm' is: 0.15
The similarity between 'surveillance' and 'gps' is: 0.8
The similarity between 'exam' and 'assessment' is: 0.63
The similarity between 'embedding' and 'embedding' is: 1.0
The similarity between 'meshing' and 'compressive' is: 0.58

MEAN OF SIMILARITIES: 0.6442439999999999
