<a href="https://colab.research.google.com/github/auroramugnai/arXiv_classification/blob/main/arXiv_classification/keywords_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Clone the github repository and move to the inner directory.

In [1]:
!git clone https://github.com/auroramugnai/arXiv_classification.git
%cd arXiv_classification/arXiv_classification

Cloning into 'arXiv_classification'...
remote: Enumerating objects: 1172, done.[K
remote: Counting objects: 100% (461/461), done.[K
remote: Compressing objects: 100% (211/211), done.[K
remote: Total 1172 (delta 291), reused 343 (delta 228), pack-reused 711[K
Receiving objects: 100% (1172/1172), 17.74 MiB | 20.91 MiB/s, done.
Resolving deltas: 100% (599/599), done.
/content/arXiv_classification/arXiv_classification


# 1) Build the dataset

In [2]:
import json
import random
import zipfile

import dask.bag as db
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import utils

In [3]:
SEED = 42 # fix random seed for reproducibility

## 1.1 Download the dataset
The following line of code comes from clicking on "Copy API command" in https://www.kaggle.com/datasets/Cornell-University/arxiv.

In [4]:
!kaggle datasets download -d Cornell-University/arxiv

Dataset URL: https://www.kaggle.com/datasets/Cornell-University/arxiv
License(s): CC0-1.0
Downloading arxiv.zip to /content/arXiv_classification/arXiv_classification
 99% 1.27G/1.28G [00:14<00:00, 154MB/s]
100% 1.28G/1.28G [00:14<00:00, 95.6MB/s]


Unzip the downloaded file.

In [5]:
with zipfile.ZipFile('./arxiv.zip', 'r') as zip_ref:
    zip_ref.extractall()

The unzipping creates an "arxiv-metadata-oai-snapshot.json". We now create a dask bag out of it.

In [6]:
path = "./arxiv-metadata-oai-snapshot.json"
arxiv_data = db.read_text(path).map(json.loads)
arxiv_data.take(1)

({'id': '0704.0001',
  'submitter': 'Pavel Nadolsky',
  'authors': "C. Bal\\'azs, E. L. Berger, P. M. Nadolsky, C.-P. Yuan",
  'title': 'Calculation of prompt diphoton production cross sections at Tevatron and\n  LHC energies',
  'comments': '37 pages, 15 figures; published version',
  'journal-ref': 'Phys.Rev.D76:013009,2007',
  'doi': '10.1103/PhysRevD.76.013009',
  'report-no': 'ANL-HEP-PR-07-12',
  'categories': 'hep-ph',
  'license': None,
  'abstract': '  A fully differential calculation in perturbative quantum chromodynamics is\npresented for the production of massive photon pairs at hadron colliders. All\nnext-to-leading order perturbative contributions from quark-antiquark,\ngluon-(anti)quark, and gluon-gluon subprocesses are included, as well as\nall-orders resummation of initial-state gluon radiation valid at\nnext-to-next-to-leading logarithmic accuracy. The region of phase space is\nspecified in which the calculation is most reliable. Good agreement is\ndemonstrated with d

## 1.2 Get rid of some unnecessary information

In [7]:
# Get the latest version of the articles.
get_latest_version = lambda x: x['versions'][-1]['created']

# Only keep articles published after 2022.
is_after_2020 = lambda x: int(get_latest_version(x).split(' ')[3]) > 2020

# Only keep some information.
cut_info = lambda x: {'id': x['id'],
                      'title': x['title'],
                      'category':x['categories'].split(' '),
                      'abstract':x['abstract'],}

# Only keep Computer Science macro-category.
is_only_cs = lambda x: all([s.startswith("cs.") for s in x['categories'].split(' ')])

arxiv_data_filtered = (arxiv_data.filter(is_after_2020).filter(is_only_cs).map(cut_info).compute())


# Create a pandas dataframe and save it to csv.
df = pd.DataFrame(arxiv_data_filtered)
df.to_csv("./cs_arxiv_data_filtered.csv", index=False)
df.head()

Unnamed: 0,id,title,category,abstract
0,710.3901,A recursive linear time modular decomposition ...,[cs.DM],A module of a graph G is a set of vertices t...
1,711.201,A Polynomial Time Algorithm for Graph Isomorphism,[cs.CC],We claimed that there is a polynomial algori...
2,802.3414,A Universal In-Place Reconfiguration Algorithm...,"[cs.CG, cs.MA, cs.RO]",In the modular robot reconfiguration problem...
3,803.3946,On the `Semantics' of Differential Privacy: A ...,"[cs.CR, cs.DB]","Differential privacy is a definition of ""pri..."
4,805.1877,Perfect tag identification protocol in RFID ne...,[cs.NI],Radio Frequency IDentification (RFID) system...


## 1.3 Get a fixed number of articles
To speed up computation and avoid a session crash.

In [8]:
path = "./cs_arxiv_data_filtered.csv"
df = pd.read_csv(path, dtype=str)
df.head()

Unnamed: 0,id,title,category,abstract
0,710.3901,A recursive linear time modular decomposition ...,['cs.DM'],A module of a graph G is a set of vertices t...
1,711.201,A Polynomial Time Algorithm for Graph Isomorphism,['cs.CC'],We claimed that there is a polynomial algori...
2,802.3414,A Universal In-Place Reconfiguration Algorithm...,"['cs.CG', 'cs.MA', 'cs.RO']",In the modular robot reconfiguration problem...
3,803.3946,On the `Semantics' of Differential Privacy: A ...,"['cs.CR', 'cs.DB']","Differential privacy is a definition of ""pri..."
4,805.1877,Perfect tag identification protocol in RFID ne...,['cs.NI'],Radio Frequency IDentification (RFID) system...


In [9]:
num_data = 10000 # number of articles that we want to keep
print(f"The dataset contains {len(df)} articles.")

# Sample the dataset only if its length exceeds num_data.
if(len(df) > num_data):
    n_sample = num_data
    random.seed(SEED)
    df = df.sample(n=n_sample, axis=0)

print(f"The dataset contains {len(df)} articles.")

The dataset contains 201390 articles.
The dataset contains 10000 articles.


# 2) Text-processing

In [10]:
!pip install -U spacy -q
!python -m spacy download en_core_web_md -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 MB[0m [31m14.7 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [11]:
import en_core_web_md
import spacy
from tqdm import tqdm

Clean out the strings (this step will take a while).

In [12]:
# Remove stop words, punctuation, special characters, numbers.
nlp = spacy.load("en_core_web_md")
tqdm.pandas() # to display progress bar

# First on abstracts.
abs_cleaner = lambda x: utils.text_cleaner(text=x["abstract"], nlp=nlp)
df["clean_abstract"] = df.progress_apply(abs_cleaner, axis=1)

# Then on titles.
tit_cleaner = lambda x: utils.text_cleaner(text=x["title"], nlp=nlp)
df["clean_title"] = df.progress_apply(tit_cleaner, axis=1)

df.tail()

100%|██████████| 10000/10000 [06:55<00:00, 24.06it/s]
100%|██████████| 10000/10000 [01:17<00:00, 129.24it/s]


Unnamed: 0,id,title,category,abstract,clean_abstract,clean_title
197939,2404.11597,Explainable Artificial Intelligence Techniques...,"['cs.AI', 'cs.LG']",As the manufacturing industry advances with ...,manufacturing industry advance sensor integrat...,explainable artificial intelligence techniques...
111945,2302.02008,Witscript: A System for Generating Improvised ...,"['cs.CL', 'cs.AI']",A chatbot is perceived as more humanlike and...,chatbot perceive humanlike likeable include jo...,witscript system generating improvised jokes c...
38304,2108.05635,Memory-based Semantic Segmentation for Off-roa...,"['cs.CV', 'cs.RO']",With the availability of many datasets tailo...,availability dataset tailor autonomous driving...,memory base semantic segmentation road unstruc...
153437,2309.13802,While Loops in Coq,"['cs.PL', 'cs.LO']",While loops are present in virtually all imp...,loop present virtually imperative programming ...,loops coq
152457,2309.10979,Towards Data-centric Graph Machine Learning: R...,['cs.LG'],"Data-centric AI, with its primary focus on t...",data centric ai primary focus collection manag...,data centric graph machine learning review out...


In [13]:
# Add a space to separate title and abstract.
df["clean_text"] = df["clean_title"] + " " + df["clean_abstract"]
df.head()

Unnamed: 0,id,title,category,abstract,clean_abstract,clean_title,clean_text
177394,2401.12178,In-Context Learning for Extreme Multi-Label Cl...,"['cs.CL', 'cs.AI']",Multi-label classification problems with tho...,multi label classification problem thousand cl...,context learning extreme multi label classific...,context learning extreme multi label classific...
149832,2309.02251,STGIN: Spatial-Temporal Graph Interaction Netw...,['cs.IR'],"In Location-Based Services, Point-Of-Interes...",location base services point recommendation pl...,stgin spatial temporal graph interaction netwo...,stgin spatial temporal graph interaction netwo...
9577,2101.01332,Equality Saturation for Tensor Graph Superopti...,"['cs.AI', 'cs.DC']",One of the major optimizations employed in d...,major optimization employ deep learning framew...,equality saturation tensor graph superoptimiza...,equality saturation tensor graph superoptimiza...
107792,2212.14293,Error syntax aware augmentation of feedback co...,['cs.CL'],This paper presents a solution to the GenCha...,paper present solution genchal share task dedi...,error syntax aware augmentation feedback comme...,error syntax aware augmentation feedback comme...
94936,2210.03682,Novice Type Error Diagnosis with Natural Langu...,"['cs.PL', 'cs.LG']",Strong static type systems help programmers ...,strong static type system help programmer elim...,novice type error diagnosis natural language m...,novice type error diagnosis natural language m...


# 4) Keywords extraction

In [14]:
!pip install KeyBERT -q
!pip install keyphrase-vectorizers -q

  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m171.5/171.5 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for KeyBERT (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m197.8/197.8 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m240.3/240.3 kB[0m [31m12.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.5/363.5 kB[0m [31m23.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m772.8/772.8 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.7/6.7 MB[0m [31m51.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m236.3/236.3 kB[0m [31m27.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━

In [15]:
from keybert import KeyBERT
from sklearn.feature_extraction.text import TfidfVectorizer

  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(


In [16]:
kw_model = KeyBERT('all-mpnet-base-v2')

extraction = lambda x: utils.extract_kws(text=x["clean_text"],
                                         kw_model=kw_model,
                                         seed=x["clean_title"].split(" "),
                                         top_n=4)

df["keywords"] = df.progress_apply(extraction, axis=1)
df.head()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

100%|██████████| 10000/10000 [17:06<00:00,  9.74it/s]


Unnamed: 0,id,title,category,abstract,clean_abstract,clean_title,clean_text,keywords
177394,2401.12178,In-Context Learning for Extreme Multi-Label Cl...,"['cs.CL', 'cs.AI']",Multi-label classification problems with tho...,multi label classification problem thousand cl...,context learning extreme multi label classific...,context learning extreme multi label classific...,"[classification, program, multi, extreme]"
149832,2309.02251,STGIN: Spatial-Temporal Graph Interaction Netw...,['cs.IR'],"In Location-Based Services, Point-Of-Interes...",location base services point recommendation pl...,stgin spatial temporal graph interaction netwo...,stgin spatial temporal graph interaction netwo...,"[graph, recommendation, temporal, stgin]"
9577,2101.01332,Equality Saturation for Tensor Graph Superopti...,"['cs.AI', 'cs.DC']",One of the major optimizations employed in d...,major optimization employ deep learning framew...,equality saturation tensor graph superoptimiza...,equality saturation tensor graph superoptimiza...,"[superoptimization, tensor, optimize, graph]"
107792,2212.14293,Error syntax aware augmentation of feedback co...,['cs.CL'],This paper presents a solution to the GenCha...,paper present solution genchal share task dedi...,error syntax aware augmentation feedback comme...,error syntax aware augmentation feedback comme...,"[augment, error, writing, dataset]"
94936,2210.03682,Novice Type Error Diagnosis with Natural Langu...,"['cs.PL', 'cs.LG']",Strong static type systems help programmers ...,strong static type system help programmer elim...,novice type error diagnosis natural language m...,novice type error diagnosis natural language m...,"[type, programmer, annotation, diagnose]"


# 4) Classification
Given an article:

- its feature X will be the cleaned text
- its label y will be its keyword

In [17]:
!pip install scikit-multilearn -q

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/89.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━[0m [32m41.0/89.4 kB[0m [31m1.0 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m89.4/89.4 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[?25h

In [18]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC

In [19]:
# Preparing X (features)
X = df["clean_text"]

# Preparing y (labels)
y = df['keywords']

#Split data into train/test.
X_train, X_test, y_train_tot, y_test_tot = train_test_split(X, y,
                                                            test_size=0.5,
                                                            random_state=SEED)

In [20]:
# Select only the first keyword for every article.
y_train = [x[0] for x in y_train_tot]
y_test = [x[0] for x in y_test_tot]

Do the classification.

In [21]:
model = Pipeline([('tfidf', TfidfVectorizer(stop_words='english')),
                  ('svm_model', LinearSVC(verbose=1))])

y_pred = utils.run_model(model, X_train, X_test, y_train, y_test,
                         multilabel=False)

print('accuracy: ', accuracy_score(y_test, y_pred))

df_pred = pd.DataFrame({'clean_text': X_test,
                        'true_kws': y_test_tot,
                        'first_true_kw': y_test,
                        'predicted_kw': y_pred})

[LibLinear]accuracy:  0.3622


In [22]:
df_pred = df_pred.reset_index(drop=True)
df_pred.head()

Unnamed: 0,clean_text,true_kws,first_true_kw,predicted_kw
0,knowledge enhanced multi modal fake news detec...,"[news, fake, classification, subgraph]",news,recommender
1,chard clinical health aware reasoning dimensio...,"[health, aware, textual, generate]",health,demographic
2,machine learning application health coronaviru...,"[coronavirus, learning, health, aircraft]",coronavirus,pandemic
3,text orient modality reinforcement network mul...,"[multimodal, sentiment, fusion, sequences]",multimodal,multimodal
4,performance analysis spectrum sharing uav enab...,"[uav, wireless, coverage, mesh]",uav,uav


In [23]:
# Get the number of predicted kws that are contained in the list of true kws.
is_in_true_kws = lambda x: x.predicted_kw in x.true_kws
num_true = df_pred.apply(is_in_true_kws, axis=1).value_counts().loc[True]

# Turn it to percentage.
print(f"{round((num_true/len(df_pred))*100, 2)}% of predicted kws are true kws")

54.4% of predicted kws are true kws


# 3) Compute the distance between the true and the predicted keywords

In [24]:
import nltk
from gensim.models import Word2Vec

In [25]:
# Lists of the keywords on which we want to compute the similarity.
kws_pred = df_pred['predicted_kw'].values
kws_true = df_pred['first_true_kw'].values

In [26]:
# Create the corpus using our processed texts.
corpus = list(df['clean_text'].values)

# Tokenize the corpus.
nltk.download('punkt')
tokenized_corpus = [nltk.word_tokenize(text.lower()) for text in corpus]

# Train the Word2Vec model on the created corpus.
model = Word2Vec(tokenized_corpus, min_count=1)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Compute the meaning similarity.

In [27]:
simil_meaning_list = [] # meaninig similarity

for i, (kp, kb) in enumerate(zip(kws_pred, kws_true)):
    sim = model.wv.similarity(kp, kb)
    sim = float("{0:.2f}".format(sim))
    if (i<5): print(f"The similarity between '{kp}' and '{kb}' is: {sim}")
    simil_meaning_list.append(sim)

print(f"\nMEAN OF SIMILARITIES: {np.mean(simil_meaning_list)}")

The similarity between 'recommender' and 'news' is: 0.42
The similarity between 'demographic' and 'health' is: 0.59
The similarity between 'pandemic' and 'coronavirus' is: 0.85
The similarity between 'multimodal' and 'multimodal' is: 1.0
The similarity between 'uav' and 'uav' is: 1.0

MEAN OF SIMILARITIES: 0.64258
