<a href="https://colab.research.google.com/github/auroramugnai/ArXivClassification/blob/main/ArXivClassification/keywords_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Clone the github repository and move to the inner directory.

In [None]:
!git clone https://github.com/auroramugnai/ArXivClassification.git
%cd ArXivClassification/ArXivClassification

Cloning into 'ArXivClassification'...
remote: Enumerating objects: 2000, done.[K
remote: Counting objects: 100% (642/642), done.[K
remote: Compressing objects: 100% (225/225), done.[K
remote: Total 2000 (delta 473), reused 548 (delta 415), pack-reused 1358[K
Receiving objects: 100% (2000/2000), 72.44 MiB | 26.55 MiB/s, done.
Resolving deltas: 100% (1088/1088), done.
/content/ArXivClassification/ArXivClassification/ArXivClassification/ArXivClassification


# 1) Build the dataset

In [None]:
import json
import random
import zipfile

import dask.bag as db
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import utils

In [None]:
SEED = 42 # fix random seed for reproducibility
random.seed(SEED)

## 1.1 Download the dataset
The following line of code comes from clicking on "Copy API command" in https://www.kaggle.com/datasets/Cornell-University/arxiv.

In [None]:
!kaggle datasets download -d Cornell-University/arxiv

Dataset URL: https://www.kaggle.com/datasets/Cornell-University/arxiv
License(s): CC0-1.0
arxiv.zip: Skipping, found more recently modified local copy (use --force to force download)


Unzip the downloaded file.

In [None]:
with zipfile.ZipFile('./arxiv.zip', 'r') as zip_ref:
    zip_ref.extractall()

The unzipping creates an "arxiv-metadata-oai-snapshot.json". We now create a dask bag out of it.

In [None]:
path = "./arxiv-metadata-oai-snapshot.json"
arxiv_data = db.read_text(path).map(json.loads)
arxiv_data.take(1)

({'id': '0704.0001',
  'submitter': 'Pavel Nadolsky',
  'authors': "C. Bal\\'azs, E. L. Berger, P. M. Nadolsky, C.-P. Yuan",
  'title': 'Calculation of prompt diphoton production cross sections at Tevatron and\n  LHC energies',
  'comments': '37 pages, 15 figures; published version',
  'journal-ref': 'Phys.Rev.D76:013009,2007',
  'doi': '10.1103/PhysRevD.76.013009',
  'report-no': 'ANL-HEP-PR-07-12',
  'categories': 'hep-ph',
  'license': None,
  'abstract': '  A fully differential calculation in perturbative quantum chromodynamics is\npresented for the production of massive photon pairs at hadron colliders. All\nnext-to-leading order perturbative contributions from quark-antiquark,\ngluon-(anti)quark, and gluon-gluon subprocesses are included, as well as\nall-orders resummation of initial-state gluon radiation valid at\nnext-to-next-to-leading logarithmic accuracy. The region of phase space is\nspecified in which the calculation is most reliable. Good agreement is\ndemonstrated with d

## 1.2 Get rid of some unnecessary information

In [None]:
# Get the latest version of the articles.
get_latest_version = lambda x: x['versions'][-1]['created']

# Only keep articles published after 2022.
is_after_2020 = lambda x: int(get_latest_version(x).split(' ')[3]) > 2020

# Only keep some information.
cut_info = lambda x: {'id': x['id'],
                      'title': x['title'],
                      'category':x['categories'].split(' '),
                      'abstract':x['abstract'],}

# Only keep Computer Science macro-category.
is_only_cs = lambda x: all([s.startswith("cs.") for s in x['categories'].split(' ')])

arxiv_data_filtered = (arxiv_data.filter(is_after_2020).filter(is_only_cs).map(cut_info).compute())


# Create a pandas dataframe and save it to csv.
df = pd.DataFrame(arxiv_data_filtered)
df.to_csv("./cs_arxiv_data_filtered.csv", index=False)
df.head()

Unnamed: 0,id,title,category,abstract
0,710.3901,A recursive linear time modular decomposition ...,[cs.DM],A module of a graph G is a set of vertices t...
1,711.201,A Polynomial Time Algorithm for Graph Isomorphism,[cs.CC],We claimed that there is a polynomial algori...
2,802.3414,A Universal In-Place Reconfiguration Algorithm...,"[cs.CG, cs.MA, cs.RO]",In the modular robot reconfiguration problem...
3,803.3946,On the `Semantics' of Differential Privacy: A ...,"[cs.CR, cs.DB]","Differential privacy is a definition of ""pri..."
4,805.1877,Perfect tag identification protocol in RFID ne...,[cs.NI],Radio Frequency IDentification (RFID) system...


## 1.3 Get a fixed number of articles
To speed up computation and avoid a session crash.

In [None]:
path = "./cs_arxiv_data_filtered.csv"
df = pd.read_csv(path, dtype=str)
df.head()

Unnamed: 0,id,title,category,abstract
0,710.3901,A recursive linear time modular decomposition ...,['cs.DM'],A module of a graph G is a set of vertices t...
1,711.201,A Polynomial Time Algorithm for Graph Isomorphism,['cs.CC'],We claimed that there is a polynomial algori...
2,802.3414,A Universal In-Place Reconfiguration Algorithm...,"['cs.CG', 'cs.MA', 'cs.RO']",In the modular robot reconfiguration problem...
3,803.3946,On the `Semantics' of Differential Privacy: A ...,"['cs.CR', 'cs.DB']","Differential privacy is a definition of ""pri..."
4,805.1877,Perfect tag identification protocol in RFID ne...,['cs.NI'],Radio Frequency IDentification (RFID) system...


In [None]:
num_data = 10000 # number of articles that we want to keep
print(f"The dataset contains {len(df)} articles.")

# Sample the dataset only if its length exceeds num_data.
if(len(df) > num_data):
    n_sample = num_data
    df = df.sample(n=n_sample, axis=0, random_state=SEED)

print(f"The dataset contains {len(df)} articles.")

The dataset contains 202943 articles.
The dataset contains 10000 articles.


# 2) Text-processing

In [None]:
!pip install -U spacy -q
!python -m spacy download en_core_web_md -q

  _torch_pytree._register_pytree_node(
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 MB[0m [31m14.3 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [None]:
import en_core_web_md
import spacy
from tqdm import tqdm

Clean out the strings (this step will take a while).

In [None]:
# Remove stop words, punctuation, special characters, numbers.
nlp = spacy.load("en_core_web_md")
tqdm.pandas() # to display progress bar

# First on abstracts.
abs_cleaner = lambda x: utils.text_cleaner(text=x["abstract"], nlp=nlp)
df["clean_abstract"] = df.progress_apply(abs_cleaner, axis=1)

# Then on titles.
tit_cleaner = lambda x: utils.text_cleaner(text=x["title"], nlp=nlp)
df["clean_title"] = df.progress_apply(tit_cleaner, axis=1)

df.tail()

100%|██████████| 10000/10000 [06:29<00:00, 25.65it/s]
100%|██████████| 10000/10000 [01:14<00:00, 134.67it/s]


Unnamed: 0,id,title,category,abstract,clean_abstract,clean_title
149330,2309.0065,Reducing Errors in Excel Models with Component...,['cs.SE'],Model errors are pervasive and can be catast...,model error pervasive catastrophic reduce mode...,reduce error excel models component base softw...
114394,2302.10287,CertViT: Certified Robustness of Pre-Trained V...,['cs.CV'],Lipschitz bounded neural networks are certif...,lipschitz bound neural network certifiably rob...,certvit certify robustness pre trained vision ...
46303,2110.04854,Identity-guided Face Generation with Multi-mod...,['cs.CV'],Recent face generation methods have tried to...,recent face generation method try synthesize f...,identity guide face generation multi modal con...
111355,2302.00047,Probabilistic Point Cloud Modeling via Self-Or...,"['cs.LG', 'cs.GR', 'cs.RO']",This letter presents a continuous probabilis...,letter present continuous probabilistic modeli...,probabilistic point cloud modeling self organi...
72315,2204.12817,CATrans: Context and Affinity Transformer for ...,['cs.CV'],Few-shot segmentation (FSS) aims to segment ...,shot segmentation fss aim segment novel catego...,catran context affinity transformer shot segme...


In [None]:
# Add a space to separate title and abstract.
df["clean_text"] = df["clean_title"] + " " + df["clean_abstract"]
df.head()

Unnamed: 0,id,title,category,abstract,clean_abstract,clean_title,clean_text
157246,2310.07424,Analytical Die-to-Die 3D Placement with Bistra...,['cs.AR'],"In this paper, we present a new analytical 3...",paper present new analytical placement framewo...,analytical die die placement bistratal wirelen...,analytical die die placement bistratal wirelen...
181146,2402.0556,Tight Approximation Bounds on a Simple Algorit...,['cs.DS'],The graph invariant EPT-sum has cropped up i...,graph invariant ept sum crop unrelated field l...,tight approximation bounds simple algorithm mi...,tight approximation bounds simple algorithm mi...
137935,2306.15076,Agile Development of Linux Schedulers with Ekiben,['cs.OS'],Kernel task scheduling is important for appl...,kernel task scheduling important application p...,agile development linux schedulers ekiben,agile development linux schedulers ekiben kern...
105513,2212.05421,Feature-Level Debiased Natural Language Unders...,['cs.CL'],Natural language understanding (NLU) models ...,natural language understanding nlu model rely ...,feature level debiased natural language unders...,feature level debiased natural language unders...
156522,2310.05333,DiffCPS: Diffusion Model based Constrained Pol...,['cs.LG'],Constrained policy search (CPS) is a fundame...,constrained policy search cps fundamental prob...,diffcps diffusion model base constrained polic...,diffcps diffusion model base constrained polic...


# 3) Keywords extraction

In [None]:
!pip install --upgrade KeyBERT -q
!pip install --upgrade keyphrase-vectorizers -q

In [None]:
from keybert import KeyBERT
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
kw_model = KeyBERT('all-mpnet-base-v2')

extraction = lambda x: utils.extract_kws(text=x["clean_text"],
                                         kw_model=kw_model,
                                         seed=x["clean_title"].split(" "),
                                         top_n=4)

df["keywords"] = df.progress_apply(extraction, axis=1)
df.head()

100%|██████████| 10000/10000 [16:47<00:00,  9.92it/s]


Unnamed: 0,id,title,category,abstract,clean_abstract,clean_title,clean_text,keywords
157246,2310.07424,Analytical Die-to-Die 3D Placement with Bistra...,['cs.AR'],"In this paper, we present a new analytical 3...",paper present new analytical placement framewo...,analytical die die placement bistratal wirelen...,analytical die die placement bistratal wirelen...,"[wirelength, interconnection, gpu, placement]"
181146,2402.0556,Tight Approximation Bounds on a Simple Algorit...,['cs.DS'],The graph invariant EPT-sum has cropped up i...,graph invariant ept sum crop unrelated field l...,tight approximation bounds simple algorithm mi...,tight approximation bounds simple algorithm mi...,"[tree, clustering, rank, average]"
137935,2306.15076,Agile Development of Linux Schedulers with Ekiben,['cs.OS'],Kernel task scheduling is important for appl...,kernel task scheduling important application p...,agile development linux schedulers ekiben,agile development linux schedulers ekiben kern...,"[scheduler, linux, ekiben, benchmark]"
105513,2212.05421,Feature-Level Debiased Natural Language Unders...,['cs.CL'],Natural language understanding (NLU) models ...,natural language understanding nlu model rely ...,feature level debiased natural language unders...,feature level debiased natural language unders...,"[bias, contrastive, learning, encode]"
156522,2310.05333,DiffCPS: Diffusion Model based Constrained Pol...,['cs.LG'],Constrained policy search (CPS) is a fundame...,constrained policy search cps fundamental prob...,diffcps diffusion model base constrained polic...,diffcps diffusion model base constrained polic...,"[diffusion, diffcps, reinforcement, offline]"


# 4) Classification
Given an article:

- its feature X will be the cleaned text
- its label y will be its keyword

In [None]:
!pip install scikit-multilearn -q

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/89.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━[0m [32m41.0/89.4 kB[0m [31m976.6 kB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m89.4/89.4 kB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC

In [None]:
# Preparing X (features)
X = df["clean_text"]

# Preparing y (labels)
y = df['keywords']

#Split data into train/test.
X_train, X_test, y_train_tot, y_test_tot = train_test_split(X, y,
                                                            test_size=0.5,
                                                            random_state=SEED)

In [None]:
# Select only the first keyword for every article.
y_train = [x[0] for x in y_train_tot]
y_test = [x[0] for x in y_test_tot]

Do the classification.

In [None]:
model = Pipeline([('tfidf', TfidfVectorizer(stop_words='english')),
                  ('svm_model', LinearSVC(verbose=1))])

# Fit of the train data using the pipeline.
model.fit(X_train, y_train)
# Prediction on the test data.
y_pred = model.predict(X_test)

# print('accuracy: ', accuracy_score(y_test, y_pred))

df_pred = pd.DataFrame({'clean_text': X_test,
                        'true_kws': y_test_tot,
                        'first_true_kw': y_test,
                        'predicted_kw': y_pred})

[LibLinear]

In [None]:
df_pred = df_pred.reset_index(drop=True)
df_pred.head()

Unnamed: 0,clean_text,true_kws,first_true_kw,predicted_kw
0,explain chest x ray pathologies natural langua...,"[chest, explainability, imaging, dataset]",chest,explainability
1,causal inference natural language processing e...,"[nlp, causality, inference, interpretation]",nlp,nlp
2,define maximum acceptable latency ai enhance c...,"[interpreter, latency, cai, enhance]",interpreter,latency
3,struggle adversarial defense try diffusion adv...,"[adversarial, diffusion, image, train]",adversarial,adversarial
4,detect unknown object detection object detecti...,"[detection, object, annotate, unknown]",detection,segmentation


# 5) Compute the distance between the true and the predicted keywords

In [None]:
import nltk
from gensim.models import Word2Vec

In [None]:
# Lists of the keywords on which we want to compute the similarity.
kws_pred = df_pred['predicted_kw'].values
kws_true = df_pred['first_true_kw'].values

In [None]:
# Create the corpus using our processed texts.
corpus = list(df['clean_text'].values)

# Tokenize the corpus.
nltk.download('punkt')
tokenized_corpus = [nltk.word_tokenize(text.lower()) for text in corpus]

# Train the Word2Vec model on the created corpus.
model = Word2Vec(tokenized_corpus, min_count=1)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Compute the meaning similarity.

In [None]:
simil_meaning_list = [] # meaninig similarity

for i, (kp, kb) in enumerate(zip(kws_pred, kws_true)):
    sim = model.wv.similarity(kp, kb)
    sim = float("{0:.2f}".format(sim))
    if (i<5): print(f"The similarity between '{kp}' and '{kb}' is: {sim}")
    simil_meaning_list.append(sim)

print(f"\nMEAN OF SIMILARITIES: {np.mean(simil_meaning_list)}")

The similarity between 'explainability' and 'chest' is: 0.22
The similarity between 'nlp' and 'nlp' is: 1.0
The similarity between 'latency' and 'interpreter' is: 0.28
The similarity between 'adversarial' and 'adversarial' is: 1.0
The similarity between 'segmentation' and 'detection' is: 0.58

MEAN OF SIMILARITIES: 0.647312
