<a href="https://colab.research.google.com/github/auroramugnai/ArXivClassification/blob/main/ArXivClassification/keywords_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Clone the github repository and move to the inner directory.

In [None]:
!git clone https://github.com/auroramugnai/ArXivClassification.git
%cd ArXivClassification/ArXivClassification

Cloning into 'ArXivClassification'...
remote: Enumerating objects: 1718, done.[K
remote: Counting objects: 100% (360/360), done.[K
remote: Compressing objects: 100% (128/128), done.[K
remote: Total 1718 (delta 285), reused 271 (delta 230), pack-reused 1358[K
Receiving objects: 100% (1718/1718), 23.17 MiB | 18.06 MiB/s, done.
Resolving deltas: 100% (900/900), done.
/content/ArXivClassification/ArXivClassification


# 1) Build the dataset

In [None]:
import json
import random
import zipfile

import dask.bag as db
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import utils

In [None]:
SEED = 42 # fix random seed for reproducibility
random.seed(SEED)

## 1.1 Download the dataset
The following line of code comes from clicking on "Copy API command" in https://www.kaggle.com/datasets/Cornell-University/arxiv.

In [None]:
!kaggle datasets download -d Cornell-University/arxiv

Dataset URL: https://www.kaggle.com/datasets/Cornell-University/arxiv
License(s): CC0-1.0
Downloading arxiv.zip to /content/ArXivClassification/ArXivClassification
100% 1.28G/1.28G [00:35<00:00, 40.5MB/s]
100% 1.28G/1.28G [00:35<00:00, 38.6MB/s]


Unzip the downloaded file.

In [None]:
with zipfile.ZipFile('./arxiv.zip', 'r') as zip_ref:
    zip_ref.extractall()

The unzipping creates an "arxiv-metadata-oai-snapshot.json". We now create a dask bag out of it.

In [None]:
path = "./arxiv-metadata-oai-snapshot.json"
arxiv_data = db.read_text(path).map(json.loads)
arxiv_data.take(1)

({'id': '0704.0001',
  'submitter': 'Pavel Nadolsky',
  'authors': "C. Bal\\'azs, E. L. Berger, P. M. Nadolsky, C.-P. Yuan",
  'title': 'Calculation of prompt diphoton production cross sections at Tevatron and\n  LHC energies',
  'comments': '37 pages, 15 figures; published version',
  'journal-ref': 'Phys.Rev.D76:013009,2007',
  'doi': '10.1103/PhysRevD.76.013009',
  'report-no': 'ANL-HEP-PR-07-12',
  'categories': 'hep-ph',
  'license': None,
  'abstract': '  A fully differential calculation in perturbative quantum chromodynamics is\npresented for the production of massive photon pairs at hadron colliders. All\nnext-to-leading order perturbative contributions from quark-antiquark,\ngluon-(anti)quark, and gluon-gluon subprocesses are included, as well as\nall-orders resummation of initial-state gluon radiation valid at\nnext-to-next-to-leading logarithmic accuracy. The region of phase space is\nspecified in which the calculation is most reliable. Good agreement is\ndemonstrated with d

## 1.2 Get rid of some unnecessary information

In [None]:
# Get the latest version of the articles.
get_latest_version = lambda x: x['versions'][-1]['created']

# Only keep articles published after 2022.
is_after_2020 = lambda x: int(get_latest_version(x).split(' ')[3]) > 2020

# Only keep some information.
cut_info = lambda x: {'id': x['id'],
                      'title': x['title'],
                      'category':x['categories'].split(' '),
                      'abstract':x['abstract'],}

# Only keep Computer Science macro-category.
is_only_cs = lambda x: all([s.startswith("cs.") for s in x['categories'].split(' ')])

arxiv_data_filtered = (arxiv_data.filter(is_after_2020).filter(is_only_cs).map(cut_info).compute())


# Create a pandas dataframe and save it to csv.
df = pd.DataFrame(arxiv_data_filtered)
df.to_csv("./cs_arxiv_data_filtered.csv", index=False)
df.head()

Unnamed: 0,id,title,category,abstract
0,710.3901,A recursive linear time modular decomposition ...,[cs.DM],A module of a graph G is a set of vertices t...
1,711.201,A Polynomial Time Algorithm for Graph Isomorphism,[cs.CC],We claimed that there is a polynomial algori...
2,802.3414,A Universal In-Place Reconfiguration Algorithm...,"[cs.CG, cs.MA, cs.RO]",In the modular robot reconfiguration problem...
3,803.3946,On the `Semantics' of Differential Privacy: A ...,"[cs.CR, cs.DB]","Differential privacy is a definition of ""pri..."
4,805.1877,Perfect tag identification protocol in RFID ne...,[cs.NI],Radio Frequency IDentification (RFID) system...


## 1.3 Get a fixed number of articles
To speed up computation and avoid a session crash.

In [None]:
path = "./cs_arxiv_data_filtered.csv"
df = pd.read_csv(path, dtype=str)
df.head()

Unnamed: 0,id,title,category,abstract
0,710.3901,A recursive linear time modular decomposition ...,['cs.DM'],A module of a graph G is a set of vertices t...
1,711.201,A Polynomial Time Algorithm for Graph Isomorphism,['cs.CC'],We claimed that there is a polynomial algori...
2,802.3414,A Universal In-Place Reconfiguration Algorithm...,"['cs.CG', 'cs.MA', 'cs.RO']",In the modular robot reconfiguration problem...
3,803.3946,On the `Semantics' of Differential Privacy: A ...,"['cs.CR', 'cs.DB']","Differential privacy is a definition of ""pri..."
4,805.1877,Perfect tag identification protocol in RFID ne...,['cs.NI'],Radio Frequency IDentification (RFID) system...


In [None]:
num_data = 10000 # number of articles that we want to keep
print(f"The dataset contains {len(df)} articles.")

# Sample the dataset only if its length exceeds num_data.
if(len(df) > num_data):
    n_sample = num_data
    df = df.sample(n=n_sample, axis=0, random_state=SEED)

print(f"The dataset contains {len(df)} articles.")

The dataset contains 201390 articles.
The dataset contains 10000 articles.


# 2) Text-processing

In [None]:
!pip install -U spacy -q
!python -m spacy download en_core_web_md -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 MB[0m [31m12.6 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [None]:
import en_core_web_md
import spacy
from tqdm import tqdm

Clean out the strings (this step will take a while).

In [None]:
# Remove stop words, punctuation, special characters, numbers.
nlp = spacy.load("en_core_web_md")
tqdm.pandas() # to display progress bar

# First on abstracts.
abs_cleaner = lambda x: utils.text_cleaner(text=x["abstract"], nlp=nlp)
df["clean_abstract"] = df.progress_apply(abs_cleaner, axis=1)

# Then on titles.
tit_cleaner = lambda x: utils.text_cleaner(text=x["title"], nlp=nlp)
df["clean_title"] = df.progress_apply(tit_cleaner, axis=1)

df.tail()

100%|██████████| 10000/10000 [06:44<00:00, 24.72it/s]
100%|██████████| 10000/10000 [01:17<00:00, 128.29it/s]


Unnamed: 0,id,title,category,abstract,clean_abstract,clean_title
120147,2303.14329,Edge-Based Video Analytics: A Survey,"['cs.DC', 'cs.CV']",Edge computing has been getting a momentum w...,edge computing momentum increase datum edge ne...,edge base video analytics survey
139871,2307.04054,Deep Unsupervised Learning Using Spike-Timing-...,['cs.CV'],Spike-Timing-Dependent Plasticity (STDP) is ...,spike timing dependent plasticity stdp unsuper...,deep unsupervised learning use spike timing de...
138750,2307.00182,Single-Stage Heavy-Tailed Food Classification,['cs.CV'],Deep learning based food image classificatio...,deep learning base food image classification e...,single stage heavy tailed food classification
33080,2106.1602,Anomaly Detection: How to Artificially Increas...,['cs.LG'],Anomaly detection is a widely explored domai...,anomaly detection widely explore domain machin...,anomaly detection artificially increase score ...
31726,2106.10938,A Game-Theoretic Taxonomy of Visual Concepts i...,"['cs.LG', 'cs.CV']","In this paper, we rethink how a DNN encodes ...",paper rethink dnn encode visual concept differ...,game theoretic taxonomy visual concepts dnns


In [None]:
# Add a space to separate title and abstract.
df["clean_text"] = df["clean_title"] + " " + df["clean_abstract"]
df.head()

Unnamed: 0,id,title,category,abstract,clean_abstract,clean_title,clean_text
82985,2207.05068,Few-Shot Semantic Relation Prediction across H...,"['cs.LG', 'cs.AI']",Semantic relation prediction aims to mine th...,semantic relation prediction aim implicit rela...,shot semantic relation prediction heterogeneou...,shot semantic relation prediction heterogeneou...
29411,2106.02954,Denoising Word Embeddings by Averaging in a Sh...,"['cs.CL', 'cs.LG']",We introduce a new approach for smoothing an...,introduce new approach smooth improve quality ...,denoise word embeddings average shared space,denoise word embeddings average shared space i...
128648,2305.08492,On the conformance of Android applications wit...,"['cs.CY', 'cs.CR']",With the rapid development of online technol...,rapid development online technology widespread...,conformance android application child data pro...,conformance android application child data pro...
87965,2208.08934,A Hybrid Self-Supervised Learning Framework fo...,['cs.LG'],"Vertical federated learning (VFL), a variant...",vertical federated learning vfl variant federa...,hybrid self supervised learning framework vert...,hybrid self supervised learning framework vert...
75162,2205.08982,Attention-based Multimodal Feature Representat...,['cs.IR'],"In recommender systems, models mostly use a ...",recommender system model use combination embed...,attention base multimodal feature representati...,attention base multimodal feature representati...


# 3) Keywords extraction

In [None]:
!pip install KeyBERT -q
!pip install keyphrase-vectorizers -q

  Preparing metadata (setup.py) ... [?25l[?25hdone
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/171.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m171.5/171.5 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for KeyBERT (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m197.8/197.8 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m240.3/240.3 kB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.5/363.5 kB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m772.8/772.8 kB[0m [31m17.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.7/6.7 MB[0m [31m44.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━

In [None]:
from keybert import KeyBERT
from sklearn.feature_extraction.text import TfidfVectorizer

  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(


In [None]:
kw_model = KeyBERT('all-mpnet-base-v2')

extraction = lambda x: utils.extract_kws(text=x["clean_text"],
                                         kw_model=kw_model,
                                         seed=x["clean_title"].split(" "))

df["keywords"] = df.progress_apply(extraction, axis=1)
df.head()

100%|██████████| 10000/10000 [17:11<00:00,  9.69it/s]


Unnamed: 0,id,title,category,abstract,clean_abstract,clean_title,clean_text,keywords
82985,2207.05068,Few-Shot Semantic Relation Prediction across H...,"['cs.LG', 'cs.AI']",Semantic relation prediction aims to mine th...,semantic relation prediction aim implicit rela...,shot semantic relation prediction heterogeneou...,shot semantic relation prediction heterogeneou...,"[subgraph, semantic, metags, predict]"
29411,2106.02954,Denoising Word Embeddings by Averaging in a Sh...,"['cs.CL', 'cs.LG']",We introduce a new approach for smoothing an...,introduce new approach smooth improve quality ...,denoise word embeddings average shared space,denoise word embeddings average shared space i...,"[embeddings, word, denoise, average]"
128648,2305.08492,On the conformance of Android applications wit...,"['cs.CY', 'cs.CR']",With the rapid development of online technol...,rapid development online technology widespread...,conformance android application child data pro...,conformance android application child data pro...,"[gdpr, guideline, android, child]"
87965,2208.08934,A Hybrid Self-Supervised Learning Framework fo...,['cs.LG'],"Vertical federated learning (VFL), a variant...",vertical federated learning vfl variant federa...,hybrid self supervised learning framework vert...,hybrid self supervised learning framework vert...,"[federated, learning, privacy, vertical]"
75162,2205.08982,Attention-based Multimodal Feature Representat...,['cs.IR'],"In recommender systems, models mostly use a ...",recommender system model use combination embed...,attention base multimodal feature representati...,attention base multimodal feature representati...,"[multimodal, recommender, attention, embed]"


# 4) Classification
Given an article:

- its feature X will be the cleaned text
- its label y will be its keyword

In [None]:
!pip install scikit-multilearn -q

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/89.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m89.4/89.4 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC

In [None]:
# Preparing X (features)
X = df["clean_text"]

# Preparing y (labels)
y = df['keywords']

#Split data into train/test.
X_train, X_test, y_train_tot, y_test_tot = train_test_split(X, y,
                                                            test_size=0.5,
                                                            random_state=SEED)

In [None]:
# Select only the first keyword for every article.
y_train = [x[0] for x in y_train_tot]
y_test = [x[0] for x in y_test_tot]

Do the classification.

In [None]:
model = Pipeline([('tfidf', TfidfVectorizer(stop_words='english')),
                  ('svm_model', LinearSVC(verbose=1))])

y_pred = utils.run_model(model, X_train, X_test, y_train, y_test,
                         multilabel=False)

print('accuracy: ', accuracy_score(y_test, y_pred))

df_pred = pd.DataFrame({'clean_text': X_test,
                        'true_kws': y_test_tot,
                        'first_true_kw': y_test,
                        'predicted_kw': y_pred})

[LibLinear]accuracy:  0.349


In [None]:
df_pred = df_pred.reset_index(drop=True)
df_pred.head()

Unnamed: 0,clean_text,true_kws,first_true_kw,predicted_kw
0,locally fair partitioning model societal task ...,"[partitioning, district, fairness, cooperative]",partitioning,federate
1,multiple subset problem encryption scheme comm...,"[encryption, subset, problem, mssp]",encryption,encryption
2,co nnect framework reveal commonsense knowledg...,"[conceptnet, classifier, commonsense, text]",conceptnet,commonsense
3,fast bitmap fit cpu cache line friendly memory...,"[allocator, bitmap, memory, randomization]",allocator,memory
4,neuda neural deformable anchor high fidelity i...,"[neural, surface, voxel, casting]",neural,nodes


In [None]:
# Get the number of predicted kws that are contained in the list of true kws.
is_in_true_kws = lambda x: x.predicted_kw in x.true_kws
num_true = df_pred.apply(is_in_true_kws, axis=1).value_counts().loc[True]

# Turn it to percentage.
print(f"{round((num_true/len(df_pred))*100, 2)}% of predicted kws are true kws")

53.04% of predicted kws are true kws


# 5) Compute the distance between the true and the predicted keywords

In [None]:
import nltk
from gensim.models import Word2Vec

In [None]:
# Lists of the keywords on which we want to compute the similarity.
kws_pred = df_pred['predicted_kw'].values
kws_true = df_pred['first_true_kw'].values

In [None]:
# Create the corpus using our processed texts.
corpus = list(df['clean_text'].values)

# Tokenize the corpus.
nltk.download('punkt')
tokenized_corpus = [nltk.word_tokenize(text.lower()) for text in corpus]

# Train the Word2Vec model on the created corpus.
model = Word2Vec(tokenized_corpus, min_count=1)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Compute the meaning similarity.

In [None]:
simil_meaning_list = [] # meaninig similarity

for i, (kp, kb) in enumerate(zip(kws_pred, kws_true)):
    sim = model.wv.similarity(kp, kb)
    sim = float("{0:.2f}".format(sim))
    if (i<5): print(f"The similarity between '{kp}' and '{kb}' is: {sim}")
    simil_meaning_list.append(sim)

print(f"\nMEAN OF SIMILARITIES: {np.mean(simil_meaning_list)}")

The similarity between 'federate' and 'partitioning' is: 0.37
The similarity between 'encryption' and 'encryption' is: 1.0
The similarity between 'commonsense' and 'conceptnet' is: 0.65
The similarity between 'memory' and 'allocator' is: 0.74
The similarity between 'nodes' and 'neural' is: 0.25

MEAN OF SIMILARITIES: 0.637372
