# Load and clean the data

* Load the metadata prepared in `../00_load_metadata.ipynb`
* Keep only papers on natural language processing: original category 'cs.CL' (Computation and Language)
* Load abstracts prepared in `../00_load_abstracts.ipynb`, merge with metadata dataframe
* Check that all entries have an abstract
* Keep only research papers (research papers are papers that are not review papers).

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split


In [2]:
%%time

# Load the metadata downloaded from archive
arxiv_metadata = pd.read_csv('../data/arxiv_metadata.csv.zip', index_col=0)



CPU times: user 10.7 s, sys: 1.46 s, total: 12.1 s
Wall time: 13.2 s


In [3]:
# Keep only papers on natural language processing: original category 'cs.CL' (Computation and Language)
nlp_idx = ['cs.CL' in subject for subject in arxiv_metadata['categories']]
arxiv_nlp = arxiv_metadata[nlp_idx]

In [4]:
%%time

# load abstracts extracted data in notebook 00_load_abstracts
arxiv_abstracts = pd.read_csv('../data/arxiv_abstracts.csv.zip', index_col=0)

CPU times: user 11.4 s, sys: 1.27 s, total: 12.7 s
Wall time: 14 s




In [5]:
# merge with metadata dataframe
arxiv_abstracts_nlp = arxiv_abstracts[arxiv_abstracts.id.isin(arxiv_nlp.id)]
arxiv_nlp_merged = pd.merge(arxiv_nlp, arxiv_abstracts_nlp, on='id')

In [6]:
# check that all entries have an abstract
idx = arxiv_nlp_merged['abstract'].isna()
arxiv_nlp_merged = arxiv_nlp_merged[~idx]

In [14]:
# Keep only research papers (research papers are papers that are not review papers).
research_paper_idx = pd.Series(['systematic literature review' not in abstract.lower() for abstract in arxiv_nlp_merged.abstract])
arxiv_nlp_reviews = arxiv_nlp_merged[~research_paper_idx] 
arxiv_nlp_merged = arxiv_nlp_merged[research_paper_idx]

In [15]:
print(f"There are {len(arxiv_nlp_reviews)} review papers on NLP with an abstract in the dataset.")
print(f"There are {len(arxiv_nlp_merged)} research papers on NLP with an abstract in the dataset.")

There are 16 review papers on NLP with an abstract in the dataset.
There are 54551 research papers on NLP with an abstract in the dataset.


### List of review papers

In [17]:
arxiv_nlp_reviews[['id', 'title', 'authors']]

Unnamed: 0,id,title,authors
28917,2110.03073,DRAFT-What you always wanted to know but could...,"Mauricio Verano Merino, Jurgen Vinju, and Mark..."
31350,2202.03086,Machine Translation from Signed to Spoken Lang...,"Mathieu De Coster, Dimitar Shterionov, Mieke V..."
31651,2202.12826,A Systematic Literature Review about Idea Mini...,Workneh Y. Ayele and Gustaf Juell-Skielse
35554,2208.01712,"No Pattern, No Recognition: a Survey about Rep...","Mar\'ilia Costa Rosendo Silva, Felipe Alves Si..."
41841,2304.02768,Application of Transformers based methods in E...,"Vitor Alcantara Batista, Alexandre Gon\c{c}alv..."
42230,2304.11065,Conversational Process Modeling: Can Generativ...,"Nataliia Klievtsova, Janik-Vasily Benzin, Timo..."
45548,2306.09079,Web of Things and Trends in Agriculture: A Sys...,"Muhammad Shoaib Farooq, Shamyla Riaz, Atif Alvi"
45883,2306.14905,PRISMA-DFLLM: An Extension of PRISMA for Syste...,Teo Susnjak
46394,2307.06483,Misclassification in Automated Content Analysi...,"Nathan TeBlunthuis, Valerie Hase, Chung-Hong Chan"
47613,2308.1242,Evolution of ESG-focused DLT Research: An NLP ...,"Walter Hernandez, Kamil Tylinski, Alastair Moo..."


## Split the data into train / test datasets

In [9]:

arxiv_nlp_train, arxiv_nlp_test = train_test_split(arxiv_nlp_merged, test_size=0.5)
print(f"The train dataset has {arxiv_nlp_train.shape[0]} rows, the test dataset {arxiv_nlp_test.shape[0]} rows")

The train dataset has 27275 rows, the test dataset 27276 rows


## Save

In [10]:
import zipfile as zf

with zf.ZipFile('../data/arxiv_nlp.csv.zip', 'w') as ziparchive:
    ziparchive.writestr('arxiv_nlp.csv', arxiv_nlp_merged.to_csv())

with zf.ZipFile('../data/arxiv_nlp_test.csv.zip', 'w') as ziparchive:
    ziparchive.writestr('arxiv_nlp_test.csv', arxiv_nlp_test.to_csv())

with zf.ZipFile('../data/arxiv_nlp_train.csv.zip', 'w') as ziparchive:
    ziparchive.writestr('arxiv_nlp_train.csv', arxiv_nlp_train.to_csv())
