# Load and clean the data

* Load the metadata prepared in `../00_load_metadata.ipynb`
* Keep only papers on natural language processing: original category 'cs.CL' (Computation and Language)
* Load abstracts prepared in `../00_load_abstracts.ipynb`, merge with metadata dataframe
* Check that all entries have an abstract
* Keep only research papers (research papers are papers that are not review papers).

In [14]:
import pandas as pd
from sklearn.model_selection import train_test_split
import tok
import zipfile as zf
import pickle

In [15]:
%%time

# Load the metadata downloaded from archive
arxiv_metadata = pd.read_csv('../data/arxiv_metadata.csv.zip', index_col=0)



CPU times: user 10 s, sys: 907 ms, total: 11 s
Wall time: 11 s


In [16]:
# Keep only papers on natural language processing: original category 'cs.CL' (Computation and Language)
nlp_idx = ['cs.CL' in subject for subject in arxiv_metadata['categories']]
arxiv_nlp = arxiv_metadata[nlp_idx]

In [17]:
%%time

# load abstracts extracted data in notebook 00_load_abstracts
arxiv_abstracts = pd.read_csv('../data/arxiv_abstracts.csv.zip', index_col=0)



CPU times: user 10.9 s, sys: 602 ms, total: 11.5 s
Wall time: 11.5 s


In [18]:
# merge with metadata dataframe
arxiv_abstracts_nlp = arxiv_abstracts[arxiv_abstracts.id.isin(arxiv_nlp.id)]
arxiv_nlp_merged = pd.merge(arxiv_nlp, arxiv_abstracts_nlp, on='id')

In [19]:
# check that all entries have an abstract
idx = arxiv_nlp_merged['abstract'].isna()
arxiv_nlp_merged = arxiv_nlp_merged[~idx]

In [20]:
# Keep only research papers (research papers are papers that are not review papers).
research_paper_idx = pd.Series(['systematic literature review' not in abstract.lower() for abstract in arxiv_nlp_merged.abstract])
arxiv_nlp_reviews = arxiv_nlp_merged[~research_paper_idx] 
arxiv_nlp_merged = arxiv_nlp_merged[research_paper_idx]

In [21]:
print(f"There are {len(arxiv_nlp_reviews)} review papers on NLP with an abstract in the dataset.")
print(f"There are {len(arxiv_nlp_merged)} research papers on NLP with an abstract in the dataset.")

There are 16 review papers on NLP with an abstract in the dataset.
There are 54551 research papers on NLP with an abstract in the dataset.


## Split the data into train / validate / test datasets

In [22]:

arxiv_nlp_train, arxiv_nlp_test = train_test_split(arxiv_nlp_merged, test_size=0.5)
arxiv_nlp_validate, arxiv_nlp_test = train_test_split(arxiv_nlp_test, test_size=0.5)
print(f"The train dataset has {arxiv_nlp_train.shape[0]} rows, the validate dataset {arxiv_nlp_validate.shape[0]} rows, the test dataset {arxiv_nlp_test.shape[0]} rows")

The train dataset has 27275 rows, the validate dataset 13638 rows, the test dataset 13638 rows


### Save raw data

In [23]:


with zf.ZipFile('../data/arxiv_nlp.csv.zip', 'w') as ziparchive:
    ziparchive.writestr('arxiv_nlp.csv', arxiv_nlp_merged.to_csv())

with zf.ZipFile('../data/arxiv_nlp_test.csv.zip', 'w') as ziparchive:
    ziparchive.writestr('arxiv_nlp_test.csv', arxiv_nlp_test.to_csv())

with zf.ZipFile('../data/arxiv_nlp_validate.csv.zip', 'w') as ziparchive:
    ziparchive.writestr('arxiv_nlp_validate.csv', arxiv_nlp_validate.to_csv())

with zf.ZipFile('../data/arxiv_nlp_train.csv.zip', 'w') as ziparchive:
    ziparchive.writestr('arxiv_nlp_train.csv', arxiv_nlp_train.to_csv())


## Tokenize

Apply pre-processing filters: strip_tags, strip_punctuation, strip_multiple_whitespaces, stric_numeric, remove_stopwords; strip_short

Apply lemmatization to the list of words.

see: https://github.com/piskvorky/gensim/blob/develop/gensim/parsing/preprocessing.py

"corpus_train"
> A percent of the texts reserved for fitting the model.

"corpus_validate"
> A percent of the texts reserved for computing perplexity when fitting the model's k-parameter, and searching for best parameters.

"corpus_test"
> A percent of the texts reserved for testing hypotheses.

In [24]:
# make a dictionary with all the words in the complete dataset
texts = tok.clean(arxiv_nlp_merged['abstract'])
dictionary = tok.make_dictionary(texts)

# tokenize
texts_train = tok.clean(arxiv_nlp_train['abstract'])
corpus_train = tok.make_corpus(dictionary, texts_train)

texts_validate = tok.clean(arxiv_nlp_validate['abstract'])
corpus_validate = tok.make_corpus(dictionary, texts_validate)

texts_test = tok.clean(arxiv_nlp_test['abstract'])
corpus_test = tok.make_corpus(dictionary, texts_test)


### Save tokenized data

In [26]:
with open('../data/dictionary.pickle', 'wb') as handle:
    pickle.dump(dictionary, handle, protocol=pickle.HIGHEST_PROTOCOL)

with open('../data/corpus_train.pickle', 'wb') as handle:
    pickle.dump(corpus_train, handle, protocol=pickle.HIGHEST_PROTOCOL)

with open('../data/corpus_validate.pickle', 'wb') as handle:
    pickle.dump(corpus_validate, handle, protocol=pickle.HIGHEST_PROTOCOL)

with open('../data/corpus_test.pickle', 'wb') as handle:
    pickle.dump(corpus_test, handle, protocol=pickle.HIGHEST_PROTOCOL)