In [1]:
import pandas as pd
import numpy as np

import re
import string

from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

import multiprocessing

from gensim.models.phrases import Phrases, Phraser
from gensim.models import Word2Vec, KeyedVectors

import tensorflow as tf

from sklearn.preprocessing import LabelBinarizer

import abs_tag_lib as at

# Abstract classification via word2vec embedding

In this notebook, we attempt to classify abstracts of scientific papers. To do so, we focus on two main steps,

- We train a word2vec embedder using ~20,000 abstracts collected via the Nature metadata dataset, filtered by the keyword *field theory*.

- Consider a restricted dataset, that we produced in a different project (see the [link]()) where the abstract have labels from 9 possible classes, and train different neural network architectures for classification.

### Goal

The aim of this study is to understand whether a word embedding coupled with a neural network for classification can achieve better performance than simpler embeddings like a words counter or a tf-idf vectorizer.

### Methods

We use a dataset that we have extracted, analysed, and cleanded in a different project (see the [link]()). The dataset is obtained via the Nature metadata API.

In the first part of the project, we train the gensim word2vec model to embed words commonly appearing in scientific papers, and particularly in papers concerning *quantum field theory*.

The second stage of this project consists in training different architectures of neural networks for classifying the abstracts;

- We train a CNN using the procedure described in Y. Kim, "Convolutional Neural Networks for Sentence Classification" [arXiv:1408.5882](https://arxiv.org/pdf/1408.5882.pdf)

- We train a RNN with a similar architecture as above, mainly to compare the performance of these two settings in TensorFlow.

### Results

For what concerns the word2vec embedding we trained, we can judge its performance by demanding similar words to a given set of words, that we consider meaningful for the task. While the model we trained is by no means general enough to be used in production, it seems to be sufficient for our pourposes. Indeed, all the word associations made by the model seem to be reasonable.

For the nn...

## 1. Word2vec Embedding

Let's start by training the gensim word2vec model for embedding words in a real vector space. Before training the embedder, we create a vocabulary using the abstracts of Nature papers with keyword *field theory*. Our vocabulary does not solely include single words, but we additionally search for common bigrams using the gensim Phrases class. The word2vec model is trained via the continuous bag-of-words method (CBOW).

### Clean the abstract and title

Let's load the database of abstracts and titles we collected form the Nature metadata corpus,

In [2]:
df = pd.read_pickle('./datasets/dataset_field_theory.pkl')
df.drop(columns='keywords',inplace=True)

df.tail()

Unnamed: 0,title,abstract
17835,On the connection between the LSZ and Wightman...,The LSZ asymptotic condition and the Yang-Feld...
17836,The ground state of the Bose gas,The mathematical formalism describing the Bose...
17837,On the vacuum state in quantum field theory. II,"We want to construct, for every local irreduci..."
17838,A theorem concerning the positive metric,It is proved that if the n -point correlation ...
17839,Geometry of Electromagnetic null field,Electromagnetic tensor field can be divided in...


Firstly, we check which abstracts are not in English (there are a few that are in multiple languages, generally English/Italian/Russian). We'll get rid of them.

Notice that we already know that all the items have non-null abstract and title field, since we checked before storing the database.

In [3]:
abstract_foreign = df['abstract'].apply(at.is_foreign).to_numpy()

number_foreign = np.sum(abstract_foreign)
print('There are, potentially, {} foreign abstracts.'.format(number_foreign))

There are, potentially, 284 foreign abstracts.


Let's drop the foreign abstracts,

In [4]:
foreign_indices = df[abstract_foreign].index
df.drop(foreign_indices,inplace=True)

Now, let's clean the abstract and title, and map them into a list of tokens

In [5]:
# Select the stemmer and the corpus of stopwords
wnl = WordNetLemmatizer()
stp_en = stopwords.words('english')

# clean the abstract and title
df['clean abs'] = df['abstract'].apply(at.clean_text_lemmatize,args=(wnl,stp_en))
df['clean title'] = df['title'].apply(at.clean_text_lemmatize,args=(wnl,stp_en))
df.tail()

Unnamed: 0,title,abstract,clean abs,clean title
17835,On the connection between the LSZ and Wightman...,The LSZ asymptotic condition and the Yang-Feld...,"[lsz, asymptotic, condition, yang, feldman, eq...","[connection, lsz, wightman, quantum, field, th..."
17836,The ground state of the Bose gas,The mathematical formalism describing the Bose...,"[mathematical, formalism, describing, bose, ga...","[ground, state, bose, gas]"
17837,On the vacuum state in quantum field theory. II,"We want to construct, for every local irreduci...","[want, construct, every, local, irreducible, q...","[vacuum, state, quantum, field, theory, ii]"
17838,A theorem concerning the positive metric,It is proved that if the n -point correlation ...,"[proved, n, point, correlation, function, syst...","[theorem, concerning, positive, metric]"
17839,Geometry of Electromagnetic null field,Electromagnetic tensor field can be divided in...,"[electromagnetic, tensor, field, divided, thre...","[geometry, electromagnetic, null, field]"


### Create a vocabulary with bigrams

We now use gensim object Phrases to identify the most common/meaningful bigram in the abstract/titles and create a vocabulary of 1 and 2-grams.

In [6]:
sentences = df['clean abs'].to_list() + df['clean title'].to_list()

Let's find the most common bigram in the text and add them to the vocabilary,

In [16]:
# Find all the bigram that appear at least 20 times and have a score above 55
phrases = Phrases(sentences, min_count=20,threshold=55)

# Since we do not plan to train phrases again, we can reduce it's memory need with the following
trained_phrases = Phraser(phrases)

# Examples of bigram found
list_bigrams = list(trained_phrases.phrasegrams.keys())
list_bigrams[:10]

[(b'long', b'range'),
 (b'shed', b'light'),
 (b'schwinger', b'keldysh'),
 (b'renormalization', b'group'),
 (b'brane', b'web'),
 (b'condensed', b'matter'),
 (b'degree', b'freedom'),
 (b'boundary', b'condition'),
 (b'previous', b'work'),
 (b'spin', b'chain')]

Save the trained phrases,

In [None]:
trained_phrases.save('./vocabulary/phrases')

### Training word2vec

We now train the embedder word2vec with the title and abstracts of every paper obtianed when querying the Nature metadata database with keyword *field theory*.

First we construct the sentences with common bigrams fused together,

In [17]:
bi_sentences = trained_phrases[sentences]

Now we can train the model; we use the standard settings as we found this to work best,

In [18]:
n_cpu = multiprocessing.cpu_count()

# Initialize model
w2v_model = Word2Vec(workers=n_cpu-1)

# build the vocabulary
w2v_model.build_vocab(bi_sentences)

# train the model
w2v_model.train(bi_sentences,
                total_examples=w2v_model.corpus_count,
                epochs=30
               )

(32629752, 35945940)

We can test the model on some words that we expect be important in distinguishing between abstracts,

In [19]:
test_words = ["computer","quantum","field","electron","proton","ad_cft","eth","gauge"]

for word in test_words:

    print('Test word is "{}"'.format(word))
    
    print('Similar words:')
    for similar, score in w2v_model.wv.most_similar(positive=[word],topn=5):
        print('- "{}" : {:.2f}'.format(similar, score))
        
    print('---')

Test word is "computer"
Similar words:
- "automated" : 0.56
- "neuroimaging" : 0.54
- "code" : 0.53
- "practical" : 0.52
- "deterministic" : 0.52
---
Test word is "quantum"
Similar words:
- "axiomatic" : 0.53
- "quantum_mechanical" : 0.50
- "classical" : 0.49
- "attribute" : 0.47
- "bohmian" : 0.45
---
Test word is "field"
Similar words:
- "electrodynamics" : 0.48
- "maxwell" : 0.46
- "perturbation" : 0.43
- "gravitation" : 0.42
- "interacting" : 0.42
---
Test word is "electron"
Similar words:
- "atom" : 0.62
- "exciton" : 0.55
- "allowance" : 0.54
- "carrier" : 0.54
- "phonon" : 0.52
---
Test word is "proton"
Similar words:
- "antiproton" : 0.69
- "nucleon" : 0.64
- "neutron" : 0.61
- "nucleus" : 0.61
- "280_gev" : 0.59
---
Test word is "ad_cft"
Similar words:
- "holography" : 0.58
- "holographic" : 0.54
- "cft" : 0.52
- "agt" : 0.52
- "fzz" : 0.51
---
Test word is "eth"
Similar words:
- "eigenstate" : 0.69
- "thermalization" : 0.65
- "monotonicity" : 0.54
- "relative_entropy" : 0.50


The above makes sense, at least is the training setting that made the most sense. We can now save the model for later use.

In [85]:
w2v_model.wv.save('./vocabulary/FTword2vec')

## 2. CNN for abstract classification

We follow the procedure described in Y. Kim, "Convolutional Neural Networks for Sentence Classification" [arXiv:1408.5882](https://arxiv.org/pdf/1408.5882.pdf), altough we change the architecture slightly. We go in more details about the architecture later in the notebook.

### Cleaning of the data

Let's first load our labelled data and clean title and abstract

In [2]:
df = pd.read_pickle('./datasets/dataset_ft_cleaned.pkl')
df.drop(columns=['clean abs','clean title','number of eqs'],inplace=True)

df.tail()

Unnamed: 0,title,abstract,keywords
4681,Some remarks about the localization of states ...,For the case of a field theory with a nuclear ...,statist physic
4682,A proof of the crossing property for two-parti...,"In the framework of the ℒ. l . Z . formalism, ...",statist physic
4683,On the connection between the LSZ and Wightman...,The LSZ asymptotic condition and the Yang-Feld...,statist physic
4684,On the vacuum state in quantum field theory. II,"We want to construct, for every local irreduci...",statist physic
4685,A theorem concerning the positive metric,It is proved that if the n -point correlation ...,statist physic


Load the previously trained Phrases object,

In [3]:
trained_phrases = Phrases.load('./vocabulary/phrases')

Clean the data and convert bigrams,

In [5]:
# Select the lemmatizer and the corpus of stopwords
wnl = WordNetLemmatizer()
stp_en = stopwords.words('english')

# Let's clean the abstract and find bigrams using the trained Phrases object
cleaned_abstracts = df['abstract'].apply(at.clean_text_lemmatize,args=(wnl,stp_en))
df['clean abs'] = np.array(trained_phrases[cleaned_abstracts],dtype=object)

# and we do the same for the title
cleaned_title = df['title'].apply(at.clean_text_lemmatize,args=(wnl,stp_en))
df['clean title'] = np.array(trained_phrases[cleaned_title],dtype=object)

Let's see the tokenized version of title and abstracts,

In [6]:
df.tail()

Unnamed: 0,title,abstract,keywords,clean abs,clean title
4681,Some remarks about the localization of states ...,For the case of a field theory with a nuclear ...,statist physic,"[case, field, theory, nuclear, space, test, fu...","[remark, localization, state, quantum, field, ..."
4682,A proof of the crossing property for two-parti...,"In the framework of the ℒ. l . Z . formalism, ...",statist physic,"[framework, l, z, formalism, crossing, propert...","[proof, crossing, property, two, particle, amp..."
4683,On the connection between the LSZ and Wightman...,The LSZ asymptotic condition and the Yang-Feld...,statist physic,"[lsz, asymptotic, condition, yang, feldman, eq...","[connection, lsz, wightman, quantum, field, th..."
4684,On the vacuum state in quantum field theory. II,"We want to construct, for every local irreduci...",statist physic,"[want, construct, every, local, irreducible, q...","[vacuum, state, quantum, field, theory, ii]"
4685,A theorem concerning the positive metric,It is proved that if the n -point correlation ...,statist physic,"[proved, n, point, correlation, function, syst...","[theorem, concerning, positive, metric]"


In [7]:
labels = np.unique(df['keywords'])

### Pre-processing

We can now apply the word2vec embedding previously trained to our title+abstracts, to obtain the input tensor for the CNN.

First, let's collect the pre-trained word2vec embedding model,

In [8]:
FTwv = KeyedVectors.load('./vocabulary/FTword2vec')

Now, we map each sequence of tokens into the corresponding index in the vocabulary of the word2vec model previously trained. We will additionally pad each sequence at the end, to have inputs of same dimension.

In [11]:
# Fuse title and abstract together
full_text = df['clean title']+df['clean abs']

# Hash the text and pad the resulting sequence
df['hashed text'] = at.hashed_padded_sequences(full_text,FTwv)

### Test and training split

Let's divide the dataset into test and training sets, as we have done in the other [project](),

In [12]:
# just so we are sure we can reproduce the splitting
seed = 1863

# split into training and test
df_train = df.sample(frac=0.7,random_state=seed)
df_test = df.drop(df_train.index)

The training and test sets are created, where the labels are mapped into dummy variable,

In [13]:
# The predictors for training and test are the following
X_train = np.vstack(df_train['hashed text'].to_numpy())
X_test =  np.vstack(df_test['hashed text'].to_numpy())

# The labels are mapped into a dummy vector
dummy_encoder = LabelBinarizer()

y_train = dummy_encoder.fit_transform(df_train['keywords'].to_numpy())
y_test = dummy_encoder.fit_transform(df_test['keywords'].to_numpy())

print('The training dataset has {} items of length {}'.format(*X_train.shape))
print('The test datatset has {} items'.format(X_test.shape[0]))

The training dataset has 3270 items of length 278
The test datatset has 1402 items


We store the training set on disk, so that we can perform CV on a different machine,

In [33]:
# Save training set
np.save('./datasets/vectorized/X_train',X_train)
np.save('./datasets/vectorized/y_train',y_train)

# Save the feature names
np.save('./datasets/vectorized/labels',labels)

Additionally, we store the embedding weigths for the word2vec model we have trained before,

In [34]:
# The weigths
vocab_size, k = FTwv.vectors.shape
embedding_weigths = np.vstack((np.zeros(k,dtype=np.float),FTwv.vectors))

np.save('./datasets/vectorized/embedding_weights',embedding_weigths)

## Convolutional Neural Network (CNN)

We follow the CNN-singlechannel construction of the paper [arXiv:1408.5882](https://arxiv.org/pdf/1408.5882.pdf), where

- The input layer consists of a matrix $n \times k$ where $n$ is the number of words in the text (in this case the title+abstract) and $k$ is the dimension of the vector space where we perform the word2vec embedding (in our case, it is $k=100$).


- The second layer is a convolutional layer with $m$ feature maps with window of length $h$. In the paper, multiple widowns length are used (e.g. h = 2,3,4) at the same time, but we prefer using a single window. We additionally regularize the weights with a l1 penalty, to encourage an effective variable window. The output of the feature maps is given via the ReLU function.


- After the convolution we apply a global max-pooling layer.


- Finally, a fully connected layer with drops-out with softmax output

### Hyper-parameters selection

We run K-fold cross-validation on a different machine, with the following range of parameters,

- `n_filters` = $[100,200,300,400,500]$
- `window_size` = $[2,3,4,5]$
- `dropout_prob` = $[0.25,0.5]$
- `l1_param` = $[10^{-2},10^{-3},10^{-4},10^{-5}]$

The results are shown below,

In [10]:
CV_df = pd.read_pickle('./cross-val/cnn/cross_val_results.pkl')