# Embedding

This notebook evaluates methods for embedded document representation using the [academia.stackexchange.com](https://academia.stackexchange.com/) data dump.

## Table of Contents
* [Data import](#data_import)
* [Embedding](#embedding)
* [Experiments](#experiments)
* [Evaluation](#evaluation)
* [Dimension Reduction](#dim_reduce)

In [1]:
%load_ext autoreload
%autoreload 2

import matplotlib.pyplot as plt
import numpy as np
from joblib import dump, load
from academia_tag_recommender.definitions import MODELS_PATH

<a id='data_import'/>

## Data Import

In [2]:
from academia_tag_recommender.data import documents
texts = [document.text for document in documents]

<a id='prep'/>

## Data Preparation

In [3]:
from academia_tag_recommender.test_train_data import get_y, get_test_train_data, get_all_labels

X = np.vstack([document.text for document in documents])
y = get_y()
labels = get_all_labels(reduced=False)
X_train, X_test, y_train, y_test = get_test_train_data(X, y, scale=False)
print('Train set with shape ', X_train.shape)
print('Test set with shape', X_test.shape)



Train set with shape  (24812, 1)
Test set with shape (8270, 1)


<a id='embedding'/>

## Embedding

> Word embedding is a term used for the representation of words for text analysis, typically in the form of a real-valued vector that encodes the meaning of the word such that the words that are closer in the vector space are expected to be similar in meaning. [Wikipedia contributors (2021)][1]

The following word embedding models will be for this approach:
- Word2Vec
- Doc2Vec
- FastText


**Word2Vec**
> The word2vec tool takes a text corpus as input and produces the word vectors as output. It first constructs a vocabulary from the training text data and then learns vector representation of words. The resulting word vector file can be used as features in many natural language processing and machine learning applications. [Google (2013)][2]


**Doc2Vec**
> Le and Mikolov in 2014 introduced the Doc2Vec algorithm, which usually outperforms such simple-averaging of Word2Vec vectors. The basic idea is: act as if a document has another floating word-like vector, which contributes to all training predictions, and is updated like other word-vectors, but we will call it a doc-vector. [Radim Řehůřek (2020)][3]


**FastText**
> The main principle behind [F]astText is that the morphological structure of a word carries important information about the meaning of the word. Such structure is not taken into account by traditional word embeddings like Word2Vec, which train a unique word embedding for every individual word. [F]astText attempts to solve this by treating each word as the aggregation of its subwords. For the sake of simplicity and language-independence, subwords are taken to be the character ngrams of the word. The vector for a word is simply taken to be the sum of all vectors of its component char-ngrams. [Radim Řehůřek (2020)][4]

[1]: https://en.wikipedia.org/wiki/Word_embedding
[2]: https://code.google.com/archive/p/word2vec/
[3]: https://radimrehurek.com/gensim/auto_examples/tutorials/run_doc2vec_lee.html#sphx-glr-auto-examples-tutorials-run-doc2vec-lee-py
[4]: https://radimrehurek.com/gensim/auto_examples/tutorials/run_fasttext.html#sphx-glr-auto-examples-tutorials-run-fasttext-py

<a id='experiments'/>

## Experiments

Before preprocessing a document is still a whole sentence, including punctuation and html tags.

In [4]:
print(X_train[0])

["Where can I find the Impact Factor for a given journal? <p>As from title. Not all journals provide the impact factor on their homepage. For those who don't where can I find their impact factor ?</p>\n"]


**Word2Vec**

First the string sentences are tokenized into arrays of strings representing the words. Punctuation and html tags are removed.

In [5]:
from academia_tag_recommender.embedded_data import Word2Tok
sentences = Word2Tok(X_train)

print(list(sentences)[0])

['can', 'find', 'impact', 'factor', 'given', 'journal', 'title', 'journals', 'provide', 'impact', 'factor', 'homepage', 'don', 'can', 'find', 'impact', 'factor']


A `Word2Vec` model is trained using the tokenized sentences.

In [6]:
from gensim.models import Word2Vec
model = Word2Vec(sentences=sentences)
wv = model.wv
del model
wv.init_sims(replace=True)
print('Training the model on the documents results in {} words in the vocabulary.'.format(len(wv.vocab)))

Training the model on the documents results in 12894 words in the vocabulary.


The `Word2Vec` can now be used to generate a vectors for words. Per default implementation the vector has 100 features.

In [7]:
wv['academic']

array([ 0.21820764,  0.00911908,  0.04361239,  0.06080858, -0.07216044,
       -0.11958563, -0.13394633, -0.04513133,  0.14242645,  0.09354336,
       -0.10020303, -0.04972231,  0.17514479, -0.0194051 ,  0.05589714,
        0.06976553,  0.05294935, -0.06892349,  0.10797131,  0.10748126,
        0.00488774, -0.0973985 , -0.09980931, -0.05023666, -0.11743355,
       -0.0603033 ,  0.07642061, -0.05857554, -0.04397227,  0.12190309,
       -0.02885075, -0.07793532,  0.1043091 ,  0.10896602,  0.19491932,
       -0.04871687, -0.12617314,  0.16792402,  0.01798099, -0.00246853,
        0.05683499, -0.05630356, -0.07650578, -0.01946483,  0.05033841,
       -0.23604739, -0.04106637, -0.08350789,  0.06483133,  0.10227551,
        0.21002582, -0.02293057,  0.12989593,  0.14919503,  0.3716091 ,
       -0.0404757 , -0.16821301,  0.06272127,  0.10951851, -0.01312884,
        0.01802835,  0.04073705,  0.0381859 ,  0.08215148,  0.05578558,
       -0.11915142, -0.22632273, -0.05271645, -0.03897511,  0.02

Based on these vectors words can be compared to each other. The 10 most similar words to `academic` are:

In [8]:
wv.most_similar('academic')

[('scientific', 0.5422207117080688),
 ('professional', 0.5357193350791931),
 ('existent', 0.4653562903404236),
 ('academia', 0.447778582572937),
 ('prospects', 0.434415727853775),
 ('academics', 0.43376612663269043),
 ('permanent', 0.4279685616493225),
 ('traditional', 0.4223255515098572),
 ('future', 0.4221450090408325),
 ('educational', 0.4168570041656494)]

**Combining Word2Vec with multiword n-grams**

Instead of only using unigrams it is possible to include bigrams into the vocabulary.

In [9]:
from academia_tag_recommender.embedded_data import Word2Tok
sentences = Word2Tok(X_train)

print(list(sentences)[0])

['can', 'find', 'impact', 'factor', 'given', 'journal', 'title', 'journals', 'provide', 'impact', 'factor', 'homepage', 'don', 'can', 'find', 'impact', 'factor']


In [10]:
from gensim.models.phrases import Phrases

bigram_transformer = Phrases(sentences, min_count=1)

In [11]:
from gensim.models import Word2Vec
model = Word2Vec(sentences=bigram_transformer[sentences])
wv = model.wv
del model
wv.init_sims(replace=True)
print('Training the model on the documents results in {} words in the vocabulary.'.format(len(wv.vocab)))

Training the model on the documents results in 20947 words in the vocabulary.


Including bigrams the vocabulary increases nearly by factor 2.

In [12]:
wv['academic']

array([ 0.14137234,  0.00499952,  0.05469143, -0.03796696, -0.07623484,
        0.1899787 , -0.15662064,  0.00240947,  0.05156227,  0.14496706,
        0.05527455,  0.01662143,  0.03960598, -0.08055399,  0.01805119,
        0.08793066, -0.04509585, -0.06955174,  0.12589985, -0.05598166,
       -0.03167972, -0.05352683,  0.03945947, -0.06907973, -0.084567  ,
        0.01734045,  0.14473717, -0.17430982, -0.23063143,  0.09975462,
       -0.06112049,  0.00801671,  0.03474763,  0.1771201 ,  0.05513798,
       -0.02076746, -0.05675825,  0.1186769 , -0.02672635, -0.00430646,
        0.03073694,  0.06659339, -0.08348006, -0.1133301 ,  0.02486548,
       -0.16794343,  0.03303879, -0.2016477 ,  0.10804214,  0.03066116,
        0.11334135, -0.01516772, -0.00446554,  0.04920239,  0.18825074,
       -0.04130374, -0.12770376,  0.03981686,  0.11411712,  0.05911706,
        0.03632458, -0.15409419,  0.22848092,  0.03293614,  0.05623715,
       -0.1329605 ,  0.01865764,  0.0248922 ,  0.02707001, -0.01

In [13]:
wv.most_similar('academic')

[('non_academic', 0.7226566076278687),
 ('early_career', 0.6944350004196167),
 ('professional', 0.6883115768432617),
 ('permanent', 0.6243336200714111),
 ('listing', 0.6010950207710266),
 ('soft_money', 0.5930420160293579),
 ('experiences', 0.5891789197921753),
 ('indigenous', 0.586772084236145),
 ('job_market', 0.5856984853744507),
 ('industrial', 0.582744836807251)]

Looking at the simililarity of words, there are now many bigrams that are similar to `academic`.

**Doc2Vec**

The `Doc2Vec` trains the model supervised, using the labels. Therefore all documents first need to be tokenized and connected to their label.

In [14]:
from academia_tag_recommender.embedded_data import Doc2Tagged

tokens = Doc2Tagged(X_train, tag=True)

token_list = list(tokens)
token_list[0]

TaggedDocument(words=['can', 'find', 'impact', 'factor', 'given', 'journal', 'title', 'journals', 'provide', 'impact', 'factor', 'homepage', 'don', 'can', 'find', 'impact', 'factor'], tags=[0])

In [15]:
from gensim.models import Doc2Vec
model = Doc2Vec()
model.build_vocab(token_list)
model.train(token_list, total_examples=model.corpus_count, epochs=model.epochs)
print('Training the model on the documents results in {} words out of {} documents in the vocabulary.'.format(len(model.wv.vocab), len(model.docvecs)))

Training the model on the documents results in 12894 words out of 24812 documents in the vocabulary.


In [16]:
vector = model.infer_vector(['academic'])
vector

array([ 0.01438004, -0.01514934, -0.01238022,  0.04506659, -0.00378427,
        0.00355258, -0.05222021,  0.02039241, -0.0093652 ,  0.01598098,
       -0.01317822, -0.00817952, -0.0455081 ,  0.01327964,  0.02089488,
        0.00648432, -0.0039745 , -0.02725293,  0.00987052, -0.00451604,
       -0.01583723,  0.01115874, -0.04272858, -0.03579678,  0.04536224,
       -0.03366986, -0.00314346, -0.0288485 , -0.04700197, -0.00745611,
        0.01262811, -0.02878716, -0.01851433, -0.00694802, -0.02387403,
        0.01044197, -0.03539443,  0.03339054,  0.00021077,  0.02185567,
        0.02020962, -0.00454351,  0.04668414, -0.05026869, -0.0239145 ,
       -0.02871663,  0.02794001,  0.00298674,  0.00213374,  0.06018677,
       -0.01920959,  0.01200431,  0.01375066,  0.01035046,  0.01845757,
        0.0004169 , -0.05598499, -0.00935971,  0.05087708,  0.08599964,
       -0.01151139, -0.05269783,  0.01051827, -0.02546464,  0.02787663,
        0.04050624,  0.02498539,  0.03593248,  0.02138111, -0.01

With the `Doc2Vec` model one can examine documents similar to the keyword `academic` instead of similar words.

In [17]:
similar_docs = model.docvecs.most_similar([vector])
print(similar_docs[0], token_list[similar_docs[0][0]].words)
print(similar_docs[1], token_list[similar_docs[1][0]].words)

(17937, 0.8849472403526306) ['academic', 'prizes', 'awards', 'vs', 'academic', 'achievements', 'phd', 'scholarship', 'application', 'forms', 'mean', 'details', 'academic', 'prizes', 'awards', 'details', 'academic', 'achievements', 'academic', 'prizes', 'academic', 'achievements', 'things']
(2684, 0.8624860644340515) ['applying', 'masters', 'degree', 'academic', 'integrity', 'violation', 'record', 'junior', 'college', 'student', 'goes', 'top', 'cs', 'school', 'will', 'graduating', 'year', 'early', 'starting', 'think', 'whether', 'want', 'get', 'master', 'degree', 'however', 'egregious', 'academic', 'integrity', 'violation', 'first', 'university', 'expelled', 'top', 'cs', 'school', 'first', 'semester', 'since', 'learned', 'incident', 'transferred', 'current', 'college', 'studying', 'years', 'decide', 'get', 'masters', 'degree', 'either', 'plan', 'pursuing', 'current', 'school', 'believe', 'will', 'accepted', 'due', 'academic', 'performance', 'relationships', 'professors', 'fact', 'depart

**FastText**

In [18]:
from academia_tag_recommender.embedded_data import Word2Tok
sentences = Word2Tok(X_train)

print(list(sentences)[0])

['can', 'find', 'impact', 'factor', 'given', 'journal', 'title', 'journals', 'provide', 'impact', 'factor', 'homepage', 'don', 'can', 'find', 'impact', 'factor']


In [19]:
from gensim.models import FastText
model = FastText(window=3, min_count=2)
model.build_vocab(sentences=sentences)
model.train(sentences=sentences, total_examples=model.corpus_count, epochs=20)
wv = model.wv
del model
wv.init_sims(replace=True)
print('Training the model on the documents results in {} words in the vocabulary.'.format(len(wv.vocab)))

Training the model on the documents results in 22407 words in the vocabulary.


In [20]:
wv['academic']

array([-0.07453362,  0.06943832,  0.07596593,  0.01869199, -0.01562418,
        0.00698225,  0.09509411,  0.08842903, -0.14629164,  0.00929409,
        0.05402608, -0.04557607, -0.05447316, -0.03358596,  0.00929349,
       -0.12571113,  0.04440993,  0.00276269,  0.16809362,  0.06980025,
       -0.15784076, -0.04907261,  0.04472274,  0.02718009,  0.01598027,
       -0.12894505,  0.04531686, -0.00085054,  0.16506658, -0.22018442,
       -0.11810841, -0.05850428,  0.0758732 , -0.02374969, -0.11128239,
       -0.05519569,  0.12765019, -0.10749408,  0.11275966,  0.14490195,
        0.27204397,  0.08476449, -0.06741492, -0.04653229, -0.0737948 ,
       -0.05993982,  0.0353999 , -0.06280465,  0.0188218 , -0.04606069,
        0.01873511, -0.03340943,  0.05184827,  0.0497258 , -0.05961163,
        0.17493825, -0.15894082,  0.04596837,  0.02607274,  0.12520662,
        0.02846302,  0.03607515,  0.07605084, -0.05812393, -0.07155358,
       -0.16323806, -0.09102985, -0.03477538, -0.11656777,  0.13

In [21]:
wv.most_similar('academic')

[('unacademic', 0.9385606050491333),
 ('nonacademic', 0.9301602840423584),
 ('academical', 0.8922649621963501),
 ('academy', 0.8589457273483276),
 ('academe', 0.8575608134269714),
 ('academies', 0.8407630920410156),
 ('academician', 0.8213838338851929),
 ('acad', 0.8002569675445557),
 ('academically', 0.7822352647781372),
 ('academiase', 0.778761625289917)]

Since `FastText` uses character n-grams there are now many words very close to `academia`.

**Combining FastText with multiword n-grams**

`FastText` can be extended with bigrams too.

In [22]:
from academia_tag_recommender.embedded_data import Word2Tok
sentences = Word2Tok(X_train)

print(list(sentences)[0])

['can', 'find', 'impact', 'factor', 'given', 'journal', 'title', 'journals', 'provide', 'impact', 'factor', 'homepage', 'don', 'can', 'find', 'impact', 'factor']


In [23]:
from gensim.models.phrases import Phrases

bigram_transformer = Phrases(sentences, min_count=1)

In [24]:
from gensim.models import FastText
model = FastText(window=3, min_count=2)
model.build_vocab(sentences=bigram_transformer[sentences])
model.train(sentences=sentences, total_examples=model.corpus_count, epochs=20)
wv = model.wv
del model
wv.init_sims(replace=True)
print('Training the model on the documents results in {} words in the vocabulary.'.format(len(wv.vocab)))

Training the model on the documents results in 58252 words in the vocabulary.


In [25]:
wv['academic']

array([ 0.08802791, -0.06200763,  0.1152458 ,  0.06426968,  0.09748365,
        0.05013251, -0.1129626 , -0.09028978,  0.04577056,  0.0607812 ,
        0.15716203, -0.03857641, -0.09675916,  0.03481515,  0.22809464,
        0.08557537,  0.0505088 ,  0.161692  , -0.12159577, -0.01974001,
        0.0351983 ,  0.01888811, -0.02160684,  0.14407273, -0.1854414 ,
        0.00957741, -0.11276507, -0.02271997,  0.14445224, -0.00840628,
       -0.03509223,  0.10303397, -0.00474818, -0.02849258, -0.00179971,
        0.03541456, -0.03873183,  0.16289444,  0.09119155,  0.05133548,
       -0.01153626,  0.10501225,  0.08953188, -0.18804191, -0.07482777,
        0.0823573 , -0.21999559, -0.09153958,  0.10757563,  0.04247302,
       -0.03051135, -0.05816598, -0.01022058,  0.06951102, -0.05417137,
       -0.00557057,  0.08584974, -0.00350851, -0.01247351, -0.11682458,
        0.18128015, -0.09628124, -0.06755148,  0.08739156, -0.12089624,
       -0.10498443, -0.13962221, -0.04237365,  0.06679508, -0.13

In [26]:
wv.most_similar('academic')

[('vp_academic', 0.9920320510864258),
 ('non_academic', 0.9616369605064392),
 ('climb_academic', 0.960375189781189),
 ('academic_cvs', 0.9587602615356445),
 ('climbs_academic', 0.9541186690330505),
 ('unacademic', 0.952389121055603),
 ('nonacademic', 0.9518829584121704),
 ('lambert_academic', 0.9488770961761475),
 ('bielefeld_academic', 0.9447492361068726),
 ('academic_theft', 0.9447307586669922)]

Similar words to `academic` are now bigrams like `non academic` and unigrams like `nonacademic`.