# Using NTEE to do Text Embedding


## Setup

<p>The following commands install our code and its required libraries:</p>
<pre><code>% pip install Cython
% pip install -r requirements.txt
% python setup.py develop
</code></pre>

## Training the embeddings

These are the trained embeddings after running the training phase. <br>
<li><a href="https://s3-ap-northeast-1.amazonaws.com/ntee/pub/models/ntee_300_sentence.joblib.gz" rel="nofollow">ntee_300_sentence.joblib.gz</a> (300d vectors, 1.8GB, trained on Wikipedia sentences)</li>
<li><a href="https://s3-ap-northeast-1.amazonaws.com/ntee/pub/models/ntee_300_paragraph.joblib.gz" rel="nofollow">ntee_300_paragraph.joblib.gz</a> (300d vectors, 1.8GB, trained on Wikipedia paragraphs)</li>

### How to train

NOTE: It takes a lot of space and time to train the model using these lines.
<p><strong>(1) Building Databases</strong></p> 
First, we need to download several files and build databases using these files. <br>

<div class="highlight highlight-source-shell"><pre>% ntee download_dbpedia_abstract_files <span class="pl-c1">.</span>
% wget https://s3-ap-northeast-1.amazonaws.com/ntee/pub/enwiki-20160601-pages-articles.xml.bz2
% ntee build_abstract_db <span class="pl-c1">.</span> dbpedia_abstract.db
% ntee build_entity_db enwiki-20160601-pages-articles.xml.bz2 entity_db
% ntee build_vocab dbpedia_abstract.db entity_db vocab</pre></div>


In [None]:
!ntee download_dbpedia_abstract_files .

In [None]:
!wget https://s3-ap-northeast-1.amazonaws.com/ntee/pub/enwiki-20160601-pages-articles.xml.bz2

In [None]:
!ntee build_abstract_db . dbpedia_abstract.db

In [None]:
!ntee build_entity_db enwiki-20160601-pages-articles.xml.bz2 entity_db

In [None]:
!ntee build_vocab dbpedia_abstract.db entity_db vocab

<p><strong>(2) Training Pre-trained Embeddings</strong></p>
<p>The pre-trained embeddings can be built using the following two commands:</p>
<div class="highlight highlight-source-shell"><pre>% ntee word2vec generate_corpus enwiki-20160601-pages-articles.xml.bz2 entity_db word2vec_corpus.txt.bz2
% ntee word2vec train word2vec_corpus.txt.bz2 word2vec_sg_300.joblib</pre></div>

In [None]:
!ntee word2vec generate_corpus enwiki-20160601-pages-articles.xml.bz2 entity_db word2vec_corpus.txt.bz2

In [None]:
!ntee word2vec train word2vec_corpus.txt.bz2 word2vec_sg_300.joblib

<p><strong>(3) Training NTEE</strong></p>
<p>Now, we can start to train our NTEE embeddings.
The training takes approximately six days on NVIDIA K80 GPU.</p>
<div class="highlight highlight-source-shell"><pre>% ntee train_model dbpedia_abstract.db entity_db vocab --word2vec=word2vec_sg_300.joblib ntee_paragraph.joblib</pre></div>

In [None]:
!ntee train_model dbpedia_abstract.db entity_db vocab --word2vec=word2vec_sg_300.joblib ntee_paragraph.joblib

## Predictions

We have the embeddings saved as .joblib format. We have separate sentence and paragraph embeddings. Now we can use these trained models to predict the embeddings of words or texts or sentences/paragraphs. Below process illustrates how to do that.  

In [2]:
from ntee.model_reader import ModelReader
import numpy as np
model = ModelReader('ntee_300_sentence.joblib')
model

<ntee.model_reader.ModelReader at 0x1425b0110>

In [3]:
x = model.get_word_vector(u'what')
x = np.array(x)
print (x)

[-0.21120638  0.04047474 -0.16536635  0.1001524  -0.109033    0.3818673
 -0.08441699 -0.03870775  0.20319948 -0.37235212  0.09496183  0.06271978
  0.00835711 -0.01277019  0.3202388  -0.0359035  -0.14234203  0.13743044
  0.2572372  -0.04544935  0.06145024  0.04705551 -0.19117215  0.07081793
 -0.08167226  0.0929535  -0.05513426 -0.27027178  0.03960169 -0.1607706
 -0.23215023 -0.08492615 -0.12072089  0.09866042 -0.04247769  0.21680275
  0.27483287 -0.00169709 -0.00208295  0.00609226 -0.10546565  0.07659734
  0.0976496  -0.15255755  0.04180073 -0.12609237 -0.0305967   0.05354677
 -0.01734309 -0.03181839 -0.10106437 -0.05512101 -0.04015365  0.2282004
  0.05182777  0.30967396  0.02015228 -0.0815163   0.0373378   0.04531371
 -0.03674673  0.08414508 -0.12649088 -0.00771928  0.05199901 -0.07265674
 -0.31960648 -0.04007003  0.09335911  0.14678808 -0.10527917 -0.08316969
  0.12356181  0.2481978  -0.15837465  0.10614112  0.1672042   0.04376265
 -0.14222287  0.16230647 -0.20489246  0.19911943 -0.00

In [4]:
y = model.get_entity_vector(u'Apple Inc.')
y = np.array(y)
print y

[-2.48675242e-01 -1.21547781e-01 -1.57411948e-01 -1.69242024e-01
  3.46656404e-02 -2.03787461e-02  6.19790815e-02 -5.87919831e-01
 -1.26004443e-01 -4.76014078e-01 -2.54268646e-02  1.14136867e-01
 -1.58809960e-01 -3.69221091e-01 -1.91863775e-01 -1.44232929e-01
 -1.21458106e-01 -1.41607314e-01 -2.23456562e-01 -7.70449638e-02
  4.63574156e-02 -2.56028503e-01 -8.69428515e-02 -4.68864381e-01
 -7.16716588e-01 -5.88682711e-01 -3.35128069e-01  1.22047782e-01
 -3.10974658e-01 -2.66667634e-01 -2.84564763e-01 -8.46317112e-02
  2.44741783e-01  2.15186719e-02  9.29940999e-01 -7.97939599e-01
  1.20348394e+00  7.51004100e-01  3.11858416e-01 -6.20878041e-01
 -2.15197448e-03  3.78250033e-01  4.27506149e-01  1.67753562e-01
 -6.36079371e-01  8.64252090e-01 -7.36296475e-01 -5.69970943e-02
  3.79695743e-01 -2.35817969e-01 -4.26611379e-02 -7.36753523e-01
  9.14451957e-01  2.65661240e-01  2.63470858e-01 -2.42097259e-01
 -2.07495168e-02 -4.69534397e-01  5.55198669e-01  2.48489618e-01
 -5.94658375e-01 -2.31407

In [5]:
print np.array(model.get_text_vector(u'How are you?'))

[ 0.01716073 -0.00861112  0.12271859 -0.02660117 -0.01703857  0.07788829
  0.02764174 -0.0660186  -0.09395936  0.03216327 -0.04406688  0.03708095
 -0.02140836 -0.03929055  0.02133072 -0.06911854  0.04313761  0.06295715
  0.03437731  0.01601282  0.01020962 -0.03772455 -0.03748902  0.00240602
  0.08563309  0.03772465  0.01692889 -0.1403138   0.0286436   0.01942853
  0.00315892  0.01492574  0.06441022  0.00846453  0.02011049  0.08688619
 -0.02787115 -0.01842279 -0.08625918 -0.00829534 -0.05809093  0.05077755
  0.08411048  0.02548055 -0.00868656  0.16125798  0.07904027 -0.01314069
 -0.03315594  0.0445268  -0.04644902  0.02630384 -0.00975984  0.03547678
  0.11696669  0.080023    0.03732325 -0.04561122 -0.05601991 -0.03025973
  0.0336674   0.03761502 -0.11781113  0.05360794 -0.06688837  0.06622493
  0.0177456  -0.05817244 -0.12104942  0.01105875  0.06662479  0.08832858
  0.05577114 -0.01847336  0.04676725  0.07071267  0.07134936  0.05341795
  0.01721229  0.02005112 -0.01573792 -0.05332713 -0

In [6]:
import joblib
model_obj = joblib.load('ntee_300_sentence.joblib')
print model_obj.values()[4]

[[-0.36159527 -0.11532898 -0.15386112 ...  0.58709633 -0.7524103
   0.09077099]
 [-0.55682015 -0.12421971 -0.21646227 ...  0.50956655 -0.5558761
  -0.35185504]
 [-0.30983728 -0.33736834  0.16777039 ...  0.6361284  -0.8122711
  -0.18458259]
 ...
 [ 0.08482471  0.0817304  -0.46426383 ...  0.4461696  -0.66930896
   0.09255198]
 [ 0.16807598  0.0129167  -0.37892318 ...  0.4783491  -0.70507866
   0.10885348]
 [ 0.22759005  0.14634937 -0.58914167 ...  0.41399944 -0.6932886
   0.04753299]]


In [7]:
print model_obj.keys()

['word_embedding', 'vocab', 'b', 'W', 'entity_embedding']


### Evaluation on 3 different datsets for benchmark

Here we do the evaluation of the model on 3 different benchmark datasets. I am using 3 different metrics namely- Spearman, Pearson and Mean Squared Error. 
 - The SICK dataset 
 - The SEMEVAL dataset
 - The STS 2014 dataset

In [8]:
!ntee evaluate_sick ntee_300_sentence.joblib SICK.txt

Using TensorFlow backend.
0.7144 (pearson) 0.6046 (spearman) 8.6439 (Mean Squared Error)
Exception TypeError: TypeError("'NoneType' object is not callable",) in <bound method _NonClosingTextIOWrapper.__del__ of <_io.TextIOWrapper encoding='UTF-8'>> ignored


In [9]:
!ntee evaluate_semeval ntee_300_sentence.joblib sts-test.txt

Using TensorFlow backend.
0.6949 (pearson) 0.6716 (spearman) 5.6852 (Mean Squared Error)
Exception TypeError: TypeError("'NoneType' object is not callable",) in <bound method _NonClosingTextIOWrapper.__del__ of <_io.TextIOWrapper encoding='UTF-8'>> ignored


In [10]:
!ntee evaluate_sts ntee_300_sentence.joblib sts-en-test-gs-2014

Using TensorFlow backend.
OnWN: 0.7204 (pearson) 0.7443 (spearman) 7.2890 (Mean Squared Error)
deft-forum: 0.5643 (pearson) 0.5491 (spearman) 5.4748 (Mean Squared Error)
deft-news: 0.7436 (pearson) 0.6775 (spearman) 6.4844 (Mean Squared Error)
headlines: 0.6876 (pearson) 0.6246 (spearman) 5.8117 (Mean Squared Error)
images: 0.8204 (pearson) 0.7671 (spearman) 5.7951 (Mean Squared Error)
tweet-news: 0.7467 (pearson) 0.6592 (spearman) 7.4580 (Mean Squared Error)
Exception TypeError: TypeError("'NoneType' object is not callable",) in <bound method _NonClosingTextIOWrapper.__del__ of <_io.TextIOWrapper encoding='UTF-8'>> ignored
