# Training Supervised Embedding
We have been many embedding techniques and the unsupervised way of training embeddings seems to be a normal way of training embeddings on a domain specific corpus. then such learning is passed down to the supervised learning task by providing a dense representation of the word or sentences. In opposite to all previously learned techniques, Infersent is a supervised learning method to learn sentence level embedding. Infersent was invented by Facebook ai research team and published in a publication "Supervised Learning of Universal Sentence Representations from Natural Language Inference Data".  Conneau et al. noted that image net trained in a supervised way doing a great job in the downward tasks. Extending this fact Conneau et al trained sentence embedding layer on the supervised manner known as Infersent.



# Installation

In [1]:
import numpy as np
from random import randint
import sys
import torch
import nltk
import os
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/sunil/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## Pre-requisite
1. Clonning InferSent and adding it to system path
2. Downloading required dataset by InferSent
3. Downloading GloVe and FastText vectors
4. Downloading InferSent pretrained models

In [None]:
# Cloaning the git repository
!git clone https://github.com/facebookresearch/InferSent.git
# Making temporary directory and appending to python path
os.mkdir("InferSent/")
sys.path.append("InferSent/")

In [None]:
#Downloading required dataset by InferSent
!bash InferSent/dataset/get_data.bash

In [None]:
# Downloading GloVe and FastText vectors
!mkdir InferSent/dataset/GloVe
!curl -Lo InferSent/dataset/GloVe/glove.840B.300d.zip http://nlp.stanford.edu/data/glove.840B.300d.zip
!unzip InferSent/dataset/GloVe/glove.840B.300d.zip -d InferSent/dataset/GloVe/
!mkdir InferSent/dataset/fastText
!curl -Lo InferSent/dataset/fastText/crawl-300d-2M.vec.zip https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M-subword.zip
!unzip InferSent/dataset/fastText/crawl-300d-2M.vec.zip -d InferSent/dataset/fastText/

In [None]:
# Downloading InferSent pretrained models
!mkdir encoder
!curl -Lo encoder/infersent1.pickle https://dl.fbaipublicfiles.com/infersent/infersent1.pkl
!curl -Lo encoder/infersent2.pickle https://dl.fbaipublicfiles.com/infersent/infersent2.pkl


# Fine Tuning
1. Loading pretrined InferSent model
2. Providing FastText vectors to the model
3. Building the Vocab
4. Fine tuning the model on given small corpus

In [None]:
from models import InferSent
V = 2 
MODEL_PATH = 'encoder/infersent2.pickle' 
params_model = {'bsize': 64, 'word_emb_dim': 300, 'enc_lstm_dim': 2048,
                'pool_type': 'max', 'dpout_model': 0.0, 'version': V}
infersent = InferSent(params_model)
infersent.load_state_dict(torch.load(MODEL_PATH))

In [None]:
W2V_PATH = 'InferSent/dataset/fastText/crawl-300d-2M-subword.vec'
infersent.set_w2v_path(W2V_PATH)

In [None]:
infersent.build_vocab_k_words(K=100000)

In [None]:
sentences = ['Everyone really likes the newest benefits',
 'The Government Executive articles housed on the website are not able to be searched .',
 'I like him for the most part , but would still enjoy seeing someone beat him .',
 'My favorite restaurants are always at least a hundred miles away from my house .',
 'I know exactly .']

In [None]:
sentences = open("dataset_for_infersent.txt").read().splitlines()[:10000]

In [None]:
embeddings = infersent.encode(sentences, bsize=64, tokenize=False, verbose=True)
print('nb sentences encoded : {0}'.format(len(embeddings)))

# Inference
A function to calculate cosine simillarity between two sentences

In [None]:
def cosine(u, v):
    return np.dot(u, v) / (np.linalg.norm(u) * np.linalg.norm(v))

Calculating cosine simillarity between two sentences

In [None]:
cosine(infersent.encode(['the cat eats.'])[0], infersent.encode(['the cat drinks.'])[0])

Infersent also provides the importance of each token in the sentence, as shown below:

![](figures/InferSent.png)

Figure: Showing word importance by plotting vector generated by InferSent

Here the importance of the padding is shown higher as we have not completed training a sufficiently large corpus. Once you will fine tune this model on sufficiently large data the importance for the padding and stop word will go down and the importance for the other words will increases.

In [None]:
my_sent = 'Obama is the former president of the US'
_, _ = infersent.visualize(my_sent)

In [None]:
embeddings = infersent.encode(["The cat is drinking milk."], tokenize=True)

In [None]:
print("Shape of the embedding : ", embeddings.shape)