# Smooth Inverse Frequency Weighting for Sentence Embeddings
## CHRIMNI Walid

Explanation and implementation of the paper : [A single but tought-to-beat base line for sentence embeddings](https://openreview.net/pdf?id=SyK00v5xx)

My report is available in this repo.

## A Classification Performance Experiment

In this notebook, I will implement the sentence embedding with Smooth Inverse Frequency (SIF) that is used in the paper and I will compare it to two other sentence embedding : an unweighted sum of word embeddings and a BERT embedding using transformers. Note that I only implement the SIF weighting and not the component removal.

I will then compare those three embeddings using a very simple model for sentiment analysis task on the IMBD dataset.

#### Loading the NLTK Movie Reviews Dataset

In [None]:
!pip install gensim==4.1.2

Collecting gensim==4.1.2
  Downloading gensim-4.1.2-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (24.1 MB)
[K     |████████████████████████████████| 24.1 MB 2.5 kB/s 
Installing collected packages: gensim
  Attempting uninstall: gensim
    Found existing installation: gensim 3.6.0
    Uninstalling gensim-3.6.0:
      Successfully uninstalled gensim-3.6.0
Successfully installed gensim-4.1.2


In [None]:
import nltk
from nltk.corpus import movie_reviews
import random

In [None]:
import nltk
nltk.download('movie_reviews')
nltk.download('punkt')
nltk.download('stopwords')


[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
sentences = [(movie_reviews.sents(fileid)[0],category)
            for category in movie_reviews.categories()
            for fileid in movie_reviews.fileids(category)]

corpus_size = len(movie_reviews.words())
corpus_size

1583820

The Movie Reviews dataset contains a total of `1583820` words. Due to computing limitations and by virtue of the fact that around 1.5 million words allows to take into acount a decent scope of the language used in this context, we will directly use the corpus derived from the dataset to compute the SIF weights.

Let $s(w)$ be the SIF weight for the word $w$ given by the expression 
$s(w) = \frac{a}{a+p(w)}$ where $a$ is a smoothing hyperparameter.
(Following the Paper A SIMPLE BUT TOUGH-TO-BEAT BASELINE FOR SENTENCE EMBEDDINGS by Arora et al.)

In [None]:
a_hp = 1e-3 # default good value for the hyperparamater as adviced in the paper

### Computing $p(w)$ for each word contained in the Movie Reviews Corpus

In [None]:
corpus_word_freq = nltk.FreqDist(word.lower() for word in movie_reviews.words())

In [None]:
corpus_word_freq = {k:v/corpus_size for (k,v) in corpus_word_freq.items()}

In [None]:
corpus_word_freq

{'plot': 0.0009552853228270889,
 ':': 0.0019206728037276964,
 'two': 0.0012065765049058606,
 'teen': 9.533911681883042e-05,
 'couples': 1.7047391749062393e-05,
 'go': 0.0007027313709891276,
 'to': 0.02016453889962243,
 'a': 0.024059552221843392,
 'church': 4.356555669204834e-05,
 'party': 0.00011554343296586733,
 ',': 0.049069338687477114,
 'drink': 2.020431614703691e-05,
 'and': 0.022462148476468286,
 'then': 0.0008990920685431425,
 'drive': 6.629541235746487e-05,
 '.': 0.041593110328193864,
 'they': 0.003046432044045409,
 'get': 0.0012305691303304668,
 'into': 0.0016561225391774318,
 'an': 0.0036266747483931256,
 'accident': 6.566402747786996e-05,
 'one': 0.003694864315389375,
 'of': 0.02154474624641689,
 'the': 0.048319253450518365,
 'guys': 0.00016921114773143413,
 'dies': 6.566402747786996e-05,
 'but': 0.005451377050422397,
 'his': 0.006053086840676339,
 'girlfriend': 0.00013764190375168894,
 'continues': 5.55618694043515e-05,
 'see': 0.0011042921544114862,
 'him': 0.0016624363879

In [None]:
corpus_word_freq['the'] #high p(w) for a stopword

0.048319253450518365

In [None]:
corpus_word_freq['computer'] #lower p(w) for a word carrying meaning in a more specific context

0.00017236807212940865

### Computing $s(w)$ for each word contained in the Movie Reviews Corpus

In [None]:
corpus_sif_weighting = {k:a_hp/(a_hp+p) for (k,p) in corpus_word_freq.items()}

In [None]:
corpus_sif_weighting

{'plot': 0.5114343100341641,
 ':': 0.342386863302074,
 'two': 0.45319072226895807,
 'teen': 0.9129592695495786,
 'couples': 0.9832383506537045,
 'go': 0.5872916991122877,
 'to': 0.047248844151187235,
 'a': 0.03990494287955954,
 'church': 0.9582531673140451,
 'party': 0.8964240839474311,
 ',': 0.019972302934572427,
 'drink': 0.980195813890161,
 'and': 0.042621842624641346,
 'then': 0.5265674142734605,
 'drive': 0.9378264113404625,
 '.': 0.023477975482294498,
 'they': 0.24713129718107235,
 'get': 0.44831607610917057,
 'into': 0.3764886541378048,
 'an': 0.21613795098678731,
 'accident': 0.938382054958467,
 'one': 0.21299870088302297,
 'of': 0.044356232226784684,
 'the': 0.020276057118408988,
 'guys': 0.8552775107731853,
 'dies': 0.938382054958467,
 'but': 0.1550056665707558,
 'his': 0.1417818924662648,
 'girlfriend': 0.8790112219866579,
 'continues': 0.9473627543635081,
 'see': 0.4752191837542982,
 'him': 0.3755958281358939,
 'in': 0.06766778519188817,
 'her': 0.2593951344782519,
 'life':

In [None]:
corpus_sif_weighting['the'] # low SIF weight for a common stopword

0.020276057118408988

In [None]:
corpus_sif_weighting['computer'] #high SIF for a word carrying meaning in a more specific context

0.8529744401719068

In [None]:
from collections import Counter
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

text='''Note that if you use RegexpTokenizer option, you lose 
natural language features special to word_tokenize 
like splitting apart contractions. You can naively 
split on the regex \w+ without any need for the NLTK.
'''

# tokenize
raw = ' '.join(word_tokenize(text.lower()))

tokenizer = RegexpTokenizer(r'[A-Za-z]{2,}')
words = tokenizer.tokenize(raw)

# remove stopwords
stop_words = set(stopwords.words('english'))
words = [word for word in words if word not in stop_words]

# count word frequency, sort and return just 20
counter = Counter()
counter.update(words)
most_common = counter.most_common(20)
most_common

[('note', 1),
 ('use', 1),
 ('regexptokenizer', 1),
 ('option', 1),
 ('lose', 1),
 ('natural', 1),
 ('language', 1),
 ('features', 1),
 ('special', 1),
 ('word', 1),
 ('tokenize', 1),
 ('like', 1),
 ('splitting', 1),
 ('apart', 1),
 ('contractions', 1),
 ('naively', 1),
 ('split', 1),
 ('regex', 1),
 ('without', 1),
 ('need', 1)]

In [None]:
words

['note',
 'use',
 'regexptokenizer',
 'option',
 'lose',
 'natural',
 'language',
 'features',
 'special',
 'word',
 'tokenize',
 'like',
 'splitting',
 'apart',
 'contractions',
 'naively',
 'split',
 'regex',
 'without',
 'need',
 'nltk']

### Training Word2Vec embeddings on the corpus

In [None]:
import gensim

In [None]:
gensim.__version__

'4.1.2'

In [None]:
movie_reviews.sents()

[['plot', ':', 'two', 'teen', 'couples', 'go', 'to', 'a', 'church', 'party', ',', 'drink', 'and', 'then', 'drive', '.'], ['they', 'get', 'into', 'an', 'accident', '.'], ...]

In [None]:
w2v_model = gensim.models.Word2Vec(sentences=movie_reviews.sents(),
                                  vector_size=150,
                                  window=5,
                                  workers=4,
                                  min_count=0,
                                  epochs=8)

In [None]:
w2v_model.wv['love'].shape

(150,)

# Sentence Embeddings

## Approach 1 : Unweighted Sum of word embeddings

In [None]:
import numpy as np

In [None]:
def sent_emb_1(sentence:list):
    emb = w2v_model.wv[sentence]
    return np.mean(emb,axis=0)

In [None]:
sent_emb_1(sentences[0][0])

array([ 0.04841578, -0.23187251, -0.02132036,  0.40737662,  0.06070386,
        0.20279986, -0.00813549,  0.21838196, -0.41697463,  0.21819595,
        0.09217939,  0.40294555, -0.04848393,  0.59762585,  0.0021255 ,
        0.2399333 ,  0.63461375, -0.61282927,  0.10416698,  0.5019918 ,
       -0.07852356,  0.6542678 ,  0.31906894, -0.08222269,  0.3880289 ,
        0.16355842, -0.8003601 , -0.08823899,  0.6252954 , -0.16038217,
        0.35031775,  0.11183237, -0.27829462, -0.09100681,  0.02570677,
        0.42795777,  0.06055289,  0.5669999 , -0.22074568, -0.3855428 ,
       -0.12497441,  0.28189543, -0.2623843 , -0.29549196,  0.03417622,
        0.29309455, -0.48708475,  0.32358903, -0.16807798,  0.0562325 ,
       -0.22501743,  0.07685026, -0.39640513, -0.6732754 ,  0.03780558,
       -0.16952832, -0.04054368, -0.20196743, -0.11393835, -0.51330626,
       -0.3353771 , -0.5406283 ,  0.03023254, -0.19013628,  0.51542807,
        0.17119089, -0.27238953, -0.22630805,  0.30839062, -0.54

## Approach 2 : SIF Weighted Sum of word embeddings

In [None]:
def sent_emb_2(sentence:list,emb_dim : int):
    n=len(sentence)
    emb = np.zeros(emb_dim)
    for word in sentence:
        emb+= w2v_model.wv[word] * corpus_sif_weighting[word]
    return emb/n

In [None]:
sent_emb_2(sentences[0][0],150)

array([-0.11116752, -0.13005966,  0.04650587,  0.21323752, -0.00276145,
        0.14860074, -0.0262641 ,  0.10307227, -0.17657994,  0.20726014,
        0.05644629,  0.26154762, -0.04759888,  0.28380906, -0.00648723,
        0.16147982,  0.18875704, -0.3250262 ,  0.01452463,  0.18531756,
        0.01456033,  0.16492823,  0.28582147, -0.03731463,  0.12768277,
       -0.03805458, -0.23749751, -0.11485145,  0.22288583, -0.15585675,
        0.14707802,  0.03415579, -0.14801836, -0.01798894,  0.09993909,
        0.08985393,  0.02311353,  0.23461261, -0.04593433, -0.24576969,
        0.00184033,  0.10539486, -0.01499137, -0.13208754,  0.06274415,
        0.1838043 , -0.25663814,  0.00054816, -0.10715505, -0.02699988,
       -0.02646498,  0.05996319, -0.13035522, -0.31924159,  0.00091247,
       -0.09107593, -0.0335299 , -0.04674318, -0.03349305, -0.18617075,
       -0.08737971, -0.27196264, -0.03222106,  0.00761876,  0.21628145,
        0.09080885,  0.00533347, -0.14849763,  0.13214848, -0.21

## Approach 3 : Embeddings with BERT transformers




In [None]:
pip install -U sentence-transformers

Collecting sentence-transformers
  Downloading sentence-transformers-2.1.0.tar.gz (78 kB)
[K     |████████████████████████████████| 78 kB 3.4 MB/s 
[?25hCollecting transformers<5.0.0,>=4.6.0
  Downloading transformers-4.12.3-py3-none-any.whl (3.1 MB)
[K     |████████████████████████████████| 3.1 MB 11.2 MB/s 
[?25hCollecting tokenizers>=0.10.3
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 47.2 MB/s 
Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 57.7 MB/s 
[?25hCollecting huggingface-hub
  Downloading huggingface_hub-0.1.2-py3-none-any.whl (59 kB)
[K     |████████████████████████████████| 59 kB 6.3 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_

In [None]:
from sentence_transformers import SentenceTransformer
# model = SentenceTransformer('all-MiniLM-L6-v2')
model = SentenceTransformer('bert-base-nli-mean-tokens')

Downloading:   0%|          | 0.00/391 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.95k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/625 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/399 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
def sent_emb_3(sentence:list):
    emb = model.encode(sentence)
    return(emb)

In [None]:
sent_emb_3(sentences[0][0])

array([[-0.31022093, -0.8895407 ,  2.4462802 , ..., -0.47177592,
        -0.23932761,  0.39813098],
       [-0.0190731 , -0.2524138 ,  2.4880831 , ...,  0.24632902,
         0.21949482,  0.3175112 ],
       [ 0.12544073, -0.38628218,  1.7481121 , ...,  0.44302788,
         0.45639274, -0.16370691],
       ...,
       [-0.14682347, -0.44106725,  2.4785135 , ...,  0.12852906,
        -0.12413316, -0.0459558 ],
       [-0.1805218 ,  0.05617952,  2.3374186 , ..., -0.6111302 ,
        -0.22054918,  0.01559925],
       [ 0.04505367, -0.13585904,  2.5101764 , ...,  0.4691844 ,
         0.19670881,  0.08466091]], dtype=float32)

# Confronting the 3 approach with a sentiment analysis task on IMBD dataset

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split([sentence[0] for sentence in sentences], [sentence[1] for sentence in sentences], test_size = 0.2, random_state=0)

In [None]:
y_train, y_test = [1 if y == 'pos' else 0 for y in y_train], [1 if y == 'pos' else 0 for y in y_test]

## Approach 1 : Unweighted Sum of word embeddings :

In [None]:
X_train_1 , X_test_1 = [], []

for i in range(len(X_train)): 
  X_train_1.append(sent_emb_1(X_train[i]))

for i in range(len(X_test)):
  X_test_1.append(sent_emb_1(X_test[i]))

In [None]:
clf = LogisticRegression(solver='sag', random_state=0)
clf.fit(X_train_1, y_train)
clf.score(X_test_1, y_test)

0.5125

## Approach 2 : SIF Weighted Sum of word embeddings

In [None]:
X_train_2 , X_test_2 = [], []

for i in range(len(X_train)): 
  X_train_2.append(sent_emb_2(X_train[i], 150))

for i in range(len(X_test)):
  X_test_2.append(sent_emb_2(X_test[i], 150))

In [None]:
clf = LogisticRegression(solver='sag', random_state=0)
clf.fit(X_train_2, y_train)
clf.score(X_test_2, y_test)

0.535

## Approach 3 : Embeddings with BERT transformers

In [None]:
X_train_3, X_test_3 = sent_emb_3(X_train), sent_emb_3(X_test)

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_3 = scaler.fit_transform(X_train_3)
X_test_3 = scaler.transform(X_test_3)

In [None]:
clf = LogisticRegression(solver='sag', random_state=0)
clf.fit(X_train_3, y_train)
clf.score(X_test_3, y_test)



0.56

# Conclusion

As expected, the SIF weighted sum of word embeddings outperforms the unweighted sum.
Also, the embeddings with bert outperfom both the SIF weighted and unweighted sum. 
It seems logical as the BERT model is the last model available between the three we have tested.

We can note that the accuracy obtained are not very high. This is totally normal as we have used a very simple model for such a complicated task.