# <center>Natural Language Processing Hands-on # 1</center>
<center><span style="font-weight: bold; font-size: 1.8rem;">Representing words and sentences</span></center>

Since most of the algorithms existing out there are designed to handle numerical data, they are hardly applicable on raw texts. However, it is definitely possible to convert a text to a numerical representation.

Ideal representations should handle **semantic**, **polysemy**, **irony** and lots of other specificities of texts. Along the decades, many text representations have been introduced to handle as many specificities as possible.

In this hands-on, you will have to convert a given corpus of texts to various representations and highlight their pros / cons.

# Installation of required packages

The packages listed below should be installed. Using a virtual environment is highly recommended but not mandatory -- that is just good practice.

In [36]:
import itertools
import string
import random

from collections import Counter
from pprint import pprint as pp
from IPython.display import display, Markdown, HTML

import gensim
import nltk
import numpy as np
import pandas as pd
import scipy as sp

import gensim.downloader as gensim_api
from gensim.models import KeyedVectors
from gensim.test.utils import datapath
from sklearn.decomposition import TruncatedSVD
from sklearn.decomposition import PCA
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

import nltk
# Uncomment the following line to download the reuters dataset
# nltk.download('reuters')
from nltk.corpus import reuters

START_TOKEN = '<START>'
END_TOKEN = '<END>'

np.random.seed(0)
random.seed(0)

**Note**: In NLP, we often add `<START>` and `<END>` tokens to represent the beginning and end of sentences, paragraphs or documents.

# Part 0 - Exploring the dataset

The Reuters Corpus that we will use contains 10,788 news documents totaling 1.3 million words. The documents have been classified into 90 categories.

Before diving into word representations, let's explore it a little bit and simply preprocess its texts to make it more suitable.

---

We will need to standardize all texts before converting anything to a numerical representations, since it will reduce the vocabulary size. Modify the following function to:

* Add the `START_TOKEN` and `END_TOKEN` at the beginning and end of each document
* Lowercase every words
* Remove the punctuation from each document

In [2]:
def read_corpus(category="tea", add_tokens=True):
    """ Read files from the specified Reuter's category.
        Params:
            category (string): category name
            add_token (boolean): whether to insert START_TOKEN or END_TOKEN in each document
        Return:
            list of lists, with words from each of the processed files
    """
    files = reuters.fileids(category)
    
    # Convert all words to lowercase and remove punctuation
    corpus = [[w.lower().translate(str.maketrans('', '', string.punctuation)) for w in list(reuters.words(f))] for f in files]
    corpus = [[word for word in doc if word and not word.isnumeric()] for doc in corpus]
    # Add token if necessary
    if add_tokens:
        corpus = [[START_TOKEN] + doc + [END_TOKEN] for doc in corpus if doc]
    return corpus

In [3]:
corpus = read_corpus(add_tokens=True)

In [4]:
pp(corpus[:2], compact=True)

[['<START>', 'pakistan', 'confirms', 'kenya', 'tea', 'import', 'investigation',
  'pakistan', 's', 'corporate', 'law', 'authority', 'cla', 'has', 'begun', 'an',
  'enquiry', 'into', 'imports', 'of', 'tea', 'from', 'kenya', 'and', 'the',
  'trade', 'imbalance', 'between', 'the', 'two', 'countries', 'cla', 'chairman',
  'irtiza', 'husain', 'confirmed', 'he', 'told', 'reuters', 'by', 'telephone',
  'that', 'importers', 'liptons', 'and', 'brooke', 'bond', 'had', 'been',
  'asked', 'to', 'supply', 'data', 'to', 'the', 'authority', 'and', 'a',
  'hearing', 'would', 'be', 'held', 'the', 'cla', 'would', 'then', 'report',
  'back', 'to', 'the', 'commerce', 'ministry', 'which', 'had', 'requested',
  'the', 'enquiry', 'husain', 'said', 'no', 'date', 'had', 'yet', 'been', 'set',
  'for', 'the', 'hearing', 'and', 'declined', 'to', 'give', 'further',
  'details', 'of', 'the', 'matter', 'industry', 'sources', 'told', 'reuters',
  'reports', 'that', 'the', 'companies', 'tea', 'import', 'licences', 'ha

# Part I - Word representations

Now that we have preprocessed our texts we can represent them using vectors, also called embeddings in this case.

*Note: the preprocessing done here is basic. We will see in another hands-on different preprocessing steps, including some suitable for frequentist approaches.*

---

Each word representation has its pros and cons: understanding them will help you in finding the best representation that suits your use case. 

As a result, you will have to implement / load and analyze the behaviour of word vectors coming from:

* Dummy encoding
* Co-occurence matrix encoding
* Pretrained GloVe encoding

## Dummy encoding

The dummy encoding consist in encoding each individual word of our corpus into a vector filled with 0 expect at a specific position where the value is 1 (equivalent to the encoding of categorical variables).

As discussed during the course, those embeddings are kind of pointless since they don't handle a single element of the ideal word representation beside being actual vectors. They are however a good starting point to play around with texts.

---

Define a function converting the words of a corpus to a set of dummy encoded vectors. Do not forget to sort your vocabulary before assigning the vectors!

In [6]:
def dummy_encode(corpus):
    """One-hot encoding of a set of texts."""
    
    words = sorted(list(set(itertools.chain.from_iterable(corpus))))
    embeddings = np.eye(len(words))
    
    return {word: embeddings[i] for i, word in enumerate(words)}

In [7]:
pp(dummy_encode(corpus), compact=True)

{'1960s': array([1., 0., 0., ..., 0., 0., 0.]),
 '<END>': array([0., 1., 0., ..., 0., 0., 0.]),
 '<START>': array([0., 0., 1., ..., 0., 0., 0.]),
 'a': array([0., 0., 0., ..., 0., 0., 0.]),
 'abnormal': array([0., 0., 0., ..., 0., 0., 0.]),
 'abnormally': array([0., 0., 0., ..., 0., 0., 0.]),
 'about': array([0., 0., 0., ..., 0., 0., 0.]),
 'abroad': array([0., 0., 0., ..., 0., 0., 0.]),
 'absence': array([0., 0., 0., ..., 0., 0., 0.]),
 'accept': array([0., 0., 0., ..., 0., 0., 0.]),
 'accident': array([0., 0., 0., ..., 0., 0., 0.]),
 'according': array([0., 0., 0., ..., 0., 0., 0.]),
 'account': array([0., 0., 0., ..., 0., 0., 0.]),
 'accounted': array([0., 0., 0., ..., 0., 0., 0.]),
 'accu': array([0., 0., 0., ..., 0., 0., 0.]),
 'across': array([0., 0., 0., ..., 0., 0., 0.]),
 'action': array([0., 0., 0., ..., 0., 0., 0.]),
 'added': array([0., 0., 0., ..., 0., 0., 0.]),
 'adding': array([0., 0., 0., ..., 0., 0., 0.]),
 'adequate': array([0., 0., 0., ..., 0., 0., 0.]),
 'advised': 

 'thousands': array([0., 0., 0., ..., 0., 0., 0.]),
 'threat': array([0., 0., 0., ..., 0., 0., 0.]),
 'threatened': array([0., 0., 0., ..., 0., 0., 0.]),
 'threatens': array([0., 0., 0., ..., 0., 0., 0.]),
 'three': array([0., 0., 0., ..., 0., 0., 0.]),
 'through': array([0., 0., 0., ..., 0., 0., 0.]),
 'tight': array([0., 0., 0., ..., 0., 0., 0.]),
 'tightened': array([0., 0., 0., ..., 0., 0., 0.]),
 'time': array([0., 0., 0., ..., 0., 0., 0.]),
 'to': array([0., 0., 0., ..., 0., 0., 0.]),
 'tobacco': array([0., 0., 0., ..., 0., 0., 0.]),
 'today': array([0., 0., 0., ..., 0., 0., 0.]),
 'told': array([0., 0., 0., ..., 0., 0., 0.]),
 'tonnes': array([0., 0., 0., ..., 0., 0., 0.]),
 'took': array([0., 0., 0., ..., 0., 0., 0.]),
 'torn': array([0., 0., 0., ..., 0., 0., 0.]),
 'total': array([0., 0., 0., ..., 0., 0., 0.]),
 'trade': array([0., 0., 0., ..., 0., 0., 0.]),
 'traders': array([0., 0., 0., ..., 0., 0., 0.]),
 'trading': array([0., 0., 0., ..., 0., 0., 0.]),
 'trend': array([0.,

If you still do not believe that this representation is pointless, try finding the most similar word to "cat" using it.

## Co-occurence matrix encoding

*This section comes from Stanford's NLP hands-on*

A co-occurrence matrix counts how often things co-occur in some environment. Given some word  $w_i$  occurring in the document, we consider the context window surrounding  $w_i$ . Supposing our fixed window size is  $n$ , then this is the  $n$  preceding and  $n$  subsequent words in that document, i.e. words  $w_{i−n} … w_{i−1}$  and  $w_{i+1} … w{i+n}$ . We build a co-occurrence matrix  $M$ , which is a symmetric word-by-word matrix in which  $M_{ij}$  is the number of times  $w_j$  appears inside  $w_i$ 's window among all documents.

**Example: Co-Occurrence with Fixed Window of n=1:**

* Document 1: "all that glitters is not gold"

* Document 2: "all is well that ends well"

|         	| START 	| all 	| that 	| glitters 	| is 	| not 	| gold 	| well 	| ends 	| END 	|
|---------:	|--------:	|----:	|-----:	|---------:	|---:	|----:	|-----:	|-----:	|-----:	|------:	|
|  START 	|       0 	|   2 	|    0 	|        0 	|  0 	|   0 	|    0 	|    0 	|    0 	|     0 	|
|      all 	|       2 	|   0 	|    1 	|        0 	|  1 	|   0 	|    0 	|    0 	|    0 	|     0 	|
|     that 	|       0 	|   1 	|    0 	|        1 	|  0 	|   0 	|    0 	|    1 	|    1 	|     0 	|
| glitters 	|       0 	|   0 	|    1 	|        0 	|  1 	|   0 	|    0 	|    0 	|    0 	|     0 	|
|       is 	|       0 	|   1 	|    0 	|        1 	|  0 	|   1 	|    0 	|    1 	|    0 	|     0 	|
|      not 	|       0 	|   0 	|    0 	|        0 	|  1 	|   0 	|    1 	|    0 	|    0 	|     0 	|
|     gold 	|       0 	|   0 	|    0 	|        0 	|  0 	|   1 	|    0 	|    0 	|    0 	|     1 	|
|     well 	|       0 	|   0 	|    1 	|        0 	|  1 	|   0 	|    0 	|    0 	|    1 	|     1 	|
|     ends 	|       0 	|   0 	|    1 	|        0 	|  0 	|   0 	|    0 	|    1 	|    0 	|     0 	|
|    END 	|       0 	|   0 	|    0 	|        0 	|  0 	|   0 	|    1 	|    1 	|    0 	|     0 	|

The rows (or columns) of this matrix provide one type of word vectors (those based on word-word co-occurrence), but the vectors will be large in general (linear in the number of distinct words in a corpus). Thus, our next step is to run dimensionality reduction. In particular, we will run *SVD* (Singular Value Decomposition), which is a kind of generalized *PCA* (Principal Components Analysis) to select the top  k  principal components.

Reducing the dimensionality of such vectors doesn't interterfere with the semantic relationship between words. Hence, *movie* will still be closer to *theater* than to *airplane*.

### Identify distinct words

Define a function that will return a list of unique words of the corpus as well as its size.

In [8]:
def distinct_words(corpus):
    """ Determine a list of distinct words for the corpus.
        Params:
            corpus (list of list of strings): corpus of documents
        Return:
            corpus_words (list of strings): list of distinct words across the corpus, sorted (using python 'sorted' function)
            num_corpus_words (integer): number of distinct words across the corpus
    """
    corpus_words = sorted(list(set(itertools.chain.from_iterable(corpus))))

    return corpus_words, len(corpus_words)

In [9]:
distincts, num_words = distinct_words(corpus)

### Compute the co-occurence matrix 

Write a method that constructs a co-occurrence matrix for a certain window-size  $n$  (with a default of $4$), considering words  $n$  before and  $n$  after the word in the center of the window.

In [11]:
def compute_co_occurrence_matrix(corpus, window_size=4):
    """ Compute co-occurrence matrix for the given corpus and window_size (default of 4).
    
        Note: Each word in a document should be at the center of a window. Words near edges will have a smaller
              number of co-occurring words.
              
              For example, if we take the document "<START> All that glitters is not gold <END>" with window size of 4,
              "All" will co-occur with "<START>", "that", "glitters", "is", and "not".
    
        Params:
            corpus (list of list of strings): corpus of documents
            window_size (int): size of context window
        Return:
            M (a symmetric numpy matrix of shape (number of unique words in the corpus , number of unique words in the corpus)): 
                Co-occurence matrix of word counts. 
                The ordering of the words in the rows/columns should be the same as the ordering of the words given by the distinct_words function.
            word2Ind (dict): dictionary that maps word to index (i.e. row/column number) for matrix M.
    """
    words, num_words = distinct_words(corpus)
    M = np.zeros((num_words, num_words))
    word2ind = {w: i for i, w in enumerate(words)}
    
    base_occurence_count = {w: 0 for w in words}
    occurence_counts = []
    
    for text in corpus:
        for i, central_word in enumerate(text):
            window = text[max(0, i-window_size):i] + text[i+1:min(len(text), window_size+i)+1]
            for context_word in window:
                M[word2ind[central_word]][word2ind[context_word]] += 1

    return M, word2ind

In [12]:
co_matrix, word_idx = compute_co_occurrence_matrix(corpus, window_size=4)

In [13]:
docs_check = [
    "<START> all that glitters is not gold <END>".split(" "),
    "<START> all is well that ends well <END>".split(" ")
]

co_matrix_test, word_idx_test = compute_co_occurrence_matrix(docs_check, window_size=1)

# Display to match the above matrix in the example
words = word_idx_test.keys()
co_matrix_test = pd.DataFrame(co_matrix_test, columns=words, index=words)
co_matrix_test = co_matrix_test.reindex(["<START>", "all" , "that", "glitters", "is", "not", "gold", "well", "ends", "<END>"])
co_matrix_test = co_matrix_test[["<START>", "all" , "that", "glitters", "is", "not", "gold", "well", "ends", "<END>"]]
display(co_matrix_test.astype(int))

Unnamed: 0,<START>,all,that,glitters,is,not,gold,well,ends,<END>
<START>,0,2,0,0,0,0,0,0,0,0
all,2,0,1,0,1,0,0,0,0,0
that,0,1,0,1,0,0,0,1,1,0
glitters,0,0,1,0,1,0,0,0,0,0
is,0,1,0,1,0,1,0,1,0,0
not,0,0,0,0,1,0,1,0,0,0
gold,0,0,0,0,0,1,0,0,0,1
well,0,0,1,0,1,0,0,0,1,1
ends,0,0,1,0,0,0,0,1,0,0
<END>,0,0,0,0,0,0,1,1,0,0


### Reduce the dimensionality of the co-occurence matrix

Construct a method that performs dimensionality reduction on the matrix to produce $k$-dimensional embeddings. Use *SVD* to take the top $k$ components and produce a new matrix of $k$-dimensional embeddings.

In our case, we will set $k=2$.

In [14]:
def reduce_to_k_dim(M, k=2):
    """ Reduce a co-occurence count matrix of dimensionality (num_corpus_words, num_corpus_words)
        to a matrix of dimensionality (num_corpus_words, k) using the following SVD function from Scikit-Learn:
            - http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html
    
        Params:
            M (numpy matrix of shape (number of unique words in the corpus , number of unique words in the corpus)): co-occurence matrix of word counts
            k (int): embedding size of each word after dimension reduction
        Return:
            M_reduced (numpy matrix of shape (number of corpus words, k)): matrix of k-dimensioal word embeddings.
                    In terms of the SVD from math class, this actually returns U * S
    """    
    n_iters = 10     # Use this parameter in your call to `TruncatedSVD`
    M_reduced = None
    print("Running Truncated SVD over %i words..." % (M.shape[0]))
    
    svd = TruncatedSVD(n_components = k, n_iter = n_iters)
    M_reduced = svd.fit_transform(M)

    print("Done.")
    return M_reduced

In [15]:
reduced_co_matrix = reduce_to_k_dim(co_matrix)

Running Truncated SVD over 1140 words...
Done.


In [16]:
reduced_co_matrix

array([[ 0.80484324, -0.5139259 ],
       [ 5.19968129, -0.6825013 ],
       [ 2.37395942,  0.36485983],
       ...,
       [ 0.62509759,  0.2798576 ],
       [ 0.91626467,  0.53682652],
       [ 2.53154089,  0.81019438]])

---

Great! You now have fix-sized vectors that represent each words of your corpus. Let's normalize our matrix to compare our vectors easily.

In [17]:
# Rescale (normalize) the rows to make them each of unit-length
co_matrix_lengths = np.linalg.norm(reduced_co_matrix, axis=1)
co_matrix_normalized = reduced_co_matrix / co_matrix_lengths[:, np.newaxis] # broadcasting

Since we are working with vectors, we can easily measure the similarity between them using the dot product. Hence, given a specific word and its related word embedding, we can easily identify its most similar words contained in the corpus!

*Note: you can either use a dot product or a cosine similarity here. The dot product cares about both angle and magnitude between the vectors while the cosine similarity only care about their angle.*

----

Let's create a dictionary that maps a vector to a its representation

In [18]:
svd_word_vectors = {word: co_matrix_normalized[i] for i, word in enumerate(word_idx.keys())}

In [19]:
svd_word_vectors

{'1960s': array([ 0.84282903, -0.53818141]),
 '<END>': array([ 0.99149537, -0.13014199]),
 '<START>': array([0.98839451, 0.15190885]),
 'a': array([0.97393753, 0.22681641]),
 'abnormal': array([ 0.95730113, -0.28909264]),
 'abnormally': array([0.80698539, 0.5905714 ]),
 'about': array([0.98288048, 0.18424432]),
 'abroad': array([ 0.8397594, -0.5429587]),
 'absence': array([0.99900643, 0.04456618]),
 'accept': array([0.82489571, 0.56528495]),
 'accident': array([ 0.96867693, -0.2483244 ]),
 'according': array([0.99652457, 0.08329932]),
 'account': array([0.94803702, 0.31816002]),
 'accounted': array([ 0.95682546, -0.29066311]),
 'accu': array([0.99277676, 0.11997625]),
 'across': array([ 0.93951921, -0.34249622]),
 'action': array([ 0.98241391, -0.18671611]),
 'added': array([0.9933149 , 0.11543613]),
 'adding': array([0.95983326, 0.28057104]),
 'adequate': array([0.99388343, 0.11043423]),
 'advised': array([0.88969081, 0.45656354]),
 'affect': array([ 0.7476838 , -0.66405492]),
 'affec

Define a function that given a word $w$ identify its most similar words in your embedding space.

In [20]:
def most_similar(query_word, word_matrix, word_indices, topn=10):
    """Return the words that have the closest embedding to the queried word."""
    
    ind2word = {value: key for key, value in word_indices.items()}
    
    query_word_idx = word_indices[query_word]
    query_word_embedding = word_matrix[query_word_idx]
    
    similarities = [(ind2word[i], cosine_similarity(query_word_embedding.reshape(1, -1), row.reshape(1, -1))[0][0]) for i, row in enumerate(word_matrix) if i != query_word_idx]
    return sorted(similarities, key=lambda x: x[1], reverse=True)[:topn]

In [21]:
pp(most_similar("tea", co_matrix_normalized, word_idx), compact=True)

[('congress', 0.9999999917994921), ('enquiry', 0.9999996608568608),
 ('dramatic', 0.9999993214106657), ('any', 0.9999991556091371),
 ('citizenship', 0.9999988594142523), ('cocoa', 0.9999984966604832),
 ('and', 0.9999933066559741), ('produce', 0.999992547807904),
 ('stopped', 0.9999924678157199), ('had', 0.9999872126469822)]


The above similarities do not seem really convincing. Hence, the corpus we have been using so far only contains 13 documents, which is way to small to create great embeddings. For good results, we should instead use at least 100k different documents.

You can try it out yourself with another corpus of text without any problem -- all those functions are totally reusable!

----

Now, let's have a look at a model that has already been pretrained on millions of texts.

## GloVe encoding

Word2Vec models are predictive by essence, since it is a neural network. However, this is not the sole method to learn geometrical encodings (vectors) of words from their co-occurrence information (how frequently they appear together in large text corpora).

GloVe is a count-based model that learn their vectors by essentially doing dimensionality reduction on the co-occurrence counts matrix. Does it remind you of something? Yes, that's exactly what you did above.

Building models is time consuming. Hence, GloVe / Word2Vec models already trained on regular training sets (Wikipedia, News, etc.) are publicly shared to be reused easily.

The below code will load a Glove model trained on wikipedia and allow us inspect easily the embeddings properties.

In [23]:
# List all pretrained models available on gensim
pp(list(gensim_api.info()["models"].keys()))

['fasttext-wiki-news-subwords-300',
 'conceptnet-numberbatch-17-06-300',
 'word2vec-ruscorpora-300',
 'word2vec-google-news-300',
 'glove-wiki-gigaword-50',
 'glove-wiki-gigaword-100',
 'glove-wiki-gigaword-200',
 'glove-wiki-gigaword-300',
 'glove-twitter-25',
 'glove-twitter-50',
 'glove-twitter-100',
 'glove-twitter-200',
 '__testing_word2vec-matrix-synopsis']


In [24]:
model = gensim_api.load('glove-wiki-gigaword-50')

You can get the most similar embeddings to those of a given set of words. 
Here, we retrieve the most similar words to fox, rabbit and cat.

In [25]:
pp(model.most_similar(positive=["fox", "rabbit", "cat"]), compact=True)

[('dog', 0.8569531440734863), ('mouse', 0.7859790921211243),
 ('monster', 0.7710846066474915), ('wolf', 0.7690606713294983),
 ('bunny', 0.765525221824646), ('spider', 0.7395666241645813),
 ('duck', 0.7366620898246765), ('rat', 0.7366542220115662),
 ('beast', 0.7319128513336182), ('cartoon', 0.724099338054657)]


You can also perform concept additions / soustractions. For instance, you can 

In [26]:
model.most_similar(positive=["king", "woman"], negative=["man"])

[('queen', 0.8523604273796082),
 ('throne', 0.7664334177970886),
 ('prince', 0.7592144012451172),
 ('daughter', 0.7473883628845215),
 ('elizabeth', 0.7460219860076904),
 ('princess', 0.7424570322036743),
 ('kingdom', 0.7337412238121033),
 ('monarch', 0.721449077129364),
 ('eldest', 0.7184861898422241),
 ('widow', 0.7099431157112122)]

Explore those embeddings and comment on how they help to identify / handle:

* Synonyms
* Antonyms
* Grammatical errors
* Polysemy
* *~~[Irony]~~ -> this was removed and should be instead done on sentence embeddings*
* *~~[Analogies]~~ -> this was removed and should be instead done on sentence embeddings*

In [27]:
def display_analogies(model, words):
    for word in words:
        results = model.most_similar(positive=[word], topn=10)
        results = [f'{r[0]} (*{np.round(r[1], 2)}*)' for r in results]
        display(Markdown(f"Words that are most similar to <b>{word}</b> are:"))
        display(Markdown(f"* {', '.join(results)}"))

### Synonyms & antonyms

First, let's have a look at simple nouns to see if we find their synonyms within their most similar embeddings.

#### Nouns

In [28]:
display_analogies(model, ["petrol", "meal", "data", "cat"])

Words that are most similar to <b>petrol</b> are:

* gasoline (*0.84*), diesel (*0.79*), kerosene (*0.78*), lpg (*0.77*), fuel (*0.75*), propane (*0.75*), liter (*0.74*), litres (*0.7*), litre (*0.7*), liters (*0.7*)

Words that are most similar to <b>meal</b> are:

* meals (*0.86*), snack (*0.78*), bread (*0.78*), eat (*0.77*), lunch (*0.77*), breakfast (*0.77*), ate (*0.76*), eating (*0.76*), dessert (*0.75*), dinner (*0.74*)

Words that are most similar to <b>data</b> are:

* information (*0.83*), tracking (*0.81*), database (*0.81*), analysis (*0.8*), applications (*0.79*), indicate (*0.77*), indicates (*0.76*), computer (*0.76*), indicating (*0.76*), user (*0.76*)

Words that are most similar to <b>cat</b> are:

* dog (*0.92*), rabbit (*0.85*), monkey (*0.8*), rat (*0.79*), cats (*0.79*), snake (*0.78*), dogs (*0.78*), pet (*0.78*), mouse (*0.77*), bite (*0.77*)

It appears here that for nouns, the similarity are working pretty great: the highest linked words are mostly synonyms and the following seem to be related ideas.

We can also note that some verbes are returned as well: *meal* -> *eat*.

----

#### Adjectives

Now, let's look at some adjectives to see if we can observe the same thing.

In [29]:
display_analogies(model, ["awesome", "great", "awful"])

Words that are most similar to <b>awesome</b> are:

* unbelievable (*0.86*), amazing (*0.86*), incredible (*0.85*), fantastic (*0.81*), marvelous (*0.79*), terrific (*0.78*), phenomenal (*0.74*), truly (*0.74*), luck (*0.72*), damn (*0.71*)

Words that are most similar to <b>great</b> are:

* greatest (*0.82*), good (*0.8*), perhaps (*0.79*), life (*0.78*), well (*0.78*), little (*0.78*), much (*0.78*), inspiration (*0.78*), luck (*0.78*), experience (*0.77*)

Words that are most similar to <b>awful</b> are:

* horrible (*0.93*), terrible (*0.89*), dreadful (*0.87*), unbelievable (*0.85*), scary (*0.83*), weird (*0.82*), sadly (*0.81*), sad (*0.81*), frightening (*0.81*), thing (*0.8*)

We can observe here that adjectives relationship are almost as good as for the nouns, but that:

* There are sometimes some opposite ideas that have a high similarity (*awesome* -> *terrific*)
* It doesn't necessarily output only adjectives (*great* -> *life*)

### Grammatical errors

In [30]:
display_analogies(model, ["awsome"])

KeyError: "Key 'awsome' not present"

In [32]:
display_analogies(model, ["helo"])

Words that are most similar to <b>helo</b> are:

* alcmene (*0.79*), vondas (*0.75*), kurama (*0.74*), selvi (*0.74*), annu (*0.74*), gorath (*0.72*), drey (*0.72*), siya (*0.72*), malli (*0.72*), nanu (*0.72*)

In [31]:
display_analogies(model, ["maket"])

Words that are most similar to <b>maket</b> are:

* +8.00 (*0.8*), gaint (*0.79*), +7.50 (*0.78*), +11.00 (*0.77*), telecommunciations (*0.76*), -9.00 (*0.76*), +1.75 (*0.76*), +1.25 (*0.76*), +7.00 (*0.76*), -6.50 (*0.74*)

We can see here that grammatical errors are a real problem for such embeddings. If they're not found in the dictionnary, you will have to dismiss them. However, for *maket* and *helo*, another problem occurs: they are either related to another concept or another language, and this could lead to big problems when trying to build downstream classification systems.

How can we solve that? By modifying the **tokenizer** that we are using.

A tokenizer is in charge of preparing the inputs for a model, i.e. of splitting the text in relevant tokens. For instance, tokenizers can represent *reading* as two tokens, helping the system in understanding what is the root word and what is its conjugation:
* reading -> *["read", "-ing"]*

If you want to know more about tokenizers, you can head here: [Tokenizer introduction](https://huggingface.co/transformers/tokenizer_summary.html)

---

### Polysemy

Let's have a closer look at an obvious polysemy example and check how the model handles it.

In [33]:
display_analogies(model, ["mouse"])

Words that are most similar to <b>mouse</b> are:

* monkey (*0.8*), bugs (*0.78*), cat (*0.77*), rabbit (*0.76*), worm (*0.75*), clone (*0.73*), robot (*0.73*), spider (*0.72*), bug (*0.71*), frog (*0.7*)

We can see here that the mouse used alongside your computer has been totally left out. 

If we were able to indicate that we want to obtain the word as a peripheral, we might be able to search differently in the embedding space. You see where I'm heading, right?

When searching for the vector that is the most similar something, we can define that something as a word (that is what we have been doing so far) or as a concept $(king - man + woman)$.

Let's try it out for mouse:

In [34]:
model.most_similar(positive=["mouse", "peripheral"])

[('brain', 0.7404634356498718),
 ('neural', 0.7243039608001709),
 ('brains', 0.7190447449684143),
 ('neurons', 0.7073276042938232),
 ('interface', 0.7037647366523743),
 ('uses', 0.6876873970031738),
 ('circuitry', 0.6868135333061218),
 ('mimic', 0.6858026385307312),
 ('bugs', 0.6839889287948608),
 ('devices', 0.6759170889854431)]

The results are okay here (we see bugs, devices), but we also have some elements related to the neurobiology field here. Let's precise that we do not want to work on the animal component of the mouse:

In [35]:
model.most_similar(positive=["mouse", "peripheral"], negative=["animal"])

[('adapter', 0.715379536151886),
 ('usb', 0.7070850729942322),
 ('switches', 0.7027341723442078),
 ('connectors', 0.7009559273719788),
 ('plugs', 0.6908823847770691),
 ('socket', 0.6843112707138062),
 ('circuitry', 0.6815988421440125),
 ('neuropathy', 0.6803703308105469),
 ('sockets', 0.6705275177955627),
 ('adapters', 0.6666727066040039)]

It looks better now!

When using such embeddings in downstream models, it is hard to do what we've done here since you do not necessarily have a knowledge graph at hand. Instead, consider using context-aware word embeddings such as ELMo or BERT (to be added on top of your model). You can also have a look at [sentence-transformers](https://github.com/UKPLab/sentence-transformers) that output multilingual context-aware embeddings.

### Recap - Where do you think those behaviours come from?

Since GloVe embeddings are trained based on co-occurence, two words will have a high similarity if they can be used in the **same context**. Hence, if two words are easily interchangeable it means that they should have a high similarity as well. This is the reason why adjectives might have a high similarity with their antonyms.

Regarding the polysemy, the most represented concept in the embedding correspond to the most represented concept in the corpus you have trained your model on.

Regarding grammatical errors, the sourcing of the training data might cause the problem. When crawling data on the web one might catch other languages by mistake and not detect it easily. Because of that, some grammatical errors are detected as different words with a total different meaning, which can be dangerous in some use cases. Try picking a tokenizer that has been designed to handle grammatical errors if this applies to your use case.

# Part II - Sentence representations

## TF-IDF

The TF-IDF is a methodology aiming at finding the most significative words in each document by comparing their in-document frequency to the overall frequency of that term in the whole corpus.

---

Convert the reuters corpus (at least one category) to its TF-IDF representation using [scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html). Beware of all the parameters of the methods! They might have significant impact on what you're doing.

In [37]:
vectorizer = TfidfVectorizer()

# Converting the corpus as full strings instead of words
corpus_str = [' '.join(doc) for doc in corpus]
tfidf_corpus = vectorizer.fit_transform(corpus_str)

With only three lines of code, you obtain TF-IDF embeddings for your $n$ documents, with all words scored each time:

In [38]:
tfidf_corpus.shape

(13, 1134)

You can access the vocabulary words by using the below method:

In [39]:
vectorizer.get_feature_names_out()

array(['1960s', 'abnormal', 'abnormally', ..., 'zero', 'zimbabwe',
       'zones'], dtype=object)

---

Once you have converted your corpus using the TF-IDF methodology, create a function identifying the most relevant comments given a search query.

In [40]:
def search_corpus(corpus, search_query, topn=10):
    """Retrieve the top n documents matching a search query within a list of texts.
    """
    corpus_str = [' '.join(doc) for doc in corpus]
    
    vectorizer = TfidfVectorizer()
    tfidf_corpus = vectorizer.fit_transform(corpus_str)

    query_vector = vectorizer.transform([search_query])[0]
    
    scores = [(corpus_str[i], cosine_similarity(query_vector, text_vector)[0][0]) for i, text_vector in enumerate(tfidf_corpus)]
    
    return pd.DataFrame(sorted(scores, reverse=True, key=lambda x: x[1])[:topn], columns=["document", "similarity"])

In [41]:
search_corpus(corpus, search_query="tea regulation")

Unnamed: 0,document,similarity
0,<START> pakistani decision will hurt kenyan te...,0.197751
1,<START> state to control pct of pakistan tea i...,0.143906
2,<START> pakistan confirms kenya tea import inv...,0.102723
3,<START> abnormal radiation found in soviet tea...,0.097015
4,<START> indonesian tea cocoa exports seen up c...,0.089477
5,<START> sri lankan tea workers launch one day ...,0.071925
6,<START> vietnam to resettle on state farms in ...,0.035178
7,<START> soviet paper details georgian flood da...,0.026356
8,<START> india steps up countertrade deals indi...,0.021954
9,<START> india relaxes rules for export promoti...,0.020039


Congrats! You have created your first text based search engine. Again, the results are not impressive here: you should pick a bigger dataset so that the embeddings are more reliable.

## Unsupervised Random Walk Sentence Embeddings

This approach has been presented by Kawin Ethayarajh in 2018. The key idea behind this methodology is to take a weighted average of previously trained word embeddings and modify it with SVD (Singular Value Decomposition, a kind of generalization of the PCA).

By having a look at [this implementation](https://github.com/kawine/usif), try to compute the uSIF embeddings of our corpus and compare their properties to the TF-IDF / Averaged Word2Vec ones. 

Consider digging in the [related paper](https://aclanthology.org/W18-3012.pdf) if you want to know more about the methodology.