# Word Embeddings (static word embeddings)
---
The bag of words (BOW) representation is very useful when it comes to representing documents as vectors of numbers, but it is not without its drawbacks. It is usually memory-hungry (the length of the vector is the size of of the entire vocabulary /Vocabulary=list of all seen tokens: words, numbers, interpunction/). While it is true that we can reduce memory consumption by limiting the dictionary to the most important tokens from the corpus, we lose the information carried by the deleted tokens, because they are ignored. <br/>
Another problem with this representation is its inability to encode word similarity information (as will be shown in subsequent sections). <br/> There is therefore a need to look for alternative representations to overcome the above-mentioned problems. One of them is Word Embeddings.

<span style="color:red">
We will need external resources (embeddings) to complete the tasks. Go to: http://nlp.stanford.edu/data/glove.6B.zip, unpack the package, and then move the file: "glove.6B.50d.txt" to the folder where this notebook is located.</span>

# Similarity between documents
It is quite a common need to evaluate the similarity of two documents. When we represent documents as vectors of equal length (and using BOW we have equal length vectors), we can use a cosine similarity to measure the similarity of the vectors.

$similarity = cos(\vec{a}, \vec{b}) = \frac{\sum_{i=1}^na_i b_i}{\sqrt{\sum_{i=1}^{n}a_i^2}\sqrt{\sum_{i=1}^{n}b_i^2}}$
<br/>
Vector $\vec{a}$ represents the first, and vector $\vec{b}$ the second document.
<br/>
Below you can find an implementation of cosine similarity using numpy.

---
**Done by:** Sofya Aksenyuk, 150284.

---

In [1]:
import numpy as np   

def cosine(v1, v2):
    v1, v2 = np.array(v1), np.array(v2)
    return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))


print(cosine([1.0, 2.0, 3.0], [1.5, -0.7, -20]))
print(cosine([-10.0, 17.0, 2.0], [5.3, 12.0, -20]))
print(cosine([1.0, 2.0, 3.0], [1, -3000, 184]))
print(cosine([1.0, 2.0, 3.0], [1, 2, 3]))


-0.7977198918178166
0.23409628697705598
-0.48434715333575534
1.0


We can generate Bag of Words representation using `CountVectorizer` object provided in `sklearn`. Let's use it to transform each document into a vector counting how many times a given word occurs in the document. Then, we can measure the similarity between vectors by using `cosine` similarity.

In [2]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()

doc1 = "Ala has a cat"
doc2 = "Ala has a beautiful fluffy cat"

docs = [doc1, doc2]

X_train_counts = count_vect.fit_transform(docs).todense()

print("Documents represented using BOW. Documents represent rows, columns represent tokens (there are 5 distinct tokens (words) in these documents)")
print("Cell at row x and col y represents how many times a token assigned to position y is observed in document x")
print(X_train_counts)
print("\n\nDocument similarity:")
print(cosine(X_train_counts[0].tolist()[0], X_train_counts[1].tolist()[0])) # tolist()[0] transforms a 1xn matrix into a list of 1x elements


Documents represented using BOW. Documents represent rows, columns represent tokens (there are 5 distinct tokens (words) in these documents)
Cell at row x and col y represents how many times a token assigned to position y is observed in document x
[[1 0 1 0 1]
 [1 1 1 1 1]]


Document similarity:
0.7745966692414834


When documents share words, the cosine similarity is non-zero. However, what happens when we have two very similar documents expressed using synonymous words?



In [3]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()

doc1 = "cat"
doc2 = "kitten"
docs = [doc1, doc2]


print("Documents represented using BOW. Documents represent rows, columns represent tokens (there are 5 distinct tokens (words) in these documents)")
X_train_counts = count_vect.fit_transform(docs).todense()
print(X_train_counts)

print("\n\nDocument similarity:")
print(cosine(X_train_counts[0].tolist()[0], X_train_counts[1].tolist()[0]))

Documents represented using BOW. Documents represent rows, columns represent tokens (there are 5 distinct tokens (words) in these documents)
[[1 0]
 [0 1]]


Document similarity:
0.0


Using BOW there is no way of spotting that a `kitten` is semantically related to a `cat` (at least more related than a `chair` or a `car`)!

---
Embeddings are nothing more than vector representation of the meaning of words (tokens) in an n-dimensional space so that similar words appear closely in this n-dimensional space. We can create them ourselves from a large body of text (which can be time-consuming), using packages such as: gensim (https://radimrehurek.com/gensim/) but we can also use "pretrained" vectors already created on some corpus, available e.g. at: (https://nlp.stanford.edu/projects/glove/). We will choose the second option - using existing vectors. <br/>

Embeddings provided by the Stanford team are text files represented as a set of lines: <br/>
word [SPACE] vector_of_numbers_separated_by_spaces_representing_words_meaning <br/>

The function to load embeddings has already been created. <br/> **Run the following code so that we can use this function in subsequent cells and evaluate the similarity of words.** <span style = "color: red"> Note: the `mapping` variable will be used in subsequent cells, so we have to run that code to make it visible for subsequent cells. </span>


In [9]:
import numpy as np

def load_embeddings(path):
    mapping = dict()
    
    with open(path, 'r', encoding='utf8') as f:
        for line in f:
            line = line.strip()
            if len(line) == 0:
                continue
            splitted = line.split(" ")
            mapping[splitted[0]] = np.array(splitted[1:], dtype=float)
    return mapping

mapping = load_embeddings('glove.6B.50d.txt') # load embeddings into a dict mapping words into vectors

cat = mapping['cat']
kitten = mapping['kitten']
chair = mapping['chair']

print("Similarity between a cat and a kitten:")
print(cosine(cat, kitten))

print("Similarity between a cat and a chair")
print(cosine(cat, chair))

Similarity between a cat and a kitten:
0.6386305647068642
Similarity between a cat and a chair
0.29425297716624566


As can be seen, we can represent WORDS (TOKENS) as vectors instead of whole documents. As a result, we can measure the similarity between words.

# Embedding space
Embeddings are representations of the meaning of words in n-dimensional space. The website: http://projector.tensorflow.org tries to visualize this space by projecting the pretrained vectors onto 3-dimensional space.

To complete the task, open the above page and follow steps a) and b) <br/>
**Task**: List 5 closest words to the word "data" analyzing embeddings loaded by default by the website (We can type the word whose neighbors we want to locate in the "Search" field in the upper right part of the screen).
Let's use cosine distance to measure distance between embeddings.
<br/>
Note - this page uses cosine distance instead of similarity. The relationship between the two measures is very simple: distance = 1 - similarity

In [23]:
# 1. Słowo: information        Distance: 0.435
# 2. Słowo: instructions       Distance: 0.506
# 3. Słowo: files              Distance: 0.522
# 4. Słowo: file               Distance: 0.542
# 5. Słowo: register           Distance: 0.547

As we can see, the words that come closest to the word "date" are synonymous in this case. <br/>
**Task**: Enter 5 closest words to the word "red". Are we still dealing with synonyms?
Answer the question: how can you interpret the most similar words (in what aspect are they similar), since, as you can see, they are not synonyms (recall the principle of embedding)? Share your answers in the comments below

In [21]:
# 1. Słowo: blue       Distance: 0.333
# 2. Słowo: yellow     Distance: 0.380
# 3. Słowo: white      Distance: 0.391
# 4. Słowo: green      Distance: 0.396
# 5. Słowo: black      Distance: 0.489

# Interpretation: They all belong to one group - colors.

# Embeddings model relations between words

Embeddings contain information about the meaning. What's more - they are vectors, so we can perform operations on them (addition, subtraction, ...). Let's check what effects we get by performing operations on these vectors.
<br/>
We will be interested in the effect of the operation: $\vec{italy} - \vec {rome} + \vec{warsaw}$. What will the vector defined in this way indicate? <br/>
Since the result of this operation will be a new vector, let's write a function that checks which existing word is closest to this vector.

**Task:** Fill the `get_most_similar(...)` function so that for the given vector `vec1`, it returns the word whose vector (embedding) is the most similar to the vector `vec1` (to evaluate the similarity use the cosine function created at the beggining of this notebook). The embeddings parameter is a dictionary that maps a word to the corresponding vector. <br/>
What word was determined closest to the calculated point?

In [26]:
def get_most_similar(vec1, embeddings):
    all_similarities = sorted(embeddings.items(), key=lambda x: cosine(vec1, x[1]), reverse=True)
    
    return all_similarities[0][0]

new_point = mapping['italy'] - mapping['rome'] + mapping['warsaw']
print(get_most_similar(new_point, mapping))


poland


Therefore, we can see that performing operations on embedding allows for very interesting results. If we subtract the capital from the "Italy" object and add the Polish capital, we will get the "Poland" object. In other words - we answer the question: what is the word in the same relation to Poland as the relation between Italy and Rome?

# Words embeddings for document classification, simple but powerful heuristic

It also turns out that embeddings are useful in classification, effectively reducing the number of features and solving the problem of sparse representation introduced by BOW (most BOW vectors have >50% of elements set to 0). Let's imagine a spam classification task. In order to decide whether we are dealing with spam or ham, we would like to use the SVC classifier that adopts embedding as features. <br/>

However, since embedding describes individual words as n-dimensional vectors, and in the problem of classifying e-mails, we have to represent entire documents as vectors - we need to aggregate information about all words in one feature vector.
<br/>

One method that works surprisingly well is to represent the entire document as a vector, which is the "center of gravity" of the words it is made of. The result vector is an n-dimensional vector (as is the vector of each of the "component" words), where the $i$-th position of the vector has a value that is the arithmetic mean of the $i$-th positions of the word vectors from the given document. NOTE: it may turn out that in pre-trained embeddings the word from the document that we want to represent as a vector is not present. In such a situation, let us ignore this word completely (let us assume that it does not exist).
<br/>
<br/>
**TASK**: Implement the function `documents_to_ave_embeddings(docs, embeddings)` taking two parameters:
<ol>
    <li> docs - list of documents in text form (list of strings) to be transformed into vectors </li>
    <li> embeddings - word mapping -> vector (embedding) from an existing model </li>
</ol>
The function should return one variable, a list of document vectors, where 1 document is the "average vector" of the vectors of the words it is made of (as in the paragraph above the content of the task). Use a tokenizer with NLTK (`word_tokenize`) and before tokenizing - convert document texts to all lowercase letters.

Consider using `numpy.mean` with the appropriate axis parameter value to calculate the mean (it will be easier to use a ready-made function than to implement it manually).

In [27]:
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas
import numpy as np
from nltk import word_tokenize
import nltk
from sklearn.metrics import classification_report

# ------------------- WCZYTANIE DANYCH -----------

full_dataset = pandas.read_csv('spam_emails.csv', encoding='utf-8')      # read the CSV file
full_dataset['label_num'] = full_dataset.label.map({'ham':0, 'spam':1})  # map string labels "ham"/"spam", into numbers so that they can be processed by sklearn

np.random.seed(0)                                       # set seed = 0 to ensure reproducibility of experiments
train_indices = np.random.rand(len(full_dataset)) < 0.7 # choose random 70% of rows as a trainset
train = full_dataset[train_indices] # create trainset (70%)
test = full_dataset[~train_indices] # create testset (remaining - 30%)


def documents_to_ave_embeddings(docs, embeddings):
    def avg_document(doc):
        to_avg = [embeddings[x] for x in word_tokenize(doc.lower()) if x in embeddings]
        
        return np.mean(to_avg, axis=0)
    
    return [avg_document(doc) for doc in docs]
        

# ------------------- VECTORIZE -----------
 
classifier = SVC(C=1.0)

train_transformed = documents_to_ave_embeddings(train['text'], mapping)
test_transformed = documents_to_ave_embeddings(test['text'], mapping)

# ------------------- TRAIN CLASSIFIER -----------

classifier.fit(train_transformed, train['label_num']) 

# ------------------- EVALUATE -----------
accuracy = classifier.score(test_transformed, test['label_num'])
print("Accuracy: {n}%".format(n=100.*accuracy))
print(classification_report(test['label_num'], classifier.predict(test_transformed))) 

Accuracy: 91.81446111869032%
              precision    recall  f1-score   support

           0       0.93      0.95      0.94       517
           1       0.88      0.84      0.86       216

    accuracy                           0.92       733
   macro avg       0.91      0.89      0.90       733
weighted avg       0.92      0.92      0.92       733



# Training your own vectors
Since training vectors on a large body can be time-consuming and requires a large body to catch the right contexts of words, we did not do it in laboratories. If you are interested in creating your own vectors, I recommend the article: https://machinelearningmastery.com/develop-word-embeddings-python-gensim/, which describes how to train embeddings using python and the `gensim` package.