# Programming assignment. Week 2. Vector models.

In this programming assignment you will work with different embeddings. You will work with `gensim` library that provides the access and cinvenient usage of word2vec and fasttext models. On the sentiment task you will check how the vector embeddings work. 
Good luck!

In [2]:
# First, load the data for sentiment task and prepare the data.
import pandas as pd
import numpy as np

data = pd.read_csv('https://github.com/mbburova/MDS/raw/main/sentiment.csv', index_col=0)
data.head()

Unnamed: 0,sentiment,review
0,1,With all this stuff going down at the moment w...
1,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,0,The film starts with a manager (Nicholas Bell)...
3,0,It must be assumed that those who praised this...
4,1,Superbly trashy and wondrously unpretentious 8...


In [3]:
import re
tag_regexp = re.compile("<[^>]*>")
regex = re.compile("[A-Za-z-]+")

def words_only(text, regex=regex):
    text = re.sub(tag_regexp, '', text)
    text = re.sub('\s+', ' ',text)
    text = re.sub(r'\\','', text)
    text = text.lower().strip()
    try:
        return " ".join(regex.findall(text))
    except:
        return ""


data['cleaned_review'] = data['review'].apply(words_only)
data['tokenized'] = data['cleaned_review'].apply(lambda x: x.split())
data.head()

Unnamed: 0,sentiment,review,cleaned_review,tokenized
0,1,With all this stuff going down at the moment w...,with all this stuff going down at the moment w...,"[with, all, this, stuff, going, down, at, the,..."
1,1,"\The Classic War of the Worlds\"" by Timothy Hi...",the classic war of the worlds by timothy hines...,"[the, classic, war, of, the, worlds, by, timot..."
2,0,The film starts with a manager (Nicholas Bell)...,the film starts with a manager nicholas bell g...,"[the, film, starts, with, a, manager, nicholas..."
3,0,It must be assumed that those who praised this...,it must be assumed that those who praised this...,"[it, must, be, assumed, that, those, who, prai..."
4,1,Superbly trashy and wondrously unpretentious 8...,superbly trashy and wondrously unpretentious s...,"[superbly, trashy, and, wondrously, unpretenti..."


In [None]:
# Split the data on train and test
from sklearn.model_selection import train_test_split
from sklearn.metrics import *

X_train, X_test, y_train, y_test = train_test_split(data.tokenized,data.sentiment, test_size=0.2, random_state = 5)
#X_train[:5]

In [None]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

STOPWORDS = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to /home/jupyter-
[nltk_data]     smalkina@edu.hse.r-b86b3/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
X_train

7751    [back, when, alec, baldwin, and, kim, basinger...
4154    [i, too, was, quite, astonished, to, see, how,...
3881    [i, saw, this, film, for, the, very, first, ti...
9238    [i, think, a, great, many, viewers, missed, en...
5210    [this, is, a, taut, suspenseful, masterpiece, ...
                              ...                        
3046    [this, film, is, about, british, prisoners, of...
9917    [but, i, enjoyed, this, show, anyway, i, ve, b...
4079    [i, got, this, in, the, dvd, pack, curse, of, ...
2254    [i, was, intrigued, by, the, title, so, during...
2915    [every, one, should, see, this, movie, because...
Name: tokenized, Length: 8000, dtype: object

## BoW

#### Bag of words

Implement *bag-of-words* representation. To create this transformation, follow the steps:
1. Find *N* most popular words in train corpus and numerate them. Do not count words which are in STOPWORDS. Now we have a dictionary of the most popular words.

2. For each review in the corpora create a zero vector with the dimension equals to *N*.
3. For each text in the corpora iterate over words which are in the dictionary and increase by 1 the corresponding coordinate.

In [None]:
from collections import Counter

all_words = dict()

def word_counter(text):
    
    for line in text:
        counts = Counter(list(line))
        
        for key, value in counts.items():
            if key not in STOPWORDS:
                if key not in all_words:
                    all_words[key] = value
                else:
                    all_words[key] = all_words[key] + value
    return

word_counter(X_train)
            

In [None]:
DICT_SIZE = 500
words_counts =  sorted(all_words.items(), key=lambda item: item[1], reverse=True)
WORDS_TO_INDEX = [x[0] for x in words_counts][0:DICT_SIZE]
#print(len(WORDS_TO_INDEX), len(words_counts), WORDS_TO_INDEX[0], words_counts[0])


def BoW(words, words_to_index, dict_size):
    """
        words: a list of words
        dict_size: size of the dictionary
        
        return a vector which is a bag-of-words representation of 'text'
    """
    counts = Counter(list(words))
    result_vector = []
    
    for word in WORDS_TO_INDEX:
        if word in counts:
            result_vector.append(counts[word])
        else:
            result_vector.append(0)
    return result_vector

In [None]:
from scipy import sparse as sp_sparse
X_train_bow = sp_sparse.vstack([sp_sparse.csr_matrix(BoW(text, WORDS_TO_INDEX, DICT_SIZE)) for text in X_train])
X_test_bow = sp_sparse.vstack([sp_sparse.csr_matrix(BoW(text, WORDS_TO_INDEX, DICT_SIZE)) for text in X_test])
print('X_train shape ', X_train_bow.shape)
print('X_test shape ', X_test_bow.shape)

X_train shape  (8000, 500)
X_test shape  (2000, 500)


In [None]:
import copy

count = 0
a = copy.deepcopy(X_train_bow.todense()[4])
a = np.squeeze(np.asarray(a))
for i in a:
    if i !=0:
        count += 1
print(count)


57


**Task 1.1 (0.5 points).** For the 5th row in *X_train_bow* find how many non-zero elements it has. 

**Hint** Do not forget that indexes start with 0 and the first row has index 0.

In [None]:
q1 = 57

**Task 1.2 (0.5 points)** 
Train (on train set) `RandomForestClassifier` from `sklearn.ensemble` with `n_estimators = 300` and `random_state=5` and `max_depth = 5`. What is the accuracy score on test set? 

In [None]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators = 300, max_depth = 5, random_state=5)
clf = clf.fit(X_train_bow, y_train)

pred = clf.predict(X_test_bow)

accuracy = accuracy_score(y_test, pred)
print(accuracy)

0.7715


In [None]:
## GRADED PART, DO NOT CHANGE!
q2 = accuracy

## TF-IDF
Now vectorize your texts using `TfidfVectorizer` from `sklearn.feature_extraction.text`.
Pass `STOPWORDS` as `stop_words` and set `max_fetures = 500`.

**Task 2 (0.5 points).** Train (on train set) `RandomForestClassifier` from `sklearn.ensemble` with `n_estimators = 300`, `random_state=5`, and `max_depth = 5` on tf-idf embeddings. What is the accuracy score on test set? 

In [None]:
#X_train.tolist()

[['back',
  'when',
  'alec',
  'baldwin',
  'and',
  'kim',
  'basinger',
  'were',
  'a',
  'mercurial',
  'hot-tempered',
  'high-powered',
  'hollywood',
  'couple',
  'they',
  'filmed',
  'this',
  'nearly',
  'scene-for-scene',
  'remake',
  'of',
  'the',
  'steve',
  'mcqueen-ali',
  'macgraw',
  'action-thriller',
  'about',
  'a',
  'fugitive',
  'twosome',
  'it',
  'almost',
  'worked',
  'the',
  'first',
  'time',
  'because',
  'mcqueen',
  'was',
  'such',
  'a',
  'vital',
  'presence',
  'on',
  'the',
  'screen--even',
  'stone',
  'silent',
  'and',
  'weary',
  'you',
  'could',
  'sense',
  'his',
  'clock',
  'ticking',
  'his',
  'cagey',
  'magnetism',
  'baldwin',
  'is',
  'not',
  'in',
  'steve',
  'mcqueen',
  's',
  'league',
  'but',
  'he',
  'has',
  'his',
  'charms',
  'and',
  'is',
  'probably',
  'a',
  'more',
  'versatile',
  'actor--if',
  'so',
  'this',
  'is',
  'not',
  'a',
  'showcase',
  'for',
  'his',
  'attributes',
  'basinger',
  '

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features = 500, stop_words=STOPWORDS)

train_X = vectorizer.fit_transform(X_train
test_X  = vectorizer.transform(X_tests

clf = RandomForestClassifier(n_estimators = 300, max_depth = 5, random_state=5)
clf = clf.fit(train_X, y_train)

pred = clf.predict(test_X)

accuracy = accuracy_score(y_test, pred)
print(accuracy)

AttributeError: 'list' object has no attribute 'lower'

In [None]:
## GRADED PART, DO NOT CHANGE!
q3 = accuracy

## Distributional embeddings

Let us use a few pre-trained distributional embedding models to analyze the performance of the classifier:

*   ```word2vec```
*   ```fastText```

We will use the [```Gensim```](https://radimrehurek.com/gensim_3.8.3/) library for python which provides a wide range of useful functions and pre-trained embedding models. 

### Vector models. ```word2vec```



In this assignment you are going to work with the pretrined model. The file `GoogleNews-vectors-negative300.bin` **is already located in the root directory**. 

You may also download it using the code below, or you may directly download it from [here](https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz).

In [None]:
# CODE FOR DOWNLOADING PRETRAINED MODEL
!brew install wget
!wget -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"
!gzip -d GoogleNews-vectors-negative300.bin.gz

**Task 3.1 (0.5 points)**

Load word2vec model using `gensim.models.KeyedVectors.load_word2vec_format`. How many words in model vocabulary start with `a` or `A`? (In the answer write the sum of vocabulary words wich start with `a` and with `A`).



In [None]:
from gensim import models
## YOUR CODE HERE


In [None]:

## GRADED PART, PUT YOUR ANSWER HERE
q4 =  ### YOUR SOLUTION

**Task 3.2 (0.5 points)**
Select all words from the vocabulary, which start from `z` or `Z`. Among these words which is the most similar to the word `park`?

Use `sklearn.metrics.pairwise.cosine_similarity` as a similarity measure.


In [None]:
from sklearn.metrics.pairwise import cosine_similarity
## YOUR CODE HERE
words_counts = ## YOUR CODE HERE
words_counts.most_common(5)

In [None]:
## GRADED PART, PUT YOUR ANSWER HERE
q5 =  ### YOUR SOLUTION

**Task 3.3 (0.5 points)** 
Compute top-5 most frequent words in `X_train` without counting words from `STOPWORDS`.
In answer write top-5 words in one string separating words by spaces.

**Example answer:** `"cat dom apple home house"`.

In [None]:
## YOUR CODE HERE
words_counts = ## YOUR CODE HERE
answer = ' '.join([x[0] for x in words_counts.most_common(5)])
answer

In [None]:
## GRADED PART, PUT YOUR ANSWER HERE
q6 =  ### YOUR SOLUTION

**Task 3.4 (0.5 points)**. With word2vec embeddings you can perform different linear operations. For example, you can check which of the words do not belong in the sequence (or does not match), or what vector we get if we sum of the `king` and the `woman` vectors and subtract the vector `man`. For the model above, do the following operations: 

1) what word from these: `London Warsaw Peru Kiev`
does not match; 

2) take the word from step 1 and make linear operation `Moscow + <your word> - Russia`; 

3) for the word from step 2 write the score for most simliar one. Round the answer up to 2 digits after the decimal point.

In [None]:
### YOUR SOLUTION


In [None]:
## GRADED PART, PUT YOUR ANSWER HERE
q7 =  ### YOUR SOLUTION

**Task 3.5 (0.5 point)**. Add information in the pre-trained word2vec model. We have already pre-trained word2vec model, but what if we want to add some new information inside and increse the occurance of some words. You can update the parameters using the `gensim` interface, you may find more information [here](https://radimrehurek.com/gensim/models/keyedvectors.html). Note, that `KeyedVectors` does not working, you can update only the full model.  

First, take the tokenized sentences from sentiment data and train Word2Vec model. Set parameters: `vector_size=100, window=2, min_count=1, workers=1`. Write the most similar word for the `woman` vector for trained model and its similarity score (round the answer up to 4 digits after the decimal point).

In [None]:
sentences = [line for line in data.tokenized]


In [None]:
## GRADED PART, PUT YOUR ANSWER HERE
q8 =  ### YOUR SOLUTION

**Task 3.6 (0.5 point)**. Next, we will learn how to update parameters of the model with new sentences. 


Let's take some data from Alice’s Adventures in Wonderland by Lewis Carroll and add new texts in the model. First, load the data and prepare:

In [None]:
import nltk
import re
nltk.download('punkt')
from nltk.tokenize import sent_tokenize

with open("alice.txt", 'r', encoding='utf-8') as f:
    text = f.read()

text = re.sub('\n', ' ', text)
sents = nltk.sent_tokenize(text.lower())

punct = '!"#$%&()*+,-./:;<=>?@[\]^_`{|}~„“«»†*—/\-‘’'
clean_sents = []

for sent in sents:
    s = [w.lower().strip(punct) for w in sent.split()]
    clean_sents.append(s)

print(clean_sents[:2])

You need to save the current model in the right format, load again the full model and add new texts in it. Run with default parameters 5 epochs. After, again check the most similar words for `woman` and write the word and score. Round the answer up to 4 digits after the decimal point.

In [None]:
from gensim.models import word2vec
### YOUR SOLUTION

In [None]:
## GRADED PART, PUT YOUR ANSWER HERE
q9 =  ### YOUR SOLUTION

### Pooling methods

Once we have a word embedding model, how can we get from word embeddings to document embeddings?

For this, there are several *pooling* strategies that define the way to aggregate the embeddings of each word in a document by taking an element-wise **average**, **minimum**, **maximum** or **sum**:

*   **Mean-pooling** (**average-pooling**)
*   **Min-pooling**
*   **Max-pooling**
*   **Sum-pooling**

It is very common in practice to use mean-pooled document embeddings for the downstream task.

**Task 4.1 (1 point).**
Using the model build sentence embedder, which computes the sentence vector as the mean vector of its word vectors (**mean-pooling**). Use zero vectors for out of vocabulary words.

What is the embedding for the sentence `'the cat sat on the mat'`?
Tokenize the sentence splitting it by spaces.

In the asnwer write the mean of its components.

In [None]:
class MeanEmbeddingVectorizer(object):
    def __init__(self, word2vec):
        self.word2vec = word2vec
        # if a text is empty we should return a vector of zeros
        # with the same dimensionality as all the other vectors
        self.dim = ## YOUR CODE HERE

    def fit(self, X, y):
        return self

    def transform(self, X):
        # X is a list of tokenized sentences 
        # example:
        # X = [['the','cat,'sat','on','the','mat'],
        #      ['the','dog,'lies','on','the','sofe']]
        
        ### YOUR CODE HERE

In [None]:
vectorizer = MeanEmbeddingVectorizer(model)
cat_vector = ### YOUR CODE HERE
print(cat_vector.shape)

In [None]:
## GRADED PART, PUT YOUR ANSWER HERE
q10 =  ### YOUR SOLUTION

**Task 4.2 (1 point).** Using `MeanEmbeddingVectorizer` vectorize tokenized reviews. Than train (on train set) `RandomForestClassifier` from `sklearn.ensemble` with `n_estimators = 300` and `random_state=5`.
What is the accuracy score on test set? 

In [None]:
### YOUR CODE HERE

pred = ### YOUR SOLUTION

In [None]:
## GRADED PART, PUT YOUR ANSWER HERE
q11 =  ### YOUR SOLUTION

**Task 4.3 (1 point)**
Transform `MeanEmbeddingVectorizer` for into MinEmbeddingVectorizer, 
MaxEmbeddingVectorizer, and SumEmbeddingVectorizer. 

For each of the vectorizers:


1) vectorize the data


2) train (on train set) `RandomForestClassifier` from `sklearn.ensemble` with `n_estimators = 300` and `random_state=5`.


3) compute accuracy score on test set

What method among three is the best? The answer should be of three: "min", "max", "sum".



In [None]:
### YOUR SOLUTION

In [None]:
class EmbeddingVectorizer(object):
    def __init__(self, word2vec):
        self.word2vec = word2vec
        # if a text is empty we should return a vector of zeros
        # with the same dimensionality as all the other vectors
        self.dim = model[model.index_to_key[0]].shape[0]

    def fit(self, X, y):
        return self

    def transform(self, X, type_ = 'min'):
        # X is a list of tokenized sentences 
        # example:
        # X = [['the','cat,'sat','on','the','mat'],
        #      ['the','dog,'lies','on','the','sofe']]
        
        ### YOUR CODE HERE

In [None]:
results = []
for type_ in ['min', 'max', 'sum']:
  ### YOUR CODE HERE
  results.append([type_, accuracy])
    
results

In [None]:
## GRADED PART, PUT YOUR ANSWER HERE
q12 =  ### YOUR SOLUTION

**Task 4.4 (0.5 points)** What is the accuracy score for this method?


In [None]:
## GRADED PART, PUT YOUR ANSWER HERE
q13 =  ### YOUR SOLUTION

## Text classification with ```fastText```

Let us have a look at how we can use the ```fastText``` library for text classification. Specifically, the [library](https://github.com/facebookresearch/fastText#text-classification) can be used to train superrvised text classifiers, for example for sentiment analysis in a single command:

```./fasttext supervised -input train.txt -output model```, where ```train.txt``` is the train set with annotated examples for the task, and ```model``` is a name of your model that will be saved in 2 files: ```model.bin``` and ```model.vec```.

**Data format**: 
The train file should be in a specified format, containing a training sentence or document per line along with the labels.

In [None]:
## Install fasttext if you work in Colab (otherwise it is already installed)

# !wget https://github.com/facebookresearch/fastText/archive/v0.2.0.zip
# !unzip v0.2.0.zip
# %cd fastText-0.2.0
# !make
# !pip install .

Let us prepare the data in the required format as follows.

In [None]:
X = [sentence.strip() for sentence in data.cleaned_review]
y = [str(label) for label in data.sentiment]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 5)

with open('ft_train.txt', 'w') as outfile:
    for i in range(len(X_train)):
        outfile.write('__label__' + y_train[i] + ' '+ X_train[i] + '\n')
    
with open('ft_test.txt', 'w') as outfile:
    for i in range(len(X_test)):
        outfile.write('__label__' + y_test[i] + ' ' + X_test[i] + '\n')

print("Files are written")

In [None]:
!head -n 5 ft_train.txt

### ```fastText```

Let us have a look at how we can apply a pre-trained fastText model that was trained over different text corpora: Wikipedia, UMBC Webbase corpus and statmt.org news dataset (16B tokens in total!).

**Task 5.1 (1 point).** Train fasttext model! Set default parapmeters, but train it for 20 epochs! 

How can we now evaluate the model now? Enter the code to test the model on the ```ft_test.txt``` file with the help of the ```fastText``` library. Write Precision (P@1) and recall (R@1) **scores**. Round the answer up to 3 digits after the decimal point. 

In [None]:
### YOUR SOLUTION

In [None]:
## GRADED PART,PUT YOUR  PRECISION VALUE HERE
q14 =  ### YOUR SOLUTION

In [None]:
## GRADED PART, PUT YOUR RECALL VALUE HERE
q15 =  ### YOUR SOLUTION

We can also load the model and use the python code to make predictions.

In [None]:
import fasttext

## save trained model
classifier.save_model("sentiment_model.bin")
## load our trained classifier
sentiment_ft = fasttext.load_model("sentiment_model.bin")

**Task 5.2. (1 point)** Write a code to make a prediction with the classifier. As answer write the label (1 or 0) and the confidence score (Round the answer up to 3 digits after the decimal point.).

In [None]:
review = "This was such a great film! I am so lucky to watch it."

# <YOUR CODE HERE>

In [None]:
## GRADED PART,PUT YOUR LABEL HERE
q16 =  ### YOUR SOLUTION

In [None]:
## GRADED PART, PUT YOUR CONFIDENCE SCORE HERE
q17 =  ### YOUR SOLUTION