# DSCI 575 - Advanced Machine Learning

# Lab 1: Word embeddings

In [None]:
import os, sys
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from scipy.sparse import coo_matrix, csr_matrix

import re
from collections import defaultdict
from collections import Counter

from tqdm import tqdm 
import random

import time

# pip install ipython-autotime
import autotime

from gensim.models import Word2Vec,FastText

from preprocessing import MyPreprocessor

## Table of contents
- [Submission guidelines](#sg)
- [Learning outcomes](#lo)
- [Exercise 0: Warm up](#0)
- [Exercise 1: Word meaning representation using co-occurrence matrix](#1)
- [Exercise 2: Word embeddings (dense word representations)](#2)
- [Exercise 3: Pre-trained word embeddings](#3)
- [Exercise 4: Product recommendation using Word2Vec](#4)

## Submission guidelines <a name="sg"></a>

#### Tidy submission
rubric={mechanics:3}
- To submit this assignment, submit this jupyter notebook with your answers embedded.
- Be sure to follow the [general lab instructions](https://ubc-mds.github.io/resources_pages/general_lab_instructions/).
- Use proper English, spelling, and grammar throughout your submission.

#### Code quality and writing
- These rubrics will be assessed on a question-by-question basis and are included in individual question rubrics below where appropriate.
- See the [quality rubric](https://github.com/UBC-MDS/public/blob/master/rubric/rubric_quality.md) and [writing rubric](https://github.com/UBC-MDS/public/blob/master/rubric/rubric_writing.md) as a guide to what we are looking for.
- Refer to [Python PEP 8 Style Guide](https://www.python.org/dev/peps/pep-0008/) for coding style.

## Learning outcomes <a name="lo"></a>

After working on this lab, you will be able to

- Find cosine similarity between words using sparse word representation
- Train your own dense embeddings using Word2Vec and fastText algorithms
- Explain how unknown words are handled in Word2Vec vs. fastText
- Use pre-trained word embeddings
- Use word2vec algorithm for product recommendation 


**Note that this lab involves loading pre-trained models that may take long time depending upon your machine. So please start early and do not leave this lab for last minute.**

You will use a subset of the good old [IMDB movie review data set](https://www.kaggle.com/utathya/imdb-review-dataset) for the first two exercises. Below I am providing starter code to create sub-corpus from this corpus. Replace the CSV path with your download path. 

In [None]:
### BEGIN STARTER CODE
# Data loading and preprocessing
imdb_df = pd.read_csv('data/imdb_master.csv', encoding = "ISO-8859-1")
imdb_df.head()
### END STARTER CODE

In [None]:
### BEGIN STARTER CODE
imdb_df['label'].value_counts()
### END STARTER CODE

In [None]:
### BEGIN STARTER CODE
SUBSET_SIZE = 5000

# A list of all reviews
imdb_all_corpus = imdb_df['review'].tolist()

# Shuffle reviews 
random.shuffle(imdb_all_corpus)

# Create a small subset of the corpus
imdb_subset = imdb_all_corpus[:SUBSET_SIZE]
### END STARTER CODE

## Exercise 0: Warm up <a name="0"></a>

Typically, text data needs to be "normalized" before we do anything with it. I am providing you `MyPreprocessor` class in the file `preprossing.py` which carries out basic preprocessing. Throughout this lab, you will be using this  class for preprocessing and in this particular exercise, you'll use this class on a toy corpus. 

### 0(a) Preprocessing 

rubric={accuracy:2,reasoning:2}

Your tasks: 

1. Preprocess the corpus below (`corpus`) using the `preprocess_corpus` method of the `MyPreprocessor` class. Print the preprocessed corpus. 
2. Write your observations about the preprocessed corpus. What do you think is the purpose of preprocessing text data? 
3. Now create a preprocessed corpus for the `imdb_subset` and store it into a variable called `pp_imdb_subset`. 

In [None]:
### BEGIN STARTER CODE 
corpus = ["""The 21 Lessons for the 21st Century focuses on 
             current affairs and on the more immediate future 
             of humankind. In a world deluged by irrelevant 
             information clarity is power. Censorship works 
             not by blocking the flow of information, but rather 
             by flooding people with disinformation and distractions. 
             So what is really happening right now? 
             What are today’s greatest challenges and choices? 
             What should we pay attention to?
         """,
         """
         The Python Data Science Handbook provides a reference 
         to the breadth of computational and statistical 
         methods that are central to data-intensive science, 
         research, and discovery. 
         """
         ]
### END STARTER CODE 

In [None]:
### YOUR ANSWER HERE

### YOUR ANSWER HERE

In [None]:
### YOUR ANSWER HERE

## Exercise 1:  Word meaning representation using co-occurrence matrix <a name="1"></a>

In this exercise you'll build sparse representation of words using term-term co-occurrence matrix and find cosine similarity scores between a set of word pairs.  

### 1(a) Build and visualize term-term co-occurrence matrix
rubric={accuracy:2,viz:2}

Below we are providing you some starter code for the class `CooccurrenceMatrix`. Read the docstrings of the class methods. 

Your tasks:
1. Create a term-term co-occurrence matrix for the `pp_imdb_subset` with `window_size` 3. 
2. Show the first few rows of the co-occurrence matrix as a pandas DataFrame. Show the appropriate column and row labels (words associated with the indices) so that your co-occurrence matrix is interpretable. 
3. Get word vector for the word _cat_ using the `get_word_vector` method of the `CooccurrenceMatrix` class. What's the size of the vector? How many values are non-zero?      

In [None]:
### BEGIN STARTER CODE
class CooccurrenceMatrix:
    def __init__(self, corpus, 
                       tokenizer = word_tokenize, 
                       window_size = 3):
        self.corpus = corpus
        self.tokenizer = tokenizer
        self.window_size = window_size
        self.vocab = {}
        self.cooccurrence_matrix = None    
        
    def fit_transform(self):
        """
        Creates a co-occurrence matrix. 
        
        Parameters
        ----------
        None
        
        Returns
        ----------
        dict, scipy.sparse.csr_matrix
            Returns the vocabulary and a sparse cooccurrence matrix
        """
        data=[]
        row=[]
        col=[]
        for tokens in self.corpus:
            for target_index, token in enumerate(tokens):
                # Get the index of the word in the vocabulary. If the word is not in the vocabulary, 
                # set the index to the size of the vocabulary. 
                i = self.vocab.setdefault(token, len(self.vocab))
                
                # Consider the context words depending upon the context window 
                start = max(0, target_index - self.window_size)
                end = min(len(tokens), target_index + self.window_size + 1)
                
                for context_index in range(start, end):
                    # Do not consider the target word.  
                    if target_index == context_index: 
                        continue                        
                    j = self.vocab.setdefault(tokens[context_index], len(self.vocab))
                    # Set diagonal to 0
                    if i == j:
                        continue
                    data.append(1.0); row.append(i); col.append(j);
        self.cooccurrence_matrix = csr_matrix((data,(row,col)))
        return self.vocab, self.cooccurrence_matrix
            
    def get_word_vector(self, word):
        """
        Given a word returns the word vector associated with it from the co-occurrence matrix. 

        Parameters
        ----------
        word : str 
            the word to look up in the vocab.
        """
        if word in self.vocab: 
            return self.cooccurrence_matrix[self.vocab[word]]
        else:
            print('The word not present in the vocab')

### END STARTER CODE

In [None]:
### YOUR ANSWER HERE

In [None]:
### YOUR ANSWER HERE

In [None]:
### YOUR ANSWER HERE

### YOUR ANSWER HERE

### 1(b) Cosine similarity between sparse word vectors
rubric={accuracy:2,reasoning:2}

1. Now get word vectors for `word_pairs` shown below. It is not required but feel free to add more word pairs if you like.  
2. Calculate cosine similarity between the word pairs using [`scikit-learn`'s cosine similarity function](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html).
3. Discuss your results. Do these similarity scores make sense to you? 

In [None]:
### BEGIN STARTER CODE
word_pairs = [('coast','shore'), 
              ('clothes', 'closet'), 
              ('old', 'new'), 
              ('smart', 'intelligent'), 
              ('dog', 'cat'),
              ('orange', 'lawyer')
             ]
### END STARTER CODE

In [None]:
### YOUR ANSWER HERE

### YOUR ANSWER HERE

## Exercise 2: Word embeddings (dense word representations) <a name="2"></a>

In Exercise 1, you created and worked with sparse word representations, where each word vector was of size $1 \times V$ ($V$ = size of the vocabulary). In this exercise, you will create short and dense word representations using the Word2Vec algorithm. Before training word embedding models, we need to convert the data into a suitable format, which we have already done above. 

In this exercie, you will
 - Train Word2Vec and fastText algorithms on `pp_imdb_subset`
 - Calculate cosine similarity between word pairs with dense vectors
 - Use pre-trained word embeddings
 
You will need to [install `gensim`](https://radimrehurek.com/gensim/index.html) for this exercise. 

###  2(a) Training `Word2Vec` and `fastText`
rubric={accuracy:4,quality:2,reasoning:1}

In this exercise, you will train two models on the preprocessed version of the subset imdb corpus `pp_imdb_subset` to get dense word representations: `Word2Vec` and `fastText`. 

Your tasks: 

1. Train [Word2Vec model](https://radimrehurek.com/gensim/models/word2vec.html) on `pp_imdb_subset` with following hyperparameters. (This might take some time so I recommend saving the model for later use.)
    * size=100
    * window=5
2. Train [fastText model](https://radimrehurek.com/gensim/models/fasttext.html) on the tokenized corpus with the same set of hyperparameters. (This might take some time so I recommend saving the model for later use.)

3. What is the vocabulary size in each model? 

Note that the word embeddings will be better quality if we use the full IMDB corpus instead of the subset. We are using a subset in this exercise to save some time. On my Macbook Air it took 204.8 s to train Word2Vec on the full IMDB corpus and 376.3 s to train fastText on the full IMDB corpus. If you are feeling adventurous, you are welcome to train it on the full corpus.  

**Please do not submit your saved models.**

In [None]:
### YOUR ANSWER HERE

In [None]:
### YOUR ANSWER HERE

### 2(b) Unknown words 
rubric={accuracy:2,reasoning:2}

1. Is the word _appendicitis_ present in the vocabulary of the two models? You may try other words which are unlikely to occur in the IMDB dataset. 
2. Now look at the vectors for the word _appendicitis_ for both models. 
3. Note your observations. 

In [None]:
### YOUR ANSWER HERE

In [None]:
### YOUR ANSWER HERE    

In [None]:
### YOUR ANSWER HERE    

### YOUR ANSWER HERE    

###  2(c) Cosine similarity with dense vectors
rubric={accuracy:2,reasoning:1}

- Calculate cosine similarity between the word pairs (`word_pairs`) from Exercise 1(b) using the [model.similarity](https://radimrehurek.com/gensim/models/word2vec.html) method.
- Comment on the quality of word embeddings. 

In [None]:
### YOUR ANSWER HERE

### YOUR ANSWER HERE

## Exercise 3:  <a name="3"></a>

Training word embeddings on a large corpus is resource intensive and we might not be able train good quality embeddings on our laptops. (You might have noticed your laptop making noises if you tried the full IMDB data set in the previous exercise.)

Using pre-trained word embeddings is very common in NLP. These embeddings are created by training a model like Word2Vec or fastText on a huge corpus of text such as a dump of Wikipedia or a dump of the web crawl. It has been shown pre-trained word embeddings [work well on a variety of text classification tasks](http://www.lrec-conf.org/proceedings/lrec2018/pdf/721.pdf). The rationale is that such corpora are representative of many different corpora you might be using in your specific domain (e.g., twitter domain, news domain).

A number of pre-trained word embeddings are available. The most popular ones are:  

- [GloVe](https://nlp.stanford.edu/projects/glove/)
    * trained using [the GloVe algorithm](https://nlp.stanford.edu/pubs/glove.pdf) 
    * published by Stanford University 
- [fastText pre-trained embeddings for 294 languages](https://fasttext.cc/docs/en/pretrained-vectors.html) 
    * trained using the fastText algorithm
    * published by Facebook
    
In this exercise, you will be downloading and using GloVe twitter pre-trained embeddings.      

### 3(a) Load GloVe twitter embeddings
rubric={mechanics:3,accuracy:2,reasoning:1}

In this exercise we will explore [GloVe](https://nlp.stanford.edu/projects/glove/) model trained on the twitter data.   

Your tasks are:
1. Download [GloVe embeddings for Twitter](http://nlp.stanford.edu/data/glove.twitter.27B.zip). This is a large file (the compressed file is ~1.42 GB ). **Please do not submit it.** 
2. Unzip the downloaded file. For this exercise we'll be using `glove.twitter.27B/glove.twitter.27B.100d.txt`. The file has words and their corresponding pre-trained embeddings.
3. Convert the GloVe embeddings to the Word2Vec format using the following command. More details [here](https://www.pydoc.io/pypi/gensim-3.2.0/autoapi/scripts/glove2word2vec/index.html).

> python -m gensim.scripts.glove2word2vec -i "glove.twitter.27B.100d.txt" -o "glove.twitter.27B.100d.w2v.txt"

4. Load the `glove_twitter_model` using the following starter code.
5. Compare the vocabulary size of `glove_twitter_model` with the two models in 2(b). 
6. Is the word *appendicitis* present in the vocabulary of the `glove_model`?


**Note that the glove model is case sensitive and it only has representation of lower-case words.**

In [None]:
### BEGIN STARTER CODE
from gensim.models import KeyedVectors
glove_twitter_model = KeyedVectors.load_word2vec_format('<YOUR_PATH>/glove.twitter.27B.100d.w2v.txt', binary=False)  # C text format
### END STARTER CODE

In [None]:
### YOUR ANSWER HERE

### YOUR ANSWER HERE

In [None]:
### YOUR ANSWER HERE

### 3(b) Word similarity using pre-trained embeddings
rubric={accuracy:1,reasoning:1}

- Calculate cosine similarity between word pairs (`word_pairs`) above using the `wv.similarity` method of the pre-trained embeddings. Compare your results with similarity results in 2(c). 

In [None]:
### YOUR ANSWER HERE

### YOUR ANSWER HERE

### 3(c) Analogies 
rubric={accuracy:2,reasoning:2}

- Try out four pairs of analogies (similar to how we did in class with English GoogleNews pre-trained word embeddings) with `glove_twitter_model`. 
- Recall that we noticed gender stereotypes when we used English GoogleNews pre-trained word embeddings. Do you see similar stereotypes with `glove_twitter_model`? 

In [None]:
### BEGIN STARTER CODE
def analogy(word1, word2, word3, model=glove_twitter_model):
    '''    
    Returns analogy word using the given model. 
    
    Parameters
    --------------
    word1 : (str) 
        word1 in the analogy relation
    word2 : (str)
        word2 in the analogy relation    
    word3 : (str)
        word3 in the analogy relation         
    model : 
        word embedding model
    
    Returns
    ---------------
        pd.dataframe
    '''
    print('%s : %s :: %s : ?' %(word1, word2, word3))
    sim_words = model.most_similar(positive=[word3, word2], negative=[word1])
    return pd.DataFrame(sim_words, columns=['Analogy word', 'Score'])
### END STARTER CODE

In [None]:
### YOUR ANSWER HERE

In [None]:
### YOUR ANSWER HERE

In [None]:
### YOUR ANSWER HERE

In [None]:
### YOUR ANSWER HERE

### YOUR ANSWER HERE

### 3(d) Building your own embeddings vs. using pre-trained embeddings
rubric={reasoning:2}

Give example scenarios when you would train your own embeddings and when you would use pre-trained embeddings.   

### (optional) 3(e) Find the odd one out 
rubric={reasoning:1}

Other than finding word similarity and analogies, we can also use word embeddings for finding an odd word out using the [`doesnt_match` method](https://tedboy.github.io/nlps/generated/generated/gensim.models.Word2Vec.doesnt_match.html?highlight=doesnt_match) of the model. 

- Try an example to find an odd word out using the pre-trained embeddings and examine whether the odd one out given by the algorithm makes sense or not.  

In [None]:
### YOUR ANSWER HERE

### (optional) 3(f) Distance between sentences
rubric={reasoning:1}

In addition, you can also use word embeddings to find distance between sentences using the [`wmdistance` method](https://tedboy.github.io/nlps/generated/generated/gensim.models.Word2Vec.wmdistance.html) of the medel. Find distance between two similar sentences (with non-overlapping words) and two completely unrelated sentences. Do the distances make sense?   

In [None]:
### YOUR ANSWER HERE

### YOUR ANSWER HERE

### YOUR ANSWER HERE

## Exercise 4: Product recommendation using Word2Vec <a name="4"></a>

The Word2Vec algorithm can also be used in tasks beyond text and word similarity. In this exercise we will explore using it for product recommendations. We will build a Word2Vec model so that similar products (products occurring in similar contexts) occur close together in the vector space. The context of products can be determined by the purchase histories of customers. Once we have reasonable representation of products in the vector space, we can recommend products to customers that are "similar" (as depicted by the algorithm) to their previously purchased items or items in their cart. 

For this exercise, we will be using the [Online Retail Data Set from UCI ML repo](https://www.kaggle.com/jihyeseo/online-retail-data-set-from-uci-ml-repo#__sid=js0). The starter code below reads the data as a pandas dataframe `df`. 

Download the data and save it under data folder in your lab's directory. **Please do not push the data to your repository.** 

In [None]:
### BEGIN STARTER CODE
### Read the data. Takes a while to read the data.
### Change the path below to your download path
df = pd.read_excel('data/Online_Retail.xlsx')
### END STARTER CODE

In [None]:
### BEGIN STARTER CODE
print("Data frame shape: ", df.shape)
df.head(20)
### END STARTER CODE

### 4(a): Preprocessing data
rubric={accuracy:2,reasoning:2}

1. Carry out necessary preprocessing (e.g., getting rid of NaN, datatype conversions), if necessary. 
2. How many unique customers and unique products are there? 

In [None]:
### YOUR ANSWER HERE

In [None]:
### YOUR ANSWER HERE

### 4(b): Prepare data for Word2Vec
rubric={accuracy:8,quality:4}

1. Split the customers into train (90%) and validation (10%) sets.
2. For the train and validation customers, create purchasing history for the customers in the following format, where each inner list corresponds to the purchase history of a unique customer. Each item in the list is a `StockCode` in the purchase history of that customer ordered on the time of purchase. 

```
[[CustomerID1_StockCode1, CustomerID1_StockCode2, ....], 
 [CustomerID2_StockCode10, CustomerID2_StockCode1, ....], 
 ...
 [CustomerID1000_StockCode99, CustomerID1000_StockCode10, ....],
 ...
 ]
 
```

In [None]:
### YOUR ANSWER HERE

In [None]:
### YOUR ANSWER HERE    

In [None]:
### YOUR ANSWER HERE    

### Exercise 4(c): Train `Word2Vec` model 
rubric={accuracy:3}

1. Now that your data is in the format suitable for training Word2Vec model, train `Word2Vec` model on the train split. Time your model and report how long it took.

In [None]:
### YOUR ANSWER HERE 

### Exercise 4(d): Examine product similarity 
rubric={accuracy:2,reasoning:4}

Read the starter code below for the `get_most_similar` function. 

1. Get similar products for the following products. 
    - 'SAVE THE PLANET MUG'
    - 'POLKADOT RAIN HAT'    
2. Now pick 4 product descriptions from the validation set. Call `get_most_similar` for these product descriptions and examine similar products returned by the function.
3. Discuss your observations. 
4. If a product does not appear in the train set but appears in the validation set, would the model return a list of similar products for this product? Does it make sense to use the `fastText` algorithm in this case instead of Word2Vec? 

In [None]:
### BEGIN STARTER CODE 
# Create products id_name and name_id dictionaries
products_id_name_dict = pd.Series(df.Description.str.strip().values,index=df.StockCode).to_dict()
products_name_id_dict = pd.Series(df.StockCode.values,index=df.Description.str.strip()).to_dict()
### END STARTER CODE 

In [None]:
### BEGIN STARTER CODE 
def get_most_similar(prod_desc, n = 10, model = model):
    """   
    Given product description, prod_desc, return the most similar 
    products  

    Arguments
    ---------     
    prod_desc -- str
        Product description     

    Keyword arguments
    ---------     
    n -- integer
        the number of similar items to return 

    model -- gensim Word2Vec model
        trained gensim word2vec model on customer purchase histories
        
    Returns
    -------
    pandas.DataFrame
        A pandas dataframe containing n names of similar products 
        and their similarity scores with the input product 
        with desciption prod_desc.     
    
    """
    stock_id = products_name_id_dict[prod_desc]
    try:
        similar_stock_ids = model.wv.most_similar(stock_id, topn = n)
    except: 
        print('The product %s is not in the vocabulary'%(prod_desc))    
        return    

    similar_prods = []
        
    for (sim_stock_id, score) in similar_stock_ids:
        similar_prods.append((products_id_name_dict[sim_stock_id], score))
    return pd.DataFrame(similar_prods, columns=['Product description', 'Similarity score'])
### END STARTER CODE  

In [None]:
### YOUR ANSWER HERE 

In [None]:
### YOUR ANSWER HERE 

In [None]:
### YOUR ANSWER HERE    

In [None]:
### YOUR ANSWER HERE    

In [None]:
### YOUR ANSWER HERE        

In [None]:
### YOUR ANSWER HERE 

### YOUR ANSWER HERE