# DSCI 563 - Unsupervised Learning

# Lab 3: Word Embeddings, T-SNE, and product Similarity using Word2Vec

## Submission instructions <a name="si"></a>
<hr>
rubric={mechanics:2}

You will receive marks for correctly submitting this assignment. To submit this assignment, follow the instructions below:

- **Please add a link to your GitHub repository here: LINK TO YOUR  REPO**
- Be sure to follow the [general lab instructions](https://ubc-mds.github.io/resources_pages/general_lab_instructions/).
- Make at least three commits in your lab's GitHub repository.
- Push the final .ipynb file with your solutions to your GitHub repository for this lab.
- Upload the .ipynb file to Gradescope.
- If the .ipynb file is too big or doesn't render on Gradescope for some reason, also upload a pdf or html in addition to the .ipynb. 
- Make sure that your plots/output are rendered properly in Gradescope.

> [Here](https://github.com/UBC-MDS/public/tree/master/rubric) you will find the description of each rubric used in MDS.

> As usual, do not push the data to the repository. 

**At this point in the program, even if it's not asked explicitly in the instructions, it's always expected that you provide a brief justification or explanation when you make some non-obvious choices (e.g., hyperparameter choices) or present a bunch of plots. If you don't do it, the reader doesn't know the rationale behind your decisions and what they are supposed to look for in your visuals.**

<br><br><br><br>

## Imports <a name="im"></a>

In [None]:
import os

%matplotlib inline
import string

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

pd.set_option("display.max_colwidth", 0)

<br><br><br><br>

> **This lab is short compared to other labs because it's quiz week. But note that it involves loading pre-trained models that may take a while  depending upon your machine. So please start early and do not leave this lab for last minute.**

<br><br>

## Exercise 1: Exploring pre-trained word embeddings
<hr>

The idea of word embeddings is to represent words with short (~100 to 1000 dimensions) and dense representations such that related words are close together in the vector space. One of the most popular algorithms to create word embeddings is word2vec.  

We can either create word embeddings on our own by training a model like word2vec on a large corpus or use pre-trained word embeddings, which is a more common approach. A number of pre-trained word embeddings are available out there. In the next few exercises, you will be exploring pre-trained embeddings trained on Wikipedia using an algorithm called [GloVe](https://nlp.stanford.edu/pubs/glove.pdf), which is similar to the word2vec algorithm we saw in class.

In this lab, you will explore embeddings to find word relatedness and analogies. In DSCI 575 in the next block, we will use them for transfer learning. 
    
The code below loads pre-trained word vectors trained on Wikipedia. The original source of these word vectors is [here](https://nlp.stanford.edu/projects/glove/). You can also conveniently download them using `gensim`, as shown below. 

To run the code, you'll need `gensim` package in your 563 conda environment. If you have installed the course conda environment successfully, you'll already have this package. If not, you can install it as follows. 

```
> conda activate 563
> conda install -c anaconda gensim
```

In [None]:
import gensim
import gensim.downloader

print(list(gensim.downloader.info()["models"].keys()))

In [None]:
import gensim.downloader as api

glove_wiki_vectors = api.load(
    "glove-wiki-gigaword-100"
)  # This will take a while to run, especially when you run it for the first time.

In [None]:
len(glove_wiki_vectors)

There are 400,000 word vectors in these pre-trained model. 

In [None]:
glove_wiki_vectors["learning"].shape

Each vector is 100 dimensional. And below we see most similar words to the word _learning_. See [here](https://tedboy.github.io/nlps/generated/generated/gensim.models.Word2Vec.most_similar.html) how these similar words are found.

In [None]:
glove_wiki_vectors.most_similar("learning")

<br><br>

### 1.1 Word relatedness
rubric={accuracy:2,reasoning:2}

Now that we have GloVe Wiki vectors loaded in `glove_wiki_vectors`, let's explore them. 

**Your tasks:**

1. Calculate cosine similarity for the following word pairs (`word_pairs`) using the [`similarity`](https://radimrehurek.com/gensim/models/keyedvectors.html?highlight=similarity#gensim.models.keyedvectors.KeyedVectors.similarity) method of the model.
2. Comment on your observations.  

In [None]:
word_pairs = [
    ("coast", "shore"),
    ("clothes", "closet"),
    ("old", "new"),
    ("smart", "intelligent"),
    ("dog", "cat"),
    ("tree", "lawyer"),
]

<div class="alert alert-warning">

Solution_1_1_1
    
</div>

<div class="alert alert-warning">

Solution_1_1_2
    
</div>

<br><br>

### (optional) 1.2 Finding an odd man out
rubric={reasoning:1}

Another thing you could do with word vectors is finding a word from the given list which does not go with other words in the group. The method `doesnt_match` finds a word whose word vector is further away from the mean of all word vectors in the group.

In the example below, the word vector of _cereal_ is further away from the mean of word vectors of other words in the group.    

**Your tasks:**

1. Try 2 to 4 lists of words of your choice with an odd word in each of them and try `doesnt_match` method on these lists. Comment on your results. 

In [None]:
glove_wiki_vectors.doesnt_match("breakfast cereal dinner lunch".split())

<div class="alert alert-warning">

Solution_1_2_1
    
</div>

<br><br>

### 1.3 Representation of all words in English
rubric={accuracy:1,reasoning:2}

**Your tasks:**

1. The vocabulary size of Wikipedia embeddings is quite large. The `test_words` list below contains a few new words (called neologisms) and biomedical domain-specific abbreviations. Write code to check whether `glove_wiki_vectors` has representation for these words or not. 
2. Give example corpora (collection of texts) that you would use to train word2vec models so that you have representations for these words.   

> If a given word `word` is in the vocabulary, `word in glove_wiki_vectors` will return True. 

In [None]:
test_words = [
    "covididiot",
    "fomo",
    "frenemies",
    "anthropause",
    "photobomb",
    "selfie",
    "pxg",  # Abbreviation for pseudoexfoliative glaucoma
    "pacg",  # Abbreviation for primary angle closure glaucoma
    "cct",  # Abbreviation for central corneal thickness
    "escc",  # Abbreviation for esophageal squamous cell carcinoma
]

<div class="alert alert-warning">

Solution_1_3_1
    
</div>

<div class="alert alert-warning">

Solution_1_3_2
    
</div>

<br><br><br>

### 1.4 Visualizing similar words
rubric={viz:5,reasoning:2}

Let's examine the quality of embeddings by visualizing whether similar words are close together in the vector space or not. 
Our word vectors are 100 dimensional and if we want to visualize them, we need to reduce dimensionality to 3 dimensions or 2 dimensions. [PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) would be a simplest approach for this. For better visualization, we can also use non-linear dimensionality reduction techniques such as [t-SNE](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html) or [UMAP](https://umap-learn.readthedocs.io/en/latest/). In this exercise, you'll use PCA and t-SNE to visualize a sample of word embeddings and compare the visualizations.  

The code below extracts word embeddings for a set of 66 words from 6 categories and stores them in the dataframe `embeddings_df`, where indices are words. 

> Feel free to experiment with the categories but in your final submission keep these categories so that it's easier for the TAs to grade your work.  

**Your tasks:**

1. Apply PCA to `embeddings_df` below to reduce dimensionality to 2 dimensions and show a scatter plot of reduced dimensions. Add labels (words) to the points in the plot.  
2. Apply t-SNE to `embeddings_df` below to reduce dimensionality to 2 dimensions and show a scatter plot of reduced dimensions. Add labels (words) to the points in the plot. 
3. Compare the scatter plots created by PCA and t-SNE and briefly discuss your observations. 

> For t-SNE, you might have to tune some hyperparameters. Show your work or briefly justify your choices. 

> Feel free to use code from lecture notes with appropriate attributions. 

In [None]:
# Create words and labels.

categories = ["english", "apple", "intelligence", "hockey", "cobain", "pca"]
subset_words = []

labels = []
j = 0
for cat in categories:
    subset_words.append(cat)
    labels.append(j)
    for similar_word, _ in glove_wiki_vectors.most_similar(cat, topn=10):
        subset_words.append(similar_word)
        labels.append(j)
    j += 1

In [None]:
embeddings_df = pd.DataFrame(data=glove_wiki_vectors[subset_words], index=subset_words)
embeddings_df.head()

In [None]:
embeddings_df.shape

<div class="alert alert-warning">

Solution_1_4_1
    
</div>

<div class="alert alert-warning">

Solution_1_4_2
    
</div>

<div class="alert alert-warning">

Solution_1_4_3
    
</div>

<br><br>

<br><br><br><br>

## Exercise 2: Stereotypes and biases in word embeddings
<hr>

### 2.1 Potential effect of stereotypes and biases in embeddings
rubric={reasoning:2}

Word vectors contain lots of useful information. But they also contain stereotypes and biases of the texts they were trained on. In the lecture, we saw that our pre-trained word embedding model output an analogy that reinforced a gender stereotype.

**Your tasks:**

1. Give an example of how using such a model could cause harm in the real world.

<div class="alert alert-warning">

Solution_2_1_1
    
</div>

<br><br>

### 2.2 Stereotypes and biases in embeddings
rubric={accuracy:2,reasoning:4}

**Your tasks:**

1. Here we are using pre-trained embeddings which are built using Wikipedia data. Explore whether there are any worrisome biases or stereotypes present in these embeddings or not by trying out at least 4 examples. You can use the following two methods or other methods of your choice to explore what kind of stereotypes and biases are encoded in these embeddings. 
    - use the `analogy` function below which gives word analogies (an example shown below)
    - use [similarity](https://radimrehurek.com/gensim/models/keyedvectors.html?highlight=similarity#gensim.models.keyedvectors.KeyedVectors.similarity) or [distance](https://radimrehurek.com/gensim/models/keyedvectors.html?highlight=distance#gensim.models.keyedvectors.KeyedVectors.distances) methods (an example is shown below.)   
2. Discuss your observations.

> Note that most of the recent embeddings are de-biased. But you might still observe some biases in them. Also, not all stereotypes present in pre-trained embeddings are necessarily bad. But you should be aware of them when you use them in your models. 

An example of using word analogies to explore biases and stereotypes.

In [None]:
def analogy(word1, word2, word3, model=glove_wiki_vectors):
    """
    Returns analogy word using the given model.

    Parameters
    --------------
    word1 : (str)
        word1 in the analogy relation
    word2 : (str)
        word2 in the analogy relation
    word3 : (str)
        word3 in the analogy relation
    model :
        word embedding model

    Returns
    ---------------
        pd.dataframe
    """
    print("%s : %s :: %s : ?" % (word1, word2, word3))
    sim_words = model.most_similar(positive=[word3, word2], negative=[word1])
    return pd.DataFrame(sim_words, columns=["Analogy word", "Score"])

In [None]:
analogy("man", "doctor", "woman")

An example of using similarity between words to explore biases and stereotypes.

In [None]:
glove_wiki_vectors.similarity("white", "poor")

In [None]:
glove_wiki_vectors.similarity("black", "poor")

<div class="alert alert-warning">

Solution_2_2_1
    
</div>

<div class="alert alert-warning">

Solution_2_2_2
    
</div>

<br><br>

### (optional) 2.3 Exploring stereotypes using WEAT 
rubric={reasoning:1}

The standard way to identify embedding bias is using WEAT (Word Embedding Association Test), which comes from a [Science paper](https://purehost.bath.ac.uk/ws/portalfiles/portal/168480066/CaliskanEtAl_authors_full.pdf) from a few years ago. It is adapted from psychological tests for detecting implicit bias. 

The basic idea is that you take some representative target words (e.g., `men_words` and `women_words`) and attribute words (e.g., `high_pay_jobs_words`, `low_pay_jobs_words`), as shown below. Then we calculate a normalized z-score like effect, which is positive if bias is as expected and a p-value based on trying all groupings of the words in the targets. 

If you're interested in details, here is the [paper](https://purehost.bath.ac.uk/ws/portalfiles/portal/168480066/CaliskanEtAl_authors_full.pdf) and [here](https://github.com/kmccurdy/w2v-gender/tags) is the Python implementation of WEAT. 

**Your tasks**

1. Explore embedding biases using WEAT for target and attribute words of your choosing using the Python implementation [here](https://github.com/kmccurdy/w2v-gender/tags). 

> You are likely to discuss this more in your ethics course next block.   

In [None]:
men_words = {"male", "man", "boy", "he", "him", "his"}

In [None]:
women_words = {"female", "woman", "girl", "she", "her", "hers"}

In [None]:
high_pay_jobs_words = {"doctor", "lawyer", "programmer", "surgeon", "executive"}

In [None]:
low_pay_jobs_words = {"nurse", "janitor", "cashier", "driver"}

<div class="alert alert-warning">

Solution_2_3_1
    
</div>

<br><br><br><br>

## Exercise 3: Building your own embeddings <a name="2"></a>
<hr>

When you work in specific domains, you might need to train your own word embeddings. In this exercise, you will train your own embeddings on a biomedical corpus using [`gensim`](https://radimrehurek.com/gensim/). 

We'll use a small subset of a corpus of [biomedical abstracts downloaded from PMC](https://www.kaggle.com/cvltmao/pmc-articles?select=a_b.csv). The original corpus is large and to get meaningful embeddings, we would ideally use the full corpus. But for the purpose of this assignment, we will only work on a sample for speed. 

**Your tasks:**

- Download `a_b.csv` from [kaggle](https://www.kaggle.com/cvltmao/pmc-articles?select=a_b.csv), and put it in the lab folder. 
- Run the code below which reads the CSV and extracts a sample of the CSV. 

In [None]:
df = pd.read_csv("a_b.csv")
df = df.dropna()
df_subset = df.sample(5000, random_state=42)

In [None]:
df_subset.head()

Word2Vec requires data to be in a specific format, as shown below.

```
[[sent1word1, sent1word2, ...], 
 [sent2word1, sent2word2, ...], 
 ...
 [sent1000word1, sent1000word2, ...],
 ...
 ]
 
```

`Gensim`, the package we are using to train Word2Vec, only requires that the input provides sentences sequentially, when iterated over. There is no need to keep everything in RAM. So we can provide one sentence, process it, forget it, load another sentence.

The `preprocessing.py` file has class `MyPreprocessor` which preprocesses a given list of documents and **yields a memory-friendly iterator** for text which you can pass to Word2Vec model. The preprocessing carries out the following steps:

- sentence segmentation
- tokenization
- turned the text into lowercase
- removing stopwords

The purpose of preprocessing is to "normalize" the text so that equivalent things (e.g., _Data_ and _data_) with respect to your task match with each other. 

Run the code below to carry out preprocessing of the corpus. 

In [None]:
from preprocessing import MyPreprocessor

corpus = df_subset["abstract"].tolist()
sentences = MyPreprocessor(corpus)  # memory friendly iterator

<br><br>

###  3.1 Training `Word2Vec` and `fastText`
rubric={accuracy:3,reasoning:2}

Now that we have an iterator of the data in the expected format, let's train our own word embeddings. In this exercise, you will train `Word2Vec` and `fastText` models on `sentences` iterator above. 

**Your tasks:** 

1. Train [Word2Vec model](https://radimrehurek.com/gensim/models/word2vec.html) on `sentences` with the following hyperparameters. (This might take some time so I recommend saving the model with `model.save` for later use. See usage example [here](https://radimrehurek.com/gensim/models/word2vec.html#usage-examples).)
    * `vector_size=100`
    * `window=5`
    * `min_count=2`,
2. Train [fastText model](https://radimrehurek.com/gensim/models/fasttext.html) on `sentences` with the same hyperparameters above. (This might take some time so I recommend saving the model for later use.)


> Note that the word embeddings will be better quality if we use the full corpus instead of the subset. We are using a subset in this exercise to save time. On my iMac it took ~60 s to train Word2Vec and ~65 s to train fastText on the sample above. If you are feeling adventurous and if your computer can handle it, you are welcome to train it on the full corpus. If your computer is struggling to create embeddings with 5000 documents, reduce the sample size.    

> **Please do not submit your saved models.**

In [None]:
from gensim.models import FastText, Word2Vec

<div class="alert alert-warning">

Solution_3_1_1
    
</div>

<div class="alert alert-warning">

Solution_3_1_2
    
</div>

### 3.2 
rubric={accuracy:1,reasoning:2}

**Your tasks:**
1. What is the vocabulary size in each of the models above? 
2. Give one or two example scenarios when you would train your own embeddings vs. when you would use pre-trained embeddings.   

> You might have to access the vocabulary size as `len(model.wv)` and word vectors as `model.wv[word]`, if `model` is your trained model. 

<div class="alert alert-warning">

Solution_3_2_1
    
</div>

<div class="alert alert-warning">

Solution_3_2_2
    
</div>

<br><br>

### 3.3 Unknown words 
rubric={accuracy:2,reasoning:2}

1. Below are the test words we tried before. Write code to check which of these words have representations in our trained embeddings. Try both word2vec and fasttext models you have trained above. 
2. Discuss your observations. 

> Note that you might have to access word vector for a word `word` as `model.wv['word']` if `model` is your trained model. 

In [None]:
test_words = [
    "covididiot",
    "fomo",
    "frenemies",
    "anthropause",
    "photobomb",
    "selfie",
    "pxg",  # Abbreviation for pseudoexfoliative glaucoma
    "pacg",  # Abbreviation for primary angle closure glaucoma
    "cct",  # Abbreviation for central corneal thickness
    "escc",  # Abbreviation for esophageal squamous cell carcinoma
]

<div class="alert alert-warning">

Solution_3_3_1
    
</div>

<div class="alert alert-warning">

Solution_3_3_2
    
</div>

<br><br><br><br>

## Exercise 4: Product recommendation using Word2Vec
<hr>

The Word2Vec algorithm can also be used in tasks beyond text and word similarity. In this exercise we will explore using it for product recommendations. We will build a Word2Vec model so that similar products (products occurring in similar contexts) occur close together in the vector space. The context of products can be determined by the purchase histories of customers. Once we have reasonable representation of products in the vector space, we can recommend products to customers that are "similar" (as depicted by the algorithm) to their previously purchased items. 

For this exercise, we will be using the [Online Retail Data Set from UCI ML repo](https://www.kaggle.com/jihyeseo/online-retail-data-set-from-uci-ml-repo#__sid=js0). The starter code below reads the data as a pandas dataframe `df`. 

> You might have to install `openpyxl` in your `conda` environment to open the `xlsx` file. 

```
conda install openpyxl
```
Download the data and save it in your lab directory. **Please do not push the data to your repository.** 

Run the code below which reads the data and carries out basic preprocessing. 

In [None]:
df = pd.read_excel("Online_Retail.xlsx")  # Takes a while to read the data.

In [None]:
print("Data frame shape: ", df.shape)
df.head()

In [None]:
df.dropna(inplace=True)
print("Shape after dropping rows with NaNs: ", df.shape)

# Convert StockCode and CustomerID columns to strings
df["StockCode"] = df["StockCode"].astype(str)
df["CustomerID"] = df["CustomerID"].astype(str)

<br><br>

### 4.1 Prepare data for Word2Vec
rubric={accuracy:4,quality:2}

Remember that word2vec requires data in the following form. 

```
[[sent1word1, sent1word2, ...], 
 [sent2word1, sent2word2, ...], 
 ...
 [sent1000word1, sent1000word2, ...],
 ...
 ]
 
```
In this context, customer purchase histories for unique customers are equivalent to sentences and stock codes are equivalent to words.   

**Your tasks:**
1. How many unique customers and unique products are present in the data above? 
2. For all unique customers, create purchasing histories for them in the following format, where each inner list corresponds to the purchase history of a unique customer. Each item in the list is a `StockCode` in the purchase history of that customer, ordered by the time of purchase. 

```
[[StockCode1_of_CustomerID1, StockCode2_of_CustomerID1, ....], 
 [StockCode1_of_CustomerID2, StockCode2_of_CustomerID2, ....], 
 ...
 [StockCode1_of_CustomerID1000, StockCode2_of_CustomerID1000, ....],
 ...
 ]
 
```

<div class="alert alert-warning">

Solution_4_1_1
    
</div>

<div class="alert alert-warning">

Solution_4_1_2
    
</div>

<br><br>

### 4.2 Train `Word2Vec` model 
rubric={accuracy:3}

**Your tasks:**
1. Now that your data is in the format suitable for training Word2Vec model, train `Word2Vec` model with the following hyperparameters
    - `window=10` 
    - `negative=10` (for negative sampling)
    - `seed=8` 
    - `min_count=1`

<div class="alert alert-warning">

Solution_4_2_1
    
</div>

<br><br>

### 4.3 Examine product similarity 
rubric={accuracy:3,reasoning:2}

Given a word2vec model trained on purchase history data and product description, the function `get_most_similar` below returns descriptions of top `n` most similar products. 

**Your tasks:**
1. Get similar products for the following products. 
    - 'SWIRLY CIRCULAR RUBBERS IN BAG'
    - 'POLKADOT RAIN HAT'    
2. Now pick 4 product descriptions of your choice from the data. Call `get_most_similar` for these product descriptions and examine similar products returned by the function.
3. Do the recommendations given by the model make sense? Discuss your observations. 

In [None]:
# Create products id_name and name_id dictionaries
products_id_name_dict = pd.Series(
    df.Description.str.strip().values, index=df.StockCode
).to_dict()
products_name_id_dict = pd.Series(
    df.StockCode.values, index=df.Description.str.strip()
).to_dict()

In [None]:
def get_most_similar(prod_desc, n=10, model=model):
    """
    Given product description, prod_desc, return the most similar
    products

    Parameters
    ---------
    prod_desc : str
        Product description

    n : integer
        the number of similar items to return

    model : gensim Word2Vec model
        trained gensim word2vec model on customer purchase histories

    Returns
    -------
    pandas.DataFrame
        A pandas dataframe containing n names of similar products
        and their similarity scores with the input product
        with desciption prod_desc.

    """
    stock_id = products_name_id_dict[prod_desc]
    try:
        similar_stock_ids = model.wv.most_similar(stock_id, topn=n)
    except:
        print("The product %s is not in the vocabulary" % (prod_desc))
        return

    similar_prods = []

    for (sim_stock_id, score) in similar_stock_ids:
        similar_prods.append((products_id_name_dict[sim_stock_id], score))
    return pd.DataFrame(
        similar_prods, columns=["Product description", "Similarity score"]
    )

<div class="alert alert-warning">

Solution_4_3_1
    
</div>

<div class="alert alert-warning">

Solution_4_3_2
    
</div>

<div class="alert alert-warning">

Solution_4_3_3
    
</div>

<br><br>

### (optional) 4.4 
rubric={reasoning:1}

**Your tasks:**

1. Suppose you get a purchase history for a new customer which has `StockCode` of a new product, which was not present in the training data. Would your Word2Vec model be able to provide recommendations for this product? What about fastText? Does it make sense to use the `fastText` algorithm in this case instead of Word2Vec? What would be a reasonable recommendation strategy for new products? 

<div class="alert alert-warning">

Solution_4_4
    
</div>

<br><br><br><br>

**PLEASE READ BEFORE YOU SUBMIT:** 

When you are ready to submit your assignment do the following:

1. Run all cells in your notebook to make sure there are no errors by doing `Kernel -> Restart Kernel and Clear All Outputs` and then `Run -> Run All Cells`. 
2. Notebooks with cell execution numbers out of order or not starting from "1" will have marks deducted. Notebooks without the output displayed may not be graded at all (because we need to see the output in order to grade your work).
3. Push all your work to your GitHub lab repository. 
4. Upload the assignment using Gradescope's drag and drop tool. Check out this [Gradescope Student Guide](https://lthub.ubc.ca/guides/gradescope-student-guide/) if you need help with Gradescope submission. 
5. Make sure that the plots and output are rendered properly in your submitted file. If the .ipynb file is too big and doesn't render on Gradescope, also upload a pdf or html in addition to the .ipynb so that the TAs can view your submission on Gradescope. 

Well done!! Congratulations on finishing the lab and have a restful weekend! 

![](eva-resting.png)