In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab3.ipynb")

![](img/563_lab_banner.png)

# Lab 3: Word embeddings, T-SNE, and product similarity using Word2Vec

## Imports

In [None]:
from hashlib import sha1

import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import pandas as pd

from sklearn.decomposition import PCA

<br><br><br><br>

<!-- BEGIN QUESTION -->

<div class="alert alert-info">

## Submission instructions <a name="si"></a>
rubric={mechanics}

You will receive marks for correctly submitting this assignment by following the instructions below:
    
- Be sure to follow the [general lab instructions](https://ubc-mds.github.io/resources_pages/general_lab_instructions/).
- [Here](https://github.com/UBC-MDS/public/tree/master/rubric) you will find the description of each rubric used in MDS.
- Make at least three commits in your lab's GitHub repository.    
- Push the final .ipynb file with your solutions to your GitHub repository for this lab.        
- Before submitting your lab, run all cells in your notebook to make sure there are no errors by doing `Kernel -> Restart Kernel and Clear All Outputs` and then `Run -> Run All Cells`. Notebooks without the output displayed may not be graded at all (because we need to see the output in order to grade your work).     
- Make sure to enroll to Gradescope via [Canvas](https://canvas.ubc.ca/courses/106525).
- Upload the .ipynb file to Gradescope.
- Make sure that your plots/output are rendered properly in Gradescope.    
- If the .ipynb file is too big or doesn't render on Gradescope for some reason, also upload a pdf (preferably WebPDF) or html export of .ipynb file with your solutions so that TAs can view your submission on Gradescope. 
- The data you download for this lab <b>SHOULD NOT BE PUSHED TO YOUR REPOSITORY</b> (there is also a `.gitignore` in the repo to prevent this).
- Include a clickable link to your GitHub repo for the lab just below this cell.
</div>    

_Points:_ 2

YOUR REPO LINK GOES HERE

<!-- END QUESTION -->

> **This lab is short compared to other labs because it's quiz week. But note that it involves loading pre-trained models that may take a while  depending upon your machine. So please start early and do not leave this lab to the last minute.**

<br><br>

## Exercise 1: Exploring pre-trained word embeddings
<hr>

The idea of word embeddings is to represent words with short (~100 to 1000 dimensions) and dense representations such that related words are close together in the vector space. One of the most popular algorithms to create word embeddings is word2vec.  

We can either create word embeddings on our own by training a model like word2vec on a large corpus or use pre-trained word embeddings, which is a more common approach. A number of pre-trained word embeddings are available out there. In the next few exercises, you will be exploring pre-trained embeddings trained on Wikipedia using an algorithm called [GloVe](https://nlp.stanford.edu/pubs/glove.pdf), which is similar to the word2vec algorithm we saw in class.

In this lab, you will explore embeddings to find word relatedness and analogies. In DSCI 575 in Block 6, we will use them for transfer learning. 
    
The code below loads pre-trained word vectors trained on Wikipedia. The original source of these word vectors is [here](https://nlp.stanford.edu/projects/glove/). You can also conveniently download them using `gensim`, as shown below. 

In this lab, you'll be using `gensim` package. You can install it in the course conda environment as follows. 

```
> conda activate 563
> conda install -c anaconda gensim
```

In [None]:
import gensim
import gensim.downloader

print(list(gensim.downloader.info()["models"].keys()))

In [None]:
import gensim.downloader as api

glove_wiki_vectors = api.load(
    "glove-wiki-gigaword-100"
)  # This will take a while to run, especially when you run it for the first time.

In [None]:
len(glove_wiki_vectors)

There are 400,000 word vectors in this pre-trained model. 

In [None]:
glove_wiki_vectors["learning"].shape

Each vector is 100 dimensional. And below we see most similar words to the word _learning_. See [here](https://tedboy.github.io/nlps/generated/generated/gensim.models.Word2Vec.most_similar.html) how similar words are found in `gensim`.

In [None]:
glove_wiki_vectors.most_similar("learning")

<br><br>

<!-- BEGIN QUESTION -->

### 1.1 Word similarity using pre-trained embeddings
rubric={accuracy}

**Your tasks:**

- Come up with a list of 4 words of your choice and find similar words to these words using `glove_wiki_vectors` embeddings. 

<div class="alert alert-warning">

Solution_1_1
    
</div>

_Points:_ 2

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

### 1.2 Word similarity using pre-trained embeddings
rubric={accuracy}

**Your tasks:**

1. Calculate cosine similarity for the following word pairs (`word_pairs`) using the [`similarity`](https://radimrehurek.com/gensim/models/keyedvectors.html?highlight=similarity#gensim.models.keyedvectors.KeyedVectors.similarity) method of `glove_wiki_vectors`.

In [None]:
word_pairs = [
    ("coast", "shore"),
    ("clothes", "closet"),
    ("old", "new"),
    ("smart", "intelligent"),
    ("dog", "cat"),
    ("tree", "lawyer"),
]

<div class="alert alert-warning">

Solution_1_2
    
</div>

_Points:_ 2

In [None]:
...

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

### 1.3 Representation of all words in English
rubric={accuracy}

**Your tasks:**

1. The vocabulary size of Wikipedia embeddings is quite large. The `test_words` list below contains a few new words (called neologisms) and biomedical domain-specific abbreviations. Write code to check whether `glove_wiki_vectors` have representation for these words or not. 
> If a given word `word` is in the vocabulary, `word in glove_wiki_vectors` will return True. 

In [None]:
test_words = [
    "covididiot",
    "fomo",
    "frenemies",
    "anthropause",
    "photobomb",
    "selfie",
    "pxg",  # Abbreviation for pseudoexfoliative glaucoma
    "pacg",  # Abbreviation for primary angle closure glaucoma
    "cct",  # Abbreviation for central corneal thickness
    "escc",  # Abbreviation for esophageal squamous cell carcinoma
]

<div class="alert alert-warning">

Solution_1_3
    
</div>

_Points:_ 2

_Type your answer here, replacing this text._

In [None]:
...

<!-- END QUESTION -->

<br><br><br>

<!-- BEGIN QUESTION -->

### 1.4 Visualizing similar words
rubric={viz,reasoning}

Let's examine the quality of embeddings by visualizing whether similar words are close together in the vector space or not. 
Our word vectors are 100 dimensional and if we want to visualize them, we need to reduce dimensionality to 3 dimensions or 2 dimensions. [PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) would be a simplest approach for this. For better visualization, we can also use non-linear dimensionality reduction techniques such as [t-SNE](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html) or [UMAP](https://umap-learn.readthedocs.io/en/latest/). In this exercise, you'll use PCA and t-SNE to visualize a sample of word embeddings and compare the visualizations.  

The code below extracts word embeddings for a set of 66 words from 6 categories and stores them in the dataframe `embeddings_df`, where indices are words. 

> Feel free to experiment with the categories but in your final submission keep these categories so that it's easier for the TAs to grade your work.  

**Your tasks:**

1. Apply PCA to `embeddings_df` below to reduce dimensionality to 2 dimensions and show a scatter plot of reduced dimensions. Add labels (words) to the points in the plot.  
2. Apply t-SNE to `embeddings_df` below to reduce dimensionality to 2 dimensions and show a scatter plot of reduced dimensions. Add labels (words) to the points in the plot. 
3. Compare the scatter plots created by PCA and t-SNE and briefly discuss your observations. 

> For t-SNE, you might have to tune some hyperparameters. Show your work or briefly justify your choices. 

> Feel free to use code from lecture notes with appropriate attributions. 

In [None]:
# Create words and labels.

categories = ["french", "apple", "intelligence", "hockey", "cobain", "pca"]
subset_words = []

labels = []
j = 0
for cat in categories:
    subset_words.append(cat)
    labels.append(j)
    for similar_word, _ in glove_wiki_vectors.most_similar(cat, topn=10):
        subset_words.append(similar_word)
        labels.append(j)
    j += 1

In [None]:
embeddings_df = pd.DataFrame(data=glove_wiki_vectors[subset_words], index=subset_words)
embeddings_df.head()

In [None]:
embeddings_df.shape

<div class="alert alert-warning">

Solution_1_4
    
</div>

_Points:_ 6

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<br><br><br><br>

## Exercise 2: Stereotypes and biases in word embeddings
<hr>

<!-- BEGIN QUESTION -->

### 2.1 Stereotypes and biases in embeddings
rubric={reasoning}

Word vectors contain lots of useful information. But they also contain stereotypes and biases of the texts they were trained on. In the lecture, we saw an example of gender bias in Google News word embeddings. Here we are using pre-trained embeddings trained on Wikipedia data. 

**Your tasks:**

1. Explore whether there are any worrisome biases or stereotypes present in these embeddings by trying out at least 4 examples. You can use the following two methods or other methods of your choice to explore this. 
    - the `analogy` function below which gives word analogies (an example shown below)
    - [similarity](https://radimrehurek.com/gensim/models/keyedvectors.html?highlight=similarity#gensim.models.keyedvectors.KeyedVectors.similarity) or [distance](https://radimrehurek.com/gensim/models/keyedvectors.html?highlight=distance#gensim.models.keyedvectors.KeyedVectors.distances) methods (an example is shown below)

> Note that most of the recent embeddings are de-biased. But you might still observe some biases in them. Also, not all stereotypes present in pre-trained embeddings are necessarily bad. But you should be aware of them when you use them in your models. 

In [None]:
def analogy(word1, word2, word3, model=glove_wiki_vectors):
    """
    Returns analogy word using the given model.

    Parameters
    --------------
    word1 : (str)
        word1 in the analogy relation
    word2 : (str)
        word2 in the analogy relation
    word3 : (str)
        word3 in the analogy relation
    model :
        word embedding model

    Returns
    ---------------
        pd.dataframe
    """
    print("%s : %s :: %s : ?" % (word1, word2, word3))
    sim_words = model.most_similar(positive=[word3, word2], negative=[word1])
    return pd.DataFrame(sim_words, columns=["Analogy word", "Score"])

Examples of using analogy to explore biases and stereotypes.  

In [None]:
analogy("man", "doctor", "woman")

In [None]:
glove_wiki_vectors.similarity("aboriginal", "success")

In [None]:
glove_wiki_vectors.similarity("white", "success")

<div class="alert alert-warning">

Solution_2_1
    
</div>

_Points:_ 4

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

### 2.2 Discussion
rubric={reasoning}

**Your tasks:**
1. Discuss your observations from 2.1. Are there any worrisome biases in these embeddings trained on Wikipedia?   
2. Give an example of how using embeddings with biases could cause harm in the real world.

<div class="alert alert-warning">

Solution_2_2
    
</div>

_Points:_ 4

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<br><br><br><br>

## Exercise 3: Building your own embeddings <a name="2"></a>
<hr>

When you work in specific domains, you might need to train your own word embeddings. In this exercise, you will train your own embeddings on a biomedical corpus using [`gensim`](https://radimrehurek.com/gensim/). 

We'll use a small subset of a corpus of [biomedical abstracts downloaded from PMC](https://www.kaggle.com/cvltmao/pmc-articles?select=a_b.csv). The original corpus is large and to get meaningful embeddings, we would ideally use the full corpus. But training meaningful word embeddings is resource intensive and not suitable for an assignment. The main purpose here is to familiarize yourself with the process of training your own embeddings. 

**Your tasks:**

- Download `a_b.csv` from [kaggle](https://www.kaggle.com/cvltmao/pmc-articles?select=a_b.csv), and put it under the data directory in the lab folder. 
- Run the code below which reads the CSV and extracts a sample of the CSV. 

In [None]:
df = pd.read_csv("data/a_b.csv")
df = df.dropna()
df_subset = df.sample(5000, random_state=42)

In [None]:
df_subset.head()

Word2Vec requires data to be in a specific format, as shown below.

```
[[sent1word1, sent1word2, ...], 
 [sent2word1, sent2word2, ...], 
 ...
 [sent1000word1, sent1000word2, ...],
 ...
 ]
 
```

`Gensim`, the package we are using to train Word2Vec, only requires that the input provides sentences sequentially, when iterated over. There is no need to keep everything in RAM. So we can provide one sentence, process it, forget it, load another sentence.

The `preprocessing.py` file has class `MyPreprocessor` which preprocesses a given list of documents and **yields a memory-friendly iterator** for text which you can pass to Word2Vec model. The preprocessing carries out the following steps:

- sentence segmentation
- tokenization
- turning the text into lowercase
- removing stopwords

The purpose of preprocessing is to "normalize" the text so that equivalent things (e.g., _Data_ and _data_) match with each other in the context of your task. 

Run the code below to carry out preprocessing of the corpus. 

In [None]:
from preprocessing import MyPreprocessor

corpus = df_subset["abstract"].tolist()
sentences = MyPreprocessor(corpus)  # memory friendly iterator

<br><br>

<!-- BEGIN QUESTION -->

###  3.1 Training `Word2Vec` and `fastText`
rubric={accuracy,reasoning}

Now that we have an iterator of the data in the expected format, let's train our own word embeddings. In this exercise, you will train `Word2Vec` and `fastText` models on the `sentences` iterator above. 

**Your tasks:** 

1. Train [Word2Vec model](https://radimrehurek.com/gensim/models/word2vec.html) on `sentences` with the following hyperparameters. (This might take some time so I recommend saving the model with `model.save` for later use. See usage example [here](https://radimrehurek.com/gensim/models/word2vec.html#usage-examples).)
    * `vector_size=100`
    * `window=5`
    * `min_count=2`,
2. Train [fastText model](https://radimrehurek.com/gensim/models/fasttext.html) on `sentences` with the same hyperparameters above. (This might take some time so I recommend saving the model for later use.)


> Note that the word embeddings will be better quality if we use the full corpus instead of the subset. We are using a subset in this exercise to save time. On my iMac it took ~50s to train Word2Vec and ~52s to train fastText on the sample above. If you are feeling adventurous and if your computer can handle it, you are welcome to train it on the full corpus. If your computer is struggling to create embeddings with 5000 documents, reduce the sample size.    

> **Please do not submit your saved models.**

<div class="alert alert-warning">

Solution_3_1
    
</div>

_Points:_ 4

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

### 3.2 Representation of domain-specific words
rubric={accuracy,reasoning}

**Your tasks:**
1. What is the vocabulary size in each model above? 
2. Below are the test words we tried before. Write code to check which of these words have representations in  word2vec and fasttext models you have trained above. 

> Note that you might have to access word vector for a word `word` as `model.wv['word']` if `model` is your trained model. 

> You might have to access the vocabulary size as `len(model.wv)` and word vectors as `model.wv[word]`, if `model` is your trained model. 

In [None]:
test_words = [
    "covididiot",
    "fomo",
    "frenemies",
    "anthropause",
    "photobomb",
    "selfie",
    "pxg",  # Abbreviation for pseudoexfoliative glaucoma
    "pacg",  # Abbreviation for primary angle closure glaucoma
    "cct",  # Abbreviation for central corneal thickness
    "escc",  # Abbreviation for esophageal squamous cell carcinoma
]

<div class="alert alert-warning">

Solution_3_2
    
</div>

_Points:_ 3

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

### 3.3 Discussion
rubric={reasoning}

**Your tasks:**
1. Discuss your observations from 3.2. 
2. Give an example scenarios when you would train your own embeddings vs. when you would use pre-trained embeddings.

<div class="alert alert-warning">

Solution_3_3
    
</div>

_Points:_ 4

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<br><br><br><br>

## Exercise 4: Product recommendation using Word2Vec
<hr>

The word2vec algorithm can also be used in tasks beyond text and word similarity. In this exercise, you will explore using it for product recommendations. We will build a word2vec model so that similar products (products occurring in similar contexts) occur close together in the vector space. The context of products can be determined by the purchase histories of customers. Once we have reasonable representation of products in the vector space, we can recommend products to customers that are "similar" (as depicted by the algorithm) to their previously purchased items. 

For this exercise, we will be using the [Online Retail Data Set from UCI ML repo](https://www.kaggle.com/jihyeseo/online-retail-data-set-from-uci-ml-repo#__sid=js0). The starter code below reads the data as a pandas dataframe `df`. 

> You might have to install `openpyxl` in your `conda` environment to open the `xlsx` file. 

```
conda install -c anaconda openpyxl
```
Download the data and save it under the data directory in your lab folder. **Please do not push the data to your repository.** 

Run the code below which reads the data and carries out basic preprocessing. 

In [None]:
df = pd.read_excel("data/Online_Retail.xlsx")  # Takes a while to read the data.

In [None]:
print("Data frame shape: ", df.shape)
df.head()

In [None]:
df.dropna(inplace=True)
print("Shape after dropping rows with NaNs: ", df.shape)

# Convert StockCode and CustomerID columns to strings
df["StockCode"] = df["StockCode"].astype(str)
df["CustomerID"] = df["CustomerID"].astype(str)

In [None]:
df["StockCode"].sort_values().unique()

<br><br>

<!-- BEGIN QUESTION -->

### 4.1 Prepare data for Word2Vec
rubric={accuracy,quality}

We will be training word2vec on customer purchase histories. But before training the model, we need to get the data into the appropriate format. Remember that word2vec requires data in the following form. 

```
[[sent1word1, sent1word2, ...], 
 [sent2word1, sent2word2, ...], 
 ...
 [sent1000word1, sent1000word2, ...],
 ...
 ]
 
```
In this context, sentences are customer purchase histories for unique customers and words are stock codes in purchases histories representing items purchased by customers. 

**Your tasks:**
1. How many unique customers and unique products are present in the data above? 
2. For each unique customer, create purchasing history for the customer in the following format, where each inner list corresponds to the purchase history of a unique customer. Each item in the list is a `StockCode` in the purchase history of that customer, ordered by the time of purchase. 

```
[[StockCode1_of_CustomerID1, StockCode2_of_CustomerID1, ....], 
 [StockCode1_of_CustomerID2, StockCode2_of_CustomerID2, ....], 
 ...
 [StockCode1_of_CustomerID1000, StockCode2_of_CustomerID1000, ....],
 ...
 ]
 
```

<div class="alert alert-warning">

Solution_4_1
    
</div>

_Points:_ 6

In [None]:
...

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

### 4.2 Train `Word2Vec` model 
rubric={accuracy}

**Your tasks:**
1. Now that your data is in the suitable format, train `Word2Vec` model with the following hyperparameters:
    - `window=10` 
    - `negative=10` (for negative sampling)
    - `seed=8` 
    - `min_count=1`

<div class="alert alert-warning">

Solution_4_2
    
</div>

_Points:_ 2

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

### 4.3 Examine product similarity 
rubric={accuracy,reasoning}

Given a word2vec model trained on purchase history data and product description, the function `get_most_similar` below returns descriptions of top `n` most similar products. 

**Your tasks:**
1. Get similar products for the following products. 
    - 'SWIRLY CIRCULAR RUBBERS IN BAG'
    - 'POLKADOT RAIN HAT'    
2. Now pick 4 product descriptions of your choice from the data. Call `get_most_similar` for these product descriptions and examine similar products returned by the function.
3. Do the recommendations given by the model make sense? Discuss your observations. 

In [None]:
# Create products id_name and name_id dictionaries
products_id_name_dict = pd.Series(
    df.Description.str.strip().values, index=df.StockCode
).to_dict()
products_name_id_dict = pd.Series(
    df.StockCode.values, index=df.Description.str.strip()
).to_dict()

In [None]:
def get_most_similar(prod_desc, n=10, model=model):
    """
    Given product description, prod_desc, return the most similar
    products

    Parameters
    ---------
    prod_desc : str
        Product description

    n : integer
        the number of similar items to return

    model : gensim Word2Vec model
        trained gensim word2vec model on customer purchase histories

    Returns
    -------
    pandas.DataFrame
        A pandas dataframe containing n names of similar products
        and their similarity scores with the input product
        with desciption prod_desc.

    """
    stock_id = products_name_id_dict[prod_desc]
    try:
        similar_stock_ids = model.wv.most_similar(stock_id, topn=n)
    except:
        print("The product %s is not in the vocabulary" % (prod_desc))
        return

    similar_prods = []

    for (sim_stock_id, score) in similar_stock_ids:
        similar_prods.append((products_id_name_dict[sim_stock_id], score))
    return pd.DataFrame(
        similar_prods, columns=["Product description", "Similarity score"]
    )

<div class="alert alert-warning">

Solution_4_3
    
</div>

_Points:_ 5

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

### 4.4 
rubric={reasoning}

**Your tasks:**

1. Suppose you get a purchase history for a new customer which has `StockCode` of a new product, which was not present in the training data. Would your word2vec model be able to provide recommendations for this product? What about fastText? Does it make sense to use the `fastText` algorithm in this case instead of word2vec? What would be a reasonable recommendation strategy for new products? 

<div class="alert alert-warning">

Solution_4_4
    
</div>

_Points:_ 2

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<br><br><br><br>

## Exercise 5: Food for thought
<hr>

Each lab will have a few challenging questions. These are usually low-risk questions and will contribute to maximum 5% of the lab grade. The main purpose here is to challenge yourself, dig deeper in a particular area, and going beyond what we explicitly discussed in the class. When you start working on labs, attempt all other questions before moving to these challenging questions. If you are running out of time, please skip the challenging questions. 

![](img/eva-game-on.png)

<br><br>

<!-- BEGIN QUESTION -->

### (Challenging) 5.1 Exploring stereotypes using WEAT 
rubric={reasoning}

The standard way to identify embedding bias is using WEAT (Word Embedding Association Test), which comes from a [Science paper](https://purehost.bath.ac.uk/ws/portalfiles/portal/168480066/CaliskanEtAl_authors_full.pdf) from a few years ago. It is adapted from psychological tests for detecting implicit bias. 

The basic idea is that you take some representative target words (e.g., `men_words` and `women_words`) and attribute words (e.g., `high_pay_jobs_words`, `low_pay_jobs_words`), as shown below. Then we calculate a normalized z-score like effect, which is positive if bias is as expected and a p-value based on trying all groupings of the words in the targets. 

If you're interested in details, here is the [paper](https://purehost.bath.ac.uk/ws/portalfiles/portal/168480066/CaliskanEtAl_authors_full.pdf) and [here](https://github.com/kmccurdy/w2v-gender/tags) is the Python implementation of WEAT. 

**Your tasks**

1. Explore embedding biases using WEAT for target and attribute words of your choosing using the Python implementation [here](https://github.com/kmccurdy/w2v-gender/tags). 
2. Based on your exploration of the paper above, broadly explain how are embeddings de-biased? 

> You are likely to discuss this more in your ethics course next block.   

In [None]:
men_words = {"male", "man", "boy", "he", "him", "his"}

In [None]:
women_words = {"female", "woman", "girl", "she", "her", "hers"}

In [None]:
high_pay_jobs_words = {"doctor", "lawyer", "programmer", "surgeon", "executive"}

In [None]:
low_pay_jobs_words = {"nurse", "janitor", "cashier", "driver"}

<div class="alert alert-warning">

Solution_5_1
    
</div>

_Points:_ 2

_Type your answer here, replacing this text._

In [None]:
...

In [None]:
...

<!-- END QUESTION -->

<br><br><br><br>

Before submitting your assignment, please make sure you have followed all the instructions in the Submission Instructions section at the top. 

Well done!! Congratulations on finishing the lab and have a restful weekend! 

In [None]:
from IPython.display import Image

Image("img/eva-resting.png")