# Applied Machine Learning - Word Embeddings

In [1]:
import os

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.cluster import DBSCAN, KMeans
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import r2_score
from sklearn.model_selection import (
    GridSearchCV,
    RandomizedSearchCV,
    cross_validate,
    train_test_split,
)
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler

pd.set_option("display.max_colwidth", 0)

<br><br><br><br>

## Exercise 1:  Exploring pre-trained word embeddings <a name="1"></a>
<hr>

In lecture 16, we talked about natural language processing (NLP). Using pre-trained word embeddings is very common in NLP. It has been shown that pre-trained word embeddings [work well on a variety of text classification tasks](http://www.lrec-conf.org/proceedings/lrec2018/pdf/721.pdf). These embeddings are created by training a model like Word2Vec on a huge corpus of text such as a dump of Wikipedia or a dump of the web crawl. 

A number of pre-trained word embeddings are available out there. Some popular ones are: 

- [GloVe](https://nlp.stanford.edu/projects/glove/)
    * trained using [the GloVe algorithm](https://nlp.stanford.edu/pubs/glove.pdf) 
    * published by Stanford University 
- [fastText pre-trained embeddings for 294 languages](https://fasttext.cc/docs/en/pretrained-vectors.html) 
    * trained using the fastText algorithm
    * published by Facebook
    
In this exercise, you will be exploring GloVe Wikipedia pre-trained embeddings. The code below loads pre-trained word vectors trained on Wikipedia. (The vectors are created using an algorithm called GloVe.) To run the code, you'll need `gensim` package for that in your cpsc330 conda environment, which you can install as follows. 

```
> conda activate cpsc330
> conda install -c anaconda gensim
```

In [2]:
import gensim
import gensim.downloader

print(list(gensim.downloader.info()["models"].keys()))

['fasttext-wiki-news-subwords-300', 'conceptnet-numberbatch-17-06-300', 'word2vec-ruscorpora-300', 'word2vec-google-news-300', 'glove-wiki-gigaword-50', 'glove-wiki-gigaword-100', 'glove-wiki-gigaword-200', 'glove-wiki-gigaword-300', 'glove-twitter-25', 'glove-twitter-50', 'glove-twitter-100', 'glove-twitter-200', '__testing_word2vec-matrix-synopsis']


In [3]:
# This will take a while to run when you run it for the first time.
import gensim.downloader as api

glove_wiki_vectors = api.load("glove-wiki-gigaword-100")

In [4]:
len(glove_wiki_vectors.index2word)

400000

There are 400,000 word vectors in these pre-trained model. 

<br><br>

### 1.1 Word similarity using pre-trained embeddings

Now that we have GloVe Wiki vectors (`glove_wiki_vectors`) loaded, let's explore the word vectors. 

**Your tasks:**

1. Calculate cosine similarity for the following word pairs (`word_pairs`) using the [`similarity`](https://radimrehurek.com/gensim/models/keyedvectors.html?highlight=similarity#gensim.models.keyedvectors.KeyedVectors.similarity) method of the model.
2. Do the similarities make sense? 

In [5]:
word_pairs = [
    ("coast", "shore"),
    ("clothes", "closet"),
    ("old", "new"),
    ("smart", "intelligent"),
    ("dog", "cat"),
    ("tree", "lawyer"),
]

In [6]:
for index, tuple in enumerate(word_pairs):
  word1 = tuple[0]
  word2 = tuple[1]
  similarity = glove_wiki_vectors.similarity(word1, word2)
  print('The similarity between', word1, 'and', word2, 'is', similarity)

The similarity between coast and shore is 0.70002717
The similarity between clothes and closet is 0.54627603
The similarity between old and new is 0.6432488
The similarity between smart and intelligent is 0.7552733
The similarity between dog and cat is 0.8798075
The similarity between tree and lawyer is 0.076719455


The similarities make sense. The lowest value is trees and lawyer, because most of the time, they don't have much to do with eachother. Dog and cats are the most common domestic pets, so it makes sense that they have a high correlation. The rest also make sense to have a fairly high correlation as smart and intelligent are synonyms, coast and shore are usually right next to eachother, and old and now are opposites.

<br><br>

### 1.2 Bias in embeddings

**Your tasks:**
1. Pre-trained word embedding model may output an analogy that reinforced a gender stereotype. Give an example of how using such a model could cause harm in the real world.
2. Here we are using pre-trained embeddings which are built using Wikipedia data. Explore whether there are any worrisome biases present in these embeddings or not by trying out some examples. You can use the following two methods or other methods of your choice to explore what kind of stereotypes and biases are encoded in these embeddings. 
    - You can use the `analogy` function below which gives words analogies. 
    - You can also use [similarity](https://radimrehurek.com/gensim/models/keyedvectors.html?highlight=similarity#gensim.models.keyedvectors.KeyedVectors.similarity) or [distance](https://radimrehurek.com/gensim/models/keyedvectors.html?highlight=distance#gensim.models.keyedvectors.KeyedVectors.distances) methods. (An example is shown below.)   
3. Discuss your observations. Do you observe the gender stereotype we observed in class in these embeddings?

> Note that most of the recent embeddings are de-biased. But you might still observe some biases in them. Also, not all stereotypes present in pre-trained embeddings are necessarily bad. But you should be aware of them when you use them in your models. 

In [7]:
def analogy(word1, word2, word3, model=glove_wiki_vectors):
    """
    Returns analogy word using the given model.

    Parameters
    --------------
    word1 : (str)
        word1 in the analogy relation
    word2 : (str)
        word2 in the analogy relation
    word3 : (str)
        word3 in the analogy relation
    model :
        word embedding model

    Returns
    ---------------
        pd.dataframe
    """
    print("%s : %s :: %s : ?" % (word1, word2, word3))
    sim_words = model.most_similar(positive=[word3, word2], negative=[word1])
    return pd.DataFrame(sim_words, columns=["Analogy word", "Score"])

Using a model based on gender could potentially be discrimatory towards a specific gender. For example, computer science is generally considered a male dominated field. Using a model that takes this into account in whether someone gets accepted to a course or not may lower the chances of a female person from getting into the field reinforcing the gender stereotype.

In [8]:
glove_wiki_vectors.similarity("man", "rich")

0.46977922

In [9]:
glove_wiki_vectors.similarity("woman", "rich")

0.3507729

Between men and women and their correlation with the word rich, woman are about 25% lower in their similarity. While this isn't really desirable, this makes sense considering the gender pay gap. Acording to the Canadian Women's Foundation (https://canadianwomen.org/the-facts/the-gender-pay-gap/), women make 71% of what men make. 

In [10]:
analogy("man", "househusband", "woman", glove_wiki_vectors)

man : househusband :: woman : ?


Unnamed: 0,Analogy word,Score
0,gold-digger,0.817556
1,tough-as-nails,0.783552
2,chiropodist,0.772597
3,love-struck,0.761878
4,wet-nurse,0.758541
5,golddigger,0.7579
6,strong-minded,0.757849
7,self-possessed,0.756513
8,elisaveta,0.751576
9,scotswoman,0.750467


In [11]:
analogy("woman", "housewife", "man", glove_wiki_vectors)

woman : housewife :: man : ?


Unnamed: 0,Analogy word,Score
0,homemaker,0.618261
1,loner,0.581096
2,schoolteacher,0.579345
3,slacker,0.566048
4,dad,0.550712
5,thug,0.535593
6,shopkeeper,0.526345
7,workaholic,0.524418
8,mom,0.523697
9,old,0.518858


Equating a woman to a housewife and a man to a homemaker makes sense, but equating a househusband to a gold-digger is definitely worrisome. The score is 0.81 which means that the association is fairly strong. These embeddings definitely have some gender stereotypes, but generally are pretty accurate. Generally, you actively have to be looking for worrisome gender stereotypes to actually find one. Most times there aren't any glaring flaws.

<br><br>

### 1.3 Representation of all words in English

**Your tasks:**
1. The vocabulary size of Wikipedia embeddings is quite large. Do you think it contains **all** words in English language? What would happen if you try to get a word vector that's unlikely to be present in the vocabulary (e.g., the word "cpsc330"). 

No, it probably doesn't. There may be a lot of slang. onomatopoeias, or informal words that the wikipedia embedding may be missing. If you try to get a word vector that's unlikely in the vocabulary, you likely won't be able to get any results, because the embedding doesn't have any info to base that word vector on.

<br><br>

### 1.4 Classification with pre-trained embeddings

We saw that you can conveniently get word vectors with `spaCy` with `en_core_web_md` model. In this exercise, you'll use word embeddings in multi-class text classification task. We will use [HappyDB](https://www.kaggle.com/ritresearch/happydb) corpus which contains about 100,000 happy moments classified into 7 categories: *affection, exercise, bonding, nature, leisure, achievement, enjoy_the_moment*. The data was crowd-sourced via [Amazon Mechanical Turk](https://www.mturk.com/). The ground truth label is not available for all examples, and in this lab, we'll only use the examples where ground truth is available (~15,000 examples). 

- Download the data from [here](https://www.kaggle.com/ritresearch/happydb).
- Unzip the file and copy it in the lab directory.

We will be using spaCy in this exercise. If you do not have spaCy in your course environment, here is how you can install it.  

```
> conda activate cpsc330
> conda install -c conda-forge spacy
```

- You also need to download the language model which contains all the pre-trained models. For that run the following in your course `conda` environment. 
```
python -m spacy download en_core_web_md
```

The code below reads the data CSV (assuming that it's present in the current directory as *cleaned_hm.csv*),  cleans it up a bit, and splits it into train and test splits. 

**Your tasks:**

1. Train logistic regression with bag-of-words features and show classification report on the test set. 
2. Train logistic regression with average embedding representation extracted using spaCy and show classification report on the test set.  
3. Discuss your results. Which model is performing well. Which model would be more interpretable?  
4. Are you observing any benefits of transfer learning here? Briefly discuss. 



In [13]:
df = pd.read_csv("cleaned_hm.csv", index_col=0)
sample_df = df.dropna()
sample_df.head()

Unnamed: 0_level_0,wid,reflection_period,original_hm,cleaned_hm,modified,num_sentence,ground_truth_category,predicted_category
hmid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
27676,206,24h,We had a serious talk with some friends of ours who have been flaky lately. They understood and we had a good evening hanging out.,We had a serious talk with some friends of ours who have been flaky lately. They understood and we had a good evening hanging out.,True,2,bonding,bonding
27678,45,24h,I meditated last night.,I meditated last night.,True,1,leisure,leisure
27697,498,24h,My grandmother start to walk from the bed after a long time.,My grandmother start to walk from the bed after a long time.,True,1,affection,affection
27705,5732,24h,I picked my daughter up from the airport and we have a fun and good conversation on the way home.,I picked my daughter up from the airport and we have a fun and good conversation on the way home.,True,1,bonding,affection
27715,2272,24h,when i received flowers from my best friend,when i received flowers from my best friend,True,1,bonding,bonding


In [14]:
sample_df = sample_df.rename(
    columns={"cleaned_hm": "moment", "ground_truth_category": "target"}
)

In [15]:
train_df, test_df = train_test_split(sample_df, test_size=0.3, random_state=123)
X_train, y_train = train_df["moment"], train_df["target"]
X_test, y_test = test_df["moment"], test_df["target"]

In [16]:
import spacy
nlp = spacy.load("en_core_web_md")

In [17]:
from sklearn.metrics import classification_report
pipe = make_pipeline(CountVectorizer(stop_words="english"), LogisticRegression(max_iter=1000))
pipe.named_steps["countvectorizer"].fit(X_train)
X_train_transformed = pipe.named_steps["countvectorizer"].transform(X_train)
print("Data matrix shape:", X_train_transformed.shape)

Data matrix shape: (9887, 8060)


In [18]:
pipe.fit(X_train, y_train);
lr_prediction = pipe.predict(X_test)
print(
    classification_report(
        y_test, lr_prediction
    )
)

                  precision    recall  f1-score   support

     achievement       0.79      0.87      0.83      1302
       affection       0.90      0.91      0.91      1423
         bonding       0.91      0.85      0.88       492
enjoy_the_moment       0.60      0.54      0.57       469
        exercise       0.91      0.57      0.70        74
         leisure       0.73      0.70      0.71       407
          nature       0.73      0.46      0.57        71

        accuracy                           0.82      4238
       macro avg       0.80      0.70      0.74      4238
    weighted avg       0.82      0.82      0.81      4238



In [19]:
X_train_embeddings = pd.DataFrame([text.vector for text in nlp.pipe(X_train)])
X_test_embeddings = pd.DataFrame([text.vector for text in nlp.pipe(X_test)])
X_train_embeddings.shape

(9887, 300)

In [20]:
lgr = LogisticRegression(max_iter=1000)
lgr.fit(X_train_embeddings, y_train)
lr_prediction = lgr.predict(X_test_embeddings)
print(
    classification_report(
        y_test, lr_prediction
    )
)

                  precision    recall  f1-score   support

     achievement       0.80      0.86      0.83      1302
       affection       0.87      0.93      0.90      1423
         bonding       0.84      0.76      0.80       492
enjoy_the_moment       0.60      0.54      0.57       469
        exercise       0.84      0.66      0.74        74
         leisure       0.81      0.69      0.75       407
          nature       0.80      0.68      0.73        71

        accuracy                           0.81      4238
       macro avg       0.80      0.73      0.76      4238
    weighted avg       0.81      0.81      0.81      4238



In this case, there is not much difference between the bag-of-words presentation and the average embedding with logistic regression. Most of the precision, recall and f1-score values for each categorization only differ by less than 0.5. The biggest difference by far is the recall score of nature where average embedding is much better by 0.20. Transfer learning is definitely benefitting this model because of the increase in predicting nature's score. While it loses some precision for most of the values, it also majorly increases the recall value for others. This makes it more interpretable and more reliable to use compared to the bag-of-words representation

<br><br><br><br>