# CS4765/6765 NLP Assignment 3: Word vectors

**Due 7 November at 23:59**

In this three part assignment you will first examine and interact with word vectors. (This part of the assignment is adapted from a CS224N assignment at Stanford.) You will then implement a new approach to sentiment analysis. Finally you will consider possible harms of classification.

In this assignment we will use [gensim](https://radimrehurek.com/gensim/) to access and interact with word embeddings. In gensim we’ll be working with a KeyedVectors object which represents word embeddings. [Documentation for KeyedVectors is available.](https://radimrehurek.com/gensim/models/keyedvectors.html) However, this assignment description and the sample code in it might be sufficient to show you how to use a KeyedVectors object. We will use word embeddings from Google that were pretrained on about 100B words from a Google News dataset.


In [1]:
import gensim.downloader
model = gensim.downloader.load('word2vec-google-news-300')

# Part 1: Examining word vectors (6 marks)

## Polysemy and homonymy

Polysemy and homonymy are the phenomena of words having multiple meanings/senses. The nearest neighbours (under cosine similarity) for a given word can indicate whether it has multiple senses.

Consider the following example which shows the top-10 most similar words for *mouse*. The "input device" and "animal" senses of *mouse* are clearly visible from the top-10 most similar words. 


In [2]:
# Find words most similar using cosine similarity to "mouse". 
# restrict_vocab=100000 limits the results to most frequent
# 100000 words. This avoids rare words in the output. For this
# assignment, whenever you call most_simlilar, also pass
# restrict_vocab=100000.
model.most_similar('mouse', restrict_vocab=100000)

[('mice', 0.5896884799003601),
 ('cursor', 0.5472042560577393),
 ('joystick', 0.525871753692627),
 ('Wiimote', 0.502321720123291),
 ('Mouse', 0.4949929118156433),
 ('stylus', 0.4937674403190613),
 ('hamster', 0.4880635142326355),
 ('keyboard', 0.4737585783004761),
 ('hamsters', 0.4688059687614441),
 ('trackpad', 0.4672960937023163)]

*cursor*, *joystick*, *Wiimote*, *stylus*, *keyboard*, and *trackpad* correspond to the input device sense. *hamster* and *hamsters* correspond to the animal sense. (You can observe something similar for the different senses of the word *leaves*.)

Find a new example that exhibits polysemy/homonymy, show its top-10 most similar words, and explain why they show that this word has multiple senses. Write your answer in the code and text boxes below.

In [12]:
# TODO Write your code here
model.most_similar('cloud', restrict_vocab=100000)

[('clouds', 0.7632127404212952),
 ('Cloud', 0.6046801805496216),
 ('cloud_computing', 0.5641706585884094),
 ('dark_clouds', 0.5420233011245728),
 ('Cloud_computing', 0.5265730023384094),
 ('Clouds', 0.5223086476325989),
 ('pall', 0.5148372054100037),
 ('SaaS', 0.499646931886673),
 ('Amazon_EC2', 0.4842623770236969),
 ('virtualized', 0.4706108272075653)]

This word cloud appears to have multiple senses split between terms related to weather and terms related to computing
### 1. Weather Terms
    ('clouds', 0.7632127404212952),
    ('Cloud', 0.6046801805496216),
    ('dark_clouds', 0.5420233011245728),


### 2. Computing Term
    ('Cloud_computing', 0.5265730023384094),
    ('Clouds', 0.5223086476325989),
    ('SaaS', 0.499646931886673),
    ('Amazon_EC2', 0.4842623770236969),
    ('virtualized', 0.4706108272075653)


## Synonyms and antonyms

The words *fast* and *speedy* are (near) synonyms (i.e., they have roughly the same meaning) and the words *fast* and *slow* are antonyms (i.e., they have opposite meanings). Note, however, that the similarity between *fast* and *slow* is higher than the similarity between *fast* and *speedy*. This should be counter to your expectations, because synonyms (which mean roughly the same thing) would be expected to be more similar than antonyms (which have opposite meanings).

In [4]:
# Find the cosine similarity between "fast" and "speedy"
model.similarity('fast', 'speedy')

np.float32(0.47171298)

In [5]:
# and between "fast" and "slow".
model.similarity('fast', 'slow')

np.float32(0.5313692)

Find another example that shows this. I.e., find three words (w1 , w2 , w3) such that w1 and w2 are (near) synonyms (i.e., have roughly the same meaning), and w1 and w3 are antonyms (i.e., have opposite meanings), but the similarity between w1 and w3 is higher than the similarity between w1 and w2. Explain why you think this unexpected situation might have occurred.

In [15]:
# TODO Write your code here
model.similarity('big', 'fat')


np.float32(0.2534442)

In [17]:

model.similarity('big', 'small')

np.float32(0.49586785)

TODO Write your answer here

The antonym pair ('big', 'small') have a higher vector or most similar than the synonym pair ('big', 'fat') this due to the fact that 'big' and 'small' tends to appear more in the same context in the corpus than 'big' and 'fat'.

## Analogies

Analogies such as *man* is to *king* as *woman* is to *X* can be solved using word embeddings. This analogy can be expressed as *X* = *woman* + *king* − *man*. The following code snippet shows how to solve this analogy with gensim. Notice that the model gets it correct! I.e., *queen* is the most similar word.

In [8]:
# Find the model's predictions for the solution to the analogy
# "man" is to "king" as "woman" is to X
model.most_similar(positive=['woman', 'king'],
                   negative=['man'],
                   restrict_vocab=100000)

[('queen', 0.7118193507194519),
 ('monarch', 0.6189674139022827),
 ('princess', 0.5902431011199951),
 ('crown_prince', 0.5499460697174072),
 ('prince', 0.5377321839332581),
 ('kings', 0.5236844420433044),
 ('queens', 0.518113374710083),
 ('sultan', 0.5098593235015869),
 ('monarchy', 0.5087411403656006),
 ('royal_palace', 0.5087166428565979)]

Find a new analogy that the model is able to answer correctly (i.e., the most-similar word is the solution to the analogy). Explain briefly why the analogy holds. For the above example, this explanation would be something along the lines of a king is a ruler who is a man and a queen is a ruler who is a woman.


In [21]:
# TODO Write your code here
model.most_similar(positive=['apple', 'vegetable'],
                   negative=['fruit'],
                   restrict_vocab=100000)

[('potato', 0.5865278244018555),
 ('onion', 0.5616337656974792),
 ('veggie', 0.5172390937805176),
 ('tomato', 0.4943786561489105),
 ('sweet_potato', 0.49368950724601746),
 ('Vegetable', 0.4932582974433899),
 ('cauliflower', 0.482282817363739),
 ('edible', 0.4821529984474182),
 ('pumpkin', 0.4816454350948334),
 ('sunflower', 0.4751611351966858)]

Here, words like potato, onion appear to be the most similar word, this likely happens because in the training corpus, 'apple' and 'fruit' appears in similar context and vegetable and potato, onion appear in similar context. Thus the difference vector between apple and fruit is similar to vegetable and potato.

## Bias

Consider the examples below. The first shows the words that are most similar to *man* and *worker* and least similar to *woman*. The second shows the words that are most similar to *woman* and *worker* and least similar to *man*.

In [10]:
# Find the words that are most similar to "man" and "worker" and
# least similar to "woman".
model.most_similar(positive=['man', 'worker'],
                   negative=['woman'],
                   restrict_vocab=100000)

[('workers', 0.5590359568595886),
 ('laborer', 0.54481041431427),
 ('foreman', 0.5192232131958008),
 ('Worker', 0.5161596536636353),
 ('employee', 0.5094279646873474),
 ('electrician', 0.49481216073036194),
 ('janitor', 0.48718902468681335),
 ('bricklayer', 0.48253133893013),
 ('carpenter', 0.47499001026153564),
 ('workman', 0.46425172686576843)]

In [11]:
# Find the words that are most similar to "woman" and "worker" and
# least similar to "man".
model.most_similar(positive=['woman', 'worker'],
                   negative=['man'],
                   restrict_vocab=100000)

[('workers', 0.6582455039024353),
 ('employee', 0.5805293917655945),
 ('nurse', 0.5249921679496765),
 ('receptionist', 0.5142489671707153),
 ('migrant_worker', 0.5001609325408936),
 ('Worker', 0.4979270100593567),
 ('housewife', 0.48609837889671326),
 ('registered_nurse', 0.4846191108226776),
 ('laborer', 0.48437267541885376),
 ('coworker', 0.48212409019470215)]

The output shows that *man* is associated with some stereotypically male jobs (e.g., *electrician*, *janitor*, *bricklayer*, *carpenter*) while *woman* is associated with some stereotypically female jobs (e.g., *nurse*, *receptionist*, *housewife*, *registered_
nurse*). This indicates that there is gender bias in the word embeddings.

Find a new example, using the same approach as above, that indicates that there is bias in the word embeddings. Briefly explain how the model output indicates that there is bias in the word embeddings. (You are by no means restricted to considering gender bias here. You are encouraged to explore other ways that embeddings might indicate bias.)

In [28]:

model.most_similar(positive=['Afghanistan', 'country'],
                   negative=['Canada'],
                   restrict_vocab=100000)

[('Iraq', 0.5811731219291687),
 ('strife_torn', 0.500225305557251),
 ('insurgency', 0.4947287142276764),
 ('troops', 0.48856207728385925),
 ('Taliban_insurgency', 0.4836787283420563),
 ('Afghan', 0.4569079577922821),
 ('tribal_regions', 0.4540356397628784),
 ('counterinsurgency', 0.4527475833892822),
 ('Somalia', 0.45025157928466797),
 ('Afghans', 0.4482972323894501)]

In [29]:
# TODO Write your code here
model.most_similar(positive=['Canada', 'country'], negative=['Afghanistan'],

                   restrict_vocab=100000)

[('world', 0.5181863903999329),
 ('United_States', 0.5135688781738281),
 ('nation', 0.509271502494812),
 ('Unites_States', 0.48146817088127136),
 ('Canadian', 0.47444069385528564),
 ('Nova_Scotia', 0.4674588739871979),
 ('America', 0.45913150906562805),
 ('continent', 0.45745521783828735),
 ('Europe', 0.4195738434791565),
 ('Québec', 0.41737625002861023)]

shows that Word2Vec encodes some bias, as Afghanistan is associated with words like insurgency, tribal_regions, troops, while Canada is associated with world, United_State, and other regions. These associations arise because the training corpus frames Afghanistan as a conflict-heavy region.


# Part 2: Sentiment Analysis (2 marks)

## Background and data

In this part of the assignment you will revisit sentiment analysis from assignment
2. You will need the data provided for that
assignment.


## Approach

We will consider sentiment analysis using an average of word embeddings document representation and a multinomial logistic regression classifier. We will compare this approach to the approach using a bag-of-words document representation and logistic regression from assignment 2.

Complete the function `vec_for_doc` below. (You should not modify other parts of the
code.) This function takes a list consisting of the tokens in a document $d$. It then returns a vector $\vec{v}$ representing the document as the average of the embeddings for the words in the document as follows:

\begin{equation}
d = w_1, w_2, ... w_n
\end{equation}
\begin{equation}
\vec{v} = \dfrac{\vec{w_1} + \vec{w_2} + ... + \vec{w_n}}{n}\\
\end{equation}

If a word in a document does not occur in the word embedding model, you can simply ignore it. As such, $n$ above is the number of token instances in document $d$ that also occur in the embedding model (as oposed to simply the number of token instances in document $d$). (Note that we would normally need to deal with the case of a document that consists entirely of words that don't occur in the embedding model, but for this dataset and embedding model, that situation does not occur, and so for now we don't worry about it.)

In [24]:
import math
def normalize(v):
    return v/math.sqrt(sum(v**2))

# TODO: Implement this function. tokenized_doc is a list of tokens in
# a document. Return a vector representation of the document as
# described above.
# Hints: 
# -You can get the vector for a word w using model[w] or
#  model.get_vector(w)
# -You can add vectors using + and sum, e.g.,
#  model['cat'] + model['dog']
#  sum([model['cat'], model['dog']])
# -You can see the shape of a vector using model['cat'].shape
# -The vector you return should have the same shape as a word vector 
# -This should be a very short function. If you're writing lots of
#  code, you are likely off track.
def vec_for_doc(tokenized_doc):

    # TODO: Add your code here
    valid_tokens = [w for w in tokenized_doc if w in model.key_to_index]


    vec = sum([model[w] for w in valid_tokens]) / len(valid_tokens)
    return vec


Once you've completed `vec_for_doc` above, run the code below to train logistic regresion on the training data and evaluate on the dev data

In [25]:
# This code is taken from the assignment 2 starter code, except where noted below
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_recall_fscore_support, accuracy_score
import re
import csv

def get_texts_and_labels(fname):
    # Added argument for encoding
    csv_reader = csv.reader(open(fname, encoding='utf8'))
    # Ignore header row
    next(csv_reader)
    texts = []
    labels = []
    for line in csv_reader:
        _,text,label = line
        label = int(label)
        texts.append(text)
        labels.append(label)
    return texts,labels

word_tokenize_pattern = re.compile(r"(?u)\b\w\w+\b")
def word_tokenize(s, apply_case_folding=True):
    return [x.lower() for x in word_tokenize_pattern.findall(s)]

def print_results(gold_labels, predicted_labels):
    p,r,f,_ = precision_recall_fscore_support(gold_labels, 
                                              predicted_labels, 
                                              average='macro', 
                                              zero_division=0)
    acc = accuracy_score(gold_labels, predicted_labels)

    print("Precision: ", p)
    print("Recall: ", r)
    print("F1: ", f)
    print("Accuracy: ", acc)
    print()

train_texts, train_labels = get_texts_and_labels('data/train-sample.csv')
dev_texts, dev_labels = get_texts_and_labels('data/dev.csv')

# train_vecs and dev_vecs are lists; each element is a vector
# representing a (train or dev) document
# (These two lines are new, i.e., not from the assignment 2 starter code)
train_vecs = [vec_for_doc(word_tokenize(x)) for x in train_texts]
dev_vecs = [vec_for_doc(word_tokenize(x)) for x in dev_texts]

# Train logistic regression, same as A2
lr = LogisticRegression(max_iter=500,
                        random_state=0)
clf = lr.fit(train_vecs, train_labels)
dev_predictions = clf.predict(dev_vecs)

print_results(dev_labels, dev_predictions)

Precision:  0.7867410845859122
Recall:  0.7844246902741197
F1:  0.7854856706600893
Accuracy:  0.785



Finally, evaluate on the test data

In [26]:
test_texts,test_labels = get_texts_and_labels('data/test.csv')
test_vecs = [vec_for_doc(word_tokenize(x)) for x in test_texts]
test_predictions = clf.predict(test_vecs)
print_results(test_labels, test_predictions)

Precision:  0.7230555881089105
Recall:  0.7271777098384208
F1:  0.7250353885574011
Accuracy:  0.7625



Run the code below to replicate the test results for logistic regression from A2. (This is just for convenience so we have the numbers here. We could go look them up from A2...)

In [27]:
from sklearn.feature_extraction.text import CountVectorizer

count_vectorizer = CountVectorizer(analyzer=word_tokenize)
train_counts = count_vectorizer.fit_transform(train_texts)
dev_counts = count_vectorizer.transform(dev_texts)
test_counts = count_vectorizer.transform(test_texts)

lr_A2 = LogisticRegression(max_iter=500,
                        random_state=0)

clf_A2 = lr_A2.fit(train_counts, train_labels)

A2_dev_predictions = clf_A2.predict(dev_counts)
print_results(dev_labels, A2_dev_predictions)

A2_test_predictions = clf_A2.predict(test_counts)
print_results(test_labels, A2_test_predictions)

Precision:  0.7900114973711068
Recall:  0.7734038142620232
F1:  0.7801965694519533
Accuracy:  0.78

Precision:  0.6405949791868577
Recall:  0.6391179383355766
F1:  0.6397733310398869
Accuracy:  0.684375



Compare the results on the test data here to the results using logistic regression on the test data for assignment 2. The difference between these two approaches is the document representation. In this assignment we used a document representation based on average of word embeddings. In assignment 2 we used a document representation based on word counts. Which method performs better?

Word embeddings generally outperform bag-of-words approaches because they capture **semantic meaning and word relationships**, while word counts only track frequency. This allows embedding-based models to **generalize better**, handle **synonyms**, and understand **context**, leading to higher classification accuracy. Logistic regression performance can drop on smaller datasets, a limitation that affects word-count models more than embedding-based models.


# Part 3 Considering harms of classification (2 marks)

In this part we will consider the potential for harms from the classifiers that we have built.

Consider the following two sentences, which we will treat as short documents:

* Betsy is discouraged by climate reports.
* Jasmine is discouraged by climate reports.

These are made up sentences inspired by Kiritchenko and Mohammad (2018), discussed in lecture 8, slide 29. Note that these sentences differ only in the first word, which is a name, and that from Kiritchenko and Mohammad, "Betsy" is a common European American female name and "Jasmine" is a common African American female name. The sentences are intended to be like those from the assignment 2 data in that they are climate related.

Run the code cells below to get class predictions for these documents from the classifiers considered in Part 2.

Svetlana Kiritchenko and Saif Mohammad. 2018. [Examining Gender and Race Bias in Two Hundred Sentiment Analysis Systems](https://aclanthology.org/S18-2005/). In Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics, pages 43–53, New Orleans, Louisiana. Association for Computational Linguistics.



In [28]:
# Get the prediction for a single document using a document representation based on average of word embeddings 
def get_prediction(doc):
    vec = vec_for_doc(word_tokenize(doc))
    prediction = clf.predict([vec])
    return prediction[0]

# Get the prediction for a single document using a document representation based on word counts
# (I.e., the logistic regression approach from assignment 2)
def get_prediction_a2(doc):
    counts = count_vectorizer.transform([doc])
    prediction = clf_A2.predict(counts)
    return prediction[0]

In [29]:
d_betsy = "Betsy is discouraged by climate reports."
print(d_betsy)
print("Word embeddings:", get_prediction(d_betsy))
print("Word counts (A2):", get_prediction_a2(d_betsy))
print()

d_jasmine = "Jasmine is discouraged by climate reports."
print(d_jasmine)
print("Word embeddings:", get_prediction(d_jasmine))
print("Word counts (A2):", get_prediction_a2(d_jasmine))

Betsy is discouraged by climate reports.
Word embeddings: 1
Word counts (A2): 1

Jasmine is discouraged by climate reports.
Word embeddings: 0
Word counts (A2): 1


Based on the behaviour you observe above, write a brief discussion below about potential harms of classification from these classifiers. In your discussion, you might consider some of the following points:
* Do the classifiers make the same predictions?
* Is a given classifier consistent in its predictions for the two sentences?
* If you observe differences in predictions, what do you believe causes this?
* What do you think the gold-standard class for these sentences should be?


Logistic regression produces consistent predictions for both sentences, while the word embedding model gives different results. This inconsistency likely arises from biases in the pretrained embeddings, which associate names like “Jasmine” with different sentiments, even though both sentences have identical meaning and should receive the same label.

# Submitting your work

When you're done, submit a3.ipynb to the assignment 3 folder on D2L.