# Assignment 2: Word embedding

In the second assignment, we are going to experiment with GloVe embeddings.
1. Load and study the GloVe vectors (15') 
  - Nearest_neighbor.  
  - Word analogy.  
  - Bias in vectors.  
2. Train an embedding-based classifier. (10')  
  - Use GloVe  
  - Use Cohere embedding vectors and compare the performance to GloVe.   

## 1. Load and study the GloVe vectors
First download the `glove.6B.zip` from [the website](https://nlp.stanford.edu/projects/glove/).  
Then, load the vectors using the following script.

In [1]:
import numpy as np
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Load GloVe vectors
def load_glove_vectors(file_path):
    word_vectors = {}
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            vector = np.asarray(values[1:], dtype='float32')
            word_vectors[word] = vector
    return word_vectors

glove_vectors = load_glove_vectors('Desktop/glove.6B/glove.6B.300d.txt')

### Nearest neighbors
Implement a "find nearest neighbors" function. Look for the 5 nearest neighbors of each of the following words. Print out the found neighbors as well as the similarities. Comment on the findings.  
`asdfg, king, queen, book, pen, raise, draft, kellogg`  
Note that some of these words might not occur in the vocabulary.  

In [3]:
# Nearest neighbor function
from sklearn.metrics.pairwise import cosine_similarity

def find_nearest_neighbors(word, word_vectors, n=5):
    if word not in word_vectors:
        return f"'{word}' not found in vocabulary"
    
    word_vector = word_vectors[word].reshape(1,-1)
    sims = {}
    for other_word, vector in word_vectors.items():
        if other_word != word:
            sim = cosine_similarity(word_vector, vector.reshape(1,-1))[0][0]
            sims[other_word] = sim
            
    nearest_neighbors = sorted(sims.items(), key=lambda x: x[1], reverse=True)[:n]
    return nearest_neighbors


for w in ["asdfg", "king", "queen", "book", "pen", "raise", "draft", "kellogg"]:
    print(find_nearest_neighbors(w, glove_vectors))

'asdfg' not found in vocabulary
[('queen', 0.6336469), ('prince', 0.61966234), ('monarch', 0.58996207), ('kingdom', 0.57912666), ('throne', 0.5606488)]
[('elizabeth', 0.67714477), ('princess', 0.6356763), ('king', 0.6336469), ('monarch', 0.58141875), ('royal', 0.5430526)]
[('books', 0.79862493), ('author', 0.71234983), ('published', 0.6973031), ('novel', 0.69667095), ('memoir', 0.64656407)]
[('ballpoint', 0.54257977), ('pens', 0.54234904), ('pencil', 0.46788222), ('ink', 0.45545986), ('le', 0.42909268)]
[('raising', 0.78430694), ('raised', 0.7381018), ('raises', 0.7100081), ('increase', 0.61106485), ('interest', 0.5880811)]
[('drafted', 0.8233013), ('drafts', 0.608817), ('drafting', 0.5927947), ('pick', 0.53310865), ('proposal', 0.52689034)]
[('cereal', 0.46913424), ('halliburton', 0.44785973), ('w.k.', 0.44425425), ('kbr', 0.42964092), ('kellog', 0.40224463)]


Words with the highest similarities are very closely related to the word. For Example, King has the highest similarity to queen and is also very similar to other words that represent royalty or are closely related. This is similar to the word Queen, which is similar to other royalty words and words closely related to queen. Book has some closer similarity values with books which is the word's plural version. Similar occurances are seen with the following words as types of the inputted word, different versions (plural, past tense etc.) are closely related

### Word analogy
What's the 5 words that have the closest vector as the vector $v(king)-v(man)+v(woman)$?  
Modify the nearest neighbor function you implemented above. Find the words and the corresponding similarities. Comment on the findings.

In [4]:
def find_word_analogy(word, word_vectors, pair=("man", "woman"), n=5):
    if word not in word_vectors or pair[0] not in word_vectors or pair[1] not in word_vectors:
        return "One of the words not found in GloVe vocabulary"
    analogy = word_vectors[word] - word_vectors[pair[0]] + word_vectors[pair[1]]
    sims = {}
    
    for other_word, vector in word_vectors.items():
        sim = cosine_similarity(analogy.reshape(1,-1), vector.reshape(1,-1))[0][0]
        sims[other_word] = sim
        
    nearest_neighbors = sorted(sims.items(), key=lambda x: x[1], reverse=True)[:n]
    return nearest_neighbors
        

find_word_analogy("king", glove_vectors)

[('king', 0.8065859),
 ('queen', 0.6896163),
 ('monarch', 0.5575491),
 ('throne', 0.5565375),
 ('princess', 0.5518684)]

The findings make sense with king and queen as the top results due to the operation transforming the king vector and signifying the relationship between king and queen. similarly, the next result of monarch is closely related to royalty and is a gender neutral term. Throne and princess are also related due to the royalty nature of the terms although they do not have too much relation to gender (other than princess which is related to woman).

### Bias in word vectors
Now we will follow the procedure of [Caliskan et al (2018)](https://purehost.bath.ac.uk/ws/portalfiles/portal/168480066/CaliskanEtAl_authors_full.pdf) and compute the following statistic to assess the bias of a word $w$:  
$$s(w, A, B) = \frac{\textrm{mean}_{a\in A} \textrm{cos}(w,a) - \textrm{mean}_{b\in B}\textrm{cos}(w,b)}{\textrm{stdev}_{x\in A\cup B} \textrm{cos}(w,x)}$$
where:  
- $A$ and $B$ are the "attribute words" of two categories. In this assignment, let's use the following collection:  
  `A:` male, man, boy, brother, he, him, his, son  
  `B:` female, woman, girl, sister, she, her, hers, daughter    
- The operation $\textrm{cos}(\cdot, \cdot)$ computes the cosine similarities between the embedding vectors of the two words.  

Compute the statistic for the following occupations. What do the statistic reveal? Comment on the results.  
`technician, accountant, supervisor, engineer, worker, doctor, physician, nurse, teacher`

In [5]:
def analyze_bias(w, A, B, word_vectors):
    if w not in word_vectors:
        return f"'{w}' not found in vocabulary"
    bias = (np.mean([cosine_similarity(word_vectors[w].reshape(1,-1), word_vectors[a].reshape(1,-1))[0][0] for a in A]) - np.mean([cosine_similarity(word_vectors[w].reshape(1,-1), word_vectors[b].reshape(1,-1))[0][0] for b in B]))
    std = np.std([cosine_similarity(word_vectors[w].reshape(1,-1), word_vectors[x].reshape(1,-1))[0][0] for x in A + B])
    return bias/std

masculine_words = ["male", "man", "boy", "brother", "he", "him", "his", "son"]
feminine_words = ["female", "woman", "girl", "sister", "she", "her", "hers", "daughter"]

for w in ["technician", "accountant", "supervisor", "engineer", "worker", "doctor", "physician", "nurse", "teacher"]:
    s = analyze_bias(w, masculine_words, feminine_words, glove_vectors)
    print(f"{w}\t {s:.4f}")

technician	 0.1471
accountant	 -0.3167
supervisor	 -0.8852
engineer	 0.9844
worker	 -0.5381
doctor	 0.2340
physician	 0.5511
nurse	 -1.4563
teacher	 -0.6045


The words that have a positive value have a higher association with masculinity such as technician, engineer, doctor, and physician. The words with a negative value have a higher association with femininity such as accountant, supervisior, worker, nurse, and teacher.

## 2. Train embedding-based classifiers

### Train a GloVe-based classifier
Train a binary classifier using the GloVe embeddings. This classifier takes the average of the embedded words in a sentence, followed by a MLP with a hidden layer of 100 units. The classifier can be implemented with scikit-learn or pytorch.  
Report the classification accuracy on the *validation* set.

In [6]:
from datasets import load_dataset  # This dataset is loaded in the same way as Assignment 1.
import pandas as pd

ds = load_dataset("glue", "sst2")
train_data = pd.DataFrame(ds["train"])
X_train_text = train_data["sentence"]
Y_train = train_data["label"]

val_data = pd.DataFrame(ds["validation"])
X_val_text = val_data["sentence"]
Y_val = val_data["label"]

test_data = pd.DataFrame(ds["test"])
X_test_text = test_data["sentence"]
Y_test = test_data["label"]

In [16]:
from sklearn.neural_network import MLPClassifier 
from sklearn.metrics import accuracy_score

def convert_sentence(sentence, word_vectors):
    words = sentence.split()
    valid_words = [word_vectors[word] for word in words if word in word_vectors]
    if len(valid_words) > 0:
        return np.mean(valid_words, axis=0)
    else:
        return np.zeros(300) 

def train_eval_model(data, glove_vectors):
    X_train_text, Y_train, X_val_text, Y_val, X_test_text, Y_test = data 
    
    X_train = np.array([convert_sentence(sentence, glove_vectors) for sentence in X_train_text])
    X_val = np.array([convert_sentence(sentence, glove_vectors) for sentence in X_val_text])
    X_test = np.array([convert_sentence(sentence, glove_vectors) for sentence in X_test_text])
        
    mlp = MLPClassifier(hidden_layer_sizes=(100,), max_iter=500, random_state=42)
    mlp.fit(X_train, Y_train)
    
    Y_val_pred = mlp.predict(X_val)
    val_accuracy = accuracy_score(Y_val, Y_val_pred)
    print(f"Validation Accuracy using GloVe embeddings: {val_accuracy:.4f}")


train_eval_model((X_train_text, Y_train, X_val_text, Y_val, X_test_text, Y_test), glove_vectors)

Validation Accuracy using GloVe embeddings: 0.7263


### Train a Cohere-based classifier
Repeat the previous step, with the very important difference of using [Cohere's embedding](https://docs.cohere.com/reference/embed) instead of GloVe.  
Report the classification accuracy on the *validation* set. 

In [15]:
import cohere


co = cohere.ClientV2("vpMDpYsyvscYIZz3N7OXH8Kh2IrjVAZ3y544KgZB")

#X_train_text = X_train_text[:95]
#X_test_text = X_test_text[:95]
#X_val_text = X_val_text[:95]
#Y_train = Y_train[:95]
#Y_test = Y_test[:95]
#Y_val = Y_val[:95]


def get_cohere_embeddings(sentences, batch_size=95):
    embeddings = []
    # Cohere API call to get embeddings for a batch of sentences
    for i in range(0, len(sentences), batch_size):
        batch = sentences[i:i + batch_size]
        response = co.embed(texts=batch, model="embed-english-v3.0", input_type='classification', embedding_types=['float'] )
        embeddings.extend(response.embeddings.float_)
    return embeddings

def train_eval_model_cohere(data):
    X_train_text, Y_train, X_val_text, Y_val, X_test_text = data
    
    # Get Cohere embeddings for train, validation, and test sets
    X_train = get_cohere_embeddings(X_train_text.tolist())
    X_val = get_cohere_embeddings(X_val_text.tolist())
    X_test = get_cohere_embeddings(X_test_text.tolist())
    
    # Train the MLP classifier
    mlp = MLPClassifier(hidden_layer_sizes=(100,), max_iter=500, random_state=42)
    mlp.fit(X_train, Y_train)
    
    # Validate the model
    Y_val_pred = mlp.predict(X_val)
    val_accuracy = accuracy_score(Y_val, Y_val_pred)
    print(f"Validation Accuracy using Cohere embeddings: {val_accuracy:.4f}")

# Train and evaluate the model
train_eval_model_cohere((X_train_text, Y_train, X_val_text, Y_val, X_test_text))

Validation Accuracy using Cohere embeddings: 0.9263


### GloVe vs Cohere
Compare the performances of the two classifiers, and comment on your observations.
The Validation Accuracy of the GloVe embeddings is 0.7263 while the validation accuracy of the cohere embeddings is 0.9263. This shows that cohere might be better at capturing the context of words in a sentence than GloVe.