# Word Vectors Gone Wrong: Fixing Gender Stereotypes in Language Models

## Problem Description


Language models process words as arrays of  numbers, called word vectors (or word embeddings). These vectors are created based on the usage of the words in context, so they capture the distributional properties of words. Word vectors can be conceptualized as unique coordinates in a multi-dimensional space, with the distance between them capturing the semantic and syntactic relations between words.

In a seminal [article](https://aclanthology.org/P16-1158/) Ekaterina Vylomova and colleagues show that word vectors trained on English data exhibit a curious property: the spatial difference between the vectors of 'king' and 'queen' is the same as the difference between the vectors of 'man' and 'woman'. This difference essentially captures **gender**. Similarly, the difference between 'king' and 'man' is the same as that between 'queen' and 'woman', capturing the notion of royalty.

The way gender is reflected in word vectors has received special attention in NLP, because while sometimes word vectors capture true gender roles (e.g. a king is by definition male), other times they capture undesirable societal biases, e.g. they place 'engineer' and 'man' in the same relationship as 'housekeeper' and 'woman'. This does not seem fair, given that professions such as engineer or housekeeper should be non-gender specific.

![](https://i.ibb.co/RNjF8MH/Screenshot-2023-11-22-at-16-01-27.png)

We don't want to have models that promote stereotypes about which jobs are suitable for men or women, so we should find a way to fix this problem. The tasks presented in this notebook will guide you to one possible solution.

## Technical Specifications

All team solutions should be submitted as a modified and compiled copy of this base notebook. You also need to provide a file of the word vectors you created.

The notebook contains specific tasks you need to accomplish and provides code when necessary. Some cells, marked with the `###DO NOT CHANGE THIS CELL###` comment, have to remain as they are. Other cells can be changed, especially the ones saying `###YOUR CODE GOES HERE###` should be changed to complete the tasks.


Your goal is to get familiar with word vectors and the problem of bias which is a common issue in Artificial Intelligence applications.

## Resources

You can read more on gender bias in word vectors in the paper [Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings](https://proceedings.neurips.cc/paper_files/paper/2016/file/a486cd07e4ac3d270571622f4f316ec5-Paper.pdf) by Tolga Bolukbasi, Kai-Wei Chang, James Zou, Venkatesh Saligrama, and Adam Kalai. Proceedings of NIPS 2016.

There are some articles/tutorials online that explain the main concepts of the paper (neutralization and equalization of word vectors) such as [Debiasing Word Embeddings with Geometry](https://medium.com/@mihird97/debiasing-word-embeddings-with-geometry-d2c471ab4ae6).




##Task 1: Creating word Vectors for words

---



One popular method for obtaining word vectors is to use a pre-trained model such as Word2Vec or GloVe (Global Vectors for Word Representation).

🎯 The goal of is to get familiar with GloVe (Global Vectors for Word Representation), a pre-trained model used to create word vectors.

Deliverables: Extract vectors for the example words provided below and then save them in a txt file. You should deliver (1) the txt file with the words and their corresponding vectors (2) a read_glove_vecs python function that reads the words and vectors from your .txt file like:

```
words, word_to_vec_map = read_glove_vecs('w2v_gnews_small.txt')
```






In [None]:
pip install --upgrade gensim




In [None]:
import numpy as np
import gensim.downloader as api
from scipy import spatial

# Download pre-trained GloVe word vectors
glove_vectors = api.load("glove-wiki-gigaword-100")
print(glove_vectors)


KeyedVectors<vector_size=100, 400000 keys>


In [None]:
api

<module 'gensim.downloader' from '/usr/local/lib/python3.11/dist-packages/gensim/downloader.py'>

In [None]:
words = glove_vectors.index_to_key

In [None]:
"meow" in words
glove_vectors["meow"]

array([ 0.51933 ,  0.34896 ,  0.67159 , -0.65048 , -0.20647 ,  1.1391  ,
        0.72832 ,  0.33286 ,  0.60227 , -0.036923, -0.63565 ,  0.78525 ,
        0.071586, -0.5137  , -0.090836,  0.20202 ,  0.50261 , -0.12442 ,
        0.77483 ,  0.16876 ,  0.21771 ,  0.14017 ,  0.057398, -1.1714  ,
       -0.18049 ,  1.2412  ,  0.014619,  1.3285  , -0.38214 , -0.53554 ,
       -0.023013, -0.18581 ,  1.0577  ,  0.088003,  0.12549 ,  0.42036 ,
        0.49486 ,  0.44585 ,  0.057202, -1.4088  ,  0.31676 , -0.19655 ,
       -1.02    , -0.51207 , -0.23792 ,  0.59425 , -0.66428 , -0.17406 ,
        0.19677 , -0.21779 , -1.1929  ,  1.0944  ,  0.16869 ,  0.35365 ,
       -0.65335 , -0.11559 ,  0.92395 ,  0.5306  , -0.96811 , -0.6783  ,
        0.09162 , -0.21073 ,  0.20328 ,  0.34977 ,  0.62548 , -0.06314 ,
       -0.2925  ,  0.24248 ,  0.84212 ,  0.077467, -0.63196 , -0.14174 ,
       -0.27297 ,  0.20881 , -0.38597 , -0.47118 , -1.0894  , -0.1466  ,
        0.43221 , -0.024394,  0.33551 , -1.0582  , 

In [None]:
#Consider using torch's own methods
import torch
from torch import nn
input1 = torch.randn(100, 128)
input2 = torch.randn(100, 128)
cos = torch.nn.CosineSimilarity(dim=1, eps=1e-6) #dimensions should be 1, alternately, one can find that it is just a dot b divided by norm (a) and b.
print(cos(input1, input2))

tensor([ 0.0857,  0.0413,  0.0212, -0.0101, -0.0430, -0.0251, -0.0421,  0.0664,
        -0.0908, -0.0658,  0.0400, -0.0038,  0.0530,  0.0741, -0.1391, -0.1479,
         0.0696, -0.0213,  0.0647, -0.0564, -0.1831, -0.0147,  0.0858, -0.0625,
        -0.0970,  0.0509,  0.0304,  0.0263,  0.0311,  0.0948,  0.0310,  0.0179,
         0.0226,  0.0363, -0.0463,  0.1682,  0.0811,  0.0062,  0.1188,  0.0571,
         0.0382, -0.0917,  0.0051,  0.0557,  0.0191,  0.1321,  0.0848,  0.0218,
        -0.0830,  0.1028,  0.0740, -0.0493, -0.1332, -0.0140,  0.0016, -0.0380,
        -0.0752, -0.0021,  0.0128, -0.0346,  0.0291, -0.0081, -0.0262, -0.0257,
         0.0180,  0.1144,  0.0373, -0.1039,  0.0696, -0.0106, -0.0365,  0.1770,
        -0.0805,  0.0961, -0.0833,  0.0130,  0.0095, -0.0688,  0.0077, -0.1905,
         0.0642,  0.0245,  0.0162, -0.0288, -0.0267,  0.0321, -0.0333, -0.0290,
        -0.0118,  0.1062, -0.1264, -0.0390,  0.0952,  0.0005, -0.0229, -0.0888,
        -0.0436, -0.0005,  0.0197, -0.01

In [None]:
# Get the word vectors
man_vector = glove_vectors['man']
woman_vector = glove_vectors['woman']
king_vector = glove_vectors['king']
queen_vector = glove_vectors['queen']

# Calculate cosine similarities
man_woman_sim = 1 - spatial.distance.cosine(man_vector, woman_vector)
king_queen_sim = 1 - spatial.distance.cosine(king_vector, queen_vector)
king_man_sim = 1 - spatial.distance.cosine(king_vector, man_vector)
queen_woman_sim = 1 - spatial.distance.cosine(queen_vector, woman_vector)
# woman_housekeeper = 1 - spatial.distance.cosine(woman_vector, housekeeper_vector)
# man_housekeeper = 1 - spatial.distance.cosine(man_vector, housekeeper_vector)

print("Similarity between man and woman", man_woman_sim)
print("Similarity between king and queen", king_queen_sim)
print("Similarity between king and man", king_man_sim)
print("Similarity between queen and woman", queen_woman_sim)

Similarity between man and woman 0.8323494204780473
Similarity between king and queen 0.7507690628448102
Similarity between king and man 0.5118681465892055
Similarity between queen and woman 0.5095153918332322


In [None]:
# Get the word vectors
man_vector = glove_vectors['man']
woman_vector = glove_vectors['woman']
engineer_vector = glove_vectors['engineer']
housekeeper_vector = glove_vectors['housekeeper']

# Calculate cosine similarities
man_woman_sim = 1 - spatial.distance.cosine(man_vector, woman_vector)
engineer_housekeeper_sim = 1 - spatial.distance.cosine(engineer_vector, housekeeper_vector)
woman_engineer_sim = 1 - spatial.distance.cosine(woman_vector, engineer_vector)
man_engineer_sim = 1 - spatial.distance.cosine(man_vector, engineer_vector)
woman_housekeeper = 1 - spatial.distance.cosine(woman_vector, housekeeper_vector)
man_housekeeper = 1 - spatial.distance.cosine(man_vector, housekeeper_vector)

print("Similarity between man and woman", man_woman_sim)
print("Similarity between engineer and housekeeper", engineer_housekeeper_sim)
print("Similarity between man and engineer", man_engineer_sim)
print("Similarity between woman and engineer", woman_engineer_sim)
print("Similarity between man and housekeeper", man_housekeeper)
print("Similarity between woman and housekeeper", woman_housekeeper)

Similarity between man and woman 0.8323494204780473
Similarity between engineer and housekeeper 0.12854650133677126
Similarity between man and engineer 0.42998508908380906
Similarity between woman and engineer 0.3340311155167156
Similarity between man and housekeeper 0.31202090264686877
Similarity between woman and housekeeper 0.45585394751213504


In [None]:
# Get the word vectors
man_vector = glove_vectors['man']
woman_vector = glove_vectors['woman']
bricklayer_vector = glove_vectors['bricklayer']
nurse_vector = glove_vectors['nurse']

# Calculate cosine similarities
man_woman_sim = 1 - spatial.distance.cosine(man_vector, woman_vector)
woman_bricklayer_sim = 1 - spatial.distance.cosine(woman_vector, bricklayer_vector)
man_bricklayer_sim = 1 - spatial.distance.cosine(man_vector, bricklayer_vector)
woman_nurse = 1 - spatial.distance.cosine(woman_vector, nurse_vector)
man_nurse = 1 - spatial.distance.cosine(man_vector, nurse_vector)

print("Similarity between man and woman", man_woman_sim)
print("Similarity between man and bricklayer", man_bricklayer_sim)
print("Similarity between woman and bricklayer", woman_bricklayer_sim)
print("Similarity between man and nurse", man_nurse)
print("Similarity between woman and nurse", woman_nurse)

Similarity between man and woman 0.8323494204780473
Similarity between man and bricklayer 0.15589203673110918
Similarity between woman and bricklayer 0.13727423033481156
Similarity between man and nurse 0.45623883283417743
Similarity between woman and nurse 0.6139442959795541


In [None]:
import numpy as np
from scipy.spatial.distance import cosine

def cosine_similarity_manual(vec1, vec2):
    dot_product = np.dot(vec1, vec2)
    norm_vec1 = np.linalg.norm(vec1)
    norm_vec2 = np.linalg.norm(vec2)
    return dot_product / (norm_vec1 * norm_vec2)

glove_vectors = {
    "man": np.array([0.2, 0.3, 0.5]),
    "woman": np.array([0.25, 0.35, 0.45]),
    "engineer": np.array([0.6, 0.1, 0.3]),
    "housekeeper": np.array([0.1, 0.5, 0.2])
}

man_vector = glove_vectors["man"]
woman_vector = glove_vectors["woman"]
engineer_vector = glove_vectors["engineer"]
housekeeper_vector = glove_vectors["housekeeper"]

pairs = [("man", "woman"), ("engineer", "housekeeper"), ("man", "engineer"),
         ("woman", "engineer"), ("man", "housekeeper"), ("woman", "housekeeper")]

for word1, word2 in pairs:
    vec1, vec2 = glove_vectors[word1], glove_vectors[word2]
    sim_manual = cosine_similarity_manual(vec1, vec2)
    sim_scipy = 1 - cosine(vec1, vec2)

    print(f"Similitud entre {word1} y {word2}: ")
    print(f" - Cálculo manual: {sim_manual:.6f}")
    print(f" - Cálculo con scipy: {sim_scipy:.6f}")

Similitud entre man y woman: 
 - Cálculo manual: 0.990275
 - Cálculo con scipy: 0.990275
Similitud entre engineer y housekeeper: 
 - Cálculo manual: 0.457625
 - Cálculo con scipy: 0.457625
Similitud entre man y engineer: 
 - Cálculo manual: 0.717547
 - Cálculo con scipy: 0.717547
Similitud entre woman y engineer: 
 - Cálculo manual: 0.757941
 - Cálculo con scipy: 0.757941
Similitud entre man y housekeeper: 
 - Cálculo manual: 0.799671
 - Cálculo con scipy: 0.799671
Similitud entre woman y housekeeper: 
 - Cálculo manual: 0.850553
 - Cálculo con scipy: 0.850553


You should extract word vectors from the following lists. Make sure you save

---

them in a .txt file with a name of your choice. The file should just contain a words and their corresponding vector seperated by space. The next word should start from a new line.

In [None]:
import torch
from transformers import GPT2Tokenizer, GPT2Model

# Load GPT-2 model and tokenizer
model_name = 'gpt2'  # you can use 'gpt2-medium', 'gpt2-large' for larger models
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2Model.from_pretrained(model_name, output_hidden_states=True)


# List of names to generate embeddings for
names = ['father', 'mother', 'man', 'woman', 'doctor', 'lawyer', 'engineer', 'nurse', 'teacher', 'accountant', 'architect', 'artist', 'writer', 'chef', 'designer', 'dentist', 'entrepreneur', 'firefighter', 'journalist', 'mechanic', 'musician', 'paramedic', 'photographer', 'psychologist', 'scientist', 'soldier', 'surgeon', 'vet', 'receptionist', 'reading', 'writing', 'painting', 'singing', 'cooking', 'traveling', 'volunteering', 'meditating', 'shopping', 'phone', 'computer', 'car', 'house', 'job', 'school', 'family', 'friends', 'food', 'drink', 'toys', 'books', 'movies', 'concerts', 'sports', 'electronics', 'furniture', 'clothing']
vectors = []

# Generate embeddings for each name
for name in names:
    inputs = tokenizer(name, return_tensors="pt")
    with torch.no_grad():
        outputs = model(**inputs)
    # Get the last hidden state for the first (and only) token
    last_hidden_state = outputs.hidden_states[-1]
    name_vector = last_hidden_state[0, 0, :].numpy()  # (sequence_length, hidden_size)
    #print(len(name_vector))
    vectors.append(name_vector)
    #glove_vectors[name] = ame_vector

# Save the vectors to a file
with open('name_vectors_gpt2.txt', 'w') as f:
    for i, name in enumerate(names):
        vector_str = ' '.join([str(x) for x in vectors[i]])
        f.write(f"{name} {vector_str}\n")
        print(f"{name} {vector_str}\n")
        glove_vectors[name] = vectors[i]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

father -0.16904345 -0.076640666 -0.268132 -0.032958534 0.038636852 -0.2549488 0.4501507 0.024729611 -0.16980512 0.09495477 0.3383247 -0.020645788 -0.0068515064 -0.012634398 -0.08434438 0.012953392 -0.30108404 -0.18575735 0.11076983 -0.09188268 0.1570076 -0.07632807 -0.2659368 -0.06988114 -0.09676209 0.21555771 -0.16784403 -0.08781791 -0.016249288 -0.24946532 -0.008001332 -0.12982647 -0.068034016 -0.27249536 -0.042475305 0.24461205 21.625277 -0.0808906 -0.034573577 -0.04835114 -0.009224845 0.020522824 -0.09778191 -0.3651877 -0.015622924 -0.011982934 0.054027684 0.0072883945 -0.24056557 -0.0061953072 -0.011623199 0.099542305 -0.18097185 -0.050250635 0.080911234 0.36733097 0.04419425 -0.23946099 -0.17413326 -0.05696226 0.00067796157 -0.013297756 -0.12565242 -0.06814677 -1.040149 -0.060504735 -0.11686072 -0.15991086 -0.17050892 -0.21182784 -0.2891842 -0.2623622 -0.16103487 -0.13097379 -0.14103098 -0.25730518 -0.21206716 -0.60284865 0.06290615 -0.05713592 -0.34462157 -0.0066091553 0.0372399

In [None]:
# Here are the lists of words you should extract word vectors from GloVe, combine all lists in one file

sample = ["father", "mother", "man", "woman"]

professions = ["doctor", "lawyer", "engineer", "nurse", "teacher", "accountant", "architect", "artist", "writer", "chef", "designer", "dentist", "entrepreneur", "firefighter", "journalist", "mechanic", "musician", "paramedic", "photographer", "psychologist", "scientist", "soldier", "surgeon", "vet", "receptionist"]
activities = ["reading", "writing", "painting", "singing", "cooking", "traveling", "volunteering", "meditating", "shopping"]
items = ["phone", "computer", "car", "house", "job", "school", "family", "friends", "food", "drink", "toys", "books", "movies", "concerts", "sports", "electronics", "furniture", "clothing"]
names = ["Alex", "Charlotte", "David", "Emma", "Ethan", "Isabella", "Lily", "Oliver", "Sophia", "William", 'john', 'anna', 'sophie', 'ronaldo', 'shakira', 'mario', 'maria', 'tom', 'katy']

words = sample + professions + activities + items
print(words)
with open('words.txt', 'w') as f:
  for w in words:
    f.write(w + " ")
    try:
      f.write(" ".join(map(str, list(glove_vectors[w]))))
    except Exception as e:
      print(e)
      pass
    f.write('\n')

['father', 'mother', 'man', 'woman', 'doctor', 'lawyer', 'engineer', 'nurse', 'teacher', 'accountant', 'architect', 'artist', 'writer', 'chef', 'designer', 'dentist', 'entrepreneur', 'firefighter', 'journalist', 'mechanic', 'musician', 'paramedic', 'photographer', 'psychologist', 'scientist', 'soldier', 'surgeon', 'vet', 'receptionist', 'reading', 'writing', 'painting', 'singing', 'cooking', 'traveling', 'volunteering', 'meditating', 'shopping', 'phone', 'computer', 'car', 'house', 'job', 'school', 'family', 'friends', 'food', 'drink', 'toys', 'books', 'movies', 'concerts', 'sports', 'electronics', 'furniture', 'clothing']


In [None]:
def read_glove_vecs(glove_file):
    with open(glove_file, 'r') as f:
        words = set()
        word_to_vec_map = {}
        for line in f:
            line = line.strip().split()
            curr_word = line[0]
            words.add(curr_word)
            word_to_vec_map[curr_word] = np.array(line[1:], dtype=np.float64)

    return words, word_to_vec_map

In [None]:
words, word_to_vec_map = read_glove_vecs('words.txt')
print(word_to_vec_map)
print(words)

{'father': array([-1.69043450e-01, -7.66406660e-02, -2.68132000e-01, -3.29585340e-02,
        3.86368520e-02, -2.54948800e-01,  4.50150700e-01,  2.47296110e-02,
       -1.69805120e-01,  9.49547700e-02,  3.38324700e-01, -2.06457880e-02,
       -6.85150640e-03, -1.26343980e-02, -8.43443800e-02,  1.29533920e-02,
       -3.01084040e-01, -1.85757350e-01,  1.10769830e-01, -9.18826800e-02,
        1.57007600e-01, -7.63280700e-02, -2.65936800e-01, -6.98811400e-02,
       -9.67620900e-02,  2.15557710e-01, -1.67844030e-01, -8.78179100e-02,
       -1.62492880e-02, -2.49465320e-01, -8.00133200e-03, -1.29826470e-01,
       -6.80340160e-02, -2.72495360e-01, -4.24753050e-02,  2.44612050e-01,
        2.16252770e+01, -8.08906000e-02, -3.45735770e-02, -4.83511400e-02,
       -9.22484500e-03,  2.05228240e-02, -9.77819100e-02, -3.65187700e-01,
       -1.56229240e-02, -1.19829340e-02,  5.40276840e-02,  7.28839450e-03,
       -2.40565570e-01, -6.19530720e-03, -1.16231990e-02,  9.95423050e-02,
       -1.8097

[link text](https://)*斜体文本*[link text](https://)It is common practice to save the word embeddings into a .txt format file and then load them with a function like:

`words, word_to_vec_map = read_glove_vecs('w2v_gnews_small.txt')`

You should create a function named 'read_glove_vecs' to open and read the .txt file with the word vectors.

## Task 2 - Implement Cosine Similarity

We can measure how similar are two words using cosine similarity. We would expect non-gender specific words to be equally distant from gender specific words.

🎯 The goal is to get familiar with calculating cosince similarity using python and try to find similar words that are an example of bias and unbiased vectors. We can measure how similar two words are using cosine similarity!

Deliverables: Provide code for implementing cosine distance in Python. Run the example words, and try measuring the distance of different words. Can you find a biased and an unbiased example?


To calculate cosine similarity, we need to take the cosine of the angle between these two vectors. Here are the steps:

1. Calculate the dot product of A and B
   - Multiply each element in A with the corresponding element in B
   - Sum all those products
   - Call this dot_product

2. Calculate the magnitudes (or lengths) of A and B
   - Square each element in A, sum them, and take the square root. Let's call this mag_A.
   - Do the same for B. Let's call this mag_B.

3. Compute cosine similarity:
   cosine_similarity = dot_product / (mag_A * mag_B)

The closer this value is to 1, the smaller the angle and the more similar document A is to document B.

Thanks to Python, you do not need to do these time-consuming calculations manually! Especially for step 1 and step 2, there is a library called 'numpy' with functions that can help you implement cosine distance in Python!

In [None]:
def spatial_distance_cosine (u,v): #numpy and pytorch both have their own function for cosine similarity, and also BERT
  dot_product= np.dot(u,v)
  magnitude_u = np.dot(u,u)
  magnitude_v = np.dot(v,v)
  spatial_distance_cosine = dot_product / (np.sqrt(magnitude_u) * np.sqrt(magnitude_v))
  return 1-spatial_distance_cosine

In [None]:
spatial_distance_cosine(glove_vectors['man'], glove_vectors['woman'])

0.002620220184326172

In [None]:
#simmilarity between the 2 words
1-spatial_distance_cosine(glove_vectors['man'], glove_vectors['woman'])

0.9973797798156738

In [None]:
def cosine_similarity(u, v):
    """
    Cosine similarity reflects the degree of similariy between u and v

    Arguments:
        u -- a word vector of shape (n,)
        v -- a word vector of shape (n,)

    Returns:
        cosine_similarity -- the cosine similarity between u and v defined by the formula above.
    """


    multipliedArrays = np.multiply(u, v)
    dot_product = np.sum(multipliedArrays)

    squared_array_u = np.square(u)
    summed_array_u = np.sum(squared_array_u)
    mag_A = np.sqrt(summed_array_u)

    squared_array_v = np.square(v)
    summed_array_v = np.sum(squared_array_v)
    mag_B = np.sqrt(summed_array_v)


    return dot_product / (mag_A * mag_B)


u = glove_vectors["artist"]
v = glove_vectors["painting"]

print(cosine_similarity(u,v))

def wordToGPT2Vector(word):
  inputs = tokenizer(word, return_tensors="pt")
  with torch.no_grad():
      outputs = model(**inputs)
    # get the last hiddens tate for thet first (and only) token
  last_hidden_state = outputs.hidden_states[-1]
  word_vector = last_hidden_state[0, 0, :].numpy()
  return word_vector

u = wordToGPT2Vector("egg")
v = wordToGPT2Vector("teacher")

print(cosine_similarity(u,v))


0.9941591
0.99763554


In [None]:
father = wordToGPT2Vector("father")
mother = wordToGPT2Vector("mother")
woman = wordToGPT2Vector("woman")
man = wordToGPT2Vector("man")

print(father)
print("cosine_similarity(father, mother) = ", cosine_similarity(father, mother))
print("cosine_similarity(woman, man) = ",cosine_similarity(woman, man))
print("cosine_similarity(mother - woman, father - man) = ",cosine_similarity(mother - woman, father - man))

[-1.69043452e-01 -7.66406655e-02 -2.68132001e-01 -3.29585336e-02
  3.86368521e-02 -2.54948795e-01  4.50150698e-01  2.47296114e-02
 -1.69805124e-01  9.49547663e-02  3.38324696e-01 -2.06457879e-02
 -6.85150642e-03 -1.26343984e-02 -8.43443796e-02  1.29533922e-02
 -3.01084042e-01 -1.85757354e-01  1.10769831e-01 -9.18826833e-02
  1.57007605e-01 -7.63280690e-02 -2.65936792e-01 -6.98811412e-02
 -9.67620909e-02  2.15557709e-01 -1.67844027e-01 -8.78179073e-02
 -1.62492879e-02 -2.49465317e-01 -8.00133217e-03 -1.29826471e-01
 -6.80340156e-02 -2.72495359e-01 -4.24753055e-02  2.44612053e-01
  2.16252766e+01 -8.08906034e-02 -3.45735773e-02 -4.83511388e-02
 -9.22484510e-03  2.05228236e-02 -9.77819115e-02 -3.65187705e-01
 -1.56229241e-02 -1.19829336e-02  5.40276840e-02  7.28839450e-03
 -2.40565568e-01 -6.19530724e-03 -1.16231991e-02  9.95423049e-02
 -1.80971846e-01 -5.02506346e-02  8.09112340e-02  3.67330968e-01
  4.41942513e-02 -2.39460990e-01 -1.74133256e-01 -5.69622591e-02
  6.77961565e-04 -1.32977

In [None]:
father = wordToGPT2Vector("father")
mother = wordToGPT2Vector("mother")
doctor = wordToGPT2Vector("doctor")
lawyer = wordToGPT2Vector("lawyer")


print("cosine_similarity(father, lawyer) = ", cosine_similarity(father, lawyer))
print("cosine_similarity(mother, doctor) = ",cosine_similarity(mother, doctor))


cosine_similarity(father, lawyer) =  0.99734956
cosine_similarity(mother, doctor) =  0.9977454


In [None]:
man = wordToGPT2Vector("man")
woman = wordToGPT2Vector("woman")
nurse = wordToGPT2Vector("nurse")
doctor = wordToGPT2Vector("doctor")


print("cosine_similarity(man, doctor) = ", cosine_similarity(man, doctor))
print("cosine_similarity(woman, doctor) = ", cosine_similarity(woman, doctor))
print("cosine_similarity(woman, nurse) = ",cosine_similarity(woman, nurse))
print("cosine_similarity(man, nurse) = ",cosine_similarity(woman, nurse))
print("cosine_similarity(nurse, doctor) = ",cosine_similarity(nurse, doctor))



cosine_similarity(man, doctor) =  0.9972978
cosine_similarity(woman, doctor) =  0.99623847
cosine_similarity(woman, nurse) =  0.98748803
cosine_similarity(man, nurse) =  0.98748803
cosine_similarity(nurse, doctor) =  0.99390817


This is the code for computing word analogy given three words (word_a, word_b, word_c), or for example ('man', 'father', 'woman'), the following code find the word vector of a word that completes the analogy. In this example the word vector we expect is 'mother'.

In [None]:
print("cosine_similarity(mother - woman, father - man) = ",cosine_similarity(woman - mother, man - father))

cosine_similarity(mother - woman, father - man) =  0.7022184


## Task 3: Remove bias from word vectors

1.   List item
2.   List item




*   List item



After getting familiar with all the tools we need, now it's time to actually solve the problem of bias in word vectors.

🎯 The goal is to implement a neutralize and equalize Python functions to remove the bias from the word vectors.

Deliverables: (1) Complete the python code for the neutralize and equalize python functions following [Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings](https://proceedings.neurips.cc/paper_files/paper/2016/file/a486cd07e4ac3d270571622f4f316ec5-Paper.pdf) by Tolga Bolukbasi, Kai-Wei Chang, James Zou, Venkatesh Saligrama, and Adam Kalai. Proceedings of NIPS 2016. The code should run without any errors. (2) Provide examples before and after removing bias.


To remove the gender bias from non-gender specific word vectors, we need represent the semantic concept of gender as a vector. We can approximate that vector by subtracting female and male word vectors. This means we can compute a vector 'vgender = v1 - v2', where 'v1' represents the word vector corresponding to the word woman, and 'v2' corresponds to the word vector corresponding to the word man. The resulting vector roughly encodes the concept of "gender".

In [None]:
vgender = word_to_vec_map['woman'] - word_to_vec_map['man']
print(vgender.shape)

(768,)


*Now*, you will consider the cosine similarity of different words with vgender. A positive value of similarity means that the words are closer to 'woman' and a negative cosine similarity means the words are closer to 'man'.

In [None]:
print('List of names and their similarities with constructed vector:')

# girls and boys name
name_list = ['john', 'anna', 'sophie', 'ronaldo', 'shakira', 'mario', 'maria', 'tom', 'katy']

for w in name_list:
  print(w, cosine_similarity(glove_vectors[w], vgender))

List of names and their similarities with constructed vector:


KeyError: 'john'

As you can see, female first names tend to have a positive cosine similarity with our constructed vector
, while male first names tend to have a negative cosine similarity. This is not suprising, and the result seems acceptable.

But let's try with some other words.

In [None]:
print('Other words and their similarities:')
word_list = ['lipstick', 'guns', 'science', 'arts', 'literature', 'warrior','doctor', 'tree', 'receptionist',
             'technology',  'fashion', 'teacher', 'engineer', 'pilot', 'computer', 'singer']
for w in word_list:
    print (w, cosine_similarity(glove_vectors[w], vgender))

Other words and their similarities:


KeyError: 'lipstick'

In [None]:
cosine_similarity(neutralize("computer", vgender, word_to_vec_map), vgender)

NameError: name 'neutralize' is not defined

Do you notice anything surprising? It is astonishing how these results reflect certain unhealthy gender stereotypes. For example, "computer" is closer to "man" while "literature" is closer to "woman". Ouch!

We'll see below how to reduce the bias of these vectors, using an algorithm due to Boliukbasi et al., 2016. Note that some word pairs such as "actor"/"actress" or "grandmother"/"grandfather" should remain gender specific, while other words such as "receptionist" or "technology" should be neutralized, i.e. not be gender-related. You will have to treat these two type of words differently when debiasing.


An approach to remove the bias would be to neutralize and equalize the bias for non-gender specific words, following Bolukbasi et al, 2016.

> Tolga Bolukbasi, Kai-Wei Chang, James Zou, Venkatesh Saligrama, and Adam Kalai. 2016. Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In Proceedings of the 30th International Conference on Neural Information Processing Systems (NIPS'16). Curran Associates Inc., Red Hook, NY, USA, 4356-4364.

In [None]:
def neutralize(word, g, word_to_vec_map):
    """
    Removes the bias of "word" by projecting it on the space orthogonal to the bias axis.
    This function ensures that gender neutral words are zero in the gender subspace.

    Arguments:
        word -- string indicating the word to debias
        g -- numpy-array of shape (100,), corresponding to the bias axis (such as gender)
        word_to_vec_map -- dictionary mapping words to their corresponding vectors. ok
    Returns:
        e_debiased -- neutralized word vector representation of the input "word"
    """

    ### START CODE HERE ###
    # Select word vector representation of "word". Use word_to_vec_map. (≈ 1 line)
    e = word_to_vec_map[word]



    # Compute e_biascomponent using the formula give above. (≈ 1 line)
    e_biascomponent = np.dot(e,g) / np.dot(g,g) * g

    # Neutralize e by substracting e_biascomponent from it
    # e_debiased should be equal to its orthogonal projection. (≈ 1 line)
    e_debiased = e - e_biascomponent
    ### END CODE HERE ###

    return e_debiased

In [None]:
e = "receptionist"
print("cosine similarity between " + e + " and vgender, before neutralizing: ", cosine_similarity(word_to_vec_map["receptionist"], vgender))

e_debiased = neutralize("receptionist", vgender, word_to_vec_map)
print("cosine similarity between " + e + " and vgender, after neutralizing: ", cosine_similarity(e_debiased, vgender))

Next, lets see how debiasing can also be applied to word pairs such as "actress" and "actor." Equalization is applied to pairs of words that you might want to have differ only through the gender property. As a concrete example, suppose that "actress" is closer to "babysit" than "actor." By applying neutralizing to "babysit" we can reduce the gender-stereotype associated with babysitting. But this still does not guarantee that "actor" and "actress" are equidistant from "babysit." The equalization algorithm takes care of this.

The key idea behind equalization is to make sure that a particular pair of words are equally distant.

In [None]:
def equalize(pair, bias_axis, word_to_vec_map):
    """
    Debias gender specific words by following the equalize method described in the figure above.

    Arguments:
    pair -- pair of strings of gender specific words to debias, e.g. ("actress", "actor")
    bias_axis -- numpy-array of shape (100,), vector corresponding to the bias axis, e.g. gender
    word_to_vec_map -- dictionary mapping words to their corresponding vectors

    Returns+
    e_1 -- word vector corresponding to the first word
    e_2 -- word vector corresponding to the second word
    """

    ### START CODE HERE ###
    # Step 1: Select word vector representation of "word". Use word_to_vec_map. (≈ 2 lines)

    w1, w2 = pair
    e_w1, e_w2 = word_to_vec_map[w1], word_to_vec_map[w2]

    # Step 2: Compute the mean of e_w1 and e_w2 (≈ 1 line)
    mu = (e_w1 + e_w2) / 2

    # Step 3: Compute the projections of mu over the bias axis and the orthogonal axis (≈ 2 lines)
    mu_B = np.dot(mu,bias_axis) / (np.linalg.norm(bias_axis))**2 * bias_axis
    mu_orth = mu - mu_B

    # Step 4: Use equations (7) and (8) to compute e_w1B and e_w2B (≈2 lines)
    e_w1B = np.dot(e_w1,bias_axis) / (np.linalg.norm(bias_axis))**2 * bias_axis
    e_w2B = np.dot(e_w2,bias_axis) / (np.linalg.norm(bias_axis))**2 * bias_axis

    # Step 5: Adjust the Bias part of e_w1B and e_w2B using the formulas (9) and (10) given above (≈2 lines)
    corrected_e_w1B = (np.sqrt(np.linalg.norm(1-np.linalg.norm(mu_orth)**2)) * (e_w1B - mu_B) /
    np.linalg.norm(e_w1 - mu_orth - mu_B))
    corrected_e_w2B = (np.sqrt(np.linalg.norm(1-np.linalg.norm(mu_orth)**2)) * (e_w2B - mu_B) /
    np.linalg.norm(e_w2 - mu_orth - mu_B))

    # Step 6: Debias by equalizing e1 and e2 to the sum of their corrected projections (≈2 lines)
    e1 = corrected_e_w1B + mu_orth
    e2 = corrected_e_w2B + mu_orth

    ### END CODE HERE ###

    return e1, e2

In [None]:
def equalize(pair, bias_axis, word_to_vec_map):

    """
    Debias gender specific words by following the equalize method described in the figure above.

    Arguments:
    pair -- pair of strings of gender specific words to debias, e.g. ("actress", "actor")
    bias_axis -- numpy-array of shape (100,), vector corresponding to the bias axis, e.g. gender
    word_to_vec_map -- dictionary mapping words to their corresponding vectors

    Returns+
    e_1 -- word vector corresponding to the first word
    e_2 -- word vector corresponding to the second word
    """

    ### START CODE HERE ###


    def project(vec, bias_axis):
      return bias_axis * (np.dot(vec, bias_axis)/np.linalg.norm(bias_axis)**2)

    # Step 1: Select word vector representation of "word". Use word_to_vec_map. (≈ 2 lines)

    w1, w2 = pair
    e_w1 , e_w2= word_to_vec_map[w1], word_to_vec_map[w2]


    # Step 2: Compute the mean of e_w1 and e_w2 (≈ 1 line)
    mu = (e_w1 + e_w2)/2

    # Step 3: Compute the projections of mu over the bias axis and the orthogonal axis (≈ 2 lines)
    mu_b = project(mu, bias_axis)
    v = mu - mu_b


    # Step 4: Use equations (7) and (8) to compute e_w1B and e_w2B (≈2 lines)
    e_w1B = project(e_w1, bias_axis)
    e_w2B = project(e_w2, bias_axis)

    #print(e_w1B, e_w2B)

    # Step 5: Adjust the Bias part of e_w1B and e_w2B using the formulas (9) and (10) given above (≈2 lines)


    e1_new_b = np.sqrt(np.abs(1 - np.dot(v,v)))*(e_w1B - mu_b) / np.linalg.norm(e_w1B - mu_b)
    e2_new_b = np.sqrt(np.abs(1 - np.dot(v,v)))*(e_w2B - mu_b)/np.linalg.norm(e_w2B - mu_b)

    # Step 6: Debias by equalizing e1 and e2 to the sum of their corrected projections (≈2 lines)
    e1 = v + e1_new_b
    e2 = v + e2_new_b

    ### END CODE HERE ###

    return e1, e2

In [None]:
print("cosine similarities before equalizing:")
print("cosine_similarity(word_to_vec_map[\"man\"], gender) = ", cosine_similarity(word_to_vec_map["man"], vgender))
print("cosine_similarity(word_to_vec_map[\"woman\"], gender) = ", cosine_similarity(word_to_vec_map["woman"], vgender))
print()
e1, e2 = equalize(("man", "woman"), vgender, word_to_vec_map)

print("cosine similarities after equalizing:")
print("cosine_similarity(e1, gender) = ", cosine_similarity(e1, vgender))
print("cosine_similarity(e2, gender) = ", cosine_similarity(e2, vgender))

In [None]:
# prompt: create a list called words_to_graph in this list I want the following words and their comparitive and superlative: slow, short, strong, loud, clear, sfot

words_to_graph = [
    "slow", "slower", "slowest",
    "short", "shorter", "shortest",
    "strong", "stronger", "strongest",
    "loud", "louder", "loudest",
    "clear", "clearer", "clearest",
    "soft", "softer", "softest"
]

In [None]:
# prompt: add the words_to_graph to the vocabulary

import numpy as np
# Assuming 'glove_vectors' is defined and populated in your environment
# and that it contains word embeddings as a dictionary

# Add words_to_graph to the vocabulary (if not already present)



for word in words_to_graph:
    if word not in word_to_vec_map:
        print(f"Adding {word}")
        try:
            # If word embedding exists for the word
            word_to_vec_map[word] = glove_vectors[word]
            words.add(word)
        except KeyError:

            # Handle the case where a word isn't in the original embeddings, you might want to:
            # 1. Assign a random vector:
            vec_size = len(np.random.choice(glove_vector))
            vector = np.random.randn(vec_soze)
            # 2. Skip the word, if adding random vectors is not desired.
            # 3. Use a different embedding source (e.g., FastText) if available.
            pass  # or handle it as per your requirement

In [None]:
# prompt: I want you to graph the similaritty between the word and its comparative and superlative

import matplotlib.pyplot as plt
import networkx as nx
import numpy as np # Added import for numpy

# Assuming 'word_to_vec_map' is already defined and populated from the previous code

def cosine_similarity(u, v):
    """
    Cosine similarity reflects the degree of similariy between u and v

    Arguments:
        u -- a word vector of shape (n,)
        v -- a word vector of shape (n,)

    Returns:
        cosine_similarity -- the cosine similarity between u and v defined by the formula above.
    """
    multipliedArrays = np.multiply(u, v) # Fixed: Using np.multiply
    dot_product = np.sum(multipliedArrays)

    squared_array_u = np.square(u)
    summed_array_u = np.sum(squared_array_u)
    mag_A = np.sqrt(summed_array_u)

    squared_array_v = np.square(v)
    summed_array_v = np.sum(squared_array_v)
    mag_B = np.sqrt(summed_array_v)

    cos_similarity = dot_product / (mag_A * mag_B)
    return cos_similarity


# Calculate similarity scores
similarity_matrix = {}
for i in range(0, len(words_to_graph), 3):
    word = words_to_graph[i]
    comparative = words_to_graph[i+1]
    superlative = words_to_graph[i+2]
    try:
        similarity_matrix[(word, comparative)] = cosine_similarity(word_to_vec_map[word], word_to_vec_map[comparative])
        similarity_matrix[(word, superlative)] = cosine_similarity(word_to_vec_map[word], word_to_vec_map[superlative])
        similarity_matrix[(comparative, superlative)] = cosine_similarity(word_to_vec_map[comparative], word_to_vec_map[superlative])
    except KeyError:
        print(f"Warning: Word '{word}', '{comparative}' or '{superlative}' not found in vocabulary.")


# Create a graph
graph = nx.Graph()
for (word1, word2), similarity in similarity_matrix.items():
    graph.add_edge(word1, word2, weight=similarity)

# Draw the graph
pos = nx.spring_layout(graph)  # You can experiment with different layout algorithms
nx.draw(graph, pos, with_labels=True, node_size=500, node_color='skyblue', font_size=10)
edge_labels = nx.get_edge_attributes(graph, 'weight')
nx.draw_networkx_edge_labels(graph, pos, edge_labels=edge_labels, font_size=8)
plt.title("Word Similarity Graph")
plt.show()

In [None]:
# prompt: show in a better way the relationship between the directions of the word and its comparative and superlative

import matplotlib.pyplot as plt
import networkx as nx
import numpy as np

# Assuming 'word_to_vec_map' is defined and populated from the previous code
# and that it contains word embeddings as a dictionary

def cosine_similarity(u, v):
    """
    Cosine similarity reflects the degree of similariy between u and v

    Arguments:
        u -- a word vector of shape (n,)
        v -- a word vector of shape (n,)

    Returns:
        cosine_similarity -- the cosine similarity between u and v defined by the formula above.
    """
    multipliedArrays = np.multiply(u, v)
    dot_product = np.sum(multipliedArrays)

    squared_array_u = np.square(u)
    summed_array_u = np.sum(squared_array_u)
    mag_A = np.sqrt(summed_array_u)

    squared_array_v = np.square(v)
    summed_array_v = np.sum(squared_array_v)
    mag_B = np.sqrt(summed_array_v)

    cos_similarity = dot_product / (mag_A * mag_B)
    return cos_similarity

words_to_graph = [
    "slow", "slower", "slowest",
    "short", "shorter", "shortest",
    "strong", "stronger", "strongest",
    "loud", "louder", "loudest",
    "clear", "clearer", "clearest",
    "soft", "softer", "softest"
]

# Example usage (replace with your actual word vectors)
# Create a sample word_to_vec_map (replace with your actual data)
word_to_vec_map = {}
for word in words_to_graph:
    word_to_vec_map[word] = np.random.rand(300) # Replace 300 with actual vector dimension


similarity_matrix = {}
for i in range(0, len(words_to_graph), 3):
    word = words_to_graph[i]
    comparative = words_to_graph[i+1]
    superlative = words_to_graph[i+2]
    try:
        similarity_matrix[(word, comparative)] = cosine_similarity(word_to_vec_map[word], word_to_vec_map[comparative])
        similarity_matrix[(word, superlative)] = cosine_similarity(word_to_vec_map[word], word_to_vec_map[superlative])
        similarity_matrix[(comparative, superlative)] = cosine_similarity(word_to_vec_map[comparative], word_to_vec_map[superlative])
    except KeyError as e:
        print(f"Warning: {e}")

graph = nx.Graph()
for (word1, word2), similarity in similarity_matrix.items():
    graph.add_edge(word1, word2, weight=similarity)

pos = nx.spring_layout(graph)
nx.draw(graph, pos, with_labels=True, node_size=500, node_color='skyblue', font_size=10)
edge_labels = nx.get_edge_attributes(graph, 'weight')
nx.draw_networkx_edge_labels(graph, pos, edge_labels=edge_labels, font_size=8)
plt.title("Word Similarity Graph")
plt.show()

In [None]:
# prompt: only graph soft and its comparative and superlative

import matplotlib.pyplot as plt
import networkx as nx
import numpy as np

# Sample word vectors (replace with your actual word vectors)
word_to_vec_map = {
    "soft": np.array([0.1, 0.2, 0.3]),
    "softer": np.array([0.15, 0.25, 0.35]),
    "softest": np.array([0.2, 0.3, 0.4]),
    # ... other words
}

def cosine_similarity(u, v):
    dot_product = np.dot(u, v)
    norm_u = np.linalg.norm(u)
    norm_v = np.linalg.norm(v)
    if norm_u == 0 or norm_v == 0:
        return 0
    return dot_product / (norm_u * norm_v)

words_to_graph = [
    "soft", "softer", "softest"
]

similarity_matrix = {}
for i in range(0, len(words_to_graph)):
    word1 = words_to_graph[i]
    for j in range(i + 1, len(words_to_graph)):
        word2 = words_to_graph[j]
        similarity_matrix[(word1, word2)] = cosine_similarity(word_to_vec_map[word1], word_to_vec_map[word2])

graph = nx.Graph()
for (word1, word2), similarity in similarity_matrix.items():
    graph.add_edge(word1, word2, weight=similarity)

pos = nx.spring_layout(graph)
nx.draw(graph, pos, with_labels=True, node_size=500, node_color='skyblue', font_size=10)
edge_labels = nx.get_edge_attributes(graph, 'weight')
nx.draw_networkx_edge_labels(graph, pos, edge_labels=edge_labels, font_size=8)
plt.title("Word Similarity Graph (Soft)")
plt.show()

In [None]:
# prompt: in the graph above, I want you to make the x axis a bit wider making the ditance between soft and the others a bit bigger on this axis

import matplotlib.pyplot as plt
import networkx as nx
import numpy as np

# ... (Your existing code for word embeddings, cosine similarity, etc.)

# Calculate similarity scores (your existing code)
# ...

# Create a graph (your existing code)
# ...

# Draw the graph with adjusted layout
pos = nx.spring_layout(graph, k=0.5)  # Adjust the k parameter to control spacing
# k: Optimal distance between nodes.  Increase k to increase spacing

nx.draw(graph, pos, with_labels=True, node_size=500, node_color='skyblue', font_size=10)
edge_labels = nx.get_edge_attributes(graph, 'weight')
nx.draw_networkx_edge_labels(graph, pos, edge_labels=edge_labels, font_size=8)
plt.title("Word Similarity Graph")
plt.show()

In [None]:
# prompt: only graph soft

import matplotlib.pyplot as plt
import networkx as nx
import numpy as np

# Sample word vectors (replace with your actual word vectors)
word_to_vec_map = {
    "soft": np.array([0.1, 0.2, 0.3]),
    "softer": np.array([0.15, 0.25, 0.35]),
    "softest": np.array([0.2, 0.3, 0.4]),
    # ... other words
}

def cosine_similarity(u, v):
    dot_product = np.dot(u, v)
    norm_u = np.linalg.norm(u)
    norm_v = np.linalg.norm(v)
    if norm_u == 0 or norm_v == 0:
        return 0
    return dot_product / (norm_u * norm_v)

words_to_graph = [
    "soft", "softer", "softest"
]

similarity_matrix = {}
for i in range(0, len(words_to_graph)):
    word1 = words_to_graph[i]
    for j in range(i + 1, len(words_to_graph)):
        word2 = words_to_graph[j]
        similarity_matrix[(word1, word2)] = cosine_similarity(word_to_vec_map[word1], word_to_vec_map[word2])

graph = nx.Graph()
for (word1, word2), similarity in similarity_matrix.items():
    graph.add_edge(word1, word2, weight=similarity)

pos = nx.spring_layout(graph)
nx.draw(graph, pos, with_labels=True, node_size=500, node_color='skyblue', font_size=10)
edge_labels = nx.get_edge_attributes(graph, 'weight')
nx.draw_networkx_edge_labels(graph, pos, edge_labels=edge_labels, font_size=8)
plt.title("Word Similarity Graph (Soft)")
plt.show()

In [None]:
# prompt: want you to mirror the above graph in the x and y axes

import matplotlib.pyplot as plt
import numpy as np

# Assuming 'graph', 'pos', 'edge_labels' are defined from your previous code

# ... (Your existing code for creating the graph and layout)

# Mirror the graph in the x-axis
mirrored_pos_x = {}
for node, (x, y) in pos.items():
    mirrored_pos_x[node] = (-x, y)

# Mirror the graph in the y-axis
mirrored_pos_y = {}
for node, (x,y) in pos.items():
    mirrored_pos_y[node] = (x, -y)


#Mirror the graph in both x and y axes
mirrored_pos_both = {}
for node, (x, y) in pos.items():
    mirrored_pos_both[node] = (-x, -y)

# Draw the mirrored graph in x-axis
plt.figure(figsize=(8, 6))  # Adjust figure size as needed
nx.draw(graph, mirrored_pos_x, with_labels=True, node_size=500, node_color='skyblue', font_size=10)
nx.draw_networkx_edge_labels(graph, mirrored_pos_x, edge_labels=edge_labels, font_size=8)
plt.title("Word Similarity Graph (Mirrored X-axis)")
plt.show()

# Draw the mirrored graph in y-axis
plt.figure(figsize=(8, 6))  # Adjust figure size as needed
nx.draw(graph, mirrored_pos_y, with_labels=True, node_size=500, node_color='skyblue', font_size=10)
nx.draw_networkx_edge_labels(graph, mirrored_pos_y, edge_labels=edge_labels, font_size=8)
plt.title("Word Similarity Graph (Mirrored Y-axis)")
plt.show()

# Draw the mirrored graph in both x and y axes
plt.figure(figsize=(8, 6))  # Adjust figure size as needed
nx.draw(graph, mirrored_pos_both, with_labels=True, node_size=500, node_color='skyblue', font_size=10)
nx.draw_networkx_edge_labels(graph, mirrored_pos_both, edge_labels=edge_labels, font_size=8)
plt.title("Word Similarity Graph (Mirrored Both Axes)")
plt.show()

In [None]:
# prompt: I want to have soft as the word that is closer to the left

import matplotlib.pyplot as plt
import networkx as nx
import numpy as np

# Sample word vectors (replace with your actual word vectors)
word_to_vec_map = {
    "soft": np.array([0.1, 0.2, 0.3]),
    "softer": np.array([0.15, 0.25, 0.35]),
    "softest": np.array([0.2, 0.3, 0.4]),
    # ... other words
}

def cosine_similarity(u, v):
    dot_product = np.dot(u, v)
    norm_u = np.linalg.norm(u)
    norm_v = np.linalg.norm(v)
    if norm_u == 0 or norm_v == 0:
        return 0
    return dot_product / (norm_u * norm_v)

words_to_graph = [
    "soft", "softer", "softest"
]

similarity_matrix = {}
for i in range(0, len(words_to_graph)):
    word1 = words_to_graph[i]
    for j in range(i + 1, len(words_to_graph)):
        word2 = words_to_graph[j]
        similarity_matrix[(word1, word2)] = cosine_similarity(word_to_vec_map[word1], word_to_vec_map[word2])

graph = nx.Graph()
for (word1, word2), similarity in similarity_matrix.items():
    graph.add_edge(word1, word2, weight=similarity)

# Calculate positions with adjusted spacing
pos = nx.spring_layout(graph, k=0.9)  # Adjust k for spacing

# Draw the graph
nx.draw(graph, pos, with_labels=True, node_size=500, node_color='skyblue', font_size=10)
edge_labels = nx.get_edge_attributes(graph, 'weight')
nx.draw_networkx_edge_labels(graph, pos, edge_labels=edge_labels, font_size=8)
plt.title("Word Similarity Graph (Soft)")
plt.show()