# AI for Social Good: Lecture 1 Exercises
Today we will explore gender bias in natural language processing. We will learn about our first models to probe gender bias in word vectors. As a reminder, word vectors are a machine's representation of a word, learned from reading a large corpus of text to understand the context that words are used in. For example, since the words "good" and "great" are used in similar contexts, they have similar word vectors!

These kinds of word vectors are used in everything from Google Search to Spotify recommendations, so if they are biased, this is a major problem.

Today we will be using GloVe vectors, which are a standard type of word vector used in a variety of real-world applications. These word vectors were trained on 6 billion word tokens, sourced from Wikipedia 2014 + Gigaword5. If you're interested you can find more information [here](https://nlp.stanford.edu/projects/glove/).

Run the below cell by highlighting it and typing Shift+Enter. This will import the required packages and download the GloVe vectors, which will take a few minutes.

In [1]:
import torchtext.vocab as vocab
import numpy as np
import requests
import zipfile
import io

np.random.seed(42)
# Download class resources...
r = requests.get("http://web.stanford.edu/class/cs21si/resources/unit1_resources.zip")
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall()

VEC_SIZE = 300
glove = vocab.GloVe(name='6B', dim=VEC_SIZE)

.vector_cache/glove.6B.zip: 862MB [02:41, 5.33MB/s]                           
100%|█████████▉| 399999/400000 [00:46<00:00, 8537.43it/s]


## Part 1: Word Vector Exploration

Below, we use the word vectors for 'good' and 'great' to determine the cosine similarity between them. We do the same for 'good' and 'human' (two words that are less similar). Feel free to play around and compute more similarities! Note: we have included a short helper function that retrieves the word vector for a given word.

In [4]:
def get_word_vector(word):
    return glove.vectors[glove.stoi[word]].numpy()

def compute_cosine_similarity(word_a, word_b):
    a, b = get_word_vector(word_a), get_word_vector(word_b)
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

print("good-great similarity %f" % compute_cosine_similarity('good', 'great'))
print("good-human similarity %f" % compute_cosine_similarity('good', 'human'))
print("good-bad similarity %f" % compute_cosine_similarity('good', 'bad'))

good-great similarity 0.641005
good-human similarity 0.313640
good-bad similarity 0.644522


## Part 2: Computing Logistic Regression

Recall that the equation for linear regression prediction is computed as:

$$\hat{y} = wx + b$$

With that in mind, we will write the code for the linear regression prediction computation together.

In [5]:
def compute_linear_regression(word, weights, bias):
    # YOUR CODE HERE (~2 lines)
    x = get_word_vector(word)
    return np.dot(weights, x) + bias
    # END CODE

Let's test our code by running the following hard-coded tests:

In [14]:
weights, bias = np.arange(VEC_SIZE) / 100., 0.
threshold = 1e-10
tests = [('good', -9.327852969836677), ('great', -2.7949289857037374), ('bad', -10.208886105293644), ('human', -1.0023533709946784)]
for word, pred in tests:
  diff = abs(pred - compute_linear_regression(word, weights, bias))
  assert diff <= threshold, 'Implementation incorrect for word \'%s\'' % word