# 1 - Natural Language Processing
<hr>

In language modeling, we have seen that we can represent words from a dictionary as vectors using one-hot encoding where all components are zero except for one.

<br>

<div style="text-align:center">
    <img src="images/1-hot-vector.png" width=450>
    <caption><center><font color="purple"><b>Figure 1:</b> One-hot vector representations</font></center></caption>
</div>

Word embeddings offer a nuanced approach to representing words in vector space, contrasting with simpler methods like one-hot encoding. While one-hot encoding provides an easy-to-understand representation (where each word corresponds to a unique vector in a high-dimensional space), it lacks the ability to capture semantic relationships between words.

### 1.1 - Word Embeddings

Word embeddings address this limitation by representing words as vectors in a lower-dimensional space, where each component of the vector captures a different semantic attribute or feature of the word (e.g., age, sex, food/non-food, word type). These features are not explicitly labeled as such but are learned from the context in which words appear. In word embeddings:
- Each dimension represents a latent feature that may correspond to aspects of the word's meaning.
- Semantically similar words have vectors that are closer in this space, with each component of the vector having non-null values.

### 1.2 - Dimensionality Reduction for Visualization

To visualize these relationships, dimensionality reduction techniques like [t-SNE](https://jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf) (t-distributed Stochastic Neighbor Embedding) can be applied. t-SNE reduces the dimensions of the embedding space to two or three, allowing us to visualize how words cluster together. This often reveals that words with similar meanings are located in proximity to each other in the reduced vector space.

If $w_i$ represents the word embedding for word $i$, then words $i$ and $j$ with similar meanings will have embeddings $w_i$ and $w_j$ such that the distance between them (e.g., Euclidean or cosine) is small.

<br>

<div style="text-align:center">
    <img src="images/vector-space.png" width=300>
    <caption><center><font color="purple"><b>Figure 2:</b> Words in a vector space</font></center></caption>
</div>

# 2 - Properties of Word Embeddings
<hr>

Word embeddings represent words in a continuous vector space where semantically similar words are mapped to nearby points. They are crucial in natural language processing (NLP) tasks like Named Entity Recognition (NER) and can be fine-tuned for specific tasks through transfer learning. This fine-tuning requires a smaller training set and dimensionality compared to training from scratch.

### 2.1 - Word Embeddings versus One-Hot Vectors

One-hot vectors represent words as vectors where each word corresponds to a unique position in a high-dimensional space. However, they fail to capture the level of similarity between words as every one-hot vector is equidistant from any other.

In contrast, embedding vectors, such as **GloVe vectors,** encapsulate more meaningful information about word semantics. They enable the modeling of relationships and analogies between words, as shown by the following equation:

$$e_{man} - e_{woman} \approx e_{king} - e_{queen}$$

This equation illustrates that the relationship between "man" and "woman" is similar to that between "king" and "queen" in the embedding space.

<br>

<div style="text-align:center">
    <img src="images/word-embeddings.png" width=900>
</div>

### 2.2 - Embedding Matrix

The embeddings of words in a vocabulary are stored in an embedding matrix $E$. This matrix allows efficient retrieval of word embeddings:

$$e_j = E \cdot O_j$$

For a vocabulary of size 10,000 and 300-dimensional embeddings, $E$ is a $300 \times 10,000$ matrix. Multiplying $E$ with a one-hot vector $O_j$ retrieves the corresponding 300-dimensional embedding:

For instance, given $E$ and the one-hot encoding $O_{6257}$, we can compute $e_{6257}$ by multiplying $E$ with $O_{6257}$ as:

$$
e_{6257} = E \cdot O_{6257} = 
\begin{bmatrix}
e_{11} & e_{12} & \cdots & e_{1n} \\
e_{21} & e_{22} & \cdots & e_{2n} \\
\vdots & \vdots & \ddots & \vdots \\
e_{m1} & e_{m2} & \cdots & e_{mn}
\end{bmatrix} 
\cdot
\begin{bmatrix}
0 \\
0 \\
\vdots \\
1 \\
\vdots \\
0
\end{bmatrix}
=
\begin{bmatrix}
e_{1,6257} \\
e_{2,6257} \\
\vdots \\
e_{m,6257}
\end{bmatrix}
$$

If our vocabulary is a 10,000 dimensional vector where each word is encoded as a 300 dimensional, then the matrix $E$ is a $300 \times 10,000$ dimensional; multiplying it with the one-hot vector $O_{6257}$ which is 10,000 dimensional, we obtain a $300 \times 1$ embedding.

$$
e_{6257} = E \cdot O_{6257} = 
\begin{bmatrix}
\text{a} & \text{aaron} & \cdots & \text{orange}_{6257} & \cdots & \text{zulu} \\
\text{a} & \text{aaron} & \cdots & \text{orange}_{6257} & \cdots & \text{zulu} \\
\vdots & \vdots & \vdots & \vdots & \ddots & \vdots \\
\end{bmatrix}_{ \ 300 \times 10,000} 
\cdot
\begin{bmatrix}
0 \\
0 \\
\vdots \\
1_{6257} \\
\vdots \\
0
\end{bmatrix}_{ \ 10,000 \times 1}
=
\begin{bmatrix}
e_{1,6257} \\
e_{2,6257} \\
\vdots \\
e_{m,6257}
\end{bmatrix}_{ \ 300 \times 1}
$$

Since most of the multiplications are zeros, in pratice we use specialized functions to look up an embedding.

### 3 - Cosine Similarity

To measure the similarity between two words, we need a way to measure the degree of similarity between two embedding vectors for the two words. Given two vectors $u$ and $v$, cosine similarity is defined as follows: 

$$\text{CosineSimilarity(u, v)} = \frac {u \cdot v} {||u||_2 ||v||_2} = cos(\theta)$$

* $u \cdot v$ is the dot product (or inner product) of two vectors
* $||u||_2$ is the norm (or length) of the vector $u$
* $\theta$ is the angle between $u$ and $v$. 
* The cosine similarity depends on the angle between $u$ and $v$. 
    * If $u$ and $v$ are very similar, their cosine similarity will be close to 1.
    * If they are dissimilar, the cosine similarity will take a smaller value. 

<img src="images/cosine_sim.png" style="width:800px;">
<caption><center><font color='purple'><b>Figure 3</b>: The cosine of the angle between two vectors is a measure of their similarity</font></center></caption>

In [14]:
def cosine_similarity(u, v):
    """Cosine similarity reflects the degree of similarity between u and v 
    Arguments:
        u -- a word vector of shape (n,)          
        v -- a word vector of shape (n,)
    Returns:
        cosine_similarity -- the cosine similarity between u and v.
    """
    
    # Special case. Consider the case u = [0, 0], v=[0, 0]
    if np.all(u == v):
        return 1
    
    # Compute the dot product between u and v
    dot = np.dot(u, v) 
    
    # Compute the L2 norm of u
    norm_u = np.sqrt(np.sum(u ** 2))
    
    # Compute the L2 norm of v
    norm_v = np.sqrt(np.sum(v ** 2))
    
    # Avoid division by 0
    if np.isclose(norm_u * norm_v, 0, atol=1e-32):
        return 0
    
    # Compute the cosine similarity
    cosine_similarity = dot / (norm_u * norm_v)
    
    return cosine_similarity

# 3 - Learning Word Embeddings
<hr>

### 3.1 - Word2Vec

Word2Vec (W2V) is a prevalent model for learning word embeddings, offering two distinct approaches:

1. Skip-Gram
2. CBOW (Continuous Bag Of Words)

#### <font color="purple">Skip-Gram</font>

The Skip-Gram model operates on the principle of predicting surrounding words given a specific word in a sentence. Here, we differentiate between two types of words:

- **Context Word:** The word based on which predictions are made.
- **Target Words:** The surrounding words within a specified window that the model attempts to predict.

For a given context word, the Skip-Gram model aims to predict the target words within a defined window size. For instance, if the window size is 5, the model predicts the 5 words before and after the context word.

Consider a vocabulary of 10,000 words. For training, we select a pair of words: a context word "orange" and a target word "juice". The embeddings of these words, denoted as $e_c$ (for "orange") and $e_t$ (for "juice"), are derived using the embedding matrix $E$ and one-hot encoding $O_j$:

$$e_j = E \cdot O_j$$

In the case of predicting the target word "juice" from the context word "orange", the embedding $e_c$ is input into a softmax unit to compute the probability of each word in the vocabulary being the target word:

$$p(t \mid c) = \frac{e^{\theta_t^T e_c}}{\sum_{j=1}^{10,000} e^{\theta_j^T e_c}}$$

Here, $\theta_t$ represents the parameters of the softmax function for the target word.

The output $\hat{y}$ is a probability distribution across the entire vocabulary. The training objective is to adjust $\theta_t$ to maximize the likelihood of the actual target word. The loss function, based on negative log-likelihood, is defined as:

$$\mathcal{L}(\hat{y}, y) = - \sum_{i=1}^{10,000} y_i \log{\hat{y}_i}$$

where $y$ is the one-hot encoding of the actual target word.

While effective, the Skip-Gram model faces computational challenges, particularly with the softmax function over large vocabularies. To mitigate this, hierarchical softmax can be used. This method applies a binary tree structure to reduce the complexity of the probability distribution computation, leading to faster processing and scalability.

### 3.2 - Negative Sampling

Negative Sampling offers a more efficient approach to compute word embeddings compared to the Skip-Gram model, particularly in the context of large vocabularies. It reformulates the learning problem by generating a training set comprising both valid (positive) and artificially generated invalid (negative) context-target pairs.

For each valid context-target word pair like (orange, juice), we generate $k$ negative samples, leading to a training set of $k+1$ samples. That is, the samples might look like this:

| context | word | target? |
| ---- | ---- | ---- |
| orange | juice | 1 |
| orange | king | 0 |
| orange | book | 0 |
| orange | the | 0 |
| orange | of | 0 |

Here's how it works:

- **Positive Sample Generation:** For a given context word, a target word is sampled from its surrounding window, creating a valid pair. This pair is labeled as "1", indicating a genuine context-target relationship like (orange, juice).
- **Negative Sample Generation:** Next, $k$ additional target words are randomly chosen from the entire vocabulary, irrespective of their actual context. These pairs are labeled as "0", representing false context-target pairs. The typical range for $k$ is 5-20 for smaller datasets and 2-5 for larger ones.

The resultant training set consists of $k+1$ word pairs, turning the learning problem into binary classification. In each iteration, a neural network is trained on these $k+1$ pairs rather than the entire vocabulary as in Skip-Gram.

The probability of a target word $t'$ occurring in the context of word $c$ is modeled as:

$$P(y=1 \mid c, t') = \sigma \left( \theta^{T}_{t'} e_c \right)$$

Here, $\sigma$ denotes the sigmoid function, $\theta_{t'}$ the parameter vector for the target word $t'$, and $e_c$ the embedding of teh context word $c$.

The objective is to adjust $\theta_{t'}$ to minimize the cost function, aiming for $P(y=1 \mid c,t)$ to be close to 1 for true pairs and close to 0 for negative samples.

### 3.3 - GloVe

Global Vectors for Word Representation (GloVe) is an alternative word embedding method to Word2Vec (W2V), notable for its simplicity and effectiveness. Unlike W2V, which relies on local context information, GloVe focuses on word co-occurrence statistics across the entire corpus.

<b><font color="purple">Co-Occurrence Matrix</font></b>

GloVe constructs a co-occurrence matrix where, for a given word $i$, it counts the occurrences of every other word $j$ within a defined context (like a window around word $i$). This generates a co-occurrence value $x_{ij}$ for each word pair.

<b><font color="purple">Objective Function</font></b>

GloVe aims to minimize the following cost function, for a vocabulary of 10,000 words:

$$\text{minimize} \sum_{i=1}^{10,000} \sum_{j=1}^{10,000} f(x_{ij}) \left( \theta^T_i e_j + b_i + b'_j - \log{x_{ij}} \right)^2$$

Here, $\theta_i$ and $e_j$ are the word vectors for words $i$ and $j$, respectively, while $b_i$ and $b'_j$ are scalar biases for these words.

<b><font color="purple">Weighting Function $f(x_{ij})$</font></b>

The function $f(x_{ij})$ is a weighting term with specific characteristics:

- It is zero if $x_{ij}=0$, effectively filtering out pairs of words that never co-occur and avoiding undefined logarithms.
- It assigns higher weights to more frequent words, but not disproportionately high to prevent common words (like stop words) from dominating.
- It gives some weight to less frequent words, ensuring that they still contribute meaningfully to the embeddings.

<b><font color="purple">Symmetry in Word Vectors</font></b>

In GloVe, $\theta_i$ and $\theta_j$ are symmetric in the learning objective. Consequently, the final word vector for a word $w$ can be obtained by averaging $e_w$ and $\theta_w$.

Despite its simplicity, GloVe effectively captures complex patterns and relationships in language. Its global perspective on word co-occurrence is a distinctive feature compared to local context-focused methods like W2V. This broader view often makes GloVe a preferred choice for researchers seeking a balance between computational efficiency and representational power in word embeddings.

In [3]:
import numpy as np
from utils.w2v_utils import *

#### Load the Word Vectors

Use 50-dimensional GloVe vectors to represent words.

In [5]:
def read_glove_vecs(glove_file):
    with open(glove_file, 'r', encoding='utf-8') as f:
        words = set()
        word_to_vec_map = {}
        
        for line in f:
            line = line.strip().split()
            curr_word = line[0]
            words.add(curr_word)
            word_to_vec_map[curr_word] = np.array(line[1:], dtype=np.float64)
            
    return words, word_to_vec_map

In [6]:
words, word_to_vec_map = read_glove_vecs('data/glove.6B.50d.txt')

In [8]:
len(words), len(word_to_vec_map)

(400000, 400000)

In [13]:
w = list(words)[17]
print(f"word: {w}")
print(f"vector: {word_to_vec_map[w]}")

word: teambuilding
vector: [ 0.2316    -0.60879   -0.67305    0.54132   -0.40751   -0.02131
  0.96881    0.65686    0.092394   0.17566   -0.28613    0.10204
  0.29729    0.56033   -0.86261   -0.73788   -0.67957    0.80383
  0.66244   -0.27007    0.30373    0.0090044 -0.30944   -0.12451
 -0.23122    1.1985     0.24719   -0.49688    0.6335     0.76954
 -1.0342     0.86625    0.34631   -0.25434   -0.10282    0.61619
 -0.72374   -0.17461   -1.0606     0.25897   -0.229     -0.52017
 -0.54502    0.4919     0.76831   -0.4819     0.39363    0.97521
 -0.81254    1.1206   ]


#### Word Analogy Task

* In the word analogy task, complete this sentence:  
    <font color='brown'>"*a* is to *b* as *c* is to **____**"</font>. 

* An example is:  
    <font color='brown'> '*man* is to *woman* as *king* is to *queen*' </font>. 

* We're trying to find a word *d*, such that the associated word vectors $e_a, e_b, e_c, e_d$ are related in the following manner:   
    $e_b - e_a \approx e_d - e_c$
* Measure the similarity between $e_b - e_a$ and $e_d - e_c$ using cosine similarity. 

In [15]:
def complete_analogy(word_a, word_b, word_c, word_to_vec_map):
    """Performs the word analogy task as explained above: a is to b as c is to ____. 
    Arguments:
        word_a -- a word, string
        word_b -- a word, string
        word_c -- a word, string
        word_to_vec_map -- dictionary that maps words to their corresponding vectors.     
    Returns:
    best_word --  the word such that v_b - v_a is close to v_best_word - v_c, as measured by cosine similarity
    """
    
    # Convert words to lowercase
    word_a, word_b, word_c = word_a.lower(), word_b.lower(), word_c.lower()
    
    # Get the word embeddings e_a, e_b and e_c
    e_a, e_b, e_c = word_to_vec_map[word_a], word_to_vec_map[word_b], word_to_vec_map[word_c] 
    
    words = word_to_vec_map.keys()
    max_cosine_sim = -100              # Initialize max_cosine_sim to a large negative number
    best_word = None                   # Initialize best_word with None, it will help keep track of the word to output
    
    # Loop over the whole word vector set
    for w in words:   
        # to avoid best_word being one the input words, skip the input word_c
        # skip word_c from query
        if w == word_c:
            continue
        
        # Compute cosine similarity between the vector (e_b - e_a) 
        # and the vector ((w's vector representation) - e_c)
        cosine_sim = cosine_similarity(e_b - e_a, word_to_vec_map[w] - e_c)
        
        if cosine_sim > max_cosine_sim:
            max_cosine_sim = cosine_sim
            best_word = w
        
    return best_word

In [16]:
triads_to_try = [('italy', 'italian', 'spain'), ('india', 'delhi', 'japan'), 
                 ('man', 'woman', 'boy'), ('small', 'smaller', 'large')]

for triad in triads_to_try:
    print ('{} -> {} :: {} -> {}'.format( *triad, complete_analogy(*triad, word_to_vec_map)))

italy -> italian :: spain -> spanish
india -> delhi :: japan -> tokyo
man -> woman :: boy -> girl
small -> smaller :: large -> smaller


# 4 - Sentiment Classification
<hr>

Sentiment classification (SC) is the process of deciding from a text whether the writer likes or dislikes something. This is for example required to map textual reviews to star-ratings (1 star=bad, 5 stars=great).

The learning problem in SC is to learn a function which maps an input $x$ (e.g. a restaurant review) to a discrete output $y$ (e.g. a star-rating). Therefore the learning problem is a multinomial classification problem where the predicted class is the number of stars. For such learning problems, however, training data is usually sparse.

A simple classifier could consist of calculating the word embeddings for each word in the review and calculating their average. This average vector could then be fed into a softmax classifier which calculates the probability for each of the target classes. This also works for long reviews because the average vector will always have the same dimensionality. However, by averaging the word vectors, we lose information about the order of the words. This is important because the word sequence "not good" should have a negative impact on the star-rating whereas when calculating the average of "not" and "good" individually, the negative meaning is lost and the star-rating would be influenced positively. An alternative would therefore be to train an RNN with the word embeddings.


<br>

<div style="text-align:center">
    <img src="images/rnn-for-sentiment-classification.png" width=600>
    <caption><center><font color='purple'><b>Figure 4</b>: Sentiment Classification with an RNN</font></center></caption>
</div>

# 5 - Debiasing Word Embeddings
<hr>

Word Embeddings can suffer from bias depending on the training data used. The term bias denotes bias towards gender/race/age etc,. Such stereotypes can become a problem because they can enforce stereotypes by learning inappropriate relationships between words (e.g. man is to computer programmer as woman is to homemaker).

In [18]:
print ('List of names and their similarities with constructed vector:')

g = word_to_vec_map['woman'] - word_to_vec_map['man']

# girls and boys name
name_list = ['john', 'marie', 'sophie', 'ronaldo', 'priya', 'rahul', 'danielle', 'reza', 'katy', 'yasmin']

for w in name_list:
    print (w, cosine_similarity(word_to_vec_map[w], g))

List of names and their similarities with constructed vector:
john -0.23163356145973724
marie 0.315597935396073
sophie 0.3186878985941878
ronaldo -0.31244796850329437
priya 0.17632041839009402
rahul -0.16915471039231722
danielle 0.24393299216283895
reza -0.07930429672199553
katy 0.2831068659572615
yasmin 0.23313857767928753


As we can see, female first names tend to have a positive cosine similarity with our constructed vector $g$, while male first names tend to have a negative cosine similarity. This is not surprising, and the result seems acceptable. 

Now try with some other words:

In [19]:
print('Other words and their similarities:')
word_list = ['lipstick', 'guns', 'science', 'arts', 'literature', 'warrior','doctor', 'tree', 'receptionist', 
             'technology',  'fashion', 'teacher', 'engineer', 'pilot', 'computer', 'singer']
for w in word_list:
    print (w, cosine_similarity(word_to_vec_map[w], g))

Other words and their similarities:
lipstick 0.27691916256382665
guns -0.1888485567898898
science -0.06082906540929699
arts 0.008189312385880344
literature 0.0647250443345993
warrior -0.20920164641125288
doctor 0.11895289410935045
tree -0.07089399175478092
receptionist 0.3307794175059374
technology -0.13193732447554293
fashion 0.035638946257727
teacher 0.1792092343182567
engineer -0.08039280494524072
pilot 0.0010764498991917074
computer -0.10330358873850498
singer 0.18500518136496297


Do you notice anything surprising? It is astonishing how these results reflect certain unhealthy gender stereotypes. For example, we see "computer" is negative and is closer in value to male first names, while "literature" is positive and is closer to female first names. Ouch!

To reduce the bias of these vectors, we'll use an algorithm due to [Boliukbasi et al., 2016](https://arxiv.org/abs/1607.06520). Note that some word pairs such as "actor"/"actress" or "grandmother"/"grandfather" should remain gender-specific, while other words such as "receptionist" or "technology" should be neutralized, i.e. not be gender-related.

### 5.1 - Neutralize Bias for Non-Gender Specific Words 

The figure below should help visualize what neutralizing does. If we're using a 50-dimensional word embedding, the 50 dimensional space can be split into two parts: The bias-direction $g$, and the remaining 49 dimensions, which is called $g_{\perp}$ here. In linear algebra, we say that the 49-dimensional $g_{\perp}$ is perpendicular (or "orthogonal") to $g$, meaning it is at 90 degrees to $g$. The neutralization step takes a vector such as $e_{receptionist}$ and zeros out the component in the direction of $g$, giving us $e_{receptionist}^{debiased}$. 

Even though $g_{\perp}$ is 49-dimensional, given the limitations of what we can draw on a 2D screen, it's illustrated using a 1-dimensional axis below. 

<img src="images/neutralize.png" style="width:800px;">
<caption><center><font color='purple'><b>Figure 5</b>: The word vector for "receptionist" represented before and after applying the neutralize operation.</font> </center></caption>

Given an input embedding $e$, we can use the following formulas to compute $e^{debiased}$: 

$$e^{bias\_component} = \frac{e \cdot g}{||g||_2^2} * g$$
$$e^{debiased} = e - e^{bias\_component}$$

**Note:** The [paper](https://papers.nips.cc/paper/6228-man-is-to-computer-programmer-as-woman-is-to-homemaker-debiasing-word-embeddings.pdf), which the debiasing algorithm is from, assumes all word vectors to have L2 norm as 1 and hence the need for the calculations below:

In [20]:
# The paper assumes all word vectors to have L2 norm as 1 and hence the need for this calculation

from tqdm import tqdm
word_to_vec_map_unit_vectors = {
    word: embedding / np.linalg.norm(embedding)
    for word, embedding in tqdm(word_to_vec_map.items())
}

g_unit = word_to_vec_map_unit_vectors['woman'] - word_to_vec_map_unit_vectors['man']

100%|██████████| 400000/400000 [00:01<00:00, 228011.37it/s]


In [21]:
def neutralize(word, g, word_to_vec_map):
    """Removes the bias of "word" by projecting it on the space orthogonal to the bias axis. 
    This function ensures that gender neutral words are zero in the gender subspace.
    
    Arguments:
        word -- string indicating the word to debias
        g -- numpy-array of shape (50,), corresponding to the bias axis (such as gender)
        word_to_vec_map -- dictionary mapping words to their corresponding vectors.
    
    Returns:
        e_debiased -- neutralized word vector representation of the input "word"
    """
    
    # Select word vector representation of "word"
    e = word_to_vec_map[word]
    
    # Compute e_biascomponent using the formula given above
    e_biascomponent = np.dot(e, g) * g / np.linalg.norm(g)**2
 
    # Neutralize e by subtracting e_biascomponent from it 
    e_debiased = e - e_biascomponent
    
    return e_debiased

In [23]:
word = "receptionist"
print("cosine similarity between " + word + " and g, before neutralizing: ", cosine_similarity(word_to_vec_map[word], g))

e_debiased = neutralize(word, g_unit, word_to_vec_map_unit_vectors)
print("cosine similarity between " + word + " and g_unit, after neutralizing: ", cosine_similarity(e_debiased, g_unit))

cosine similarity between receptionist and g, before neutralizing:  0.3307794175059374
cosine similarity between receptionist and g_unit, after neutralizing:  2.0560779843378157e-17


### 5.2 - Equalization Algorithm for Gender-Specific Words

Next, let's see how debiasing can also be applied to word pairs such as "actress" and "actor." Equalization is applied to pairs of words that we might want to have differ only through the gender property. As a concrete example, suppose that "actress" is closer to "babysit" than "actor." By applying neutralization to "babysit," we can reduce the gender stereotype associated with babysitting. But this still does not guarantee that "actor" and "actress" are equidistant from "babysit." The equalization algorithm takes care of this. 

The key idea behind equalization is to make sure that a particular pair of words are equidistant from the 49-dimensional $g_\perp$. The equalization step also ensures that the two equalized steps are now the same distance from $e_{receptionist}^{debiased}$, or from any other work that has been neutralized. Visually, this is how equalization works: 

<img src="images/equalize.png" style="width:800px;height:400px;">

The derivation of the linear algebra to do this is a bit more complex. (See Bolukbasi et al., 2016 in the References for details.) Here are the key equations: 

$$ \mu = \frac{e_{w1} + e_{w2}}{2}$$ 

$$ \mu_{B} = \frac {\mu \cdot \text{bias_axis}}{||\text{bias_axis}||_2^2} *\text{bias_axis}$$ 

$$\mu_{\perp} = \mu - \mu_{B}$$

$$ e_{w1B} = \frac {e_{w1} \cdot \text{bias_axis}}{||\text{bias_axis}||_2^2} *\text{bias_axis}$$ 

$$ e_{w2B} = \frac {e_{w2} \cdot \text{bias_axis}}{||\text{bias_axis}||_2^2} *\text{bias_axis}$$


$$e_{w1B}^{corrected} = \sqrt{{1 - ||\mu_{\perp} ||^2_2}} * \frac{e_{\text{w1B}} - \mu_B} {||e_{w1B} - \mu_B||_2}$$

$$e_{w2B}^{corrected} = \sqrt{{1 - ||\mu_{\perp} ||^2_2}} * \frac{e_{\text{w2B}} - \mu_B} {||e_{w2B} - \mu_B||_2}$$

$$e_1 = e_{w1B}^{corrected} + \mu_{\perp}$$

$$e_2 = e_{w2B}^{corrected} + \mu_{\perp}$$

In [22]:
def equalize(pair, bias_axis, word_to_vec_map):
    """Debias gender specific words by following the equalize method described in the figure above.
    
    Arguments:
    pair -- pair of strings of gender specific words to debias, e.g. ("actress", "actor") 
    bias_axis -- numpy-array of shape (50,), vector corresponding to the bias axis, e.g. gender
    word_to_vec_map -- dictionary mapping words to their corresponding vectors
    
    Returns
    e_1 -- word vector corresponding to the first word
    e_2 -- word vector corresponding to the second word
    """
    
    # Step 1: Select word vector representation of "word"
    w1, w2 = pair[0], pair[1]
    e_w1, e_w2 = word_to_vec_map[w1], word_to_vec_map[w2]
    
    # Step 2: Compute the mean of e_w1 and e_w2
    mu = (e_w1 + e_w2) / 2

    # Step 3: Compute the projections of mu over the bias axis and the orthogonal axis
    mu_B = np.dot(mu, bias_axis) * bias_axis / np.linalg.norm(bias_axis)**2
    mu_orth = mu - mu_B

    # Step 4: Compute e_w1B and e_w2B
    e_w1B = np.dot(e_w1, bias_axis) * bias_axis / np.linalg.norm(bias_axis)**2
    e_w2B = np.dot(e_w2, bias_axis) * bias_axis / np.linalg.norm(bias_axis)**2
        
    # Step 5: Adjust the Bias part of e_w1B and e_w2B
    corrected_e_w1B = np.sqrt(1 - np.linalg.norm(mu_orth)**2) * (e_w1B - mu_B) / np.linalg.norm(e_w1B - mu_B)
    corrected_e_w2B = np.sqrt(1 - np.linalg.norm(mu_orth)**2) * (e_w2B - mu_B) / np.linalg.norm(e_w2B - mu_B)

    # Step 6: Debias by equalizing e1 and e2 to the sum of their corrected projections
    e1 = corrected_e_w1B + mu_orth
    e2 = corrected_e_w2B + mu_orth
    
    return e1, e2

In [24]:
print("cosine similarities before equalizing:")
print("cosine_similarity(word_to_vec_map[\"man\"], gender) = ", cosine_similarity(word_to_vec_map["man"], g))
print("cosine_similarity(word_to_vec_map[\"woman\"], gender) = ", cosine_similarity(word_to_vec_map["woman"], g))
print()
e1, e2 = equalize(("man", "woman"), g_unit, word_to_vec_map_unit_vectors)
print("cosine similarities after equalizing:")
print("cosine_similarity(e1, gender) = ", cosine_similarity(e1, g_unit))
print("cosine_similarity(e2, gender) = ", cosine_similarity(e2, g_unit))

cosine similarities before equalizing:
cosine_similarity(word_to_vec_map["man"], gender) =  -0.11711095765336832
cosine_similarity(word_to_vec_map["woman"], gender) =  0.35666618846270376

cosine similarities after equalizing:
cosine_similarity(e1, gender) =  -0.23871136142883811
cosine_similarity(e2, gender) =  0.23871136142883814
