# FACT-UVA: Man is to Programmer as Woman is to Homemaker?

## Links

Debiaswe: https://github.com/tolga-b/debiaswe
Lipstick: https://github.com/gonenhila/gender_bias_lipstick

### How to get the GoogleNews word2vec embeddings:
Download it directly from the official [website](https://code.google.com/archive/p/word2vec/) or clone [this github repo](https://github.com/mmihaltz/word2vec-GoogleNews-vectors). Place the downloaded **.bin** file in the embeddings folder.

### How to get the Glove embeddings:
Go to the official [website](https://nlp.stanford.edu/projects/glove/). Download **glove.840B.300d.zip**. Place the downloaded **.txt** file in the embeddings folder.

## Debiasing Word Embeddings

### Word2vec

The code block bellow executes the main debias function using the word2vec Google News embeddings. Additionally, the function takes as arugments several json files with definitional pairs and geneder specific words as described in the original paper. The function outputs two files - **bias_word2vec.bin** and **debiased_word2vec.bin**, which correspond to the embeddings before and after debiasing.

In [5]:
# Debias word2vec embeddings
!python3 code/main.py --debias_o_em=embeddings/debiased_word2vec.bin --i_em=embeddings/GoogleNews-vectors-negative300.bin --bias_o_em=embeddings/bias_word2vec.bin --def_fn=data/definitional_pairs.json --g_words_fn=data/gender_specific_full.json --eq_fn=data/equalize_pairs.json

Namespace(bias_o_em='embeddings/bias_word2vec.bin', debias_o_em='embeddings/debiased_word2vec.bin', def_fn='data/definitional_pairs.json', em_limit=50000, eq_fn='data/equalize_pairs.json', g_words_fn='data/gender_specific_full.json', i_em='embeddings/GoogleNews-vectors-negative300.bin', o_ext='bin')
*** Reading data from embeddings/GoogleNews-vectors-negative300.bin
  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL
Number of words:  26391
Saving biased vectors to file...
Debiasing...
Saving to file...


Done!



### Glove

The only difference between the two formats (word2vec and glove) is that the first line of word2vec contains the number of words and the vector size, while glove does no contain said line. In order to simply things and reduce the lenght of the code we can convert one of the two to the other format. This way the code has to supoort only one format. The code block below converts the glove embeddings to the word2vec fromat. Said code block needs to be executed only once.

In [2]:
# convert glove to word2vec format
!code/scripts/gloveToW2V.sh embeddings/glove.840B.300d.txt embeddings/glove.formatted.txt

extracting number of vectors
there are 2196017 lines
extracting vector dimension
cat: write error: Broken pipe
vectors have size 300
creating word2vec format file
done


After transforming the glove embeddings to the word2vec format we can rerun the previous experiment this time using the glove embeddings. The function will generate two files again - **bias_glove.bin** and **debiased_glove.bin** respectfully.

In [3]:
# Debias glove embeddings
!python3 code/main.py --debias_o_em=embeddings/debiased_glove.bin --i_em=embeddings/glove.formatted.txt --bias_o_em=embeddings/bias_glove.bin --def_fn=data/definitional_pairs.json --g_words_fn=data/gender_specific_full.json --eq_fn=data/equalize_pairs.json

Namespace(bias_o_em='embeddings/bias_glove.bin', debias_o_em='embeddings/debiased_glove.bin', def_fn='data/definitional_pairs.json', em_limit=50000, eq_fn='data/equalize_pairs.json', g_words_fn='data/gender_specific_full.json', i_em='embeddings/glove.formatted.txt', o_ext='bin')
*** Reading data from embeddings/glove.formatted.txt
  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL
Number of words:  23177
Saving biased vectors to file...
Debiasing...
Saving to file...


Done!



### Benchmark debiased embeddings

After generating the 4 embeddings files (both biased and debiased for word2vec and glove) we can run the benchmark tests on them to determine if the removing of the biased led to any deterioration. The results from the benchmarks would also show if the results have been replicated using the glove embeddings. The code block bellow evaluates each of the 4 embeddings on all of the benchmark test

In [9]:
!cd code/benchmark/scripts/ && ./run_test.sh

05:00:29 INFO:loading projection weights from /mnt/windows_drive_d/Amsterdam University/Year_1/FACT/embeddings/bias_word2vec.bin
05:00:29 INFO:Loading #26391 words with 300 dim
05:00:30 INFO:Transformed 26391 into 26391 words
05:00:30 INFO:Calculating similarity benchmarks
05:00:30 INFO:Spearman correlation of scores on WS353 0.6488884094043214
05:00:30 INFO:Spearman correlation of scores on MTurk 0.513563416944097
05:00:30 INFO:Spearman correlation of scores on WS353S 0.7197011080164891
05:00:31 INFO:Spearman correlation of scores on SimLex999 0.43510114901981567
05:00:31 INFO:Spearman correlation of scores on WS353R 0.5805558093795756
05:00:31 INFO:Spearman correlation of scores on MEN 0.7042259430744064
05:00:31 INFO:Spearman correlation of scores on RG65 0.6937745222655595
05:00:31 INFO:Spearman correlation of scores on RW 0.27659989138184415
05:00:31 INFO:Calculating analogy benchmarks
05:00:31 INFO:Processing 1/196 batch
05:00:32 INFO:Processing 20/196 batch
05:00:33 INFO:Process

05:01:29 INFO:Analogy prediction accuracy on SemEval2012 0.20300340542239614
05:01:29 INFO:Calculating categorization benchmarks
05:01:29 DEBUG:Purity=0.750 using affinity=euclidean linkage=ward
05:01:29 DEBUG:Purity=0.800 using affinity=cosine linkage=average
05:01:29 DEBUG:Purity=0.450 using affinity=cosine linkage=complete
05:01:29 DEBUG:Purity=0.800 using affinity=euclidean linkage=average
05:01:29 DEBUG:Purity=0.450 using affinity=euclidean linkage=complete
05:01:29 DEBUG:Purity=0.750 using KMeans
05:01:29 INFO:Cluster purity on ESSLI_2b 0.8
05:01:30 DEBUG:Purity=0.557 using affinity=euclidean linkage=ward
05:01:30 DEBUG:Purity=0.356 using affinity=cosine linkage=average
05:01:30 DEBUG:Purity=0.448 using affinity=cosine linkage=complete
05:01:30 DEBUG:Purity=0.164 using affinity=euclidean linkage=average
05:01:30 DEBUG:Purity=0.448 using affinity=euclidean linkage=complete
05:01:30 DEBUG:Purity=0.418 using KMeans
05:01:30 INFO:Cluster purity on AP 0.5572139303482587
05:01:30 DEBUG

05:02:35 DEBUG:Purity=0.727 using affinity=euclidean linkage=ward
05:02:35 DEBUG:Purity=0.614 using affinity=cosine linkage=average
05:02:35 DEBUG:Purity=0.727 using affinity=cosine linkage=complete
05:02:35 DEBUG:Purity=0.614 using affinity=euclidean linkage=average
05:02:35 DEBUG:Purity=0.727 using affinity=euclidean linkage=complete
05:02:35 DEBUG:Purity=0.727 using KMeans
05:02:35 INFO:Cluster purity on ESSLI_1a 0.7272727272727273
05:02:35 INFO:Saving results...
         AP  BLESS    Battig      ...          Google       MSR  SemEval2012_2
0  0.532338  0.755  0.272032      ...        0.386154  0.550125       0.183059

[1 rows x 17 columns]
05:02:36 INFO:loading projection weights from /mnt/windows_drive_d/Amsterdam University/Year_1/FACT/embeddings/debiased_glove.bin
05:02:36 INFO:Loading #23177 words with 300 dim
05:02:36 INFO:Transformed 23177 into 23177 words
05:02:36 INFO:Calculating similarity benchmarks
05:02:36 INFO:Spearman correlation of scores on SimLex999 0.4008367026997