# Bias in word embeddings
Word embeddings is a technique in NLP and text mining to represent words in order to be able to compare words based on similarity. The word embeddings are trained on large data sets with text written by humans. Therefore, the bias we have and include in our writungs will be transferred to the word embeddings. This project looks at what bias there are in the swedish word embeddings "Swectors" based on text from Göteborgsposten. They will also be compared with another set of word ebedding that will be trained on antoher set of data.

In [1]:
import bz2
import pandas as pd
import numpy as np
import time

## Importing datasets
Importing the large set of vectors might take a little while, please be patient.

In [2]:
colnames = ["word"] + ["dim" + str(x) for x in range(1,301)]
with bz2.open("swectors-300dim.txt.bz2") as source:
    swectors = pd.read_csv(source, header=None, names=colnames, delimiter=" ", skiprows=[0])
    
swectors.tail()

Unnamed: 0,word,dim1,dim2,dim3,dim4,dim5,dim6,dim7,dim8,dim9,...,dim291,dim292,dim293,dim294,dim295,dim296,dim297,dim298,dim299,dim300
192245,senarelagts,0.230236,-0.698184,0.796791,0.478522,0.247992,-0.511549,-0.529833,0.038788,-0.330778,...,0.89395,-0.354877,0.102388,-0.242362,0.834064,-0.650772,0.483784,-0.057586,0.092397,-0.053338
192246,hopfogade,-0.855621,-0.809333,0.492892,0.301056,-0.827752,0.726347,-1.175709,-0.37215,1.453347,...,0.219323,-1.010406,-0.082386,-0.681065,-0.762676,0.490875,0.080074,1.547569,-0.741131,0.440568
192247,brandsäkert,0.043847,0.629073,-0.972256,-0.300409,-0.430899,-0.148622,0.368104,0.218571,-0.299612,...,0.078883,-0.424826,-0.201831,0.397869,-0.591887,0.093042,0.194235,-0.501115,-0.199177,-0.689233
192248,skummisar,-0.982998,-1.007951,0.482135,-0.333532,-0.336754,0.055291,0.106165,-0.276979,0.521596,...,-0.276437,-0.745135,-0.275002,-0.083945,0.377782,-0.000553,-0.011056,-0.213323,0.626405,0.18865
192249,parkeringsgaraget,-0.202439,-0.70252,-0.334053,0.112022,-0.489995,0.157127,-0.645399,0.038916,-0.655055,...,-0.28111,-0.880703,0.430193,0.522002,0.702857,-0.245853,1.137718,0.190806,0.104953,0.717571


### Importing medium sized dataset

In [3]:
colnames = ["word"] + ["dim" + str(x) for x in range(1,301)]
with bz2.open("swectors_medium-300dim.txt.bz2") as source:
    swectors_medium = pd.read_csv(source, header=None, names=colnames, delimiter=" ", skiprows=[0])
    
swectors_medium.tail()

Unnamed: 0,word,dim1,dim2,dim3,dim4,dim5,dim6,dim7,dim8,dim9,...,dim291,dim292,dim293,dim294,dim295,dim296,dim297,dim298,dim299,dim300
21417,azalea,2.112017,-0.533726,0.053609,-0.786255,0.689719,1.012232,-0.350125,-1.63128,-2.606261,...,-1.741931,-2.952205,-0.110983,0.456692,-0.544853,-1.828536,1.221711,1.027984,-1.07711,0.473681
21418,näve,1.345923,0.597424,2.513506,3.45716,2.651895,6.182822,-0.368426,-3.371817,2.64359,...,-0.438095,-0.042643,-1.613785,-5.32829,-1.340255,0.338083,-0.596297,-2.04311,-1.599239,-0.994915
21419,gråtande,1.130774,-1.903109,-1.494416,0.140581,0.953516,0.785574,-1.491892,-3.746919,0.615522,...,-2.713697,-1.247861,1.611524,-2.873954,-0.594923,0.62634,-1.108217,1.135933,0.584222,2.188613
21420,landström,-0.660288,1.06114,0.185531,-0.807531,1.036153,0.982675,0.793358,-4.035374,0.350284,...,-0.311126,-4.305467,-1.73259,2.748172,-1.667064,-2.170431,1.775902,-4.257471,-1.002027,1.401197
21421,swärd,-1.00977,-0.66999,-0.76619,-2.407949,3.631123,0.811528,-1.738418,-1.147215,0.467119,...,-2.448882,-3.163466,-2.989345,-1.085497,-2.759893,-4.205324,-0.240915,-1.984612,-0.711889,1.204459


### Import small sized dataset

In [4]:
colnames = ["word"] + ["dim" + str(x) for x in range(1,301)]
with bz2.open("swectors_short-300dim.txt.bz2") as source:
    swectors_short = pd.read_csv(source, header=None, names=colnames, delimiter=" ", skiprows=[0])
    
swectors_short.tail()

Unnamed: 0,word,dim1,dim2,dim3,dim4,dim5,dim6,dim7,dim8,dim9,...,dim291,dim292,dim293,dim294,dim295,dim296,dim297,dim298,dim299,dim300
232,visar,-1.476303,-0.164548,0.837403,3.601232,-3.16256,1.402999,2.718013,-2.645194,0.60941,...,-0.856245,4.448714,-0.576104,1.739378,0.391575,-0.12585,2.015083,1.718248,-0.844881,2.715925
233,polisen,7.009651,7.01369,-1.444328,-4.925603,3.559211,-3.249944,2.048355,-6.019977,-3.143787,...,-0.770776,-0.047948,2.189141,-1.423542,-0.780415,-0.023532,3.538834,0.580358,-0.135183,3.497905
234,ur,-1.7414,-0.374864,3.145575,2.075605,-0.671171,6.320302,-0.073202,-4.386775,1.14845,...,3.385483,0.690599,5.339314,-2.97969,-4.16792,-0.03133,-4.535151,-1.007191,-2.10649,1.962901
235,...,-0.908397,0.421404,-1.230688,-2.445108,-1.759891,-0.355466,0.230951,1.841953,-0.037035,...,-0.285616,-0.468213,0.04761,0.137823,-0.127526,0.699059,-2.317676,-0.820054,2.093197,0.324723
236,tiden,-0.111988,-1.90059,-1.660171,1.816599,-2.846941,3.079602,0.178024,-1.241285,-4.933645,...,-3.361414,-1.638938,-2.823144,0.061135,-0.974601,-1.929044,-0.087526,-3.543884,-1.879851,2.901012


Extract the vector for the word 'kvinna', in order to look att similar words for bias measures.
Get the 300 dimensions from the dataframe, convert to numpy and get the following format: `[[dim1 dim2 ... dim299 dim300]]`, take the first element to get a single list. Save it as a tuple with word first and vector second.

In [5]:
kvinna = ('kvinna', swectors.loc[swectors['word'] == 'kvinna'].loc[:, 'dim1':'dim300'].to_numpy()[0])
man = ('man', swectors.loc[swectors['word'] == 'man'].loc[:, 'dim1':'dim300'].to_numpy()[0])

In [6]:
def cosine_similarity(word1, word2):
    # Takes two vectors and calculates the cosine similarity between them
    # @ is dot product
    v1 = word1[1]
    v2 = word2[1]
    return (v1 @ v2) / (np.linalg.norm(v1)*np.linalg.norm(v2))

In [7]:
cosine_similarity(kvinna, man)

0.3581906572935491

## Iteration over dataframe
This iteration over the whole dataframe takes a long time, but it works.

In [8]:
import heapq

def n_most_similar_iter(word, n):
    # Initiate to -2 since cosine similarity is [-1, 1]
    best = [(-2, 0) for x in range(0,n)]
    heapq.heapify(best)
    for index, row in swectors.iterrows():
        word2 = (row['word'], row.loc['dim1':'dim300'].to_numpy())
        if word[0] != word2[0]:
            heapq.heappushpop(best, (cosine_similarity(word, word2), word2[0]))
    return best
            

In [9]:
start = time.time()

#kvinna_similar = n_most_similar_iter(kvinna, 10)
end = time.time()
print("Time elapsed: ", end - start)

#kvinna_similar.sort(reverse=True, key=lambda x: x[0])
#print(kvinna_similar)

Time elapsed:  0.0


## Applying function to whole dataframe
This way of doing the calculation will apply a function to each row of the dataframe, and return a Series of same length with all results in it.

In [10]:
def n_most_similar(n, word):
    
    word_vec = (word, swectors.loc[swectors['word'] == word].loc[:, 'dim1':'dim300'].to_numpy()[0])
    
    def similarity(row):
        row_vec = (row['word'], row.loc['dim1':'dim300'].to_numpy())
        return cosine_similarity(word_vec, row_vec)

    start = time.time()
    similarities = swectors.apply(similarity, axis=1)

    # Concatenate the top n words (plus the word itself) to the similarity values of each word.
    # Also set the correct coulmn name.
    s1 = swectors.loc[similarities.nlargest(n+1).index, 'word']
    s2 = similarities.nlargest(n+1)
    similars = pd.concat([s1, s2], axis=1).rename(columns={0: "similarity"})
    end = time.time()
    print("Time elapsed: ", end - start)
    return similars

In [11]:
similars_kvinna = n_most_similar(10, 'kvinna')
print(similars_kvinna)

Time elapsed:  177.92815494537354
               word  similarity
542          kvinna    1.000000
2269         flicka    0.778604
646         kvinnan    0.687048
2957          pojke    0.676086
2972           tjej    0.673264
497          person    0.671946
58745  tonårsflicka    0.645879
16134       yngling    0.606193
10872       väninna    0.603767
3471            dam    0.596047
11426      polisman    0.593694


In [12]:
similars_män = n_most_similar(10, 'män')
print(similars_män)

Time elapsed:  100.98912358283997
                word  similarity
417              män    1.000000
322          kvinnor    0.804870
1177          männen    0.771062
1867         flickor    0.727492
2288          pojkar    0.692823
1626       kvinnorna    0.690541
1793          killar    0.656109
186         personer    0.655852
33699  tonårsflickor    0.654169
6182      tonåringar    0.639736
1477          tjejer    0.609658


## Filtering out all adjectives in the word embeddings
To see bias in adjectives, the dataframe with the swectors is filtered to only keep words that are in the dataframe with adjectves from Språkrådet.

In [13]:
with bz2.open("adjektiv.txt.bz2") as source:
    adjectives = pd.read_csv(source)
adjectives.head()

Unnamed: 0,Word
0,Chicagobaserad
1,Chicagobaserade
2,Chicagobaserades
3,Chicagobaserads
4,Chicagobaserat
