## Training and evaluating FastText Embeddings
This notebook provides a quick example for training and evaluating FastText embeddings.

### Training FastText embeddings
**Note: This is only a quick example and more care should be put in collecting good training data!**

Also, note that for each run, some random initialization happens by FastText.


#### Load and prepare datasets

In [1]:
import pandas as pd
bug_balanced = pd.read_csv("../data/datasets/balanced_BUG.csv")
bug_balanced

FileNotFoundError: [Errno 2] No such file or directory: '../data/datasets/balanced_BUG.csv'

In [None]:
bug_unbalanced = pd.read_csv("../data/datasets/full_BUG.csv", nrows=len(bug_balanced))
bug_unbalanced

Unnamed: 0.1,Unnamed: 0,sentence_text,tokens,profession,g,profession_first_index,g_first_index,predicted gender,stereotype,distance,num_of_pronouns,corpus,data_index
0,0,Patient number 2 was isolated with his wife th...,"['Patient', 'number', '2', 'was', 'isolated', ...",Patient,his,0,6,male,0,6,1,covid19,1
1,1,"Five days post admission to the CCU , the pati...","['Five', 'days', 'post', 'admission', 'to', 't...",patient,his,9,14,male,0,5,1,covid19,1
2,2,One patient whose fascial layers were closed i...,"['One', 'patient', 'whose', 'fascial', 'layers...",patient,her,1,15,female,0,14,1,covid19,1
3,3,The patient was discharged 18 days after his a...,"['The', 'patient', 'was', 'discharged', '18', ...",patient,his,1,7,male,0,6,1,covid19,1
4,4,PATIENT CONCERNS A 24 year-old male was referr...,"['PATIENT', 'CONCERNS', 'A', '24', 'year', '-'...",PATIENT,his,0,12,male,0,10,1,covid19,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
25499,25499,The patient subsequently admitted that he had ...,"['The', 'patient', 'subsequently', 'admitted',...",patient,he,1,5,male,0,4,2,pubmed,4
25500,25500,One patient recalled that he had felt pain and...,"['One', 'patient', 'recalled', 'that', 'he', '...",patient,he,1,4,male,0,3,1,pubmed,4
25501,25501,Although healing of the periapical lesion occu...,"['Although', 'healing', 'of', 'the', 'periapic...",patient,he,9,12,male,0,3,1,pubmed,4
25502,25502,"This month , farmer Joe Stanley describes why ...","['This', 'month', ',', 'farmer', 'Joe', 'Stanl...",farmer,he,3,8,male,1,5,2,pubmed,4


In [None]:
from datasets import load_dataset

# Load a book corpus dataset as base dataset
book_dataset = load_dataset("bookcorpus", split="train[:25504]")
book_text = [sentence for sentence in book_dataset["text"]]
book_dataset.to_pandas().head()

Found cached dataset bookcorpus (/Users/oskarvanderwal/.cache/huggingface/datasets/bookcorpus/plain_text/1.0.0/eddee3cae1cc263a431aa98207d4d27fd8a73b0a9742f692af0e6c65afa4d75f)


Unnamed: 0,text
0,"usually , he would be tearing around the livin..."
1,but just one look at a minion sent him practic...
2,that had been megan 's plan when she got him d...
3,"he 'd seen the movie almost by mistake , consi..."
4,she liked to think being surrounded by adults ...


In [None]:
# Create three text files as training data

with open("book.txt", 'w') as f:
    f.write("\n".join(book_text))

with open("unbalanced.txt", 'w') as f:
    f.write("\n".join(bug_unbalanced["sentence_text"].tolist()))

with open("balanced.txt", 'w') as f:
    f.write("\n".join(bug_balanced["sentence_text"].tolist()))

#### Train and save FastText embeddings

In [None]:
import fasttext

model_book = fasttext.train_unsupervised('book.txt', model='skipgram')
model_balanced = fasttext.train_unsupervised('balanced.txt', model='skipgram')
model_unbalanced = fasttext.train_unsupervised('unbalanced.txt', model='skipgram')

Read 0M words
Number of words:  3834
Number of labels: 0
Progress: 100.0% words/sec/thread:  285872 lr:  0.000000 avg.loss:  2.545298 ETA:   0h 0m 0s
Read 0M words
Number of words:  11010
Number of labels: 0
Progress: 100.0% words/sec/thread:  139062 lr:  0.000000 avg.loss:  2.409050 ETA:   0h 0m 0s
Read 0M words
Number of words:  8819
Number of labels: 0
Progress: 100.0% words/sec/thread:  141782 lr:  0.000000 avg.loss:  2.383835 ETA:   0h 0m 0s


In [None]:
model_balanced.words[:10]

[',', 'the', '</s>', '.', 'and', 'of', 'a', '"', 'in', 'to']

In [None]:
model_balanced.save_model("fasttext_balanced.bin")
model_unbalanced.save_model("fasttext_unbalanced.bin")
model_book.save_model("fasttext_book.bin")

### Evaluating FastText embeddings

In [None]:
from biasbarometer.models import FastTextEmbeddingsModel
from biasbarometer.barometers import AutoBarometer

# Operationalize the barometer using Dutch word lists this time
barometer = AutoBarometer.from_spec("direction", wordpairs="../data/wordlists/man_vs_woman.csv", target="../data/wordlists/occupations.txt")

results = {}

for m in ["balanced", "unbalanced", "book"]:
    embeddings = FastTextEmbeddingsModel(f"fasttext_{m}.bin", device="cpu").embeddings

    # Run the bias evaluation
    barometer.evaluate(embeddings)

    results[m] = barometer.results
    print(m, "score: ", barometer.results["score"])

balanced score:  0.29308185810040926
unbalanced score:  0.46710792957696246
book score:  0.27703325134969253


#### Balanced

In [None]:
df = results["balanced"]["bias_df"]
df[df["category"]=="target"]

Unnamed: 0,word,score,category
0,manager,0.72859,target
1,clerk,0.621153,target
2,officer,0.378479,target
3,secretary,0.347692,target
4,administrator,0.334812,target
5,architect,0.326884,target
6,engineer,0.26392,target
7,bartender,0.262998,target
8,broker,0.212344,target
9,cashier,0.208459,target


#### Unbalanced

In [None]:
df = results["unbalanced"]["bias_df"]
df[df["category"]=="target"]

Unnamed: 0,word,score,category
0,surgeon,0.342226,target
1,scientist,0.256169,target
2,pathologist,0.155013,target
3,auditor,-0.000578,target
11,investigator,-0.080511,target
12,librarian,-0.08259,target
14,clerk,-0.152537,target
15,architect,-0.155174,target
16,bartender,-0.204451,target
17,broker,-0.219188,target
