GitHub - andrinelo/norwegian-nlp

Detecting- and Measuring Experiments

Count pronouns

The corpus files are excluded from the code due to size and easy availability online. We collected this data on the 20th of January.

To count the number of pronouns in Norsk Aviskorpus:

Download Norsk Aviskorpus
Unzip .tar.gz and .gz files
Replace the variable in "rootdir" in main() with the path to your Aviskorpus data
Run experiments/pronoun_count/pronoun_count_norsk_aviskorpus.py

To count the number of pronouns in Wikipedia:

Download Bokmål Wikipedia and Nynorsk Wikipedia dumps with segment wiki
Replace the argument in pronoun_count/pronoun_count_in_wikipedia.py with the path to your wiki-dump-jsonfile
Run experiments/pronoun_count/pronoun_count_in_wikipedia.py

To count the number of pronouns in Norwegian Colossal Corpus (NCC):

Clone the training set with git clone https://huggingface.co/datasets/NbAiLab/NCC
Create one large training file of all shards without unpacking cat NCC/data/train*.gz > onefile.json.gz
Unpack with gzip -d onefile.json.gz
Replace the argument in experiments/pronoun_count/pronount_count_in_norwegian_colossal_corpus.py with the path to your jsonfile
Run experiments/pronoun_count/pronount_count_in_norwegian_colossal_corpus.py

Al results are written to terminal.

Embeddings: Masked language modelling

First, the most biased adjectives for all models are predicted:

Run experiments/masked_adjectives/extract_top_adjectives.py to get files with top adjectives for each of the models. The predicted adjectives are stored in experiments/masked_adjectives/data/...

Further, the results are collected by calculating aggregated bias scores and plotting the top biased adjectives for all models. 2. Run experiments/masked_adjectives/get_prediction_scores.py to get aggregated prediction scores for all adjectives per model. 3. Run experiments/masked_adjectives/plot_adjectives.py to get word cloud of top adjectives for all models. Both results are stored in experiments/masked_adjectives/results/...

Downstram Task: Hanna And Hans

First, the embeddings to be used in the experiment are extracted.

Run experiments/hanna_og_hans/extract_embeddings_hans_hanna.py for all three models. Change input variable True/False in run() in main to differ between sentence embedding (SA) and han/hun embedding (TWA) for texts. Embeddings are stored in experiments/hanna_og_hans/data/...

Further, the difference in distance between Hanna and Hans embeddings are calculated: 2. Run experiments/hanna_og_hans/embedding_distance.py. The results are stored in experiments/hanna_og_hans/results/...

Debiasing Experiments

Debiasing of language models by removing gender subspace

First, the embeddings to be used in the experiment are extracted.

Run experiments/hanna_og_hans/extract_embeddings_hans_hanna.py for all three models. Change input variable True/False in run() in main to differ between sentence embedding (SA) and han/hun embedding (TWA) for texts.
Run debiasing/remove_gender_subspace/extract_embeddings_for_pca.py for all three models. Fill inn for wanted variables in the main function before extracting. Both sets of embeddings are stored in debiasing/remove_gender_subspace/data/...

Further, the embeddings are debiased through removing the gender subspace and the new distance between Hanna and Hans descriptions and questions from survey is calculated.

Run debiasing/remove_gender_subspace/remove_subspace.py. The results are stored in debiasing/remove_gender_subspace/results/...

Debiasing of language models through retraining on female corpus

This experiment requires possibility to store large datasets and train complex language models.

First, NCC corpus is gender swapped:

Run debiasing/gender_swap/gender_swap_NCC.py.
Fine-tune NB-BERT on gender swapped corpus. Both steps are done by The National Libraby of Norway in this thesis.

Further, both measuring experiments for embeddings are redone. For masked adjectives:

Run debiasing/gender_swap/masked_adjectives/extract_top_adjectives.py to get files with top adjectives for new model. The predicted adjectives are stored in debiasing/gender_swap/masked_adjectives/data/...
Run debiasing/gender_swap/masked_adjectives/get_prediction_scores.py to get aggregated prediction scores for all adjectives per model.
Run debiasing/gender_swap/masked_adjectives/plot_adjectives.py to get word cloud of top adjectives for all models. Both results are stored in debiasing/gender_swap/masked_adjectives/results/...

For Hanna and Hans:

Run debiasing/gender_swap/hanna_og_hans/extract_embeddings_hans_hanna.py for both models. Change input variable True/False in run() in main to differ between sentence embedding (SA) and han/hun embedding (TWA) for texts. Embeddings are stored in debiasing/gender_swap/hanna_og_hans/data/...
Run debiasing/gender_swap/hanna_og_hans/embedding_distance.py. The results are stored in debiasing/gender_swap/hanna_og_hans/results/...

Name		Name	Last commit message	Last commit date
Latest commit History 120 Commits
data_sets		data_sets
debiasing		debiasing
experiments		experiments
.gitignore		.gitignore
README.md		README.md
references.bib		references.bib
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Detecting- and Measuring Experiments

Count pronouns

Embeddings: Masked language modelling

Downstram Task: Hanna And Hans

Debiasing Experiments

Debiasing of language models by removing gender subspace

Debiasing of language models through retraining on female corpus

About

Releases

Packages

Contributors 2

Languages

andrinelo/norwegian-nlp

Folders and files

Latest commit

History

Repository files navigation

Detecting- and Measuring Experiments

Count pronouns

Embeddings: Masked language modelling

Downstram Task: Hanna And Hans

Debiasing Experiments

Debiasing of language models by removing gender subspace

Debiasing of language models through retraining on female corpus

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages