# **Fine-Tuning BERT with OntoNotes - Extracting Embeddings.**

Code based on GitHub repo found here: https://github.com/12kleingordon34/NLP_masters_project  (unless otherwise specified) 


## **1. Process OntoNotes 5.0 Data** 

### 1.1 Clone the word mapping between male <--> female characters. 

In [None]:
!git clone https://github.com/uclanlp/gn_glove.git

Cloning into 'gn_glove'...
remote: Enumerating objects: 199, done.[K
remote: Counting objects: 100% (6/6), done.[K
remote: Compressing objects: 100% (6/6), done.[K
remote: Total 199 (delta 2), reused 0 (delta 0), pack-reused 193[K
Receiving objects: 100% (199/199), 67.78 KiB | 6.78 MiB/s, done.
Resolving deltas: 100% (88/88), done.


### 1.2 Run bash command to process .gold_conll file as csv. 


Running this bash command processes the ontonotes into a csv is much quicker than using python. 



In [None]:
# mounting drive

from google.colab import drive

drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!cp "/content/drive/MyDrive/Colab Notebooks/conll-formatted-ontonotes-5.0/test.english.v4_gold_conll" "/content/test.v4_gold_conll"
!cp "/content/drive/MyDrive/Colab Notebooks/conll-formatted-ontonotes-5.0/train.english.v4_gold_conll" "/content/train.v4_gold_conll"
!cp "/content/drive/MyDrive/Colab Notebooks/conll-formatted-ontonotes-5.0/dev.english.v4_gold_conll" "/content/dev.v4_gold_conll"

In [None]:
!ls "/content/"

dev.v4_gold_conll  gn_glove	test.v4_gold_conll
drive		   sample_data	train.v4_gold_conll


In [None]:
!pwd

/content


In [None]:
#convert to csv
!for path in $(find "/content" -name "*.v4_gold_conll"); do sed 's/  */,/g; s/"/""/g' ${path} > ${path}.csv;done

sed: can't read /content/drive/MyDrive/Colab: No such file or directory
/bin/bash: Notebooks/conll-formatted-ontonotes-5.0/dev.english.v4_gold_conll.csv: No such file or directory
sed: can't read /content/drive/MyDrive/Colab: No such file or directory
/bin/bash: Notebooks/conll-formatted-ontonotes-5.0/test.english.v4_gold_conll.csv: No such file or directory
sed: can't read /content/drive/MyDrive/Colab: No such file or directory
/bin/bash: Notebooks/conll-formatted-ontonotes-5.0/train.english.v4_gold_conll.csv: No such file or directory
sed: can't read /content/drive/MyDrive/Colab: No such file or directory
/bin/bash: Notebooks/dev.v4_gold_conll.csv: No such file or directory
sed: can't read /content/drive/MyDrive/Colab: No such file or directory
/bin/bash: Notebooks/test.v4_gold_conll.csv: No such file or directory
sed: can't read /content/drive/MyDrive/Colab: No such file or directory
/bin/bash: Notebooks/train.v4_gold_conll.csv: No such file or directory


### 1.3 Import Packages

In [None]:
import csv
from glob import glob
import os
from tqdm import tqdm

import nltk
nltk.download('names')
from nltk.corpus import names

import pandas as pd

[nltk_data] Downloading package names to /root/nltk_data...
[nltk_data]   Unzipping corpora/names.zip.


### 1.4 Codebase 

Functions for processing data. 

In [None]:
def load_data(path):
    """
    Load content from csv's as a list of lists, with each sublist
    correspoinding to a line in the csv.
    """
    content = []
    with open(path) as f:
        reader = csv.reader(f, delimiter=",")
        for line in reader:
            if len(line) > 0: 
                if line[0][0] != '#':
                    content.append(line)
            else:
                content.append([])
    return content

In [None]:
def generate_pronoun_map():
    """
    Create pronoun mapping to switch possessive
    and personal pronouns to their opposite gender
    """
    pronoun_map_df = pd.DataFrame([
        ['he', '[she]', 'PRP'],
        ['she', '[he]', 'PRP'],
        ['his', '[her]', 'PRP$'],   
        ['his', '[hers]', 'PRP'],
        ['hers', '[his]', 'PRP'], # Added to counter line 5026 in 'bc/phoenix/00/phoenix_0000.gold_conll.csv'
        ['her', '[his]', 'PRP$'],   
        ['him', '[her]', 'PRP'],
        ['her', '[him]', 'PRP'],
    ])
    pronoun_map_df.columns = ['word', 'flipped_pronoun', 'pos_0']
    return pronoun_map_df

In [None]:
def preprocess_content(data):
    """
    Select "word" and Part of Speech column ("pos_0") from data.
    Sub in all missing values with a new line, and return this as
    a pandas dataframe.
    """
    df = pd.DataFrame(data)
    df = df.loc[:, [3,4]]
    df.columns = ['word', 'pos_0']
    df['word'] = df['word'].str.replace('""', '"')
    df['word'] = df['word'].str.strip()
    for col in ['word', 'pos_0']:
        df.loc[df[col].isnull(), col] = '\n'
    return df

In [None]:
def generate_name_maps():
    """
    Create mapping of male/female names to anonymised entities.
    
    We add other male/female names to the list if they are not found
    in the nltk.names corpus.
    """
    male_names = [name for name in names.words('male.txt')] + ['Saddam', 'Mao']
    female_names = [name for name in names.words('female.txt')] + ['Gong']
    full_names = set(male_names + female_names)
                    
    full_name_pairs = [[name, 'E'+str(i)] for i, name in enumerate(full_names)]
    
    return pd.DataFrame(full_name_pairs, columns=['word', 'entity'])

In [None]:
def flipped_gendered_words_map(path):
    """
    Load male/female word files from gn_glove.git file downloaded above
    and create a pd.DataFrame which maps words to their equivalents in
    the opposite gender.
    
    We note that there are words which are mapped to multiple others.
    We manually select which pairings we want (stored in `manual_additions`)
    and add this to the mapping to the deduplicated original dataframe.
    """
    male_words = []
    female_words = []
    with open(os.path.join(path, 'male_word_file.txt')) as f:
        for line in f:
            male_words.append(line.strip('\n'))    
    with open(os.path.join(path, 'female_word_file.txt')) as f:
        for line in f:
            female_words.append(line.strip('\n'))
              
    # Manually add words not in Zhao's subset
    male_words = male_words + ['kingdom']
    female_words = female_words + ['queendom']      
              
    full_mapping = [[m, w] for m, w in zip(male_words, female_words)] + \
        [[w, m] for m, w in zip(male_words, female_words)]
        
              
    full_mapping_df = pd.DataFrame(full_mapping, columns=['word', 'flipped_gender_word'])
    
    # Remove gendered pronoun words
    full_mapping_df = full_mapping_df.loc[~full_mapping_df['word'].str.contains('^he$|^she$|^her$|^his$|^him$')]
    
    # Remove all duplicated 'word' entries and manually re-add those which make most sense
    removed_words = set(
        full_mapping_df.loc[full_mapping_df['word'].duplicated(keep=False), 'word']
    )
    full_mapping_df = full_mapping_df.drop_duplicates(subset='word', keep=False)
       
    manual_additions = pd.DataFrame([
        ['bachelor', 'maiden'],
        ['bride' , 'bridegroom'],
        ['brides' , 'bridegrooms'],
        ['dude', 'chick'],
        ['dudes', 'chicks'],    
        ['gal', 'guy'],
        ['gals', 'guys'],
        ['god', 'goddess'],
        ['grooms', 'brides'],
        ['hostess', 'host'],
        ['ladies', 'gentlemen'],
        ['lady', 'gentleman'],
        ['lass', 'lad'],
        ['manservant', 'maid'],
        ['mare', 'stallion'],
        ['maternity', 'paternity'],
        ['paternity', 'maternity'],
        ['penis', 'vagina'],
        ['mistress', 'master'],
        ['nun', 'priest'],
        ['nuns', 'priests'],   
        ['priest', 'priestess'],
        ['priests', 'priestesses'],  
        ['prostatic_utricle', 'womb'],
        ['sir', 'madam'],
        ['wife', 'husband']
    ], columns=['word', 'flipped_gender_word'])
    
    # Ensure all duplicated words are accounted for
    assert set(manual_additions['word']) == removed_words
    
    full_mapping_df = pd.concat([full_mapping_df, manual_additions], axis=0)
    
    return full_mapping_df


In [None]:
def unify_full_string_cols(d):
    """
    Unify all anonymised entities, gender flipped words and ungendered
    words into a single column.
    """
    d['original_str'] = d['word']
    d.loc[d['entity'].notnull(), 'original_str'] = d['entity']
    d.loc[d['orig_pronoun'].notnull(), 'original_str'] = d['orig_pronoun']
    
    d['flipped_str'] = d['word']
    d.loc[d['flipped_entity'].notnull(), 'flipped_str'] = d['flipped_entity']
    d.loc[d['flipped_pronoun'].notnull(), 'flipped_str'] = d['flipped_pronoun']
    d.loc[d['flipped_gender_word'].notnull(), 'flipped_str'] = d['flipped_gender_word']

    return d

In [None]:
def process_ontonotes_file(path):
    """
    Process ontonotes file located by `path` and process the file as a dataframe.
    Flip gendered words and anonymise entities. Concatenate all words in 
    `orignal_str` and `flipped_st` to create full original and flipped 
    strings which are returned as an output.
    """
    data = load_data(path)
    df = preprocess_content(data)
    
    pronoun_map = generate_pronoun_map()
    name_map = generate_name_maps()
    flipped_map = flipped_gendered_words_map('gn_glove/wordlist/')
    
    df_2 = pd.merge(df, name_map, on='word', how='left')

    df_2['word'] = df_2['word'].str.lower()
    df_2['flipped_entity'] = 'FL_' + df_2['entity'].str[1:]
        
    df_2['orig_pronoun'] = '[' + df_2.loc[
        (df_2.loc[:, 'pos_0'].astype(str).str.contains('PRP')) &
        (df_2.loc[:, 'word'].astype(str).str.contains('^he$|^she$|^her$|^his$|^him$')),
        'word'
    ].astype(str) + ']'
    
    df_3 = pd.merge(df_2, pronoun_map, on=['word', 'pos_0'], how='left')

    df_4 = pd.merge(df_3, flipped_map, on='word', how='left')
    
    df_5 = unify_full_string_cols(df_4)

    original_string = df_5['original_str'].str.cat(sep=' ')
    flipped_string = df_5['flipped_str'].str.cat(sep=' ')    

    return original_string, flipped_string

### 1.2 Process the data 

Had to only process the test data as other files were too large. 

In [None]:
PATH = "/content/"
EXT = "train.v4_gold_conll.csv"
all_csv_files = [file
                 for path, subdir, files in os.walk(PATH)
                 for file in glob(os.path.join(path, EXT))]

original_strings = []
flipped_strings = []
erroneous_paths = []
for path in tqdm(all_csv_files):
    try:
        original, flipped = process_ontonotes_file(path)
        original_strings.append(original)
        flipped_strings.append(flipped)
    except:
        erroneous_paths.append(path)

100%|██████████| 1/1 [00:23<00:00, 24.00s/it]


In [None]:
print(original_strings[:200])



### 1.3 Combine sentences until a gendered pronoun appears.

In [None]:
import re

def compile_pronoun_strings(corpus):
    """
    Note that BERT does not process strings longer than 512 characters. Thus
    we ensure that all strings are below this character limit.
    
    We attempt to add as many sentences as possible to the training example
    to provide maximal context to the BERT masked language model. We also
    add `[CLS]` and `[SEP]` tokens to our training strings.
    """
    stored_full_strings = []
    temp_storage = []
    pronouns = ['[his]', '[her]', '[him]', '[she]', '[he]']
    pronoun_regex = '\[his\]|\[her\]|\[him\]|\[she\]|\[he\]|\[hers\]'
    
    num_corpi_too_long = 0
    for strings in corpus:
        temp = strings.split('\n')
        for string in temp:
            temp_storage.append(string)
            if re.search(pronoun_regex, string):
                if len(' [SEP] '.join(temp_storage)) <= 512:
                    stored_full_strings.append('[CLS] '+' [SEP] '.join(temp_storage) + ' [SEP]')
                    temp_storage = []
                elif len(' [SEP] '.join(temp_storage[-8:])) <= 512:
                    stored_full_strings.append('[CLS] '+' [SEP] '.join(temp_storage[-8:]) + ' [SEP]')
                    temp_storage = []
                elif len(' [SEP] '.join(temp_storage[-7:])) <= 512:
                    stored_full_strings.append('[CLS] '+' [SEP] '.join(temp_storage[-7:]) + ' [SEP]')
                    temp_storage = []
                elif len(' [SEP] '.join(temp_storage[-6:])) <= 512:
                    stored_full_strings.append('[CLS] '+' [SEP] '.join(temp_storage[-6:]) + ' [SEP]')
                    temp_storage = []
                elif len(' [SEP] '.join(temp_storage[-5:])) <= 512:
                    stored_full_strings.append('[CLS] '+' [SEP] '.join(temp_storage[-5:]) + ' [SEP]')
                    temp_storage = []
                elif len(' [SEP] '.join(temp_storage[-4:])) <= 512:
                    stored_full_strings.append('[CLS] '+' [SEP] '.join(temp_storage[-4:]) + ' [SEP]')
                    temp_storage = []
                elif len(' [SEP] '.join(temp_storage[-3:])) <= 512:
                    stored_full_strings.append('[CLS] '+' [SEP] '.join(temp_storage[-3:]) + ' [SEP]')
                    temp_storage = []
                elif len(' [SEP] '.join(temp_storage[-2:])) <= 512:
                    stored_full_strings.append('[CLS] '+' [SEP] '.join(temp_storage[-2:]) + ' [SEP]')
                    temp_storage = []
                elif len(' [SEP] '.join(temp_storage[-1:])) <= 512:
                    stored_full_strings.append('[CLS] '+' [SEP] '.join(temp_storage[-1:]) + ' [SEP]')
                    temp_storage = []
                else:
                    num_corpi_too_long += 1
                    stored_full_strings.append('___TEXT-TO-LONG___')
                    temp_storage = []
    print("Number of text corpuses which are too long for BERT: {} / {}".format(num_corpi_too_long, len(corpus)))
    return stored_full_strings

In [None]:
%%time
original_pronoun_strings = compile_pronoun_strings(original_strings)

Number of text corpuses which are too long for BERT: 25 / 1
CPU times: user 97.7 ms, sys: 6.87 ms, total: 105 ms
Wall time: 104 ms


In [None]:
print(original_pronoun_strings[:5000])



In [None]:
print(len(original_pronoun_strings))

orig = [s for strings in original_strings for s in strings.split('\n')]

print(len(orig))


12896
75188


### 1.4 Create data in CSV format

In [None]:
def generate_training_data(strings):
    """
    Identify whether a string contains a gendered pronoun.
    If so, identify if a pronoun is missing, and replace its
    occurance with `[MASK]`, whilst keeping the pronoun as the
    predictive target.
    
    If a string has multiple pronouns, we create as many training
    examples as there are unique pronouns in the sentence.
    """
    data_list = []
    regex = '\[his\]|\[her\]|\[him\]|\[she\]|\[he\]|\[hers\]'


    for string in strings:
        string_pronouns = re.findall(regex, string)
        if string_pronouns:
            for pronoun in string_pronouns:
                regex_pronoun = re.compile('\[' + pronoun + '\]')
                temp_str = re.sub(regex, '[MASK]', string)
                temp_str = re.sub(r'\s+', r' ', temp_str)
                temp_data = [temp_str, pronoun[1:-1]]
                data_list.append(temp_data)
        else:
            pass   # Pass if no string is present
    return data_list

In [None]:
original_data = generate_training_data(original_pronoun_strings)

In [None]:
original_data[0]

['[CLS] welcome both of you to the studio to participate in our program . [SEP] well i especially want to know ha how the two of you found out the news on the day of the accident ? [SEP] ah about 11:00 m. yesterday ah i happened to find out through an sms when i was outside . [SEP] uh-huh . [SEP] uh-huh . [SEP] it happened that i was going to have lunch with a friend um at noon . [SEP] and then the friend first sent me an sms uh-huh . saying [MASK] would come pick me up to go together . [SEP]',
 'he']

In [None]:
original_df = pd.DataFrame(original_data, columns=['text', 'pronouns'])

In [None]:
print(original_df.shape)

(2086, 2)


In [None]:
# remove duplicate rows 
original_dropped_df = original_df.drop_duplicates(keep='first')
original_dropped_df.shape

(15384, 2)

In [None]:
# save data to csv 
original_dropped_df.to_csv('original_data.csv', index=False)


In [None]:
# check data has been saved 
pd.read_csv('original_data.csv')

Unnamed: 0,text,pronouns
0,[CLS] it was an arduous battle . [SEP] at 10:0...,he
1,[CLS] [MASK] himself would bring this group of...,he
2,[CLS] with a wave of [MASK] hand peng dehuai s...,his
3,[CLS] i was in charge of this er and -- [SEP] ...,his
4,[CLS] while destroying roads there was clear -...,his
...,...,...
15379,[CLS] [MASK] face was flooding sweat [SEP],his
15380,[CLS] the veins on [MASK] forehead were bulgin...,his
15381,[CLS] and [MASK] eyes were shot with blood and...,his
15382,[CLS] [MASK] was happy [SEP],he


**Editing dataframe to work for AllenNLP MLM**

In [None]:
# making sure duplicated lines are deleted so that number of mask tokens match the labels (for allennlp)
import pandas as pd
df = pd.read_csv ('original_data_sentences.csv')
print(df)

FileNotFoundError: ignored

In [None]:
original_drop_df = df.drop_duplicates(keep='first')
original_drop_df.shape

(1487, 1)

In [None]:
# save data to csv 
original_drop_df.to_csv('original_data_sentences_dropped.csv', index=False)

removing rows that have more than one ['MASK'] in

In [None]:
# importing dataset with both sentences and labels in 
import pandas as pd
df = pd.read_csv('original_data.csv')
print(df)

                                                    text pronouns
0      [CLS] it was an arduous battle . [SEP] at 10:0...       he
1      [CLS] [MASK] himself would bring this group of...       he
2      [CLS] with a wave of [MASK] hand peng dehuai s...      his
3      [CLS] i was in charge of this er and -- [SEP] ...      his
4      [CLS] while destroying roads there was clear -...      his
...                                                  ...      ...
15379         [CLS] [MASK] face was flooding sweat [SEP]      his
15380  [CLS] the veins on [MASK] forehead were bulgin...      his
15381  [CLS] and [MASK] eyes were shot with blood and...      his
15382                       [CLS] [MASK] was happy [SEP]       he
15383     [CLS] and [MASK] began to laugh in joy . [SEP]       he

[15384 rows x 2 columns]


In [None]:
df.text.duplicated()
df.loc[df.text.duplicated(keep=False), :]

Unnamed: 0,text,pronouns
14,[CLS] the two met . [SEP] well from the inform...,he
15,[CLS] the two met . [SEP] well from the inform...,his
16,[CLS] that is through multilateral meetings [M...,he
17,[CLS] that is through multilateral meetings [M...,his
21,[CLS] uh-huh . [SEP] all these must be properl...,he
...,...,...
15359,[CLS] i remember while howling back from a lon...,her
15365,[CLS] then [MASK] noticed we were parked at th...,he
15366,[CLS] then [MASK] noticed we were parked at th...,his
15367,[CLS] though in the end [MASK] proved not to b...,he


In [None]:
df = df.drop_duplicates(keep=False, subset=['text'])

In [None]:
df.dtypes

text        object
pronouns    object
dtype: object

In [None]:
!pip install allennlp
!pip install http://download.pytorch.org/whl/cu75/torch-0.2.0.post3-cp36-cp36m-manylinux1_x86_64.whl torchvison


[31mERROR: torch-0.2.0.post3-cp36-cp36m-manylinux1_x86_64.whl is not a supported wheel on this platform.[0m


In [None]:
# testing 'read' code
#!pip install allennlp
#!pip install allennlp-models
# from transformers import PreTrainedTokenizer

# from allennlp.common.util import sanitize_wordpiece
# from allennlp.data.tokenizers.token_class import Token
# from allennlp.data.tokenizers.tokenizer import Tokenizer

# from allennlp.dataset_readers.dataset_reader import DatasetReader
# from allennlp.tokenizers import PretrainedTransformerTokenizer

# _tokenizer = PretrainedTransformerTokenizer


targets = df.iloc[:,0].tolist()
sentences = df.iloc[:,-1].tolist()
zipped = zip(sentences, targets)
for t, s in zipped:
    sentence = s
    tokens = (sentence)+'token'
    target = t
    # t = Token("[MASK]")
    print(target, sentence)

he [CLS] welcome both of you to the studio to participate in our program . [SEP] well i especially want to know ha how the two of you found out the news on the day of the accident ? [SEP] ah about 11:00 m. yesterday ah i happened to find out through an sms when i was outside . [SEP] uh-huh . [SEP] uh-huh . [SEP] it happened that i was going to have lunch with a friend um at noon . [SEP] and then the friend first sent me an sms uh-huh . saying [MASK] would come pick me up to go together . [SEP]
he [CLS] well lots of barricade tape has been strung up on the side road in the north - south direction of the accident scene beneath the jingguang bridge at east third ring road . [SEP] all personnel responsible for the emergency repair of underground sewage pipes ah have arrived at their designated locations . [SEP] E5763 through this footage we see ah this -- [SEP] uh-huh . [SEP] okay ah this emergency repair worker said that [MASK] was there at 4 o'clock . [SEP]
he [CLS] however the affected 

In [None]:
# save data to csv 
df.to_csv('data_one_mask.csv', index=False)

## **2. Fine-tuning BERT MLM** 

### 2.1 Activate function for saving models plus tokenizer post training 

In [None]:
import os

# Saving best-practices: if you use defaults names for the model, you can reload it using from_pretrained()
def save_model(processed_model, epoch, lr, eps):
  output_dir = './drive/My Drive/playground/model_save/debias/full/lr_{}_eps_{}/epoch_{}/'.format(lr, eps, epoch)

  # Create output directory if needed
  if not os.path.exists(output_dir):
      os.makedirs(output_dir)

  print("Saving model to %s" % output_dir)

  # Save a trained model, configuration and tokenizer using `save_pretrained()`.
  # They can then be reloaded using `from_pretrained()`
  model_to_save = processed_model.module if hasattr(processed_model, 'module') else processed_model  # Take care of distributed/parallel training
  model_to_save.save_pretrained(output_dir)
  tokenizer.save_pretrained(output_dir)

  # Good practice: save your training arguments together with the trained model
  torch.save([epoch, lr, eps], os.path.join(output_dir, 'training_args.bin'))

### 2.2 Using Colab GPU for Training

In [None]:
# check GPU is activated 

import tensorflow as tf

# Get the GPU device name.
device_name = tf.test.gpu_device_name()

# The device name should look like the following:
if device_name == '/device:GPU:0':
    print('Found GPU at: {}'.format(device_name))
else:
    raise SystemError('GPU device not found')

Found GPU at: /device:GPU:0


Identifying and specifying the GPU as the device. It will later be incorporated into the training loop. 

In [None]:

import torch

# If there's a GPU available...
if torch.cuda.is_available():    

    # Tell PyTorch to use the GPU.   
    torch.cuda.empty_cache()
    device = torch.device("cuda")
    torch.cuda.empty_cache()


    print('There are %d GPU(s) available.' % torch.cuda.device_count())
    print('We will use the GPU:', torch.cuda.get_device_name(0))

# If not...
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

There are 1 GPU(s) available.
We will use the GPU: Tesla T4


### 2.3 Installing / Importing Libraries 

In [None]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.10.2-py3-none-any.whl (2.8 MB)
[K     |████████████████████████████████| 2.8 MB 9.4 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.45-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 76.1 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-5.4.1-cp37-cp37m-manylinux1_x86_64.whl (636 kB)
[K     |████████████████████████████████| 636 kB 66.9 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 66.2 MB/s 
Collecting huggingface-hub>=0.0.12
  Downloading huggingface_hub-0.0.17-py3-none-any.whl (52 kB)
[K     |████████████████████████████████| 52 kB 2.0 MB/s 
Installing collected packages: tokenizers, sacremoses, pyyaml, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3.13
  

### 2.4 Load the data 

In [None]:
import pandas as pd

df = pd.read_csv('./original_data.csv')
df['gender'] = df['pronouns'].str.contains('^he$|^his$|^him$').astype(int)
# Report the number of sentences.
print('Number of training sentences: {:,}\n'.format(df.shape[0]))
df.sample(10)

Number of training sentences: 1,811



Unnamed: 0,text,pronouns,gender
250,[CLS] they said that when [MASK] father was ho...,him,1
1438,[CLS] %um and i asked [MASK] [SEP],her,0
454,[CLS] actor drugewbo unitich tells how the sho...,he,1
206,[CLS] what [MASK] felt was that i go out to wo...,he,1
1748,[CLS] i think ' what the fuck ? ' [SEP] but i ...,him,1
1248,[CLS] the men who did this work made a lot of ...,he,1
1131,[CLS] [MASK] also saw the follower [MASK] love...,he,1
574,[CLS] consensus on the one china question is a...,him,1
548,[CLS] victory to the strivers [SEP] although t...,she,0
1268,[CLS] here is what god said in that promise : ...,he,1


### 2.5 Extract sentences and labels of our data as ndarrays 

In [None]:
# Get the lists of sentences and their labels.
sentences = df.text.values
labels = df.pronouns.values

## 3. Tokenization & Input formatting


[CLS] and [SEP] were added in preprocessing so we don't need to add these terms to our data.

### 3.1 Tokenization


In [None]:
from transformers import BertTokenizer

# Load the BERT tokenizer.
print('Loading BERT tokenizer...')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

Loading BERT tokenizer...


Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [None]:
# Print the original sentence.
print(' Original: ', sentences[0])

# Print the sentence split into tokens.
print('Tokenized: ', tokenizer.tokenize(sentences[0]))

# Print the sentence mapped to token ids.
print('Token IDs: ', tokenizer.convert_tokens_to_ids(tokenizer.tokenize(sentences[0])))

 Original:  [CLS] welcome both of you to the studio to participate in our program . [SEP] well i especially want to know ha how the two of you found out the news on the day of the accident ? [SEP] ah about 11:00 m. yesterday ah i happened to find out through an sms when i was outside . [SEP] uh-huh . [SEP] uh-huh . [SEP] it happened that i was going to have lunch with a friend um at noon . [SEP] and then the friend first sent me an sms uh-huh . saying [MASK] would come pick me up to go together . [SEP]
Tokenized:  ['[CLS]', 'welcome', 'both', 'of', 'you', 'to', 'the', 'studio', 'to', 'participate', 'in', 'our', 'program', '.', '[SEP]', 'well', 'i', 'especially', 'want', 'to', 'know', 'ha', 'how', 'the', 'two', 'of', 'you', 'found', 'out', 'the', 'news', 'on', 'the', 'day', 'of', 'the', 'accident', '?', '[SEP]', 'ah', 'about', '11', ':', '00', 'm', '.', 'yesterday', 'ah', 'i', 'happened', 'to', 'find', 'out', 'through', 'an', 'sms', 'when', 'i', 'was', 'outside', '.', '[SEP]', 'uh', '-'

### 3.2 Mask Tokens

Create a mask which hides all tokens/words which do not correspond to the pronouns we seek to predict. This is used by the BERT model such that only predictions on the pronouns are considered when calculating the training loss. We mask the terms we wish to ignore with the token -100.

In [None]:
mask_id = tokenizer.convert_tokens_to_ids('[MASK]')
mask_id

103

In [None]:
%%time
masked_lm_labels = []
for sentence, label in zip(sentences, labels):
  sentence_ids =  tokenizer.convert_tokens_to_ids(tokenizer.tokenize(sentence))
  label_id = tokenizer.convert_tokens_to_ids(label)
  masked_lm_labels.append([label_id if id == mask_id else -100 for id in sentence_ids])

CPU times: user 1.6 s, sys: 0 ns, total: 1.6 s
Wall time: 1.6 s


In [None]:
masked_lm_labels[0]

[-100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 2002,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100]

### 3.2 Tokenization


The tokenizer.encode function combines multiple steps for us:

1. Split the sentence into tokens.
2. Map the tokens to their IDs.

In [None]:
# Get word count
word_count = [len(s.split()) for s in list(sentences)]
sum(word_count)

78294

In [None]:

%%time
# Tokenize all of the sentences and map the tokens to thier word IDs.
input_ids = []

# For every sentence...
for sent in list(sentences):
    # `encode` will:
    #   (1) Tokenize the sentence.
    #   (2) Prepend the `[CLS]` token to the start.
    #   (3) Append the `[SEP]` token to the end.
    #   (4) Map tokens to their IDs.
    encoded_sent = tokenizer.encode(
                        sent,                      # Sentence to encode.
                        add_special_tokens = False, # Adds '[CLS]' and '[SEP]' if True

                        # This function also supports truncation and conversion
                        # to pytorch tensors, but we need to do padding, so we
                        # can't use these features :( .
                        #max_length = 128,          # Truncate all sentences.
                        #return_tensors = 'pt',     # Return pytorch tensors.
                   )
    
    # Add the encoded sentence to the list.
    input_ids.append(encoded_sent)

# Print sentence 0, now as a list of IDs.
print('Original: ', sentences[0])
print('Token IDs:', input_ids[0])

Original:  [CLS] welcome both of you to the studio to participate in our program . [SEP] well i especially want to know ha how the two of you found out the news on the day of the accident ? [SEP] ah about 11:00 m. yesterday ah i happened to find out through an sms when i was outside . [SEP] uh-huh . [SEP] uh-huh . [SEP] it happened that i was going to have lunch with a friend um at noon . [SEP] and then the friend first sent me an sms uh-huh . saying [MASK] would come pick me up to go together . [SEP]
Token IDs: [101, 6160, 2119, 1997, 2017, 2000, 1996, 2996, 2000, 5589, 1999, 2256, 2565, 1012, 102, 2092, 1045, 2926, 2215, 2000, 2113, 5292, 2129, 1996, 2048, 1997, 2017, 2179, 2041, 1996, 2739, 2006, 1996, 2154, 1997, 1996, 4926, 1029, 102, 6289, 2055, 2340, 1024, 4002, 1049, 1012, 7483, 6289, 1045, 3047, 2000, 2424, 2041, 2083, 2019, 22434, 2043, 1045, 2001, 2648, 1012, 102, 7910, 1011, 9616, 1012, 102, 7910, 1011, 9616, 1012, 102, 2009, 3047, 2008, 1045, 2001, 2183, 2000, 2031, 6265, 

### 3.3 Padding and Truncation 

All sentences must be padded or truncated to a single, fixed length. Maximum length BERT can take is 512

In [None]:
# finding max length
print('Max sentence length: ', max([len(sen) for sen in input_ids]))

Max sentence length:  148


setting max length to 140 

In [None]:
MAX_LEN = 140

# We'll borrow the `pad_sequences` utility function to do this.
from keras.preprocessing.sequence import pad_sequences

# Set the maximum sequence length.
# I've chosen 150 somewhat arbitrarily. It's slightly larger than the
# maximum training sentence length of 148...

print('\nPadding/truncating all sentences to %d values...' % MAX_LEN)

print('\nPadding token: "{:}", ID: {:}'.format(tokenizer.pad_token, tokenizer.pad_token_id))

# Pad our input tokens with value 0.
# "post" indicates that we want to pad and truncate at the end of the sequence,
# as opposed to the beginning.
input_ids = pad_sequences(input_ids, maxlen=MAX_LEN, dtype="long", 
                          value=0, truncating="post", padding="post")
masked_lm_labels = pad_sequences(masked_lm_labels, maxlen=MAX_LEN, dtype="long", 
                          value=0, truncating="post", padding="post")
print('\nDone.')



Padding/truncating all sentences to 140 values...

Padding token: "[PAD]", ID: 0

Done.


### 3.4 Attention Mask 

All sentences must be padded or truncated to a single, fixed length.


In [None]:
# Create attention masks
attention_masks = []

# For each sentence...
for sent in input_ids:
    
    # Create the attention mask.
    #   - If a token ID is 0, then it's padding, set the mask to 0.
    #   - If a token ID is > 0, then it's a real token, set the mask to 1.
    att_mask = [int(token_id > 0) for token_id in sent]
    
    # Store the attention mask for this sentence.
    attention_masks.append(att_mask)

### 3.5 Training and Validation Split

Divide up our training set to use 80% for training and 20% for validation

In [None]:
# Use train_test_split to split our data into train and validation sets for
# training
import numpy as np
from sklearn.model_selection import train_test_split

# Use 80% for training and 20% for validation.
train_inputs, validation_inputs, train_lm_labels, validation_lm_labels = train_test_split(input_ids, masked_lm_labels,
                                                            random_state=2018, test_size=0.2)
# Do the same for the masks.
train_masks, validation_masks, _, _ = train_test_split(attention_masks, masked_lm_labels,
                                             random_state=2018, test_size=0.2)

### 3.6 Converting to PyTorch Data Types

In [None]:
# Convert all inputs and labels into torch tensors, the required datatype 
# for our model.
train_inputs = torch.tensor(train_inputs)
validation_inputs = torch.tensor(validation_inputs)

train_lm_labels = torch.tensor(train_lm_labels)
validation_lm_labels = torch.tensor(validation_lm_labels)

train_masks = torch.tensor(train_masks)
validation_masks = torch.tensor(validation_masks)

We'll also create an iterator for our dataset using the torch DataLoader class. This helps save on memory during training because, unlike a for loop, with an iterator the entire dataset does not need to be loaded into memory.

In [None]:
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler

# The DataLoader needs to know our batch size for training, so we specify it 
# here.
# For fine-tuning BERT on a specific task, the authors recommend a batch size of
# 16 or 32.

batch_size = 16

# Create the DataLoader for our training set.
train_data = TensorDataset(train_inputs, train_masks, train_lm_labels)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

# Create the DataLoader for our validation set.
validation_data = TensorDataset(validation_inputs, validation_masks, validation_lm_labels)
validation_sampler = SequentialSampler(validation_data)
validation_dataloader = DataLoader(validation_data, sampler=validation_sampler, batch_size=batch_size)

## 4. **Training the Model**

Fine-tuning the Model on the Task of MLM - producing a **BIASED** BERT


### 4.1  Load pre-trained model


In [None]:
# Load pre-trained model (weights)
from transformers import BertModel, BertConfig, BertForMaskedLM, AdamW

config = BertConfig.from_pretrained('bert-base-uncased', output_hidden_states=True)
model = BertForMaskedLM.from_pretrained('bert-base-uncased',
                                  config=config, # Whether the model returns all hidden-states.
                                  )

# Put the model in "evaluation" mode, meaning feed-forward operation.
# only care about doing a forward pass through the architecture for this task. 
model.cuda()
model.eval()

Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


BertForMaskedLM(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=Tr

### 4.2 Optimizer and Learning Rate Scheduler 

Now that we have our model loaded we need to grab the training hyperparameters from within the stored model.

For the purposes of fine-tuning, the authors recommend choosing from the following values:

Batch size: 16, 32 (We chose 16 when creating our DataLoaders).
Learning rate (Adam): 5e-5, 3e-5, 2e-5 (We cross validated over these parameters).
Number of epochs: 2, 3, 4 (We'll used 8, and selected the epoch number which best performs on the validation dataset).
The epsilon parameter eps = 1e-8 is "a very small number to prevent any division by zero in the implementation" (from here).

You can find the creation of the AdamW optimizer in run_glue.py here.

In [None]:
# Note: AdamW is a class from the huggingface library (as opposed to pytorch) 
# I believe the 'W' stands for 'Weight Decay fix"
lr = 2e-5
eps = 1e-8
optimizer = AdamW(model.parameters(),
                  lr = lr, # args.learning_rate - default is 5e-5, our notebook had 2e-5
                  eps = eps # args.adam_epsilon  - default is 1e-8.
                )

In [None]:
from transformers import get_linear_schedule_with_warmup

# Number of training epochs (authors recommend between 2 and 4)
epochs = 8

# Total number of training steps is number of batches * number of epochs.
total_steps = len(train_dataloader) * epochs

# Create the learning rate scheduler.
scheduler = get_linear_schedule_with_warmup(optimizer, 
                                            num_warmup_steps = 0, # Default value in run_glue.py
                                            num_training_steps = total_steps)

### 4.3 Training Loop 

Below is our training loop. There's a lot going on, but fundamentally for each pass in our loop we have a trianing phase and a validation phase. At each pass we need to:

Training loop:

Unpack our data inputs and labels
Load data onto the GPU for acceleration
Clear out the gradients calculated in the previous pass.
In pytorch the gradients accumulate by default (useful for things like RNNs) unless you explicitly clear them out.
Forward pass (feed input data through the network)
Backward pass (backpropagation)
Tell the network to update parameters with optimizer.step()
Track variables for monitoring progress
Evalution loop:

Unpack our data inputs and labels
Load data onto the GPU for acceleration
Forward pass (feed input data through the network)
Compute loss on our validation data and track variables for monitoring progress
So please read carefully through the comments to get an understanding of what's happening. If you're unfamiliar with pytorch a quick look at some of their beginner tutorials will help show you that training loops really involve only a few simple steps; the rest is usually just decoration and logging.



#### 4.3.1 Define a helper function for calculating accuracy.

In [None]:
import numpy as np

# Function to calculate the accuracy of our predictions vs labels
def flat_accuracy(preds, labels):
    pred_flat = np.argmax(preds, axis=2).flatten()
    labels_flat = labels.flatten()
    labels_flat_filtered = (labels_flat != 0) * (labels_flat != -100) * labels_flat

    return np.sum((pred_flat == labels_flat_filtered) * (labels_flat_filtered != 0)) / sum(labels_flat_filtered != 0)

#### 4.3.2 Helper function for formatting elapsed times.

In [None]:
import time
import datetime

def format_time(elapsed):
    '''
    Takes a time in seconds and returns a string hh:mm:ss
    '''
    # Round to the nearest second.
    elapsed_rounded = int(round((elapsed)))
    
    # Format as hh:mm:ss
    return str(datetime.timedelta(seconds=elapsed_rounded))

#### 4.3.3 now start training 

In [None]:
import random
from tqdm import tqdm

# This training code is based on the `run_glue.py` script here:
# https://github.com/huggingface/transformers/blob/5bfcd0485ece086ebcbed2d008813037968a9e58/examples/run_glue.py#L128


# Set the seed value all over the place to make this reproducible.
seed_val = 42

random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

# Store the average loss after each epoch so we can plot them.
training_loss_values = []
eval_loss_values = []

# For each epoch...
for epoch_i in range(2, epochs):
    
    # ========================================
    #               Training
    # ========================================
    
    # Perform one full pass over the training set.

    print("")
    print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, epochs))
    print('Training...')

    # Measure how long the training epoch takes.
    t0 = time.time()

    # Reset the total loss for this epoch.
    total_loss = 0

    # Put the model into training mode. Don't be mislead--the call to 
    # `train` just changes the *mode*, it doesn't *perform* the training.
    # `dropout` and `batchnorm` layers behave differently during training
    # vs. test (source: https://stackoverflow.com/questions/51433378/what-does-model-train-do-in-pytorch)
    model.train()

    # For each batch of training data...
    for step, batch in enumerate(train_dataloader):

        # Progress update every 20 batches.
        if step % 40 == 0 and not step == 0:
            # Calculate elapsed time in minutes.
            elapsed = format_time(time.time() - t0)
            
            # Report progress.
            print('Batch {:>5,} of {:>5,}.  Elapsed: {:}.'.format(step, len(train_dataloader), elapsed))

        # Unpack this training batch from our dataloader. 
        #
        # As we unpack the batch, we'll also copy each tensor to the GPU using the 
        # `to` method.
        #
        # `batch` contains three pytorch tensors:
        #   [0]: input ids 
        #   [1]: attention masks
        #   [2]: labels 
        #   [3]: segments 
        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].to(device)


        # Always clear any previously calculated gradients before performing a
        # backward pass. PyTorch doesn't do this automatically because 
        # accumulating the gradients is "convenient while training RNNs". 
        # (source: https://stackoverflow.com/questions/48001598/why-do-we-need-to-call-zero-grad-in-pytorch)
        model.zero_grad() 

        # Perform a forward pass (evaluate the model on this training batch).
        # This will return the loss (rather than the model output) because we
        # have provided the `labels`.
        # The documentation for this `model` function is here: 
        # https://huggingface.co/transformers/v2.2.0/model_doc/bert.html#transformers.BertForSequenceClassification
        outputs = model(b_input_ids, 
                    # token_type_ids=b_segments,
                    attention_mask=b_input_mask, 
                    labels=b_labels)
        
        # The call to `model` always returns a tuple, so we need to pull the 
        # loss value out of the tuple.
        loss = outputs[0]

        # Accumulate the training loss over all of the batches so that we can
        # calculate the average loss at the end. `loss` is a Tensor containing a
        # single value; the `.item()` function just returns the Python value 
        # from the tensor.
        total_loss += loss.item()

        # Perform a backward pass to calculate the gradients.
        loss.backward()

        # Clip the norm of the gradients to 1.0.
        # This is to help prevent the "exploding gradients" problem.
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        # Update parameters and take a step using the computed gradient.
        # The optimizer dictates the "update rule"--how the parameters are
        # modified based on their gradients, the learning rate, etc.
        optimizer.step()

        # Update the learning rate.
        scheduler.step()

    # Calculate the average loss over the training data.
    avg_train_loss = total_loss / len(train_dataloader)            
    
    # Store the loss value for plotting the learning curve.
    training_loss_values.append(avg_train_loss)
    save_model(model, epoch_i, lr, eps)

    print("")
    print("  Average training loss: {0:.2f}".format(avg_train_loss))
    print("  Training epcoh took: {:}".format(format_time(time.time() - t0)))
        
    # ========================================
    #               Validation
    # ========================================
    # After the completion of each training epoch, measure our performance on
    # our validation set.

    print("")
    print("Running Validation...")

    t0 = time.time()

    # Put the model in evaluation mode--the dropout layers behave differently
    # during evaluation.
    model.eval()

    # Tracking variables 
    eval_loss, eval_accuracy = 0, 0
    nb_eval_steps, nb_eval_examples = 0, 0

    # Evaluate data for one epoch
    for batch in validation_dataloader:
        
        # Add batch to GPU
        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].to(device)
        
        # Telling the model not to compute or store gradients, saving memory and
        # speeding up validation
        with torch.no_grad():        

            # Forward pass, calculate logit predictions.
            # This will return the logits rather than the loss because we have
            # not provided labels.
            # token_type_ids is the same as the "segment ids", which 
            # differentiates sentence 1 and 2 in 2-sentence tasks.
            # The documentation for this `model` function is here: 
            # https://huggingface.co/transformers/v2.2.0/model_doc/bert.html#transformers.BertForSequenceClassification
          outputs = model(b_input_ids, 
                      # token_type_ids=b_segments,
                      attention_mask=b_input_mask,
                      labels=b_labels)
        
        # Get testing loss
        loss = outputs[0]
        eval_loss += loss.item()

        # Get the "logits" output by the model. The "logits" are the output
        # values prior to applying an activation function like the softmax.
        logits = outputs[1]        

        # Move logits and labels to CPU
        logits = logits.detach().cpu().numpy()
        label_ids = b_labels.to('cpu').numpy()

        # Calculate the accuracy for this batch of test sentences.
        tmp_eval_accuracy = flat_accuracy(logits, label_ids)
        
        # Accumulate the total accuracy.
        eval_accuracy += tmp_eval_accuracy

        # Track the number of batches
        nb_eval_steps += 1

    # Calculate the average loss over the training data.
    avg_eval_loss = eval_loss / len(validation_dataloader)            
    eval_loss_values.append(avg_eval_loss)


    # Report the final accuracy for this validation run.
    print("  Average evaluation loss: {0:.2f}".format(avg_eval_loss))
    print("  Accuracy: {0:.2f}".format(eval_accuracy/nb_eval_steps))
    print("  Validation took: {:}".format(format_time(time.time() - t0)))

print("")
print("Training complete!")



Training...
Batch    40 of    91.  Elapsed: 0:00:20.
Batch    80 of    91.  Elapsed: 0:00:40.
Saving model to ./drive/My Drive/playground/model_save/debias/full/lr_2e-05_eps_1e-08/epoch_2/

  Average training loss: 0.02
  Training epcoh took: 0:00:47

Running Validation...
  Average evaluation loss: 0.03
  Accuracy: 0.53
  Validation took: 0:00:07

Training...
Batch    40 of    91.  Elapsed: 0:00:21.
Batch    80 of    91.  Elapsed: 0:00:42.
Saving model to ./drive/My Drive/playground/model_save/debias/full/lr_2e-05_eps_1e-08/epoch_3/

  Average training loss: 0.02
  Training epcoh took: 0:00:48

Running Validation...
  Average evaluation loss: 0.02
  Accuracy: 0.54
  Validation took: 0:00:08

Training...
Batch    40 of    91.  Elapsed: 0:00:21.
Batch    80 of    91.  Elapsed: 0:00:43.
Saving model to ./drive/My Drive/playground/model_save/debias/full/lr_2e-05_eps_1e-08/epoch_4/

  Average training loss: 0.01
  Training epcoh took: 0:00:49

Running Validation...
  Average evaluation lo

# **4.2.1.1 Extracting word embeddings**
(extracting word vectors). 

In [None]:
#Install libraries
!pip install transformers
!pip install plotly==4.9.0
!pip install wmd

Collecting transformers
  Downloading transformers-4.9.2-py3-none-any.whl (2.6 MB)
[K     |████████████████████████████████| 2.6 MB 5.1 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-5.4.1-cp37-cp37m-manylinux1_x86_64.whl (636 kB)
[K     |████████████████████████████████| 636 kB 46.4 MB/s 
Collecting huggingface-hub==0.0.12
  Downloading huggingface_hub-0.0.12-py3-none-any.whl (37 kB)
Collecting sacremoses
  Downloading sacremoses-0.0.45-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 49.6 MB/s 
[?25hCollecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 47.9 MB/s 
Installing collected packages: tokenizers, sacremoses, pyyaml, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3.13
    Uninstalling PyYAML-3.13:
      Successfully uninsta

In [None]:
#imports
import torch
from transformers import BertTokenizer, BertModel  #RobertaModel, RobertaTokenizer 
import sys
import re
from collections import defaultdict
from sklearn.metrics.pairwise import euclidean_distances, cosine_distances
from scipy.spatial.distance import euclidean, pdist, squareform
from sklearn import manifold          #use this for MDS computation
import pandas as pd
import numpy as np

#visualization libs
import plotly.express as px
import plotly.graph_objects as go
import matplotlib.pyplot as plt
% matplotlib inline

#Used to calculation of word movers distance between sentence
from collections import Counter

#Library to calculate Relaxed-Word Movers distance
from wmd import WMD
from wmd import libwmdrelax

  defaults = yaml.load(f)


In [None]:
#Define some constants
PRETRAINED_MODEL = 'bert-base-uncased' 
MAX_LEN = 15

Data and words of interest - when I replicate this, could make a list of words of interest of gendered words. 

Dana's - find gendered words. 

occupational words?? 

In [None]:
#this defines what I would like highlighted when I visualize the word vectors
WORDS_OF_INTEREST = ['woman', 'man']

In [None]:
#call the model on the sentences
outputs = model(input_ids, attention_masks) #(tokenized_tensor, sent_tensor)
hidden_states = outputs[2]

print("Total hidden layers:", len(hidden_states))
print("First layer : hidden_states[0].shape ", hidden_states[0].shape)     # [batch_size x seq_length x vector_dim]

NameError: ignored

Experimenting with how to get tensors from different layers and stack them as needed

In [None]:
#get last 4 layers
torch.stack(hidden_states[-4:]).shape

#concatenate last 4 layer outputs
torch.cat(hidden_states[-4:], dim=2).shape

#avg last 4 layer outputs
torch.stack(hidden_states[-4:]).mean(0).shape

#find mean across th 4 layers, and swap the batch_size and seq_len dim to access any token
torch.stack(hidden_states[-4:]).sum(0).permute(1,0,2).shape

torch.Size([15, 2, 1024])

In [None]:
def get_vector(hidden_layers_form_arch, token_index=0, mode='average', top_n_layers=4):
  '''
  retrieve vectors for a token_index from the top n layers and return a concatenated, averaged or summed vector 
  hidden_layers_form_arch: tuple returned by the transformer library
  token_index: index of the token for which a vector is desired
  mode=
        'average' : avg last n layers
        'concat': concatenate last n layers
        'sum' : sum last n layers
        'last': return embeddings only from last layer
        'second_last': return embeddings only from second last layer

  top_n_layers: number of top layers to concatenate/ average / sum
  '''
  if mode == 'concat':
    #concatenate last 4 layer outputs -> returns [batch_size x seq_len x dim]
    #permute(1,0,2) swaps the the batch and seq_len dim , making it easy to return all the vectors for a particular token position
    return torch.cat(hidden_layers_form_arch[-top_n_layers:], dim=2).permute(1,0,2)[token_index]
  
  if mode == 'average':
    #avg last 4 layer outputs -> returns [batch_size x seq_len x dim]
    return torch.stack(hidden_layers_form_arch[-top_n_layers:]).mean(0).permute(1,0,2)[token_index]


  if mode == 'sum':
    #sum last 4 layer outputs -> returns [batch_size x seq_len x dim]
    return torch.stack(hidden_layers_form_arch[-top_n_layers:]).sum(0).permute(1,0,2)[token_index]


  if mode == 'last':
    #last layer output -> returns [batch_size x seq_len x dim]
    return hidden_layers_form_arch[-1:][0].permute(1,0,2)[token_index]

  if mode == 'second_last':
    #last layer output -> returns [batch_size x seq_len x dim]
    return hidden_layers_form_arch[-2:-1][0].permute(1,0,2)[token_index]

  return None

In [None]:
get_vector(hidden_states, token_index=0, mode='concat', top_n_layers=4).shape
get_vector(hidden_states, token_index=0, mode='sum', top_n_layers=4).shape

torch.Size([2, 1024])

In [None]:
#Lengths of each sentence
sent_lengths = attention_masks.sum(1).tolist()
sent_lengths

[10, 9]

In [None]:
#get the tokenized version of each sentence (text form, to label things in the plot)
tokenized_sents = [tokenizer.convert_ids_to_tokens(i) for i in input_ids]
tokenized_sents[0]

['[CLS]',
 'joe',
 'took',
 'alexandria',
 'out',
 'on',
 'a',
 'date',
 '.',
 '[SEP]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]']

In [None]:
def plt_dists(dists, labels, dims=2, words_of_interest=[], title=""):
  '''
  Plot distances using MDS in 2D/3D 
  dists: precomputed distance matrix
  labels: labels to display on the plot
  dims: 2/3 for 2 or 3 dimensional plot, defaults to 2 for any other value passed
  words_of_interest: list of words to highlight with a different color
  title: title for the plot
  '''
  cnt_dict = dict()
  color = list()

  #separate colors for words that are in words_of_interest vs other
  #each word will have a _SentenceNumber at the end to differentiate the words coming in from different sentences
  for v in labels:
    found = False
    for wrd_int in words_of_interest:
      if wrd_int in v:
        found = True
        break
      
    if found:
      color.append(1)
    else:
      color.append(0)

  #https://community.plotly.com/t/plotly-colours-list/11730/6
  colorscale = [[0, 'darkcyan'], [1, 'white']]

  #dists is precomputed using cosine similarity and passed
  #calculate MDS with number of dims passed
  mds = manifold.MDS(n_components=dims, dissimilarity="precomputed", random_state=60, max_iter=90000)
  results = mds.fit(dists)

  print(results)

  #get coodinates for each point
  coords = results.embedding_
  
  #plot
  if dims == 3:
    fig = go.Figure(data=[go.Scatter3d(
        x=coords[:, 0],
        y=coords[:, 1],
        z=coords[:, 2],
        mode='markers+text',
        textposition="top center",
        text=labels,
        marker=dict(
            size=10,
            color=color,
            colorscale=colorscale,
            opacity=0.8,
            
        )
    )])
  else:
    fig = go.Figure(data=[go.Scatter(
        x=coords[:, 0],
        y=coords[:, 1],
        mode='markers+text',
        text=labels,
        textposition="top center",
        marker=dict(
            size=12,
            color=color,
            colorscale=colorscale,
            opacity=0.8,
            
        )
    )])

  fig.update_layout(template="plotly_dark")
  if title!="":
    fig.update_layout(title_text=title)
  fig.show()

In [None]:
def eval_vecs(input_hidden_states, input_tokenized_sents, mode='concat', top_n_layers=4, viz_dims=2, words_with_diff_color=WORDS_OF_INTEREST):
  '''
  function to get a vectors for each word in each sentence, add the sentence number to the end of each word
  calculate cosine distance between each pair of words and then pass it to the visualization function

  inputs:
  input_hidden_states: hiddent states retrieved from a BERT-like model
  input_tokenized_sents: tokenized sentences, used to assign labels for each point on the plot
  model:  'average' : avg last n layers
          'concat': concatenate last n layers
          'sum' : sum last n layers
          'last':  embeddings only from last layer
          'second_last':  embeddings only from second last layer
  top_n_layers: top n layers to use for concat/sum etc.
  viz_dims: 2/3 for 2D or 3D plot
  words_with_diff_color: words that should be highlighed with different color on the plot
  '''
  vecs = list()
  labels = list()
  for token_ind in range(MAX_LEN):
    if token_ind == 0:
      #ignore CLS
      continue
    vectors = get_vector(input_hidden_states, token_index=token_ind, mode=mode, top_n_layers=top_n_layers)
    for sent_ind, sent_len in enumerate(sent_lengths):
      if token_ind < sent_len-1:
        #ignore SEP which will be at the last index of each sentence
        vecs.append(vectors[sent_ind])
        labels.append(input_tokenized_sents[sent_ind][token_ind]+"_"+str(sent_ind))
    
  #create a numpy matrix to pass to cosine distance
  mat = torch.stack(vecs).detach().numpy()
  #call the plot function on the cosine distance matrix
  plt_dists(cosine_distances(mat), labels=labels, dims=viz_dims, words_of_interest=words_with_diff_color, title='Method: {}'.format(mode))

In [None]:
#check if sum and average are the same
sm = get_vector(hidden_states, token_index=0, mode='sum', top_n_layers=4)
av = get_vector(hidden_states, token_index=0, mode='average', top_n_layers=4)

torch.eq(sm, av)

tensor([[False, False, False,  ..., False, False, False],
        [False, False, False,  ..., False, False, False]])

Looking at word vectors 

In [None]:
MODE = 'concat'
eval_vecs(hidden_states, tokenized_sents, mode='concat')

MDS(dissimilarity='precomputed', eps=0.001, max_iter=90000, metric=True,
    n_components=2, n_init=4, n_jobs=None, random_state=60, verbose=0)


In [None]:
#we can look at this using a 3D plot too
eval_vecs(hidden_states, tokenized_sents, mode='concat', viz_dims=3)

MDS(dissimilarity='precomputed', eps=0.001, max_iter=90000, metric=True,
    n_components=3, n_init=4, n_jobs=None, random_state=60, verbose=0)


In [None]:
MODE = 'sum'
eval_vecs(hidden_states, tokenized_sents, mode=MODE)

MDS(dissimilarity='precomputed', eps=0.001, max_iter=90000, metric=True,
    n_components=2, n_init=4, n_jobs=None, random_state=60, verbose=0)


In [None]:
MODE = 'average'
eval_vecs(hidden_states, tokenized_sents, mode=MODE)

MDS(dissimilarity='precomputed', eps=0.001, max_iter=90000, metric=True,
    n_components=2, n_init=4, n_jobs=None, random_state=60, verbose=0)


In [None]:
MODE = 'last'
eval_vecs(hidden_states, tokenized_sents, mode=MODE)

MDS(dissimilarity='precomputed', eps=0.001, max_iter=90000, metric=True,
    n_components=2, n_init=4, n_jobs=None, random_state=60, verbose=0)


In [None]:
MODE = 'second_last'
eval_vecs(hidden_states, tokenized_sents, mode=MODE)

MDS(dissimilarity='precomputed', eps=0.001, max_iter=90000, metric=True,
    n_components=2, n_init=4, n_jobs=None, random_state=60, verbose=0)


#### 4.2.1.2 Extracting word embeddings (from hidden states) 

from McCormick: https://colab.research.google.com/drive/1yFphU6PW9Uo6lmDly_ud9a6c4RCYlwdX#scrollTo=G4Qa5KkkM2Aq 

In [None]:
# FROM MCCORMICK (extracting embeddings) 

# Run the text through BERT, and collect all of the hidden states produced
# from all 12 layers. 
with torch.no_grad():

    outputs = model(input_ids)

    # Evaluating the model will return a different number of objects based on 
    # how it's  configured in the `from_pretrained` call earlier. In this case, 
    # becase we set `output_hidden_states = True`, the third item will be the 
    # hidden states from all layers. See the documentation for more details:
    # https://huggingface.co/transformers/model_doc/bert.html#bertmodel
    hidden_states = outputs[2]

The full set of hidden states for this model, stored in the object hidden_states, is a little dizzying. This object has four dimensions, in the following order:

1. The layer number (13 layers)
2. The batch number (1 sentence)
3. The word / token number (22 tokens in our sentence)
4. The hidden unit / feature number (768 features)

Wait, 13 layers? Doesn't BERT only have 12? It's 13 because the first element is the input embeddings, the rest is the outputs of each of BERT's 12 layers.

That’s 219,648 unique values just to represent our one sentence!

The second dimension, the batch size, is used when submitting multiple sentences to the model at once; here, though, we just have one example sentence.

In [None]:
print ("Number of layers:", len(hidden_states), "  (initial embeddings + 12 BERT layers)")
layer_i = 0

print ("Number of batches:", len(hidden_states[layer_i]))
batch_i = 0

print ("Number of tokens:", len(hidden_states[layer_i][batch_i]))
token_i = 0

print ("Number of hidden units:", len(hidden_states[layer_i][batch_i][token_i]))

Let's take a quick look at the range of values for a given layer and token.

You'll find that the range is fairly similar for all layers and tokens, with the majority of values falling between [-2, 2], and a small smattering of values around -10.

In [None]:
# For the 5th token in our sentence, select its feature values from layer 5.
token_i = 5
layer_i = 5
vec = hidden_states[layer_i][batch_i][token_i]

# Plot the values as a histogram to show their distribution.
plt.figure(figsize=(10,10))
plt.hist(vec, bins=200)
plt.show()

Grouping the values by layer makes sense for the model, but for our purposes we want it grouped by token.

Current dimensions:

[# layers, # batches, # tokens, # features]

Desired dimensions:

[# tokens, # layers, # features]

Luckily, PyTorch includes the permute function for easily rearranging the dimensions of a tensor.

However, the first dimension is currently a Python list!

In [None]:
# `hidden_states` is a Python list.
print('      Type of hidden_states: ', type(hidden_states))

# Each layer in the list is a torch tensor.
print('Tensor shape for each layer: ', hidden_states[0].size())

combine the layers to make it one whole big tensor 



In [None]:
# Concatenate the tensors for all layers. We use `stack` here to
# create a new dimension in the tensor.
token_embeddings = torch.stack(hidden_states, dim=0)

token_embeddings.size()

Get rid of the batches dimensions as don't need it! 

In [None]:
# Remove dimension 1, the "batches".
token_embeddings = torch.squeeze(token_embeddings, dim=1)

token_embeddings.size()

### 4.2.1.3 Extracting Embeddings 

SimpleRepresentations library (https://pypi.org/project/simplerepresentations/) 

Tutorial on YouTube: https://pypi.org/project/simplerepresentations/ 


In [None]:
!pip install torch
!pip install simplerepresentations

Collecting simplerepresentations
  Downloading simplerepresentations-0.0.4.tar.gz (7.3 kB)
Collecting transformers
  Downloading transformers-4.10.2-py3-none-any.whl (2.8 MB)
[K     |████████████████████████████████| 2.8 MB 8.1 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-5.4.1-cp37-cp37m-manylinux1_x86_64.whl (636 kB)
[K     |████████████████████████████████| 636 kB 70.4 MB/s 
[?25hCollecting sacremoses
  Downloading sacremoses-0.0.45-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 70.1 MB/s 
Collecting huggingface-hub>=0.0.12
  Downloading huggingface_hub-0.0.17-py3-none-any.whl (52 kB)
[K     |████████████████████████████████| 52 kB 2.0 MB/s 
[?25hCollecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 64.4 MB/s 
Building wheels for collected packages: simplerepresentations
  Buil

In [None]:
# model
from simplerepresentations import RepresentationModel

model_type = 'bert'
model_name = 'bert-base-uncased'

representation_model = RepresentationModel(
    model_type=model_type,
    model_name=model_name,
    batch_size=128,
    max_seq_length=128, # truncate sentences to be less than or equal to 128 tokens
    combination_method='sum', # sum the last `last_hidden_to_use` hidden states
    last_hidden_to_use=2, # use the last 1 hidden states to build tokens representations
    verbose=0
)

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [None]:
# data
import datasets
text_sample = datasets.load_dataset('bookcorpus', split='train[50%:51%]')

In [None]:

all_sentences_representations, all_tokens_representations = representation_model(text_sample['text'])

AttributeError: ignored

#### 4.2.1.4 Extracting BERT Embeddings 

Using trasnformers library 

https://towardsdatascience.com/word-embeddings-in-2020-review-with-code-examples-11eb39a1ee6d


In [None]:
!pip install transformers



import pytorch, pretrained BERT, BERT tokenizer

In [None]:
import torch
torch.manual_seed(0)
from transformers import BertTokenizer, BertModel
import logging
import matplotlib.pyplot as plt
% matplotlib inline
# Load pre-trained model tokenizer (vocabulary)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

import data and tokenize

In [None]:
import datasets
text_sample = datasets.load_dataset('bookcorpus', split='train[50%:51%]')

In [None]:
# Create a function to tokenize a set of texts
def preprocessing_for_bert(data, tokenizer_obj):
    """Perform required preprocessing steps for pretrained BERT.
    @param    data (np.array): Array of texts to be processed.
    @return   input_ids (torch.Tensor): Tensor of token ids to be fed to a model.
    @return   attention_masks (torch.Tensor): Tensor of indices specifying which
                  tokens should be attended to by the model.
    @return   attention_masks_without_special_tok (torch.Tensor): Tensor of indices specifying which
                  tokens should be attended to by the model excluding the special tokens (CLS/SEP)
    """
    # Create empty lists to store outputs
    input_ids = []
    attention_masks = []

    # For every sentence...
    for sent in data:
        # `encode_plus` will:
        #    (1) Tokenize the sentence
        #    (2) Add the `[CLS]` and `[SEP]` token to the start and end
        #    (3) Truncate/Pad sentence to max length
        #    (4) Map tokens to their IDs
        #    (5) Create attention mask
        #    (6) Return a dictionary of outputs
        encoded_sent = tokenizer_obj.encode_plus(
            text=sent,  # Preprocess sentence
            add_special_tokens=True,        # Add `[CLS]` and `[SEP]`
            max_length=MAX_LEN,                  # Max length to truncate/pad
            pad_to_max_length=True,         # Pad sentence to max length
            truncation=True,              #Truncate longer seq to max_len
            return_attention_mask=True      # Return attention mask
            )
        
        # Add the outputs to the lists
        input_ids.append(encoded_sent.get('input_ids'))
        attention_masks.append(encoded_sent.get('attention_mask'))

    # Convert lists to tensors
    input_ids = torch.tensor(input_ids)
    attention_masks = torch.tensor(attention_masks)
    
    #lets create another mask that will be useful when we want to average all word vectors later
    #we would like to average across all word vectors in a sentence, but excluding the CLS and SEP token
    #create a copy
    attention_masks_without_special_tok = attention_masks.clone().detach()
    
    #set the CLS token index to 0 for all sentences 
    attention_masks_without_special_tok[:,0] = 0

    #get sentence lengths and use that to set those indices to 0 for each length
    #essentially, the last index for each sentence, which is the SEP token
    sent_len = attention_masks_without_special_tok.sum(1).tolist()

    #column indices to set to zero
    col_idx = torch.LongTensor(sent_len)
    #row indices for all rows
    row_idx = torch.arange(attention_masks.size(0)).long()
    
    #set the SEP indices for each sentence token to zero
    attention_masks_without_special_tok[row_idx, col_idx] = 0

    return input_ids, attention_masks, attention_masks_without_special_tok

In [None]:
#run data through the tokenizer
MAX_LEN = 15
input_ids, attention_masks, attention_masks_without_special_tok = preprocessing_for_bert(text_sample['text'], tokenizer)
print(len(input_ids))



740042


"Segment ID. BERT is trained on and expects sentence pairs using 1s and 0s to distinguish between the two sentences. We will encode each sentence separately so we will just mark each token in each sentence with 1."

In [None]:
segments_ids = torch.ones_like(input_ids)

Call the BERT model and get hidden model states from which we create word embeddings 

In [None]:
from transformers import BertModel, BertConfig

config = BertConfig.from_pretrained('bert-base-uncased', output_hidden_states=True)
model = BertModel.from_pretrained('bert-base-uncased',
                                  config=config, # Whether the model returns all hidden-states.
                                  )

# Put the model in "evaluation" mode, meaning feed-forward operation.
# only care about doing a forward pass through the architecture for this task. 
model.eval()

with torch.no_grad():
    outputs = model(input_ids, segments_ids)
    hidden_states = outputs[2]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Examine the model 

In [None]:
print ("Number of layers:", len(hidden_states), "  (initial embeddings + 12 BERT layers)")
print ("Number of batches:", len(hidden_states[0]))
print ("Number of tokens:", len(hidden_states[0][0]))
print ("Number of hidden units:", len(hidden_states[0][0][0]))

In [None]:
# Concatenate the tensors for all layers. 
token_embeddings = torch.stack(hidden_states, dim=0)
# Swap dimensions, so we get tensors in format: [sentence, tokens, hidden layes, features]
token_embeddings = token_embeddings.permute(1,2,0,3)

use last 4 hidden layers to create each word embedding

In [None]:
processed_embeddings = token_embeddings[:, :, 9:, :]

Concatenate four layers for each token to create embeddings

In [None]:
embeddings = torch.reshape(processed_embeddings, (4, 48, -1))

*continue following the tutorial!!*

# 5. Extracting BERT Embeddings to Identify Biased Words and Gender Subspace

This is necessary to then define the protected attribute (gender) that will be used as the input to the adversary when fine-tuning BERT. 

code for extracting BERT embeddings from: https://colab.research.google.com/drive/1yFphU6PW9Uo6lmDly_ud9a6c4RCYlwdX#scrollTo=UCIGe0AXfg4Z 

## **6. Training (fine-tuning) the Model with an Adversary**

### 6.1 Create Relevant Classes (one for the Adversary, one for the Masked Language Model (BERT)). 

(inspired from: https://github.com/choprashweta/Adversarial-Debiasing/blob/master/Debiased_Classifier.ipynb) 

In [None]:
! pip install constant
!pip install bert-pytorch
!pip install pytorch-pretrained-bert pytorch-nlp
!pip install -U -q PyDrive

Collecting constant
  Downloading constant-0.0.4.zip (63 kB)
[?25l[K     |█████▏                          | 10 kB 25.8 MB/s eta 0:00:01[K     |██████████▎                     | 20 kB 30.4 MB/s eta 0:00:01[K     |███████████████▌                | 30 kB 35.4 MB/s eta 0:00:01[K     |████████████████████▋           | 40 kB 32.2 MB/s eta 0:00:01[K     |█████████████████████████▉      | 51 kB 16.9 MB/s eta 0:00:01[K     |███████████████████████████████ | 61 kB 12.0 MB/s eta 0:00:01[K     |████████████████████████████████| 63 kB 1.3 MB/s 
Building wheels for collected packages: constant
  Building wheel for constant (setup.py) ... [?25l[?25hdone
  Created wheel for constant: filename=constant-0.0.4-py3-none-any.whl size=74356 sha256=5836e68094c83a0dbf88b173e9d46f216809467866f834a7503752751fb6e219
  Stored in directory: /root/.cache/pip/wheels/66/c7/15/b36373f806bcade834c10be9a6f559d63d9be96fa905b5cd45
Successfully built constant
Installing collected packages: constant
Success

In [None]:
from sklearn.metrics import accuracy_score, roc_auc_score, f1_score, recall_score, precision_score
from sklearn.utils.class_weight import compute_class_weight
from keras.layers import Input, Dense, Dropout
from keras.models import Model
import pandas as pd
import numpy as np
import os, sys
from google.colab import drive

sys.path.append(os.path.join(os.path.dirname(sys.path[0]), 'analysis'))
sys.path.append(os.path.join(os.path.dirname(os.path.dirname(sys.path[0])), 'configs' ))

import constant

In [None]:
import torch
import torchvision
import torchvision.transforms as transforms
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch.utils.data
import numpy as np
import pandas as pd
from tqdm import tqdm
import os
import warnings
from pytorch_pretrained_bert import BertTokenizer, BertForSequenceClassification, BertAdam, BertModel
from pytorch_pretrained_bert import BertConfig

# specify GPU device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
n_gpu = torch.cuda.device_count()
torch.cuda.get_device_name(0)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


'Tesla P100-PCIE-16GB'