#### Before you run solve this make sure you have created a new conda environment.
run the following commands in your commandline
<pre><code>
conda create -n deepmoji anaconda
source activate deepmoji
## add to jupyter notebook list
python -m ipykernel install --user --name deepmoji --display-name "Python (deepmoji)"</code></pre>

### Or preferably load this notebook into [google colab](https://colab.research.google.com)   


## Transfer learning and bias detection
Today we are gonna practice adopting pretrained language models to power our classifiers. 
Using a pretrained model as input, both means a potentially huge gain in performance, but also a potentially problematic introduction of bias. 

Since you are not controlling the population / dataset from which your model learns it is hard to guarantee that the models do not come with certain biases builtin. 

As the pretrained models come "free", you should instead spent ressources on investigating and potentially eliminating biases (bias correction). Today you will practice investigating the biases. 

We will do this using two datasets: 
1. From the paper:  "Examining Gender and Race Bias in Two Hundred Sentiment Analysis Systems" by Kiritchenko & Mohammad 2018:. [data](https://saifmohammad.com/WebDocs/EEC/Equity-Evaluation-Corpus.zip)

2. Kaggle Toxicity Classification: https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/data
Follow the url. Sign in and download the zip file.








In [1]:
## Download the equity evaluation corpus
import requests
response = requests.get('https://saifmohammad.com/WebDocs/EEC/Equity-Evaluation-Corpus.zip')

with open('Equity-Evaluation-Corpus.zip','wb') as f:
    f.write(response.content)

import zipfile
zip_ref = zipfile.ZipFile('Equity-Evaluation-Corpus.zip', 'r')
directory_to_extract_to = 'bias_dataset'

import os
if not os.path.isdir(directory_to_extract_to):
    os.mkdir(directory_to_extract_to)
zip_ref.extractall(directory_to_extract_to)
zip_ref.close()

In [2]:
import pandas as pd
directory_to_extract_to = 'bias_dataset'
bias_df = pd.read_csv(directory_to_extract_to+'/Equity-Evaluation-Corpus.csv')

In [3]:

bias_df.head()

Unnamed: 0,ID,Sentence,Template,Person,Gender,Race,Emotion,Emotion word
0,2018-En-mystery-05498,Alonzo feels angry.,<person subject> feels <emotion word>.,Alonzo,male,African-American,anger,angry
1,2018-En-mystery-11722,Alonzo feels furious.,<person subject> feels <emotion word>.,Alonzo,male,African-American,anger,furious
2,2018-En-mystery-11364,Alonzo feels irritated.,<person subject> feels <emotion word>.,Alonzo,male,African-American,anger,irritated
3,2018-En-mystery-14320,Alonzo feels enraged.,<person subject> feels <emotion word>.,Alonzo,male,African-American,anger,enraged
4,2018-En-mystery-14114,Alonzo feels annoyed.,<person subject> feels <emotion word>.,Alonzo,male,African-American,anger,annoyed


The dataset contains short sentences expressing a simple sentiment, but with changing characters connotating different genders and ethnicities. This allows you to test your classifier in relation to these biases.

Today we will test two types of classifiers.

- Baseline classifier trained yourself on a given dataset:
    - pick either fasttext.
    - or the NBLOG (Naive Bayes features feed into a Logistic Regression)
    
- And the Deepmoji classifier.


## Setting up the DeepMoji encoder
First we shall see what biases the [DeepMoji](https://arxiv.org/pdf/1708.00524.pdf) encoder has out of the box.

In this way we get to practice loading and interacting with a pretrained model.

DeepMoji was originally conceived using [Keras](https://github.com/bfelbo/DeepMoji), but since you are use to PyTorch we shall use the [TorchMoji](https://github.com/huggingface/torchMoji) implementation.

Loading it is straightforward using git.

In [4]:
## clone the repository
#! git clone https://github.com/huggingface/torchMoji.git

In [5]:
## download the pretrained model's weights using their script
! torchMoji/scripts/download_weights.py

/bin/sh: 1: torchMoji/scripts/download_weights.py: Permission denied


In [10]:
# navigate to the torchmoji folder
import os
#os.chdir('torchMoji')
## install dependencies
! pip install -e .


Obtaining file:///home/kristian/Documents/github/tsds/material/13_text3/torchMoji
Collecting emoji==0.4.5 (from torchmoji==1.0)
  Downloading https://files.pythonhosted.org/packages/7e/0c/c3d24c913986271484fe85446a158ab7b5ff068daa5c2e0ba8793116eed6/emoji-0.4.5.tar.gz
Collecting numpy==1.13.1 (from torchmoji==1.0)
[?25l  Downloading https://files.pythonhosted.org/packages/59/e2/57c1a6af4ff0ac095dd68b12bf07771813dbf401faf1b97f5fc0cb963647/numpy-1.13.1-cp36-cp36m-manylinux1_x86_64.whl (17.0MB)
[K     |████████████████████████████████| 17.0MB 6.4MB/s eta 0:00:01
[?25hCollecting scipy==0.19.1 (from torchmoji==1.0)
[?25l  Downloading https://files.pythonhosted.org/packages/0e/46/da8d7166102d29695330f7c0b912955498542988542c0d2ae3ea0389c68d/scipy-0.19.1-cp36-cp36m-manylinux1_x86_64.whl (48.2MB)
[K     |████████████████████████████████| 48.2MB 5.4MB/s eta 0:00:01     |██████████████████████          | 33.1MB 5.0MB/s eta 0:00:04
[?25hCollecting scikit-learn==0.19.0 (from torchmoji==1.0)
[

If already downloaded elsewhere add the deepmoji directory to the sys.path so python can import it automatically

In [8]:
# add to sys.path
import sys
base_path = '' # change if you have downloaded folder elsewhere.
#base_path = '/mnt/b0c8e396-e5ba-4614-be6f-146c4c861ab3/torchMoji/' ## path to the torchmoji directory
#sys.path.insert(0, base_path)


In [6]:
## Load model and tokenizer
from torchmoji.sentence_tokenizer import SentenceTokenizer
# load the deepmoji encoder that transforms text to emojies.
from torchmoji.model_def import torchmoji_emojis
from torchmoji.global_variables import PRETRAINED_PATH, VOCAB_PATH
import json,csv, numpy as np
import warnings; warnings.simplefilter('ignore')


## set the max context length
max_token = 30 ## This will not work for longer texts,
################# here you should consider splitting each text into smaller segments.

# Load vocab (i.e. the index of each word in the vector representation)
with open(VOCAB_PATH, 'r') as f:
    vocabulary = json.load(f)

# initialize tokenizer
sentence_tokenizer = SentenceTokenizer(vocabulary, max_token)
# load model
model = torchmoji_emojis(PRETRAINED_PATH)

The model outputs a vector of length 64 representing the probability of 64 emojiies.

We can find the index of the emojies with descriptions in the data folder.


In [9]:
with open(base_path+'data/emoji_codes.json') as f:
    emoji_desc = json.load(f)
list(emoji_desc.items())[0:10]

FileNotFoundError: [Errno 2] No such file or directory: 'data/emoji_codes.json'

We know use this index and the emoji package to translate the index to emojiies.

In [139]:
import emoji
def translate_emoji(emoji_descr):
    if emoji_descr in emoji.unicode_codes.EMOJI_ALIAS_UNICODE:
        return emoji.unicode_codes.EMOJI_ALIAS_UNICODE[emoji_descr]
    if emoji_descr in emoji.unicode_codes.EMOJI_UNICODE:
        return emoji.unicode_codes.EMOJI_UNICODE[emoji_descr]
    return emoji_descr
to_emoji = [translate_emoji(desc) for i,desc in sorted(emoji_desc.items(),key=lambda x: int(x[0]))]
to_emoji_desc = [desc for i,desc in sorted(emoji_desc.items(),key=lambda x: int(x[0]))]

## index 
to_emoji[0],to_emoji_desc[0]

('😂', ':joy:')

We are now ready to encode the text as emojis

**note we are not using it for transfer learning** but simple as a pretrained classifier.


### Exercise 13.1.1
Use the sentence_tokenizer defined above to tokenize the documents.

see example in the torchmoji examples [e.g.](https://github.com/huggingface/torchMoji/blob/master/examples/encode_texts.py) folder for help.

Inspect the tokenized documents to see the format. Try to convert them back using <code>vocabulary</code> variable defined earlier.

**- Hint this means reversing the vocabulary dictionary.**


In [None]:
# [Answer to ex. 13.1.1. here]

In [141]:
docs = bias_df.Sentence.values
%time tokenized, _, _ = sentence_tokenizer.tokenize_sentences(docs)

CPU times: user 1.09 s, sys: 4 ms, total: 1.09 s
Wall time: 1.09 s


In [144]:
tokenized[0]

array([33306,  1459,  1740,    11,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0], dtype=uint16)

In [150]:
vocab = sorted(vocabulary,key=lambda x: vocabulary[x])
[vocab[i] for i in tokenized[0]]

['alonzo',
 'feels',
 'angry',
 '.',
 'CUSTOM_MASK',
 'CUSTOM_MASK',
 'CUSTOM_MASK',
 'CUSTOM_MASK',
 'CUSTOM_MASK',
 'CUSTOM_MASK',
 'CUSTOM_MASK',
 'CUSTOM_MASK',
 'CUSTOM_MASK',
 'CUSTOM_MASK',
 'CUSTOM_MASK',
 'CUSTOM_MASK',
 'CUSTOM_MASK',
 'CUSTOM_MASK',
 'CUSTOM_MASK',
 'CUSTOM_MASK',
 'CUSTOM_MASK',
 'CUSTOM_MASK',
 'CUSTOM_MASK',
 'CUSTOM_MASK',
 'CUSTOM_MASK',
 'CUSTOM_MASK',
 'CUSTOM_MASK',
 'CUSTOM_MASK',
 'CUSTOM_MASK',
 'CUSTOM_MASK']

### Exercise 13.1.2
Encode the tokenized sentences and wrap it in a function.
- Hint: Do a forward pass of the model on the tokenized data.
    - check [here](https://github.com/huggingface/torchMoji/blob/master/examples/encode_texts.py) for help 

For larger datasets and with longer sentences encoding is problematic if not done in batches. 

Write a for loop that takes only 256 tokenized documents at a time and concatenate them to a dataframe in the end.

Use the <code>to_emoji</code> list as columns


In [None]:
# [Answer to ex 13.1.2 here]

In [169]:
n_batch = 256
bs = len(tokenized)//n_batch
enc = []
import tqdm
import pandas as pd
def emoji_encode(tokenized,to_df=False):
    #tokenized, _, _ = sentence_tokenizer.tokenize_sentences(texts)
    probs = model(tokenized)
    if to_df:
        return pd.DataFrame(probs,columns=to_emoji_desc)
    return probs

for i in tqdm.tqdm(list(range(n_batch))):
    batch = tokenized[i*bs:(i+1)*bs]
    emoji_encoded = emoji_encode(batch,to_df=True)
    enc.append(emoji_encoded)
batch = tokenized[(i+1)*bs:]
emoji_encoded = emoji_encode(batch,to_df=True)
enc.append(emoji_encoded)

100%|██████████| 256/256 [00:12<00:00, 22.83it/s]


In [229]:
emoji_df = pd.concat(enc)
emoji_df.columns = to_emoji

## Ex. 13.1.3
- Join the output of Deepmoji with the bias dataframe columns (Race, Gender and Emotion)
    - Make sure Race count and Gender counts are equal after join.

Investigate if there are significant differences in relation to **Race** (Race column).

- See which types of emojies are most different.


- Make a dictionary mapping emojiis to different classes, either happy, sad, angry etc.
- HINT: Fastests way is to loop through the list of emojiies <code>to_emoji</code> and use the <code>input()</code> to input the class.
        
    

In [None]:
# [Answer to ex. 13.1.3 here]

In [None]:
# [This question is in assignment 4]

### Ex.13.1.4 - See which emotions are most biased

- Groupby Emotion and Race and calculate absolute difference in emoji encoding. 
    - hint: first groupby emotion and race, calculate mean, then diff, then abs and then sum.


In [None]:
# [Answer to ex. 13.1.4 here]

In [299]:
emoji_df.groupby(['Emotion','Race']).mean().groupby('Emotion').diff().apply(abs).sum(axis=1)

Emotion  Race            
anger    African-American    0.000000
         European            0.096118
fear     African-American    0.000000
         European            0.105285
joy      African-American    0.000000
         European            0.092333
sadness  African-American    0.000000
         European            0.086262
dtype: float32

In [None]:
## Encode 
def deepmoji_encode(tokenized):
    ## let the forward pass end before the softmax layer.
    model.feature_output = True
    ### Do forward pass.
    probs = model(tokenized)
    ## set the model back to default emoji output.
    model.feature_output = False
    return probs


## Hostility and Minority Dataset (Kaggle Toxicity Classification)
**Context**
All outcome and minority variables are crowdsourced annotations and variables are expressed and percentage of annotators marking the category. This means that to create categorical outcomes we should apply a cutoff. The dataset provider suggests 0.5. 

**Ex.13.2.1:** 
- Define a variable <code>minority_cols</code> as a list of column names of the minorities.
- Define a variable <code>outcome_cols</code> as a list of column names of the minorities.
- Create a categorical version of each variable in the outcome cols and minority cols.


**Ex. 13.2.2:** The dataset is fairly large so subsampling will be a good idea (e.g. 25000 samples) where minorities are upsampled. Do a subsample of 1000 for each minority including a none category. 

**Ex 13.2.3:** Train one of the baseline classifiers (see lecture 13 for bow based and fasttext) on the hostility and minority dataset.

**Ex.13.2.4:** Investigate biases in relation to minority groups.


In [None]:
# [Answers to ex 13.2.x here]

## Bow baseline

In [370]:
import requests
with open('get_bow_baseline.py','w') as f:
    f.write(requests.get('https://raw.githubusercontent.com/snorreralund/test_tokenization/master/get_bow_baseline.py').text)
f.close()
import get_bow_baseline

In [4]:
toxicity_df = pd.read_csv('/home/snorre/Dropbox/Forskning/PhD/undervisning/train.csv')

In [5]:
toxicity_df.head()

Unnamed: 0,id,target,comment_text,severe_toxicity,obscene,identity_attack,insult,threat,asian,atheist,...,article_id,rating,funny,wow,sad,likes,disagree,sexual_explicit,identity_annotator_count,toxicity_annotator_count
0,59848,0.0,"This is so cool. It's like, 'would you want yo...",0.0,0.0,0.0,0.0,0.0,,,...,2006,rejected,0,0,0,0,0,0.0,0,4
1,59849,0.0,Thank you!! This would make my life a lot less...,0.0,0.0,0.0,0.0,0.0,,,...,2006,rejected,0,0,0,0,0,0.0,0,4
2,59852,0.0,This is such an urgent design problem; kudos t...,0.0,0.0,0.0,0.0,0.0,,,...,2006,rejected,0,0,0,0,0,0.0,0,4
3,59855,0.0,Is this something I'll be able to install on m...,0.0,0.0,0.0,0.0,0.0,,,...,2006,rejected,0,0,0,0,0,0.0,0,4
4,59856,0.893617,haha you guys are a bunch of losers.,0.021277,0.0,0.021277,0.87234,0.0,0.0,0.0,...,2006,rejected,0,0,0,1,0,0.0,4,47


In [7]:
toxicity_df.columns

Index(['id', 'target', 'comment_text', 'severe_toxicity', 'obscene',
       'identity_attack', 'insult', 'threat', 'asian', 'atheist', 'bisexual',
       'black', 'buddhist', 'christian', 'female', 'heterosexual', 'hindu',
       'homosexual_gay_or_lesbian', 'intellectual_or_learning_disability',
       'jewish', 'latino', 'male', 'muslim', 'other_disability',
       'other_gender', 'other_race_or_ethnicity', 'other_religion',
       'other_sexual_orientation', 'physical_disability',
       'psychiatric_or_mental_illness', 'transgender', 'white', 'created_date',
       'publication_id', 'parent_id', 'article_id', 'rating', 'funny', 'wow',
       'sad', 'likes', 'disagree', 'sexual_explicit',
       'identity_annotator_count', 'toxicity_annotator_count'],
      dtype='object')

In [10]:
minority_columns = ['asian', 'atheist', 'bisexual',
       'black', 'buddhist', 'christian', 'female', 'heterosexual', 'hindu',
       'homosexual_gay_or_lesbian', 'intellectual_or_learning_disability',
       'jewish', 'latino', 'male', 'muslim', 'physical_disability',
       'psychiatric_or_mental_illness', 'transgender', 'white'
                    #, 'other_disability',
       #'other_gender', 'other_race_or_ethnicity', 'other_religion',
       #'other_sexual_orientation',
                   ]
toxicity_df[minority_columns].sum()

asian                                   4846.792005
atheist                                 1298.449450
bisexual                                 763.380519
black                                  13933.484260
buddhist                                 571.434540
christian                              38595.950842
female                                 51723.057378
heterosexual                            1311.421852
hindu                                    590.427372
homosexual_gay_or_lesbian              10375.613491
intellectual_or_learning_disability      440.822098
jewish                                  7236.672289
latino                                  2482.062856
male                                   44032.257970
muslim                                 20037.556177
physical_disability                      549.380996
psychiatric_or_mental_illness           4895.233197
transgender                             2723.960538
white                                  23072.311802
dtype: float

In [25]:
toxicity_df[minority_columns].head()

Unnamed: 0,asian,atheist,bisexual,black,buddhist,christian,female,heterosexual,hindu,homosexual_gay_or_lesbian,intellectual_or_learning_disability,jewish,latino,male,muslim,physical_disability,psychiatric_or_mental_illness,transgender,white
0,,,,,,,,,,,,,,,,,,,
1,,,,,,,,,,,,,,,,,,,
2,,,,,,,,,,,,,,,,,,,
3,,,,,,,,,,,,,,,,,,,
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [47]:
## downsample the data
samples = []
for i in minority_columns:
    sub = toxicity_df[toxicity_df[i]>0.5]
    samples.append(sub.sample(min(len(sub),2500)))
samples.append(toxicity_df[toxicity_df[minority_columns].apply(np.isfinite,axis=1).sum(axis=1)==0].sample(10000))
subsample = pd.concat(samples)

In [48]:
subsample[minority_columns].sum(axis=0)

asian                                  2592.598986
atheist                                1276.641091
bisexual                                365.102540
black                                  4353.551327
buddhist                                665.379257
christian                              5829.283233
female                                 6547.743259
heterosexual                           1199.917006
hindu                                   657.067270
homosexual_gay_or_lesbian              4178.218788
intellectual_or_learning_disability      77.428542
jewish                                 3346.776099
latino                                 1634.154292
male                                   6490.994750
muslim                                 4595.055830
physical_disability                     113.655244
psychiatric_or_mental_illness          2284.221359
transgender                            2393.904706
white                                  5083.138263
dtype: float64

In [310]:
#(toxicity_df[outcome_columns]>0).sum(axis=1).value_counts()

array(['0.0', '0.0', '0.3', ..., '0.2', '0.732394366197183', '0.0'],
      dtype=object)

In [32]:
#toxicity_df[outcome_columns].astype(str)
#(toxicity_df[outcome_columns]>0.5).sum(axis=1).value_counts()

In [336]:
import numpy as np
outcome_columns = np.array(outcome_columns)
cutoff = 0.5
subsample['y'] = ((subsample['target']>=0.5)*1).astype(str)
subsample['y2'] = ((subsample['target']>=0.5)*1)
#subsample['y'] = subsample[outcome_columns].apply(lambda x: list(outcome_columns[x>=cutoff]) if sum(x>=cutoff)>0 else ['none'],axis=1)

In [305]:
subsample = subsample.reset_index(drop=True)

In [314]:
# make train test split

p = 0.5 # split 50-50 because we are equally interested in the test
n = int(len(subsample)*p)
idx = np.random.permutation(np.arange(len(subsample)))
train,test = idx[0:n],idx[n:]

In [337]:
train_df = subsample.iloc[train]
test_df = subsample.iloc[test]

In [335]:
train_df.head()

Unnamed: 0,id,target,comment_text,severe_toxicity,obscene,identity_attack,insult,threat,asian,atheist,...,rating,funny,wow,sad,likes,disagree,sexual_explicit,identity_annotator_count,toxicity_annotator_count,y
22241,5366022,0.0,This man who is the supposed Governor can not ...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,approved,0,0,0,6,1,0.0,10,4,0
10163,6312161,0.0,Easy enough to google. \n\n\nOnce all the wome...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,approved,0,0,0,2,0,0.0,4,4,0
1206,5188598,0.0,Two pictures seen in G&M articles about campus...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,approved,0,0,0,5,1,0.0,4,4,0
27386,356884,0.0,And there is recourse to those harmed by the h...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,approved,0,0,0,3,0,0.0,4,4,0
27863,5690588,0.0,A guy from my hometown served in the military ...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,approved,0,0,0,0,0,0.0,4,4,0


In [None]:
! cp /home/snorre/Dropbox/Forskning/PhD/logbog/get_bow_baseline.py ./
%load_ext autoreload
%autoreload 2


In [355]:
%aimport get_bow_baseline
import get_bow_baseline
import nltk
tokenizer = nltk.tokenize.TweetTokenizer().tokenize
baseline = get_bow_baseline.TokenizationTest(train_df,test_df,text_col='comment_text',y_col='y2',MAX_EVALS=50)

In [356]:
baseline.evaluate('nltk_tweet',tokenizer)

  4%|▍         | 2/50 [00:05<02:22,  2.98s/it, best loss: 0.3039770523080001]




 10%|█         | 5/50 [00:18<02:41,  3.59s/it, best loss: 0.2835911275162999]





 22%|██▏       | 11/50 [00:38<02:07,  3.28s/it, best loss: 0.2835911275162999]




 30%|███       | 15/50 [00:54<02:05,  3.58s/it, best loss: 0.2835911275162999]







 32%|███▏      | 16/50 [01:01<02:33,  4.52s/it, best loss: 0.2835911275162999]







 34%|███▍      | 17/50 [01:07<02:46,  5.04s/it, best loss: 0.2835911275162999]





 36%|███▌      | 18/50 [01:11<02:36,  4.88s/it, best loss: 0.2835911275162999]




 60%|██████    | 30/50 [01:58<01:12,  3.65s/it, best loss: 0.27989279730057015]





 68%|██████▊   | 34/50 [02:13<00:57,  3.56s/it, best loss: 0.27989279730057015]





 84%|████████▍ | 42/50 [02:43<00:28,  3.57s/it, best loss: 0.27989279730057015]




 90%|█████████ | 45/50 [02:56<00:20,  4.07s/it, best loss: 0.27868571792003527]







 98%|█████████▊| 49/50 [03:12<00:03,  3.98s/it, best loss: 0.27868571792003527]







100%|██████████| 50/50 [03:18<00:00,  4.59s/it, best loss: 0.27868571792003527]
Final accuracy and roc_auc score of tokenizer (nltk_tweet) + nb_log: 0.862 and 0.811


In [360]:
test_df['true_pred'] = baseline.clf.predict(baseline.x_test)==baseline.y_test

In [369]:
for minority in minority_columns:
    idx = test_df[minority]>=0.5
    
    print(minority,sum(idx),sum(test_df[idx]['true_pred'])/sum(idx))
    

asian 1504 0.18617021276595744
atheist 765 0.1738562091503268
bisexual 152 0.21052631578947367
black 2363 0.32035548032162503
buddhist 401 0.09725685785536159
christian 3236 0.1572929542645241
female 3447 0.20336524514070206
heterosexual 729 0.24828532235939643
hindu 370 0.10810810810810811
homosexual_gay_or_lesbian 2338 0.3109495295124038
intellectual_or_learning_disability 30 0.13333333333333333
jewish 1890 0.20317460317460317
latino 925 0.2064864864864865
male 3490 0.22578796561604583
muslim 2516 0.2484101748807631
physical_disability 43 0.09302325581395349
psychiatric_or_mental_illness 1354 0.2651403249630724
transgender 1284 0.24065420560747663
white 2840 0.2992957746478873


## Fasttext baseline

In [316]:
import nltk
# Fasttext needs a format where each label is in the beginning of the row and marked with :
## __label__{name}
def get_labels(val):
    if type(val)==str:
        return '__label__%s'%val
    else:
        labels = []
        for i in val:
            labels.append('__label__%s'%i)
        return ' '.join(labels)
def make_fasttext_format(df,y_col,text_col,outfile,tokenizer=nltk.tokenize.TweetTokenizer().tokenize):
    docs = df[text_col].values
    # tokenize
    tokenized = [' '.join(tokenizer(doc)) for doc in docs]
    # lower
    tokenized = [doc.lower().replace('\n',' __newline__ ') for doc in tokenized]
    if type(df[y_col].values[0])==str:
        fasttext_labels = ['__label__%s'%val for val in df[y_col]]
    else:
        fasttext_labels = [get_labels(vals) for vals in df[y_col]]
    fast_docs = [' '.join([fasttext_labels[i],tokenized[i]]) for i in range(len(df))]
    with open(outfile,'w') as f:
        f.write('\n'.join(fast_docs))
    f.close()

y_col =  'y' 
text_col = 'comment_text'
trainfile = 'hostility.train'
testfile = 'hostility.test'
make_fasttext_format(train_df,
                     y_col=y_col,
                     text_col=text_col
                     ,outfile=trainfile)

make_fasttext_format(test_df,
                     y_col=y_col,
                     text_col=text_col
                     ,outfile=testfile)

## Fit the model and test it

In [317]:
fast_path = '/mnt/b0c8e396-e5ba-4614-be6f-146c4c861ab3/fastText-0.2.0'
! {fast_path}/./fasttext supervised -input hostility.train -output model_fast -lr 0.5 -epoch 100 -wordNgrams 2 -dim 10 -ws 5
#! {fast_path}/./fasttext supervised -input fasttext.train -output model_fast -lr 0.5 -epoch 200 -wordNgrams 3 -dim 100 -ws 5
! {fast_path}/./fasttext test model_fast.bin hostility.test

Read 1M words
Number of words:  48311
Number of labels: 2
Progress: 100.0% words/sec/thread: 1935159 lr:  0.000000 loss:  0.024559 ETA:   0h 0m
N	21058
P@1	0.856
R@1	0.856


In [100]:
nb_classes = len(outcome_columns)+1

In [110]:
! {fast_path}/./fasttext predict-prob model_fast.bin hostility.test {nb_classes}  > predict_proba.txt

In [113]:
! {fast_path}/./fasttext predict model_fast.bin hostility.test 1  > predict.txt

In [115]:
! tail predict.txt

__label__none
__label__none
__label__none
__label__none
__label__insult
__label__identity_attack
__label__none
__label__none
__label__insult
__label__none


In [117]:
def read_fasttext_row(row):
    labels = row.split('__label__')[1:]
    return dict(zip(labels,[1 for i in range(len(labels))]))


In [118]:
def read_fasttext_row_prob(row):
    preds = {}
    for pair in row.split('__label__')[1:]:
        i,j = pair.split()
        preds[i.strip()] = float(j.strip())
        
    return preds
    
def load_fasttext_proba(predict_file):
    
    with open(predict_file,'r') as f:
        l = f.read().split('\n')
    return pd.DataFrame([read_fasttext_row_prob(i) for i in l])
#! tail predict_proba.txt

In [119]:
def load_fasttext_predict(predict_file):
    
    with open(predict_file,'r') as f:
        l = f.read().split('\n')
    return pd.DataFrame([read_fasttext_row(i) for i in l])


In [121]:
predict = load_fasttext_predict('predict.txt')

## Bonus 
So far we have not done any real transfer learning. For this exercise you should visit some of the major models and investigate how to adopt the model to your own dataset.

BERT - [COLAB Example](https://colab.research.google.com/github/tensorflow/tpu/blob/master/tools/colab/bert_finetuning_with_cloud_tpus.ipynb)

DeepMoji:
Try the deepmoji finetuning example [here](https://colab.research.google.com/drive/1IsV5a_tr2c5OVdKnX_PyGjoRWW8DUG0S)
    - HINT: Inspect the load_benchmark data to see how to make your own dataset conform.

In [None]:
# [Answer to bonus question here]