## 0. Introduction
In earlier notebook(https://www.kaggle.com/anirbansen3027/jtcc-word2vec) we used Gensim for getting pretrained Word2Vec models/embedding vectors for the words used in the sentences, mapped them against the output variables "toxic","severe_toxic","obscene","threat","insult","identity_hate" and used Multi Output Logistic Regression Classifier wrapper from sklearn to create Logistic Regression models for all the 6 output variables.

In this one, we will be using fastText library for both generating embeddings for the sentences as well as text classification. Infact, it gives us an option of doing both in one go. The intuition part is going to be shorter than other methods as the documentation/ readings on the fastText are limited.

**What is fastText?**

In 2016, Facebook AI Research (FAIR) is open-sourced fastText, a library designed to help build scalable solutions for text representation and classification. fastText take the idea of word embeddings in Word2Vec a step ahead and learns representations
for character n-grams, and to represent words as the sum of the n-gram vectors. Taking the word "where" and n = 3 as an example, it will be represented by the character n-grams: <wh, whe, her, ere, re>. Apart from txet representations in form of sub-word embeddings, it also provides an off-the-shelf classfication model which is optimized to work with these embedding and give fast results.

**Why is fastText required?**

fastText has 2 benifits over regular word2vec embeddings:

*1. fastText helps in dealing with Out of Vocabulary(OOV) problem:*

Word2Vec faces the problem of Out of vocabulary. Lets say we are training a Word2Vec model from scratch, we setup a vocabulary which contains of all the words in the training data. Now if we have a new word in the test data for which we might be needing embedding, the new missing word will be OOV. In word2vec we completely ignored such words. By using sub-word embeddings in fastText, we try to get an embedding for a word which is OOV as well

*2. By using a distinct vector representation for each word, the Word2Vec model ignores the internal structure of words:*

In word2vec each word is learned uniquely based on the context it appears in. For example  boxer and boxing are used in different contexts and there is no way we can capture the underlying similarity. Breaking it down to character n-gram helps

**What are additional benifits of fastText?**

Although fastText goes beyond word level to character-ngram level, It is extremely fast (and hence the name).Experiments show that fastText is often on par with deeplearning classifiers in terms of accuracy, and many orders of magnitude faster for training and evaluation. We can train fastText on more than one billion words in less than ten minutes using a standard multicore CPU, and classify half a million sentences among 312K classes in less than a minute.

Additionally, fastText provides word vectors for 157 languages trained on Wikipedia and Crawl (which is amazing).

**How does fastText work?**

*1. Creation of word embeddings*

The subword model is based on the skipgram model from Word2Vec and instead of using the vector representations of words, an average of the vector representations of the character n-grams are used. Eveything else is quite similar to the skipgram model. 

*2. Text Classification*
<img src="https://i.imgur.com/Gl1PiFO.png" title="source: imgur.com" width = 300/> 
Above image is taken from the actual paper "Bag of Tricks for Efficient Text Classification" where fastText for classification was introduced. fastText uses a shallow neural network similar to Word2Vec networks. We use the softmax function f to compute the probability distribution over the predefined classes.

Infact, it uses something called Hierarchical softmax based on the Huffman coding tree (the most common word/alphabet is given the smallest code in short). So, in Hierarchical softmax probability of a node is always lower than the one of its parent. This helps both when we have a large number of classes and at testing time where we are searching for the most likely class. 
<img src="https://media.geeksforgeeks.org/wp-content/uploads/20200921040227/treec.png" width = 300/> 
For example, if Travel, Food and Indian Cuisines are 3 classes, and if Travel has a higher probability, we dont even need to calculate the probability for Food and Indian Cuisines. This saves reduces the complexity.

Additionally it also uses word n-grams as additional features other than the embeddings to capture some partial information about the local word order which is otherwise very computationally expensive to cature in normal BagOfWords.

Let's dive into the code then 

### Table of Contents:

[1. Importing Libraries](#1)

[2. Reading Dataset](#2)

[3. Splitting Dataset into training and validation sets](#3)

[4. Basic Preprocessing](#4)

[5. Training and Validating fastText Classifier](#5)

[6. Predicting and Submitting for Test Data](#6)

[7. TODOs](#7)

## 1. Importing Libraries <a class="anchor" id="1"></a>

In [1]:
import pandas as pd
import numpy as np
from statistics import mean

from fasttext import train_supervised

#Sklearn Library
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

from tqdm.notebook import tqdm

## 2.Reading Dataset <a class="anchor" id="2"></a>
All the datasets are provided as zipped files. First we will have to unzip them and then read them into dataframes

In [2]:
#unzipping all the zip folders and saving it /kaggle/working and saving the verbose in /dev/null to keep it quiet
# -o for overwrite -d for destination directory of unzipped file
!unzip -o '/kaggle/input/jigsaw-toxic-comment-classification-challenge/*.zip' -d /kaggle/working > /dev/null


4 archives were successfully processed.


In [3]:
#Reading input csv files
train_text = pd.read_csv("train.csv")
test_text = pd.read_csv("test.csv")
sample_submission = pd.read_csv("sample_submission.csv")

print(train_text.shape, test_text.shape, sample_submission.shape)
train_text.head()

(159571, 8) (153164, 2) (153164, 7)


Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


## 3. Splitting Dataset into training and validation sets <a class="anchor" id="3"></a>

In [4]:
y_cols = ["toxic","severe_toxic","obscene","threat","insult","identity_hate"]
X = train_text.comment_text
y = train_text[y_cols]

train, val = train_test_split(train_text, shuffle = True, random_state = 123)

## 4. Basic Preprocessing <a class="anchor" id="4"></a>

In terms of preprocessing we are doing the following steps:

1. We are introducing spaces for some punctuations like ?, ., ) , ( , ! , and removed some just to make the text cleaner

2. We have removed \n as we can already see a lot of it in the texts

3. We did some unicode normalization it order to handle some unicode issues that might arise

4. This one being the most important, converting the binary labels for all output variables from 0 and 1 to \__clas__0 and \__class__1 as the fastText classifier needs it that way. We only need to do this for the training_data as the classifier will not look at the real labels of the validation data or the test data

5. Shuffling the dataset to introduce some randomness and remove ordering (if present)

In [5]:
# Lets do some cleaning of this text
def clean_it(text,normalize=True):
    # Replacing possible issues with data. We can add or reduce the replacemtent in this chain
    s = str(text).replace(',',' ').replace('"','').replace('\'',' \' ').replace('.',' . ').replace('(',' ( ').\
            replace(')',' ) ').replace('!',' ! ').replace('?',' ? ').replace(':',' ').replace(';',' ').lower()
    s = s.replace("\n"," ")
    
    # normalizing / encoding the text
    if normalize:
        s = s.normalize('NFKD').str.encode('ascii','ignore').str.decode('utf-8')
    
    return s

# Now lets define a small function where we can use above cleaning on datasets
def clean_df(data, cleanit= False, shuffleit=False, encodeit=False, label_prefix='__class__'):
    # Defining the new data
    df = data[['comment_text']].copy(deep=True)
    for col in y_cols:
        df[col] = label_prefix + data[col].astype(str) + ' '
    
    # cleaning it
    if cleanit:
        df['comment_text'] = df['comment_text'].apply(lambda x: clean_it(x,encodeit))
    
    # shuffling it
    if shuffleit:
        df.sample(frac=1).reset_index(drop=True)
            
    return df

# Transform the datasets using the above clean functions
df_train_cleaned = clean_df(train, True, True)
df_val_cleaned = clean_df(val, True, True, label_prefix='')

In [6]:
df_train_cleaned.head()

Unnamed: 0,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
141599,a quote many psychometricians and behavio...,__class__0,__class__0,__class__0,__class__0,__class__0,__class__0
86692,and right at the time of making such allegati...,__class__0,__class__0,__class__0,__class__0,__class__0,__class__0
28863,formally not de facto yes . and in the end ...,__class__0,__class__0,__class__0,__class__0,__class__0,__class__0
14048,religious groups why is the only religion is...,__class__0,__class__0,__class__0,__class__0,__class__0,__class__0
78005,the doctor suggests it ' s probably something...,__class__0,__class__0,__class__0,__class__0,__class__0,__class__0


In [7]:
df_val_cleaned.head()

Unnamed: 0,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
50446,and redirect the other names to it,0,0,0,0,0,0
81571,sinebot1 please read the above comments .,0,0,0,0,0,0
25983,thank you for your very good answer . i just...,0,0,0,0,0,0
39022,i think we need to ask who is likely to be v...,0,0,0,0,0,0
49431,orangemonster2k1|svrtvdude]] ( vt ) 23 26 3...,0,0,0,0,0,0


So, as mentioned earlier we have changed the label classes for the training dataset but not for the validation set as they wouldn't be looked at by the fastText classifier

## 5. Training and Validating fastText Classifier <a class="anchor" id="5"></a>

* Since fastText classifier takes input a csv file with the text data and the class label, we can't use the Multi-Output Classifier wrapper we were using in earlier notebooks. So we will have to run a for loop to train separate models for each output variable and store predictions for each output variable in the validation set.

* train_supervised is the function that is used for fastText classification. We can tune the learning parameters to improve the model.

* There is no API till date which can take a validation set and give out probabilities for the positive case. We can only get probabilities for one sentence at a time. So, we run a for loop around each of the validation sentence and store the probabilities in a list. We need probabilties as the performance metric is ROC-AUC

In [8]:
#Will contain all the predictions for validation set for all the output variables
all_preds = []
#Iterating over all output variables to create separate models
for col in tqdm(y_cols):
    #Path for saving the training dataset
    train_file = '/kaggle/working/final_train.csv'
    #Saving the Output Variable and the text data to a csv
    df_train_cleaned[[col, "comment_text"]].to_csv(train_file, header=None, index=False, columns=[col, "comment_text"]) 
    #Training the model
    model = train_supervised(input=train_file, label="__class__", lr=1.0, epoch=2, loss='ova', wordNgrams=2, dim=200, thread=2, verbose=100)
    #Predictions for validation sets for that ouput variable
    col_preds = []
    #Iterating over each sentence in the validation set
    for text in df_val_cleaned["comment_text"].values:
        #Get the prediction for class 1
        pred = model.predict(text, k = 2)[1][1]
        #Append the prediction to the list of predictions for that output variable
        col_preds.append(pred)
    #Append the list of predictions for a output variable to the overall set of predictions for all columns
    all_preds.append(col_preds)

HBox(children=(FloatProgress(value=0.0, max=6.0), HTML(value='')))




Now that we have got the predictions from the model on the validation set, let's look at the results/performance. Since, the competition uses mean ROC-AUC as the evaluation metric, we will be using the same in the notebook.

In [9]:
#Function for calculating roc auc with given actual binary values across target variables and the probability score made by the model
def accuracy(y_test, y_pred):
    aucs = []
    #Calculate the ROC-AUC for each of the target column
    for col in range(y_test.shape[1]):
        aucs.append(roc_auc_score(y_test[:,col],y_pred[:,col]))
    return aucs

In [10]:
#Actual Labels
y_val_actuals = df_val_cleaned[y_cols].astype("int").to_numpy()
#Prediction probability - minor ordering
all_preds_array = np.transpose(np.array(all_preds))
#Calculate the mean of the ROC-AUC for each of the ouput variable
mean_auc = mean(accuracy(y_val_actuals,all_preds_array))
mean_auc

0.764607758647555

ROC-AUC on the validation set tends to be around 0.77 which is much better than the Word2Vec model. Also, these are early results just on 2 epochs and not-tuned parameters. I think we can get even better results with better tuning. And also the learning and predictions were pretty quick.Apparently, Logistic Regression model that we used with BagOfWords took more time to converge while training than fastText.

## 6. Predicting and Submitting for Test Data <a class="anchor" id="6"></a>

In [11]:
# Merging the test dataset with sample_submission to have all the columns:
#id,text_data and the target variables in one dataframe
df_test = pd.merge(test_text, sample_submission, on = "id")
# Preprocessing the test dataset as well
df_test_cleaned = clean_df(df_test, True, True, label_prefix='')
#Will contain all the predictions for validation set for all the output variables
all_test_preds = []
for col in tqdm(y_cols):
    #Predictions for test sets for that ouput variable
    col_preds = []
    #Iterating over each sentence in the test set
    for text in df_test_cleaned["comment_text"].values:
        #Get the prediction for class 1
        pred = model.predict(text, k = 2)[1][1]
        #Append the prediction to the list of predictions for that output variable
        col_preds.append(pred)
    #Append the list of predictions for a output variable to the overall set of predictions for all columns
    all_test_preds.append(col_preds)
#Prediction probability - minor ordering
all_test_preds_array = np.transpose(np.array(all_test_preds))
#Assign the predictions by the model in the final test dataset
df_test[["toxic","severe_toxic","obscene","threat","insult","identity_hate"]] = all_test_preds_array
#Drop Comment Text as the sample submission doesnt have it and wouldnt be expected
df_test.drop(["comment_text"], axis = 1, inplace = True)
#Save the dataset as a csv to submit it
df_test.to_csv("sample_submission.csv", index = False)

HBox(children=(FloatProgress(value=0.0, max=6.0), HTML(value='')))




## 7. TODOs <a class="anchor" id="7"></a>
* Better Text Preprocessing Typo correction etc can be done to further improve the model
* Try tuning the hyperparameters to get better results

***Do upvote if you find it helpful 😁***