## 0. Introduction
In earlier notebook(https://www.kaggle.com/anirbansen3027/jtcc-bag-of-words) we used CountVectorizer (an sklearn implementation of Bag-of-Words) model to convert the texts to a numerical dataset, mapped against the output variables "toxic","severe_toxic","obscene","threat","insult","identity_hate" and used Multi Output Classifier wrapper from sklearn to create Logistic Regression models for all the 6 output variables.

In this one, we will be replacing the first part with a Word2Vec model to create an embedding instead of the BagOfWords vector and then input that to a Logistic Regression Model (Any ML/DL model can be built on top of the Word2Vec embedding).

### Brief Intuition:
#### What is Word Embedding?
Word embeddings give us a way to use an efficient, dense representation in which similar words have a similar encoding. Importantly, you do not have to specify this encoding by hand. An embedding is a dense vector of floating point values (the length of the vector is a parameter you specify).
<img src="https://www.researchgate.net/profile/Ali_Basirat/publication/327074728/figure/fig1/AS:678946643386368@1538884902625/A-two-dimensional-representation-of-word-embeddings-Words-with-similar-meanings-are.png" width="500">

Above is a 2-dimensional word embedding where Sunday has more similar values to other weekdays than members of a family

#### What is Word2Vec?
Word2Vec is one of the oldest methods to create/learn this embeddings. Word2Vec is not a singular algorithm, rather, it is a family of model architectures and optimizations that can be used to learn word embeddings from large datasets. Embeddings learned through Word2Vec have proven to be successful on a variety of downstream natural language processing tasks like Text Classification, Question Answering. The papers proposed two methods for learning representations of words:

**Continuous Bag-of-Words Model** which predicts the middle word based on surrounding context words. The context consists of a few words before and after the current (middle) word. This architecture is called a bag-of-words model as the order of words in the context is not important.

**Continuous Skip-gram Model** which predict words within a certain range before and after the current word in the same sentence.

Architecture Diagrams:
<img src="https://3qeqpr26caki16dnhd19sv6by6v-wpengine.netdna-ssl.com/wp-content/uploads/2017/08/Word2Vec-Training-Models.png" width="500">

Examples:
![](https://1.bp.blogspot.com/-Vz5pLuZ49K8/XV0ErlMtdDI/AAAAAAAAB0A/FIM74z__LAUkCqpW12ViAnGX8Br56W2PQCEwYBhgL/s1600/image001.png)
In CBOW, given the words (the quick brown box, over the lazy log), we would want to predict jump.
In Skipgram just the opposite given the word jump, we would want to predict (the quick brown box, over the lazy log)

#### But how does the models learn?
I tried to make an image as that would be easier to grasp. It might look scary at first but I will try my best to explain.
Lets's start with CBOW, we take the sentence "Natural Language Processing" where "Natural" and "Processing" are context words and "Language" is the target word. We have a shallow network as shown above with a single hidden layer. 

So the input is a one-hot encoded vector of V terms, V being the size of vocabulary (total number of unique words) with only single 1. So let's say we have only 5 words in vocabulary (Natural, Language, Processing, is, great). The vector for Natural will be [1, 0, 0, 0, 0]. Similarly for Processing it will be [0, 0, 1, 0, 0].
Now, we have a randomly initialised Embedding vector(E) with size V * D where D is the dimension size of the vector which you can choose on. This is the weight matrix for the input layer.
So, we multiply the input one-hot encoded vector with the weights/embedding vector. This gives the embedding vectors for the context words of size 1 * D.

Now in the hidden layer, we average the emedding vectors for the context words which forms the input for this layer of size 1* D. This is multiplied by another Vector called Context Vector (E') with size D * V. This gives us a vector of 1 * V which is then passed through a sigmoid function to get the final output.

The final output is compared with the one-hot encoded vector of Language (the middle word) [0, 1, 0, 0, 0] and loss function is calculated. This loss is back propogated and the model is trained using Gradient Descent

The final o
<img src="https://i.imgur.com/JsCPzSX.png" title="source: imgur.com" width="1000"/>

#### How will be get the embeddings?
Gensim library enables us to develop word embeddings.Gensim gives you an option to choose the either CBOW or Skipgram while training your own embeddings.(Default is CBOW). Along with it, Gensim also has a directory of pretrained embeddings which are trained on several documents like wiki pages, google news, twitter tweets etc. In this example, we will be using a pretrained embedding based on Google News corpus (3 billion running words) word vector model (3 million 300-dimension English word vectors). 

Ok enough of definitions. Let's dive into the code

### Table of Contents:
[1. Importing Libraries](#1)

[2. Reading Dataset](#2)

[3. Basic preprocessing](#3)

[4. Load pretrained embeddings](#4)

[5. Convert text inputs to embeddings using pretrained models](#5)

[6. Train and Validate a Multi-Output Classifier](#6)

[7. Predicting and Submitting for Test Data](#7)

[8. TODOs](#8)

**N.B.: I haven't covered Logistic Regression and Feature Importance/ Model Intrepretation in this notebook as I have covered it in the last notebook : https://www.kaggle.com/anirbansen3027/jtcc-bag-of-words**

## 1. Importing Libraries <a class="anchor" id="1"></a>

In [1]:
!pip install ipython-autotime 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

from gensim.models import Word2Vec, KeyedVectors

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from string import punctuation
from tqdm.notebook import tqdm

from sklearn.multioutput import MultiOutputClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

from statistics import mean
%load_ext autotime

Collecting ipython-autotime
  Downloading ipython_autotime-0.3.0-py2.py3-none-any.whl (6.8 kB)
Installing collected packages: ipython-autotime
Successfully installed ipython-autotime-0.3.0
You should consider upgrading via the '/opt/conda/bin/python3.7 -m pip install --upgrade pip' command.[0m
time: 347 µs (started: 2020-12-30 21:51:00 +00:00)


## 2.Reading Dataset <a class="anchor" id="2"></a>
All the datasets are provided as zipped files. First we will have to unzip them and then read them into dataframes

In [2]:
#unzipping all the zip folders and saving it /kaggle/working and saving the verbose in /dev/null to keep it quiet
# -o for overwrite -d for destination directory of unzipped file
!unzip -o '/kaggle/input/jigsaw-toxic-comment-classification-challenge/*.zip' -d /kaggle/working > /dev/null
#Reading input csv files
df_train = pd.read_csv("train.csv")
df_test = pd.read_csv("test.csv")
sample_submission = pd.read_csv("sample_submission.csv")
print(df_train.shape, df_test.shape, sample_submission.shape)
df_train.head()


4 archives were successfully processed.
(159571, 8) (153164, 2) (153164, 7)


Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


time: 4.47 s (started: 2020-12-30 21:51:00 +00:00)


In [3]:
train_texts = list(df_train["comment_text"].values)
train_labels = df_train[['toxic', 'severe_toxic', 'obscene', 'threat','insult', 'identity_hate']].to_numpy()
test_texts = list(df_test["comment_text"].values)
print("Example Training Text:\n\n",train_texts[0])

Example Training Text:

 Explanation
Why the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27
time: 33.6 ms (started: 2020-12-30 21:51:04 +00:00)


We have around 160k training texts and about 153k test texts

## 3. Basic preprocessing <a class="anchor" id="3"></a>
In this case, we remove stopwords and digits, lowercase all the texts and tokenize(break into individual tokens/words) the texts using word_tokenize from NLTK library

In [4]:
def preprocess_corpus(texts):
    #importing stop words like in, the, of so that these can be removed from texts
    #as these words dont help in determining the classes(Whether a sentence is toxic or not)
    mystopwords = set(stopwords.words("english"))
    def remove_stops_digits(tokens):
        #Nested function that lowercases, removes stopwords and digits from a list of tokens
        return [token.lower() for token in tokens if token not in mystopwords and not token.isdigit()
               and token not in punctuation]
    #This return statement below uses the above function and tokenizes output further. 
    return [remove_stops_digits(word_tokenize(text)) for text in tqdm(texts)]

#Preprocess both for training and test data
train_texts_processed = preprocess_corpus(train_texts)
test_texts_processed = preprocess_corpus(test_texts)
print("Example Training Prepocessed Text\n\n", train_texts_processed[0])

HBox(children=(FloatProgress(value=0.0, max=159571.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=153164.0), HTML(value='')))


Example Training Prepocessed Text

 ['explanation', 'why', 'edits', 'made', 'username', 'hardcore', 'metallica', 'fan', 'reverted', 'they', "n't", 'vandalisms', 'closure', 'gas', 'i', 'voted', 'new', 'york', 'dolls', 'fac', 'and', 'please', "n't", 'remove', 'template', 'talk', 'page', 'since', 'i', "'m", 'retired', 'now.89.205.38.27']
time: 6min 9s (started: 2020-12-30 21:51:04 +00:00)


## 4. Load pretrained embeddings <a class="anchor" id="4"></a>
We use Gensim Library to load pretrained embeddings for words trained on Google News dataset. The Google New model/ embedding vector is of 300 dimension

In [5]:
#Path for the models/ embedding vector
google_news_model = '../input/gensim-embeddings-dataset/GoogleNews-vectors-negative300.gensim'
#Loading the models/ embedding vector using KeyedVectors.load function from gensim
w2v_google_news = KeyedVectors.load(google_news_model)
#Print lengths/number of words in the embedding
print(len(w2v_google_news.vocab))

3000000
time: 44.4 s (started: 2020-12-30 21:57:14 +00:00)


Goggle News model/ embedding vector has about 3 M words. Let's have a look at an example of an embedding which is essentialy a dictionary where the key is the word and value is the embedding vector for that word.

In [6]:
#Print Shape of the embedding
print("Shape of embedding vector", w2v_google_news["Natural"].shape)
#Let's print first 20 dimensions rather than all 300
print("First 20 numbers in the embedding of the word Natural\n\n", w2v_google_news["Natural"][:20])

Shape of embedding vector (300,)
First 20 numbers in the embedding of the word Natural

 [-0.22753906 -0.07617188 -0.06787109 -0.1015625   0.20214844  0.12890625
  0.1796875  -0.11035156  0.01123047  0.01794434  0.12402344  0.11132812
 -0.3359375  -0.01104736 -0.16015625 -0.16113281 -0.13769531  0.4296875
 -0.03979492  0.05297852]
time: 1.74 ms (started: 2020-12-30 21:57:58 +00:00)


This is how the embedding for the word "Natural" looks like.

## 5. Convert text inputs to embeddings using pretrained models <a class="anchor" id="5"></a>

Here we take the input tokenized texts from earlier and get the embeddings for each word in texts from the pretrained embedding vector. This will give us the final input dataset in form of an embedding per sentence which can be used to train along with the output variables.

In [7]:
#Function that takes in the input text dataset in form of list of lists where each sentence is a list of words all the sentences are 
#inside a list
def embedding_feats(list_of_lists, DIMENSION, w2v_model):
    zeros_vector = np.zeros(DIMENSION)
    feats = []
    missing = set()
    missing_sentences = set()
    #Traverse over each sentence
    for tokens in tqdm(list_of_lists):
        # Initially assign zeroes as the embedding vector for the sentence
        feat_for_this = zeros_vector
        #Count the number of words in the embedding for this sentence
        count_for_this = 0
        #Traverse over each word of a sentence
        for token in tokens:
            #Check if the word is in the embedding vector
            if token in w2v_model:
                #Add the vector of the word to vector for the sentence
                feat_for_this += w2v_model[token]
                count_for_this +=1
            #Else assign the missing word to missing set just to have a look at it
            else:
                missing.add(token)
        #If no words are found in the embedding for the sentence
        if count_for_this == 0:
            #Assign all zeroes vector for that sentence
            feats.append(feat_for_this)
            #Assign the missing sentence to missing_sentences just to have a look at it
            missing_sentences.add(' '.join(tokens))
        #Else take average of the values of the embedding for each word to get the embedding of the sentence
        else:
            feats.append(feat_for_this/count_for_this)
    return feats, missing, missing_sentences

time: 2.44 ms (started: 2020-12-30 21:57:58 +00:00)


In [8]:
#Embeddings for the train dataset
train_vectors, missing, missing_sentences = embedding_feats(train_texts_processed, 300, w2v_google_news)

HBox(children=(FloatProgress(value=0.0, max=159571.0), HTML(value='')))


time: 37.8 s (started: 2020-12-30 21:57:58 +00:00)


In [9]:
print("Shape of the final embeddings for the sentences", np.array(train_vectors).shape)
print("First 20 numbers in the embedding of the first train sentence\n\n", np.array(train_vectors)[0][:20])

Shape of the final embeddings for the sentences (159571, 300)
First 20 numbers in the embedding of the first train sentence

 [ 0.01282094  0.07879842  0.02510173  0.08058929 -0.05929769 -0.0129928
 -0.01726329 -0.09076614  0.07128092  0.0426658  -0.0269399  -0.1129715
 -0.05300395  0.01384761 -0.08168869  0.10279338  0.01549072  0.05962321
  0.03122152 -0.08815613]
time: 476 ms (started: 2020-12-30 21:58:36 +00:00)


To summarize, each sentence will have one 300 dimensional embedding vector which will be an average of the word embeddings present in that sentence. The word embeddings are taken from the pretrained word embeddings that was trained on google news to find the embedding.

## 6. Train and Validate a Multi-Output Classifier <a class="anchor" id="6"></a>
Since we need to classify each sentence as toxic or not, severe_toxic or not, obscene or not, threat or not, insult or not and identity_hate or not, we need to classify the sentence against 6 output variables (This is called Multi-Label Classification which is different from mult-class classification where a target variable has more than 2 options e.g. a sentence can be positive, negative and neutral)

For the same, we will be using MultiOutputClassifier from sklearn which as mentioned earlier is a wrapper.This strategy consists of fitting one classifier per target.
So, this segment will deal with 5 things

1. Getting the embedding vector for the training dataset
2. Split the embedding vector and output variables into train and validation set
3. Fit a Logistic Regression model on training embedding vector and output variables
*(I have covered Logistic Regression in the previous notebook https://www.kaggle.com/anirbansen3027/jtcc-bag-of-words)*
4. Make predictions on the validation embedding vectors
5. Measure performance in the terms of ROC-AUC

Since, the competition uses mean ROC-AUC as the evaluation metric, we will be using the same in the notebook. We will compare the mean ROC-AUC across all the 3 models we have trained. We will be using predict_proba function of models instead of predict which gives us the probability scores instead of predicted value based on a threshold of 0.5, as it is used by the roc_auc_measure.

In [10]:
#Function for calculating roc auc with given actual binary values across target variables
#and the probability score made by the model
def accuracy(y_test, y_pred):
    aucs = []
    #Calculate the ROC-AUC for each of the target column
    for col in range(y_test.shape[1]):
        aucs.append(roc_auc_score(y_test[:,col],y_pred[:,col]))
    return aucs

time: 1.17 ms (started: 2020-12-30 21:58:37 +00:00)


In [11]:
def train_model(DIMENSION, model):
    #Get the embedding vector for the training data
    train_vectors, missing, missing_sentences = embedding_feats(train_texts_processed, DIMENSION, model)
    
    #Split the embedding vector for the training data along with the output variables
    #into train and validation sets
    train_data, val_data, train_cats, val_cats = train_test_split(train_vectors, train_labels)
    
    #Logistic Regression Model (As we have unbalanced dataset, we use class_weight which will use inverse
    #of counts of that class. It penalizes mistakes in samples of class[i] with class_weight[i] instead of 1)
    lr = MultiOutputClassifier(LogisticRegression(class_weight='balanced', max_iter=3000)).fit(train_data, train_cats)
    
    #Actuals for the validation data
    y_vals = val_cats
    #Prediction probability for the validation dataset by the model for class 1
    y_preds = np.transpose(np.array(lr.predict_proba(val_data))[:,:,1])
    #Calculate the Mean ROC_AUC 
    mean_auc = mean(accuracy(y_vals,y_preds))
    return mean_auc, lr

time: 1.88 ms (started: 2020-12-30 21:58:37 +00:00)


In [12]:
mean_auc, lr = train_model(300, w2v_google_news)
print(mean_auc)

HBox(children=(FloatProgress(value=0.0, max=159571.0), HTML(value='')))


0.5712985193905036
time: 1min 16s (started: 2020-12-30 21:58:37 +00:00)


This model turns out to be pretty moderate. This might be because of the pretrained embeddings not correctly capturing the details. We could instead train an embedding of our own using Word2Vec.

## 7. Predicting and Submitting for Test Data <a class="anchor" id="7"></a>

In [13]:
# Merging the test dataset with sample_submission to have all the columns:
#id,text_data and the target variables in one dataframe
df_test = pd.merge(df_test, sample_submission, on = "id")
#Getting the embedding matrix for test texts 
test_vectors, _, _ = embedding_feats(test_texts_processed, 300, w2v_google_news)
#Use the Logistic Regression model to output probabilities and take the probability for class 1
y_preds = np.transpose(np.array(lr.predict_proba(test_vectors))[:,:,1])
#Assign the predictions by the model in the final test dataset
df_test[["toxic","severe_toxic","obscene","threat","insult","identity_hate"]] = y_preds
#Drop Comment Text as the sample submission doesnt have it and wouldnt be expected
df_test.drop(["comment_text"], axis = 1, inplace = True)
#Save the dataset as a csv to submit it
df_test.to_csv("sample_submission.csv", index = False)

HBox(children=(FloatProgress(value=0.0, max=153164.0), HTML(value='')))


time: 38.8 s (started: 2020-12-30 21:59:54 +00:00)


## 8. TODOs <a class="anchor" id="8"></a>
1. Train a Word2Vec model from scratch 
2. Try ensemble models instead of Vanilla ML models Bagging and Boosting models give better results than classic ML techniques in most cases
3. Better Text Preprocessing Typo correction etc can be done to further improve the model

***Do upvote if you find it helpful 😁***