# Pre-trained Vectors - Sentence -  Toxic Comments

A corpus of manually labeled comments - classifying each comment by its type of toxicity is available on Kaggle. We will aim to do a binary classification of whether a comment is toxic or not

This notebook uses **Pre-trained Sentence Embedding** from Spacy to do the task.

In [23]:
import numpy as np
import pandas as pd
import keras
import matplotlib.pyplot as plt
%matplotlib inline
import vis

### Get the Data

Uncomment these shell lines to get the data

In [24]:
#!wget http://bit.do/deep_toxic_train -P data/
#!mv data/deep_toxic_train data/train.zip

In [25]:
df = pd.read_csv("data/train.zip")

In [26]:
df.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


### Create the Input & Output Data

In [6]:
train_sentences = df["comment_text"]
train_sentences.head()

0    Explanation\nWhy the edits made under my usern...
1    D'aww! He matches this background colour I'm s...
2    Hey man, I'm really not trying to edit war. It...
3    "\nMore\nI can't make any real suggestions on ...
4    You, sir, are my hero. Any chance you remember...
Name: comment_text, dtype: object

In [28]:
labels = df.iloc[:,2].values

In [29]:
from keras.utils import to_categorical
y = to_categorical(labels)

## Creating the sentence embedding for learning

In [30]:
import spacy
#!python -m spacy download en_core_web_md
nlp = spacy.load('en_core_web_md')

In [31]:
doc = nlp("This is some text")
from spacy import displacy
displacy.render(doc, jupyter=True)

In [32]:
# Word Vector
doc[3].vector.shape

(300,)

In [33]:
# Sentence Vector
doc.vector.shape

(300,)

In [34]:
def get_vector(sentence):
    vector = nlp(sentence).vector
    return vector

In [35]:
from tqdm import tqdm, tqdm_pandas
tqdm.pandas(desc="progress")

In [None]:
X = train_sentences.progress_apply(get_vector)

progress:  48%|████▊     | 75826/159571 [23:02<28:26, 49.07it/s]

In [None]:
X.shape

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

### Step 2: Create the Model Architecture

In [34]:
from keras.models import Sequential
from keras.layers import Dense, Dropout

In [36]:
model = Sequential()
model.add(Dense(128, input_dim = 300, activation = "relu"))
model.add(Dropout(0.25))
model.add(Dense(64, activation = "relu"))
model.add(Dropout(0.25))
model.add(Dense(6, activation="sigmoid"))

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 200, 300)          300000    
_________________________________________________________________
flatten_1 (Flatten)          (None, 60000)             0         
_________________________________________________________________
dense_1 (Dense)              (None, 32)                1920032   
_________________________________________________________________
dense_2 (Dense)              (None, 2)                 66        
Total params: 2,220,098
Trainable params: 2,220,098
Non-trainable params: 0
_________________________________________________________________


### Step 3: Compile the Model & Fit on the Data

In [38]:
model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])

In [None]:
output = model.fit(X_train, y_train, batch_size=128, epochs=5, validation_split=0.2)

### Step 4: Evaluate the Model

In [None]:
vis.metrics(output.history)

In [None]:
score = model.evaluate(X_test, y_test, verbose=1)

In [None]:
print('Test loss:', score[0])
print('Test accuracy:', score[1])

### Step 5: Visualise evaluation & Make a prediction

In [None]:
predict_classes = model.predict_classes(X_test)

In [None]:
pd.crosstab(y_test, predict_classes)

In [102]:
doc = nlp("temperature is quite hot")