# Toxic Comment Classification using Natural Language Processing

# Data Overview

Source - [Toxic Comment Classification Dataset](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data)

You are provided with a large number of Wikipedia comments which have been labeled by human raters for toxic behavior. The types of toxicity are:

- toxic
- severe_toxic
- obscene
- threat
- insult
- identity_hate

You must create a model which predicts a probability of each type of toxicity for each comment.

File descriptions:
- train.csv - the training set, contains comments with their binary labels
- test.csv - the test set, you must predict the toxicity probabilities for these comments. To deter hand labeling, the test set   contains some comments which are not included in scoring.
- sample_submission.csv - a sample submission file in the correct format

# Importing Libraries

In [7]:
import numpy as np, pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import keras
import tensorflow as tf
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation
from keras.layers import Bidirectional, GlobalMaxPool1D
from keras.models import Model
from keras import initializers, regularizers, constraints, optimizers, layers

# Loading Data

In [8]:
train = pd.read_csv('../input/toxicity/train.csv')
test = pd.read_csv('../input/toxicity/test.csv')

In [9]:
train.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


# Data Pre-Processing

In [10]:
print(train.isnull().any().sum())
print(test.isnull().any().sum())

0
0


In [11]:
class_list = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]
y = train[class_list].values
train_sentences = train["comment_text"]
test_sentences = test["comment_text"]

In [12]:
max_features = 20000
tokenizer = Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(list(train_sentences))
train_tokenized = tokenizer.texts_to_sequences(train_sentences)
test_tokenized = tokenizer.texts_to_sequences(test_sentences)

In [13]:
max_length = 200
X_train = pad_sequences(train_tokenized, maxlen=max_length)
X_test = pad_sequences(test_tokenized, maxlen=max_length)

# Modelling

In [14]:
inp = Input(shape=(max_length, ))

In [15]:
embed_size = 128
x = Embedding(max_features, embed_size)(inp)
x = LSTM(60, return_sequences=True,name='lstm_layer')(x)
x = GlobalMaxPool1D()(x)
x = Dropout(0.1)(x)
x = Dense(50, activation="relu")(x)
x = Dropout(0.1)(x)
x = Dense(6, activation="sigmoid")(x)

In [16]:
final_model = Model(inputs=inp, outputs=x)
final_model.compile(loss='binary_crossentropy',
                  optimizer='adam',
                  metrics=['accuracy'])

In [18]:
batch_size = 32
epochs = 2
final_model.fit(X_train,y, batch_size=batch_size, epochs=epochs, validation_split=0.1)

Epoch 1/2
Epoch 2/2


<tensorflow.python.keras.callbacks.History at 0x7f93cc69dd50>

# Results

In [19]:
final_model.summary()

Model: "functional_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 200)]             0         
_________________________________________________________________
embedding (Embedding)        (None, 200, 128)          2560000   
_________________________________________________________________
lstm_layer (LSTM)            (None, 200, 60)           45360     
_________________________________________________________________
global_max_pooling1d (Global (None, 60)                0         
_________________________________________________________________
dropout (Dropout)            (None, 60)                0         
_________________________________________________________________
dense (Dense)                (None, 50)                3050      
_________________________________________________________________
dropout_1 (Dropout)          (None, 50)               

In [None]:
y_test = final_model.predict(X_test)

sample_submission = pd.read_csv("../input/sample-toxic/sample_submission.csv")

sample_submission[class_list] = y_test

sample_submission.to_csv("toxicity.csv", index=False)