# RNN Training
This notebook can be used to train an RNN for text classification and generate predictions for the kaggle competition found [here](https://www.kaggle.com/c/quora-insincere-questions-classification). 

The notebook utilizes Keras and GloVe for preprocessing using word embeddings. Then, Keras with Tensorflow backend is used for training a Deep RNN. 

In [1]:
import pandas as pd
import numpy as np
import os

from sklearn.model_selection import train_test_split

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation, CuDNNGRU, Conv1D
from keras.layers import Bidirectional, GlobalMaxPool1D
from keras.models import Model
from keras import initializers, regularizers, constraints, optimizers, layers
%load_ext autoreload
%autoreload 2

Using TensorFlow backend.


In [2]:
# Read in training and testing data
train_df = pd.read_csv('data/train.csv')
test_df = pd.read_csv('data/test.csv')
train_df.head()

Unnamed: 0,qid,question_text,target
0,00002165364db923c7e6,How did Quebec nationalists see their province...,0
1,000032939017120e6e44,"Do you have an adopted dog, how would you enco...",0
2,0000412ca6e4628ce2cf,Why does velocity affect time? Does velocity a...,0
3,000042bf85aa498cd78e,How did Otto von Guericke used the Magdeburg h...,0
4,0000455dfa3e01eae3af,Can I convert montra helicon D to a mountain b...,0


In [3]:
# Print a fex examples of insincere questions
toxic_df = train_df[train_df['target']==1]
for i in range(5):
    print(toxic_df.iloc[i]['question_text'])

Has the United States become the largest dictatorship in the world?
Which babies are more sweeter to their parents? Dark skin babies or light skin babies?
If blacks support school choice and mandatory sentencing for criminals why don't they vote Republican?
I am gay boy and I love my cousin (boy). He is sexy, but I dont know what to do. He is hot, and I want to see his di**. What should I do?
Which races have the smallest penis?


In [4]:
# Eliminate any potential null values
train_df[train_df.isnull().any(axis=1)].shape

(0, 3)

In [5]:
# Extract the training data and corresponding labels
text = train_df['question_text'].values
labels = train_df['target'].values

# Split into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(text, labels,\
                                                  test_size=0.2)
X_test = test_df['question_text']

In [6]:
embed_size = 300 # how big is each word vector
max_words = 30000 # how many unique words to use (i.e num rows in embedding vector)
maxlen = 100 # max number of words in a question to use

In [7]:
## Tokenize the sentences
tokenizer = Tokenizer()
tokenizer.fit_on_texts(list(X_train))
X_train = tokenizer.texts_to_sequences(X_train)
X_val = tokenizer.texts_to_sequences(X_val)
X_test = tokenizer.texts_to_sequences(X_test)

## Pad the sentences 
X_train = pad_sequences(X_train, maxlen=maxlen)
X_val = pad_sequences(X_val, maxlen=maxlen)
X_test = pad_sequences(X_test, maxlen=maxlen)

In [8]:
word_index = tokenizer.word_index
print('The word index consists of {} unique tokens.'.format(len(word_index)))

The word index consists of 196342 unique tokens.


In [9]:
embedding_dict = {}
with open(os.path.join('./embeddings/', 'glove.840B.300d/glove.840B.300d.txt')) as f:
    for line in f:
        line = line.split()
        token = line[0]
        try:
            coefs = np.asarray(line[1:], dtype='float32')
            embedding_dict[token] = coefs
        except:
            pass
print('The embedding dictionary has {} items'.format(len(embedding_dict)))

The embedding dictionary has 2195884 items


In [10]:
embed_mat = np.zeros(shape=[len(word_index)+1, embed_size])
for word, idx in word_index.items():
    vector = embedding_dict.get(word)
    if vector is not None:
        embed_mat[idx] = vector

In [18]:
print(embed_mat.shape)
print(X_train[5])

(196343, 300)
[    0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0    26   461   122
 33298   262    77    58     4   413     7   127  2057  3225    77    26
  1550     1    34  1550 64429    58   106  1136    13   176  5438   503
    10     3     4  4140]
