# ConvNet Training
This notebook can be used to train a CNN for text classification and generate predictions for the kaggle competition found [here](https://www.kaggle.com/c/quora-insincere-questions-classification). 

The notebook utilizes Keras and GloVe for preprocessing using word embeddings. Then, Keras with Tensorflow backend is used for training a deep CNN. Feel free to fork!

### Acknowledgements
* [This blog post](https://richliao.github.io/supervised/classification/2016/11/26/textclassifier-convolutional/) for starter code for the cnn
* [This notebook](https://www.kaggle.com/yekenot/2dcnn-textclassifier) for the F1 Score calculation

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os

from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

from keras.callbacks import Callback
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Input, Dense, Embedding, Dropout
from keras.layers import Conv2D, MaxPool2D, Flatten, Reshape, Concatenate
from keras.models import Model

%load_ext autoreload
%autoreload 2

Using TensorFlow backend.


In [4]:
# Load in training and testing data
train_df = pd.read_csv('./input/train.csv')
train_df.head()

Unnamed: 0,qid,question_text,target
0,00002165364db923c7e6,How did Quebec nationalists see their province...,0
1,000032939017120e6e44,"Do you have an adopted dog, how would you enco...",0
2,0000412ca6e4628ce2cf,Why does velocity affect time? Does velocity a...,0
3,000042bf85aa498cd78e,How did Otto von Guericke used the Magdeburg h...,0
4,0000455dfa3e01eae3af,Can I convert montra helicon D to a mountain b...,0


In [5]:
# Extract the training data and corresponding labels
text = train_df['question_text'].fillna('unk').values
labels = train_df['target'].values

# Split into training and validation sets by making use of the scikit-learn
# function train_test_split
X_train, X_val, y_train, y_val = train_test_split(text, labels,\
                                                  test_size=0.2)

In [6]:
embed_size = 300 # Size of each word vector
max_words = 50000 # how many unique words to use (i.e num rows in embedding vector)
maxlen = 100 # max number of words in a question to use

In [7]:
## Tokenize the sentences
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(list(X_train))

# The tokenizer will assign an integer value to each word in the dictionary
# and then convert each string of words into a list of integer values
X_train = tokenizer.texts_to_sequences(X_train)
X_val = tokenizer.texts_to_sequences(X_val)

word_index = tokenizer.word_index
print('The word index consists of {} unique tokens.'.format(len(word_index)))

## Pad the sentences 
X_train = pad_sequences(X_train, maxlen=maxlen)
X_val = pad_sequences(X_val, maxlen=maxlen)

The word index consists of 196290 unique tokens.


In [9]:
# Create the embedding layer weight matrix
embed_mat = np.zeros(shape=[max_words, embed_size])
for word, idx in word_index.items():
    # Word index is ordered from most frequent to least frequent
    # Ignore words that occur less frequently
    if idx >= max_words: continue
    vector = embedding_dict.get(word)
    if vector is not None:
        embed_mat[idx] = vector

In [10]:
def create_cnn(filter_sizes, num_filters):
    
    sequence_input = Input(shape=(maxlen,), dtype='int32')
    x = Embedding(max_words, embed_size, weights=[embed_mat], trainable=False)(sequence_input)
    x = Reshape((maxlen, embed_size, 1))(x)
    
    conv_layers = []
    maxpool_layers = []
    for i in range(len(filter_sizes)):
        conv_layers.append(Conv2D(num_filters, kernel_size=(filter_sizes[0], embed_size),
                                 kernel_initializer='he_normal', activation='relu')(x))
        maxpool_layers.append(MaxPool2D(pool_size=(maxlen - filter_sizes[0] + 1, 1))(conv_layers[i]))
    
    
    z = Concatenate(axis=1)(maxpool_layers)   
    z = Flatten()(z)
    z = Dropout(0.1)(z)
    z = Dense(64, activation='relu')(z)
    preds = Dense(1, activation='sigmoid')(z)

    model = Model(sequence_input, preds)
    model.compile(loss='binary_crossentropy',
                  optimizer='adam',
                  metrics=['acc'])
    model.summary()
    
    return model

In [11]:
threshold = 0.5
class F1Evaluation(Callback):
    def __init__(self, validation_data=(), interval=1):
        super(Callback, self).__init__()

        self.interval = interval
        self.X_val, self.y_val = validation_data

    def on_epoch_end(self, epoch, logs={}):
        if epoch % self.interval == 0:
            y_pred = self.model.predict(self.X_val, verbose=0)
            y_pred = (y_pred > threshold).astype(int)
            score = f1_score(self.y_val, y_pred)
            print("\n F1 Score - epoch: %d - score: %.6f \n" % (epoch+1, score))
            
F1_Score = F1Evaluation(validation_data=(X_val, y_val), interval=1)

In [13]:
# Create and train network
filter_sizes = [2,3,5]
num_filters = 64

cnn = create_cnn(filter_sizes, num_filters)
history = cnn.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=3, batch_size=512,
                  callbacks=[F1_Score])

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_2 (InputLayer)            (None, 100)          0                                            
__________________________________________________________________________________________________
embedding_2 (Embedding)         (None, 100, 300)     15000000    input_2[0][0]                    
__________________________________________________________________________________________________
reshape_2 (Reshape)             (None, 100, 300, 1)  0           embedding_2[0][0]                
__________________________________________________________________________________________________
conv2d_5 (Conv2D)               (None, 99, 1, 20)    12020       reshape_2[0][0]                  
__________________________________________________________________________________________________
conv2d_6 (

Train on 1044897 samples, validate on 261225 samples
Epoch 1/3


    512/1044897 [..............................] - ETA: 2:44:24 - loss: 0.6931 - acc: 0.9297

   1024/1044897 [..............................] - ETA: 1:35:17 - loss: 0.6929 - acc: 0.9326

   1536/1044897 [..............................] - ETA: 1:11:56 - loss: 0.6927 - acc: 0.9336

   2048/1044897 [..............................] - ETA: 1:00:14 - loss: 0.6925 - acc: 0.9331

   2560/1044897 [..............................] - ETA: 53:13 - loss: 0.6923 - acc: 0.9332  

   3072/1044897 [..............................] - ETA: 48:43 - loss: 0.6921 - acc: 0.9300

   3584/1044897 [..............................] - ETA: 45:23 - loss: 0.6919 - acc: 0.9328

   4096/1044897 [..............................] - ETA: 42:55 - loss: 0.6916 - acc: 0.9348

   4608/1044897 [..............................] - ETA: 40:56 - loss: 0.6914 - acc: 0.9340

   5120/1044897 [..............................] - ETA: 39:21 - loss: 0.6912 - acc: 0.9346

   5632/1044897 [..............................] - ETA: 38:03 - loss: 0.6910 - acc: 0.9341

   6144/1044897 [..............................] - ETA: 36:57 - loss: 0.6907 - acc: 0.9359

   6656/1044897 [..............................] - ETA: 36:03 - loss: 0.6905 - acc: 0.9370

   7168/1044897 [..............................] - ETA: 35:19 - loss: 0.6903 - acc: 0.9374

   7680/1044897 [..............................] - ETA: 35:04 - loss: 0.6901 - acc: 0.9375

   8192/1044897 [..............................] - ETA: 34:40 - loss: 0.6898 - acc: 0.9391

   8704/1044897 [..............................] - ETA: 34:15 - loss: 0.6896 - acc: 0.9391

   9216/1044897 [..............................] - ETA: 33:47 - loss: 0.6894 - acc: 0.9383

   9728/1044897 [..............................] - ETA: 33:24 - loss: 0.6892 - acc: 0.9380

  10240/1044897 [..............................] - ETA: 33:02 - loss: 0.6890 - acc: 0.9377

  10752/1044897 [..............................] - ETA: 32:39 - loss: 0.6888 - acc: 0.9370

  11264/1044897 [..............................] - ETA: 32:20 - loss: 0.6886 - acc: 0.9378

  11776/1044897 [..............................] - ETA: 32:13 - loss: 0.6883 - acc: 0.9375

  12288/1044897 [..............................] - ETA: 31:55 - loss: 0.6881 - acc: 0.9373

  12800/1044897 [..............................] - ETA: 31:54 - loss: 0.6879 - acc: 0.9372

  13312/1044897 [..............................] - ETA: 32:03 - loss: 0.6877 - acc: 0.9373

  13824/1044897 [..............................] - ETA: 31:57 - loss: 0.6875 - acc: 0.9371

KeyboardInterrupt: 

In [17]:
thresholds = np.arange(0.1, 1, 0.1)

best_thresh = None
best_score = 0.
for thresh in thresholds:
    y_pred = cnn.predict(X_val, verbose=0)
    y_pred = (y_pred > thresh).astype(int)
    score = f1_score(y_val, y_pred)
    print('F1 Score for theshold {}: {}'.format(thresh, score))
    if not best_thresh or score>best_score:
        best_thresh = thresh


F1 Score for theshold 0.1: 0.11535492435446984


F1 Score for theshold 0.2: 0.11535492435446984


F1 Score for theshold 0.30000000000000004: 0.11535492435446984


F1 Score for theshold 0.4: 0.11535492435446984


  'precision', 'predicted', average, warn_for)


F1 Score for theshold 0.5: 0.0


# 4. Predictions
The remainder of this notebok will generate predictions from the test set and write them to a submission csv file for the kaggle competition.

In [12]:
test_df = pd.read_csv('../input/test.csv')
X_test = test_df['question_text'].values

# Perform the same preprocessing as was done on the training set
X_test = tokenizer.texts_to_sequences(X_test)
X_test = pad_sequences(X_test, maxlen=maxlen)

# Make predictions, ensure that predictions are in integer form
preds = np.rint(cnn.predict([X_test], batch_size=1024, verbose=1)).astype('int')
test_df['prediction'] = preds



Let's examine a few examples of sincere predictions and insincere predictions. It appears that our network is making meaningful predictions.

In [13]:
n=5
sin_sample = test_df.loc[test_df['prediction'] == 0]['question_text'].head(n)
print('Sincere Samples:')
for idx, row in enumerate(sin_sample):
    print('{}'.format(idx+1), row)

print('\n')
print('Insincere Samples:')
insin_sample = test_df.loc[test_df['prediction'] == 1]['question_text'].head(n)
for idx, row in enumerate(insin_sample):
    print('{}'.format(idx+1), row)

Sincere Samples:
1 My voice range is A2-C5. My chest voice goes up to F4. Included sample in my higher chest range. What is my voice type?
2 How much does a tutor earn in Bangalore?
3 What are the best made pocket knives under $200-300?
4 Why would they add a hypothetical scenario that’s impossible to happen in the link below? It shows what 800 meters rise in sea level would look like.
5 What is the dresscode for Techmahindra freshers?


Insincere Samples:
1 Are the BJP bhakts satisfied that prices of petrol and diesel were slashed by 1 paisa? Do you think criticizing this would be unpatriotic?
2 Why does Quora send me a notice because I told a guy from England that he wasn’t American so he shouldn’t worry about our gun laws?
3 Can a bleeding heart liberal be happily married to a militant Republican, when they fundamentally disagree on everything? I'm an optimistic feminist who believes in hope, and he's a die hard gun enthusiast who borders on misogyny and racism.
4 Why do these Sikhs

In [14]:
test_df = test_df.drop('question_text', axis=1)
test_df.to_csv('submission.csv', index=False)