# ___Quora Insincere Questions Classification___

An existential problem for any major website today is how to handle toxic and divisive content. Quora wants to tackle this problem head-on to keep their platform a place where users can feel safe sharing their knowledge with the world

Quora is a platform that empowers people to learn from each other. On Quora, people can ask questions and connect with others who contribute unique insights and quality answers. A key challenge is to weed out insincere questions -- those founded upon false premises, or that intend to make a statement rather than look for helpful answers.

In this competition, Kagglers will develop models that identify and flag insincere questions. To date, Quora has employed both machine learning and manual review to address this problem. With your help, they can develop more scalable methods to detect toxic and misleading content.

# Link: https://www.kaggle.com/c/quora-insincere-questions-classification

# Imports

In [1]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import os
import numpy as np 
import pandas as pd
#from tqdm.tqdm import tqdm
import math
from sklearn.model_selection import train_test_split

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [2]:
from keras.models import Sequential
from keras.layers import Dense,Activation, Flatten
from keras.layers import LSTM
from keras.layers import Dropout

In [3]:
print(os.listdir())

['.ipynb_checkpoints', 'embeddings', 'LSTM.ipynb', 'LSTM_Code.rar', 'model.h5', 'model.json', 'sample_submission.csv', 'submission.csv', 'test.csv', 'train.csv']


# Load Training and Test Data

In [4]:
train_df  = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")

print("Train Data Shape: ", train_df.shape)
print("Train Data Shape: ", test_df.shape)

Train Data Shape:  (1306122, 3)
Train Data Shape:  (56370, 2)


In [5]:
train_df.head()

Unnamed: 0,qid,question_text,target
0,00002165364db923c7e6,How did Quebec nationalists see their province...,0
1,000032939017120e6e44,"Do you have an adopted dog, how would you enco...",0
2,0000412ca6e4628ce2cf,Why does velocity affect time? Does velocity a...,0
3,000042bf85aa498cd78e,How did Otto von Guericke used the Magdeburg h...,0
4,0000455dfa3e01eae3af,Can I convert montra helicon D to a mountain b...,0


# Train Test Train Split

In [6]:
train_df, val_df = train_test_split(train_df, test_size=0.1)

# Word Embedding using Glove

GlobalVectors (GloVe) is a model that learns vectors or words from their co-occurrence information. GloVe is a count-based model. This model that learns vectors or words from their co-occurrence information, i.e. how frequently they appear together in large text corpora, is GlobalVectors (GloVe).

Count-based models learn vectors by doing dimensionality reduction on a co-occurrence counts matrix. First they construct a large matrix of co-occurrence information, which contains the information on how frequently each “word” (stored in rows), is seen in some “context” (the columns). The number of “contexts” needs be large, since it is essentially combinatorial in size. Afterwards they factorize this matrix to yield a lower-dimensional matrix of words and features, where each row yields a vector representation for each word. It is achieved by minimizing a “reconstruction loss” which looks for lower-dimensional representations that can explain the variance in the high-dimensional data.

In the case of GloVe, the counts matrix is preprocessed by normalizing the counts and log-smoothing them. Compared to word2vec, GloVe allows for parallel implementation, which means that it’s easier to train over more data. It is believed (GloVe) to combine the benefits of the word2vec skip-gram model in the word analogy tasks, with those of matrix factorization methods exploiting global statistical information.

Reference:

https://www.kdnuggets.com/2018/08/word-vectors-nlp-glove.html

https://nlp.stanford.edu/projects/glove/

# Word Embedding using Glove : Dictionary of word and its coefficients

In [7]:
EMBEDDING_FILE = "embeddings/glove.840B.300d/glove.840B.300d.txt"
embeddings_index = {} # Dictionary of word and its coefficients

In [8]:
f = open(EMBEDDING_FILE, encoding="utf8")
for line in f:
    values = line.split(" ")
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

Found 2196016 word vectors.


# Convert values to embeddings

In [10]:
# Convert values to embeddings
def text_to_array(text):
    empyt_emb = np.zeros(300)
    text = text[:-1].split()[:30]
    embeds = [embeddings_index.get(x, empyt_emb) for x in text]
    embeds+= [empyt_emb] * (30 - len(embeds))
    return np.array(embeds)

val_vects = np.array([text_to_array(X_text) for X_text in val_df["question_text"][:3000]])
val_y = np.array(val_df["target"][:3000])

# Data provider

In [11]:
batch_size = 128

def batch_gen(train_df):
    n_batches = math.ceil(len(train_df) / batch_size)
    while True: 
        train_df = train_df.sample(frac=1.)  # Shuffle the data.
        for i in range(n_batches):
            texts = train_df.iloc[i*batch_size:(i+1)*batch_size, 1]
            text_arr = np.array([text_to_array(text) for text in texts])
            yield text_arr, np.array(train_df["target"][i*batch_size:(i+1)*batch_size])

# Training

# RNN Architecture:

__Sequential()__ : Initialize RNN.

1) __Add 4 layers, with 100 units in each layer__

2) __units__ : no of memory units you want to have in LSTM or number of LSTM cells

3) __return_sequences__ will be set to "True" because we are building stacked RNN with multiple layers. If you want to add new LSTM layer after current layer then __return_sequences = True__ and if it is last layer then __return_sequences__ will be set to False

4) __input_shape__ : Shape of x_train, but here we need not to give 3D shape, only shape corresponding to timestamps(2nd) and indicators(3rd) are needed. Shape corresponding to observation(1st) will automatically taken into account.


In [14]:
model = Sequential()
# First Layer
model.add(LSTM(units=100, return_sequences=True, input_shape=(30, 300)))
model.add(Dropout(rate=0.2))

#2nd Layer
model.add(LSTM(units=100, return_sequences=True))
model.add(Dropout(rate=0.2))

#3rd Layer
model.add(LSTM(units=100, return_sequences=True))
model.add(Dropout(rate=0.2))

#4th Layer
model.add(LSTM(units=100, return_sequences=True))
model.add(Dropout(rate=0.2))

model.add(Flatten())

#Output Layes
model.add(Dense(units=1, activation="sigmoid"))

# Compile RNN
model.compile(optimizer= 'adam',loss='binary_crossentropy', metrics=['accuracy'])

# Fit model

In [15]:
mg = batch_gen(train_df)
model.fit_generator(mg, epochs=20,steps_per_epoch=1000,validation_data=(val_vects, val_y),verbose=True)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0xa389f13940>

# Save model to disk

Keras provides the ability to describe any model using JSON format with a __to_json()__ function. This can be saved to file and later loaded via the __model_from_json()__ function that will create a new model from the JSON specification.

The weights are saved directly from the model using the __save_weights()__ function and later loaded using the symmetrical __load_weights()__ function.

The model is then converted to JSON format and written to __model.json__ in the local directory. The network weights are written to __model.h5__ in the local directory.

# Serialize model to JSON

In [16]:
model_json = model.to_json()
with open("model.json", "w") as json_file:
    json_file.write(model_json)

# serialize weights to HDF5

In [17]:
model.save_weights("model.h5")
print("Model Saved to disk")

Model Saved to disk


# Load json and create model

In previous step model and weights are save. If we want to use previously saved model then we can do so by loading and compiling it. In this training will not happen again. To do so download __model.json__ and __model.h5__ is same folder and uncomment and run below code. 


The weights are saved directly from the model using the __save_weights()__ function and later loaded using the symmetrical __load_weights()__ function.

In [18]:
#json_file = open('model.json', 'r')
#loaded_model_json = json_file.read()
#json_file.close()
#loaded_model = model_from_json(loaded_model_json)
# Load saved weights
#loaded_model.load_weights("model.h5")
#print("Loaded model from disk")

# Compile model after loading from disk

__The model and weight data is loaded from the saved files and a new model is created. It is important to compile the loaded model before it is used. This is so that predictions made using the model can use the appropriate efficient computation from the Keras backend.__

In [19]:
#model.compile(optimizer= 'adam',loss='binary_crossentropy', metrics=['accuracy'])

# prediction part

In [20]:
batch_size = 256
def batch_gen(test_df):
    n_batches = math.ceil(len(test_df) / batch_size)
    for i in range(n_batches):
        texts = test_df.iloc[i*batch_size:(i+1)*batch_size, 1]
        text_arr = np.array([text_to_array(text) for text in texts])
        yield text_arr

all_preds = []
for x in batch_gen(test_df):
    all_preds.extend(model.predict(x).flatten())

In [21]:
y_test = (np.array(all_preds) > 0.5).astype(np.int)

submit_df = pd.DataFrame({"qid": test_df["qid"], "prediction": y_test})
submit_df.to_csv("Output.csv", index=False)  # Output.csv will have prediction of test data

# ___END___