# **Spam Filter for Quora Questions**

**GOAL : To build a model for identifying if a Question on Quora is Spam.**

In [1]:
import tensorflow as  tf

In [2]:
print(tf.__version__)

2.14.0


In [3]:
!pip show keras

Name: keras
Version: 2.14.0
Summary: Deep learning for humans.
Home-page: https://keras.io/
Author: Keras team
Author-email: keras-users@googlegroups.com
License: Apache 2.0
Location: /usr/local/lib/python3.10/dist-packages
Requires: 
Required-by: tensorflow


In [4]:
!pip install keras==2.14.0     #Installing keras



**Importing Libraries**

In [5]:
import os
import time
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from tqdm import tqdm  #tracks the time taken to complete the task

import math

from sklearn.model_selection import train_test_split
from sklearn import metrics

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation,  Conv1D
from keras.layers import Bidirectional, GlobalMaxPool1D
from keras.models import Model
from keras import initializers, regularizers, constraints, optimizers, layers
from tensorflow.compat.v1.keras.layers import CuDNNGRU

**Reading the dataset train**

In [7]:
train_df = pd.read_csv("/content/train.csv")
print("Train shape : ",train_df.shape)

Train shape :  (55265, 3)


In [8]:
target_types = train_df.groupby('target').agg('count')
target_types

Unnamed: 0_level_0,qid,question_text
target,Unnamed: 1_level_1,Unnamed: 2_level_1
0.0,51868,51868
1.0,3396,3396


In [9]:
target_labels = train_df.target.sort_values().index
target_counts = train_df.target.sort_values()

In [10]:
import re
import nltk  #toolkit build for working with NLP

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

nltk.download("stopwords")
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

Stopwords are a set of commonly used words in any language. For eg, in english, 'a', 'the', 'is', 'and', etc.
Stopwords are used to eliminate unimportant words, allowing applications to focus on the important words instead.

In [11]:
eng_stopwords = stopwords.words('english')
eng_stopwords.remove('not') #remove not from the words as it is negative
eng_stopwords

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

Lemmetization is a text pre-procesing technique used in NLP models to break a word down to its root meaning to identify similarities. For eg, running becomes run, caring become care, and so on.

In [12]:
lemmatizer = WordNetLemmatizer()

In [13]:
def data_preprocessing(questions):

    #data cleaning
    questions = re.sub(re.compile('<.*?>'),'',questions)
    questions = re.sub('[^A-Za-z0-9]+',' ',questions)

    #Lowercase : Converting every word to lowercase.
    questions = questions.lower()

    #tokenization : Is the process of breaking text into smaller pieces called tokens.
    tokens = nltk.word_tokenize(questions)

    #stop words removal
    questions = [word for word in tokens if word not in eng_stopwords]

    #lemmatization
    questions = [lemmatizer.lemmatize(word) for word in questions]

    #join words in preprocessed questions
    questions = ' '.join(questions)

    return questions

In [14]:
train_df['preprocessed_question_text']=train_df["question_text"].apply(lambda question_text: data_preprocessing(question_text))
train_df.head()

Unnamed: 0,qid,question_text,target,preprocessed_question_text
0,00002165364db923c7e6,How did Quebec nationalists see their province...,0.0,quebec nationalist see province nation 1960s
1,000032939017120e6e44,"Do you have an adopted dog, how would you enco...",0.0,adopted dog would encourage people adopt not shop
2,0000412ca6e4628ce2cf,Why does velocity affect time? Does velocity a...,0.0,velocity affect time velocity affect space geo...
3,000042bf85aa498cd78e,How did Otto von Guericke used the Magdeburg h...,0.0,otto von guericke used magdeburg hemisphere
4,0000455dfa3e01eae3af,Can I convert montra helicon D to a mountain b...,0.0,convert montra helicon mountain bike changing ...


In [15]:
## split to train and val

train_df, val_df = train_test_split(train_df, test_size=0.3, random_state=2018)

In [16]:
## fill up the missing values in the question_text with "_na_"

train_df["question_text"] = train_df["question_text"].fillna("_na_").values
val_df["question_text"] = val_df["question_text"].fillna("_na_").values

In [17]:
## Get the target values

train_y = train_df['target'].values
val_y = val_df['target'].values

In [18]:
## Importing COuntVectorizer

from sklearn.feature_extraction.text import CountVectorizer

CountVectorizer is used to transform a given text into a vector on the basis of frequency(count) of each word that occurs in the entire text.
In NLP, models cannot understand textual data, they only accept numbers, so this textual data needs to be vectorized.

In [19]:
vect= CountVectorizer(dtype=np.float32,strip_accents='unicode',
                      analyzer='word',token_pattern=r'\w{1,}',
                      ngram_range=(1,3), min_df = 3)
X_train = vect.fit_transform(list(train_df['preprocessed_question_text'].values))
X_val = vect.transform(val_df['preprocessed_question_text'].values)

**Using Naive Bayes Classifier**
- It is a supervised machine learning algorithm, that is used for classfication tasks like text classification.
- It is also a part of a family of Generative Learning algorithms, meaning that it seeks to model the distribution of a inputs of a given class or category.
- Here we will use three types of Naive Bayes Classifier, Multinomial Naive Bayes, Gaussian Naive Bayes and Bernoulli Naive Bayes Classifier.

In [20]:
## Importing MultinomailNB, GaussianNB and BernoulliNB

from sklearn.naive_bayes import MultinomialNB,GaussianNB,BernoulliNB
from sklearn.metrics import accuracy_score,f1_score

Multinomial Naive Bayes: It is suitable for classification with discrete features (eg word counts for text classification). The multinomai distribution normally requires integer feature counts.

In [None]:
clf=MultinomialNB()
clf.fit(X_train,train_y)

In [None]:
## Printing the Validation Accuracy and Validation f1_score

y_val = clf.predict(X_val)
print("Validation accuracy: ",accuracy_score(val_y,y_val))
print("Validation f1_score: ",f1_score(val_y,y_val))

Validation accuracy:  0.9270640598003762
Validation f1_score:  0.5412606943931684


In [None]:
del X_train,vect,X_val
import gc; gc.collect()
time.sleep(10)

In [None]:
## Importing TFIDF Vectorizer

from sklearn.feature_extraction.text import TfidfVectorizer

Term Frequency Inverse Document Frequency (TFIDF) shows how important a word is to a document in a collection or corpus. The TFIDF value increases proportionally to the number of times a word apperas in the document and is offset by the number of the documents in the corpus that contains the word.

**Why TFIDF?**
- TFIDF is better than CountVectorizer because it not only focuses on the frequency of the word present in the corpus but also provides the importance of words.

In [None]:
tfidfvec= TfidfVectorizer(dtype=np.float32,strip_accents='unicode',
                      analyzer='word',token_pattern=r'\w{1,}',
                      ngram_range=(1,3), min_df = 3,
                      max_features=None,use_idf=1,smooth_idf=1,sublinear_tf=1,stop_words='english')
X_train_tfidf = tfidfvec.fit_transform(list(train_df['preprocessed_question_text'].values) )
X_val_tfidf = tfidfvec.transform(val_df['preprocessed_question_text'].values)

Bernoulli Naive Bayes Classifier is based on the Bernoulli Distribution and accepts only binary values, i.e. 0 or 1. It is used when the dataset is in a binary distribution where the output label is either present or absent.

In [None]:
clf=BernoulliNB()
clf.fit(X_train_tfidf,train_y)

BernoulliNB()

In [None]:
## Printing the Validation accuracy and Validation f1_score

y_val = clf.predict(X_val_tfidf)
print("Validation accuracy: ",accuracy_score(val_y,y_val))
print("Validation f1_score: ",f1_score(val_y,y_val))

Validation accuracy:  0.9379461357656372
Validation f1_score:  0.5113643214565623


In [None]:
del X_train_tfidf,tfidfvec,X_val_tfidf
import gc; gc.collect()
time.sleep(10)

In [None]:
## Importing HashingVectorizer

from sklearn.feature_extraction.text import HashingVectorizer

Hashing Vectorizer is based on feature hashing and is a memory efficient technique, also known as Hashing Trick. The Hashing Vectorizer maintains no vocabulary and determines the index of a word in an array of fixed size via hashing so no worry of mis-spelling.

In [None]:
hashvec= HashingVectorizer(dtype=np.float32,strip_accents='unicode',
                      analyzer='word',token_pattern=r'\w{1,}',
                      ngram_range=(1,3),n_features = 2**10)
X_train_hashvec = hashvec.fit_transform(list(train_df['preprocessed_question_text'].values))
X_val_hashvec = hashvec.transform(val_df['preprocessed_question_text'].values)

In [None]:
## Using GaussianNB

clf=GaussianNB()
clf.fit(X_train_hashvec.toarray(),train_y)

GaussianNB()

Gaussian Naive Bayes Classifier is a classification technique used in machine learning based on the probabilistic approach and Gaussian distribution.

In [None]:
y_val = clf.predict(X_val_hashvec.toarray())
print("Validation accuracy: ",accuracy_score(val_y,y_val))
print("Validation f1_score: ",f1_score(val_y,y_val))

Validation accuracy:  0.7236019058945429
Validation f1_score:  0.22872647253615908


In [None]:
clf=BernoulliNB()
clf.fit(X_train_hashvec,train_y)

BernoulliNB()

In [None]:
y_val = clf.predict(X_val_hashvec)
print("Validation accuracy: ",accuracy_score(val_y,y_val))
print("Validation f1_score: ",f1_score(val_y,y_val))

Validation accuracy:  0.8969163197962418
Validation f1_score:  0.26971614536250227


In [None]:
del X_train_hashvec,hashvec,X_val_hashvec
import gc; gc.collect()
time.sleep(10)

**Next steps are as follows:**
1. Spliting the training dataset into train and val sample. Cross validation is a time consuming process, so let us do simple train val split.
2. Filling up the missing values in the text column with 'NA'.
3. Tokenizing the text column and converting them to vector sequence.
4. Padding the sequence as needed.
  - If the number of words in the text is greater than 'max_len', truncate them to 'max_len'.
  - If the number of words in the text is lesser than 'max_len', add zeros for remaining values.

**Using Embeddings**

In [None]:
## some config values
embed_size = 300 # how big is each word vector
max_features = 50000 # how many unique words to use (i.e num rows in embedding vector)
maxlen = 100 # max number of words in a question to use

## Tokenize the sentences
tokenizer = Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(list(train_df["question_text"]))
train_X = tokenizer.texts_to_sequences(train_df["question_text"])
val_X = tokenizer.texts_to_sequences(val_df["question_text"])

## Pad the sentences
train_X = pad_sequences(train_X, maxlen=maxlen)
val_X = pad_sequences(val_X, maxlen=maxlen)

Now that we are done with all the necessary preprocessing steps, we can first train a Bidirectional GRU model. We will not use any pretrained word embeddings for this model and the embeddings will learn from scratch.

In [None]:
## Bidirectional GRU model

inp = Input(shape=(maxlen,))
x = Embedding(max_features, embed_size)(inp)
x = tf.keras.layers.Bidirectional(tf.keras.layers.CuDNNGRU(64, return_sequences=True))(x)
x = GlobalMaxPool1D()(x)
x = Dense(16, activation="relu")(x)
x = Dropout(0.1)(x)
x = Dense(1, activation="sigmoid")(x)
model = Model(inputs=inp, outputs=x)
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

print(model.summary())

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_12 (InputLayer)       [(None, 100)]             0         
                                                                 
 embedding_11 (Embedding)    (None, 100, 300)          15000000  
                                                                 
 bidirectional (Bidirectiona  (None, 100, 128)         186880    
 l)                                                              
                                                                 
 global_max_pooling1d (Globa  (None, 128)              0         
 lMaxPooling1D)                                                  
                                                                 
 dense (Dense)               (None, 16)                2064      
                                                                 
 dropout (Dropout)           (None, 16)                0     

Gated Recurrent Unit (GRU) is a gating mechanism in Recurrent Neural Networks (RNN) similar to a Long-Short Term Memory (LSTM) unit but without an output gate. GRU's try to solve the vanishing gradient problem that can come with standard recurrent neural network.

**WHY GRU?**
- GRU has fewer gates and fewer parameters than LSTM, which makes it simpler and faster, but also less powerful and adaptable.

In [None]:
## Train the model
model.fit(train_X, train_y, batch_size=512, epochs=2, validation_data=(val_X, val_y))

Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7fb6c20bc7f0>

In [None]:
pred_noemb_val_y = model.predict([val_X], batch_size=1024, verbose=1)
for thresh in np.arange(0.1, 0.501, 0.01):
    thresh = np.round(thresh, 2)
    print("F1 score at threshold {0} is {1}".format(thresh, metrics.f1_score(val_y, (pred_noemb_val_y>thresh).astype(int))))

F1 score at threshold 0.1 is 0.5757985325757158
F1 score at threshold 0.11 is 0.5839391330124359
F1 score at threshold 0.12 is 0.5915996425379804
F1 score at threshold 0.13 is 0.5975710168793742
F1 score at threshold 0.14 is 0.6033352294841473
F1 score at threshold 0.15 is 0.6083885209713024
F1 score at threshold 0.16 is 0.6131122693598527
F1 score at threshold 0.17 is 0.6171556002261448
F1 score at threshold 0.18 is 0.6204918163418172
F1 score at threshold 0.19 is 0.6239513795723083
F1 score at threshold 0.2 is 0.6270892049551026
F1 score at threshold 0.21 is 0.6303925636982323
F1 score at threshold 0.22 is 0.6330881981724247
F1 score at threshold 0.23 is 0.6339850341759422
F1 score at threshold 0.24 is 0.6356139806093226
F1 score at threshold 0.25 is 0.6365603406156208
F1 score at threshold 0.26 is 0.6382166955170099
F1 score at threshold 0.27 is 0.6399699527829447
F1 score at threshold 0.28 is 0.6414316702819957
F1 score at threshold 0.29 is 0.6427239147130893
F1 score at threshold 

Now that our model building is done, it might be a good idea to clean up some memory before we go to the next step.

In [None]:
del model, inp, x
import gc; gc.collect()
time.sleep(10)

**Using GloVe embeddings to rebuild GRU model**
- GloVe stands for Global Vectors for word representation.
- It is an unsupervised learning algorithm developed to generate word embeddings by aggregating global word co-occurence matrices from a given corpus.
- The primary idea behind GloVe word embeddings is to use statistics to derive the link between words.

In [None]:
!wget 'https://nlp.stanford.edu/data/glove.840B.300d.zip'

--2023-02-23 10:25:42--  https://nlp.stanford.edu/data/glove.840B.300d.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/glove.840B.300d.zip [following]
--2023-02-23 10:25:43--  https://downloads.cs.stanford.edu/nlp/data/glove.840B.300d.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2176768927 (2.0G) [application/zip]
Saving to: ‘glove.840B.300d.zip’


2023-02-23 10:32:33 (5.06 MB/s) - ‘glove.840B.300d.zip’ saved [2176768927/2176768927]



So we get some baseline GRU model without pre-trained embeddings. Now let us use the provided embeddings and rebuild the model again to see the performance.

We have four different types of embeddings.
1. GoogleNews-vectors-negative300
2. glove.8408.300d
3. paragram_300_s1999
4. wiki-news-300d-1M

In [None]:
!unzip glove.840B.300d.zip

Archive:  glove.840B.300d.zip
  inflating: glove.840B.300d.txt     


In [None]:
!rm glove.840B.300d.zip

In [None]:
EMBEDDING_FILE = 'glove.840B.300d.txt'
def get_coefs(word,*arr):
  return word, np.asarray(arr, dtype='float32')
embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(EMBEDDING_FILE))

In [None]:
all_embs = np.stack(embeddings_index.values())
emb_mean,emb_std = all_embs.mean(), all_embs.std()
embed_size = all_embs.shape[1]

  if (await self.run_code(code, result,  async_=asy)):


In [None]:
del all_embs

In [None]:
word_index = tokenizer.word_index
nb_words = min(max_features, len(word_index))
embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))
for word, i in word_index.items():
    if i >= max_features: continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None: embedding_matrix[i] = embedding_vector

In [None]:
inp = Input(shape=(maxlen,))
x = Embedding(max_features, embed_size, weights=[embedding_matrix])(inp)
x = tf.keras.layers.Bidirectional(tf.keras.layers.CuDNNGRU(64, return_sequences=True))(x)
x = GlobalMaxPool1D()(x)
x = Dense(16, activation="relu")(x)
x = Dropout(0.1)(x)
x = Dense(1, activation="sigmoid")(x)
model = Model(inputs=inp, outputs=x)
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

Model: "model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_14 (InputLayer)       [(None, 100)]             0         
                                                                 
 embedding_13 (Embedding)    (None, 100, 300)          15000000  
                                                                 
 bidirectional_1 (Bidirectio  (None, 100, 128)         186880    
 nal)                                                            
                                                                 
 global_max_pooling1d_1 (Glo  (None, 128)              0         
 balMaxPooling1D)                                                
                                                                 
 dense_2 (Dense)             (None, 16)                2064      
                                                                 
 dropout_1 (Dropout)         (None, 16)                0   

In [None]:
model.fit(train_X, train_y, batch_size=512, epochs=1, validation_data=(val_X, val_y))



<keras.callbacks.History at 0x7fb6c7b01100>

In [None]:
pred_glove_val_y = model.predict([val_X], batch_size=1024, verbose=1)
for thresh in np.arange(0.1, 0.501, 0.01):
    thresh = np.round(thresh, 2)
    print("F1 score at threshold {0} is {1}".format(thresh, metrics.f1_score(val_y, (pred_glove_val_y>thresh).astype(int))))

F1 score at threshold 0.1 is 0.5865264487431184
F1 score at threshold 0.11 is 0.5954688005400313
F1 score at threshold 0.12 is 0.6031107728101797
F1 score at threshold 0.13 is 0.6096690175880133
F1 score at threshold 0.14 is 0.6160973647752834
F1 score at threshold 0.15 is 0.6214168838252228
F1 score at threshold 0.16 is 0.6265731253269331
F1 score at threshold 0.17 is 0.6308969995941178
F1 score at threshold 0.18 is 0.6352061823018941
F1 score at threshold 0.19 is 0.6388447653429603
F1 score at threshold 0.2 is 0.6424821623027288
F1 score at threshold 0.21 is 0.6451039747301922
F1 score at threshold 0.22 is 0.6477846046256056
F1 score at threshold 0.23 is 0.6501481281982225
F1 score at threshold 0.24 is 0.6521887010645896
F1 score at threshold 0.25 is 0.6538507832247193
F1 score at threshold 0.26 is 0.6546945013720519
F1 score at threshold 0.27 is 0.6564748832800926
F1 score at threshold 0.28 is 0.6578033042615047
F1 score at threshold 0.29 is 0.659257668492072
F1 score at threshold 0

In [None]:
del word_index, embeddings_index, embedding_matrix, model, inp, x
import gc; gc.collect()
time.sleep(10)

**Using FastText embeddings trained on Wiki News corpus in place of Glove embeddings and rebuilding the model.**

**FastText Embeddings**
- FastText is an open source, free library from Facebook AI Research (FAIR) for learning word embeddings and word classification.
- This model allows creating supervised and unsupervised learning algorithm for obtaining vector representations for words.
- FastText breaks words into several n-grams(sub-words). For instance, the tri-grams for the word apple is app, ppl, and ple (ignoring the starting and ending of boundaries of words). The word embedding vector for apple will be the sum of all these n-grams.

In [None]:
!wget 'https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec.zip'

--2023-02-23 12:11:08--  https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec.zip
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 104.22.74.142, 172.67.9.4, 104.22.75.142, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|104.22.74.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 681808098 (650M) [application/zip]
Saving to: ‘wiki-news-300d-1M.vec.zip’


2023-02-23 12:11:23 (45.2 MB/s) - ‘wiki-news-300d-1M.vec.zip’ saved [681808098/681808098]



In [None]:
!unzip wiki-news-300d-1M.vec.zip

Archive:  wiki-news-300d-1M.vec.zip
  inflating: wiki-news-300d-1M.vec   


In [None]:
!rm wiki-news-300d-1M.vec.zip

In [None]:
EMBEDDING_FILE2 = 'wiki-news-300d-1M.vec'
def get_coefs(word,*arr):
  return word, np.asarray(arr, dtype='float32')
embeddings_index2 = dict(get_coefs(*o.split(" ")) for o in open(EMBEDDING_FILE2) if len(o)>100)

In [None]:
all_embs2 = np.stack(embeddings_index2.values())
emb_mean2,emb_std2 = all_embs2.mean(), all_embs2.std()
embed_size2 = all_embs2.shape[1]

  if (await self.run_code(code, result,  async_=asy)):


In [None]:
del all_embs2

In [None]:
word_index2 = tokenizer.word_index
nb_words2 = min(max_features, len(word_index2))
embedding_matrix2 = np.random.normal(emb_mean2, emb_std2, (nb_words2, embed_size2))
for word, i in word_index2.items():
    if i >= max_features: continue
    embedding_vector2 = embeddings_index2.get(word)
    if embedding_vector2 is not None: embedding_matrix2[i] = embedding_vector2

In [None]:
inp = Input(shape=(maxlen,))
x = Embedding(max_features, embed_size2, weights=[embedding_matrix2])(inp)
x = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64, return_sequences=True))(x)
x = GlobalMaxPool1D()(x)
x = Dense(16, activation="relu")(x)
x = Dropout(0.1)(x)
x = Dense(1, activation="sigmoid")(x)
model = Model(inputs=inp, outputs=x)
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [None]:
model.fit(train_X, train_y, batch_size=512, epochs=1, validation_data=(val_X, val_y))



<keras.callbacks.History at 0x7fb6c24e1910>

In [None]:
pred_fasttext_val_y = model.predict([val_X], batch_size=1024, verbose=1)
for thresh in np.arange(0.1, 0.501, 0.01):
    thresh = np.round(thresh, 2)
    print("F1 score at threshold {0} is {1}".format(thresh, metrics.f1_score(val_y, (pred_fasttext_val_y>thresh).astype(int))))

F1 score at threshold 0.1 is 0.59587444978279
F1 score at threshold 0.11 is 0.603775812951127
F1 score at threshold 0.12 is 0.6108686789375934
F1 score at threshold 0.13 is 0.617299181149475
F1 score at threshold 0.14 is 0.622688388476538
F1 score at threshold 0.15 is 0.6276231057572166
F1 score at threshold 0.16 is 0.6319718355771424
F1 score at threshold 0.17 is 0.6363982580762672
F1 score at threshold 0.18 is 0.6401358485490433
F1 score at threshold 0.19 is 0.6441673783091375
F1 score at threshold 0.2 is 0.646633740577073
F1 score at threshold 0.21 is 0.6486287179127015
F1 score at threshold 0.22 is 0.6502250929731846
F1 score at threshold 0.23 is 0.6514067371987815
F1 score at threshold 0.24 is 0.6526611666788348
F1 score at threshold 0.25 is 0.653877400295421
F1 score at threshold 0.26 is 0.6546045261035177
F1 score at threshold 0.27 is 0.6546622579121398
F1 score at threshold 0.28 is 0.6547352721849368
F1 score at threshold 0.29 is 0.6546065982489708
F1 score at threshold 0.3 is 

In [None]:
del word_index2, embeddings_index2,  embedding_matrix2, model, inp, x
import gc; gc.collect()
time.sleep(10)

**Observations:**



* Overall pretrained embeddings seem to give better results comapred to non-pretrained model.

* The performance of the different pretrained embeddings are almost similar.



**FINAL BLEND**
- Through the results of the models with different pre-trained embeddings are similar, there is a good chance that they might capture different type of information from the data.
- So let us do a blend of these two models by averaging their predictions.

In [None]:
pred_val_y = 0.70*pred_glove_val_y + 0.30*pred_fasttext_val_y
for thresh in np.arange(0.1, 0.501, 0.01):
    thresh = np.round(thresh, 2)
    print("F1 score at threshold {0} is {1}".format(thresh, metrics.f1_score(val_y, (pred_val_y>thresh).astype(int))))

F1 score at threshold 0.1 is 0.5896088087201904
F1 score at threshold 0.11 is 0.5987329752405788
F1 score at threshold 0.12 is 0.6069245255685933
F1 score at threshold 0.13 is 0.6144754531889431
F1 score at threshold 0.14 is 0.6201228323699421
F1 score at threshold 0.15 is 0.6255799353840971
F1 score at threshold 0.16 is 0.6310821755653854
F1 score at threshold 0.17 is 0.6355270231807976
F1 score at threshold 0.18 is 0.6396636389896332
F1 score at threshold 0.19 is 0.6425342309818844
F1 score at threshold 0.2 is 0.6459393999307993
F1 score at threshold 0.21 is 0.649735696776668
F1 score at threshold 0.22 is 0.652322242864381
F1 score at threshold 0.23 is 0.6545448331254166
F1 score at threshold 0.24 is 0.6553088552915767
F1 score at threshold 0.25 is 0.6572896281800391
F1 score at threshold 0.26 is 0.6585633016501183
F1 score at threshold 0.27 is 0.6601393603716277
F1 score at threshold 0.28 is 0.6621043318105486
F1 score at threshold 0.29 is 0.6626393882430237
F1 score at threshold 0.