# "Quora Questions Pairs using BERT"
> "Task: Identify wether two question have similar context/meaning or not"

- toc: true
- branch: master
- badges: true
- comments: true
- categories: [fastpages, jupyter]
- image: images/some_folder/your_image.png
- hide: false
- search_exclude: true
- metadata_key1: metadata_value1
- metadata_key2: metadata_value2

### Quora Questions Pairs using BERT : Overview

**Task: Identify wether two question have similar context/meaning or not**<br>
[kaggle](https://www.kaggle.com/c/quora-question-pairs/overview)
<br>
I have tried this problem using two different approach


1.   Using Naive Bayes Classifier
2.   Using BERT

### Naive Bayes Classifier



In [1]:
import pandas as pd
import numpy as np
import os
import nltk
from nltk.stem import WordNetLemmatizer, PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import re
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
path = "quora-question-pairs"
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('omw-1.4')


[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/jovyan/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [2]:
#load Data
train = pd.read_csv(path+"/train.csv")
print("Total samples:",len(train))
train.head(10)

Total samples: 404290


Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0
5,5,11,12,Astrology: I am a Capricorn Sun Cap moon and c...,"I'm a triple Capricorn (Sun, Moon and ascendan...",1
6,6,13,14,Should I buy tiago?,What keeps childern active and far from phone ...,0
7,7,15,16,How can I be a good geologist?,What should I do to be a great geologist?,1
8,8,17,18,When do you use シ instead of し?,"When do you use ""&"" instead of ""and""?",0
9,9,19,20,Motorola (company): Can I hack my Charter Moto...,How do I hack Motorola DCX3400 for free internet?,0


In [3]:
#dropping null values
print(train.isnull().sum(axis=0))
train.dropna(axis=0,inplace=True)


id              0
qid1            0
qid2            0
question1       1
question2       2
is_duplicate    0
dtype: int64


#### Text Preprocessing

*   remove stopwords
*   Lemmatize similar words 



In [4]:
#preprocessing
def preprocess(series):
  #remove characters other than alphabets & numerics
  words = re.sub("[^A-Za-z0-9]"," ",series).lower().split()

  #lemmatize words
  lemm = WordNetLemmatizer()
  stpwords = stopwords.words('english')
  lemmitized = [lemm.lemmatize(word) for word in words if word not in stpwords]
  sent = ' '.join(lemmitized)
  return sent

In [5]:
#Apply preprocessing
train['question1'] =train['question1'].apply(preprocess)
train['question2'] =train['question2'].apply(preprocess)

In [6]:
#concatenate Question 1 & Question 2
def concat(ser):
  print(ser['question1'])
  return 1
train['combine'] = train.apply(lambda ser: ser['question1'] + " " + ser['question2'],axis=1)
train.head(10)

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate,combine
0,0,1,2,step step guide invest share market india,step step guide invest share market,0,step step guide invest share market india step...
1,1,3,4,story kohinoor koh noor diamond,would happen indian government stole kohinoor ...,0,story kohinoor koh noor diamond would happen i...
2,2,5,6,increase speed internet connection using vpn,internet speed increased hacking dns,0,increase speed internet connection using vpn i...
3,3,7,8,mentally lonely solve,find remainder math 23 24 math divided 24 23,0,mentally lonely solve find remainder math 23 2...
4,4,9,10,one dissolve water quikly sugar salt methane c...,fish would survive salt water,0,one dissolve water quikly sugar salt methane c...
5,5,11,12,astrology capricorn sun cap moon cap rising say,triple capricorn sun moon ascendant capricorn say,1,astrology capricorn sun cap moon cap rising sa...
6,6,13,14,buy tiago,keep childern active far phone video game,0,buy tiago keep childern active far phone video...
7,7,15,16,good geologist,great geologist,1,good geologist great geologist
8,8,17,18,use instead,use instead,0,use instead use instead
9,9,19,20,motorola company hack charter motorolla dcx3400,hack motorola dcx3400 free internet,0,motorola company hack charter motorolla dcx340...


#### Convert Words into Vector

Using tf-idf to assign how relevant words are in the questions


In [7]:
cv = TfidfVectorizer(max_features=50000)#Word to Vectors using Tf-Idf

#Take combine questions data as X
X = cv.fit_transform(train['combine'])
y = np.array(train['is_duplicate'])
print(X.shape)

#Tarin-Test Spilt
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.05)
print(X_train.shape,X_test.shape)

(404287, 50000)
(384072, 50000) (20215, 50000)


In [8]:
nb_model = MultinomialNB()#Training
nb_model.fit(X_train,y_train)

#Predictions
y_pred_train = nb_model.predict(X_train)
y_pred_test = nb_model.predict(X_test) 

In [9]:
 
accuracy_train = sum((y_pred_train == y_train).astype(int))/len(y_train)
accuracy_test = sum((y_pred_test == y_test).astype(int))/len(y_test)
print(accuracy_train,accuracy_test)

0.7518225749338666 0.7419243136284936


We got 74% Accuracy, which is not very good for text classification

### BERT
I used the "Semantic Similarity with BERT" code to solve this problem.<br>
reference : https://keras.io/examples/nlp/semantic_similarity_with_bert/

In [1]:
import numpy as np
import pandas as pd
import tensorflow as tf
!pip install transformers==2.11.0
import transformers

Collecting transformers==2.11.0
  Using cached transformers-2.11.0-py3-none-any.whl (674 kB)
Collecting tokenizers==0.7.0
  Using cached tokenizers-0.7.0-cp36-cp36m-manylinux1_x86_64.whl (3.8 MB)
Installing collected packages: tokenizers, transformers
  Attempting uninstall: tokenizers
    Found existing installation: tokenizers 0.5.2
    Uninstalling tokenizers-0.5.2:
      Successfully uninstalled tokenizers-0.5.2
  Attempting uninstall: transformers
    Found existing installation: transformers 2.8.0
    Uninstalling transformers-2.8.0:
      Successfully uninstalled transformers-2.8.0
Successfully installed tokenizers-0.7.0 transformers-2.11.0


In [5]:
max_length = 128  # Maximum length of input sentence to the model.
batch_size = 4
epochs = 2
path = "quora-question-pairs"
# Labels in our dataset.
labels = [1,0]
#1 : Non Duplicate
#0 : Duplicate
df = pd.read_csv(path+"/train.csv")

#### Preprocessing

In [6]:
#Dropping Null values
print(df.isnull().sum(axis=0))
df.dropna(axis=0,inplace=True)

id              0
qid1            0
qid2            0
question1       1
question2       2
is_duplicate    0
dtype: int64


In [7]:
#create mask for train-test distribution
mask = np.random.rand(len(df)) < 0.7
train_df = df[mask]
not_train = df[~mask]

#create mask for val-test distribution
mask = np.random.rand(len(not_train)) < 0.5
test_df = not_train[mask]
val_df = not_train[~mask]
val_df.head(5)

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
9,9,19,20,Motorola (company): Can I hack my Charter Moto...,How do I hack Motorola DCX3400 for free internet?,0
17,17,35,36,Why do girls want to be friends with the guy t...,How do guys feel after rejecting a girl?,0
19,19,39,40,Which is the best digital marketing institutio...,Which is the best digital marketing institute ...,0
36,36,73,74,I'm a 19-year-old. How can I improve my skills...,I am a 19 year old guy. How can I become a bil...,0
41,41,83,84,When can I expect my Cognizant confirmation mail?,When can I expect Cognizant confirmation mail?,0


In [8]:
# Shape of the data
print(f"Total train samples : {train_df.shape[0]}")
print(f"Total validation samples: {val_df.shape[0]}")
print(f"Total test samples: {test_df.shape[0]}")

Total train samples : 282884
Total validation samples: 60802
Total test samples: 60601


In [9]:
print("Train Target Distribution")
print(train_df.is_duplicate.value_counts())

print("Validation Target Distribution")
print(val_df.is_duplicate.value_counts())

Train Target Distribution
0    178251
1    104633
Name: is_duplicate, dtype: int64
Validation Target Distribution
0    38343
1    22459
Name: is_duplicate, dtype: int64


In [10]:
#One hot encoding 
y_train = tf.keras.utils.to_categorical(train_df.is_duplicate, num_classes=2)
print(f"y_train.shape:{y_train.shape}")

y_val = tf.keras.utils.to_categorical(val_df.is_duplicate, num_classes=2)
print(f"y_val.shape:{y_val.shape}")

y_test = tf.keras.utils.to_categorical(test_df.is_duplicate, num_classes=2)
print(f"y_test.shape:{y_test.shape}")

y_train.shape:(282884, 2)
y_val.shape:(60802, 2)
y_test.shape:(60601, 2)


#### Custom Data Generator

In [11]:
class BertSemanticDataGenerator(tf.keras.utils.Sequence):
    """Generates batches of data.

    Args:
        sentence_pairs: Array of premise and hypothesis input sentences.
        labels: Array of labels.
        batch_size: Integer batch size.
        shuffle: boolean, whether to shuffle the data.
        include_targets: boolean, whether to incude the labels.

    Returns:
        Tuples `([input_ids, attention_mask, `token_type_ids], labels)`
        (or just `[input_ids, attention_mask, `token_type_ids]`
         if `include_targets=False`)
    """

    def __init__(
        self,
        sentence_pairs,
        labels,
        batch_size=batch_size,
        shuffle=True,
        include_targets=True,
    ):
        self.sentence_pairs = sentence_pairs
        self.labels = labels
        self.shuffle = shuffle
        self.batch_size = batch_size
        self.include_targets = include_targets
        
        # Load our BERT Tokenizer to encode the text.
        # We will use base-base-uncased pretrained model.
        self.tokenizer = transformers.BertTokenizer.from_pretrained(
            "bert-base-uncased", do_lower_case=True
        )
        self.indexes = np.arange(len(self.sentence_pairs))
        self.on_epoch_end()

    def __len__(self):
        # Denotes the number of batches per epoch.
        return len(self.sentence_pairs) // self.batch_size

    def __getitem__(self, idx):
        # Retrieves the batch of index.
        indexes = self.indexes[idx * self.batch_size : (idx + 1) * self.batch_size]
        sentence_pairs = self.sentence_pairs[indexes]

        # With BERT tokenizer's batch_encode_plus batch of both the sentences are
        # encoded together and separated by [SEP] token.
        encoded = self.tokenizer.batch_encode_plus(
            sentence_pairs.tolist(),
            add_special_tokens=True,
            max_length=max_length,
            return_attention_mask=True,
            return_token_type_ids=True,
            pad_to_max_length=True,
            return_tensors="tf",
        )

        # Convert batch of encoded features to numpy array.
        input_ids = np.array(encoded["input_ids"], dtype="int32")
        attention_masks = np.array(encoded["attention_mask"], dtype="int32")
        token_type_ids = np.array(encoded["token_type_ids"], dtype="int32")

        # Set to true if data generator is used for training/validation.
        if self.include_targets:
            labels = np.array(self.labels[indexes], dtype="int32")
            return [input_ids, attention_masks, token_type_ids], labels
        else:
            return [input_ids, attention_masks, token_type_ids]

    def on_epoch_end(self):
        # Shuffle indexes after each epoch if shuffle is set to True.
        if self.shuffle:
            np.random.RandomState(42).shuffle(self.indexes)

#### Build The Model

In [12]:
# Create the model under a distribution strategy scope.
strategy = tf.distribute.MirroredStrategy()

with strategy.scope():
    # Encoded token ids from BERT tokenizer.
    input_ids = tf.keras.layers.Input(
        shape=(max_length,), dtype=tf.int32, name="input_ids"
    )
    # Attention masks indicates to the model which tokens should be attended to.
    attention_masks = tf.keras.layers.Input(
        shape=(max_length,), dtype=tf.int32, name="attention_masks"
    )
    # Token type ids are binary masks identifying different sequences in the model.
    token_type_ids = tf.keras.layers.Input(
        shape=(max_length,), dtype=tf.int32, name="token_type_ids"
    )
    # Loading pretrained BERT model.
    bert_model = transformers.TFBertModel.from_pretrained("bert-base-uncased")
    # Freeze the BERT model to reuse the pretrained features without modifying them.
    bert_model.trainable = False

    sequence_output, pooled_output = bert_model(
        input_ids, attention_mask=attention_masks, token_type_ids=token_type_ids
    )
    # Add trainable layers on top of frozen layers to adapt the pretrained features on the new data.
    bi_lstm = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64, return_sequences=True))(sequence_output)
    
    # Applying hybrid pooling approach to bi_lstm sequence output.
    avg_pool = tf.keras.layers.GlobalAveragePooling1D()(bi_lstm)
    max_pool = tf.keras.layers.GlobalMaxPooling1D()(bi_lstm)
    concat = tf.keras.layers.concatenate([avg_pool, max_pool])
    dropout = tf.keras.layers.Dropout(0.3)(concat)
    output = tf.keras.layers.Dense(2, activation="softmax")(dropout)
    model = tf.keras.models.Model(
        inputs=[input_ids, attention_masks, token_type_ids], outputs=output
    )

    model.compile(
        optimizer=tf.keras.optimizers.Adam(),
        loss="categorical_crossentropy",
        metrics=["acc"],
    )


print(f"Strategy: {strategy}")
model.summary()

INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
Strategy: <tensorflow.python.distribute.mirrored_strategy.MirroredStrategy object at 0x7f6742d8cf98>
Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_ids (InputLayer)          [(None, 128)]        0                                            
__________________________________________________________________________________________________
attention_masks (InputLayer)    [(None, 128)]        0                                            
_______

#### Create train and validation data generators

In [13]:
train_data = BertSemanticDataGenerator(
    train_df[["question1", "question2"]].values.astype("str"),
    y_train,
    batch_size=batch_size,
    shuffle=True,
)
val_data = BertSemanticDataGenerator(
    val_df[["question1", "question2"]].values.astype("str"),
    y_val,
    batch_size=batch_size,
    shuffle=False,
)

#### Train the model
Training is done only for the top layers to perform "feature extraction", which will allow the model to use the representations of the pretrained model.

In [14]:
history = model.fit(
    train_data,
    validation_data=val_data,
    epochs=epochs,
     use_multiprocessing=True,
    workers=-1,
)

  ...
    to  
  ['...']
  ...
    to  
  ['...']
Train for 70721 steps, validate for 15200 steps
Epoch 1/2
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
Epoch 2/2


#### Fine Tuning

Now BERT model has knowledge of Language & Context now we can unfreeze the BERT pretrained weights & retrain using very low learning rate to solve actual NLP problem

In [15]:
bert_model.trainable = True
# Recompile the model to make the change effective.
model.compile(
    optimizer=tf.keras.optimizers.Adam(1e-5),
    loss="categorical_crossentropy",
    metrics=["accuracy"],
)
model.summary()

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_ids (InputLayer)          [(None, 128)]        0                                            
__________________________________________________________________________________________________
attention_masks (InputLayer)    [(None, 128)]        0                                            
__________________________________________________________________________________________________
token_type_ids (InputLayer)     [(None, 128)]        0                                            
__________________________________________________________________________________________________
tf_bert_model (TFBertModel)     ((None, 128, 768), ( 109482240   input_ids[0][0]                  
______________________________________________________________________________________________

In [16]:
import gc; 
gc.collect()
history = model.fit(
    train_data,
    validation_data=val_data,
    epochs=epochs,
    use_multiprocessing=True,
    workers=-1,
)

  ...
    to  
  ['...']
  ...
    to  
  ['...']
Train for 70721 steps, validate for 15200 steps
Epoch 1/2
Epoch 2/2


Model was trained for 4-5 hours

#### Evaluate on Test Dataset

In [17]:
test_data = BertSemanticDataGenerator(
    test_df[["question1", "question2"]].values.astype("str"),
    y_test,
    batch_size=batch_size,
    shuffle=False,
)
model.evaluate(test_data, verbose=1)

  ...
    to  
  ['...']


[0.24381618442934172, 0.8969142]

We Got 90% Accuracy on Test Dataset which is far better than Naive Bayes

In [19]:
def check_similarity(sentence1, sentence2):
  sentence_pairs = np.array([[str(sentence1), str(sentence2)]])
  test_data = BertSemanticDataGenerator(
      sentence_pairs, labels=None, batch_size=1, shuffle=False, include_targets=False,
  )

  proba = model.predict(test_data)[0]
  idx = np.argmax(proba)
  proba = f"{proba[idx]: .2f}%"
  pred = labels[idx]
  return pred, proba

#### Try the custom Questions

In [20]:
ind = np.random.randint(0,500) 
#Duplicate Questions
q1 = test_df[test_df["is_duplicate"] == 1].iloc[ind]['question1']
q2 = test_df[test_df["is_duplicate"] == 1].iloc[ind]['question2']
print(q1+"\n"+q2)
check_similarity(q1,q2)

What is wrong with today's education system?
What are the things going wrong in our education system?


(0, ' 0.80%')

In [21]:
ind = np.random.randint(0,500)
#Non-Duplicate Questions
q1 = test_df[test_df["is_duplicate"] == 0].iloc[ind]['question1']
q2 = test_df[test_df["is_duplicate"] == 0].iloc[ind]['question2']
print(q1+"\n"+q2)
check_similarity(q1,q2)

What do you think of this blog?
What do you think about this blog forfishingvideos.blogspot.com?


(1, ' 1.00%')