# GloVe embedding for PAWS and MRCP

#### In this notebook we are implmenting glove embedding on PAWS and MRCP dataset, detailed instructions on how to run this notebook is provided either in terms of markdown or comments, all the datasets are present in the data folder

## Importing all the necessary libraries for this notebook

In [1]:
## Data Processing imports
import nltk
import string
import re
import numpy as np
import pandas as pd
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

## Model building imports
import tensorflow as tf
from tensorflow.keras.layers import *
from tensorflow.keras.models import Model
from keras.preprocessing.text import Tokenizer
from keras_preprocessing.sequence import pad_sequences
from keras.utils import to_categorical

# SKlearn imports
from sklearn.model_selection import train_test_split

In [2]:
# Downloads for string cleaning
wn = nltk.WordNetLemmatizer()
nltk.download('stopwords')
nltk.download('wordnet')

# Cleaning function for the strings
def clean_string(input_str):
    
    # Lowercase the input_string
    input_str = input_str.lower()
    
    # Remove URLs, links
    input_str = re.sub(r"http\S+", "", input_str)
    input_str = re.sub(r"www.\S+", "", input_str)
    input_str = re.sub(r"\S+@\S+", "", input_str)
    
    # Remove punctuations
    input_str_punc = "".join(char for char in input_str if char not in string.punctuation)

    # Remove stopwords
    stopword = nltk.corpus.stopwords.words('english')
    input_str_stopwords = " ".join([word for word in re.split('\W+', input_str_punc) if word not in stopword])
    
    # Lemmatization
    input_str_cleaned = " ".join([wn.lemmatize(word,'n') for word in re.split('\W+', input_str_stopwords)])

    return input_str_cleaned

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/ameyagidh/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/ameyagidh/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [3]:
# This function is used to load either paws or mrcp dataset, by passing the argument `paws` or `mrcp` to the funtion
def load_dataset_for_glove(data):
    if data == "paws":
        df = pd.read_csv('../data/paws/train.csv')
        test_data = pd.read_csv('../data/paws/test.csv')
    elif data == 'mrcp':
        df = pd.read_csv('../data/mrcp-data/msr_paraphrase_train.csv')
        test_data = pd.read_csv('../data/mrcp-data/msr_paraphrase_test.csv')
    return df, test_data, data

In [4]:
df, test_data, dataset = load_dataset_for_glove('paws') # or load_dataset_for_glove('mrcp')

In [5]:
df.head()

Unnamed: 0.1,Unnamed: 0,sentence1,sentence2,label
0,0,"In Paris , in October 1560 , he secretly met t...","In October 1560 , he secretly met with the Eng...",0
1,1,The NBA season of 1975 -- 76 was the 30th seas...,The 1975 -- 76 season of the National Basketba...,1
2,2,"There are also specific discussions , public p...","There are also public discussions , profile sp...",0
3,3,When comparable rates of flow can be maintaine...,The results are high when comparable flow rate...,1
4,4,It is the seat of Zerendi District in Akmola R...,It is the seat of the district of Zerendi in A...,1


##### Cleaning the sentences for train data

In [6]:
df.sentence2 = df['sentence2'].astype('str')
df.sentence1 = df.sentence1.apply(lambda x: clean_string(x))
df.sentence2 = df.sentence2.apply(lambda x: clean_string(x))

In [7]:
df.head()

Unnamed: 0.1,Unnamed: 0,sentence1,sentence2,label
0,0,paris october 1560 secretly met english ambass...,october 1560 secretly met english ambassador n...,0
1,1,nba season 1975 76 30th season national basket...,1975 76 season national basketball association...,1
2,2,also specific discussion public profile debate...,also public discussion profile specific discus...,0
3,3,comparable rate flow maintained result high,result high comparable flow rate maintained,1
4,4,seat zerendi district akmola region,seat district zerendi akmola region,1


In [8]:
test_data.head()

Unnamed: 0.1,Unnamed: 0,sentence1,sentence2,label
0,0,This was a series of nested angular standards ...,"This was a series of nested polar scales , so ...",0
1,1,His father emigrated to Missouri in 1868 but r...,"His father emigrated to America in 1868 , but ...",0
2,2,"In January 2011 , the Deputy Secretary General...","In January 2011 , FIBA Asia deputy secretary g...",1
3,3,"Steiner argued that , in the right circumstanc...",Steiner held that the spiritual world can be r...,0
4,4,"Luciano Williames Dias ( born July 25 , 1970 )...",Luciano Williames Dias ( born 25 July 1970 ) i...,0


##### Cleaning data for the test data

In [9]:
test_data.sentence2 = test_data['sentence2'].astype('str')
test_data.sentence1 = test_data.sentence1.apply(lambda x: clean_string(x))
test_data.sentence2 = test_data.sentence2.apply(lambda x: clean_string(x))

In [10]:
test_data.head()

Unnamed: 0.1,Unnamed: 0,sentence1,sentence2,label
0,0,series nested angular standard measurement azi...,series nested polar scale measurement azimuth ...,0
1,1,father emigrated missouri 1868 returned wife b...,father emigrated america 1868 returned wife be...,0
2,2,january 2011 deputy secretary general fiba asi...,january 2011 fiba asia deputy secretary genera...,1
3,3,steiner argued right circumstance spiritual wo...,steiner held spiritual world researched right ...,0
4,4,luciano williames dia born july 25 1970 brazil...,luciano williames dia born 25 july 1970 former...,0


#### Checking the max number of words any sentence has, which will be used while converting text to sequences and then to pad, so that length of all the input sentences are equal

In [11]:
s1 = test_data['sentence1']
s2 = test_data['sentence2']
max_len = 0
for i, j in zip(s1, s2):
    max_len = max(max_len, max(len(i.split()), len(j.split())))

In [12]:
max_len

24

In [13]:
MAX_NB_WORDS = 20000
tokenizer = Tokenizer(num_words = MAX_NB_WORDS)
tokenizer.fit_on_texts(list(df['sentence1'].values.astype(str))+list(df['sentence2'].values.astype(str)))


In [14]:
# Tokenize the text in the 'sentence1' column of the dataframe
X_train_q1 = tokenizer.texts_to_sequences(df['sentence1'].values.astype(str))

# Pad the sequences in X_train_q1 with zeros to a maximum length of 25
# The padding is done after the sequence
X_train_q1 = pad_sequences(X_train_q1, maxlen=25, padding='post')

# Tokenize the text in the 'sentence2' column of the dataframe
X_train_q2 = tokenizer.texts_to_sequences(df['sentence2'].values.astype(str))

# Pad the sequences in X_train_q2 with zeros to a maximum length of 25
# The padding is done after the sequence
X_train_q2 = pad_sequences(X_train_q2, maxlen=25, padding='post')


In [15]:
# Get the text data from the 'sentence1' column of the test_data dataframe
X_testq1 = test_data['sentence1']

# Get the text data from the 'sentence2' column of the test_data dataframe
X_testq2 = test_data['sentence2']

# Convert the text in X_testq1 to numerical sequences using the tokenizer object
X_test_q1 = tokenizer.texts_to_sequences(X_testq1.ravel())

# Pad the sequences in X_test_q1 with zeros to a maximum length of 25
# The padding is done after the sequence
X_test_q1 = pad_sequences(X_test_q1, maxlen=25, padding='post')

# Convert the text in X_testq2 to numerical sequences using the tokenizer object
X_test_q2 = tokenizer.texts_to_sequences(X_testq2.astype(str).ravel())

# Pad the sequences in X_test_q2 with zeros to a maximum length of 25
# The padding is done after the sequence
X_test_q2 = pad_sequences(X_test_q2, maxlen=25, padding='post')


## Loading GloVe Embedding

In [16]:
word_index = tokenizer.word_index

In [17]:
import codecs
embedding_index = {}
with codecs.open('../data/glove.6B.200d.txt',encoding='utf8') as f:
    for line in f:
        values = line.split()
        word = values[0]
        vectors = np.asarray(values[1:], 'float32')
        embedding_index[word] = vectors
    f.close()

### Creating our Embedding matrix

In [18]:
embedding_matrix = np.random.random((len(word_index)+1, 200))
for word, i in word_index.items():
    embedding_vector = embedding_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

In [19]:
y = to_categorical(df['label'])

## Model Building

### Model 1 for sentence1

In [20]:
# Define a sequential model for Q1
model_q1 = tf.keras.Sequential()

# Add an Embedding layer to the model with the specified input dimension, output dimension, 
# weights, and input length
model_q1.add(Embedding(input_dim=len(word_index) + 1,
                       output_dim=200,
                       weights=[embedding_matrix],
                       input_length=25))

# Add an LSTM layer with 128 units, 'relu' activation function, and return sequences flag set to True
model_q1.add(LSTM(128, activation='relu', return_sequences=True))

# Add a dropout layer with a rate of 0.2
model_q1.add(Dropout(0.2))

# Add another LSTM layer with 128 units, 'relu' activation function, and return sequences flag set to True
model_q1.add(LSTM(128, activation='relu', return_sequences=True))

# Add another dropout layer with a rate of 0.2
model_q1.add(Dropout(0.2))

# Add a dense layer with 64 units and 'relu' activation function
model_q1.add(Dense(64, activation='relu'))

# Add another dropout layer with a rate of 0.2
model_q1.add(Dropout(0.2))

# Add a dense layer with 2 units and 'sigmoid' activation function
model_q1.add(Dense(2, activation='sigmoid'))


### Model 2 for sentence2

In [21]:
# Define a sequential model for Q2
model_q2 = tf.keras.Sequential()

# Add an Embedding layer to the model with the specified input dimension, output dimension, 
# weights, and input length
model_q2.add(Embedding(input_dim=len(word_index) + 1,
                       output_dim=200,
                       weights=[embedding_matrix],
                       input_length=25))

# Add an LSTM layer with 128 units, 'relu' activation function, and return sequences flag set to True
model_q2.add(LSTM(128, activation='relu', return_sequences=True))

# Add a dropout layer with a rate of 0.2
model_q2.add(Dropout(0.2))

# Add another LSTM layer with 128 units, 'relu' activation function, and return sequences flag set to True
model_q2.add(LSTM(128, activation='relu', return_sequences=True))

# Add another dropout layer with a rate of 0.2
model_q2.add(Dropout(0.2))

# Add a dense layer with 64 units and 'relu' activation function
model_q2.add(Dense(64, activation='relu'))

# Add another dropout layer with a rate of 0.2
model_q2.add(Dropout(0.2))

# Add a dense layer with 2 units and 'sigmoid' activation function
model_q2.add(Dense(2, activation='sigmoid'))


### Merging both the models

In [22]:
# Merging the output of the two models, i.e., model_q1 and model_q2
# Multiply the output tensors element-wise
mergedOut = Multiply()([model_q1.output, model_q2.output])

# Flatten the output tensor
mergedOut = Flatten()(mergedOut)

# Add a fully connected layer with 128 units and ReLU activation
mergedOut = Dense(128, activation='relu')(mergedOut)

# Apply a dropout of 20% to the previous layer
mergedOut = Dropout(0.2)(mergedOut)

# Add another fully connected layer with 32 units and ReLU activation
mergedOut = Dense(32, activation='relu')(mergedOut)

# Apply another dropout of 20% to the previous layer
mergedOut = Dropout(0.2)(mergedOut)

# Add the final fully connected layer with 2 units and sigmoid activation
mergedOut = Dense(2, activation='sigmoid')(mergedOut)


In [23]:
new_model = Model([model_q1.input, model_q2.input], mergedOut)

### Implementing Kfold validation

In [24]:
from sklearn.model_selection import KFold
from keras.callbacks import EarlyStopping

# Define number of folds
num_folds = 5

# Initialize k-fold cross validator
kfold = KFold(n_splits=num_folds, shuffle=True)
validation_scores = []

In [25]:
for fold, (train_indices, val_indices) in enumerate(kfold.split(X_train_q1, y)):
    
    print(f"Fold {fold+1}:")
    
    # Split data into training and validation sets
    X1_train, X1_val = X_train_q1[train_indices], X_train_q1[val_indices]
    X2_train, X2_val = X_train_q2[train_indices], X_train_q2[val_indices]
    Y_train, Y_val = y[train_indices], y[val_indices]
    
    new_model.compile(optimizer = 'adam', loss = 'binary_crossentropy',
                 metrics = ['accuracy'])
    
    new_model.fit([X1_train,X2_train],Y_train, batch_size = 64 if dataset == "mrcp" else 2000, epochs = 5, verbose=1,
                  callbacks=[EarlyStopping(patience=2)])
    
    # Evaluate model on validation set and store accuracy
    scores = new_model.evaluate([X1_val, X2_val], Y_val, verbose=1)
    validation_scores.append(scores[1])
    
    print(f"Validation accuracy: {scores[1]}")
    
mean_accuracy = np.mean(validation_scores)
print(f"\nMean validation accuracy: {mean_accuracy}")

Fold 1:
Epoch 1/5


2024-03-14 06:31:00.974644: W tensorflow/tsl/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz


Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Validation accuracy: 0.5635057091712952
Fold 2:
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Validation accuracy: 0.6041498184204102
Fold 3:
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Validation accuracy: 0.616295576095581
Fold 4:
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Validation accuracy: 0.5946356058120728
Fold 5:
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


Validation accuracy: 0.623481810092926

Mean validation accuracy: 0.600413703918457


### Testing Accuracy

In [26]:
loss, accuracy = new_model.evaluate([X_test_q1, X_test_q2], to_categorical(test_data['label'].values))

print('Test accuracy:', accuracy)

Test accuracy: 0.5578749775886536


### Classification Report

In [27]:
from sklearn.metrics import classification_report
y_pred = new_model.predict([X_test_q1,X_test_q1])

y_pred_final = []
for x in y_pred:
    if x[0] > x[1]:
        y_pred_final.append(0)
    else:
        y_pred_final.append(1)

print(classification_report(test_data['label'].values, y_pred_final))

              precision    recall  f1-score   support

           0       0.56      0.84      0.67      4464
           1       0.47      0.18      0.26      3536

    accuracy                           0.55      8000
   macro avg       0.52      0.51      0.47      8000
weighted avg       0.52      0.55      0.49      8000



### Note: This reslts are for paws, but if you just load the data for mrcp as mentioned in the beggining this will give us 63% as testing and 99% as training which again tells us that our data is overfitting

## Conclusion:

Based on the evaluation results i.e. 56% for paws and 63% for mrcp, it appears that the model is overfitting the training data, as indicated by the significantly higher training accuracy compared to the test accuracy. This means that the model is performing well on the training data, but is not generalizing well to new, unseen data.

Overfitting occurs when the model is too complex relative to the amount of training data available, or when the model is trained for too many epochs. In this case, it may be necessary to simplify the model architecture, reduce the number of training epochs, or increase the amount of training data.

Overall, it is important to take measures to prevent overfitting in order to ensure that the model generalizes well to new data and performs well in real-world scenarios.