# Important note before you read on🤚

I have used this notebook simply as a learning thing, so I won't be using the test data of the quora dataset for now. Maybe I will upload that one day, who knows 🤷‍♂️. I am also using only a small portion of the data since using even just 10,000 rows (let alone the whole dataset) was causing memory overuse in kaggle's environment.


#### That said, I of course appreciate any recommendations to add to this notebook since it is fun to just discuss and learn some cool new NLP stuff. Alright folks!😎 Hope you find this interesting!🤩

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/questions.csv


In [2]:
from transformers import AutoTokenizer, AutoModel # for tokenizing and word embeddings
import torch # for tensor processing

# for data preprocessing
import nltk
from nltk.tokenize import word_tokenize
import string
import re

from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import CountVectorizer

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

## Let's bring in the data

In [3]:
df = pd.read_csv('/kaggle/input/questions.csv')
df.head()

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0


In [4]:
df.isnull().sum()

id              0
qid1            0
qid2            0
question1       1
question2       2
is_duplicate    0
dtype: int64

In [5]:
df.dropna(inplace=True)

In [6]:
df = df.sample(1200, random_state=42)

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1200 entries, 371032 to 91783
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   id            1200 non-null   int64 
 1   qid1          1200 non-null   int64 
 2   qid2          1200 non-null   int64 
 3   question1     1200 non-null   object
 4   question2     1200 non-null   object
 5   is_duplicate  1200 non-null   int64 
dtypes: int64(4), object(2)
memory usage: 65.6+ KB


In [8]:
df.index = range(len(df))
df.drop(columns=df.columns[[0, 1, 2]], axis=1, inplace=True)

In [9]:
df.head()

Unnamed: 0,question1,question2,is_duplicate
0,Do people realize that you can send marijuana ...,How do you send weed through the mail?,0
1,How can rock music be brought back?,What would it take for rock music to make a co...,1
2,Why does one feel relaxed after smoking a join...,How do I sober up quickly after smoking weed/m...,0
3,How to gain weight ?,How do I gain weight fast but still be healthy?,1
4,Is porn bad for men?,Can I become a porn fan without getting addicted?,0


# Adding Bag Of Words

One very important step is to make explicit the vocabulary that we are going to use. Unfortunately our vocabulary probably won't be too large since our dataset is barely a few thousand compared to the 400K data size for reasons I mentioned at the start, but this is a sacrifice we will make for now until I find something clever to do about it.

In [10]:
class BOWAdder:
    def __init__(self):
        self.q1 = None
        self.q2 = None
        self.vocab_ = None
        return None
    
    def fit(self, X, y=None):
        self.q1 = X.iloc[:, 0]
        self.q2 = X.iloc[:, 1]
        return self
    
    def transform(self, X, y=None):
        cv = CountVectorizer(max_features=10000, ngram_range=(1,1))
        
        questions = list(self.q1) + list(self.q2)
        q1_vecs, q2_vecs = np.vsplit(cv.fit_transform(questions).toarray(), 2)
        q1_vecs_df, q2_vecs_df = pd.DataFrame(q1_vecs), pd.DataFrame(q2_vecs)
        
        self.vocab_ = cv.vocabulary_
        
        X_cpy = X.copy(deep=True)
        X_cpy.reset_index(inplace=True)
        
        X_new = pd.concat([X_cpy, q1_vecs_df, q2_vecs_df], axis=1)
        X_new.drop(X_new.columns[0], axis=1, inplace=True)
        return X_new

In [11]:
bow = BOWAdder().fit(df)
df = bow.transform(df)
df

Unnamed: 0,question1,question2,is_duplicate,0,1,2,3,4,5,6,...,4348,4349,4350,4351,4352,4353,4354,4355,4356,4357
0,Do people realize that you can send marijuana ...,How do you send weed through the mail?,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,How can rock music be brought back?,What would it take for rock music to make a co...,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Why does one feel relaxed after smoking a join...,How do I sober up quickly after smoking weed/m...,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,How to gain weight ?,How do I gain weight fast but still be healthy?,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Is porn bad for men?,Can I become a porn fan without getting addicted?,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1195,Who has the best midrange game in the NBA?,What are good reasons for NBA teams to still u...,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1196,What's the best site to learn German?,What are the best online resources to learn th...,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1197,What is the average height of females in the U.S?,What is the average height for males and femal...,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1198,How do I unlock my iPhone with another carrier?,How do I unlock iPhone from carrier?,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [12]:
len(bow.vocab_) # this will come in handy later

4358

We have about 4.4K words in our vocabulary, which isn't a lot compared to the actual no. of words a state of the art model would have, but it should do for now.

In [13]:
sorted(bow.vocab_.items(), key=lambda x: x[1], reverse=True)

[('出している', 4357),
 ('出してある', 4356),
 ('你对中国的印象如何', 4355),
 ('zr', 4354),
 ('zone', 4353),
 ('zits', 4352),
 ('zinc', 4351),
 ('zero', 4350),
 ('zener', 4349),
 ('zencart', 4348),
 ('zeljko', 4347),
 ('yukon', 4346),
 ('youtubers', 4345),
 ('youtube', 4344),
 ('yourself', 4343),
 ('your', 4342),
 ('young', 4341),
 ('you', 4340),
 ('yorker', 4339),
 ('yoda', 4338),
 ('yo', 4337),
 ('yet', 4336),
 ('yes', 4335),
 ('years', 4334),
 ('year', 4333),
 ('yamaha', 4332),
 ('yahoo', 4331),
 ('yaga', 4330),
 ('yacht', 4329),
 ('xx', 4328),
 ('xperia', 4327),
 ('xlri', 4326),
 ('xii', 4325),
 ('xi', 4324),
 ('wwii', 4323),
 ('wwdc', 4322),
 ('wrong', 4321),
 ('written', 4320),
 ('writing', 4319),
 ('writer', 4318),
 ('write', 4317),
 ('would', 4316),
 ('worthwhile', 4315),
 ('worth', 4314),
 ('worst', 4313),
 ('worse', 4312),
 ('worrying', 4311),
 ('worried', 4310),
 ('world', 4309),
 ('works', 4308),
 ('workout', 4307),
 ('workload', 4306),
 ('working', 4305),
 ('workers', 4304),
 ('work', 4303),

In [14]:
from sklearn.model_selection import train_test_split
X, y = df.drop('is_duplicate', axis=1), df['is_duplicate']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# The Pipeline
What we'll try now is we'll use inputs from the MiniLM model, the jaccard similarity index features as well as BOW vectorizations and put it into a machine learning model in order to check if the accuracy is increased or not.

Here's what our preprocessing pipeline for training will look like:  
1. Preprocesser
2. MiniLM model cosine similarities  
3. Jaccard index features  

After training and testing, we will create a new pipeline which will be the same except for an additional count vectorizer at the beginning for creating bag of words for the input. This is where we will reuse the vocabulary that we have found so far.

## 1. Preprocessor

Basically removes any special characters and lowercases the entire thing.

In [15]:
class Preprocesser():
    def __init__(self):
        self.q1 = None
        self.q2 = None
        return None

    def preprocess(self, doc):
        doc = doc.lower()
        doc = re.sub('[^a-zA-Z1-9]', ' ', doc)
        return doc

    def process_docs(self, documents):
        new_docs = [self.preprocess(doc) for doc in documents]
        return new_docs
    
    # The 2 main functions
    def fit(self, X, y=None):
        self.q1 = X.iloc[:, 0]
        self.q2 = X.iloc[:, 1]
        return self
    
    def transform(self, X, y=None):
        X_new = pd.DataFrame(np.zeros(shape=X.shape))
        
        newq1 = self.process_docs(self.q1)
        newq2 = self.process_docs(self.q2)
        
        X_new.iloc[:, 0] = newq1
        X_new.iloc[:, 1] = newq2
        return X_new

In [16]:
p = Preprocesser().fit(X_train)
df1 = p.transform(X_train)
df1

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,8708,8709,8710,8711,8712,8713,8714,8715,8716,8717
0,what is the best laptop in price range 5 ...,what is the best gaming laptop i can buy under...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,i have scored 215 in neet 2 16 i am sc candid...,i have scored 215 in neet 2 16 i am sc candid...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,why do you want to join banks,why do you want to join our bank,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,can i say instead of here,what can i say instead of how are you,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,do you need friends,why do we need friends,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
955,what is the life of a web developer,what is the future of web development,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
956,is nyfa sydney good for film making,is it possible for lnow status about cybercrim...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
957,how do i track a cell phone by number for free,how can i locate my cell phone with the phone ...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
958,what is the worst experience that you have had...,what are worst experiences a student has had w...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## 2. MiniLM model similarities

In [17]:
class MiniLMModel():
    def __init__(self):
        self.q1 = None
        self.q2 = None
        return None

    def get_embeddings(self, docs):
        model_name = 'sentence-transformers/multi-qa-MiniLM-L6-cos-v1'
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        model = AutoModel.from_pretrained(model_name)
        
        tokens = {
            'input_ids' : [],
            'attention_mask' : []
        }

        for doc in docs:
            new_tokens = tokenizer.encode_plus(doc, max_length=128, truncation=True, padding='max_length', return_tensors='pt')
            tokens['input_ids'].append(new_tokens['input_ids'][0])
            tokens['attention_mask'].append(new_tokens['attention_mask'][0])

        tokens['input_ids'] = torch.stack(tokens['input_ids'])
        tokens['attention_mask'] = torch.stack(tokens['attention_mask'])

        outputs = model(**tokens)

        embeddings = outputs['last_hidden_state']
        attention = tokens['attention_mask']
        mask = attention.unsqueeze(-1).expand(embeddings.shape).float()

        # important line, helps to get only the embeddings for the useful words
        mask_embeddings = embeddings * mask

        summed = torch.sum(mask_embeddings, 1)
        counts = torch.clamp(mask.sum(1), min=1e-9)
        mean_pooled = summed / counts

        # using this line since i want to get a numpy array for calculations, not a tensor
        mean_pooled = mean_pooled.detach().numpy()
        return mean_pooled

    def similar(self, docs1, docs2):
        q1_emb = self.get_embeddings(list(docs1))
        q2_emb = self.get_embeddings(list(docs2))
        
        sim = np.array([cosine_similarity([q1_emb[i], q2_emb[i]])[0, 1] for i in range(len(q1_emb))])
        return sim
    
    def fit(self, X, y=None):
        self.q1 = X.iloc[:, 0]
        self.q2 = X.iloc[:, 1]
        return self
    
    def transform(self, X, y=None):
        X_new = pd.DataFrame(np.zeros(shape=X.shape[0]), columns=['semnatic_similarity'])
        X_new.iloc[:, 0] = self.similar(self.q1, self.q2)
        
        X_cpy = X.copy(deep=True)
        X_cpy.reset_index(inplace=True)
        
        X_new2 = pd.concat([X_cpy, X_new], axis=1)
        X_new2.drop(X_new2.columns[0], axis=1, inplace=True)
        return X_new2

In [18]:
mlm = MiniLMModel().fit(df1)
df2 = mlm.transform(df1)
df2

Downloading (…)okenizer_config.json:   0%|          | 0.00/383 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,8709,8710,8711,8712,8713,8714,8715,8716,8717,semnatic_similarity
0,what is the best laptop in price range 5 ...,what is the best gaming laptop i can buy under...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.690774
1,i have scored 215 in neet 2 16 i am sc candid...,i have scored 215 in neet 2 16 i am sc candid...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.845588
2,why do you want to join banks,why do you want to join our bank,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.945438
3,can i say instead of here,what can i say instead of how are you,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.577379
4,do you need friends,why do we need friends,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.829165
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
955,what is the life of a web developer,what is the future of web development,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.743196
956,is nyfa sydney good for film making,is it possible for lnow status about cybercrim...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.102225
957,how do i track a cell phone by number for free,how can i locate my cell phone with the phone ...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.713356
958,what is the worst experience that you have had...,what are worst experiences a student has had w...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.923860


In [19]:
class JacIdx():
    def __init__(self):
        self.q1 = None
        self.q2 = None
        return None
    
    def fit(self, X, y=None):
        self.q1 = X.iloc[:, 0]
        self.q2 = X.iloc[:, 1]
        return self
    
    def transform(self, X, y=None):
        jacc = []
        docs1, docs2 = self.q1, self.q2
        for i in range(len(docs1)):
        #     print(ok3.iloc[i, 0], ok3.iloc[i, 1])
            s1 = set(word_tokenize(docs1.iloc[i]))
            s2 = set(word_tokenize(docs2.iloc[i]))
            jacsimilar = len(s1.intersection(s2)) / len(s1.union(s2))
            jacc.append(jacsimilar)
        jacc = pd.DataFrame(jacc, columns=['jaccard_similarity'])
        
        X_new = pd.concat([X, jacc], axis=1)
        return X_new

In [20]:
j = JacIdx().fit(df2)
df3 = j.transform(df2)
df3

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,8710,8711,8712,8713,8714,8715,8716,8717,semnatic_similarity,jaccard_similarity
0,what is the best laptop in price range 5 ...,what is the best gaming laptop i can buy under...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.690774,0.368421
1,i have scored 215 in neet 2 16 i am sc candid...,i have scored 215 in neet 2 16 i am sc candid...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.845588,0.761905
2,why do you want to join banks,why do you want to join our bank,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.945438,0.666667
3,can i say instead of here,what can i say instead of how are you,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.577379,0.500000
4,do you need friends,why do we need friends,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.829165,0.500000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
955,what is the life of a web developer,what is the future of web development,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.743196,0.500000
956,is nyfa sydney good for film making,is it possible for lnow status about cybercrim...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.102225,0.142857
957,how do i track a cell phone by number for free,how can i locate my cell phone with the phone ...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.713356,0.312500
958,what is the worst experience that you have had...,what are worst experiences a student has had w...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.923860,0.350000


### Finalizing the data

In [21]:
traindata = df3.iloc[:, 2:]
traindata

Unnamed: 0,2,3,4,5,6,7,8,9,10,11,...,8710,8711,8712,8713,8714,8715,8716,8717,semnatic_similarity,jaccard_similarity
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.690774,0.368421
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.845588,0.761905
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.945438,0.666667
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.577379,0.500000
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.829165,0.500000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
955,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.743196,0.500000
956,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.102225,0.142857
957,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.713356,0.312500
958,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.923860,0.350000


In [22]:
p2 = Preprocesser().fit(X_test)
df4 = p2.transform(X_test)

mlm2 = MiniLMModel().fit(df4)
df5 = mlm2.transform(df4)

j2 = JacIdx().fit(df5)
df6 = j2.transform(df5)

testdata = df6.iloc[:, 2:]
testdata

Unnamed: 0,2,3,4,5,6,7,8,9,10,11,...,8710,8711,8712,8713,8714,8715,8716,8717,semnatic_similarity,jaccard_similarity
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.832594,0.357143
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.464905,0.083333
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.887286,0.545455
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.069391,0.269231
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.686486,0.300000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
235,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.993269,1.000000
236,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.937554,0.500000
237,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.655941,0.307692
238,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.546200,0.111111


In [23]:
traindata.columns = traindata.columns.astype(str)
testdata.columns = testdata.columns.astype(str)

traindata = traindata.values
testdata = testdata.values

In [24]:
traindata.shape, testdata.shape

((960, 8718), (240, 8718))

### Alright, it's time for the models to walk on the red carpet 🤟

In [25]:
from sklearn.model_selection import cross_val_score
all_questions = np.concatenate([traindata, testdata], axis=0)
all_y = np.concatenate([y_train, y_test], axis=0)

In [26]:
clf = RandomForestClassifier(n_estimators=200, random_state=42).fit(traindata, y_train)
y_pred = clf.predict(testdata)
accuracy_score(y_pred, y_test), np.mean(cross_val_score(RandomForestClassifier(), all_questions, all_y, cv=10, scoring='accuracy'))

(0.7166666666666667, 0.75)

In [27]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression().fit(traindata, y_train)
y_pred2 = lr.predict(testdata)
accuracy_score(y_pred2, y_test), np.mean(cross_val_score(LogisticRegression(), all_questions, all_y, cv=10, scoring='accuracy'))

(0.75, 0.7683333333333333)

In [28]:
from xgboost import XGBClassifier
xg = XGBClassifier().fit(traindata, y_train)
y_pred3 = xg.predict(testdata)
accuracy_score(y_pred3, y_test), np.mean(cross_val_score(XGBClassifier(), all_questions, all_y, cv=10, scoring='accuracy'))

(0.725, 0.7408333333333333)

In [29]:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB().fit(traindata, y_train)
y_pred4 = gnb.predict(testdata)
accuracy_score(y_pred4, y_test), np.mean(cross_val_score(GaussianNB(), all_questions, all_y, cv=10, scoring='accuracy'))

(0.7541666666666667, 0.7441666666666666)

In [30]:
from sklearn.svm import SVC
svc = SVC(kernel='linear', C=1, degree=1).fit(traindata, y_train)
y_pred5 = svc.predict(testdata)
accuracy_score(y_pred5, y_test), np.mean(cross_val_score(SVC(kernel='linear', C=1, degree=1), all_questions, all_y, cv=10, scoring='accuracy'))

(0.7708333333333334, 0.7674999999999998)

In [31]:
class ExtraProcessor:
    def __init__(self):
        self.data = None
        return None
    
    def fit(self, X, y=None):
        self.data = X
        return self
    
    def transform(self, X, y=None):
        X_new = self.data.iloc[:, 2:]
        X_new.columns = X_new.columns.astype(str)
        return X_new

In [32]:
pipe = make_pipeline(Preprocesser(), MiniLMModel(), JacIdx(), ExtraProcessor())

In [33]:
pipe.fit_transform(X_train)

Unnamed: 0,2,3,4,5,6,7,8,9,10,11,...,8710,8711,8712,8713,8714,8715,8716,8717,semnatic_similarity,jaccard_similarity
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.690774,0.368421
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.845588,0.761905
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.945438,0.666667
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.577379,0.500000
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.829165,0.500000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
955,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.743196,0.500000
956,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.102225,0.142857
957,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.713356,0.312500
958,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.923860,0.350000


### Alright, the pipeline works!

In [34]:
# print('Please enter a minimum of 10 words in each sentence!\n')
# s1 = input('Enter a sentence:')
# s2 = input('Enter another sentence:')

In [35]:
# test_df = pd.DataFrame([[s1, s2]])
# test_df

In [36]:
# pipe.fit_transform(test_df)

### Let's use Logistic Regression to predict since it was the one with the best cross validation score

In [37]:
class BOWAdder2:
    def __init__(self, vocab_):
        self.q1 = None
        self.q2 = None
        self.vocab_ = vocab_
        return None
    
    def fit(self, X, y=None):
        self.q1 = X.iloc[:, 0]
        self.q2 = X.iloc[:, 1]
        return self
    
    def transform(self, X, y=None):
        cv = CountVectorizer(vocabulary=self.vocab_, max_features=10000, ngram_range=(1,1))
        
        questions = list(self.q1) + list(self.q2)
        q1_vecs, q2_vecs = np.vsplit(cv.fit_transform(questions).toarray(), 2)
        q1_vecs_df, q2_vecs_df = pd.DataFrame(q1_vecs), pd.DataFrame(q2_vecs)
        
        self.vocab_ = cv.vocabulary_
        
        X_cpy = X.copy(deep=True)
        X_cpy.reset_index(inplace=True)
        
        X_new = pd.concat([X_cpy, q1_vecs_df, q2_vecs_df], axis=1)
        X_new.drop(X_new.columns[0], axis=1, inplace=True)
        return X_new

In [38]:
preprocess_pipe = make_pipeline(BOWAdder2(bow.vocab_), pipe)

In [39]:
# lr.predict(preprocess_pipe.fit_transform(test_df))

I think that worked nice. Let's see how this works for other pairs.

In [40]:
def sentence_similarity(s1=None, s2=None):
    if s1 is None:
        s1 = input('Sentence 1:')
    if s2 is None:
        s2 = input('Sentence 2:')
    
    test_df = pd.DataFrame([[s1, s2]])
    res = 'Similar' if (lr.predict(preprocess_pipe.fit_transform(test_df))[0] == 1) else 'Not Similar'
    print(res)

In [41]:
docs = [
    "this is a great day",
    "this is an awesome day",
    
    "The kitchen was filled with the enticing scent of bread straight from the oven, evoking an irresistible craving in everyone present.",
    "The smell of fresh bread spread through the kitchen, watering everyone's mouth.",
    
    "Climate change is an urgent global crisis that demands immediate action. Rising temperatures, melting ice caps, and extreme weather events are clear indicators of the planet's deteriorating health. In the face of this challenge, nations must come together to reduce greenhouse gas emissions and transition to sustainable energy sources. Failure to address climate change effectively will result in irreversible damage to our environment, impacting not only current generations but also the well-being of future ones.",
    "The ramifications of climate change are becoming increasingly evident on a global scale. As the Earth's temperature continues to rise, polar ice is rapidly vanishing, and catastrophic weather events are becoming more frequent. To combat this crisis, international cooperation is paramount. It is crucial for countries to commit to substantial reductions in carbon emissions and to embrace renewable energy solutions. Failing to take decisive action on climate change will have far-reaching consequences, jeopardizing the quality of life for present and future generations alike."
]

In [42]:
sentence_similarity(docs[-1], docs[-2]), sentence_similarity(docs[0], docs[1]), sentence_similarity(docs[1], docs[2])



Similar




Similar
Not Similar




(None, None, None)

In [43]:
# Inaccurate result
sentence_similarity(docs[2], docs[3])

Not Similar




In [44]:
import pickle
pickle.dump(preprocess_pipe, open('preprocess_pipe.pkl', 'wb'))
pickle.dump(lr, open('model.pkl', 'wb'))

#### As is clear, the model does still give inaccurate results so I will try to add more meaningful features and preprocessing in each version.