Loading all the required packages.

In [3]:
import pandas as pd
import numpy as np

import string
from string import digits
import re
from nltk import word_tokenize
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')
stop = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


The dataset, which has two paragraphs per sample and the objective here is to find similarity between the two.

In [0]:
data = pd.read_csv("gdrive/My Drive/datasets/Text_Similarity_Dataset.csv")

In [5]:
data.head(10)

Unnamed: 0,Unique_ID,text1,text2
0,0,savvy searchers fail to spot ads internet sear...,newcastle 2-1 bolton kieron dyer smashed home ...
1,1,millions to miss out on the net by 2025 40% o...,nasdaq planning $100m share sale the owner of ...
2,2,young debut cut short by ginepri fifteen-year-...,ruddock backs yapp s credentials wales coach m...
3,3,diageo to buy us wine firm diageo the world s...,mci shares climb on takeover bid shares in us ...
4,4,be careful how you code a new european directi...,media gadgets get moving pocket-sized devices ...
5,5,india seeks to boost construction india has cl...,music mogul fuller sells company pop idol supr...
6,6,podcasters look to net money nasa is doing it...,ukip outspent labour on eu poll the uk indepen...
7,7,row over police power for csos the police fe...,ban on hunting comes into force fox hunting wi...
8,8,election could be terror target terrorists m...,nhs waiting time target is cut hospital waitin...
9,9,japan economy slides to recession the japanese...,optimism remains over uk housing the uk proper...


In [0]:
text1 = data.text1.values
text2 = data.text2.values

Preprocessing the paragraphs. Very basic: will remove
* digits 
* special characters 
* checks for punctuation 
* removes stop words.

***Also,  lets not lemmatize or stem the words. stemmed version of the words may be missing in vocabulary (GloVe)***

In [0]:
def process_text(txt):    
    for i in range(len(text1)):

        txt[i] = re.sub(r'\W+', ' ', txt[i])
        remove_digits = str.maketrans('', '', digits)
        txt[i] = txt[i].translate(remove_digits)
        txt[i] = txt[i].translate(str.maketrans('', '', string.punctuation))
        txt[i] = re.sub(r'\b\w{1,2}\b', '', txt[i])
        txt[i] = [i for i in txt[i].lower().split() if i not in stop]
    return txt

text1 = process_text(text1)
text2 = process_text(text2)

data.text1 = text1
data.text2 = text2 

In [8]:
data.head(10)

Unnamed: 0,Unique_ID,text1,text2
0,0,"[savvy, searchers, fail, spot, ads, internet, ...","[newcastle, bolton, kieron, dyer, smashed, hom..."
1,1,"[millions, miss, net, population, still, witho...","[nasdaq, planning, share, sale, owner, technol..."
2,2,"[young, debut, cut, short, ginepri, fifteen, y...","[ruddock, backs, yapp, credentials, wales, coa..."
3,3,"[diageo, buy, wine, firm, diageo, world, bigge...","[mci, shares, climb, takeover, bid, shares, ph..."
4,4,"[careful, code, new, european, directive, coul...","[media, gadgets, get, moving, pocket, sized, d..."
5,5,"[india, seeks, boost, construction, india, cle...","[music, mogul, fuller, sells, company, pop, id..."
6,6,"[podcasters, look, net, money, nasa, year, old...","[ukip, outspent, labour, poll, independence, p..."
7,7,"[row, police, power, csos, police, federation,...","[ban, hunting, comes, force, fox, hunting, dog..."
8,8,"[election, could, terror, target, terrorists, ...","[nhs, waiting, time, target, cut, hospital, wa..."
9,9,"[japan, economy, slides, recession, japanese, ...","[optimism, remains, housing, property, market,..."


Lets see which are the most occuring words in our corpus. This function returns the words and their counts.

In [0]:
def count_most_common(col):
  full_list = []  
  for elmnt in data[col]:  
      full_list += elmnt
  val_counts = pd.Series(full_list).value_counts() 
  return val_counts

Most repeated words of our first column (text1):

In [10]:
val_counts_text1 = count_most_common("text1")
print(val_counts_text1[:10])

said      13072
would      4657
year       4192
also       3924
people     3693
new        3589
one        3454
could      2731
last       2498
first      2486
dtype: int64


Most repeated words of our second column (text1):

***looking at the common words alone, these two columns or rather the collection of paragraphs definitely have commonalities.***

In [11]:
val_counts_text2 = count_most_common("text2")
print(val_counts_text2[:10])

said      13041
would      4680
year       4162
also       3883
people     3665
new        3590
one        3454
could      2735
last       2532
first      2454
dtype: int64


Lets get the two columns in array format, to be worked upon next.

In [0]:
text1 = data.text1.values.tolist()
text2 = data.text2.values.tolist()

In [13]:
len(text1), len(text2)

(4023, 4023)

Lets join the two lists.

In [0]:
joined_list = text1 + text2


* Going forward, I tried using LSTM network for the tokenized data. Results were very bad. LSTM network did not learn due to many possible reasons, one of which is that the data has straightforward correlation. 
***To avoid that, lets introduce Noice.**

<ol>
  <li> t1_freq and t2_freq columns will have the lengths of individual lengths per sample. </li>
  <li> t1_t2_intersect as the name suggests has the intersection(common words) between the two paragraphs per sample.



In [0]:

from collections import defaultdict

d_dict = defaultdict(set)
for i in range(data.shape[0]):
  d_dict[" ".join(data.text1[i])].add(" ".join(data.text2[i]))
  d_dict[" ".join(data.text2[i])].add(" ".join(data.text1[i]))


def t1_freq(row):
    return(len(row['text1']))
    
def t2_freq(row):
    return(len(row['text2']))
    
def t1_t2_intersect(row):
    # print(row['text1'])
    return len(list(set(row['text1']) & set(row['text2'])))


data['t1_t2_intersect'] = data.apply(t1_t2_intersect, axis=1, raw=True)
data['t1_freq'] = data.apply(t1_freq, axis=1, raw=True)
data['t2_freq'] = data.apply(t2_freq, axis=1, raw=True)



In [16]:
data

Unnamed: 0,Unique_ID,text1,text2,t1_t2_intersect,t1_freq,t2_freq
0,0,"[savvy, searchers, fail, spot, ads, internet, ...","[newcastle, bolton, kieron, dyer, smashed, hom...",13,266,362
1,1,"[millions, miss, net, population, still, witho...","[nasdaq, planning, share, sale, owner, technol...",8,276,109
2,2,"[young, debut, cut, short, ginepri, fifteen, y...","[ruddock, backs, yapp, credentials, wales, coa...",11,181,202
3,3,"[diageo, buy, wine, firm, diageo, world, bigge...","[mci, shares, climb, takeover, bid, shares, ph...",12,89,187
4,4,"[careful, code, new, european, directive, coul...","[media, gadgets, get, moving, pocket, sized, d...",44,534,393
...,...,...,...,...,...,...
4018,4018,"[labour, plans, maternity, pay, rise, maternit...","[seasonal, lift, house, market, swathe, figure...",14,253,250
4019,4019,"[high, fuel, costs, hit, airlines, two, larges...","[new, media, battle, bafta, awards, bbc, leads...",7,173,156
4020,4020,"[britons, growing, digitally, obese, gadget, l...","[film, star, fox, behind, theatre, bid, leadin...",19,295,237
4021,4021,"[holmes, hit, hamstring, injury, kelly, holmes...","[tsunami, hit, sri, lanka, banks, sri, lanka, ...",5,120,128


Lets scale our noice related data.

In [0]:
noice = data[['t1_t2_intersect', 't1_freq', 't2_freq']]

In [0]:
from sklearn.preprocessing import StandardScaler


ss = StandardScaler()

ss.fit(noice)
noice = ss.transform(noice)

In [20]:
joined_list[-1][:2],joined_list[0][:2],joined_list[4023][:2],joined_list[4022][:2]

(['factor', 'show'],
 ['savvy', 'searchers'],
 ['newcastle', 'bolton'],
 ['nuclear', 'dumpsite'])

tokenize function returns the tokenized padded sentences.

In [21]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

def tokenize(num, list_arr):
    num = num #max size of library
    name_tokenizer = Tokenizer(num_words=num)
    name_tokenizer.fit_on_texts(list(list_arr))
    name_tokenized_list = name_tokenizer.texts_to_sequences(list(list_arr))
    
    if len(name_tokenizer.word_index) < num:
        num = len(name_tokenizer.word_index)
      
    print('Number of words:', num)
    pad_name = len(max(name_tokenized_list,key=len))
    final_array = pad_sequences(name_tokenized_list, maxlen=pad_name, padding='post')

    return num, name_tokenizer, name_tokenized_list, pad_name, final_array

max_num_word_text, tokenizer_text, list_tokenized_text, pad_max_text, text_array = tokenize(100000, joined_list)

Using TensorFlow backend.


Number of words: 27545


Lets use GloVe 300 Dimensions word embedding weights. 

In [0]:
!wget http://nlp.stanford.edu/data/glove.6B.zip

--2020-02-05 09:54:30--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2020-02-05 09:54:30--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2020-02-05 09:54:31--  http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


2020-0

In [0]:
!unzip glove*.zip


Archive:  glove.6B.zip
  inflating: glove.6B.50d.txt        
  inflating: glove.6B.100d.txt       
  inflating: glove.6B.200d.txt       
  inflating: glove.6B.300d.txt       


In [0]:
!mv "glove.6B.300d.txt" "gdrive/My Drive/datasets/"

mv: cannot stat 'glove.6B.300d.txt': No such file or directory


The one with embedding dimention 300 is read in from the text file as a numpy array

In [22]:
embeddings_index = {}
f = open('gdrive/My Drive/datasets/glove.6B.300d.txt', encoding="utf8")
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))


Found 400000 word vectors.


Lets create an embedding matrix. According to the pre-trained word vector, the embedding matrix is billt. Words not found is marked all-zero.


In [0]:
embedding_matrix = np.zeros((max_num_word_text + 1, 300))
for word, i in tokenizer_text.word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

Lets divide the tokenized version of entire corpus (joined list).

In [24]:
text_array1 = text_array[:4023]
text_array2 = text_array[4023:]

len(text_array1), len(text_array2)

(4023, 4023)

Splitting data.

In [0]:
from sklearn.model_selection import train_test_split

text_array1_train,text_array1_test,text_array2_train,text_array2_test,noice_train,noice_test,text1_train,text1_test,text2_train,text2_test=train_test_split(text_array1,text_array2,noice,text1,text2,test_size=0.3, shuffle=True, random_state=2)

# The approach hereon:
<ol>
<li> For the training data, lets calculate a very straightforward manhattan distance labels (using scipy's spatial). These will be our labels. Lets call them dummy labels.
<li> Our model will try to learn from these labels and possibly learn to give better similarity scores.

In [0]:
def manhattan_distance_wordembedding_method(s1, s2):
    import scipy
    vector_1 = np.mean([embeddings_index.get(word) for word in s1 if word in embeddings_index],axis=0)
    vector_2 = np.mean([embeddings_index.get(word) for word in s2 if word in embeddings_index],axis=0)
    cityblock = scipy.spatial.distance.cityblock(vector_1, vector_2)
    return round((1-cityblock),2)

In [0]:

dummy_targets = []
for i in range(len(text1_train)):
  dummy_targets.append(manhattan_distance_wordembedding_method(text1_train[i],text2_train[i]))


Lets normalize our dummy targets (The scores were all negative).

In [29]:
dummy_targets = dummy_targets/min(dummy_targets)
dummy_targets[:30]

array([0.7791289 , 0.72319257, 0.44417422, 0.29206279, 0.6743312 ,
       0.41101039, 0.54123369, 0.43643599, 0.50210038, 0.65774928,
       0.40128233, 0.43842582, 0.52907362, 0.48640283, 0.47932788,
       0.5693124 , 0.67322574, 0.54012823, 0.60225514, 0.63431351,
       0.45699757, 0.56776476, 0.66393986, 0.42073845, 0.56157418,
       0.52221977, 0.48817157, 0.6460314 , 0.56245855, 0.42383374])

In [0]:
from tensorflow.python.keras.layers import Layer
from tensorflow.python.keras import backend as K

This is a Siamese LSTM model, wherein:
<ol>
<li> inputs share the embedding weights </li>
<li> output of embedding layer also share a common LSTM layer (weights once again shared </li>
<li> Noice values will have a separate Dense layer. </li>
<li> outputs of shared-LSTM and Desne are then concatenated. </li>
<li> Batch normalization in between for better performance and speed. </li>
<li> Final Dense layer with sigmoid Activation (0-1 similarity score) </li>

In [0]:
import keras
from keras.layers import Concatenate, Dense, Dropout, Flatten, LSTM, BatchNormalization
from keras.models import Model
from keras.layers import Input, Reshape, Dot
from keras.layers.embeddings import Embedding
from keras.optimizers import Adadelta, Adam, RMSprop, Adagrad
from keras.regularizers import l2
from keras.layers import Lambda
from keras.layers import Add, Activation, Lambda, Conv1D, MaxPool1D, concatenate
from keras.layers import Bidirectional, SpatialDropout1D, CuDNNLSTM

from keras.callbacks import EarlyStopping

def similarity_score():

    embedding_layer = Embedding(max_num_word_text + 1,
                                   300,
                                   weights=[embedding_matrix],
                                   input_length=pad_max_text,
                                   trainable=False)
    
    left_input = Input(shape=(pad_max_text,), dtype='int32')
    right_input = Input(shape=(pad_max_text,), dtype='int32')
    
    encoded_left = embedding_layer(left_input)
    encoded_right = embedding_layer(right_input)
    
    shared_lstm = CuDNNLSTM(200)
    
    left_output = shared_lstm(encoded_left)
    right_output = shared_lstm(encoded_right)

    noice_input = Input(shape=(noice.shape[1],))
    noice_dense = Dense(50, activation="relu")(noice_input)

    merged = concatenate([left_output, right_output, noice_dense])
    merged = BatchNormalization()(merged)
    merged = Dropout(0.4)(merged)

    merged = Dense(100, activation="relu")(merged)
    merged = BatchNormalization()(merged)
    merged = Dropout(0.4)(merged)

    preds = Dense(1, activation='sigmoid')(merged)
    
    model = Model(inputs=[left_input, right_input, noice_input], outputs=preds)
    opt =Adam()
    model.compile(loss='binary_crossentropy', optimizer="nadam")
    return model


In [49]:

model = similarity_score()
model.summary()

Model: "model_7"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_19 (InputLayer)           (None, 2137)         0                                            
__________________________________________________________________________________________________
input_20 (InputLayer)           (None, 2137)         0                                            
__________________________________________________________________________________________________
embedding_7 (Embedding)         (None, 2137, 300)    8263800     input_19[0][0]                   
                                                                 input_20[0][0]                   
__________________________________________________________________________________________________
input_21 (InputLayer)           (None, 3)            0                                      

In [50]:
model.fit(x = [text_array1_train, text_array2_train,noice_train], y =dummy_targets, batch_size=64, epochs=12,verbose=1)

Epoch 1/12
Epoch 2/12
Epoch 3/12
Epoch 4/12
Epoch 5/12
Epoch 6/12
Epoch 7/12
Epoch 8/12
Epoch 9/12
Epoch 10/12
Epoch 11/12
Epoch 12/12


<keras.callbacks.History at 0x7f9918847400>

Lets predict for the entire dataset. 

In [0]:
preds = model.predict([text_array1,text_array2,noice])

Lets round scores upto 2nd Decimal place.

In [0]:
import math
flat_list = [item for sublist in preds for item in sublist]

flat_list = [math.ceil(i*100.0)/100.0 for i in flat_list]


[0.54, 0.57, 0.53, 0.49, 0.33, 0.49, 0.57, 0.41, 0.48, 0.36]

Final Submission dataframe.

In [0]:
final_df = pd.DataFrame(
    {'Unique_ID': data.iloc[:,0],
     'Similarity_Score': flat_list
    })


In [0]:
final_df

Unnamed: 0,Unique_ID,Similarity_Score
0,0,0.54
1,1,0.57
2,2,0.53
3,3,0.49
4,4,0.33
...,...,...
4018,4018,0.51
4019,4019,0.58
4020,4020,0.46
4021,4021,0.60


In [0]:
final_df.to_csv('gdrive/My Drive/datasets/bhuvSubmission.csv',index=False)