<a href="https://colab.research.google.com/github/amritakesh/doc_similarity/blob/main/docsimile.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
#importing all independencies and modules
import pandas as pd
import numpy as np
from nltk.corpus import stopwords 
import nltk
import re
from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import euclidean_distances

In [None]:
# Sample corpus (will change to larger dataset)
documents = ['Machine learning is the study of computer algorithms that improve automatically through experience.\
Machine learning algorithms build a mathematical model based on sample data, known as training data.\
The discipline of machine learning employs various approaches to teach computers to accomplish tasks \
where no fully satisfactory algorithm is available.',
'Machine learning is closely related to computational statistics, which focuses on making predictions using computers.\
The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning.',
'Machine learning involves computers discovering how they can perform tasks without being explicitly programmed to do so. \
It involves computers learning from data provided so that they carry out certain tasks.',
'Machine learning approaches are traditionally divided into three broad categories, depending on the nature of the "signal"\
or "feedback" available to the learning system: Supervised, Unsupervised and Reinforcement',
'Software engineering is the systematic application of engineering approaches to the development of software.\
Software engineering is a computing discipline.',
'A software engineer creates programs based on logic for the computer to execute. A software engineer has to be more concerned\
about the correctness of the program in all the cases. Meanwhile, a data scientist is comfortable with uncertainty and variability.\
Developing a machine learning application is more iterative and explorative process than software engineering.'
]

In [None]:
pd.set_option('display.max_colwidth', 0)
pd.set_option('display.max_columns', 0)
df=pd.DataFrame(documents,columns=['documents'])

In [None]:
df.head()

Unnamed: 0,documents
0,"Machine learning is the study of computer algorithms that improve automatically through experience.Machine learning algorithms build a mathematical model based on sample data, known as training data.The discipline of machine learning employs various approaches to teach computers to accomplish tasks where no fully satisfactory algorithm is available."
1,"Machine learning is closely related to computational statistics, which focuses on making predictions using computers.The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning."
2,Machine learning involves computers discovering how they can perform tasks without being explicitly programmed to do so. It involves computers learning from data provided so that they carry out certain tasks.
3,"Machine learning approaches are traditionally divided into three broad categories, depending on the nature of the ""signal""or ""feedback"" available to the learning system: Supervised, Unsupervised and Reinforcement"
4,Software engineering is the systematic application of engineering approaches to the development of software.Software engineering is a computing discipline.


In [None]:
# downloading filler words(stop words)
nltk.download('stopwords')
stop_words=stopwords.words('english')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
#removing blanks and stopwords
df['documents_cleaned']=df.documents.apply(lambda x: " ".join(re.sub(r'[^a-zA-Z]',' ',w).lower() for w in x.split() if re.sub(r'[^a-zA-Z]',' ',w).lower() not in stop_words) )


APPLYING TF-IDF VECTORIZER

In [None]:
tfidfvectoriser=TfidfVectorizer(max_features=64) #will convert to 64 dimensional vector
tfidfvectoriser.fit(df.documents_cleaned)
tfidf_vectors=tfidfvectoriser.transform(df.documents_cleaned)

In [None]:
df.shape

(6, 2)

In [None]:
tfidf_vectors.shape

(6, 64)

In [None]:
tfidf_vectors=tfidf_vectors.toarray()
print (tfidf_vectors[0])

[0.20860612 0.41721224 0.         0.14442061 0.17106    0.17106
 0.         0.         0.         0.         0.         0.
 0.17106    0.14442061 0.         0.         0.         0.
 0.28884121 0.17106    0.         0.         0.         0.
 0.32062347 0.32062347 0.         0.17106    0.         0.20860612
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.20860612 0.20860612 0.         0.         0.         0.
 0.         0.17106    0.         0.         0.         0.17106
 0.20860612 0.17106    0.         0.         0.         0.20860612
 0.         0.         0.         0.        ]


In [None]:
#checking cosine similarity
pair_sim=np.dot(tfidf_vectors,tfidf_vectors.T) #checking pairwise similarities
pair_diff=euclidean_distances(tfidf_vectors) #checking pairwise differences

In [None]:
# print (tfidf_vectors[0])
print('Similarities')
print(pair_sim[0][:])
print('Differences')
print (pair_diff[0][:])

Similarities
[1.         0.30335642 0.29899126 0.20763548 0.06056832 0.16004863]
Differences
[0.         1.18037585 1.18406819 1.25886021 1.37071637 1.29611062]


In [None]:
def most_similar(doc_id,similarity_matrix,matrix):
    print (f'Document: {df.iloc[doc_id]["documents"]}')
    print ('\n')
    print (f'Similar Documents using {matrix}:')
    if matrix=='Cosine Similarity':
        similar_ix=np.argsort(similarity_matrix[doc_id])[::-1]
    elif matrix=='Euclidean Distance':
        similar_ix=np.argsort(similarity_matrix[doc_id])
    for ix in similar_ix:
        if ix==doc_id:
            continue
        print('\n')
        print (f'Document: {df.iloc[ix]["documents"]}')
        print (f'{matrix} : {similarity_matrix[doc_id][ix]}')

In [None]:
most_similar(0,pair_sim,'Cosine Similarity')

Document: Machine learning is the study of computer algorithms that improve automatically through experience.Machine learning algorithms build a mathematical model based on sample data, known as training data.The discipline of machine learning employs various approaches to teach computers to accomplish tasks where no fully satisfactory algorithm is available.


Similar Documents using Cosine Similarity:


Document: Machine learning is closely related to computational statistics, which focuses on making predictions using computers.The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning.
Cosine Similarity : 0.30335642341823865


Document: Machine learning involves computers discovering how they can perform tasks without being explicitly programmed to do so. It involves computers learning from data provided so that they carry out certain tasks.
Cosine Similarity : 0.29899125782686603


Document: Machine learning approaches a

WORD2VEC IMPLEMENTATION

In [None]:
#downloading modules and independencies
from keras.preprocessing.text import Tokenizer
import gensim
from keras.preprocessing.sequence import pad_sequences
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt') #divides the document into sentences

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
#tokenizing and padding every document to make it one size
tokenizer=Tokenizer()
tokenizer.fit_on_texts(df.documents_cleaned)
tokenized_documents=tokenizer.texts_to_sequences(df.documents_cleaned)
tokenized_paded_documents=pad_sequences(tokenized_documents,maxlen=64,padding='post')
vocab_size=len(tokenizer.word_index)+1
print(tokenized_paded_documents)

[[ 2  1 10 11 12 20 21 22  2  1 12 23 13 24 14 25  4 26 27  4 15 16  2  1
  28 29  7 30  5 31  8 32 33 34 17  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 2  1 35 36 37 38 39 40 41 42  5 15 10 13 43 44 45 46  9 47 48  2  1  0
   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 2  1 18  5 49 50  8 51 52 53 54 18  5  1  4 55 56 57  8  0  0  0  0  0
   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 2  1  7 58 59 60 61 62 63 64 65 66 67 17  1 68 69 70 71  0  0  0  0  0
   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 3  6 72  9  6  7 73  3  3  6 74 16  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 3 

In [None]:
print(tokenized_paded_documents.shape)

(6, 64)


In [None]:
from gensim import models
# import pandas as pd
data = pd.read_csv('https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?resourcekey=0-wjGZdNAUop6WykTtMip30g')
# print(data)
# url = 'https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?resourcekey=0-wjGZdNAUop6WykTtMip30g'
#  w = models.KeyedVectors.load_word2vec_format(    '../GoogleNews-vectors-negative300.bin', binary=True)
# w2v_path = "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"
#word2vec model
w2v_path = '/content/drive/MyDrive/GoogleNews-vectors-negative300.bin.gz'
model_w2v = gensim.models.KeyedVectors.load_word2vec_format(w2v_path, binary=True)

In [None]:
embedding_matrix=np.zeros((vocab_size,300))
for word,i in tokenizer.word_index.items():
    if word in model_w2v:
        embedding_matrix[i]=model_w2v[word]



embedding_matrix[0]


array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0.

In [None]:
embedding_matrix[tokenizer.word_index['machine']]

array([ 2.55859375e-01, -2.20947266e-02,  2.90527344e-02,  5.44433594e-02,
       -7.42187500e-02,  3.53515625e-01, -6.34765625e-02,  1.44531250e-01,
        7.22656250e-02,  1.00097656e-01, -1.82617188e-01, -2.28515625e-01,
        2.20947266e-02, -2.20703125e-01,  1.91406250e-01,  1.91406250e-01,
       -1.66992188e-01,  1.67968750e-01,  2.94921875e-01, -1.80664062e-01,
       -1.45263672e-02,  1.07421875e-01, -1.65039062e-01,  2.98828125e-01,
        1.29882812e-01, -1.17187500e-01, -1.67968750e-01,  1.01562500e-01,
        4.49218750e-02, -1.20605469e-01,  7.47070312e-02, -3.47656250e-01,
       -1.01074219e-01,  3.80859375e-01, -2.06054688e-01, -7.47070312e-02,
       -1.08398438e-01,  1.86523438e-01,  2.01171875e-01, -2.12402344e-02,
        2.85156250e-01, -9.27734375e-02,  1.39648438e-01,  5.78613281e-02,
        2.67578125e-01, -1.50390625e-01, -8.54492188e-02,  1.92382812e-01,
        8.00781250e-02,  6.39648438e-02, -7.47070312e-02, -8.59375000e-02,
        5.10253906e-02, -

In [None]:
document_word_embeddings=np.zeros((len(tokenized_paded_documents),64,300))

for i in range(len(tokenized_paded_documents)):
    for j in range(len(tokenized_paded_documents[0])):
        document_word_embeddings[i][j]=embedding_matrix[tokenized_paded_documents[i][j]]

In [None]:
document_word_embeddings.shape


(6, 64, 300)

In [None]:
document_embeddings=np.zeros((len(tokenized_paded_documents),300))
words=tfidfvectoriser.get_feature_names()

for i in range(len(document_word_embeddings)):
    for j in range(len(words)):
        document_embeddings[i]+=embedding_matrix[tokenizer.word_index[words[j]]]*tfidf_vectors[i][j]
        
document_embeddings=document_embeddings/np.sum(tfidf_vectors,axis=1).reshape(-1,1)



In [None]:
pair_sim=cosine_similarity(document_embeddings)
pair_diff=euclidean_distances(document_embeddings)

In [None]:
most_similar(0,pair_sim,'Cosine Similarity')


Document: Machine learning is the study of computer algorithms that improve automatically through experience.Machine learning algorithms build a mathematical model based on sample data, known as training data.The discipline of machine learning employs various approaches to teach computers to accomplish tasks where no fully satisfactory algorithm is available.


Similar Documents using Cosine Similarity:


Document: Machine learning is closely related to computational statistics, which focuses on making predictions using computers.The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning.
Cosine Similarity : 0.8397965442591562


Document: Machine learning involves computers discovering how they can perform tasks without being explicitly programmed to do so. It involves computers learning from data provided so that they carry out certain tasks.
Cosine Similarity : 0.7794695868906796


Document: A software engineer creates pro

BERT

In [None]:
!pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer


Collecting sentence-transformers
  Downloading sentence-transformers-2.2.0.tar.gz (79 kB)
[K     |████████████████████████████████| 79 kB 3.0 MB/s 
[?25hCollecting transformers<5.0.0,>=4.6.0
  Downloading transformers-4.19.2-py3-none-any.whl (4.2 MB)
[K     |████████████████████████████████| 4.2 MB 33.2 MB/s 
Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 17.6 MB/s 
[?25hCollecting huggingface-hub
  Downloading huggingface_hub-0.6.0-py3-none-any.whl (84 kB)
[K     |████████████████████████████████| 84 kB 3.0 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 31.6 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64

In [None]:
sbert_model = SentenceTransformer('bert-base-nli-mean-tokens')

Downloading:   0%|          | 0.00/391 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.95k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/625 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/399 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

In [None]:
document_embeddings = sbert_model.encode(df['documents_cleaned'])

In [None]:
pair_sim=cosine_similarity(document_embeddings)
pair_diff=euclidean_distances(document_embeddings)

In [None]:
print('Similarities')
#calling the most_similar function defined in tf-idf section but passing different parameters
most_similar(0,pair_sim,'Cosine Similarity') 


Similarities
Document: Machine learning is the study of computer algorithms that improve automatically through experience.Machine learning algorithms build a mathematical model based on sample data, known as training data.The discipline of machine learning employs various approaches to teach computers to accomplish tasks where no fully satisfactory algorithm is available.


Similar Documents using Cosine Similarity:


Document: Machine learning is closely related to computational statistics, which focuses on making predictions using computers.The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning.
Cosine Similarity : 0.8365410566329956


Document: A software engineer creates programs based on logic for the computer to execute. A software engineer has to be more concernedabout the correctness of the program in all the cases. Meanwhile, a data scientist is comfortable with uncertainty and variability.Developing a machine l

GLOVE

In [None]:
!pip install kaggle



In [None]:
!mkdir ~/.kaggle

In [None]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = '/content'

In [None]:
!kaggle datasets download -d danielwillgeorge/glove6b100dtxt

Traceback (most recent call last):
  File "/usr/local/bin/kaggle", line 5, in <module>
    from kaggle.cli import main
  File "/usr/local/lib/python3.7/dist-packages/kaggle/__init__.py", line 23, in <module>
    api.authenticate()
  File "/usr/local/lib/python3.7/dist-packages/kaggle/api/kaggle_api_extended.py", line 166, in authenticate
    self.config_file, self.config_dir))
OSError: Could not find kaggle.json. Make sure it's located in /content. Or use the environment method.


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:

embeddings_index = dict()

with open('/content/drive/MyDrive/glove.6B.100d.txt') as file:
    for line in file:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs

In [None]:

embedding_matrix=np.zeros((vocab_size,100))

for word,i in tokenizer.word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

In [None]:
document_embeddings=np.zeros((len(tokenized_paded_documents),100))
words=tfidfvectoriser.get_feature_names()

for i in range(df.shape[0]):
    for j in range(len(words)):
        document_embeddings[i]+=embedding_matrix[tokenizer.word_index[words[j]]]*tfidf_vectors[i][j]
        
        document_embeddings = document_embeddings/np.sum(tfidf_vectors,axis=1).reshape(-1,1)



In [None]:
document_embeddings.shape

(6, 100)

In [None]:

pair_sim=cosine_similarity(document_embeddings)
pair_diff=euclidean_distances(document_embeddings)

In [None]:
most_similar(0,pair_sim,'Cosine Similarity')

Document: Machine learning is the study of computer algorithms that improve automatically through experience.Machine learning algorithms build a mathematical model based on sample data, known as training data.The discipline of machine learning employs various approaches to teach computers to accomplish tasks where no fully satisfactory algorithm is available.


Similar Documents using Cosine Similarity:


Document: A software engineer creates programs based on logic for the computer to execute. A software engineer has to be more concernedabout the correctness of the program in all the cases. Meanwhile, a data scientist is comfortable with uncertainty and variability.Developing a machine learning application is more iterative and explorative process than software engineering.
Cosine Similarity : 0.0


Document: Software engineering is the systematic application of engineering approaches to the development of software.Software engineering is a computing discipline.
Cosine Similarity : 0.

DOC2VEC

In [None]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

In [None]:
tagged_data = [TaggedDocument(words=word_tokenize(doc), tags=[i]) for i, doc in enumerate(df.documents_cleaned)]

In [None]:


model_d2v = Doc2Vec(vector_size=100,alpha=0.025, min_count=1)
model_d2v.build_vocab(tagged_data)

for epoch in range(100):
    model_d2v.train(tagged_data,
                total_examples=model_d2v.corpus_count,
                epochs=model_d2v.epochs)
document_embeddings=np.zeros((df.shape[0],100))
for i in range(len(document_embeddings)):
    document_embeddings[i]=model_d2v.docvecs[i]

In [None]:
pair_sim=cosine_similarity(document_embeddings)
pair_diff=euclidean_distances(document_embeddings)

In [None]:
most_similar(0,pair_sim,'Cosine Similarity')

Document: Machine learning is the study of computer algorithms that improve automatically through experience.Machine learning algorithms build a mathematical model based on sample data, known as training data.The discipline of machine learning employs various approaches to teach computers to accomplish tasks where no fully satisfactory algorithm is available.


Similar Documents using Cosine Similarity:


Document: Machine learning involves computers discovering how they can perform tasks without being explicitly programmed to do so. It involves computers learning from data provided so that they carry out certain tasks.
Cosine Similarity : 0.48570503236941825


Document: Machine learning is closely related to computational statistics, which focuses on making predictions using computers.The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning.
Cosine Similarity : 0.3589481260731879


Document: Machine learning approaches ar

Universal Sentence Encoder


In [None]:

import tensorflow as tf
import tensorflow_hub as hub
module_url = "https://tfhub.dev/google/universal-sentence-encoder/4" 
model = hub.load(module_url)