### Tf-idf (term frequency - inverse document frequency)
#### In this section the documents will be compared using the tf-idf method

In [59]:
import pandas as pd
import numpy as np
import nltk, re
nltk.download('stopwords')
from nltk.corpus import stopwords

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [60]:
# Sample documents (corpus) assuming that the query is the first item
query = ['A software engineer creates programs based on logic for the computer to execute.\
A software engineer has to be more concernedabout the correctness of the program in all the cases.']

documents = ['Machine learning is the study of computer algorithms that improve automatically through experience.\
Machine learning algorithms build a mathematical model based on sample data, known as training data.\
The discipline of machine learning employs various approaches to teach computers to accomplish tasks \
where no fully satisfactory algorithm is available.',
'Machine learning is closely related to computational statistics, which focuses on making predictions using computers.\
The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning.',
'Machine learning involves computers discovering how they can perform tasks without being explicitly programmed to do so. \
It involves computers learning from data provided so that they carry out certain tasks.',
'Machine learning approaches are traditionally divided into three broad categories, depending on the nature of the "signal"\
or "feedback" available to the learning system: Supervised, Unsupervised and Reinforcement',
'Software engineering is the systematic application of engineering approaches to the development of software.\
Software engineering is a computing discipline.',
'A software engineer creates programs based on logic for the computer to execute. A software engineer has to be more concerned\
about the correctness of the program in all the cases. Meanwhile, a data scientist is comfortable with uncertainty and variability.\
Developing a machine learning application is more iterative and explorative process than software engineering.']

# create dataframe
docs_df = pd.DataFrame(documents, columns=['documents'])
# insert the query as the first item
if(pd.notnull(query)):
  docs_df.loc[-1] = query
  docs_df.index = docs_df.index + 1
  docs_df = docs_df.sort_index()

In [61]:
# remove special characters and stop words
stop_words_l=stopwords.words('english')
docs_df['documents_cleaned']=docs_df.documents.apply(lambda x: " ".join(re.sub(r'[^a-zA-Z]',' ',w).lower() for w in x.split() if re.sub(r'[^a-zA-Z]',' ',w).lower() not in stop_words_l))


In [106]:
#Tf-Idf Vectorization
tfidfvectorizer = TfidfVectorizer(max_features=64)
tfidfvectorizer.fit(docs_df.documents_cleaned)
tfidf_docs_vectors = tfidfvectorizer.transform(docs_df.documents_cleaned)
tfidf_docs_vectors=tfidf_docs_vectors.toarray()

tfidf_docs_vectors.shape

(7, 64)

In [102]:
pairwiseCosSimilarities = cosine_similarity(tfidf_docs_vectors)

In [103]:
# function that prints the first element (the query) and similar documents 
# sorting by the cosine similarity
def most_similar(doc_id, docs_df, similarity_matrix):
    print (f'Document: {docs_df.iloc[doc_id]["documents"]}')
    print ('\n')
    print ('Similar Documents:')
    # argsort returns the indeces in ascending order. "[::-1] is used to reverse it"
    similar_ix=np.argsort(similarity_matrix[doc_id])[::-1]
    for ix in similar_ix:
        # skip the query itself
        if ix==doc_id:
            continue
        print (f'Document: {docs_df.iloc[ix]["documents"]}')
        print (f'Cosine Similarity: {similarity_matrix[doc_id][ix]}')
        print('\n')

In [104]:
most_similar(0, docs_df, pairwiseCosSimilarities)

Document: A software engineer creates programs based on logic for the computer to execute.A software engineer has to be more concernedabout the correctness of the program in all the cases.


Similar Documents:
Document: A software engineer creates programs based on logic for the computer to execute. A software engineer has to be more concernedabout the correctness of the program in all the cases. Meanwhile, a data scientist is comfortable with uncertainty and variability.Developing a machine learning application is more iterative and explorative process than software engineering.
Cosine Similarity: 0.7616797122187707


Document: Software engineering is the systematic application of engineering approaches to the development of software.Software engineering is a computing discipline.
Cosine Similarity: 0.2288787293216722


Document: Machine learning is the study of computer algorithms that improve automatically through experience.Machine learning algorithms build a mathematical model bas

### Word2Vec
#### In this section, Word2Vec method is used. 'Google' and 'Stanford GloVe' pre-trained models

In [66]:
from keras.preprocessing.text import Tokenizer
from keras_preprocessing.sequence import pad_sequences

# tokenize and pad documents/query to make them of the same size
tokenizer = Tokenizer()
tokenizer.fit_on_texts(docs_df.documents_cleaned)
# tokenization
tokenized_docs = tokenizer.texts_to_sequences(docs_df.documents_cleaned)
# padded tokenized docs
tokenized_docs_padded = pad_sequences(tokenized_docs, maxlen=64, padding='post')
vocab_size = len(tokenizer.word_index)+1

print(tokenized_docs_padded[5])

[ 2  7 81 12  7 10 82  2  2  7 83 25  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]


In [9]:
# load Google pre-trained model to represent each word as 300-dimensional vector
import gensim.downloader as api
word2vec_model = api.load('word2vec-google-news-300')

#word2vec_model = gensim.models.KeyedVectors.load_word2vec_format(word2vec_path, binary=True)

In [67]:
# create the embedding matrix (each word represented by 300-d vector)
# every row is a vector representation from the vocabulary indexed by the tokenizer index
embedding_matrix = np.zeros((vocab_size, 300))

for word,i in tokenizer.word_index.items():
    if word in word2vec_model:
        embedding_matrix[i]=word2vec_model[word]

# building the document-word embeddings
docs_word_embeddings = np.zeros((len(tokenized_docs_padded),64,300))
for i in range(len(tokenized_docs_padded)):
    for j in range(len(tokenized_docs_padded[0])):
        docs_word_embeddings[i][j]=embedding_matrix[tokenized_docs_padded[i][j]]

docs_word_embeddings.shape

(7, 64, 300)

In [117]:
# calculating average of word vectors of a document weighted by tf-idf
document_embeddings=np.zeros((len(tokenized_docs_padded),300))
words=tfidfvectorizer.get_feature_names_out()
# building document embedding using weighted average (tf-idf)
for i in range(len(docs_word_embeddings)):
    for j in range(len(words)):
        document_embeddings[i]+=embedding_matrix[tokenizer.word_index[words[j]]]*tfidf_docs_vectors[i][j]
document_embeddings=document_embeddings/np.sum(tfidf_docs_vectors,axis=1).reshape(-1,1)

# calculating similarities between the first item (the query) and the rest of the documents
word2vec_pairwise_similarities=cosine_similarity(document_embeddings)
most_similar(0, docs_df, word2vec_pairwise_similarities)

Document: A software engineer creates programs based on logic for the computer to execute.A software engineer has to be more concernedabout the correctness of the program in all the cases.


Similar Documents:
Document: A software engineer creates programs based on logic for the computer to execute. A software engineer has to be more concernedabout the correctness of the program in all the cases. Meanwhile, a data scientist is comfortable with uncertainty and variability.Developing a machine learning application is more iterative and explorative process than software engineering.
Cosine Similarity: 0.904882730205429


Document: Software engineering is the systematic application of engineering approaches to the development of software.Software engineering is a computing discipline.
Cosine Similarity: 0.7727723931801744


Document: Machine learning is the study of computer algorithms that improve automatically through experience.Machine learning algorithms build a mathematical model base

#### Using GloVe pre-trained word embedding model (Stanford)

In [119]:
# load GloVe pre-trained model to represent each word as 300-dimensional vector
import gensim.downloader as api
word2vec_glove_model = api.load('glove-wiki-gigaword-300')



In [120]:
from keras.preprocessing.text import Tokenizer
from keras_preprocessing.sequence import pad_sequences

# tokenize and pad documents/query to make them of the same size
tokenizer = Tokenizer()
tokenizer.fit_on_texts(docs_df.documents_cleaned)
# tokenization
tokenized_docs = tokenizer.texts_to_sequences(docs_df.documents_cleaned)
# padded tokenized docs
tokenized_docs_padded = pad_sequences(tokenized_docs, maxlen=64, padding='post')
vocab_size = len(tokenizer.word_index)+1

In [121]:
# create the embedding matrix (each word represented by 300-d vector)
# every row is a vector representation from the vocabulary indexed by the tokenizer index
embedding_matrix = np.zeros((vocab_size, 300))

for word,i in tokenizer.word_index.items():
    if word in word2vec_glove_model:
        embedding_matrix[i]=word2vec_glove_model[word]

# building the document-word embeddings
docs_word_embeddings = np.zeros((len(tokenized_docs_padded),64,300))
for i in range(len(tokenized_docs_padded)):
    for j in range(len(tokenized_docs_padded[0])):
        docs_word_embeddings[i][j]=embedding_matrix[tokenized_docs_padded[i][j]]

docs_word_embeddings.shape

(7, 64, 300)

In [122]:
# calculating average of word vectors of a document weighted by tf-idf
document_embeddings=np.zeros((len(tokenized_docs_padded),300))
words=tfidfvectorizer.get_feature_names_out()
# building document embedding using weighted average (tf-idf)
for i in range(len(docs_word_embeddings)):
    for j in range(len(words)):
        document_embeddings[i]+=embedding_matrix[tokenizer.word_index[words[j]]]*tfidf_docs_vectors[i][j]
document_embeddings=document_embeddings/np.sum(tfidf_docs_vectors,axis=1).reshape(-1,1)

# calculating similarities between the first item (the query) and the rest of the documents
word2vec_pairwise_similarities=cosine_similarity(document_embeddings)
most_similar(0, docs_df, word2vec_pairwise_similarities)

Document: A software engineer creates programs based on logic for the computer to execute.A software engineer has to be more concernedabout the correctness of the program in all the cases.


Similar Documents:
Document: A software engineer creates programs based on logic for the computer to execute. A software engineer has to be more concernedabout the correctness of the program in all the cases. Meanwhile, a data scientist is comfortable with uncertainty and variability.Developing a machine learning application is more iterative and explorative process than software engineering.
Cosine Similarity: 0.9425996236392878


Document: Software engineering is the systematic application of engineering approaches to the development of software.Software engineering is a computing discipline.
Cosine Similarity: 0.8099935039689231


Document: Machine learning is the study of computer algorithms that improve automatically through experience.Machine learning algorithms build a mathematical model bas

### BERT
#### In this section, Transformer method is used. 'BERT' pre-trained model is used to encode the content of the query and the documents

In [128]:
#!pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer

bertModel = SentenceTransformer('bert-base-nli-mean-tokens')

Downloading:   0%|          | 0.00/391 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.95k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/625 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/399 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

In [129]:
docs_embeddings = bertModel.encode(docs_df['documents_cleaned'])

In [130]:
# calculating similarities between the first item (the query) and the rest of the documents
bert_pairwise_similarities=cosine_similarity(docs_embeddings)
most_similar(0, docs_df, bert_pairwise_similarities)

Document: A software engineer creates programs based on logic for the computer to execute.A software engineer has to be more concernedabout the correctness of the program in all the cases.


Similar Documents:
Document: A software engineer creates programs based on logic for the computer to execute. A software engineer has to be more concernedabout the correctness of the program in all the cases. Meanwhile, a data scientist is comfortable with uncertainty and variability.Developing a machine learning application is more iterative and explorative process than software engineering.
Cosine Similarity: 0.890613317489624


Document: Software engineering is the systematic application of engineering approaches to the development of software.Software engineering is a computing discipline.
Cosine Similarity: 0.8488336801528931


Document: Machine learning is the study of computer algorithms that improve automatically through experience.Machine learning algorithms build a mathematical model base