# Introduction
This notebook serves as an example for the process of the analysis of semantic similarity. Semantic similarity requires not simply comparing text, but comparing what the text is supposed to mean. Even people can have disagreements over this, so computing something that abstract is a complex task.

A concept of vital importance is text embedding. Text is relatively hard to compare; comparing numbers is significantly faster than comparing strings, the latter possibly requiring iterating over each character to check it. Transforming the text into a numerical representation brings a lot of efficiency, and is an important first step. There are multiple ways to go about this.

The simplest way is called <i>bag of words</i>. The entire vocabulary of the library of text we want to compare is numerically allocated; each word is given an entry in a simple lookup table. A sentence can then be vectorized, each vector being the same length as the lookup table, and each index holding the count of how often the word assigned that number appears in the sentence. Comparing these vectors then essentially compares the overlap between words used in sentences, assuming that the same words lead to the same meanings.

Of course, as the 'bag' implies, we lose order, a very important part of semantic meaning. Additionally, words may be functionally similar, but be different counts; two sentences each using different synonyms would be very different in this theory. 'Greater' could be stemmed to 'great', but the same is harder for 'better' and 'good'. What about 'terrific' and 'great'? We could look up synonyms for each word, but this quickly becomes very inefficient. Then there's the question of how relevant a word is; a 'I have a good book' is very different from a 'I have a good car', after all, despite being 80% "similar".

This relevancy problem can be diminished using TFIDF, or term frequency–inverse document frequency, to introduce a weighting. The more often a term shows up in a document, which is any arbitrary amount of text, the more relevant it is. The more documents that term shows up in, however, the less relevant. The sentences above may be reduced to 'good book' and 'good car', for example, based on the exact implementation. 

A more complex solution is to use a continuous bag of words such as described in the Word2Vec model as originally patented by Google, an implementation of which exists in the open source Gensim package. The concept is similar to a normal bag of words, but instead of each word apart, we look at the surrounding words. Each word is paired with its surroundings, creating a structure that allows for the prediction of words: using the above sentences again, 'good' can predict 'a', 'book', or 'car'. Then, TFIDF could be applied to determine the odds of predicting each word, or these weights can be determined more accurately using a neural network, as described in the original article. The Gensim package uses the latter method. The (abstracted) weights can be vectorized, and these are then used much like the simpler embedding can be.

To compare sentences, the embedded word vector can be taken for each word in the sentence, taking the average. Two averaged vectors can then be compared to determine how similar they are. This solves the relevancy and synonym problems, but does not preserve order; some context from sentence structure is lost. An expansion on the Word2Vec method, the Doc2Vec method, designed by the same authors, proposes to solve this by taking the word vectors in a document, and adding a 'phantom word'; the resulting so called document vector is then unique for that document, in an attempt to remember or at least account for the entire document context. This also has an implementation built in Gensim.

How long the resulting vector is, depends on the implementation. The Gensim implementations use a vector size of 100 by default. Larger vectors can capture more details, but make computation more expensive. In our case, 100 was an apt choice 

In the next cell, we load versions of these three models trained on the data provided to us, to show their capability. Additionally, a pretrained Google model known as the Universal Sentence Encoder will be shown.

# Embedding Models

In [1]:
from gensim.models import word2vec, doc2vec
from embed import universal_sentence_encoder as use_model


# Load the pretrained models
w2v_model = word2vec.Word2Vec.load("models/w2vmodel.mod")
d2v_model = doc2vec.Doc2Vec.load("models/doc2vec.model")
# tfidf model?


# Classification

The above models and instruction clarify how to embed text, acquiring vectors for the relevant documents. It mentions comparing these vectors, but not exactly how. There are, again, multiple ways to do this. The simplest way is to calculate the cosine similarity between them, essentially imagining we have position vectors, and checking the angle between them. This is relatively easy to compute, intuitive, and gives a result between 0 and 1, making it very easy to use; set or train a proper threshold, and if the cosine similarity is above that threshold, the sentences carry the same semantic intent.

A complexer but significantly more accurate way of comparing is through the use of a neural network. Starting with more-or-less random predictions, the correct answers are used to calculate how wrong the network's prediction was, which is then used to update it, improving the next prediction. Through repetition, the network thus learns how to tell which vectors are supposed to be 'similar', learning to classify them. We implemented a relatively simple neural network, that takes as input two concatenated vectors; the two sentence vectors we want to compare. One hidden layer, of equal size to the input layer, with as activation function the rectified linear activation function, and finally a sigmoid function for the output layer of size 1. This output is simply a 0 or 1, signifying whether the sentences are the same or not.

The training of this second method is rather computationally expensive, though training doesn't necessarily have happen more than once. However, it also requires a large amount of annotated data; a large list of sentence combinations of which it is known if they are semantically the same or not. Additionally, this data should be very varied, as scenarios not adequately present in the training data will be hard to predict. This data being accurate is also of vital importance for accuracy, meaning it will have to be done, or at least verified, by hand. Our data consisted of about 400000 pairs of questions, meaning the network is likely to be better at classifying questions than other sentence structures. While this machine learning is not applicable in every situation, the prerequisites are something to keep in mind.

Below we load our pretrained models, after which the requirements for analysis are present, and some examples can be run.

# Neural Network

In [23]:
from tensorflow import keras
# Load the pretrained models
w2v_network = keras.models.load_model('neuralnets/w2v.h5')
d2v_network = keras.models.load_model('neuralnets/d2v.h5')
use_network = keras.models.load_model('neuralnets/use.h5')

In [46]:
from classifiers import binaryclassification as bc
from embed import doc2vec
from embed import w2vec
import numpy as np
from nltk.tokenize import RegexpTokenizer
from scipy import spatial
import tensorflow as tf
tokenizer = RegexpTokenizer(r'\w+\'*\w*')

sentence1 = "Is it possible for a meme to beat two memes?"
sentence2 = "Is it possible for a meme to beat two memes?"

tokens1 = tokenizer.tokenize(sentence1.lower())
tokens2 = tokenizer.tokenize(sentence2.lower())

w2v_vector = np.array(bc.vectorize_w2v(w2v_model, tokens1, tokens2)).reshape((1, 200))
d2v_vector = np.array(doc2vec.doc2vec(d2v_model, tokens1, tokens2)).reshape((1, 200))
use_vector = tf.convert_to_tensor(np.array(tf.concat([use_model.encode(sentence1), use_model.encode(sentence2)], axis=0)).reshape((1, 1024)))

In [47]:
w2vnet_ans = w2v_network.predict((w2v_vector) > 0.5).astype("int32")[0][0] == 1
w2vcos_ans = w2vec.string_similarity(w2v_model, tokens1, tokens2) > 0.98

d2vnet_ans = d2v_network.predict((d2v_vector) > 0.5).astype("int32")[0][0] == 1
d2vcos_ans = (1 - spatial.distance.cosine(d2v_vector[0][:100], d2v_vector[0][100:])) > 0.98

usenet_ans = use_network.predict((use_vector) > 0.5).astype("int32")[0][0] == 1
usecos_ans = (1 - spatial.distance.cosine(use_vector[0][:512], use_vector[0][512:])) > 0.98

print("W2Vnet says:", w2vnet_ans, "\tW2V Cosine:", w2vcos_ans, "\tconclusion:", w2vcos_ans or w2vnet_ans)
print("D2Vnet says:", d2vnet_ans, "\tD2V Cosine:", d2vcos_ans, "\tconclusion:", d2vnet_ans or d2vcos_ans)
# print("TFIDF says: ", model_tfidf.predict(tfidf_vector))
print("USEnet says:", usenet_ans, "\tUSE Cosine:", usecos_ans, "\tconclusion:", usenet_ans or usecos_ans)

W2Vnet says: False 	W2V Cosine: True 	conclusion: True
D2Vnet says: False 	D2V Cosine: False 	conclusion: False
USEnet says: False 	USE Cosine: True 	conclusion: True
