# Introduction
This notebook serves as an example for the process of the analysis of semantic similarity. Semantic similarity requires not simply comparing text, but comparing what the text is supposed to mean. Even people can have disagreements over this, so computing something that abstract is a complex task.

A concept of vital importance is text embedding. Text is relatively hard to compare; comparing numbers is significantly faster than comparing strings, the latter possibly requiring iterating over each character to check it. Transforming the text into a numerical representation brings a lot of efficiency, and is an important first step. There are multiple ways to go about this.

The simplest way is called \textit{bag of words}. The entire vocabulary of the library of text we want to compare is numerically allocated; each word is given an entry in a simple lookup table. A sentence can then be vectorized, each vector being the same length as the lookup table, and each index holding the count of how often the word assigned that number appears in the sentence. Comparing these vectors then essentially compares the overlap between words used in sentences, assuming that the same words lead to the same meanings.

Of course, as the 'bag' implies, we lose order, a very important part of semantic meaning. Additionally, words may be functionally similar, but be different counts; two sentences each using different synonyms would be very different in this theory. 'Greater' could be stemmed to 'great', but the same is harder for 'better' and 'good'. What about 'terrific' and 'great'? We could look up synonyms for each word, but this quickly becomes very inefficient. Then there's the question of how relevant a word is; a 'I have a good book' is very different from a 'I have a good car', after all, despite being 80% "similar".

This relevancy problem can be diminished using TFIDF, or term frequency–inverse document frequency, to introduce a weighting. The more often a term shows up in a document, which is any arbitrary amount of text, the more relevant it is. The more documents that term shows up in, however, the less relevant. The sentences above may be reduced to 'good book' and 'good car', for example, based on the exact implementation. 

A more complex solution is to use a continuous bag of words such as described in the Word2Vec model as originally patented by Google, an implementation of which exists in the open source Gensim package. The concept is similar to a normal bag of words, but instead of each word apart, we look at the surrounding words. Each word is paired with its surroundings, creating a structure that allows for the prediction of words: using the above sentences again, 'good' can predict 'a', 'book', or 'car'. Then, TFIDF could be applied to determine the odds of predicting each word, or these weights can be determined more accurately using a neural network, as described in the original article. The Gensim package uses the latter method. The (abstracted) weights can be vectorized, and these are then used much like the simpler embedding can be.

To compare sentences, the embedded word vector can be taken for each word in the sentence, taking the average. Two averaged vectors can then be compared to determine how similar they are. This solves the relevancy and synonym problems, but does not preserve order; some context from sentence structure is lost. An expansion on the Word2Vec method, the Doc2Vec method, designed by the same authors, proposes to solve this by taking the word vectors in a document, and adding a 'phantom word'; the resulting so called document vector is then unique for that document, in an attempt to remember or at least account for the entire document context.

In the next cell, we load versions of these three models trained on the data provided to us, to show their capability.

# Models