## Question 1.1
Introduced in 1950 by Richard Hamming, the Hamming code is basically a means of counteracting errors in the transfer of data in binary bit format. In the case of Hamming himself, the errors originated from the a punchcard reader at his workplace. Yet, this method can in principle be used for errorcorrection in data transfer, no matter the medium.
The Hamming code makes use of the concepts of dot-product from linear algebra and parity bits in combination. The idea is this: take a 4-bit piece of data, you wish to transfer. This can be seen as a 4x1 vector. Multiply this vector by a 7x4 so-called _"generator-matrix"_, to create a 7-bit _code word_. This 7-bit code word now contains both the original 4 bits of data along with 3 parity bits, that can be used for error correction, once data has been transfered. 
The code below contains a method that, when supplied with a 4-bit data-string constructs a 7-bit code word, using a Hamming Code (named _G matrix_ in the code, short for generator matrix). The details of the method will be explained below.


In [3]:
# Imports
import random

# Method to convert a 4-bit message in to a 7-bit code word, 
# adding 3 parity bits in the process
def encoder(message):
    
    #This is the generator/encoding matrix
    G_matrix = [[1,0,1,1],
                [1,1,0,1],
                [0,0,0,1],
                [1,1,1,0],
                [0,0,1,0],
                [0,1,0,0],
                [1,0,0,0]]
    
    # Variable to hold the 7-bit codeword
    code_word = []
    
    # For-loop to test the number of rows in the matrix
    for i in range(len(G_matrix)):
        # Variable to hold the dot-product
        count = 0
        # Nested for-loop to calculate the dot-product of every row multiplicated by the 4-bit message
        for x in range(len(message)):
            count += message[x] * G_matrix[i][x]
        # Every dot-product is appended to the code_word variable, creating the 7-bit code_word.
        code_word.append(count%2)
    
    return(code_word)

print('7-bit code_word',encoder([1,0,1,0]))


7-bit code_word [0, 1, 0, 0, 1, 0, 1]


When running the encoder on the 4 bits __[1,0,1,0]__, the resulting 7-bit code word is __[0,1,0,0,1,0,1]__.To understand the whole concept of Hamming codes, and why the 7-bit code word looks the way it does, it is necessary to understand the concept of parity bits.
### Parity bits
These bits do not contain parts of the original data, but rather meta data in the form of _"data about the data"_. This means, that a parity bit is an indication of wether the number of 1's in a piece of data is even or odd. Working with even parities, means that a parity bit will be 1, if the number of 1's in piece of data is odd, as this makes the sum of all 1's even (odd number + 1 = even number.
In the code above, 3 parity bits are introduced in positions 1,2 and 4 of the 7-bit code word. This can be seen, as rows 1,2 and 4 in the G_matrix have 3 1' each, but in different positions. What this means is, that all the parity bits look at 3 positions in the original 4 pieces of data, and checks, whether they are even or odd. As an example, the first parity bit (first row of G-matrix) has the form __[1,0,1,1]__, which means that is "look" for 1's in positions 1,3,4. Our data is __[1,0,1,0]__, which means we get a dot-product of 2 in this instance, as both vectors have a 1 in positions 1 and 3. And so the parity bit in position 1 is a zero, as the number of 1's in the bits looked at is already even. In the code this check for even/oddd is done by using the _modulus-operator_ with 2 (%2).
The following table illustrates which parity bits (P) are looking at which data bits (D), and which positions the different bits are placed in, in the final 7-bit code word:

<img src="attachment:image.png" width="400">

The 4 remaining bits of the 7-bit codeword in positions 3,5,6,7 are basically just a mirroring of the original 4 bits of data. It is worth noting (as is also illustrated in the above table), that the G-matrix used in this case actually turns the original 4 bits of data around. This can be seen by the way, that the 7th row of the G-matrix has a 1 in it's first position. This means that it "looks" at the first position of the data. The 6'th row has a 1 in it's 3rd position and so on.

Encoding a 4 bits of data in to a 7 bit code word is only really a clever thing to do, if the data is at risk of incurring errors in the form of flipped bits. To demonstrate how the use of a Hamming code can effectively find and correct such a flipped bit error, the following code is a method, that introduces a random flipped bit to our code word __[0, 1, 0, 0, 1, 0, 1]__: 


In [4]:
# Method introducing a random 1 bit error (bit-flip) to the code_word,
# immitating a noisy data-transfer
def noisy_channel (code_word):
    
    # Variable to hold the code word with the 1-bit error
    err_code_word = code_word[:]
    # Variable to hold a random location of a 1-bit error
    bit_flip_location = random.randint(0,len(code_word)-1)
    
    # If-loop to flip the bit at the random location of error
    if code_word[bit_flip_location] == 1:
        err_code_word[bit_flip_location] = 0
    else:
        err_code_word[bit_flip_location] = 1
    
    return (err_code_word)
print('Original code word:', encoder([1,0,1,0]))
err_code_word = noisy_channel(encoder([1,0,1,0]))
print('Code word w. error:', err_code_word)



Original code word: [0, 1, 0, 0, 1, 0, 1]
Code word w. error: [0, 1, 0, 1, 1, 0, 1]


When running the above code, it should hopefully be apparent, that a bit in one of the 7 positions has been flipped. This is obviously a problem, as the data is not representative of the original data any more. There is of course a chance, that the flipped bit is one of the parity bits, and as such, the original 4 bits of data are still correct. But had we sent the 4 bits of data without 3 parity bits, a flipped bit would inevitably have been a bit of the original data, and in that case we would not have any means of finding and correcting this error by using the parity bits.
The following code introduces a method, that looks for errors and correct these, if any are present. This is done by making use of a so-caled _parity check matrix_ (named H_matrix in the code), looking at the 3 parity bits for information on whether an error is present in the data, and if so, in which location it is, so the bit in this position can be switched back.
The details of the code are explained below.

In [5]:

# Method for correcting the 1-bit error introduced to the original 7-bit code word
def error_correction (err_code_word):
    
    # This is the parity-check matrix
    H_Matrix = [[1,0,1,0,1,0,1],
                [0,1,1,0,0,1,1],
                [0,0,0,1,1,1,1]]
    
    # Variable to hold the location of the flipped bit (bit error)
    error_location = 0
    # Variable to hold the code word with the erroneous bit flipped back
    corrected_code_word = err_code_word[:]
    
    # For-loop to test the number of rows in the matrix
    for i in range(len(H_Matrix)):
        # Variable to hold the dot-product
        count = 0
        # Nested for-loop to calculate the dot-product of every row multiplicated by the 7-bit code word
        for x in range(len(err_code_word)):
            if err_code_word[x] * H_Matrix[i][x] == 1:
                count += 1
        # If-loop testing, whether the dot-product is even or not. 
        #With an uneven dot-product, the binary value of the parity-bit is added to the error-location
        if count%2 == 1:
            error_location += 2**i
    
    # If loop testing if the error loaction is 0 (no error).
    # In case it is not, the bit at the error location will be flipped back
    if error_location != 0:
        if err_code_word[error_location-1] == 1:
            corrected_code_word[error_location-1] = 0
        else: 
            corrected_code_word[error_location-1] = 1
    
    return (corrected_code_word)

print('Original code word:           ', encoder([1,0,1,0]))
print('Code word w. error:           ', err_code_word)
print('Code word w. error correction:', error_correction(err_code_word))

Original code word:            [0, 1, 0, 0, 1, 0, 1]
Code word w. error:            [0, 1, 0, 1, 1, 0, 1]
Code word w. error correction: [0, 1, 0, 0, 1, 0, 1]


The above code, location and correcting the random bit-error in principle works just like the operation of encoding the 4-bit message. The H-matrix, being a 3x7 matrix, when multiplied by our 7x1 code word vector, results in a 3x1 vector representing the results of a parity-check of the 3 parity bits in the code word. If we look at the first row of the H-matrix __[1,0,1,0,1,0,1]__, it has 1's in position 1,3,5 and 7. Position 1 is the position of the parity bit itself, and if we look back at the table above (subsection on __parity bits__), we see that the first parity bit does indeed look to the data-bits in positions 3,5,7. As the Hamming code is working with even parity bits, the dot product of looking at the 4 positions should be an even number (once again the _modulus-operator_ is used with 2 (%2) to make this check in the code). If the parity check is even, the modulus-operation will result in a 0, indicating that the sum of 1's is still even, and that no single bits have been flipped in any of the 4 positions 1,3,5,7, as the dot-product would otherwise have been odd.
This means, that in case there are no errors in the data, the result of multiplying the H-matrix by the code word will be a [0,0,0] vector. In our case an error has been introduced, as 1 bit has been flipped. And this is where the Hamming code gets really clever. If the flipped bit is one of the parity bits, it is quite obvious, as the parity bit will now be giving the wrong information (saying even, when the number of one's being looked at is odd or the other way around). If, on the other hand, the bit-error is in a bit of the original data, all bits of data are "covered" by at least 2 parity bits (3 parity bits for data-bit 1). This means, that when a data-bit is flipped, 2 parity bits will actually be giving the wrong information (saying even, when the number of one's being looked at is odd or the other way around). Knowing which 2 parity bits are wrong, we also know which data-bit has been flipped, as it must be the one, shared by the 2 parity bits, that are "lying". This information is simply used to flip back the bit in the position corresponding position (a parity bit is 1 parity bit is wrong, and a data-bit if 2 or 3 parity-bits are wrong.
The following code completes the last step of utilizing the Hamming code, by decoding the error corrected 7-bit code word back into the original 4-bit data-string by utilizing a _decoding matrix_ (named R_matrix in the code). The details of the code are explained below:



## Question 1.2


In [6]:
# Method for convert a 7-bit codeword back into a 4-bit message, 
# removing 3 parity bits in the process
def decoder(code_word):
    
    # This is the decoding matrix
    R_matrix = [[0,0,0,0,0,0,1],
                [0,0,0,0,0,1,0],
                [0,0,0,0,1,0,0],
                [0,0,1,0,0,0,0]]
    
    # Variable to hold the decoded 4-bit message
    message = []
    
    # For-loop to test the number of rows in the matrix
    for i in range(len(R_matrix)):
        # Variable to hold the dot-product
        count = 0
        # Nested for-loop to calculate the dot-product of every row multiplicated by the 7-bit code_word
        for x in range(len(code_word)):
            if code_word[x] * R_matrix[i][x] == 1:
                count += 1
        # Every dot-product is appended to the message variable, 
        # recreating the original 4-bit message
        message.append(count)
    
    return(message)

print('Original 4-bit message:        [1, 0, 1, 0]')
print('Original code word:           ', encoder([1,0,1,0]))
print('Code word w. error:           ', err_code_word)
print('Code word w. error correction:', error_correction(err_code_word))
print('Decoded code word:            ', decoder(error_correction(err_code_word)))



Original 4-bit message:        [1, 0, 1, 0]
Original code word:            [0, 1, 0, 0, 1, 0, 1]
Code word w. error:            [0, 1, 0, 1, 1, 0, 1]
Code word w. error correction: [0, 1, 0, 0, 1, 0, 1]
Decoded code word:             [1, 0, 1, 0]


The decoding process is basically performing the opossite of the encoding process. This time we are multiplying the 4x7 H_matrix by the 7x1 code word. This will result in a 4x1 vector correctly representing the original data string. Looking at the 4 rows of the H-matrix, it becomes quite clear, that each one is looking at just 1 position in the 7-bit code word. The first row is only looking at the 7th position, which is where the 1 bit of the original 4-bit data-string was placed (as explained, the data-string was reversed due to the way the Hamming code used in this paper is constructed. By the design of the H-matrix in this way, the data is now turned back around, and the 3 parity bits are sorted out. What we are left with is the original 4 bits of data, all correct, even though they passed through a noisy channel, that introduced an error to one of the bits.

It is worth noting, that the 7x4 Hamming code is only effective at correcting 1 bit errors. The result of looking at the parity bits will not be distinguishable between 1 or 2 bit errors, yet with 2 flipped bits, it will not be possible to deduct their position, and as such, any attempt at correcting 2 bit errors will yield a wrong result.


## Question 1.3


The following code illustrates the overall functionality of all methods with 4 widely different 4-bit messages


In [9]:
message = [0,0,0,1]
print('\nThe original message:                            ', message)
print('The 7-bit code_word of the message:              ', encoder(message))
print('With a random bit flip the code word becomes:    ', noisy_channel(encoder(message)))
print('Code word after error detection and correction:  ', error_correction(noisy_channel(encoder(message))))
print('The decoded 4 bit message after error-correction:', decoder(error_correction(noisy_channel(encoder(message)))))

message = [0,1,1,0]
print('\nThe original message:                            ', message)
print('The 7-bit code_word of the message:              ', encoder(message))
print('With a random bit flip the code word becomes:    ', noisy_channel(encoder(message)))
print('Code word after error detection and correction:  ', error_correction(noisy_channel(encoder(message))))
print('The decoded 4 bit message after error-correction:', decoder(error_correction(noisy_channel(encoder(message)))))

message = [1,1,1,0]
print('\nThe original message:                            ', message)
print('The 7-bit code_word of the message:              ', encoder(message))
print('With a random bit flip the code word becomes:    ', noisy_channel(encoder(message)))
print('Code word after error detection and correction:  ', error_correction(noisy_channel(encoder(message))))
print('The decoded 4 bit message after error-correction:', decoder(error_correction(noisy_channel(encoder(message)))))

message = [1,1,1,1]
print('\nThe original message:                            ', message)
print('The 7-bit code_word of the message:              ', encoder(message))
print('With a random bit flip the code word becomes:    ', noisy_channel(encoder(message)))
print('Code word after error detection and correction:  ', error_correction(noisy_channel(encoder(message))))
print('The decoded 4 bit message after error-correction:', decoder(error_correction(noisy_channel(encoder(message)))))


The original message:                             [0, 0, 0, 1]
The 7-bit code_word of the message:               [1, 1, 1, 0, 0, 0, 0]
With a random bit flip the code word becomes:     [1, 1, 0, 0, 0, 0, 0]
Code word after error detection and correction:   [1, 1, 1, 0, 0, 0, 0]
The decoded 4 bit message after error-correction: [0, 0, 0, 1]

The original message:                             [0, 1, 1, 0]
The 7-bit code_word of the message:               [1, 1, 0, 0, 1, 1, 0]
With a random bit flip the code word becomes:     [1, 1, 0, 1, 1, 1, 0]
Code word after error detection and correction:   [1, 1, 0, 0, 1, 1, 0]
The decoded 4 bit message after error-correction: [0, 1, 1, 0]

The original message:                             [1, 1, 1, 0]
The 7-bit code_word of the message:               [0, 0, 0, 1, 1, 1, 1]
With a random bit flip the code word becomes:     [0, 0, 0, 1, 1, 0, 1]
Code word after error detection and correction:   [0, 0, 0, 1, 1, 1, 1]
The decoded 4 bit message after er

# Question 2

## Numpy 
Numpy (Numerical Python) is the core library for scientific computing in Python. It provides a high-performance multidimensional array object, and tools for working with these arrays.  Numpy is heavily used within the field of machine learning, due to its mathematical and logical operations on arrays. It provides an abundance of useful features for operations on n-arrays and matrices in Python. 

## Implementation 

The objective of Question 2 is to create our own implementation of a few functionalities supported by the Numpy library. We will call out implementation Snumpy, and create a dedicated class for it (referenced ‘snp’).

Our implementation looks as follows:


In [8]:
# Creating the class Snumpy
class Snumpy():
    
    #Create ones function
    def ones(self, length):
        #Create list
        list = []
        #Make for loop and append 1
        for i in range(length):
            list.append(1)
        return list
    

    #Create zeros function
    def zeros(self, length):
        list = []
        for i in range(length):
            list.append(0)
        return list
    
    
    #Create function
    def reshape(self, array, tuple):
        #Create matrix list
        matrix = []
        #Identify the number of columns
        columns = tuple[1]
        #Make for loop
        for i in range(tuple[0]):
            #Split array into even sizes
            split = array[columns-tuple[1]:columns]
            matrix.append(split)
            columns = columns + tuple[1]
    
    
    #Create shape function
    def shape(self, array):
        rows = len(array)
        #Check 
        if isinstance(array[0], list):
            columns = len(array[0])
        else:
            columns = 0
        return (rows, columns)
    
    
    def append(self, array1, array2):
        result = []
        if self.shape(array1) == self.shape(array2):
            result.append(array1)
            result.append(array2)
            return result
        else:
            return "Those arrays can not be appended."
    
    
    def get(self, array, tuple):
        return array[tuple[0]][tuple[1]]
    
    
    def dotproduct(self, array1, array2):
        if self.shape(array1) == self.shape(array2):
            result = 0
            for i in range(len(array1)):
                result += array1[i] * array2[i]
            return result
        else:
            return "You can not apply the dot product to these arrays."

In [10]:
snp = Snumpy()

# Question 1
print(snp.ones(5))

# Question 2
print(snp.zeros(5))

# Question 3
array = [1, 2, 3, 4, 5, 6]
tuple = (3, 2)
print(snp.reshape(array, tuple))

# Question 4
array1 = [[4,2],[3,1],[2,5],[1,6]]
print(snp.shape(array1))

# Question 5
array1 = [1, 2, 3, 4, 5, 6]
array2 = [7, 8, 9, 10, 11, 12]
print(snp.append(array1, array2))

array3 = [[1, 2, 3],[4, 5, 6]]
print(snp.append(array2, array3))

# Question 6
print(snp.get(array3, (1,2)))

# Question 7
print(snp.dotproduct(array1, array2))


[1, 1, 1, 1, 1]
[0, 0, 0, 0, 0]
None
(4, 2)
[[1, 2, 3, 4, 5, 6], [7, 8, 9, 10, 11, 12]]
Those arrays can not be appended.
6
217



__1__

The first question deals with creating a function within our Snumpy class called ‘ones(Int), that takes an integer as argument. This function will return an array of length Int containing only ones. 

First we define the function and its parameter. Inside the function we first create a list to host the values. Then we loop over the length of the integer, and append the number 1 to the list for every iteration of the loop. At last we return the result.
In Numpy, you could simply have done: `np.ones(Int)`


__2__

Similarly, in question 2 you have to return an array of zeros. 

It is the same principle as the above function ‘ones’, this time it is just zeroes we append to the array, instead of 1’s.

__3__

In question three we have to make our own implementation of Numpy’s reshape function, which takes an array and converts it into the dimensions specified by a tuple (row, column). 

First we create the matrix variable and find the number of columns defined in the tuple. Then we loop through each row of the tuple, and split the array into even chunks, with the size of the columns defined in the tuple. Then we append the split into the matrix.

__4__ 

Question 4 is about returning a tuple with the dimensions of a matrix or vector.

Initially, we find the number of rows in the array, and then do a quick check of the dimensions of the array. Then we return the result as a tuple.

__5__

Then we had to create an ‘append(array1, array2)’ function, that takes two input vectors / matrices, and returns a new vector/matrix that is the combination of the two. 

First we define the result variable and then we check whether the vectors/matrices are in fact the same shape. If not, we throw an error saying they cannot be appended.

__6__

Question 6 deals with creating a ‘get(array, (row, column))’ function, that returns the value specified by the coordinate point of the array provided. 


__7__

The last question is about creating a function that computes the dot product of two arrays.

# Question 3

## Implementation

In [48]:
#Importing libraries
import string
import math
import pandas as pd
import numpy as np


corpus = ["this is the first document.", 
          "This is the second document.", 
          "this was the third document",
          "document document document document"]

In [49]:
def build_vectors(corpus=list,searchstring=str):
    
    #Build word counters for textcorpus 
    counters = []
    
    for document in corpus:
        counter = {}
        
        #Creating seperate documents with one word per element. 
        document = [word.strip(string.punctuation).lower() 
                    for word in document.split()]
        
        # Loop counting words for every document in corpus. 
        for word in document:
            if word not in counter:
                counter[word] = 1
            else:
                counter[word] += 1  
        #Dictionary is appended to list
        counters.append(counter)
    
    #Build word counter for searchstring 
    searchstring_counter = {}
    searchstring = [word.strip(string.punctuation).lower() 
                    for word in searchstring.split()]
    
    for word in searchstring:
            if word not in searchstring_counter:
                searchstring_counter[word] = 1
            else:
                searchstring_counter[word] += 1
    
    #Set searchstring as last element in counters 
    counters.append(searchstring_counter)

    #Build combined dict
    #Taking a set of keys(unique representation) and union them into combined dict
    combined_dict = set().union(*counters)
    
    #Build vectors
    #Building vectors in a comprehension list with conditions. 
    vector_list = []
    for c in range(len(counters)): 
        i=0
        #For word in combined dict, check if word is in corresponding counter.
        #If true, append value to vector, else set value as 0.
        vector = [i + counters[c][word] if word in counters[c] else i + 0 
                  for word in combined_dict] 
        vector_list.append(vector)

    return counters, combined_dict, vector_list




In [50]:
#Function for finding dot product, taking vector list as parameter
def dotproduct(vl):
    dp_dict = {}
    for vector in vl[:-1]:
        doc = 'Doc' + str((vl.index(vector)+1))
        dot_product = sum(n1 * n2 for n1, n2 in zip(vector, vl[-1]))
        dp_dict[doc] = dot_product
    return dp_dict


This function we take a list of vectors and compute the dotproduct between search document vector and the different corpus document vectors. 
$\newline$
$D$ = document vectors have points $d_1, d_2 ... d_n$, $S$ = searchdocument vector have points $s_1, s_2 ... s_n$. Algebraic definition:

$$ D \cdot S = \displaystyle\sum_{i=1}^{n} d_i s_i = d_1 s_1 + d_2 s_2 + ... + d_n s_n $$


In [51]:
def euclideandistance(vl):
    ed_dict = {}
    for vector in vl[:-1]:
        doc = 'Doc' + str((vl.index(vector)+1))
        euclidean_distance = math.sqrt(sum(((n1 - n2)**2) 
                                           for n1, n2 in zip(vector, vl[-1])))
        ed_dict[doc] = euclidean_distance
    return ed_dict

The euclidian function takes two vectors to compute the distance between endpoints of the different corpus documents vectors and the search document. 
$\newline$
$d$ = document vectors points, $s$ = searchdocument vector points. Mathematical definition:


$$ distance(d,s) = \sqrt{(d_1-s_1)^2+(d_2-s_2)^2+(d_n-s_n)^2} $$

In [52]:
def cosinesimilarity(vl):
    cs_dict = {}
    for vector in vl[:-1]:
        doc = 'Doc' + str((vl.index(vector)+1))
        dot_product = sum(n1 * n2 for n1, n2 in zip(vector, vl[-1]))
        vector_norm = math.sqrt(sum(n**2 for n in vector))
        sstr_norm = math.sqrt(sum(n**2 for n in vl[-1]))
        product_norms = vector_norm * sstr_norm
        cosine_similarity = dot_product/product_norms
        cs_dict[doc] = cosine_similarity
    return cs_dict

Cosine similarity takes to vectors to compute the angle between the vectors. A value of 1 means that the vectors are similar, implying that the vectors are on top of eachother, where a value of 0 causes orthogonal vectors.



$D$ = document vectors 

$S$ = searchdocument vector



Algebraic definition:

$$ cos(\theta)=\frac{D \cdot S}{\| D \| \| S \|} =\frac{\displaystyle\sum_{i=1}^{n} D_i S_i}{\sqrt{\displaystyle\sum_{i=1}^{n} D_i^2}\sqrt{\displaystyle\sum_{i=1}^{n} S_i^2}}$$


In [61]:
#The following code pieces displays the results
# Set search string
sstr="this was the fourth document"
counters, combined_dict, vector_list = build_vectors(corpus,sstr)

#Visualize data
combined_dict_list = [combined_dict]

df_vectors = pd.DataFrame(vector_list)
df_counters = pd.DataFrame(counters)
df_combined = pd.DataFrame(combined_dict_list)

In [54]:
df_counters

Unnamed: 0,document,first,fourth,is,second,the,third,this,was
0,1,1.0,,1.0,,1.0,,1.0,
1,1,,,1.0,1.0,1.0,,1.0,
2,1,,,,,1.0,1.0,1.0,1.0
3,4,,,,,,,,
4,1,,1.0,,,1.0,,1.0,1.0


In [55]:
df_combined

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,the,third,was,document,second,first,this,fourth,is


In [56]:
df_vectors

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,1,0,0,1,0,1,1,0,1
1,1,0,0,1,1,0,1,0,1
2,1,1,1,1,0,0,1,0,0
3,0,0,0,4,0,0,0,0,0
4,1,0,1,1,0,0,1,1,0


In [57]:
df_dotproduct = pd.DataFrame([dotproduct(vector_list)])
df_dotproduct
#df_dotproduct.sort_values(by)

Unnamed: 0,Doc1,Doc2,Doc3,Doc4
0,3,3,4,4


In [58]:
df_euclideandistance = pd.DataFrame([euclideandistance(vector_list)])
df_euclideandistance

Unnamed: 0,Doc1,Doc2,Doc3,Doc4
0,2.0,2.0,1.414214,3.605551


In [59]:
df_cosinesimilarity = pd.DataFrame([cosinesimilarity(vector_list)])
df_cosinesimilarity

Unnamed: 0,Doc1,Doc2,Doc3,Doc4
0,0.6,0.6,0.8,0.447214


In [60]:
#validate  with numpy
doc1 = np.array(vector_list[0])
doc2 = np.array(vector_list[1])
doc3 = np.array(vector_list[2])
doc4 = np.array(vector_list[3])
sstr = np.array(vector_list[4])

#Doc 3 vs. search documents
dotproduct = np.dot(doc3,sstr)
euclideandistance = np.linalg.norm(doc3-sstr)
cosinesimilarity = np.dot(doc3, sstr) / (np.linalg.norm(doc3) * np.linalg.norm(sstr))

print(euclideandistance)
print(dotproduct)
print(cosinesimilarity)

1.4142135623730951
4
0.7999999999999998


## Discussions of formulas and measurements 

### Dot product
In text similarity the dot product is a powerful way to determine if there is any similarity between two documents. A positive dot product means that there is similarity between the documents, in  some way. However, the product itself does not give an indication in what way the documents is similar. E.g. when document A consist of same words as document B, but document A has 10x more words. In this case the theta would be 0o ( cos(0) = 1 ) but the big distances would imply a high dot product, due to a large number of the words occurrences. Therefore, when using the dot product as a similarity measure, it is important to bring in additional measures in order to establish the type of similarity.


$$ A \cdot B =\| A \| \| B \| cos(\theta) $$

 
By this definition we know, that a dot product equals to zero, implies two orthogonal vectors, which means that the theta is 90 degrees. And as we know that the greater our theta is, the less the value of cosine of theta, thus the similarity decreases between two documents.

### Euclidean distance
The Euclidean distance tells about the length between the endpoints of the vectors, which give an indication of how long the distance between words in two documents is. It is important, to note if one term is represented many times in the document but the rest of words is similar to the searching document, the distance will go up, which could give a false indication of the similarity. Therefore when using the Euclidean measure it is important to look at the cosine similarity in order to assess if the abovementioned hypothesis is correct. 

$$ \| q - p \| = \sqrt{(q - p)^2 * (q - p)^2}$$
 
### Cosine similiarity
This brings us to using Cosine similarity as a distance measure, as the cosine similarity actually tells us how great the angle between two vectors is. We can then say how similar or far away two documents is. With a cosine of theta = 1 indicating that there is a 100% similarity between two documents. This will happen only if the exact same words occur in the different documents.  

$$ cos(\theta) = \frac{A \cdot B} {\| A \| \| B \|} $$
 

### Assessment
On the basis of abovementioned, it is clear that various measures needs to be done in order to find an accurate similarity grade. Using Euclidean distance and cosine similarity individually, and a assessing the similarity on these measures, we find a proper indication of document similarities.  
