# Distance Calculation

##Euclidian Distance

$$\|P\|= n \in P_{n}\sqrt{p_{n1}^2+p_{n2}^2+\cdots +p_{n\dotso}^2} = \sqrt{p \cdot p}$$

###Calculating with one vector:
$$\langle a, b, c \rangle \to \sqrt{a^2 + b^2 + c^2}$$

$$\langle 3, 4, 5 \rangle \to \sqrt{3^2 + 4^2 + 5^2}$$

###Calculating with two vectors:
$${\langle a_{x}, a_{y}, a_{z} \rangle, \langle b_{x}, b_{y}, b_{z} \rangle} \to \sqrt{(a_{x}-b_{x})^2 + (a_{y}-b_{y})^2 + (a_{z}-b_{z})^2} $$
$$\langle 2, 5, 18 \rangle, \langle 7, 31, 43\rangle \to \sqrt{(2-7)^2 + (5-31)^2 + (18 - 43)^2}$$

Because you square the differences between the vectors, you always end up with positive distances.

In [49]:
from math import sqrt
import numpy as np
from __future__ import division
import nltk

In [50]:
### Example Euclidian distance without numpy

def euclidian_non_np(vector_a, vector_b=None): # We assume that we have only one vector.
    if vector_b:  # we have two vectors == if vector_b != None
        # zip merges two lists, like a zipper on your coat
        distance_sums = sum((a - b)**2 for a,b in zip(vector_a, vector_b))  
        
        return sqrt(distance_sums)
    
    return sqrt(sum(a**2 for a in vector_a))

## example Euclidian distance with one or two vectors using numpy

def euclidian_np(a, b=None): # We assume that we have only one vector.
    if b:  # we have two vectors
        # function ends and return the two vectors
        # with numpy arrays we can use minus operator to 
        # calculate the difference between each element.
        return np.linalg.norm(np.array(a) - np.array(b)) 
    
    return np.linalg.norm(a)  # we have only one vector

print'Non numpy distance [3,4,5]          :', euclidian_non_np([3,4,5])
print 'Non numpy distance [3,4,5], [4,5,6] :', euclidian_non_np([3,4,5],[4,5,6])
print 'With numpy distance [3,4,5]         :', euclidian_np([3,4,5])
print 'With numpy distance [3,4,5], [4,5,6]:', euclidian_np([3,4,5],[4,5,6])

Non numpy distance [3,4,5]          : 7.07106781187
Non numpy distance [3,4,5], [4,5,6] : 1.73205080757
With numpy distance [3,4,5]         : 7.07106781187
With numpy distance [3,4,5], [4,5,6]: 1.73205080757


# Similarity Calculation
## Cosine Similarity

$$similarity(A, B) = \cos() = \frac{A \cdot B}{\|A\| * \|B\|}$$

Calculating the numerator is the dot product of the two vectors. This is the same as the sum of the pairwise product of the elements in the vectors.

$$
\langle 2, 5, 18 \rangle
\langle 7, 31, 43 \rangle \to 2 \cdot 7 + 5 \cdot 31 + 18 \cdot 43
$$

The denominator is the product of the two Euclidian distances from both vectors.

$$\langle 2, 5, 18 \rangle, \langle 7, 31, 43\rangle \to \sqrt{(2)^2 + (5)^2 + (18)^2} * \sqrt{(7)^2 + (31)^2 + (43)^2}$$

Which results in:
$$\frac{2 \cdot 7 + 5 \cdot 31 + 18 \cdot 43}{\sqrt{(2)^2 + (5)^2 + (18)^2} * \sqrt{(7)^2 + (31)^2 + (43)^2}}$$

In [51]:
def similarity(vector_a, vector_b):
    teller = np.vdot(np.array(vector_a), np.array(vector_b)) # dot product of two vectors
    noemer = euclidian_np(vector_a, vector_b)  # reusing the euclidian distance
    return teller / noemer

#Intrinsic en extrinsic search methods

##Intrinsic
Intrinsic methods of defining the worth of a document in terms of the query proposed to it is based around the idea that the context surrounding the document is not necessary to say something about the value of the document. This approach regards every document as a set of words with their frequency connected to it and the collection of documents as a set of the aforementioned sets.

As an example you could take a collection of news articles, or all a collection of pages from Wikipedia.

A method of defining the value of a document with regards to a search query Q is by multiplying the amount of times a word appears in a document (TF) with the inverse document frequency (IDF).

$$TF = \textit{amount of times a word appears in a document.}\\
N = \textit{amount of documents}\\
n = \textit{amount of document the word appears in}\\
IDF = \log{N/n}$$

###Indexing
Recalculating the TF \* IDF by processing every file when a query is very time consuming as the collection of documents grows. Therefore it is a much better solution to store an index of the collection containing the information needed to calculate the TF \* IDF. Re-indexing is necessary over time, and we want to extract as much information as possible from the indexing as this will reduce the calculationss necessary at query time. Below is a sample index layout based on a python dictionary:

```
index = { __filenames: [list of filenames or document names],
                  __N: number of documents in the collection,
                word: { doc1 : TF,
                         doc2 : TF,
                         doc3 : TF,
                           DF : N / number of documents the word1 appears in},
                word2: { doc1 : TF,
                         doc2 : TF,
                         doc3 : TF,
                           DF : N / number of documents the word2 appears in}
        }
```
In a real world example the index would of course be stored in a database to provide stability and redundance. Also RAM size is a limitation in terms of index size. 

Below is a sample implementation that will read any text file in a given folder.

In [52]:
def index_collection(folder):
    '''
    Returns an index dictionary of all tokens in all ".txt" files
    in the given folder.
    >>index_collection('my_folder') 
    ''' 
    index = {}
    folder = os.path.abspath(folder)
    
    # check for a trailing slash and add it if it is not there
    if folder.endswith() != '/':
        folder = folder + '/'
        
    # create list of file names and only use the files ending on .txt 
    # add the absolute path to the filename so we can use the index everywhere
    files = [folder + f for f in os.listdir() if '.txt' in f] 
    # setting up the index
    
    N = len(files)
    index['__filenames'] = files 
    index['__N'] = N
    
    stop_words = nltk.corpus.stopwords.words('english')
    
    # loop over every file to analyze the document.
    for filename in files:
        with open(filename) as f:
            tokens = [word for word in nltk.tokenize.word_tokenize(text.lower())
                         if word not in stop_words]
            fd = nltk.FreqDist(tokens)  # Generate a frequency distribution
                                        # {word: frequency}
        # loop over every word
        for word, freq in fd.items():
            idx_word = index.get(word)  # dict.get() checks if the key exists,
                                        # it returns None if it is not in the dictionary 
            if idx_word: # it exists
                idx_word.update({filename: freq}) # add the filename to the index
                idx_word['df'] = (idx_word['df'] * N + 1) / N  # recalculate the DF
            else:                       # the word is not in the index
                index[word] = {filename : freq, 'df': 1 / N} # insert the word
    return index

###Querying and scoring
Querying is done by entering a string of words and then summing up the TF\*IDF score for each word in the query for each document in the collection.





In [None]:
def score(word, doc, index):
    '''
    Returns the TFxIDF score for a given word and index.
    '''
    if index.get(word): # again we use the get function to prevent keyerrors
        TF = index.get(word).get(doc, 0)  # same as index[word][doc]
        IDF = math.log(1/index.get(word).get('df', 1))  # same as index[word]['df']
        return TF * IDF
    return 0 # the word was not in the document so we return a score of 0


def query(index, qry):
    qry = nltk.tokenize.word_tokenize(qry.lower())
    # create a list of tuples (score sums, document)
    score_list = [(sum(score(word, doc, index) for word in qry), doc)
                   for doc in index['__files']]
    # sort the document in descending order based on the scores
    return sorted(score_list, reverse=True)