 <div style="background-color: #99CD4E; text-align:center; vertical-align: middle; padding:40px 0;"> 
  <h1 style="color: white;"> *Vector representation of text documents* </h1>.
 </div>

<div style="background-color: #99CD4E; padding:5px 0;"> 
  <h2 style="color: white;"> *Bag of words & Vector Space Model* </h1>
 </div>
- *BAG OF WORDS*: The exact ordering of the terms in a document is ignored but the number of occurences of each term is material
- *VECTOR SPACE MODEL*: Representation of a set of documents as vectors in a common vector space
    - The dimensions of the vectore space are the words in the dictionary, and the value for each dimension is a weight that is indicative of the occurence of the word in the document. 




<div style="background-color: #99CD4E; padding:5px 0;"> 
  <h2 style="color: white;"> *Term frequency-inverse document frequency (Tf-idf)* </h1>
 </div>

_**Term frequency (tf<sub>t,d</sub>)**_ Number of occurences of a term(word) t in a document d 

_**Inverse document frequency (idf<sub>t</sub>)**_ Document frequency df of a term t is the number of documents that contain a term. Let N be number of documents in collection. Inverse document frequency is $$ idf_{t} = log(\frac{N}{df_{t}})$$

_**Tf<sub>t,d</sub>-idf<sub>t</sub>**_ of a term is product of term frequency (tf<sub>t,d</sub>) and inverse document frequency idf<sub>t</sub>). $$ Tf_{t,d}-idf_{t} = Tf_{t,d} \ast idf_{t} $$ 

It has the property:

    1. higher when word occurs many times within small number of documents
    2. lower when word occurs fewer times in a document, or occurs in many documents
    3. lowest when term occurs in virtuall all documents




# Create a collection of documents

In [1]:
document_collection = [ 'A group of kids is playing in a yard and an old man is standing in the background',
                'A group of children is playing in the house and there is no man standing in the background',
                'The young boys are playing outdoors and the man is smiling nearby',
                'The kids are playing outdoors near a man with a smile',
                'There is no boy playing outdoors and there is no man smiling',
                'A group of boys in a yard is playing and a man is standing in the background',
                'A brown dog is attacking another animal in front of the tall man in pants',
                'A brown dog is attacking another dog in front of the man in pants',
                'Two dogs are fighting',
                'Two dogs are wrestling and hugging']

# Count based vectorizer

In [2]:
from sklearn.feature_extraction.text import CountVectorizer

count_vectorizer = CountVectorizer(stop_words='english')
count_vector_space_model = count_vectorizer.fit_transform(document_collection)

In [3]:
terms = count_vectorizer.get_feature_names()
print(terms)

['animal', 'attacking', 'background', 'boy', 'boys', 'brown', 'children', 'dog', 'dogs', 'fighting', 'group', 'house', 'hugging', 'kids', 'man', 'near', 'nearby', 'old', 'outdoors', 'pants', 'playing', 'smile', 'smiling', 'standing', 'tall', 'wrestling', 'yard', 'young']


In [4]:
print(len(terms))

28


In [5]:
import numpy as np
print(np.shape(count_vector_space_model))

(10, 28)


In [6]:
print(type(count_vector_space_model))

<class 'scipy.sparse.csr.csr_matrix'>


In [7]:
import pandas as pd

df_count_vsm = pd.DataFrame(count_vector_space_model.toarray(), columns= terms)

In [8]:
pd.set_option('display.max_columns', None)
df_count_vsm

Unnamed: 0,animal,attacking,background,boy,boys,brown,children,dog,dogs,fighting,group,house,hugging,kids,man,near,nearby,old,outdoors,pants,playing,smile,smiling,standing,tall,wrestling,yard,young
0,0,0,1,0,0,0,0,0,0,0,1,0,0,1,1,0,0,1,0,0,1,0,0,1,0,0,1,0
1,0,0,1,0,0,0,1,0,0,0,1,1,0,0,1,0,0,0,0,0,1,0,0,1,0,0,0,0
2,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,1,0,1,0,1,0,0,0,0,1
3,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,0,1,0,1,1,0,0,0,0,0,0
4,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,1,0,1,0,0,0,0,0
5,0,0,1,0,1,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,1,0,0,1,0
6,1,1,0,0,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0
7,0,1,0,0,0,1,0,2,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0


# Tf-idf vectorizer

In [28]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf_vector_space_model = tfidf_vectorizer.fit_transform(document_collection)

In [29]:
terms = tfidf_vectorizer.get_feature_names()
print(terms)

['animal', 'attacking', 'background', 'boy', 'boys', 'brown', 'children', 'dog', 'dogs', 'fighting', 'group', 'house', 'hugging', 'kids', 'man', 'near', 'nearby', 'old', 'outdoors', 'pants', 'playing', 'smile', 'smiling', 'standing', 'tall', 'wrestling', 'yard', 'young']


In [30]:
print(len(terms))

28


In [39]:
import numpy as np
print(np.shape(tfidf_vector_space_model))

(10, 28)


In [40]:
import pandas as pd

tfidf_vsm = pd.DataFrame(tfidf_vector_space_model.toarray())

In [41]:
tfidf_vsm

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27
0,0.0,0.0,0.347145,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.347145,0.0,0.0,0.396791,0.207202,0.0,0.0,0.466762,0.0,0.0,0.250571,0.0,0.0,0.347145,0.0,0.0,0.396791,0.0
1,0.0,0.0,0.365318,0.0,0.0,0.0,0.491198,0.0,0.0,0.0,0.365318,0.491198,0.0,0.0,0.218049,0.0,0.0,0.0,0.0,0.0,0.263689,0.0,0.0,0.365318,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.401465,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.209643,0.0,0.472261,0.0,0.351235,0.0,0.253523,0.0,0.401465,0.0,0.0,0.0,0.0,0.472261
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.438341,0.228899,0.51564,0.0,0.0,0.383497,0.0,0.27681,0.51564,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.601817,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.267154,0.0,0.0,0.0,0.447589,0.0,0.323072,0.0,0.511599,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.392528,0.0,0.448664,0.0,0.0,0.0,0.0,0.0,0.392528,0.0,0.0,0.0,0.234289,0.0,0.0,0.0,0.0,0.0,0.283329,0.0,0.0,0.392528,0.0,0.0,0.448664,0.0
6,0.443343,0.376882,0.0,0.0,0.0,0.376882,0.0,0.376882,0.0,0.0,0.0,0.0,0.0,0.0,0.196805,0.0,0.0,0.0,0.0,0.376882,0.0,0.0,0.0,0.0,0.443343,0.0,0.0,0.0
7,0.0,0.370811,0.0,0.0,0.0,0.370811,0.0,0.741622,0.0,0.0,0.0,0.0,0.0,0.0,0.193635,0.0,0.0,0.0,0.0,0.370811,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.647689,0.761905,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.515192,0.0,0.0,0.0,0.606043,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.606043,0.0,0.0


<div style="background-color: #99CD4E; padding:5px 0;"> 
  <h2 style="color: white;"> *Similarity Metric* </h1>
 </div>
Given two vectors _*X*_ and _*y*_ with _*p*_ dimensions the distance between them is defined as follows:

_**Eulidean Distance**_

$$\sqrt{(x_1 - y_1)^2 + (x_2 - y_2)^2  + \ldots + (x_p - y_p)^2  }$$

_**Cosine Similarity**_

$$\frac{\sum\limits_{k = 1}^{p} xy }{ \sqrt{\sum\limits_{k = 1}^{p} x^2} \sqrt{\sum\limits_{k = 1}^{p} y^2} }  $$

cosine similarity gives distance 1 when items are close, and -1 when items are apart.

_**Jaccard Distance**_

$$1 - \frac{| x \cap y |}{| x \cup y |}$$


In [46]:
from sklearn.metrics.pairwise import cosine_similarity
dist_matrix = cosine_similarity(tfidf_vsm.iloc[[0,3]])
dist_matrix

array([[ 1.        ,  0.29071844],
       [ 0.29071844,  1.        ]])

In [50]:
dist_matrix = cosine_similarity(tfidf_vsm)
pd.DataFrame(dist_matrix)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,1.0,0.491708,0.106964,0.290718,0.136307,0.706357,0.040778,0.040121,0.0,0.0
1,0.491708,1.0,0.112564,0.122903,0.143443,0.55599,0.042913,0.042222,0.0,0.0
2,0.106964,0.112564,1.0,0.252862,0.500511,0.30107,0.041259,0.040594,0.0,0.0
3,0.290718,0.122903,0.252862,1.0,0.32223,0.132057,0.045049,0.044323,0.0,0.0
4,0.136307,0.143443,0.500511,0.32223,1.0,0.154127,0.052577,0.05173,0.0,0.0
5,0.706357,0.55599,0.30107,0.132057,0.154127,1.0,0.046109,0.045367,0.0,0.0
6,0.040778,0.042913,0.041259,0.045049,0.052577,0.046109,1.0,0.736869,0.0,0.0
7,0.040121,0.042222,0.040594,0.044323,0.05173,0.045367,0.736869,1.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.333684
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333684,1.0


 <div style="background-color: #99CD4E; text-align:center; vertical-align: middle; padding:40px 0;"> 
  <h1 style="color: white;"> *The End* </h1>.
 </div>