 <div style="background-color: #99CD4E; text-align:center; vertical-align: middle; padding:40px 0;"> 
  <h1 style="color: white;"> *Vector representation of text documents* </h1>.
 </div>

<div style="background-color: #99CD4E; padding:5px 0;"> 
  <h2 style="color: white;"> Why do we need this ? </h1>
 </div>
- Machine learning algorithms work with numbers 
- Apply other algorithms

See article [An Intuitive Understanding of Word Embeddings: From Count Vectors to Word2Vec](https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/)


<div style="background-color: #99CD4E; padding:5px 0;"> 
  <h2 style="color: white;"> *Bag of words & Vector Space Model* </h1>
 </div>
- *BAG OF WORDS*: The exact ordering of the terms in a document is ignored but the number of occurences of each term is material

- *VECTOR SPACE MODEL*: Representation of a set of documents as vectors in a common vector space
    - The dimensions of the vectore space are the words in the dictionary, and the value for each dimension is a weight that is indicative of the occurence of the word in the document. 




<div style="background-color: #99CD4E; padding:5px 0;"> 
  <h2 style="color: white;"> *Term frequency-inverse document frequency (Tf-idf)* </h1>
</div>

_**Term frequency (tf<sub>t,d</sub>)**_ Number of occurences of a term(word) t in a document d 

_**Inverse document frequency (idf<sub>t</sub>)**_ Document frequency df<sub>t</sub> of a term t is the number of documents that contain a term. Let N be number of documents in collection. Inverse document frequency is $$ idf_{t} = log(\frac{N}{df_{t}})$$

_**Tf<sub>t,d</sub>-idf<sub>t</sub>**_ of a term is product of term frequency (tf<sub>t,d</sub>) and inverse document frequency idf<sub>t</sub>). $$ Tf_{t,d}-idf_{t} = Tf_{t,d} \ast idf_{t} $$ 

It has the property:

    1. higher when word occurs many times within small number of documents
    2. lower when word occurs fewer times in a document, or occurs in many documents
    3. lowest when term occurs in virtuall all documents




# Create a collection of documents

In [2]:
document_collection = [ 'A group of kids is playing in a yard and an old man is standing in the background',
                'A group of children is playing in the house and there is no man standing in the background',
                'The young boys are playing outdoors and the man is smiling nearby',
                'The kids are playing outdoors near a man with a smile',
                'There is no boy playing outdoors and there is no man smiling',
                'A group of boys in a yard is playing and a man is standing in the background',
                'A brown dog is attacking another animal in front of the tall man in pants',
                'A brown dog is attacking another dog in front of the man in pants',
                'Two dogs are fighting',
                'Two dogs are wrestling and hugging']

# Simple descriptive statistics

In [54]:
from nltk.tokenize import word_tokenize
import nltk

tokens = []

for sent in document_collection:
    tokens = tokens + word_tokenize(sent)

fd = nltk.FreqDist(tokens)

In [59]:
fd.most_common(5)

[('is', 11), ('in', 10), ('man', 8), ('the', 7), ('and', 6)]

# Exercise

[Spacy](https://spacy.io) is another popular Python natural language processing package. Write the code to count word frequencies using Spcay?

# Term Frequency<sub>t,d</sub> 

In [6]:
from sklearn.feature_extraction.text import CountVectorizer

term_frequency_vectorizer = CountVectorizer(stop_words='english')
term_frequency_model = term_frequency_vectorizer.fit_transform(document_collection)

In [8]:
term_frequency_vectorizer.vocabulary_

{'animal': 0,
 'attacking': 1,
 'background': 2,
 'boy': 3,
 'boys': 4,
 'brown': 5,
 'children': 6,
 'dog': 7,
 'dogs': 8,
 'fighting': 9,
 'group': 10,
 'house': 11,
 'hugging': 12,
 'kids': 13,
 'man': 14,
 'near': 15,
 'nearby': 16,
 'old': 17,
 'outdoors': 18,
 'pants': 19,
 'playing': 20,
 'smile': 21,
 'smiling': 22,
 'standing': 23,
 'tall': 24,
 'wrestling': 25,
 'yard': 26,
 'young': 27}

In [9]:
terms = term_frequency_vectorizer.get_feature_names()
print(terms)

['animal', 'attacking', 'background', 'boy', 'boys', 'brown', 'children', 'dog', 'dogs', 'fighting', 'group', 'house', 'hugging', 'kids', 'man', 'near', 'nearby', 'old', 'outdoors', 'pants', 'playing', 'smile', 'smiling', 'standing', 'tall', 'wrestling', 'yard', 'young']


In [10]:
print(len(terms))

28


In [12]:
import numpy as np
print(np.shape(term_frequency_model))

(10, 28)


In [17]:
print(type(term_frequency_model))

<class 'scipy.sparse.csr.csr_matrix'>


In [19]:
import pandas as pd

term_frequency_model_as_dataframe = pd.DataFrame(term_frequency_model.toarray(), columns= terms)

In [20]:
#pd.set_option('display.max_columns', None)
term_frequency_model_as_dataframe.head()

Unnamed: 0,animal,attacking,background,boy,boys,brown,children,dog,dogs,fighting,group,house,hugging,kids,man,near,nearby,old,outdoors,pants,playing,smile,smiling,standing,tall,wrestling,yard,young
0,0,0,1,0,0,0,0,0,0,0,1,0,0,1,1,0,0,1,0,0,1,0,0,1,0,0,1,0
1,0,0,1,0,0,0,1,0,0,0,1,1,0,0,1,0,0,0,0,0,1,0,0,1,0,0,0,0
2,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,1,0,1,0,1,0,0,0,0,1
3,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,0,1,0,1,1,0,0,0,0,0,0
4,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,1,0,1,0,0,0,0,0


# Tf-idf vectorizer

In [21]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf_model = tfidf_vectorizer.fit_transform(document_collection)

In [29]:
terms = tfidf_vectorizer.get_feature_names()
print(terms)

['animal', 'attacking', 'background', 'boy', 'boys', 'brown', 'children', 'dog', 'dogs', 'fighting', 'group', 'house', 'hugging', 'kids', 'man', 'near', 'nearby', 'old', 'outdoors', 'pants', 'playing', 'smile', 'smiling', 'standing', 'tall', 'wrestling', 'yard', 'young']


In [22]:
print(len(terms))

28


In [23]:
import numpy as np
print(np.shape(tfidf_model))

(10, 28)


In [25]:
import pandas as pd

tfidf_model_as_dataframe = pd.DataFrame(tfidf_model.toarray(), columns=terms)

In [26]:
tfidf_model_as_dataframe

Unnamed: 0,animal,attacking,background,boy,boys,brown,children,dog,dogs,fighting,group,house,hugging,kids,man,near,nearby,old,outdoors,pants,playing,smile,smiling,standing,tall,wrestling,yard,young
0,0.0,0.0,0.347145,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.347145,0.0,0.0,0.396791,0.207202,0.0,0.0,0.466762,0.0,0.0,0.250571,0.0,0.0,0.347145,0.0,0.0,0.396791,0.0
1,0.0,0.0,0.365318,0.0,0.0,0.0,0.491198,0.0,0.0,0.0,0.365318,0.491198,0.0,0.0,0.218049,0.0,0.0,0.0,0.0,0.0,0.263689,0.0,0.0,0.365318,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.401465,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.209643,0.0,0.472261,0.0,0.351235,0.0,0.253523,0.0,0.401465,0.0,0.0,0.0,0.0,0.472261
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.438341,0.228899,0.51564,0.0,0.0,0.383497,0.0,0.27681,0.51564,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.601817,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.267154,0.0,0.0,0.0,0.447589,0.0,0.323072,0.0,0.511599,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.392528,0.0,0.448664,0.0,0.0,0.0,0.0,0.0,0.392528,0.0,0.0,0.0,0.234289,0.0,0.0,0.0,0.0,0.0,0.283329,0.0,0.0,0.392528,0.0,0.0,0.448664,0.0
6,0.443343,0.376882,0.0,0.0,0.0,0.376882,0.0,0.376882,0.0,0.0,0.0,0.0,0.0,0.0,0.196805,0.0,0.0,0.0,0.0,0.376882,0.0,0.0,0.0,0.0,0.443343,0.0,0.0,0.0
7,0.0,0.370811,0.0,0.0,0.0,0.370811,0.0,0.741622,0.0,0.0,0.0,0.0,0.0,0.0,0.193635,0.0,0.0,0.0,0.0,0.370811,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.647689,0.761905,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.515192,0.0,0.0,0.0,0.606043,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.606043,0.0,0.0


## Exercise 

- Given a new sentence: "Two dogs are playing in the yard", write the code in the cell below to get the tfidf representation of the sentence.

In [27]:
## Hint use the function transform() on the tfidf vectorizer
## Your code here

- Does the code above converts the text into lower case ? If not then how can you do the same ?

<div style="background-color: #99CD4E; padding:5px 0;"> 
  <h2 style="color: white;"> Stemming and lemmatization </h2>
</div>

Refer [Stemming and lemmatization](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html)

A word can have multiple forms. For example:
- organize, organizes, and organizing (different forms of one word)
- democracy, democratic, and democratization (related words with different meaning)

Both stemming and lemmatization reduce a word to a common base form.  

~~~~
Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. 
~~~~

~~~~
Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma 
~~~~

In [30]:
word_list1 = ['organize', 'organizes', 'organizing']
word_list2 = ['democracy', 'democratic', 'democratization']

In [1]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
wordnet_lemmatizer = WordNetLemmatizer()

print('Processing word list 1')
for word in word_list1:
    print(wordnet_lemmatizer.lemmatize(word, 'v'))

print('Processing word list 2')
for word in word_list2:
    print(wordnet_lemmatizer.lemmatize(word, 'n'))
    print(wordnet_lemmatizer.lemmatize(word, 'v'))

Processing word list 1


NameError: name 'word_list1' is not defined

In [37]:
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")

print('Processing word list 1')
for word in word_list1:
    print(stemmer.stem(word))

print('Processing word list 2')
for word in word_list2:
    print(stemmer.stem(word))


Processing word list 1
organ
organ
organ
Processing word list 2
democraci
democrat
democrat


# Exercise 

- How can you incorporate lemmatization in the process of generating TFIDF representation ? Hint: Refer [Document Clustering with Python](http://brandonrose.org/clustering)

 <div style="background-color: #99CD4E; text-align:center; vertical-align: middle; padding:40px 0;"> 
  <h1 style="color: white;"> *The End* </h1>.
 </div>