# Word Encodings

Let us say we have collected reviews and determined the sentiment of a movie. 

Our goal is to create a model to predict the sentiment of the movies based on the text.

### Setting up data 

#### Creating a data frame

In [None]:
reviews = ['excellent excellent movie','disgusting!','time pass movie','among the all time greats','worst movie. time pass']
reviews

['excellent excellent movie',
 'disgusting!',
 'time pass movie',
 'among the all time greats',
 'worst movie. time pass']

In [None]:
doc_id = ['doc'+str(i+1) for i in (list(range(len(reviews))))]
doc_id

['doc1', 'doc2', 'doc3', 'doc4', 'doc5']

In [None]:
sentiment=[1,0,0,1,0]

In [None]:
ds_dict={'doc_id': doc_id, 'reviews': reviews, 'sentiment': sentiment}

In [None]:
import pandas as pd

In [None]:
ds = pd.DataFrame(ds_dict)
ds

Unnamed: 0,doc_id,reviews,sentiment
0,doc1,excellent excellent movie,1
1,doc2,disgusting!,0
2,doc3,time pass movie,0
3,doc4,among the all time greats,1
4,doc5,worst movie. time pass,0


#### Creating Vocabulary

In [None]:
import nltk

nltk.download('punkt')  
from nltk import word_tokenize

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
lst_words = []

In [None]:
for review in reviews:
    for token in word_tokenize(review):
        lst_words.append(token)

vocabulary = set(sorted(lst_words))
print(vocabulary)

{'.', 'the', 'greats', 'worst', 'disgusting', 'among', 'excellent', '!', 'time', 'pass', 'movie', 'all'}


### Question: How to represent the textual data?

Doc4 = Among the all time greats. 

This document or review is **ordered sequence of words** and through tokenization we can break this document into words.

Doc4 = ['Among', 'the', 'all', 'time', 'greats']

But we see all our data is text and we know that for all the algorithms to work we need to have numeric data. 

**So how do we convert the text data to numeric data?**

In other words, we need each word to have a numeric encoding:

'Among' --> x0 (numeric encoding) 

'the' --> x1 (numeric encoding) 

'''''

But what is that numeric reprsentation?

## One-hot Encoding

In [None]:
columns=['excellent', '.', 'greats', 'the', 'pass', 'worst', 'time', 'disgusting', 'all', '!', 'movie', 'among']

features=[[1,0,0,0,0,0,0,0,0,0,1,0],
          [0,0,0,0,0,0,0,1,0,1,0,0], 
          [0,0,0,0,1,0,1,0,0,0,1,0],
          [0,0,1,1,0,0,1,0,1,0,0,1],
          [0,1,0,0,1,1,1,0,0,0,1,0]]

pd.DataFrame(features, columns=columns, index=doc_id)

Unnamed: 0,excellent,.,greats,the,pass,worst,time,disgusting,all,!,movie,among
doc1,1,0,0,0,0,0,0,0,0,0,1,0
doc2,0,0,0,0,0,0,0,1,0,1,0,0
doc3,0,0,0,0,1,0,1,0,0,0,1,0
doc4,0,0,1,1,0,0,1,0,1,0,0,1
doc5,0,1,0,0,1,1,1,0,0,0,1,0


### Issues with this encoding

- We lose the order/sequence and hence the context.
- Here the vocabulary was of 12 words but imagine a case where we have 1 million words as vocabulary. We will end up with massive vocabulary and features/dimensions.
- With bigger vocabulary, we end up with high sparsity i.e. most of the cells are empty or 0. For instance, doc2 would be filled for only 2 columns out of 1 million columns of vocabulary.
- Since we are just capturing the presence of the word, we lose the frequency information i.e. even if the word is repeated multiple times, we just capture it only once. Doc1 has 'excellent' twice but the value shown is 1.
- This does not capture any meaning or relationship between the words.


## Count Vectorizer

Encodes as the frequency of the word i.e. how often the word is used in the document.

In [None]:
restaurant_reviews = ['the food was very bad','the place was very bad, the food was bad and the service was very bad as well']
restaurant_reviews

['the food was very bad',
 'the place was very bad, the food was bad and the service was very bad as well']

In [None]:
rest_review_id = ['doc'+str(i+1) for i in (list(range(len(restaurant_reviews))))]
rest_review_id

['doc1', 'doc2']

In [None]:
restaurant_ds=pd.DataFrame({'reviews': restaurant_reviews}, index=rest_review_id)
restaurant_ds

Unnamed: 0,reviews
doc1,the food was very bad
doc2,"the place was very bad, the food was bad and t..."


In [None]:
lst_words=[]
for review in restaurant_reviews:
    for token in word_tokenize(review):
        lst_words.append(token)

rest_vocab = set(sorted(lst_words))
print(rest_vocab)

{'as', 'the', 'service', 'bad', 'and', 'well', 'food', ',', 'was', 'very', 'place'}


In [None]:
rest_columns=['the', 'well', 'place', 'bad', 'and', 'as', 'very', 'food', 'service', ',', 'was']

rest_features=[[1,0,0,1,0,0,1,1,0,0,1],
              [3,1,1,3,1,1,2,1,1,1,3]]

pd.DataFrame(rest_features, columns=rest_columns, index=rest_review_id)

Unnamed: 0,the,well,place,bad,and,as,very,food,service,",",was
doc1,1,0,0,1,0,0,1,1,0,0,1
doc2,3,1,1,3,1,1,2,1,1,1,3


### Issues with this encoding

This encoding address the frequency issue of OHE but still has the following issues:

- We lose the order/sequence and hence the context.
- Here the vocabulary was of 12 words but imagine a case where we have 1 million words as vocabulary. We will end up with massive vocabulary and features/dimensions.
- With bigger vocabulary, we end up with high sparsity i.e. most of the cells are empty or 0. For instance, doc2 would be filled for only 2 columns out of 1 million columns of vocabulary.
- This does not capture any meaning or relationship between the words.

## Term Frequency - Inverse Document Frequency (TF-IDF)

In [None]:
stage_reviews = ['the play was good', 'the end was good', 'the cast was brilliant','the ultimate show']
stage_reviews

['the play was good',
 'the end was good',
 'the cast was brilliant',
 'the ultimate show']

In [None]:
stage_review_id = ['doc'+str(i+1) for i in (list(range(len(stage_reviews))))]
stage_review_id

['doc1', 'doc2', 'doc3', 'doc4']

In [None]:
stage_ds=pd.DataFrame({'reviews': stage_reviews}, index=stage_review_id)
stage_ds

Unnamed: 0,reviews
doc1,the play was good
doc2,the end was good
doc3,the cast was brilliant
doc4,the ultimate show


#### Term-Frequency

Term frequency (tf) vector is is the frequency of each token in the document.

Term Frequency ($tf_{t,d}$) =  Number of occurrences of word i in the document d i.e. count(t,d)

or

tf = $log_{10}(count(t,d) + 1)$

So, if the count is 1, then tf is log(1 + 1) = 0.3

For instance, tf for doc1 is [0.3, 0.3, 0.3, 0.3] for the words [the, play, was, good] respectively.


In [None]:
lst_words=[]
for review in stage_reviews:
    for token in word_tokenize(review):
        lst_words.append(token)

stage_vocab = set(sorted(lst_words))
print(stage_vocab)

{'brilliant', 'show', 'the', 'end', 'was', 'cast', 'ultimate', 'good', 'play'}


In [None]:
stage_columns= ['end', 'show', 'good', 'the', 'ultimate', 'play', 'brilliant', 'cast', 'was']

stage_features_tf=[[0,0,0.3,0.3,0,0.3,0,0,0.3],
                   [0.3,0,0.3,0.3,0,0,0,0,0.3],
                   [0,0,0,0.3,0,0,0.3,0.3,0.3],
                   [0,0.3,0,0.3,0.3,0,0,0,0]]

pd.DataFrame(stage_features_tf, columns=stage_columns, index=stage_review_id)

Unnamed: 0,end,show,good,the,ultimate,play,brilliant,cast,was
doc1,0.0,0.0,0.3,0.3,0.0,0.3,0.0,0.0,0.3
doc2,0.3,0.0,0.3,0.3,0.0,0.0,0.0,0.0,0.3
doc3,0.0,0.0,0.0,0.3,0.0,0.0,0.3,0.3,0.3
doc4,0.0,0.3,0.0,0.3,0.3,0.0,0.0,0.0,0.0


#### Inverse-Document Frequency

It typically measures how important a term is in the corpus. Since tf considers all terms equally important we can’t only use term frequencies to calculate the weight of a term in the document. Besides we know that terms such as “the”, “a”, and “was”, may appear a lot of times and might actually overshadow the important words. Thus we need to reduce the weight of these frequent terms while increase the weight of rare words.



Term frequency is the occurrence count of a term in one particular document only; while document frequency is the number of different documents the term appears in, so it depends on the whole corpus. So, idf of a term is the number of documents in the corpus divided by the document frequency of a term.                                                                                           


<center>$idf(t) = \frac{N}{N(t)}$</center>

where,

N(t) = Number of documents containing the term t

N is the number of documents.

It’s expected that the more frequent term to be considered less important. We take the log of the inverse document frequencies to reduce the magnitude, as the actual value could be high.

<center>$idf(t) =log_{10}(\frac{N}{N(t)})$</center>



For instance, idf('the') = log(4/4) = log(1) = 0 and idf('brilliant') = log(4/1) = 0.6

We can see here that the word which is used in every document is reduced to zero and for the rare words in the corpus such as the word 'brilliant' is of higher weight 0.6.

In [None]:
idf_words = [[0.6, 0.6, 0.3, 0, 0.6, 0.6, 0.6, 0.6, 0.125]]

pd.DataFrame(idf_words, columns=stage_columns)

Unnamed: 0,end,show,good,the,ultimate,play,brilliant,cast,was
0,0.6,0.6,0.3,0,0.6,0.6,0.6,0.6,0.125


#### TF-IDF

In [None]:
stage_features_tf_idf=[[0,0,0.09,0,0,0.18,0,0,0.0375], 
                       [0.18,0,0.09,0,0,0,0,0,0.0375], 
                       [0,0,0,0,0,0,0.18,0.18,0.0375], 
                       [0,0.18,0,0,0.18,0,0,0,0]]

pd.DataFrame(stage_features_tf_idf, columns=stage_columns, index=stage_review_id)

Unnamed: 0,end,show,good,the,ultimate,play,brilliant,cast,was
doc1,0.0,0.0,0.09,0,0.0,0.18,0.0,0.0,0.0375
doc2,0.18,0.0,0.09,0,0.0,0.0,0.0,0.0,0.0375
doc3,0.0,0.0,0.0,0,0.0,0.0,0.18,0.18,0.0375
doc4,0.0,0.18,0.0,0,0.18,0.0,0.0,0.0,0.0


We observe that the weight are normalized or reduced and that of the word 'the' is 0, even if it is present in all documents.

### Issues with TF-IDF encoding

This encoding address the frequency issue of OHE and also captures the relative importance of token when compared to Count Vectorizer but still has the following issues:

- We lose the order/sequence and hence the context.
- Here the vocabulary was of 12 words but imagine a case where we have 1 million words as vocabulary. We will end up with massive vocabulary and features/dimensions.
- With bigger vocabulary, we end up with high sparsity i.e. most of the cells are empty or 0. For instance, doc2 would be filled for only 2 columns out of 1 million columns of vocabulary.
- This does not capture any meaning or relationship between the words.

## Word Embeddings

We observed that different techniques were tried but all of them have some major short-comings. 

Researchers needed embeddings which could simultaneously solve the following issues:

- Low dimension matrix
- High density and low sparsity
- Capture the Context
- Semantic Information of words and relations between them

