# Bag of Words and TF-IDF

https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction

In [47]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [41]:
corpus = [
    "I sat on a plane and sat on a chair"
]

In [42]:
vectorizer = CountVectorizer() 
X = vectorizer.fit_transform(corpus) 
pd.DataFrame(X.A, columns=vectorizer.get_feature_names_out())

Unnamed: 0,and,chair,on,plane,sat
0,1,1,2,1,2


In [43]:
corpus = [
    "Tune a hyperparameter.",
    "You can't tune a fish like a piano.",
]

In [37]:
vectorizer = CountVectorizer(stop_words='english') 
X = vectorizer.fit_transform(corpus) 
pd.DataFrame(X.A, columns=vectorizer.get_feature_names_out())

Unnamed: 0,fish,hyperparameter,like,piano,tune
0,0,1,0,0,1
1,1,0,1,1,1


Now our numbers have shifted a little bit. Instead of just being a count, it’s the percentage of the words.

`value = (number of times word appears in sentence) / (number of words in sentence)`

‘l2’: Sum of squares of vector elements is 1. The cosine similarity between two vectors is their dot product when l2 norm has been applied.
‘l1’: Sum of absolute values of vector elements is 1.

>Note: We made it be the percentage of the words by passing in norm="l1" - by default it’s normally an L2 (Euclidean) norm, which is actually better, but I thought it would make more sense using the L1 - a.k.a. terms divided by words -norm.

In [16]:
vectorizer = TfidfVectorizer(stop_words='english', use_idf=False, norm='l1') 
X = vectorizer.fit_transform(corpus) 
pd.DataFrame(X.A, columns=vectorizer.get_feature_names_out())

Unnamed: 0,fish,hyperparameter,like,piano,tune
0,0.0,0.5,0.0,0.0,0.5
1,0.25,0.0,0.25,0.25,0.25


This is inverse term frequency - the more often a term shows up across all documents, the less important it is in our matrix.

In [17]:
vectorizer = TfidfVectorizer(stop_words='english') 
X = vectorizer.fit_transform(corpus) 
pd.DataFrame(X.A, columns=vectorizer.get_feature_names_out())

Unnamed: 0,fish,hyperparameter,like,piano,tune
0,0.0,0.814802,0.0,0.0,0.579739
1,0.534046,0.0,0.534046,0.534046,0.379978


In [75]:
corpus = [
    "Tune a hyperparameter.",
    "You can tune a piano but you can't tune a fish.",
    "Fish who eat fish, catch fish.",
    "People can tune a fish or a hyperparameter.",
    "It is hard to catch fish and tune it.",
]

In [76]:
vectorizer = CountVectorizer(stop_words='english') 
X = vectorizer.fit_transform(corpus) 
pd.DataFrame(X.A, columns=vectorizer.get_feature_names_out())

Unnamed: 0,catch,eat,fish,hard,hyperparameter,people,piano,tune
0,0,0,0,0,1,0,0,1
1,0,0,1,0,0,0,1,2
2,1,1,3,0,0,0,0,0
3,0,0,1,0,1,1,0,1
4,1,0,1,1,0,0,0,1


In [90]:
vectorizer = TfidfVectorizer(stop_words='english', use_idf=False, norm='l2') 
X = vectorizer.fit_transform(corpus) 
df = pd.DataFrame(np.round(X.A,3), columns=vectorizer.get_feature_names_out())
df

Unnamed: 0,catch,eat,fish,hard,hyperparameter,people,piano,tune
0,0.0,0.0,0.0,0.0,0.707,0.0,0.0,0.707
1,0.0,0.0,0.408,0.0,0.0,0.0,0.408,0.816
2,0.302,0.302,0.905,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.5,0.0,0.5,0.5,0.0,0.5
4,0.5,0.0,0.5,0.5,0.0,0.0,0.0,0.5


In [87]:
vectorizer = TfidfVectorizer(stop_words='english') 
X = vectorizer.fit_transform(corpus) 
df = pd.DataFrame(np.round(X.A,3), columns=vectorizer.get_feature_names_out())
df

Unnamed: 0,catch,eat,fish,hard,hyperparameter,people,piano,tune
0,0.0,0.0,0.0,0.0,0.82,0.0,0.0,0.573
1,0.0,0.0,0.35,0.0,0.0,0.0,0.622,0.701
2,0.38,0.471,0.796,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.373,0.0,0.534,0.661,0.0,0.373
4,0.534,0.0,0.373,0.661,0.0,0.0,0.0,0.373


# Notes

## Basic Text preprocessing steps
removing noise: anything that isn’t a standard number or letter
removing stop words: very common words that add little value in analysis are removed from the vocabulary.
stemming: reducing inflected (or derived) words to their stem, base or root form 
lemmatization: similar to stemming, however stemming can often create non-words, whereas lemmas are actual words

## Bag of Words (BoW) Model
After preprocessing, text needs to be transformed into a meaningful number vectors for use in ML algorithms. The BoW model represents text as a matrix of word counts within a document. It's called a “bag of words" because information about the order or structure of words is discarded. The model only cares whether the known words occur in the document, but not where they occur. Intuitively, documents are similar if they have similar content.
It involves:
 - a vocabulary of known words
 - a measure of the presence of known words

For example, given a dictionary containing {Learning, is, the, not, great}, to vectorize the text “Learning is great”.
Its vector representation would be : $(1, 1, 0, 0, 1)$, where the numbers represent their word counts.

## TF-IDF
With BoW, highly frequent words start to dominate the document, but such words may not contain much informational content. It also gives more weight to longer documents than shorter documents.  

One approach is to rescale the frequency of words by how often they appear in all documents. The scores for frequent words that are also frequent across all documents are penalized. This scoring is called Term Frequency-Inverse Document Frequency, where
 - Term Frequency: a scoring of the frequency of the word in the current document
    - TF = (Number of times term t appears in a document)/(Number of terms in the document)

 - Inverse Document Frequency: a scoring of how rare a word is across documents.
    - IDF = $1+log(N/n)$, where, $N$ is the number of documents and n is the number of documents a term $t$ has appeared in.

## Cosine Similarity
A measure of similarity between two non-zero vectors of an inner product space</ul>
 - Tf-idf weight is a weight often used in information retrieval (IR) and text mining.
 - It is a statistical measure used to evaluate how important a word is to a document in a collection or corpus
     - Cosine Similarity $(d1, d2)= Dot product (d1, d2) / ||d1|| * ||d2||$ where $d1,d2$ are two non zero vectors.
     
**Reference:**  
 1. https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
 1. https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
 1. https://scikit-learn.org/stable/modules/metrics.html#cosine-similarity
 1. http://jonathansoma.com/lede/foundations/classes/text%20processing/tf-idf/