# TF-IDF

## Why?

What's wrong with the count vectorizer?

Stopwords?
- Unlikely to be helpful, they're everwhere, can't use for search, can't use for sentiment analysis, etc
- How do we know our stopwords are correct?
- We don't, they're context specific
- Can we automatically determine which words are important?

## The main idea of TF-IDF

- Words we want to ignore appear in many different documents
- They won't help us differentiate between documents
- We want to somehow scale down those word counts
- Term Frequency - Inverse Document Frequency
- TF-IDF = term requency / document frequency (intuition, not acutal formula)
- TF (term frequency) is just the count vectorizer result
- IDF (inverse document frequency) = log(N/N(t))
- Why take the log?
- Monotonically increasing
- It squashes down large values
- Deeper reasons from information theory, see extra reading for more if you want

In [4]:
train_texts = ["hello"]
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
Xtrain = tfidf.fit_transform(train_texts)
Xtest = tfidf.transform(train_texts)

NOTE: arguments exist for stopwords, tokenizer, strip accents, etc.

In [5]:
## Variations

### Term Frequency

- Binary, 0/1 if the words appears or not
- Normalize the count (sometimes this is the default)
- Take the log of 1 + count, reduces the influence of extreme values

### Inverse Document Frequency Variations

- Smooth IDF => idf(t) = log(N / N(t)+1) + 1
- prevents a value of 0 (log(1) = 0), prevents divide by zero later
- IDF Max, use the max term count from in the document for N
- Probabilistic IDF => idf(t) = log(N - N(t) / N(t)) | log-odds aka logit
- Trivial IDF = idf == 1 ===> Then we just have the Term Frequency

## Normalizing the TF-IDF

- Recall: relationship between euclidean distance and cosine distance
- Unlike CountVectorizer, TfidfVectorizer supports it
- TfidfVectorizer(norm="l2") # (or "l1", but "l2" is the default)

NOTE: L2 normalization is unit lenght normalization