## Document Classification (IMDB Movie Reviews)
![image.png](attachment:image.png)

## Word to Matrix / Vectors

Documents have different length, and consist of sequences of words. How do we create features X to characterize a document?

### Bag-of-Words
* From a dictionary, identify the 10K most frequently occurring words.
* Create a binary vector of length $p=10K$ for each document, and score a 1 in every position that the corresponding word occurred.
  * With $n$ documents, we now have a $n*p$ **sparse** feature matrix X.
  
#### Drawbacks
* `Bag-of-Words (BoW)` **does not consider the sequence of words** in the text. It consider words independently, we call it `unigrams`. We can instead use `bigrams` (occurrences of adjacent word pairs), and in general `n-grams`.

![image.png](attachment:image.png)

### N-grams
N-grams are contiguous sequences of n items (words, characters, or tokens) extracted from a text corpus. In the context of natural language processing (NLP), N-grams are commonly used to capture patterns and dependencies between words in a sequence of text. 

#### Advantages
* **Capturing Context**: N-grams preserve some level of word order and context, allowing models to capture dependencies between adjacent words.
* **Flexibility**: N-grams can be adjusted to different granularities (`unigrams`, `bigrams`, `trigrams`, etc.), providing flexibility in capturing different levels of context.

#### Drawbacks
* **Data Sparsity:** As the length of the n-grams increases, the number of unique combinations grows exponentially, leading to sparsity issues, especially with smaller datasets.
* **Lack of Generalization**: N-grams may **overfit** to specific patterns present in the training data, making them less generalizable to unseen data.

#### Mitigation Strategies
* **Pruning**: Limit the vocabulary size or discard low-frequency n-grams to reduce computational complexity.
* **Smoothing**: Address data sparsity issues by smoothing probabilities of unseen n-grams.
  
![image.png](attachment:image.png)

## TF-IDF
* [Reference](https://www.kdnuggets.com/2022/09/convert-text-documents-tfidf-matrix-tfidfvectorizer.html)
$$
TF-IDF(t,d,D) = TF(t,d) * IDF(t,D) 
$$

### Term Frequency (TF)
Measures how frequently a term (word) occurs in a document. 
*  It is calculated as the ratio of the number of times a term appears in a document to the total number of terms in the document.
$$
TF(t,d) = \frac{Number\: of\: occurrences\: of\: term\: t\: in\: document\: d}{Total\: number\: of\: terms\: in\: document\: d} 
$$

### Inverse Document Frequency (IDF)
Measures **how important a term is** across the entire collection of documents. 
*   It is calculated as the logarithm of the ratio of the total number of documents to the number of documents containing the term, with a smoothing term to avoid division by zero.
$$
IDF(t,D) = log(\frac{Number\: of \: documents}{Number\: of\: documents\: containing\: term\: t})
$$


* **Text A**: Jupiter is the largest planet
* **Text B**: Mars is the fourth planet from the sun

![image.png](attachment:image.png)