# 03. Text Representation

Machine Learning models understand numbers, not text. We need to convert text into numerical vectors.

Techniques covered:
1. **One-Hot Encoding**: Simple binary representation.
2. **Bag of Words (BoW)**: Counting word frequencies.
3. **N-Grams**: Capturing context with sequences of words.
4. **TF-IDF**: Weighing words by importance.

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Sample corpus
corpus = [
    "I love NLP.",
    "I love machine learning.",
    "NLP is the future of AI."
]

## 1. Bag of Words (BoW)
Represents text as the count of word occurrences.

In [None]:
# Initialize CountVectorizer
vectorizer = CountVectorizer()

# Fit and transform
X = vectorizer.fit_transform(corpus)

# Convert to DataFrame for visualization
bow_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
print("Bag of Words Representation:")
display(bow_df)

## 2. N-Grams
Capturing sequences of N words (e.g., Bigrams = 2 words) to retain some context.

In [None]:
# Bigrams (ngram_range=(2, 2))
bigram_vectorizer = CountVectorizer(ngram_range=(2, 2))
X_bigram = bigram_vectorizer.fit_transform(corpus)

bigram_df = pd.DataFrame(X_bigram.toarray(), columns=bigram_vectorizer.get_feature_names_out())
print("Bigram Representation:")
display(bigram_df)

## 3. TF-IDF (Term Frequency - Inverse Document Frequency)
Penalizes words that appear too frequently across all documents (like 'the', 'is') and highlights unique terms.

In [None]:
tfidf_vectorizer = TfidfVectorizer()
X_tfidf = tfidf_vectorizer.fit_transform(corpus)

tfidf_df = pd.DataFrame(X_tfidf.toarray(), columns=tfidf_vectorizer.get_feature_names_out())
print("TF-IDF Representation:")
display(tfidf_df)