# Numerical representation of text

### Summary

This section introduces text vectorization, a crucial step in preparing text data for machine learning algorithms. It covers two primary methods: the Bag of Words model and TF-IDF (Term Frequency-Inverse Document Frequency), highlighting their differences and applications.

### Highlights

- 🔢 Text vectorization converts text into a numerical format for machine learning.
- 📝 The Bag of Words model counts word occurrences, losing context.
- ⚖️ TF-IDF assesses word importance within documents, considering overall document frequency.
- 📊 Understanding these methods is essential for effective NLP.
- 📚 Bag of words is simple and easy to understand.
- 📈 TF-IDF captures more context.
- 💡 The next lesson will cover the specific calculation of TF-IDF.

# Bag of Words model

### Summary

This section demonstrates how to create a Bag of Words model using Python's `CountVectorizer` from the `sklearn` library. It outlines the process of initializing the vectorizer, fitting and transforming the text data, and displaying the resulting matrix as a pandas DataFrame.

### Highlights

- 🐍 Utilizing `CountVectorizer` from `sklearn.feature_extraction.text` for Bag of Words.
- 🐼 Employing pandas for data manipulation and DataFrame creation.
- 🔢 Transforming text into a matrix of token counts.
- 📊 Each row represents a document, and each column represents a word.
- 0️⃣ and 1️⃣ binary representation indicating word presence.
- 📝 Simple implementation and easy interpretation of results.
- 📉 Loss of context is a limitation of the Bag of Words model.

### Code Examples

```python
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

# Example text data
data = [
    "Most shark attacks occur about ten feet from the beach, since that is where the people are.",
    "The efficiency with which he paired the socks in the drawer was quite admirable.",
    # ... more sentences
]

# Initialize CountVectorizer
count_vec = CountVectorizer()

# Fit and transform the data
count_vec_fit = count_vec.fit_transform(data)

# Create a pandas DataFrame
bag_of_words = pd.DataFrame(count_vec_fit.toarray(), columns=count_vec.get_feature_names_out())

# Print the resulting Bag of Words
print(bag_of_words)
```

# TF - IDF

### Summary

This section explains TF-IDF (Term Frequency-Inverse Document Frequency) as a method for text vectorization, highlighting its ability to retain more context than the Bag of Words model. It details the calculation of term frequency and inverse document frequency and demonstrates how to implement TF-IDF vectorization using scikit-learn's `TfidfVectorizer`.

### Highlights

- 📈 TF-IDF retains more context compared to the Bag of Words model.
- 🧮 Term Frequency (TF) measures word occurrence within a document.
- ⚖️ Inverse Document Frequency (IDF) assesses word importance across all documents.
- 💡 TF-IDF assigns higher scores to less common, more significant words.
- 🐍 Scikit-learn's `TfidfVectorizer` simplifies TF-IDF calculation.
- 📊 The output matrix reflects word importance with varying numerical values.
- 🧠 Retaining nuanced context aids machine learning model understanding.

### Code Examples

```python
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

# Example text data (same as before)
data = [
    "Most shark attacks occur about ten feet from the beach, since that is where the people are.",
    "The efficiency with which he paired the socks in the drawer was quite admirable.",
    # ... more sentences
]

# Initialize TfidfVectorizer
tfidf_vec = TfidfVectorizer()

# Fit and transform the data
tfidf_vec_fit = tfidf_vec.fit_transform(data)

# Create a pandas DataFrame
tfidf_df = pd.DataFrame(tfidf_vec_fit.toarray(), columns=tfidf_vec.get_feature_names_out())

# Print the resulting TF-IDF DataFrame
print(tfidf_df)
```