# Module: Text Representation and Vectorization

## Module Overview

This module introduces students to the fundamental techniques for converting text data into numerical representations suitable for machine learning and NLP applications. Students will explore the evolution from traditional sparse representations to modern dense embeddings, gaining both theoretical understanding and practical implementation skills through hands-on exercises with real-world examples.

### Module Objectives

By the end of this module, students will be able to:

1. **Understand Text Vectorization Fundamentals**: Grasp core concepts of transforming text into numerical representations and their importance in NLP
2. **Master Basic Vectorization Techniques**: Implement one-hot encoding, bag-of-words, and TF-IDF vectorization methods
3. **Apply Advanced Text Representation**: Use n-gram models and understand their impact on capturing text context
4. **Work with Word Embeddings**: Understand and implement Word2Vec (CBOW and Skip-Gram) models for dense text representations
5. **Leverage Contextual Embeddings**: Apply BERT for context-aware word representations and compare with static embeddings

### Module Components

#### Theoretical Foundation
- Evolution of text representation methods from sparse to dense vectors
- Mathematical foundations of vectorization techniques (TF-IDF, cosine similarity)
- Distributional vs. distributed representations in NLP
- Understanding the transformer architecture and self-attention mechanisms
- Contextual vs. static embeddings and their applications

#### Practical Skills
- Implementation of one-hot encoding and bag-of-words models
- TF-IDF vectorization with parameter tuning and optimization
- Training Word2Vec models using Gensim library
- Working with pre-trained BERT models using Transformers library
- Comparative analysis of different vectorization approaches
- Performance evaluation using similarity metrics

---

## Module Content

### Lecture Materials
- **[Text Representation: Basic Vectorization Approaches (PDF)]** - Comprehensive overview of traditional text vectorization methods
- **[Word Embeddings (PDF)]** - Deep dive into distributed representations and Word2Vec
- **[BERT (PDF)]** - Understanding contextual embeddings and transformer-based models

<div class="alert alert-block alert-info">

For practice notebooks 201-204, please use the "NLP" Container.

</div>

### Practical Sessions (practices/)
- **[One-Hot Encoding Basics](practices/201_text_processing_one_hot_encoding.ipynb)**
  - Understanding categorical data representation
  - Manual implementation vs. scikit-learn CountVectorizer
  - Advantages and limitations of sparse representations
  - Practical applications and use cases

- **[Bag-of-Words and Text Processing](practices/202_text_processing_bag_of_words.ipynb)**
  - Document tokenization and vocabulary building
  - Implementing BoW with CountVectorizer
  - Stopword removal and frequency analysis
  - Working with repeating words and document matrices
  - Enhancement techniques for improved performance

- **[TF-IDF Vectorization](practices/203_text_processing_tfidf.ipynb)**
  - Understanding Term Frequency and Inverse Document Frequency
  - Mathematical foundations and normalization techniques
  - Implementing TF-IDF with and without smoothing
  - N-gram integration and parameter optimization
  - Comparative analysis with basic BoW approaches

- **[Word Embeddings with Word2Vec](practices/204_text_processing_word_embeddings.ipynb)**
  - CBOW vs. Skip-Gram model architectures
  - Training Word2Vec models on custom datasets
  - Semantic similarity analysis using cosine similarity
  - Vector arithmetic and analogy tasks
  - Evaluation and interpretation of embedding quality

- **[BERT Contextual Embeddings](practices/205_text_processing_bert.ipynb)**
  - Understanding bidirectional context processing
  - Working with pre-trained BERT models
  - Tokenization with WordPiece and special tokens
  - Contextual vs. static embedding comparisons
  - Practical applications in text understanding tasks

---

## Assignments

### Assignment 1: Text Processing with Encoding Techniques
**File:** [Assignment_201.ipynb](assignments/Assignment_201.ipynb)  
**Points:** 10  
**Focus:** Implementing and comparing basic vectorization methods (one-hot encoding, bag-of-words, and TF-IDF)

**Overview:** This assignment explores three fundamental encoding techniques for NLP by implementing them on real text data and analyzing their differences. Students will work with manual implementations and scikit-learn tools to understand the progression from sparse to weighted representations.

**Learning Outcomes:** Understanding of basic text vectorization, hands-on experience with scikit-learn, and critical analysis of method trade-offs.

### Assignment 2: Advanced Text Embeddings Analysis
**File:** [Assignment_202.ipynb](assignments/Assignment_202.ipynb)  
**Points:** 10  
**Focus:** Training Word2Vec models and implementing BERT for contextual embeddings, with semantic relationship analysis

**Overview:** Building on basic vectorization techniques, this assignment explores advanced embedding methods by comparing static Word2Vec embeddings with contextual BERT representations. Students will work with domain-specific text to understand how context affects semantic capture.

**Learning Outcomes:** Practical experience with advanced embeddings, understanding of contextual vs. static representations, and awareness of ethical considerations in NLP.

---

## Learning Path

### Beginner Level
1. Start with **Basic Vectorization Approaches (PDF)** to understand foundational concepts
2. Practice **One-Hot Encoding** for simple categorical representation
3. Work through **Bag-of-Words** for document-level vectorization

### Intermediate Level
4. Master **TF-IDF Vectorization** for weighted term importance
5. Explore **Word Embeddings (PDF)** to understand distributed representations
6. Implement **Word2Vec** models for semantic similarity tasks

### Advanced Level
7. Study **BERT (PDF)** for contextual embedding theory
8. Practice **BERT implementation** for context-aware representations
9. Compare all methods for comprehensive understanding and appropriate use case selection

---

## Prerequisites

### Technical Requirements
- Solid understanding of Python programming and data structures
- Familiarity with NumPy arrays and basic linear algebra operations
- Basic knowledge of machine learning concepts (vectors, similarity measures)
- Understanding of basic NLP preprocessing from Module 1

### Libraries to Install (Only applicable to your local machines)

```python
# Core NLP and ML libraries
pip install nltk spacy gensim

# Transformers and deep learning
pip install transformers torch

# Traditional ML and vectorization
pip install scikit-learn pandas numpy

# Data visualization
pip install matplotlib seaborn

# Download language models
python -m spacy download en_core_web_sm
```

---

## Recommended Background
- Completion of Module 1 (NLP Pipeline and Preprocessing)
- Basic understanding of linear algebra and vector operations
- Familiarity with machine learning concepts (training, evaluation)
- Knowledge of Python libraries like pandas and numpy

---

## Additional Resources

### Documentation and References
- [Scikit-learn Feature Extraction Documentation](https://scikit-learn.org/stable/modules/feature_extraction.html)
- [Gensim Word2Vec Documentation](https://radimrehurek.com/gensim/models/word2vec.html)
- [Hugging Face Transformers Documentation](https://huggingface.co/docs/transformers/index)
- [BERT Paper - Original Research](https://arxiv.org/abs/1810.04805)

### Recommended Reading
- Vajjala, Sowmya, et al. *Practical Natural Language Processing: A Comprehensive Guide to Building Real-World NLP Systems*. O'Reilly Media, 2020. (Chapters 3-4)
- Tunstall, Lewis, et al. *Natural Language Processing with Transformers*. O'Reilly Media, 2022. (Chapters 1-3)
- Jurafsky, Daniel, and James H. Martin. *Speech and Language Processing*. 3rd edition. (Chapters 6-7)

### Online Resources
- [Word2Vec Tutorial](https://www.tensorflow.org/tutorials/text/word2vec)
- [Understanding BERT](https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270)
- [TF-IDF from Scratch](https://towardsdatascience.com/tf-idf-for-document-ranking-from-scratch-in-python-on-real-world-dataset-796d339a4089)

## Getting Started

1. **Review the theoretical foundations** with the provided PDF slides
2. **Set up your environment** with the required libraries listed above
3. **Work through the notebooks sequentially** - each builds upon previous concepts
4. **Complete practical exercises** in each notebook to reinforce learning
5. **Experiment with different parameters** to understand their impact on results
6. **Apply techniques to your own text data** to see real-world applications

### Support and Questions
- Review the comprehensive examples and explanations in each notebook
- Refer to the documentation links for detailed API references and advanced usage
- Practice with different text datasets to understand method strengths and limitations  
- Experiment with hyperparameter tuning to optimize performance for specific tasks
- Consider the computational trade-offs between different approaches for your use cases

### Key Success Metrics
- Ability to select appropriate vectorization method based on task requirements
- Understanding of when to use sparse vs. dense representations
- Competency in implementing and evaluating different text representation techniques
- Knowledge of modern contextual embeddings and their advantages over traditional methods