# NLP Workflow Summary: Sentiment Analysis Example

---

## Introduction
- The video continues discussion on NLP topics like stemming, lemmatization, stopwords, using NLTK and Python.
- Focus is on revising concepts through a **sentiment analysis** problem statement.

---

## Key NLP Concepts and Pipeline

### 1. Text and Corpus
- **Documents (d1, d2, d3, d4)**: Individual sentences or documents.
- **Corpus**: Collection of documents or sentences (can be paragraphs).
- **Vocabulary**: Set of unique words in the corpus.

### 2. NLP Project Lifecycle for Sentiment Analysis
- **Dataset**: Start with a dataset containing text data.
- **Text Pre-processing**:
  - **Part 1**:
    - **Tokenization**: Split text into sentences or words.
    - **Lowercasing**: Convert all text to lowercase to unify words like "Though" and "though".
    - **Regular Expressions (Regex)**: Clean text by removing special characters or unwanted patterns.
  - **Part 2**:
    - **Stemming & Lemmatization**: Reduce words to base/root forms.
    - **Stopwords Removal**: Remove common words that don’t add meaning.
    
  These steps clean and normalize the raw text.

### 3. Text to Vectors (Numerical Representation)
- After cleaning, text is converted into numerical vectors for ML models.
- Techniques include:
  - **One-Hot Encoding**: Simple, sparse vector; rarely used now for text.
  - **Bag of Words (BoW)**: Counts occurrences of words.
  - **TF-IDF (Term Frequency-Inverse Document Frequency)**: Weighs words by importance.
  - **Word2Vec**: Generates dense word embeddings capturing semantic meaning.
  - **Average Word2Vec**: Averages embeddings for document-level representation.
- These techniques are foundational before deep learning methods like **transformers** and **BERT**.

### 4. Model Training
- Use the numerical vectors to train **machine learning algorithms** for classification tasks like sentiment analysis.
- After training, evaluate performance by prediction accuracy.

---

## Additional Notes
- **Gensim Library** is recommended for Word2Vec implementation; it supports pre-trained models and custom training.
- Word embeddings (Word2Vec, etc.) use deep learning concepts to capture context.
- Understanding vectorization techniques is critical before moving to advanced NLP methods.

---

## Summary of Workflow

```plaintext
Dataset
   ↓
Text Pre-processing
   ├─ Tokenization
   ├─ Lowercasing
   ├─ Regex Cleaning
   ├─ Stemming/Lemmatization
   └─ Stopwords Removal
   ↓
Text to Vectors (One-Hot, BoW, TF-IDF, Word2Vec)
   ↓
Train ML Model (e.g. classification for sentiment analysis)
   ↓
Prediction & Evaluation
