# Early Approaches in NLP: Counting Words
Before neural networks became popular, NLP (Natural Language Processing) relied on simpler techniques that focused on counting words. Two main methods were:

## Count Vectors
- Imagine you have a document, and you simply count how often each word appears.
- **Example**: In the sentence "I like cats and I like dogs", the word "like" appears 2 times, "I" appears 2 times, and the rest appear once.

## TF-IDF (Term Frequency-Inverse Document Frequency)
- It’s a bit more advanced than just counting.
- TF-IDF assigns importance to words:
  - Words that appear often in a document are important for that document.
  - But if a word is too common across many documents (like "the" or "is"), it becomes less important.
- **Example**: If the word "panda" appears frequently in one document but rarely in others, TF-IDF will assign it high importance for that specific document.

These methods were good at categorizing and searching through text, but they had a problem:
- They couldn’t understand the meaning or relationships between words. For example, "happy" and "joyful" are similar in meaning, but these methods treated them as completely different words.

# Neural Networks Enter the Scene
To solve these limitations, neural networks (NNs) started being used for NLP. A big breakthrough came in 2003 from a researcher named Yoshua Bengio and his team, who developed the Neural Network Language Model (NNLM).

## How NNLM Worked
- It predicted the next word in a sentence by looking at the previous words.
  - **Example**: If the input is "I like", the NNLM would try to predict that the next word could be "cats" or "dogs".

## Word Embeddings
- The NNLM didn’t just treat words as isolated counts. It created **word embeddings** — compact vectors (numbers) that captured the meaning of words.
  - **Example**: Words like "happy" and "joyful" would have similar word embeddings, showing they are related in meaning.

## Limitations of the NNLM
- It struggled with long sentences and large vocabularies (lots of words).
- It wasn’t perfect, but it was a major improvement over counting words because it could understand meaning better.

# Summary: From Counting to Understanding
- **Old methods** (Count Vectors, TF-IDF) were good at counting words but couldn’t understand meaning or context.
- **Neural Networks** (like NNLM) introduced word embeddings that helped capture semantic relationships (like "happy" and "joyful" being similar).
- Though the NNLM had some problems with long texts and large vocabularies, it opened the door to better neural models in the future.

# What Are Word Embeddings?
Think of word embeddings as a way to represent words using numbers so that a computer can understand their meaning and relationships.

## Imagine a World Without Embeddings
- Before embeddings, computers saw words as just plain text with no connection to each other.
- **Example**:
  - "dog" and "cat" were treated like completely different words, even though they both refer to animals.
  - Similarly, "happy" and "joyful" had no relation, even though they mean almost the same thing.

## Enter Word Embeddings
- A word embedding is a set of numbers (a vector) that represents a word in a way that captures its meaning and relationships with other words.
  - **Example**:
    - "dog" might be represented as [0.5, 0.8, -0.2].
    - "cat" might be [0.6, 0.7, -0.1].
    - These two vectors are close to each other, meaning the computer understands that "dog" and "cat" are related.

## How Word Embeddings Work in Practice
- The magic happens because the closer two word embeddings are in the number space, the more similar the words are in meaning.
  - **Example**:
    - "happy" might have the vector [0.3, 0.9, -0.4].
    - "joyful" might have [0.3, 0.8, -0.3].
    - These vectors are close, meaning the computer knows they are similar in meaning.

## Analogy: Coordinates on a Map
- Think of word embeddings as placing words on a map:
  - Words like "king" and "queen" will be close to each other.
  - Words like "apple" and "banana" will also be close to each other, but far from "king" and "queen" because they belong to different categories.

# Distributed Representations: Making Word Meanings Understandable for Computers
In natural language processing (NLP), distributed representations are a smart way to represent words so computers can understand their meaning and relationships.
- Instead of treating words as just individual symbols, we use **vectors** (sets of numbers) to describe them, and each number in the vector captures some feature of the word.

# Two Key Techniques for Creating Word Vectors
## Word2Vec
- **How it works**: Word2Vec uses neural networks to predict words that appear around each word in a text.
  - **Example**: If your sentence is “The cat is sleeping,” Word2Vec will try to predict the surrounding words for “cat”.
  - This process helps the neural network figure out how often words like “cat” and “sleeping” appear together.
- **Result**: Each word gets a vector (a set of numbers) that reflects how it is used in context. Words with similar meanings will have vectors that are close together.

## GloVe (Global Vectors)
- **How it works**: GloVe takes a different approach by analyzing how often words appear together in a large collection of text (called co-occurrence).
  - **Example**: If it notices that “coffee” often appears with “hot”, and “ice cream” with “cold”, it will create vectors that reflect these relationships.
    - Coffee: [1, 0]
    - Hot: [0.9, 0]
    - Ice Cream: [0, 1]
    - Cold: [0, 0.9]
  - **Result**: The vectors for “coffee” and “hot” will be close together, and similarly for “ice cream” and “cold.” This reflects the fact that they are related concepts.

# What is Transfer Learning in NLP?
Transfer learning allows a model to re-use knowledge it gained from one task and apply it to another, saving time and effort.

- **Example**:
  - Let’s say we train a model to understand sentiments (positive or negative) in movie reviews.
  - It learns that words like "amazing" are positive and "terrible" are negative.
  - Now, we want to apply this to product reviews. Instead of training a new model from scratch, we re-use the existing knowledge.
  - The model can now understand product-related sentiments faster, even with less data.

# How Transfer Learning Works in NLP
- **Pre-trained Models**:
  - Models like GloVe create word vectors by analyzing huge collections of text. These vectors capture the meaning and relationships between words.
  - This base knowledge can be re-used for other tasks without starting from scratch.

# How NNs Improved NLP with Transfer Learning
- **Automatic Learning of Patterns**:
  - Neural networks don’t need hand-made rules. They can learn complex patterns (like context and relationships) directly from data.
- **Richer Representations with NNs**:
  - NNs produce even better word vectors than GloVe or Word2Vec, capturing deeper meanings.
- **Examples of NN Architectures**:
  - CNNs (Convolutional Neural Networks): Great at tasks like text classification.
  - RNNs (Recurrent Neural Networks): Useful for sequences, like predicting the next word in a sentence.
  - Transformers (like BERT and GPT): A game-changer in NLP because they can understand context in a sentence better than earlier models.

# Summary
- **Old Methods** (Count Vectors, TF-IDF) were good at counting words but couldn’t understand meaning.
- **Word Embeddings** (Word2Vec, GloVe) made it possible for computers to understand relationships between words.
- **Neural Networks & Transfer Learning** brought richer ways to understand language, allowing models to apply knowledge across different tasks.


# Language Modeling with RNNs: Making Sense of Sequences

Think of RNNs (Recurrent Neural Networks) as a way for computers to understand and process sequences like text. Unlike simple neural networks, RNNs remember what happened earlier in a sentence to make sense of what comes next.

## How RNNs Work

### Sequence Processing:
- RNNs read text one word at a time, passing information from one step to the next. This way, each word in a sentence influences the understanding of the next word.
- **Example:** In the phrase "I love", RNNs can predict the next word might be "chocolate" or "music" based on what it has already read.

### Internal Memory:
- RNNs have an internal state that helps them remember what came earlier in the sequence, like how we remember the subject of a sentence to make sense of its ending.

## The Problem with RNNs

RNNs have trouble remembering information over long sequences. For example:
- In a long sentence like "The cat, which already ate a lot, was not hungry," standard RNNs might forget "The cat" by the time they reach "was not hungry".
- This issue is known as the **vanishing gradient problem**, where earlier information gets lost as the sequence gets longer.

## LSTMs: A Better RNN for Long Sequences

To solve this, LSTMs (Long Short-Term Memory Networks) were introduced.

### How LSTMs Work:
- LSTMs use a system of **gates** to decide:
  - What information to keep.
  - What information to forget.
  - What new information to add.
- These gates help LSTMs maintain important information over long sequences.

### Example of LSTM in Action:
- In the sentence "The cat, which already ate a lot, was not hungry," the LSTM remembers:
  - The subject "The cat" from the start.
  - The recent information that the cat ate a lot, helping it predict the next word "hungry".
- LSTMs excel at both short-term and long-term memory, making them great for language modeling and text classification tasks.

## Rise of CNNs in NLP

While RNNs are good at understanding sequences, they can be slow because they process text one step at a time. **CNNs (Convolutional Neural Networks)**, originally designed for images, offer a faster alternative for NLP.

### How CNNs Work in NLP:
- CNNs use **filters** to scan through text, identifying patterns in n-grams (short word sequences).
- **Example:** A CNN can identify phrases like "great service" or "terrible experience".

### Feature Extraction in CNNs:
- **Lower Layers:** Detect simple word patterns like common phrases.
- **Middle Layers:** Understand negations (e.g., "not good").
- **Higher Layers:** Capture abstract concepts like the overall sentiment of a review.

### Advantages of CNNs:
- **Parallel Processing:** CNNs process text in parallel (faster than RNNs, which process word by word).
- **Hierarchical Features:** They extract features at different levels, improving the understanding of text.

## Limitations of CNNs and the Rise of Self-Attention

While CNNs are fast, they can only capture **local information** (nearby words) and miss global relationships in long sequences. This led to the development of **attention-augmented networks**, which combine CNNs with self-attention to capture relationships across the entire input.