### Deep Learning-Based Language Models

Deep learning-based modern language models can learn much larger datasets and capture more complex language structures compared to traditional statistical and probabilistic models. These models have revolutionized natural language processing (NLP) applications.

!["deep-learning-based-language-models"](../images/4/4-deep-learning-based-language-models.png)
<br>
<br>

---

#### 1. Word Embeddings

Word embeddings are techniques that transform words into fixed-dimensional dense vectors. These vectors mathematically represent the semantic similarities and relationships between words.

##### Important Word Embedding Models

- Word2Vec (CBOW & Skip-gram)
- GloVe (Global Vectors)
- FastText (Uses subword information)

Advantages:

- Captures semantic relationships between words.
- Can learn phrase structures and contextual relationships.
- Efficient and fast, forming the foundation of many NLP models.

Disadvantages:

- Produces fixed-length vectors, which cannot fully model context.
- Struggles with homonymy (same spelling, different meanings) and polysemy (words with multiple meanings).
- Learned vectors are static, meaning they cannot differentiate word meanings based on context.
  <br>
  <br>

---

#### 2. Recurrent Neural Networks (RNN)

RNNs are neural networks designed to process sequential data. They are widely used in NLP and time-series forecasting.

##### How it Works

- Stores previous time-step information to influence future predictions.
- Uses a hidden state vector to model past information.

$$
h_t = f(W \cdot x_t + U \cdot h_{t-1} + b)
$$

Where:

- \( x_t \) → Current input
- \( h\_{t-1} \) → Previous hidden state
- \( W, U, b \) → Learnable parameters

Advantages:

- Suitable for sequential data.
- Can learn context and determine a word’s meaning based on its position in a sentence.

Disadvantages:

- **Vanishing Gradient Problem** makes modeling long dependencies difficult.
- Poor parallelization since computations must be done sequentially.
  <br>
  <br>

---

#### 3. Long Short-Term Memory (LSTM)

LSTM is an improved version of RNNs and is much better at modeling long-term dependencies.

##### How it Works

- Uses an **input gate**, **forget gate**, and **output gate** to control information flow.
- The memory cell forgets irrelevant information and retains important data.

$$
c_t = f_t \cdot c_{t-1} + i_t \cdot \tilde{c}_t
$$

Where:

- \( c_t \) → Memory cell
- \( f_t \) → Forget gate
- \( i_t \) → Input gate
- \( \tilde{c}\_t \) → New memory state

Advantages:

- Can learn long-term dependencies.
- Solves the **Vanishing Gradient Problem**.

Disadvantages:

- High computational cost.
- Poor parallelization since previous steps must be computed first.
  <br>
  <br>

---

#### 4. Transformer Models

The Transformer architecture has revolutionized NLP and is the foundation of modern large language models. It is much more efficient and powerful due to **parallelized computation**.

##### Self-Attention Mechanism

- Computes relationships between all words in a sentence.
- Does not rely only on previous words but sees the entire context.

Advantages:

- Excellent at capturing long-term dependencies.
- Highly parallelizable.
- Performs better on large datasets.
  <br>
  <br>

##### **Important Transformer Models:**

#### **4.1** BERT (Bidirectional Encoder Representations from Transformers)

- Uses **bidirectional attention**, meaning it processes all words in a sentence at once.
- Learns from large datasets using a **pretraining strategy**.

##### Use Cases:

- Text classification
- Named Entity Recognition (NER)
- Question answering systems
  <br>
  <br>

#### **4.2** GPT (Generative Pretrained Transformer

- Uses **unidirectional** modeling (predicts words based on previous words).
- Great for **text generation**.
- Forms the basis of **large-scale language models (LLM)**.

##### Use Cases:

- Chatbots
- Text generation
- Summarization
  <br>
  <br>

#### **4.3** LLaMA (Large Language Model Meta AI

- Developed by **Meta (Facebook)**.
- Similar to GPT but optimized for better performance with **less data**.
- Popular in the **open-source** community.

##### Use Cases:

- Research and Academic Studies
- Custom LLM Applications
- Lightweight Language Models
- Open-Source Development
  <br>
  <br>

---

## **Comparative Summary Table**

| Model           | Long Dependency Learning | Parallelization | Special Use Case          |
| --------------- | ------------------------ | --------------- | ------------------------- |
| Word Embeddings | ❌ No                    | ✅ Yes          | Word semantic relations   |
| RNN             | ❌ Weak                  | ❌ No           | Sequence-based processing |
| LSTM            | ✅ Yes                   | ❌ No           | Long dependencies         |
| Transformer     | ✅ Excellent             | ✅ Yes          | NLP, LLMs                 |
| BERT            | ✅ Bidirectional         | ✅ Yes          | NLP understanding         |
| GPT             | ✅ Unidirectional        | ✅ Yes          | Text generation           |
| LLaMA           | ✅ Optimized             | ✅ Yes          | Lightweight LLM           |

## **Conclusion**

- **Word Embeddings** transform words into vectors, capturing semantic relationships.
- **RNN & LSTM** are good for sequential data, with LSTM being better at long dependencies.
- **Transformer models (BERT, GPT, LLaMA)** provide the best performance in modern NLP and form the foundation of **large-scale language models**.

The future is completely shaped by **Transformer-based models**!
