![nlp.jfif](attachment:nlp.jfif)

Natural Language Processing (NLP) is a field at the intersection of artificial intelligence, linguistics, and computer science. It focuses on enabling machines to understand, interpret, and generate human language in a way that is both meaningful and useful. NLP encompasses a wide range of tasks, including language translation, sentiment analysis, text summarization, and more.

## What is NLP?

###  Overview

NLP involves the interaction between computers and human languages, enabling machines to process and analyze large amounts of natural language data. This involves tasks like:

- **Text Classification**: Assigning categories to text (e.g., spam detection).
- **Named Entity Recognition (NER)**: Identifying entities like names, dates, and locations in text.
- **Part-of-Speech Tagging**: Assigning parts of speech (noun, verb, etc.) to each word in a sentence.
- **Sentiment Analysis**: Determining the sentiment or emotional tone of a piece of text.
- **Machine Translation**: Automatically translating text from one language to another.
- **Question Answering**: Building systems that can answer questions posed in natural language.


![digital-twins-and-knowledge-graphs-1280x640.png](attachment:digital-twins-and-knowledge-graphs-1280x640.png)

##  Processing Natural Language with Neural Networks
<div style="text-align: center;">
    <img src="./files/Traditional Feedforward Neural Networks (FFNNs).JPG" width="50%"/>
</div>


Neural networks have become the cornerstone of modern NLP, significantly improving the performance of NLP tasks. Below is a breakdown of how different types of neural networks, from traditional feedforward neural networks to transformers, are used in NLP.

###  Traditional Feedforward Neural Networks (FFNNs)

**Basic Concept**: In NLP, a feedforward neural network can be used for simple tasks like text classification. However, FFNNs treat each word or token in isolation, without considering the sequence or context in which the word appears.

**Limitations**:

- **Lack of Context Awareness**: FFNNs do not maintain context across words or sentences, making them inadequate for tasks where the order of words matters (e.g., sentiment analysis or language modeling).
- **Fixed Input Size**: FFNNs generally require a fixed-size input, which is problematic for variable-length text sequences.

Despite these limitations, FFNNs can be combined with other techniques, such as n-grams, to capture some local context, but they still fall short in handling long-term dependencies.



Traditional feedforward neural networks (FFNNs) are the most basic type of artificial neural networks. They are called "feedforward" because the information in these networks moves in one direction—from the input layer, through the hidden layers (if any), to the output layer. There are no cycles or loops in the network, distinguishing them from recurrent neural networks (RNNs). Let's delve into the details of FFNNs, starting from the basics and moving toward more complex concepts.

###  Basic Structure of Feedforward Neural Networks

####  Neurons and Layers

- **Neuron**: The fundamental unit of a neural network. Each neuron receives input, processes it (using a weighted sum and an activation function), and passes the result to the next layer.

- **Layers**:

  - **Input Layer**: The first layer in the network, which receives the input data. The number of neurons in this layer equals the number of features in the input data.
  
  - **Hidden Layers**: Intermediate layers between the input and output layers. These layers perform computations on the input data and extract relevant features. A network can have one or multiple hidden layers.
  
  - **Output Layer**: The final layer, which produces the output. The number of neurons in this layer depends on the task (e.g., for binary classification, there would typically be one output neuron).

###  Forward Pass

In an FFNN, data moves in one direction: forward through the network. During the forward pass:

- Each neuron in the hidden layers computes a weighted sum of its inputs.
- The weighted sum is passed through an activation function to produce the neuron's output.
- The output from one layer serves as the input to the next layer.
- Finally, the output layer produces the final predictions.

###  Mathematical Formulation

####  Weight and Bias

- **Weights (𝑊)**: Each connection between neurons in adjacent layers has an associated weight. These weights determine the strength and direction (positive or negative) of the influence that one neuron's output has on another neuron's input.
- **Bias (𝑏)**: A bias term is added to the weighted sum of inputs to allow the activation function to shift left or right. This provides the model with additional flexibility.

####  Activation Functions

The output of each neuron is passed through an activation function, which introduces non-linearity into the model. Common activation functions include:


 - **Sigmoid**:
 
 In this formula 
$$
\sigma(x) = \frac{1}{1+e^{-x}}
$$
 
 Maps the input to a value between 0 and 1 as below fig :

<div style="text-align: center;">
    <img src="./files/OIP.jfif" width="50%"/>
</div>        





 - **Tanh**:

 In this formula $$ \tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$$


Maps the input to a value between -1 and 1 as below fig :

<div style="text-align: center;">
    <img src="./files/tanh-fig.jfif" width="50%"/>
</div>


 -  **ReLU (Rectified Linear Unit)**:

 In this formula $$ \text{ReLU}(x) = \max(0, x)$$
 
 
 Maps all negative values to 0 and all positive values to the same value.Introduces sparsity by setting negative values to zero and allowing positive values to pass unchanged.

 <div style="text-align: center;">
    <img src="./files/relu-fig.png" width="50%" />
</div>


In [1]:
pip install tensorflow

Collecting tensorflowNote: you may need to restart the kernel to use updated packages.

  Downloading tensorflow-2.17.0-cp312-cp312-win_amd64.whl.metadata (3.2 kB)
Collecting tensorflow-intel==2.17.0 (from tensorflow)
  Downloading tensorflow_intel-2.17.0-cp312-cp312-win_amd64.whl.metadata (5.0 kB)
Collecting absl-py>=1.0.0 (from tensorflow-intel==2.17.0->tensorflow)
  Using cached absl_py-2.1.0-py3-none-any.whl.metadata (2.3 kB)
Collecting astunparse>=1.6.0 (from tensorflow-intel==2.17.0->tensorflow)
  Using cached astunparse-1.6.3-py2.py3-none-any.whl.metadata (4.4 kB)
Collecting flatbuffers>=24.3.25 (from tensorflow-intel==2.17.0->tensorflow)
  Using cached flatbuffers-24.3.25-py2.py3-none-any.whl.metadata (850 bytes)
Collecting gast!=0.5.0,!=0.5.1,!=0.5.2,>=0.2.1 (from tensorflow-intel==2.17.0->tensorflow)
  Downloading gast-0.6.0-py3-none-any.whl.metadata (1.3 kB)
Collecting google-pasta>=0.1.1 (from tensorflow-intel==2.17.0->tensorflow)
  Using cached google_pasta-0.2.0-py3-none-

ERROR: Could not install packages due to an OSError: [Errno 28] No space left on device



   ------ -------------------------------- 63.7/385.2 MB 100.0 kB/s eta 0:53:36
   ------ -------------------------------- 63.7/385.2 MB 100.0 kB/s eta 0:53:36
   ------ -------------------------------- 63.7/385.2 MB 100.0 kB/s eta 0:53:36
   ------ --------------------------------- 64.0/385.2 MB 97.2 kB/s eta 0:55:04
   ------ --------------------------------- 64.0/385.2 MB 97.2 kB/s eta 0:55:04
   ------ --------------------------------- 64.0/385.2 MB 97.2 kB/s eta 0:55:04
   ------ --------------------------------- 64.0/385.2 MB 97.2 kB/s eta 0:55:04
   ------ --------------------------------- 64.0/385.2 MB 97.2 kB/s eta 0:55:04
   ------ --------------------------------- 64.0/385.2 MB 97.2 kB/s eta 0:55:04
   ------ --------------------------------- 64.0/385.2 MB 97.2 kB/s eta 0:55:04
   ------ --------------------------------- 64.0/385.2 MB 97.2 kB/s eta 0:55:04
   ------ --------------------------------- 64.0/385.2 MB 97.2 kB/s eta 0:55:04
   ------ ------------------------------

In [None]:
import tensorflow as tf
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten
from tensorflow.keras.datasets import imdb

# Set parameters
max_features = 10000  # Number of words to consider as features
maxlen = 500  # Cut texts after this number of words (among top max_features most common words)
batch_size = 32

# Load the data (IMDb movie reviews)
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)

# Pad sequences to ensure all input data has the same length
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)

# Build the model
model = Sequential()
model.add(tf.keras.layers.Embedding(max_features, 128, input_length=maxlen))
model.add(Flatten())  # Flatten the 2D input to 1D for feedforward network
model.add(Dense(64, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
history = model.fit(x_train, y_train, epochs=10, batch_size=batch_size, validation_split=0.2)

# Evaluate the model
test_loss, test_acc = model.evaluate(x_test, y_test)
print(f'Test Accuracy: {test_acc:.4f}')

## Explain Implementation of NLP with Traditional Feedforward Neural Networks (FFNNs)

### 1. Dataset:

We use the IMDb movie reviews dataset, with reviews labeled as positive or negative.

### 2. Preprocessing:

- **Tokenization and Padding**: The reviews are already tokenized into integers, and we pad them to ensure they have a fixed length of 500 words.

### 3. Model Architecture:

- **Embedding Layer**: Converts the integer tokens into dense vectors of fixed size (128-dimensional in this case).
- **Flatten Layer**: The output from the Embedding layer is a 2D tensor (sequence length x embedding size). The Flatten layer converts this 2D tensor into a 1D tensor, making it suitable for a feedforward network.
- **Dense Layers**:
  - The first Dense layer has 64 units with a ReLU activation function, which introduces non-linearity.
  - The second Dense layer has 1 unit with a sigmoid activation function, which outputs a probability for binary classification (positive or negative sentiment).

### 4. Training:

- **Binary Crossentropy**: The loss function used for binary classification.
- **Adam Optimizer**: Used for optimizing the model parameters.
- **Validation Split**: 20% of the training data is used for validation during training.

### 5. Evaluation:

After training, the model's accuracy is evaluated on the test dataset.


##  Recurrent Neural Networks (RNNs) for NLP
<div style="text-align: center;">
    <img src="./files/RNN.png" width="50%"/>
</div>

###  Introduction to RNNs


- **Context Awareness**: Unlike FFNNs, RNNs are designed to handle sequential data, making them well-suited for NLP tasks where context matters. RNNs process input sequences one element at a time, maintaining a hidden state that captures information about previous elements in the sequence.



### Mathematical Formulation

At each time step \( t \), the hidden state \( h_t \) is updated based on the current input \( x_t \) and the previous hidden state \( h_{t-1} \):


$$
h_t = \sigma(W_{hx}x_t + W_{hh}h_{t-1} + b_h)
$$

  where   𝑊ℎ𝑥,𝑊ℎℎ  are weight matrices, 𝑏ℎ  is a bias term, and 𝜎 is an activation function (typically tanh or ReLU).
  
  The output 𝑦𝑡  at each time step can be computed as:

$$
y_t = \sigma(W_{hy}h_t + b_y)
$$

# Let explain more about Recurrent Neural Networks (RNNs) in NLP

Recurrent Neural Networks (RNNs) are a class of neural networks particularly well-suited for processing sequential data, making them a popular choice for natural language processing (NLP) tasks. Here's a detailed step-by-step explanation of how RNNs process data in the context of NLP:

### 1. **Input Representation**
- **Tokenization**: The first step in processing text data is to break down the input text into tokens (words, subwords, or characters).
  

  - ### Example: Tokenization of a Sentence

    **Input Sentence:**  
        "The cat sat on the mat."

    **Tokenization Process:**  
    - **Word-Level Tokenization:**  
    The sentence is broken down into individual words (tokens):
    ```
    ["The", "cat", "sat", "on", "the", "mat", "."]
    ```

    - **Subword-Level Tokenization (using Byte Pair Encoding or similar):**  
     Each word may be broken down into smaller subword units:
    ```
    ["Th", "e", " ca", "t", " sa", "t", " on", " the", " ma", "t", "."]
    ```

    - **Character-Level Tokenization:**  
    The sentence is split into individual characters:
    ```
    ["T", "h", "e", " ", "c", "a", "t", " ", "s", "a", "t", " ", "o", "n", " ", "t", "h", "e", " ", "m", "a", "t", "."]
    ```

    **Explanation:**

    - **Word-Level Tokenization**: This breaks the text into words, making it easy to analyze word-level features.
    - **Subword-Level Tokenization**: Useful when dealing with rare words or morphological variations, as it breaks words into more frequent subword units.
    - **Character-Level Tokenization**: Useful for tasks where the exact form of text is important, such as in languages with rich morphology or in spell-checking systems.

    This example shows how tokenization can vary in granularity depending on the specific needs of the NLP task.
- **Embedding**: Each token is then converted into a vector representation, often using pre-trained embeddings like Word2Vec, GloVe, or BERT. The embeddings capture semantic meaning and reduce the dimensionality of the input.
    - ### Example: Word Embedding

        **Context:**  
        Imagine you have the following sentence:
        ```
        "The cat sat on the mat."
        ```

        **Step 1: Tokenization**  
        The sentence is first broken down into tokens (words):
        ```
        ["The", "cat", "sat", "on", "the", "mat", "."]
        ```

        **Step 2: Word Embedding**  
        Each token (word) is then converted into a vector representation. Let's say we use a pre-trained embedding like Word2Vec, which converts each word into a vector of numbers. 

        For simplicity, let's assume the embeddings map each word to a 3-dimensional vector:

        - "The" → [0.2, 0.1, 0.9]
        - "cat" → [0.8, 0.6, 0.4]
        - "sat" → [0.7, 0.5, 0.3]
        - "on" → [0.4, 0.4, 0.2]
        - "the" → [0.2, 0.1, 0.9] (same as "The")
        - "mat" → [0.9, 0.8, 0.7]
        - "." → [0.1, 0.2, 0.3]
        
        **Output:**  
        The sentence is now represented as a sequence of vectors:
        ```
        [
        [0.2, 0.1, 0.9],  // "The"
        [0.8, 0.6, 0.4],  // "cat"
        [0.7, 0.5, 0.3],  // "sat"
        [0.4, 0.4, 0.2],  // "on"
        [0.2, 0.1, 0.9],  // "the"
        [0.9, 0.8, 0.7],  // "mat"
        [0.1, 0.2, 0.3]   // "."
        ]
        ```

        **Explanation:**

        - **Semantic Meaning**: The vectors are designed to capture the semantic meaning of words. For example, words with similar meanings will have similar vectors. If you had another sentence like "The dog sat on the mat," the word "dog" might have a vector similar to "cat."
  
        - **Dimensionality Reduction**: Instead of representing words as large sparse vectors (e.g., one-hot encoding where each word is a huge vector of zeros and a single one), embeddings reduce the dimensionality while preserving meaning. This makes it easier for the model to process and understand the text.

        This example shows how embedding transforms words into vectors that a machine learning model can work with, while also preserving the meaning of the words.


### 2. **Sequential Data Input**
- **Sequence Formation**: The input tokens are arranged in a sequence. For example, a sentence "The cat sat on the mat" would be represented as a sequence of vectors corresponding to each word.
- **Time Steps**: Each token in the sequence is processed at a different time step in the RNN. The RNN processes the sequence one element at a time, maintaining a hidden state that is updated with each time step.

### 3. **RNN Cell Operation**
- **Hidden State Initialization**: The RNN starts with an initial hidden state, usually initialized to zero or small random values. This hidden state is updated as the network processes each token in the sequence.
- **Processing Each Time Step**:
    1. **Current Input**: At time step $ \ t $, the RNN receives the vector representation of the current token $ \ x_t $.
    2. **Previous Hidden State**: The hidden state from the previous time step $ \  h_{t-1} $ is also fed into the RNN cell.
    3. **Hidden State Update**: The RNN cell combines the current input $ \ x_t $ and the previous hidden state $ \ h_{t-1} $ to compute the new hidden state $ \ h_t $. The update is typically performed using a non-linear function like a tanh or ReLU activation.
       $$ h_t = \tanh\left(W_{hh} \cdot h_{t-1} + W_{xh} \cdot x_t + b_h\right) $$

       where $ \ W_{hh} $ and $ \ W_{xh} $ are weight matrices, and $ \ b_h $ is a bias term.
    4. **Output Generation**: Depending on the task, the RNN might also produce an output $ \ y_t $ at each time step, which is a function of the current hidden state:
       $$
       y_t = \text{softmax}(W_{hy} \cdot h_t + b_y)
       $$
       where $ \ W_{hy} $ is the weight matrix for the output layer, and $ \ b_y $ is a bias term.

### 4. **Handling Long Sequences**
- **Vanishing/Exploding Gradient Problem**: When processing long sequences, RNNs can suffer from vanishing or exploding gradients during backpropagation. This means that the network either stops learning (gradients become too small) or becomes unstable (gradients become too large).
- **Solutions**: 
    - **Long Short-Term Memory (LSTM)** and **Gated Recurrent Unit (GRU)** networks are specialized types of RNNs that mitigate these issues by incorporating gates that control the flow of information.
    - **Gradient Clipping**: A technique where gradients are capped at a maximum value during training to prevent them from exploding.

- ### Explain more about Handling Long Sequences in RNNs

    When working with Recurrent Neural Networks (RNNs), handling long sequences of data can be challenging due to the **vanishing** and **exploding gradient problem**. Here’s a simple explanation of these problems and their solutions:

    #### 1. **Vanishing/Exploding Gradient Problem**

    - **Vanishing Gradients**: 
      - **What Happens**: When training an RNN on long sequences, the gradients (which are used to update the model’s weights) can become very small as they are propagated back through time. As a result, the model stops learning because the updates to the weights become insignificant.
      - **Example**: Imagine trying to remember the first word in a long sentence after reading the entire sentence. As you go further into the sentence, your memory of the first word fades away. Similarly, in RNNs, the influence of earlier time steps diminishes as the sequence length increases, making it difficult for the network to learn long-term dependencies.

    - **Exploding Gradients**:
      - **What Happens**: Conversely, the gradients can become excessively large during backpropagation, causing the model’s weights to change drastically. This leads to an unstable model that may fail to converge during training.
      - **Example**: Think of trying to adjust a radio volume knob. If the knob is too sensitive, a slight touch could make the volume too loud, making the control very difficult. Similarly, in RNNs, large gradients can cause massive weight updates, making the model behave unpredictably.

    #### 2. **Solutions to These Problems**

    - **Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU)**:
      - **What They Do**: LSTMs and GRUs are special types of RNNs designed to address the vanishing gradient problem. They include mechanisms called "gates" that control the flow of information, allowing the model to keep or forget information as needed over long sequences.
      - **Example**: Think of an LSTM as having a memory cell with a gate that controls what information should be remembered and what should be forgotten. This allows the network to remember important information from earlier in the sequence, even if the sequence is long.

    - **Gradient Clipping**:
      - **What It Does**: Gradient clipping is a technique used to prevent exploding gradients. It sets a maximum threshold for the gradients during backpropagation. If the gradients exceed this threshold, they are scaled down to stay within the limit.
      - **Example**: Imagine a speed limit on a highway. If a car tries to go faster than the speed limit, it is forced to slow down to avoid accidents. Similarly, gradient clipping limits the speed (magnitude) of the gradient updates to keep the training process stable.

    ### Summary

    Handling long sequences in RNNs can be tricky due to the vanishing and exploding gradient problems. LSTMs and GRUs solve the vanishing gradient issue by using gates to manage information flow, while gradient clipping prevents exploding gradients by capping their magnitude. These solutions allow RNNs to learn effectively from long sequences.





### 5. **Backpropagation Through Time (BPTT)**
- **Unrolling the RNN**: To compute the gradients for training, the RNN is "unrolled" across time steps. This means the sequence of computations is treated as a feedforward network where each time step corresponds to a layer.
- **Loss Computation**: The loss is computed for each output at each time step, depending on the task. For example, in language modeling, the loss might be the cross-entropy loss between the predicted word and the actual next word in the sequence.
- **Gradient Calculation**: Gradients are calculated for each weight matrix by backpropagating the error through the unrolled network, a process called Backpropagation Through Time (BPTT).
- **Parameter Update**: Using the calculated gradients, the model's parameters (weights and biases) are updated using an optimization algorithm like Stochastic Gradient Descent (SGD) or Adam.
    - ### Understanding Backpropagation Through Time (BPTT) in RNNs

    Backpropagation Through Time (BPTT) is the process used to train Recurrent Neural Networks (RNNs). Here's a simple explanation of how BPTT works, step by step, with an example for clarity:

    #### 1. **Unrolling the RNN**

    - **What It Means**: 
      - RNNs process sequences one step at a time, reusing the same weights at each step. To compute gradients for training, the RNN is "unrolled" across time steps, meaning we treat each time step as a separate layer in a deep network.
      - **Example**: Imagine you have a sentence, "The cat sat," and the RNN processes one word at a time. Unrolling the RNN would look like a feedforward neural network with three layers, each corresponding to one word in the sentence ("The," "cat," and "sat").

        $$ \texttt{Layer 1: "The"→Layer 2: "cat"→Layer 3: "sat"} $$


    #### 2. **Loss Computation**

    - **What It Means**:
      - The RNN makes predictions at each time step (e.g., predicting the next word in a sequence). The loss is calculated based on how far off the predictions are from the actual values.
      - **Example**: If the task is to predict the next word in the sentence, the RNN might predict "dog" instead of "cat" after "The". The loss function (like cross-entropy loss) measures the difference between the predicted word ("dog") and the actual word ("cat").

    #### 3. **Gradient Calculation**

    - **What It Means**:
      - Gradients represent how much the model's weights need to change to reduce the loss. BPTT calculates these gradients by backpropagating the error through the unrolled network.
      - **Example**: After unrolling, we go backward from "sat" to "The," calculating the gradients for each layer (time step). The error at each time step affects not only the weights at that step but also the weights at previous steps.

    #### 4. **Parameter Update**

    - **What It Means**:
      - Once the gradients are calculated, they are used to update the model’s weights to minimize the loss. This is done using optimization algorithms like Stochastic Gradient Descent (SGD) or Adam.
      - **Example**: After computing the gradients, the model updates its weights so that next time it encounters a similar sequence, it might predict "cat" instead of "dog" after "The".

    ### Summary with a Simple Example

    Imagine teaching a child to complete the sentence "The cat sat on the ...". Initially, the child might guess wrong ("dog" instead of "mat"). You correct them, and they adjust their understanding.

    In RNN training:
    1. **Unrolling**: The sentence is split into steps ("The," "cat," "sat," "on," "the," "mat").
    2. **Loss Computation**: The model guesses the next word at each step. It compares its guesses with the actual words and calculates the loss.
    3. **Gradient Calculation**: The model backtracks through the sentence, figuring out how to adjust its guesses to be more accurate.
    4. **Parameter Update**: It tweaks its "understanding" (weights) to do better next time.

    BPTT is like teaching the model by repeatedly correcting its mistakes until it gets better at making predictions for sequences of words.

### 6. **Task-Specific Adjustments**
- **Sequence-to-Sequence Tasks**: In tasks like machine translation, the RNN is often used in an encoder-decoder architecture. The encoder processes the input sequence, and the decoder generates the output sequence.
- **Attention Mechanism**: In more advanced models, attention mechanisms are incorporated to allow the RNN to focus on specific parts of the input sequence when generating each output.

### 7. **Inference**
- **Sequence Generation**: After training, the RNN can generate sequences of text. In a sequence generation task, the model might generate the next word in a sentence by sampling from the predicted probability distribution over the vocabulary.
- **Handling Variable-Length Sequences**: RNNs are capable of processing input sequences of variable length, making them flexible for a wide range of NLP tasks.

### 8. **Evaluation**
- **Perplexity**: In language modeling, the performance of an RNN is often evaluated using perplexity, which measures how well the model predicts the next word in a sequence.
- **Accuracy and F1 Score**: For classification tasks like sentiment analysis, metrics like accuracy, precision, recall, and F1 score are commonly used.

### 9. **Common Applications in NLP**
- **Language Modeling**: Predicting the next word in a sentence.
- **Machine Translation**: Translating text from one language to another.
- **Sentiment Analysis**: Classifying the sentiment of a piece of text.
- **Speech Recognition**: Converting spoken language into text.
- **Named Entity Recognition (NER)**: Identifying and classifying entities in text.

This detailed process shows how RNNs handle sequential data in NLP tasks, from the initial input representation to the final output and evaluation. The adaptability of RNNs to different tasks and their ability to process variable-length sequences make them a foundational model in NLP.



###  Challenges with RNNs

- **Vanishing Gradient Problem**: As the length of the input sequence increases, the gradients used to update the network's weights during backpropagation can become very small, making it difficult to learn long-term dependencies.
- **Exploding Gradient Problem**: Conversely, the gradients can also become excessively large, leading to unstable training.


In [None]:
import tensorflow as tf
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, SimpleRNN, Dense
from tensorflow.keras.datasets import imdb

# Set parameters
max_features = 10000  # Number of words to consider as features
maxlen = 500  # Cut texts after this number of words (among top max_features most common words)
batch_size = 32

# Load the data (IMDb movie reviews)
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)

# Pad sequences to ensure all input data has the same length
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)

# Build the model
model = Sequential()
model.add(Embedding(max_features, 128, input_length=maxlen))
model.add(SimpleRNN(128, return_sequences=False))
model.add(Dense(1, activation='sigmoid'))

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
history = model.fit(x_train, y_train, epochs=10, batch_size=batch_size, validation_split=0.2)

# Evaluate the model
test_loss, test_acc = model.evaluate(x_test, y_test)
print(f'Test Accuracy: {test_acc:.4f}')

# Explain Implementation of RNN

1. **Dataset:**

    **IMDB Dataset**: We use the IMDB movie reviews dataset, which contains 50,000 reviews labeled as positive or negative. We use 25,000 reviews for training and 25,000 for testing.

2. **Preprocessing:**

    **Tokenization**: The reviews are already tokenized as integers where each integer represents a word in a dictionary.
    
    **Padding**: Since RNNs require fixed-length input sequences, we pad the sequences to ensure they are all the same length.

3. **Model Architecture:**

    **Embedding Layer**: This layer turns positive integers (representing words) into dense vectors of fixed size. It helps in capturing semantic meanings of the words.
    
    **SimpleRNN Layer**: A basic RNN layer that processes the sequence of word embeddings and maintains a hidden state that captures information about the sequence.
    
    **Dense Layer**: A fully connected layer with a sigmoid activation function to output a probability value for binary classification (positive or negative sentiment).

4. **Training:**

    **Binary Crossentropy**: Used as the loss function because this is a binary classification problem.
    
    **Adam Optimizer**: A popular optimizer that adapts the learning rate during training.
    
    **Validation Split**: We use 20% of the training data for validation during training.

5. **Evaluation:**

    After training, the model is evaluated on the test dataset to measure its accuracy in predicting sentiment.


# Long Short-Term Memory (LSTM) Networks

<div style="text-align: center;">
    <img src="./files/LSTM.png" width="50%"/>
</div>

-  **Introduction to LSTMs**

    **Motivation**: LSTM networks were designed to address the vanishing gradient problem in RNNs. They introduce a memory cell that can maintain its state over long periods, along with gates that regulate the flow of information into and out of the cell.

    

    **Components**:

    - **Cell State**: The cell state acts as a memory, carrying information across time steps.
    
    - **Forget Gate**: Decides what information to discard from the cell state.


$$f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)$$

- **Input Gate**: Decides what new information to add to the cell state.
$$i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)$$

- **Output Gate**: Decides what to output based on the cell state.
$$o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)$$

- The **cell state** 𝐶𝑡  is updated as follows:

$$ C_t = f_t * C_{t-1} + i_t * \tilde{C}_t $$

where 𝐶~𝑡  is the candidate cell state, typically computed using a tanh activation function.


- The **hidden state** ℎ𝑡  is updated as:
$$ h_t = o_t * \tanh(C_t) $$

# Explanation of Custom LSTM Cell Implementation

**LSTM Gates:**

- **Forget Gate**: Determines what part of the previous cell state should be forgotten. It's calculated using a sigmoid activation function. The output of the forget gate is multiplied by the previous cell state to "forget" unimportant parts.

- **Input Gate**: Controls what new information should be added to the cell state. It has two parts:
    - **The input_gate**: Decides which parts of the input are important using a sigmoid function.
    - **The input_value**: Determines the potential new values to add to the cell state using a tanh function.

- **Cell State Update**: The forget gate's output and the input gate's output are combined to update the cell state.

- **Output Gate**: Determines the next hidden state (which is also the output) based on the updated cell state and the current input.


In [None]:
import tensorflow as tf
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Embedding, Dense
from tensorflow.keras.datasets import imdb
from tensorflow.keras.layers import LSTM, concatenate, Multiply, Add, Activation

# Set parameters
max_features = 10000  # Number of words to consider as features
maxlen = 500  # Cut texts after this number of words (among top max_features most common words)
embedding_size = 128
batch_size = 32

# Load the data (IMDb movie reviews)
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)

# Pad sequences to ensure all input data has the same length
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)

# LSTM custom cell implementation
def lstm_cell(input, prev_hidden_state, prev_cell_state, units):
    # Forget gate
    forget_gate = Dense(units, activation='sigmoid')(concatenate([input, prev_hidden_state]))
    forget_gate = Multiply()([forget_gate, prev_cell_state])

    # Input gate
    input_gate = Dense(units, activation='sigmoid')(concatenate([input, prev_hidden_state]))
    input_value = Dense(units, activation='tanh')(concatenate([input, prev_hidden_state]))
    input_gate = Multiply()([input_gate, input_value])

    # Cell state
    cell_state = Add()([forget_gate, input_gate])

    # Output gate
    output_gate = Dense(units, activation='sigmoid')(concatenate([input, prev_hidden_state]))
    hidden_state = Multiply()([output_gate, Activation('tanh')(cell_state)])

    return hidden_state, cell_state

# Build the LSTM model
inputs = Input(shape=(maxlen,))
embedding = Embedding(max_features, embedding_size)(inputs)

# Initialize states for the first time step
hidden_state = tf.zeros((batch_size, 128))
cell_state = tf.zeros((batch_size, 128))

# Process each time step through the LSTM cell
for t in range(maxlen):
    current_input = embedding[:, t, :]
    hidden_state, cell_state = lstm_cell(current_input, hidden_state, cell_state, 128)

# Final Dense layer for output
output = Dense(1, activation='sigmoid')(hidden_state)

# Create model
model = Model(inputs=inputs, outputs=output)

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
history = model.fit(x_train, y_train, epochs=10, batch_size=batch_size, validation_split=0.2)

# Evaluate the model
test_loss, test_acc = model.evaluate(x_test, y_test)
print(f'Test Accuracy: {test_acc:.4f}')


In [None]:
import tensorflow as tf
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.datasets import imdb

# Set parameters
max_features = 10000  # Number of words to consider as features
maxlen = 500  # Cut texts after this number of words (among top max_features most common words)
batch_size = 32

# Load the data (IMDb movie reviews)
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)

# Pad sequences to ensure all input data has the same length
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)

# Build the LSTM model
model = Sequential()
model.add(Embedding(max_features, 128, input_length=maxlen))
model.add(LSTM(128))  # LSTM layer with 128 units
model.add(Dense(1, activation='sigmoid'))

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
history = model.fit(x_train, y_train, epochs=10, batch_size=batch_size, validation_split=0.2)

# Evaluate the model
test_loss, test_acc = model.evaluate(x_test, y_test)
print(f'Test Accuracy: {test_acc:.4f}')

# LSTM Applications in NLP

- **Machine Translation**: Translating text from one language to another by capturing long-term dependencies between words.

- **Text Summarization**: Summarizing a long piece of text by understanding the overall context.

- **Speech Recognition**: Converting spoken language into text by processing sequences of audio frames.


# Gated Recurrent Unit (GRU) Networks

<div style="text-align: center;">
    <img src="./files/RNN,LSTM,GRU.jfif" width="50%"/>
</div>

 **Introduction to GRUs**

   **Simplified Architecture**: GRUs are a variant of LSTMs that simplify the architecture by combining the forget and input gates into a single gate. This reduces the complexity of the model while maintaining performance in many tasks.

   **GRU Equations**:

   - **Update Gate**:
      $$z_t = \sigma(W_z \cdot [h_{t-1}, x_t] + b_z)$$


- **Reset Gate**:
 $$r_t = \sigma(W_r \cdot [h_{t-1}, x_t] + b_r)$$

- **Candidate Hidden State**:
 $$\tilde{h}t = \tanh(W_h \cdot [r_t * h{t-1}, x_t] + b_h)$$


- **Final Hidden State**:
 $$h_t = (1 - z_t) * h_{t-1} + z_t * \tilde{h}_t$$

In [None]:
import tensorflow as tf
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Embedding, Dense
from tensorflow.keras.datasets import imdb
from tensorflow.keras.layers import concatenate, Dense, Multiply, Activation, Add

# Set parameters
max_features = 10000  # Number of words to consider as features
maxlen = 500  # Cut texts after this number of words (among top max_features most common words)
embedding_size = 128
batch_size = 32

# Load the data (IMDb movie reviews)
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)

# Pad sequences to ensure all input data has the same length
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)

# GRU custom cell implementation
def gru_cell(input, prev_hidden_state, units):
    # Update gate
    update_gate = Dense(units, activation='sigmoid')(concatenate([input, prev_hidden_state]))

    # Reset gate
    reset_gate = Dense(units, activation='sigmoid')(concatenate([input, prev_hidden_state]))

    # Candidate hidden state (h~)
    candidate_hidden_state = Dense(units, activation='tanh')(concatenate([input, Multiply()([reset_gate, prev_hidden_state])]))

    # Final hidden state
    hidden_state = Add()([
        Multiply()([update_gate, prev_hidden_state]),  # h_t-1 * z_t
        Multiply()([Activation('subtract')(1.0 - update_gate), candidate_hidden_state])  # (1 - z_t) * h~
    ])

    return hidden_state

# Build the GRU model
inputs = Input(shape=(maxlen,))
embedding = Embedding(max_features, embedding_size)(inputs)

# Initialize states for the first time step
hidden_state = tf.zeros((batch_size, 128))

# Process each time step through the GRU cell
for t in range(maxlen):
    current_input = embedding[:, t, :]
    hidden_state = gru_cell(current_input, hidden_state, 128)

# Final Dense layer for output
output = Dense(1, activation='sigmoid')(hidden_state)

# Create model
model = Model(inputs=inputs, outputs=output)

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
history = model.fit(x_train, y_train, epochs=10, batch_size=batch_size, validation_split=0.2)

# Evaluate the model
test_loss, test_acc = model.evaluate(x_test, y_test)
print(f'Test Accuracy: {test_acc:.4f}')


In [None]:
import tensorflow as tf
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, GRU, Dense
from tensorflow.keras.datasets import imdb

# Set parameters
max_features = 10000  # Number of words to consider as features
maxlen = 500  # Cut texts after this number of words (among top max_features most common words)
batch_size = 32

# Load the data (IMDb movie reviews)
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)

# Pad sequences to ensure all input data has the same length
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)

# Build the GRU model
model = Sequential()
model.add(Embedding(max_features, 128, input_length=maxlen))
model.add(GRU(128))  # GRU layer with 128 units
model.add(Dense(1, activation='sigmoid'))

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
history = model.fit(x_train, y_train, epochs=10, batch_size=batch_size, validation_split=0.2)

# Evaluate the model
test_loss, test_acc = model.evaluate(x_test, y_test)
print(f'Test Accuracy: {test_acc:.4f}')


##  GRU Applications in NLP

GRUs are often used in the same contexts as LSTMs, such as:

- **Machine Translation**: Translating text from one language to another.

- **Text Generation**: Generating coherent and contextually relevant text.

- **Speech Recognition**: Converting spoken language into text.

They offer similar performance with fewer parameters, making them faster to train and easier to implement.


# Transformers: The Modern Approach

**Introduction to Transformers**

- **Revolutionizing NLP**: Transformers have become the standard architecture for many NLP tasks. Unlike RNNs and their variants, transformers do not process data sequentially. Instead, they rely on a mechanism called self-attention to process all elements of the input sequence simultaneously, allowing them to capture long-range dependencies more effectively.

- **Self-Attention Mechanism**:

- **Attention Scores**: Each word in a sequence is assigned a score based on its relevance to other words in the sequence. This is achieved using queries, keys, and values:


![attention.JPG](attachment:attention.JPG)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

# Transformers: The Modern Approach

-  **Introduction to Transformers**

    - **Revolutionizing NLP**: Transformers have become the standard architecture for many NLP tasks. Unlike RNNs and their variants, transformers do not process data sequentially. Instead, they rely on a mechanism called self-attention to process all elements of the input sequence simultaneously, allowing them to capture long-range dependencies more effectively.

    - **Self-Attention Mechanism**:

     - **Attention Scores**: Each word in a sequence is assigned a score based on its relevance to other words in the sequence. This is achieved using queries (𝑄), keys (𝐾), and values (𝑉), which are matrices derived from the input, and 𝑑𝑘 is the dimensionality of the keys.

     - **Multi-Head Attention**: The self-attention mechanism is applied multiple times in parallel, with different weight matrices, to capture different aspects of the input. The outputs are concatenated and linearly transformed.
 
     - **Positional Encoding**: Since transformers do not inherently process data sequentially, positional encoding is added to the input embeddings to give the model information about the position of each word in the sequence.

-  **Transformer Architecture**

    - **Encoder-Decoder Structure**: The original transformer model consists of an encoder (which processes the input sequence) and a decoder (which generates the output sequence).

    - **Encoder**: A stack of identical layers, each with two sub-layers: a multi-head self-attention mechanism and a position-wise fully connected feedforward network.
    
    - **Decoder**: Similar to the encoder but with an additional multi-head attention mechanism that attends to the encoder's output.

- **Applications**:

  - **BERT (Bidirectional Encoder Representations from Transformers)**: A transformer-based model pre-trained on a large corpus of text in a bidirectional manner, making it highly effective for a variety of NLP tasks such as question answering and named entity recognition.

  - **GPT (Generative Pretrained Transformer)**: A transformer model designed for text generation. It is trained in an autoregressive manner, predicting the next word in a sequence.

  - **T5 (Text-To-Text Transfer Transformer)**: A model that treats every NLP problem as a text-to-text problem, enabling it to perform a wide range of tasks, from translation to summarization.


Getting Started with NLTK
NLTK (Natural Language Toolkit) is a leading platform for building Python programs to work with human language data. Let's start with a simple example using NLTK:

In [2]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

from nltk.tokenize import word_tokenize
from nltk import pos_tag

text = "Natural language processing is a subfield of artificial intelligence."

# Tokenization
tokens = word_tokenize(text)
print("Tokens:", tokens)

# Part-of-Speech Tagging
pos_tags = pos_tag(tokens)
print("POS Tags:", pos_tags)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\didgostar\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\didgostar\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.


Tokens: ['Natural', 'language', 'processing', 'is', 'a', 'subfield', 'of', 'artificial', 'intelligence', '.']
POS Tags: [('Natural', 'JJ'), ('language', 'NN'), ('processing', 'NN'), ('is', 'VBZ'), ('a', 'DT'), ('subfield', 'NN'), ('of', 'IN'), ('artificial', 'JJ'), ('intelligence', 'NN'), ('.', '.')]


This example demonstrates basic tokenization and part-of-speech tagging using NLTK. In the upcoming sections, we'll dive deeper into these concepts and explore more advanced NLP techniques.

# Text Preprocessing in NLP

####  Text preprocessing is a crucial step in NLP that involves cleaning and transforming raw text data into a format suitable for analysis. This process helps to reduce noise in the text and improve the performance of NLP models.

 **Common Preprocessing Steps**


- **1.Lowercasing**: Converting all text to lowercase to ensure consistency.

- **2.Removing punctuation**: Eliminating punctuation marks that may not contribute to the meaning.

- **3.Removing numbers**: Removing numerical digits if they're not relevant to the analysis.

- **4.Removing whitespace**: Stripping extra spaces, tabs, and newlines.

- **5.Removing stop words**: Eliminating common words that don't carry much meaning (e.g., "the", "is", "at").

- **6.Stemming**: Reducing words to their root form (e.g., "running" to "run").

- **7.Lemmatization**: Similar to stemming, but ensures the root word is a valid word (e.g., "better" to "good").

- **8.Handling contractions**: Expanding contractions to their full form (e.g., "don't" to "do not").

- **9.Removing HTML tags**: Cleaning text scraped from websites.

- **10.Handling emojis and special characters**: Deciding whether to remove, replace, or keep these elements.



Preprocessing with NLTK and spaCy
We'll demonstrate text preprocessing using both NLTK and spaCy, two popular NLP libraries in Python.
NLTK Example

In [3]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
import string

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

def preprocess_text_nltk(text):
    # Lowercase
    text = text.lower()
    
    # Tokenize
    tokens = word_tokenize(text)
    
    # Remove punctuation and numbers
    tokens = [token for token in tokens if token not in string.punctuation and not token.isdigit()]
    
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words]
    
    # Stemming
    stemmer = PorterStemmer()
    tokens = [stemmer.stem(token) for token in tokens]
    
    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(token) for token in tokens]
    
    return ' '.join(tokens)

# Example usage
text = "The quick brown foxes are jumping over the lazy dogs! They've been doing this for 123 days."
preprocessed_text = preprocess_text_nltk(text)
print(preprocessed_text)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\didgostar\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\didgostar\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\didgostar\AppData\Roaming\nltk_data...


quick brown fox jump lazi dog 've day


In [8]:
import spacy

nlp = spacy.load("en_core_web_sm")

def preprocess_text_spacy(text):
    doc = nlp(text)
    
    # Tokenize and lemmatize
    tokens = [token.lemma_.lower() for token in doc if not token.is_stop and not token.is_punct and not token.is_digit]
    
    return ' '.join(tokens)

# Example usage
text = "The quick brown foxes are jumping over the lazy dogs! They've been doing this for 123 days."
preprocessed_text = preprocess_text_spacy(text)
print(preprocessed_text)

ModuleNotFoundError: No module named 'spacy'

In the next sections, we'll explore how to use these preprocessed texts for various NLP tasks such as feature extraction and text classification.

# Feature Extraction in NLP

Feature extraction is the process of transforming raw text data into numerical features that can be used by machine learning algorithms. This step is crucial in NLP as it bridges the gap between human-readable text and machine-understandable input.

**Common Feature Extraction Techniques**



- **1.Bag of Words (BoW)**: Represents text as a multiset of words, disregarding grammar and word order.

- **2.Term Frequency-Inverse Document Frequency (TF-IDF)**: Reflects the importance of a word in a document within a collection.

- **3.Word Embeddings**: Dense vector representations of words that capture semantic meanings.

- **4.N-grams**: Contiguous sequences of n items from a given text.

- **5.Part-of-Speech (POS) Features**: Grammatical features based on the role of words in sentences.

- **6.Named Entity Recognition (NER) Features**: Features based on identified named entities in the text.

- **7.Syntactic Features**: Based on the syntactic structure of sentences (e.g., dependency parsing).



Implementing Feature Extraction
We'll demonstrate how to implement Bag of Words, TF-IDF, and Word Embeddings using popular Python libraries.


Bag of Words (BoW) with scikit-learn:

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Sample texts
corpus = [
    "The quick brown fox jumps over the lazy dog.",
    "The lazy dog sleeps all day.",
    "The quick brown fox is quick."
]

# Create a CountVectorizer object
vectorizer = CountVectorizer()

# Fit the vectorizer to the corpus and transform the texts
X = vectorizer.fit_transform(corpus)

# Get the feature names (words)
feature_names = vectorizer.get_feature_names_out()

# Print the BoW representation
print("Bag of Words representation:")
print(X.toarray())
print("Feature names:", feature_names)

TF-IDF with scikit-learn:

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample texts (same as before)
corpus = [
    "The quick brown fox jumps over the lazy dog.",
    "The lazy dog sleeps all day.",
    "The quick brown fox is quick."
]

# Create a TfidfVectorizer object
vectorizer = TfidfVectorizer()

# Fit the vectorizer to the corpus and transform the texts
X = vectorizer.fit_transform(corpus)

# Get the feature names (words)
feature_names = vectorizer.get_feature_names_out()

# Print the TF-IDF representation
print("TF-IDF representation:")
print(X.toarray())
print("Feature names:", feature_names)

Word Embeddings with Gensim (Word2Vec):

In [None]:
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
import nltk

nltk.download('punkt')

# Sample texts (same as before)
corpus = [
    "The quick brown fox jumps over the lazy dog.",
    "The lazy dog sleeps all day.",
    "The quick brown fox is quick."
]

# Tokenize the texts
tokenized_corpus = [word_tokenize(text.lower()) for text in corpus]

# Train a Word2Vec model
model = Word2Vec(sentences=tokenized_corpus, vector_size=100, window=5, min_count=1, workers=4)

# Get the vector for a specific word
print("Vector for 'fox':", model.wv['fox'])

# Find similar words
print("Words similar to 'quick':", model.wv.most_similar('quick'))

These examples demonstrate how to extract features from text data using different techniques. In the next sections, we'll explore how to use these features for various NLP tasks such as text classification and clustering