<a href="https://colab.research.google.com/github/babupallam/Msc_AI_Module2_Natural_Language_Processing/blob/main/L07_Recurrent_Neural_Networks/2_Setup_and_Model_Description.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### **2.1. Preparing the Data**

In most machine learning tasks, data preparation is a critical first step. For Recurrent Neural Networks (RNNs), this is especially important when working with sequential data such as text, speech, or time series. In this case, we are dealing with text data, where the goal is to classify names based on their origin.

**Steps to Prepare the Data:**
1. **Dataset Selection**: (Example)
   - The dataset consists of names from 18 different languages, stored in separate text files (`[Language].txt`). Each file contains a list of names, one per line.

2. **Download and Organize the Data**:
   - Download the dataset files.
   - Place them in a directory structure that is easy to access during processing.

3. **Preprocessing the Data**:
   - Convert the names from Unicode to ASCII to make them compatible with the model.
   - Organize the data by creating a dictionary, where each key represents a language and each value is a list of names corresponding to that language.

**Code to Prepare the Data:**


In [8]:
import glob  # Module to retrieve file paths using patterns
import os    # Module to interact with the operating system (e.g., file paths)
import unicodedata  # Provides access to Unicode character properties
import string  # Contains common string constants (like ASCII letters)

# Define all valid characters that we will allow in the processed names.
# This includes all ASCII letters (both uppercase and lowercase), and a few punctuation characters: " .,'"
all_letters = string.ascii_letters + " .,;'"
n_letters = len(all_letters)  # Total number of valid characters

# Function to convert a Unicode string into plain ASCII.
# It removes any accents or diacritics by normalizing the string to 'NFD' form.
# Then, it filters out characters that are not part of the 'Mn' category (which represents combining marks, like accents),
# and finally, only retains characters that are in the list of allowed ASCII characters (all_letters).
def unicodeToAscii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'  # Exclude diacritic marks
        and c in all_letters  # Only keep characters that are in the predefined list (ASCII letters + punctuation)
    )

# Example usage of the unicodeToAscii function to convert a name with accented characters
# This will remove the accents and convert the name to a plain ASCII representation.
print(unicodeToAscii("Ślusàrski"))  # Output: Slusarski

# Dictionary to store names, categorized by language.
# Keys are language names (like 'Arabic', 'English', etc.), and values are lists of names from that language.
category_lines = {}

# List to store all the categories (languages) found in the dataset.
all_categories = []

# Function to read lines from a file, process each line by converting it to ASCII, and return the processed lines.
# The file is opened with UTF-8 encoding, and each line is stripped of leading/trailing whitespace.
# It returns a list of names, where each name has been converted to ASCII using unicodeToAscii.
def readLines(filename):
    with open(filename, encoding='utf-8') as f:
        return [unicodeToAscii(line.strip()) for line in f]

# Use glob to find all text files in the 'data/names/' directory that match the pattern '*.txt'.
# Each file contains names for a particular language, and the language is inferred from the filename.
# For each file:
# - Extract the category (language) from the filename.
# - Append the category to all_categories.
# - Read the lines (names) from the file, process them with readLines, and store them in the category_lines dictionary.
for filename in glob.glob('data/names/*.txt'):
    # Extract the category name by stripping the file extension from the filename
    # Example: 'data/names/Arabic.txt' -> 'Arabic'
    category = os.path.splitext(os.path.basename(filename))[0]

    # Add the category to the list of all categories
    all_categories.append(category)

    # Read the lines from the file, convert them to ASCII, and store them in the dictionary
    lines = readLines(filename)
    category_lines[category] = lines

# Get the number of categories (languages) we have loaded
n_categories = len(all_categories)

# Output the list of all categories (languages)
print(all_categories)  # Output: ['Arabic', 'Chinese', 'Czech', ...]
print(n_categories)  # Output: 18
print(category_lines['Arabic'][:5])  # Output: ['Arabic', 'Arabicam', 'Arabicer', 'Arabices', 'Arabicino']


Slusarski
['Polish', 'English', 'Italian', 'German', 'Russian', 'Czech', 'Portuguese', 'Chinese', 'Arabic', 'Spanish', 'Scottish', 'Korean', 'Greek', 'Japanese', 'French', 'Dutch', 'Irish', 'Vietnamese']
18
['Khoury', 'Nahas', 'Daher', 'Gerges', 'Nazari']



**Explanation:**
- The `unicodeToAscii` function converts any Unicode characters to ASCII, which is important because many names in different languages might have special characters.
- The `category_lines` dictionary stores the names categorized by language, which will be used later to create tensors for training the model.

---



#### **2.2. Turning Names into Tensors**

Before feeding the data into the RNN, we need to convert the names (which are sequences of characters) into tensors that the RNN can process. This is done using **one-hot encoding**.

**One-hot Encoding Explanation:**
- A one-hot vector is a representation where all elements are zero, except for one element that is set to 1.
- Each letter in a name is represented by a one-hot vector of size `<1 x n_letters>`, where `n_letters` is the total number of possible characters (e.g., all letters in the alphabet, plus some punctuation).
- A name is represented as a sequence of one-hot vectors.

**Code for One-hot Encoding and Tensor Conversion:**


In [11]:
import torch  # PyTorch library for creating and manipulating tensors

# Function to convert a single letter into a one-hot encoded tensor
# Each letter is represented by a tensor of size (1, n_letters), where n_letters is the total number of possible letters.
# A one-hot tensor means all elements are zero except for a single element that is 1, which corresponds to the letter's position.
def letterToTensor(letter):
    # Initialize a tensor of zeros with shape (1, n_letters).
    # This will represent the one-hot encoding of the letter.
    tensor = torch.zeros(1, n_letters)

    # Find the index of the letter in the 'all_letters' string.
    # Then, set the corresponding position in the tensor to 1.
    # This converts the letter into a one-hot encoded vector.
    tensor[0][all_letters.find(letter)] = 1
    return tensor

# Function to convert a name (a sequence of letters) into a 3D tensor.
# The resulting tensor will have shape (len(name), 1, n_letters), where:
# - len(name) is the number of characters in the name (time steps),
# - 1 is the batch size (each name is treated as a separate sequence),
# - n_letters is the size of the one-hot encoded vector for each character.
def nameToTensor(name):
    # Initialize a tensor of zeros with shape (len(name), 1, n_letters).
    # This tensor will store the one-hot encoded vectors for each letter in the name.
    tensor = torch.zeros(len(name), 1, n_letters)

    # Loop through each letter in the name and one-hot encode it.
    # 'enumerate(name)' gives us both the index (i) and the letter itself.
    for i, letter in enumerate(name):
        # Find the index of the letter in 'all_letters' and set the corresponding position in the tensor to 1.
        tensor[i][0][all_letters.find(letter)] = 1

    # Return the 3D tensor representing the name
    return tensor


# Example usage of the letterToTensor function
# This prints the one-hot encoded tensor for the letter 'J'.
# The output will be a tensor with a single row (1x57), where the 1 corresponds to the index of 'J' in all_letters.
print(letterToTensor('J'))  # Output: tensor([[0., 0., 0., ..., 0., 1., 0.]])

# Example usage of the nameToTensor function
# This converts the name 'Jones' into a 3D tensor where each letter is one-hot encoded.
# The resulting tensor will have shape (5, 1, 57) because 'Jones' has 5 letters, each represented by a one-hot encoded vector of size 57.
print(nameToTensor('Jones').size())  # Output: torch.Size([5, 1, 57])
print(nameToTensor('Jones'))


tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0.]])
torch.Size([5, 1, 57])
tensor([[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0.]],

        [[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0.]],

        [[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0.,


**Explanation:**
- The `letterToTensor` function converts a single letter into a one-hot encoded tensor.
- The `nameToTensor` function converts a sequence of letters (i.e., a name) into a tensor, where each row is the one-hot encoded vector of a letter.

---



#### **2.3. Creating the RNN Model**

Now that the data is prepared, we can focus on building the RNN model. In PyTorch, we define an RNN by creating a class that inherits from `torch.nn.Module`.

**Model Architecture:**
- The RNN takes an input (a one-hot vector representing a character) and a hidden state from the previous time step.
- The hidden state gets updated with each time step based on the input and the previous hidden state.
- After processing the input, the RNN outputs a prediction (the language category) and updates the hidden state for the next time step.

**Code for Defining the RNN Model:**


In [14]:
import torch.nn as nn
import torch.nn.functional as F
import torch

# Define a simple RNN class that extends nn.Module
class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(RNN, self).__init__()

        # Store the hidden layer size
        self.hidden_size = hidden_size

        # Define the linear layer that computes the next hidden state
        self.i2h = nn.Linear(input_size + hidden_size, hidden_size)

        # Define the linear layer that computes the output
        self.i2o = nn.Linear(input_size + hidden_size, output_size)

        # Define the LogSoftmax layer to convert outputs to log-probabilities
        self.softmax = nn.LogSoftmax(dim=1)

    # Define the forward pass through the RNN
    def forward(self, input, hidden):
        print(f"Input size: {input.size()}")  # Display size of input
        print(f"Hidden size before update: {hidden.size()}")  # Display size of hidden state

        # Concatenate the input and hidden state
        combined = torch.cat((input, hidden), 1)
        print(f"Combined input-hidden size: {combined.size()}")  # Display size after concatenation

        # Compute the next hidden state
        hidden = self.i2h(combined)
        print(f"Hidden state after update: {hidden.size()}")  # Display size of updated hidden state

        # Compute the output
        output = self.i2o(combined)
        print(f"Output before softmax: {output}")  # Display raw output (logits)

        # Apply the softmax to the output
        output = self.softmax(output)
        print(f"Output after softmax: {output}")  # Display log-probabilities

        return output, hidden

    # Initialize the hidden state with zeros
    def init_hidden(self):
        return torch.zeros(1, self.hidden_size)


In [17]:
# Assuming n_letters is the number of unique characters in the dataset and n_categories is the number of output classes
n_letters = 57  # Example: number of possible input letters (e.g., 26 letters, uppercase, lowercase, punctuation)
n_categories = 18  # Example: number of possible output categories (e.g., different languages or classes)

# Example usage of the RNN class
n_hidden = 128  # Hidden state size

# Create an instance of the RNN
rnn = RNN(n_letters, n_hidden, n_categories)

# Create a dummy input tensor for the letter 'A'
input_tensor = letterToTensor('A')
print(f"Input tensor for 'A': {input_tensor}")

# Initialize the hidden state to zeros
hidden_tensor = rnn.init_hidden()
print(f"Initial hidden state: {hidden_tensor}")

# Pass the input tensor and the initial hidden state through the RNN
print("\n--- Forward pass ---")
output, next_hidden = rnn(input_tensor, hidden_tensor)

# Final output
print(f"\nFinal Output: {output}")
print(f"Final Output Shape: {output.shape}")

# Final hidden state

print(f"Next Hidden State: {next_hidden}")
print(f"Next Hidden State Shape: {next_hidden.shape}")


Input tensor for 'A': tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0.]])
Initial hidden state: tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0.]])

--- Forward pass ---
Input size: torch.Size([1, 57])
Hidden size before update: torch.Size([1, 128])
Combined input-hidde


**Explanation:**
- The `RNN` class defines an RNN model with two fully connected layers: one for updating the hidden state (`i2h`) and one for generating the output (`i2o`).
- The `forward` method takes an input tensor and a hidden state, concatenates them, and passes them through the network to produce an output and an updated hidden state.
- The `init_hidden` method initializes the hidden state to zeros before the first time step.

**Components of the Model**:
- **Input Layer (`i2h`)**: Combines the input and the hidden state from the previous time step to produce a new hidden state.
- **Output Layer (`i2o`)**: Uses the hidden state to predict the class (language) of the input name.
- **Softmax Layer**: Converts the raw outputs into probabilities, indicating how likely the input belongs to each category.

---



#### **2.4. Understanding the Flow of Data**

**Data Flow in the RNN**:
1. **Input Processing**:
   - A name is broken down into its constituent letters.
   - Each letter is converted into a one-hot vector.
2. **Sequential Processing**:
   - The RNN processes each letter in the name one by one.
   - At each time step, the RNN takes the current letter and the hidden state from the previous step to compute a new hidden state and an output.
3. **Final Prediction**:
   - After processing the last letter in the sequence, the final output is taken as the prediction of the language category.

**Explanation of the Flow**:
- Each letter of the name is processed sequentially.
- The hidden state is updated at each time step, allowing the RNN to "remember" information from earlier in the sequence.
- The final hidden state and output are used to classify the name based on the patterns the model has learned during training.
