####  IMDB_Dense_Torch_0000

1. Run under a PyTorch virtual env

2. Take a deep dive to examine the data structures that the PyTorch Dense models need to process data similar to imdb data.

3. Produce fake data that mimics the imdb data; process that data with PyTorch dense models. 

4. Since we know the nature of the artificial data, the processing results should be consistent with that data  


In [1]:
import torch
import torch.nn as nn
from torch.nn.utils.rnn import pad_sequence, pack_padded_sequence, pad_packed_sequence
from torch.utils.data import DataLoader, Dataset
from sklearn.model_selection import train_test_split


In [2]:
# CUDA_LAUNCH_BLOCKING=1
# torch.backends.cudnn.deterministic = True


In [3]:
# 1. Create artificial data
# 
# Mimic the IMDB Data, which is typically set as list of (review, label) tuples.
# The reviews are lists of word indices, from 1...10000, going from more important to less important
# The labels are 0 or 1, indicating a bad or good review, respectively.

# I created the data manually, to ensure that the data preprocessing, the models etc, work well.
# There are only three different lists in this data to ensure that models can learn and predict
# with high accuracy. Results with low accuracy will reveal problems in the data pre-processing,
# the models configurations, or both...

imdb_fake_data = [
    ([1, 2, 3, 4, 5], 1),  # Review: [word1, word2, ...], Label: 1
    ([6, 7, 8], 0),        # Review: [word1, word2, ...], Label: 0
    ([1, 2, 3, 4, 5], 1),
    ([1, 2, 3, 4, 5], 1),
    ([1, 2, 3, 4, 5], 1),
    ([1, 2, 3, 4, 5], 1),
    ([1, 2, 3, 4, 5], 1),
    ([1, 2, 3, 4, 5], 1),
    ([6, 7, 8], 0),
    ([6, 7, 8], 0),
    ([6, 7, 8], 0),
    ([6, 7, 8], 0),
    ([6, 7, 8], 0),
    ([15, 16], 0),
    ([15, 16], 0),
    ([15, 16], 0),
    ([15, 16], 0),
    ([15, 16], 0),
    ([15, 16], 0),
    ([15, 16], 0),
    ([15, 16], 0),
    ([15, 16], 0),
    # ... more data
]



### IMDBDataset class

Define class IMDBDataset for PyTorch's DataLoader.

The class helps to process the movie reviews and the sentiment labels associated to each review [0,1].

1. The class inherits from PyTorch's Dataset class

2. ```__init__(self, data) ```  is the class constructor, which is called when a class instance is created.

3. ```self``` is a reference to the instance calling the function. ```self ``` is used to access and modify instance variables.

4. ```data``` is the argument expected when the class is executed. This is a Python obj (list of tuples) with a review and its label.

5. ```self.data = data``` is the line that stores the arg data into the self.data. This makes the data accessible throughout the class's methods.

6. ```def __len__(self)``` is a method that is **required** by PyTorch Dataset class. It returns the number of items in the dataset.

7. ```def __getitem__(self, idx) ``` is a crucial method **required** by PyTorch DataSet. The method fetches and returns an obj that contains a review and a label, given its index.

8. ```idx``` is the index of the data item to retrieve.

9. ```return torch.tensor(review), torch.tensor(label)``` is a critical step for PyTorch. It converts the review and the label into PyTorch tensors.


In [4]:

#
# 2. IMDBDataset(Dataset) is a class derived from torch.Dataset.
# Therefore understanding how to especialize this class from the base class is crucial.
# 

class IMDBDataset(Dataset):
    def __init__(self, data):
        self.data = data

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        review, label = self.data[idx]
        return torch.tensor(review), torch.tensor(label)  # Convert to tensors


### collate_fn

+ The function  is designed for helping PyTorch's DataLoader handle batches of variable-length reviews.  

+ The function addresses the need for padding sequences to a uniform length when working with neural networks.


1. ``` def collate_fn(batch)``` defines the function and gets a batch as input.

2. A batch is a list of data points. Each data point is what the method in IMDBDataset ``` __getitem__ ``` method returns. In this case, each data point is a tuple of (review, label).

3. The line ```reviews, labels = zip(*batch)``` unpacks the batch. ```zip(*batch)``` also transposes the batch. If the batch is of the form ``` [(review1, label1), (review2, label2), ...]``` then ``` zip(*batch)``` will transform it into ```([review1, review2, ...], [label1, label2, ...])```. Therefore reviews becomes a tuple containing all the reviews in the batch, and labels becomes a tuple containing all the labels.

4. The line ```review_lengths = torch.tensor([len(r) for r in reviews])``` is a crucial step, because it calculates the length of each review in the batch and stores these lengths in a PyTorch tensor called ```review_lengths```.  This is important because after padding, all reviews will have the same length, but the original lengths might be needed for other NLP tasks (e.g., when using recurrent neural networks, we don't want to process the padding).


5. The line ```padded_reviews = pad_sequence(reviews, batch_first=True, padding_value=0)``` performs the padding.  ```pad_sequence``` is a PyTorch utility that takes a list of sequences (reviews in this case) and pads them to the length of the longest sequence in the batch.

6. ```reviews``` is the list of reviews to pad.

7. ```batch_first=True``` tells pad_sequence to return the padded sequences in the shape ```(batch_size, max_sequence_length)```. If this arg is False (the default), it would return instead ```(max_sequence_length, batch_size)```.  ```batch_first=True``` is more common.


8. ``` padding_value=0``` specifies the value used for padding. 

9. The line ```labels = torch.stack(labels) ``` converts the tuple of labels into a PyTorch tensor. 

10. ```torch.stack``` concatenates the labels along a new dimension.  Since the labels are single values (0,1), this creates a tensor of shape (batch_size,).




In [5]:
# 3. Data Loading with Padding
def collate_fn(batch):
    reviews, labels = zip(*batch)
    review_lengths = torch.tensor([len(r) for r in reviews])  # Store original lengths
    padded_reviews = pad_sequence(reviews, batch_first=True, padding_value=0) # Pad to max length in batch
    labels = torch.stack(labels) # Stack labels
    return padded_reviews, labels, review_lengths


In [6]:
# Split training and test data. Training gets 80% of the data
train_data, test_data = train_test_split(imdb_fake_data, test_size=0.2, random_state=42)

train_dataset = IMDBDataset(train_data)
test_dataset = IMDBDataset(test_data)

## train_loader will be used during the training loops to load train data into the model
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True, collate_fn=collate_fn)

## test_loader will be used during the evalation loops to load test data into the model
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False, collate_fn=collate_fn)


In [7]:
 #4. Define a Pytorch Dense Model

class DenseModel(nn.Module):
    def __init__(self, input_dim, embedding_dim, hidden_dim, output_dim):
        super(DenseModel, self).__init__()
        self.embedding = nn.Embedding(input_dim, embedding_dim)  # Word embeddings
        self.fc1 = nn.Linear(embedding_dim, hidden_dim)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_dim, output_dim)

    def forward(self, x, lengths):
        embedded = self.embedding(x)  # (batch_size, seq_len, embedding_dim)

        # Create mask *before* embedding. 1 for non-padding, 0 for padding
        mask = x != 0  # (batch_size, seq_len) - True for non-padding, False for padding
        masked_embedded = embedded * mask.unsqueeze(-1).float() # Apply mask to embeddings

        # Average pooling over sequence length (handles variable lengths)
        lengths = lengths.unsqueeze(1).float()  # (batch_size, 1)
        pooled = masked_embedded.sum(dim=1) / lengths  # Average pool

        x = self.fc1(pooled)
        x = self.relu(x)
        x = self.fc2(x)
        return x
    

In [8]:

# 5. Training and Validation Loops
# input_dim = 10000  # Adjust size of vocabulary to 10000 
input_dim = 100
embedding_dim = 128  # Size of word embeddings
hidden_dim = 256
output_dim = 2  # Binary classification (0 or 1)

model = DenseModel(input_dim, embedding_dim, hidden_dim, output_dim)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # use gpu if available
model.to(device) # move model to gpu

criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)


In [9]:
# change number of epochs as necessary with the real IMDB data.
#  Maybe just 10 will be enough 
num_epochs = 10

for epoch in range(num_epochs):
    # Training
    model.train()  # Set model to training mode
    for padded_reviews, labels, lengths in train_loader:
      padded_reviews = padded_reviews.to(device) # move data to gpu
      labels = labels.to(device) # move data to gpu
      lengths = lengths.to(device) # move data to gpu
      optimizer.zero_grad()
      outputs = model(padded_reviews, lengths)
      loss = criterion(outputs, labels)
      loss.backward()
      optimizer.step()

    # Validation
    model.eval()  # Set model to evaluation mode
    correct = 0
    total = 0
    with torch.no_grad():  # Disable gradients during validation
        for padded_reviews, labels, lengths in test_loader:
          padded_reviews = padded_reviews.to(device) # move data to gpu
          labels = labels.to(device) # move data to gpu
          lengths = lengths.to(device) # move data to gpu
          outputs = model(padded_reviews, lengths)
          _, predicted = torch.max(outputs.data, 1) # get the prediction
          total += labels.size(0)
          correct += (predicted == labels).sum().item() # count correct predictions

    accuracy = 100 * correct / total
    print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}, Accuracy: {accuracy:.2f}%")

    

Epoch [1/10], Loss: 0.7322, Accuracy: 100.00%
Epoch [2/10], Loss: 0.5877, Accuracy: 100.00%
Epoch [3/10], Loss: 0.4747, Accuracy: 100.00%
Epoch [4/10], Loss: 0.3853, Accuracy: 100.00%
Epoch [5/10], Loss: 0.3146, Accuracy: 100.00%
Epoch [6/10], Loss: 0.2578, Accuracy: 100.00%
Epoch [7/10], Loss: 0.2118, Accuracy: 100.00%
Epoch [8/10], Loss: 0.1740, Accuracy: 100.00%
Epoch [9/10], Loss: 0.1428, Accuracy: 100.00%
Epoch [10/10], Loss: 0.1171, Accuracy: 100.00%


In [10]:
# 6. Evaluation (After Training) - More Detailed
model.eval()  # Set to evaluation mode
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # use gpu if available
model.to(device) # move model to gpu

correct = 0
total = 0
all_predictions = []
all_labels = []

with torch.no_grad():
    for padded_reviews, labels, lengths in test_loader:
        padded_reviews = padded_reviews.to(device) # move data to gpu
        labels = labels.to(device) # move data to gpu
        lengths = lengths.to(device) # move data to gpu
        outputs = model(padded_reviews, lengths)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

        all_predictions.extend(predicted.cpu().numpy())  # Store predictions for later analysis
        all_labels.extend(labels.cpu().numpy())      # Store true labels

accuracy = 100 * correct / total
print(f"Test Accuracy: {accuracy:.2f}%")

# 7. Additional Evaluation Metrics (using scikit-learn)
from sklearn.metrics import classification_report, confusion_matrix

print("\nClassification Report:")
print(classification_report(all_labels, all_predictions))

print("\nConfusion Matrix:")
print(confusion_matrix(all_labels, all_predictions))


# Example of getting predictions for a single review
def predict(review_indices):
    model.eval()
    review_tensor = torch.tensor([review_indices]).to(device)  # Add batch dimension
    review_length = torch.tensor([len(review_indices)]).to(device)
    with torch.no_grad():
        output = model(review_tensor, review_length)
        _, predicted = torch.max(output, 1)
    return predicted.item()

# Example usage:
example_review = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]  # Replace with actual review indices
prediction = predict(example_review)
print(f"Prediction for example review: {prediction}")


Test Accuracy: 100.00%

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00         4
           1       1.00      1.00      1.00         1

    accuracy                           1.00         5
   macro avg       1.00      1.00      1.00         5
weighted avg       1.00      1.00      1.00         5


Confusion Matrix:
[[4 0]
 [0 1]]
Prediction for example review: 1
