### Importing necesseary libraries

In [1]:
from pytorch_lightning.callbacks import ModelCheckpoint
from pytorch_lightning import Trainer
from datamodule import TextDataModule
from pl_model import ModelLSTM

  from .autonotebook import tqdm as notebook_tqdm


# Create data loader

## TextDataModule

The `TextDataModule` is a custom PyTorch Lightning data module responsible for handling the dataset and tokenizer used in your project. It helps with efficient data loading, processing, and batching. This module is part of the workflow to streamline handling data for model training, validation, and testing.

### Arguments

- `config_path`: This parameter points to a JSON configuration file for your dataset. It is used to load specific settings related to data paths, preprocessing steps, and other configurations necessary to set up the dataset.
  - Example: `./configs/dataset_config.json`
  
- `tokenizer_path`: Path to a tokenizer file, which is necessary for converting text data into a suitable format for model consumption. In this case, the tokenizer is character-level, stored as a JSON file.
  - Example: `character_level_tokenizer.json`
  
- `batch_size`: The size of each data batch to be fed into the model during training. This can be customized depending on memory constraints or the size of your dataset.
  - Example: `512`

### Methods

- **`setup()`**: This method prepares the data for use. It is called to load the dataset, apply necessary preprocessing, and tokenize the data. This is typically run once during initialization or before training begins.

### Example Usage

Here’s an example of how you can instantiate and use the `TextDataModule`:

```python
# Initialize the data module with a dataset configuration, tokenizer, and batch size
data_module = TextDataModule(
    config_path="./configs/dataset_config.json",  # Path to dataset config
    tokenizer_path='character_level_tokenizer.json',  # Path to tokenizer
    batch_size=512  # Set batch size
)

# Setup the data module (e.g., load and process dataset)
data_module.setup()

# You can now access data loaders using:
train_loader = data_module.train_dataloader()
val_loader = data_module.val_dataloader()
test_loader = data_module.test_dataloader()


In [2]:
data_module = TextDataModule(
    config_path="./configs/dataset_config.json",  # Path to your dataset
    tokenizer_path='character_level_tokenizer.json',  # Path to tokenizer
    batch_size=512  # Customize batch size if needed
)
data_module.setup()

In [5]:
input_dim = data_module.tokenizer.get_vocab_size()

# Training BI LSTM model

## Running the ModelLSTM with PyTorch Lightning

This section describes how to set up and train the `ModelLSTM` using PyTorch Lightning. We will define the model, set up callbacks, initialize the trainer, and then train the model.

### Step 1: Model Initialization

We initialize the `ModelLSTM` with the following parameters:
- `in_dim`: The input dimension, which is 40 in this case (typically the feature size of each sequence element).
- `embedding_dim`: The size of the embedding layer. Here, it is set to 64.
- `hidden_dim`: The number of hidden units in the LSTM. Set to 64.
- `out_dim`: The output dimension. For binary classification, this is 1.
- `max_len`: The maximum sequence length, retrieved from the data module.
- `lr`: The learning rate, set to 0.002.

```python
model = ModelLSTM(
    in_dim=40,
    embedding_dim=64,  # Embedding dimension
    hidden_dim=64,  # Hidden dimension for LSTM
    out_dim=1,  # Output dimension - Number of target classes
    max_len=data_module.max_len[0],  # Max sequence length from the data module
    lr=0.002,  # Learning rate
)


In [None]:
model = ModelLSTM(
    in_dim=input_dim,
    embedding_dim=64,  # Embedding dimension
    hidden_dim=64,  # Hidden dimension for LSTM
    out_dim=1,  # Output dimension - Number of target
    max_len=data_module.max_len[0],
    lr=0.002,
)

# Step 2.5: Create callbacks

checkpoint_callback = ModelCheckpoint(dirpath="./models", save_top_k=1, monitor="val_loss", filename="model")

# Step 3: Initialize the Trainer
trainer = Trainer(
    max_epochs=3,  # Number of epochs
    callbacks=[checkpoint_callback],
)

# Step 4: Train the model
trainer.fit(model, data_module)

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
d:\anaconda\envs\peds\lib\site-packages\pytorch_lightning\trainer\connectors\logger_connector\logger_connector.py:75: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `pytorch_lightning` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default
You are using a CUDA device ('NVIDIA GeForce RTX 3060 Laptop GPU') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_ma

Sanity Checking: |          | 0/? [00:00<?, ?it/s]

d:\anaconda\envs\peds\lib\site-packages\pytorch_lightning\trainer\connectors\data_connector.py:424: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=11` in the `DataLoader` to improve performance.


Sanity Checking DataLoader 0: 100%|██████████| 2/2 [00:00<00:00,  6.73it/s]Confusion Matrix:
tensor([[  0, 512],
        [  0, 512]], device='cuda:0')
                                                                           

d:\anaconda\envs\peds\lib\site-packages\pytorch_lightning\trainer\connectors\data_connector.py:424: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=11` in the `DataLoader` to improve performance.


Epoch 0: 100%|██████████| 1381/1381 [02:13<00:00, 10.37it/s, v_num=43, train_loss=5.99e-5, train_acc=1.000, train_f1=1.000] Confusion Matrix:
tensor([[74982,    14],
        [    1, 76514]], device='cuda:0')
Epoch 1:  11%|█▏        | 158/1381 [00:18<02:21,  8.62it/s, v_num=43, train_loss=2.84e-5, train_acc=1.000, train_f1=1.000, val_loss=0.000281, val_acc=1.000, val_f1=1.000] 

In [4]:
inference_model = ModelLSTM.load_from_checkpoint("models/model.ckpt").to("cuda")

In [5]:
inference_model.device

device(type='cuda', index=0)

### Evaluation

In [6]:
import torchmetrics

# Initialize metrics for binary classification
accuracy_metric = torchmetrics.Accuracy(task="binary")
precision_metric = torchmetrics.Precision(task="binary")
recall_metric = torchmetrics.Recall(task="binary")
f1_metric = torchmetrics.F1Score(task="binary")
confusion_matrix = torchmetrics.ConfusionMatrix(task="binary", num_classes=2)
# Assuming you already have your test dataloader
test_dataloader = data_module.test_dataloader()

# Iterate over the test dataset and calculate metrics
for batch in test_dataloader:
    # Get the inputs and labels
    labels = batch['label']
    
    # Make predictions using the model
    preds = trainer.model.predict_step(batch)
    preds = (preds > 0.5).long().flatten()
    
    # Convert predictions to class indices (if your model outputs probabilities or logits)
    
    # Update metrics with predictions and true labels
    accuracy_metric.update(preds, labels)
    precision_metric.update(preds, labels)
    recall_metric.update(preds, labels)
    f1_metric.update(preds, labels)
    confusion_matrix.update(preds, labels)

# Compute final metrics
accuracy = accuracy_metric.compute()
precision = precision_metric.compute()
recall = recall_metric.compute()
f1 = f1_metric.compute()

# Print the results
cm = confusion_matrix.compute()
print(f"Confusion Matrix:\n{cm}")
print(f"Test Accuracy: {accuracy.item():.4f}")
print(f"Test Precision: {precision.item():.4f}")
print(f"Test Recall: {recall.item():.4f}")
print(f"Test F1 Score: {f1.item():.4f}")

# Reset metrics for potential future use
accuracy_metric.reset()
precision_metric.reset()
recall_metric.reset()
f1_metric.reset()


Confusion Matrix:
tensor([[74988,     9],
        [    1, 76514]])
Test Accuracy: 0.9999
Test Precision: 0.9999
Test Recall: 1.0000
Test F1 Score: 0.9999
