# Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training

### Requirements

Minimum requirements to run Colossal AI

- PyTorch >= 1.11 and PyTorch <= 2.1
- Python >= 3.7
- CUDA >= 11.0
- NVIDIA GPU Compute Capability >= 7.0 (V100/RTX20 and higher)
- Linux OS

**Now, let's dive into the code and walk through each step of the implementation.**

### Import Required Packages

In this section, we start by importing the necessary packages and libraries required for our code implementation:

- `os`: Allows interaction with the operating system, such as setting environment variables which will be used for ColossalAI.
- `time`: Provides functions for working with time-related tasks, which may be useful for timing experiments.
- `warnings`: Enables handling of warning messages to control their display.
- `torch`: The core library for all tensor computations and neural network operations in PyTorch.
- `torch.distributed`: Facilitates distributed computing, crucial for parallel training across multiple GPUs.
- `torch.nn`: Contains neural network modules, such as layers, activations, loss functions, etc.
- `torchvision`: Offers popular datasets, model architectures, and image transformations for computer vision tasks.
- `torch.optim`: Provides optimization algorithms, such as Adam, SGD, etc., for training neural networks.
- `torch.nn.functional`: Includes functional interface for operations like convolution, activation functions, etc.
- `torchvision.transforms`: Contains image transformation functions for data augmentation and preprocessing.
- `torch.optim.Optimizer`: Base class for all optimizers in PyTorch, used for defining custom optimizers.
- `torch.optim.lr_scheduler.MultiStepLR`: Scheduler for learning rate updates based on predefined milestones.
- `torch.utils.data.DataLoader`: Allows efficient loading of data in batches, essential for training neural networks.
- `colossalai`: The core library for Colossal-AI, providing tools and components for large-scale parallel training.
- `colossalai.accelerator`: Provides functions for interacting with the hardware accelerator, such as GPUs.
- `colossalai.booster`: Offers functionalities for boosting model training, including distributed data parallelism.
- `colossalai.booster.plugin.TorchDDPPlugin`: Plugin for integrating Colossal-AI with PyTorch's Distributed Data Parallel (DDP).
- `colossalai.cluster.DistCoordinator`: Coordinates distributed computations across multiple nodes.
- `colossalai.nn.optimizer.HybridAdam`: Optimizer that combines local and global gradient updates for efficient training.
- `matplotlib.pyplot`: Library for creating static, interactive, and animated visualizations in Python.
- `seaborn`: Provides high-level interface for drawing attractive and informative statistical graphics.
- `numpy`: Fundamental package for scientific computing with Python, used for numerical operations.
- `torchmetrics.classification.MulticlassROC`: Metric for computing ROC curves in multi-class classification tasks.
- `sklearn.metrics`: Offers functions for evaluating model performance, such as precision, recall, etc.

We also use `warnings.filterwarnings("ignore")` to suppress warning messages for cleaner output during execution.


### Set up distributed training environment variables

In this section, we set up environment variables required for distributed training:

- `os.environ["RANK"] = "0"`: Specifies the rank of the current process in the distributed environment. Here, we set it to 0, indicating it as the first process.
- `os.environ["LOCAL_RANK"] = "0"`: Defines the local rank of the current process, which is also set to 0 in this case.
- `os.environ["WORLD_SIZE"] = "1"`: Specifies the total number of processes participating in the distributed training. Since we're not distributing the training in this demo, the world size is set to 1.
- `os.environ["MASTER_ADDR"] = "localhost"`: Sets the address of the master node, which is the local machine in this case, indicated by "localhost".
- `os.environ["MASTER_PORT"] = "1234"`: Specifies the port for communication between different processes in the distributed environment, which is set to 1234 for this demo.

### Set device

In this section, we determine the device on which our computations will be performed:

- `torch.device("cuda" if torch.cuda.is_available() else "CPU")`: Creates a PyTorch device object based on the availability of CUDA. If CUDA is available, the device is set to GPU ("cuda"); otherwise, it defaults to CPU ("CPU").

This step ensures that our code can seamlessly switch between GPU and CPU based on hardware availability, optimizing performance and resource utilization.

### Define Hyperparameters

In this section, we define the hyperparameters required for training our neural network model:

- `BATCH_SIZE`: Specifies the number of samples in each batch during training. A larger batch size can lead to faster convergence but may require more memory.
- `NUM_EPOCHS`: Indicates the number of times the entire dataset will be passed forward and backward through the neural network during training. Each epoch consists of multiple iterations (batches).
- `LEARNING_RATE`: Defines the rate at which the model's parameters are updated during optimization. It determines the step size for gradient descent and impacts the convergence speed and final performance of the model.
- `GAMMA`: Represents the factor by which to reduce the learning rate at each milestone specified in the learning rate scheduler. It is used to decay the learning rate over time, enabling finer control over the optimization process.

These hyperparameters play a crucial role in determining the performance and convergence behavior of our neural network model during training. Adjusting them appropriately is essential for achieving optimal results.


### Initialize ColossalAI, Plugin, and Booster

In this section, we initialize ColossalAI, along with its plugin and booster components:

- `colossalai.launch_from_torch(config={})`: Launches ColossalAI from PyTorch, enabling seamless integration. This step ensures that ColossalAI's functionalities are available for use in our code. Please comment out this line if ColossalAI has already been launched to avoid redundancy.
- `coordinator = DistCoordinator()`: Creates a distributed coordinator object responsible for coordinating distributed computations across multiple nodes. It manages communication between processes during parallel training.
- `plugin = TorchDDPPlugin()`: Initializes a plugin for integrating ColossalAI with PyTorch's Distributed Data Parallel (DDP) module. This plugin facilitates efficient distributed training by managing data parallelism across multiple GPUs.
- `booster = Booster(plugin=plugin)`: Initializes a booster object, which serves as the main interface for utilizing ColossalAI's capabilities. It is configured with the previously created plugin to leverage distributed data parallelism for enhanced model training.

These initializations lay the foundation for leveraging ColossalAI's features, enabling efficient large-scale parallel training of neural network models.


### Transformation and Data Loaders

In this section, we define transformations, download dataset and create data loaders for our dataset:

#### Transformations:
- `transform_train`: Composes a series of transformations for the training dataset, including converting images to tensors and normalizing pixel values.
- `transform_test`: Composes transformations for the testing dataset, similar to the training set.

#### MNIST Dataset:
- We download the MNIST dataset provided by torchvision library and apply the defined transformations. The dataset consists of handwritten digits from 0 to 9, with separate training and testing splits. We'll preprocess and load the dataset using PyTorch's DataLoader, making it ready for model training.

#### Data Loaders:
- `train_dataloader`: Constructs a data loader for the training dataset, which iterates over batches of data during training. It shuffles the data and drops the last incomplete batch.
- `test_dataloader`: Creates a data loader for the testing dataset, ensuring that batches are not shuffled or dropped.

#### MNIST Dataset for ColossalAI:
- To demonstrate ColossalAI's compatibility, we prepare another set of MNIST data specifically for use with ColossalAI. This involves downloading the dataset and applying transformations within the ColossalAI framework.

#### Data Loader for ColossalAI:
- `train_dataloader_Col`: Prepares a data loader for the ColossalAI-compatible training dataset, utilizing ColossalAI's data preparation capabilities to ensure efficient parallel training.
- `test_dataloader_Col`: Constructs a data loader for the testing dataset compatible with ColossalAI.

These transformations and data loaders are essential components for efficiently handling and processing our dataset during training and evaluation.


### Grayscale or RGB Image Check

In this section, we perform a check to determine if the images in our dataset are grayscale or RGB.

If the number of channels is 1, the images are grayscale.<br/>
If the number of channels is 3, the images are RGB.</br>
This check helps ensure that our model architecture and data preprocessing are aligned with the expected input format.


### Sample Images

In this section, we visualize a batch of sample images from the dataset:

- `classes = train_dataset.classes`: Retrieves the list of classes (labels) present in the dataset.
- `imshow`: Defines a function to display images.
- `dataiter = iter(train_dataloader)`: Creates an iterator over the training data loader.
- `images, labels = next(dataiter)`: Retrieves the next batch of images and labels from the iterator.
- `imshow(torchvision.utils.make_grid(images))`: Displays the batch of images in a grid format using the `imshow` function.
- `print(' '.join(f'{classes[labels[j]]:5s}' for j in range(BATCH_SIZE)))`: Prints the corresponding labels for each image in the batch.


### Define Model Architecture

In this section, we define the architecture of the VGG13 neural network model:

- `class VGG13(nn.Module)`: Defines a class named VGG13 that inherits from the `nn.Module` class, making it a PyTorch neural network module.
- `def __init__(self, dropoutAdd=False)`: Initializes the VGG13 model with optional dropout layers.
- `Convolutional layers`: Defines five sets of convolutional layers followed by ReLU activation functions and batch normalization.
- `Fully-connected layers`: Defines two fully-connected layers with ReLU activation and optional dropout layers.
- `def forward(self, x)`: Implements the forward pass of the model, defining the sequence of operations applied to input data `x`.
- `modelVGG = VGG13(dropoutAdd=True).to(device)`: Creates an instance of the VGG13 model with dropout layers enabled, and moves it to the specified device (GPU or CPU).
- `print("\nModel Architecture:\n")`: Prints a message indicating the display of the model architecture.
- `print(modelVGG)`: Prints the architecture of the VGG13 model, displaying the sequential arrangement of layers and their parameters.


### Prepare Criterion, Optimizer, Learning Rate Scheduler

In this section, we prepare the criterion, optimizer, and learning rate scheduler for model training:

- `criterion = nn.CrossEntropyLoss()`: Defines the criterion for calculating the loss, which is the cross-entropy loss suitable for classification tasks.
- `optimizer = optim.Adam(modelVGG.parameters(), lr=LEARNING_RATE)`: Initializes the Adam optimizer with the VGG13 model parameters and the specified learning rate.
- `optimizer_Col = HybridAdam(modelVGG.parameters(), lr=LEARNING_RATE)`: Initializes the ColossalAI-specific HybridAdam optimizer with the VGG13 model parameters and the specified learning rate.
- `lr_scheduler = MultiStepLR(optimizer, milestones=[20, 40, 60, 80], gamma=GAMMA)`: Initializes a multi-step learning rate scheduler, which adjusts the learning rate at predefined milestones during training, with a decay factor defined by `GAMMA`.


### Define Model Training

The `modelTraining` function trains the neural network model for a specified number of epochs. It performs both training and testing phases, computes and accumulates the losses, and adjusts the model parameters through backpropagation and optimization. Here's a breakdown of its components:

- The function takes input parameters such as the number of epochs, the model architecture, optimizer, criterion (loss function), learning rate scheduler, data loaders for training and testing, as well as ColossalAI-specific objects if applicable.
- It initializes arrays to store training and testing loss values throughout the epochs.
- For each epoch, it iterates through the training data, computes the loss, and updates the model parameters based on the optimizer.
- It then evaluates the model on the testing data to compute the testing loss.
- The function also prints log information such as the current epoch, training loss, and testing loss every two epochs.
- Finally, it returns the arrays containing the training and testing loss values and prints the total time taken for training.


### Plot Losses

This function generates a plot displaying the training and testing losses across epochs.


### Define Model Testing

This section includes functions for evaluating the model's performance on the test dataset, including accuracy, precision, recall, F-score, confusion matrix, and ROC curve.

#### `loadersAccuracy(model, loader, colossalAI)`

This function calculates the accuracy of the model on the given dataloader.

- It iterates through the data loader, computes the model's predictions, and compares them with the ground truth labels to calculate the number of correct predictions.
- For colossalAI, it utilizes distributed computing to aggregate correct and total counts across all processes.
- It also collects true labels and predicted labels for further evaluation.

#### `perfEvaluation(model, train_dataloader, test_dataloader, colossalAI = False)`

This function evaluates the model's performance on the training and testing datasets.

- It computes the accuracy, precision, recall, and F-score on the testing dataset and prints the results.
- It also plots the accuracies on both training and testing datasets using a bar chart.
- Additionally, it generates a confusion matrix to visualize the model's performance across different classes.
- Finally, it plots the ROC curve to assess the model's ability to discriminate between different classes.


### Model Evaluation without Colossal AI

This section evaluates the model's performance without using Colossal AI for distributed training.

- It then calls the `modelTraining` function to train the VGG model for the specified number of epochs without utilizing Colossal AI. The training and testing losses are plotted using the `LossesPlot` function.

- Finally, the `perfEvaluation` function evaluates the model's performance on both the training and testing datasets without Colossal AI. It prints the accuracy, precision, recall, and F-score, and generates visualizations including a bar chart of accuracies, a confusion matrix, and an ROC curve.

### Boost with ColossalAI

This step involves boosting the model with ColossalAI for distributed training.

- The `booster.boost` function is called to boost the VGG model with ColossalAI. This function returns the boosted model (`model_Col`), boosted optimizer (`optimizer_Col`), boosted criterion (`criterion_Col`), along with other objects like the learning rate scheduler (`lr_scheduler_Col`).

- The boosted model (`model_Col`) is trained using ColossalAI for the specified number of epochs.

- The boosted model's performance is evaluated using the `perfEvaluation` function, which calculates metrics such as accuracy, precision, recall, and F-score, and generates visualizations including a bar chart of accuracies, a confusion matrix, and an ROC curve.

### Model Evaluation with Colossal AI

This section involves evaluating the model trained with ColossalAI for distributed training.

- The boosted model (`model_Col`) trained with ColossalAI is evaluated for the specified number of epochs.

- The `modelTraining` function is called with the ColossalAI flag set to `True`, indicating that the training is performed using ColossalAI.

- The training and testing losses are plotted using the `LossesPlot` function.

- The performance of the model is evaluated using the `perfEvaluation` function, which calculates metrics such as accuracy, precision, recall, and F-score, and generates visualizations including a bar chart of accuracies, a confusion matrix, and an ROC curve.

### Time to Train

Below are the time duration taken to train the model with and without ColossalAI.

- `Without ColossalAI`: 8 minutes and 21 seconds.
- `With ColossalAI`: 8 minutes and 4 seconds.



### Observations

`Efficiency Gains Over Multiple Epochs`:
Training the model with Colossal-AI proved to be slightly quicker compared to traditional methods over 20 epochs, indicating that Colossal-AI could potentially increase training speed, although the time saved was small.

`Impact of Dataset Simplicity on Parallel Training`:
Using VGG13 with the simple MNIST dataset didn't fully demonstrate the benefits of parallel training, which typically becomes more apparent with larger or more complex datasets like Cifar-100 and models that require more computing power.

`Hardware Limitations and Parallelization Efficiency`:
Since only two GPUs were used, we couldn't fully leverage the advantages of Colossal-AI's parallel training features. Colossal-AI is more effective with more extensive GPU setups, where without enough hardware, the setup and synchronization efforts can diminish the benefits of faster processing.

### References

`Below are the references used throughout Code Demo.`

1. https://numpy.org/doc/
1. https://matplotlib.org/stable/index.html
1. https://scikit-learn.org/stable/
1. https://seaborn.pydata.org/
1. https://pytorch.org/tutorials/
1. https://pytorch.org/vision/main/models/vision_transformer.html
1. https://colossalai.org/docs/get_started/installation/
1. https://github.com/hpcaitech/ColossalAI
1. https://github.com/hpcaitech/ColossalAI/blob/main/examples/images/resnet/train.py
1. Data Loading, VGG13 Model Architecture, Model Training and Testing is based on CSE 676 Deep Learning Assignment 1 Part 1 and Bonus submission by Nikhil Gupta

### Task

For the task, your challenge is to:

- Modify the code to use the CIFAR-10 dataset instead of MNIST.
- Update the VGG-13 model architecture and hyperparameters according to CIFAR-10 requirements.
- After completing the task, share your experiences and insights in the notebook file itself. 
- You can document any challenges faced, improvements made, or interesting observations during the process. 

Feel free to ask questions or discuss your findings in the piazza discussion forum. Happy coding!