 Road Follower - Train Model by using host PC

In this notebook we will train a neural network to take an input image, and output a set of x, y values corresponding to a target.

We will be using PyTorch deep learning framework to train pytorch model, e.g. ResNet18, neural network architecture model for road follower application.

Before executing this script for training pytroch models, the system should install python3 and some packages as followings:
1. Install version 3.6 or above (e.g., python3.8), and set the installed python as the Pycharm project interpreter.
     
2. Make sure Nvidia CUDA has been installed in your PC. To install CUDA, you may select the CUDA version and follow the instructions with reference to the Nvidia Website below. https://developer.nvidia.com/cuda-toolkit-archive

3. Install pytorch and torchvision packages with an Nvidia CUDA version (e.g., version 12.4.1) in host PC by executing the command in a Windows PowerShell command window as following (the pytorch version 2.4.1+cu124 and torchvision version 0.19.1+cu124 is workable in python 3.8). You may refer to: https://pytorch.org/get-started/previous-versions/
>       pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu124

4. Install python packages jupyter and jupyterlab:
>       pip install jupyter jupyterlab

5. Then you can run command "jupyter lab" in a command window to start the training.
6. During training, you may get `Import Error` of some python packages need for executing the installation, you have to install the python packages as required.

In [None]:
import torch
import torch.optim as optim
import torch.nn.functional as F

import torchvision.transforms as transforms
import glob
import PIL.Image
import os
import numpy as np

### The following pytorch models may be used for images classification. 
### The "mobilenet_X" models would have supporting issues when converting them to TensorRT engine using torch2trt.
### Thus using mobilenet would be only applicable for pytorch simulation but not for TensorRT.

# vgg11
# googlenet
# inception_v3
# resnet18, resnet34, resnet50, resnet101; 18, 34, 50, ... no of layers of network
# densenet121, densenet161, densenet169, densenet201; 121, 161, ... no of layers of network
# shufflenet_v2_x2_0, shufflenet_v2_x1_5, shufflenet_v2_x1_0, shufflenet_v2_x0_5 xy_y: factor determines depth of tensor (no. of channels or no. of filters of each stage)
# mnasnet1_3, mnasnet1_0, mnasnet0_75, mnasnet0_5; y_y: MnasNet_A1 model, depth multiplier determines depth of tensor (no. of channel, no. of filters)
# mobilenet, mobilenet_v2, mobilenet_v3_large, mobilenet_v3_small
# efficientnet_b0~b5, bx: using different compound coefficient phi
# vit_x_xx : x: b, l, h; xx: 16, 32, 14(for h only))

TRAIN_MODEL = 'resnet18'

### Define Neural Network Model 

1. In a process called $transfer$  $learning$, we can repurpose a pre-trained model (trained on millions of images) for a new task that has possibly much less data available.
More Details on Transfer Learning: https://www.youtube.com/watch?v=yofjFQddwHE 

2. The $transfer$  $learning$ will be used for training the model for road following  simulation.

3. You can use the models (the parameter setting TRAIN_MODEL above) available in PyTorch TorchVision package, such as resnet18, resnet34, resnet50, resnet101, mobilenet_v2, vgg11, mobilenet_v3_large, inception_v3, efficientnet_b4, googlenet, vit, densenet, shufflenet..., etc.

4. Before you use the pre-trained pytorch model for transfer learning, you should modify the classifier nodes parameters of the selected neural model architecture. 

5. The modification should be done through modifying the function `load_tune_pth_model` in `model_selection.py` which is located in the director jetbot/utils/. 
> e.g., ResNet model has fully connected (fc) final layer with 512 as in_features, and we will be training for regression thus the out_features is set to 2.

6. Please refer the information in the following websites for modifying pytorch pre-trained models:
* classifier : https://pytorch.org/vision/0.10/models.html#classification
* github : https://github.com/pytorch/vision/tree/release/0.11/torchvision/models
* More details on ResNet-18 and other variants: https://github.com/pytorch/vision/blob/master/torchvision/models/resnet.py

7. After modifying the classifier, load the model below:


In [None]:

%cd "../../jetbot/utils"
from model_selection import load_tune_pth_model
model, model_type, preprocess_wrap = load_tune_pth_model(pth_model_name=TRAIN_MODEL, pretrained=True) 
# preprocess_wrap = [preprocess, classifier_preprocess] or None   # refer to load_tune_pth_model in model_selection.py


### Download and extract data

1. Before you start, you need to make sure the zip data file for training has been created by using ``data_collection.ipynb`` notebook and stored in the jetbot.

2. you should upload the zip data file for training to the PC directory set by `dir_depo` in the cell below.　The default `dir_depo` is `D:\\AI_Lecture_Demos\\Data_Repo\\Cuterbot_Repo`. All the training will be worked on this directory. If you want to use another directory, you may set this variable. 

> If you're training on the JetBot that you collected data on, you can skip this step!

3. After training is finished, the trained model will be stored in the directory `dir_depo`.

4. The data set file name (variable name: `training_datafile`) should set to the exact same file name that you used in ``data_collection.ipynb``, which would have format `road_following_{DATASET_DIR}_{timestr()}.zip`. In this example, the variable `training_datafile` used here is `dataset_xy_0916_1.zip`.

5. You should then extract the dataset for training by calling the command below:

In [None]:
# !unzip -q road_following.zip
from zipfile import ZipFile

# The dir_depo parameter can be set as you required: 
dir_depo = 'D:\\Temp\\Cuterbot_Repo'
# dir_depo = 'D:\\AI_Lecture_Demos\\Data_Repo\\Cuterbot_2004_Repo'
os.makedirs(dir_depo, exist_ok=True)
# dir_depo = os.getcwd()
training_datafile = 'road_following_dataset_xy_2024-12-22_13-18-05.zip'  # check the data file is loaded to dir_depo

with ZipFile(os.path.join(dir_depo, training_datafile), 'r') as zObject:
    zObject.extractall(path=dir_depo)


You should see a folder named ``dataset_xy`` appear in the file directory as set in variable `dir_depo`.

### Create Dataset Instance

Here we create a custom ``torch.utils.data.Dataset`` implementation, which implements the ``__len__`` and ``__getitem__`` functions.  This class
is responsible for loading images and parsing the x, y values from the image filenames.  Because we implement the ``torch.utils.data.Dataset`` class,
we can use all of the torch data utilities :)

We hard coded some transformations (like color jitter) into our dataset.  We made random horizontal flips optional (in case you want to follow a non-symmetric path, like a road
where we need to 'stay right').  If it doesn't matter whether your robot follows some convention, you could enable flips to augment the dataset.

In [None]:
def get_x(path, width):
    """Gets the x value from the image filename"""
    # print(path.split("_")[1])
    return (float(int(path.split("_")[1])) - width/2) / (width/2)

def get_y(path, height):
    """Gets the y value from the image filename"""
    # print(path.split("_")[2])
    return (float(int(path.split("_")[2])) - height/2) / (height/2)

class XYDataset(torch.utils.data.Dataset):
    
    def __init__(self, directory, preprocess, random_hflips=False):
        self.directory = directory
        self.random_hflips = random_hflips
        self.preprocess = preprocess     # use weights.transform() for torchvision with version > 0.13
        self.image_paths = glob.glob(os.path.join(self.directory, '*.jpg'))
        self.color_jitter = transforms.ColorJitter(0.3, 0.3, 0.3, 0.3)
    
    def __len__(self):
        return len(self.image_paths)
    
    def __getitem__(self, idx):
        image_path = self.image_paths[idx]
        
        image = PIL.Image.open(image_path)
        width, height = image.size
        # print(width, height)
        x = float(get_x(os.path.basename(image_path), width))
        y = float(get_y(os.path.basename(image_path), height))
        # print(x, y)
        if float(np.random.rand(1)) > 0.5:
            image = transforms.functional.hflip(image)
            x = -x
        
        image = self.color_jitter(image)
        image = self.preprocess(image)
            
        '''
        if TRAIN_MODEL == 'inception_v3':
            image = transforms.functional.resize(image, (299, 299))
        else:
            image = transforms.functional.resize(image, (224, 224))
        image = transforms.functional.to_tensor(image)
        image = image.numpy()[::-1].copy()
        image = torch.from_numpy(image)
        image = transforms.functional.normalize(image, [0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
        '''
        return image, torch.tensor([x, y]).float()
    
dataset = XYDataset(os.path.join(dir_depo, 'dataset_xy'), preprocess_wrap[0], random_hflips=False)  # preprocess_wrap[0] : preprocess for model

In [None]:
dataset[0]

### Split dataset into train and test sets
Once we read dataset, we will split data set in train and test sets. In this example we split train and test a 90%-10%. The test set will be used to verify the accuracy of the model we train.

In [None]:
test_percent = 0.1
num_test = int(test_percent * len(dataset))
train_dataset, test_dataset = torch.utils.data.random_split(dataset, [len(dataset) - num_test, num_test])

### Create data loaders to load data in batches

We use ``DataLoader`` class to load data in batches, shuffle data and allow using multi-subprocesses. In this example we use batch size of 64. Batch size will be based on memory available with your GPU and it can impact accuracy of the model.

In [None]:
train_loader = torch.utils.data.DataLoader(
    train_dataset,
    batch_size=8,
    shuffle=True,
    num_workers=0
)

test_loader = torch.utils.data.DataLoader(
    test_dataset,
    batch_size=8,
    shuffle=True,
    num_workers=0
)

### Processor used for training
1. You may use CPU or GPU for training the model by checking the check widget below.
2. If you use GPU for training, the model you selected (in "TRAIN_MODEL") will be transferred for execution on the GPU

In [None]:
# ** you may use CPU or GPU for training
processor = 'GPU'
if processor == 'GPU':
    device = torch.device('cuda')
    print("torch cuda version : ", torch.version.cuda)
    print("cuda is available for pytorch: ", torch.cuda.is_available())    
elif processor == 'CPU':
    device = torch.device('cpu')
model = model.float()
model = model.to(device, dtype=torch.float)


### Define Learning Algorithm for training Neural Network Model:
1. You can use the following learning algorithms (the parameter setting `TRAIN_METHOD` above) for training a model
"Adam", "SGD", "ASGD", "Adadelta", "RAdam"; the parameters lr=0.01, momentum=0.92 is needed for SGD.
* Reference web site : https://pytorch.org/docs/stable/optim.html#algorithms

In [None]:
# *** refererence : https://pytorch.org/docs/stable/optim.html#algorithms
# use the following learning algorithms for evaluation
# "Adam", "SGD", "ASGD", "Adadelta", "RAdam"; the parameters lr=0.01, momentum=0.92 is needed for SGD
TRAIN_METHOD = 'Adam'

### Train Regression:

1. The training record is stored in the directory set by variable `dir_training_records` in the cell below.
2. The trained model for road following is saved in directory `DIR_RC_MODEL_REPO` with file name `BEST_MODEL_PATH` in the cell below, and the `torch_model_tbl.csv` in the directory `DIR_MODEL_REPO` will be updated accordingly. 
3. We train for 70 epochs and save the best model if the loss is reduced.
4. The parameters or variables you may be modified as you want.

In [None]:
%cd "../../jetbot/utils"
import time
from tqdm.notebook import tqdm
from training_profile import * 

dir_training_records = os.path.join(dir_depo, processor, 'training records', "road_following", TRAIN_MODEL)
os.makedirs(dir_training_records, exist_ok=True)

DIR_MODEL_REPO = os.path.join(dir_depo, processor, 'model_repo')
os.makedirs(DIR_MODEL_REPO, exist_ok=True)
DIR_RC_MODEL_REPO = os.path.join(DIR_MODEL_REPO, 'road_following')
os.makedirs(DIR_RC_MODEL_REPO, exist_ok=True)

# BEST_MODEL_PATH = 'best_steering_model_xy.pth'
BEST_MODEL_NAME = "best_steering_model_xy_" + TRAIN_MODEL
BEST_MODEL_PATH = os.path.join(DIR_RC_MODEL_REPO, BEST_MODEL_NAME + ".pth")
MODEL_PREPROCESS_PATH = os.path.join(DIR_RC_MODEL_REPO, BEST_MODEL_NAME + "_preprocess.pth")

NUM_EPOCHS = 70
best_loss = 1e9

optimizer = getattr(optim, TRAIN_METHOD)(model.parameters(), weight_decay=0)
# optimizer = getattr(optim, TRAIN_METHOD)(model.parameters(), lr=0.01, momentum=0.95)

loss_data = []
lt_epoch = []  # learning time per epoch
lt_sample = []  # learning time per epoch

print("start training model ----- %s -----" % TRAIN_MODEL)

batch_size = len(train_loader)
pbar_overall_format = "{desc} {percentage:.2f}% | {bar} | elapsed: {elapsed}; estimated to finish: {remaining}"
pbar_overall = tqdm(total=100, bar_format = pbar_overall_format)
show_batch_progress = False   # Set True if need to show the batch learning progress in an epoch
show_training_plot = False  # Set True if need to show the convergent profile during training

best_loss = None

for epoch in range(NUM_EPOCHS):
    start_epoch = time.time()
    
    model.train()
    train_loss = 0.0
    
    if show_batch_progress:
        pbar_batch = tqdm(train_loader, total = batch_size)
    else:
        pbar_batch = iter(train_loader)
    for index, (images, labels) in enumerate(pbar_batch):
        start_sample = time.time()
        
        images = images.to(device)
        labels = labels.to(device)
        optimizer.zero_grad()
        if TRAIN_MODEL == 'inception_v3':
            outputs = model(images)
            loss_main = F.mse_loss(outputs.logits, labels, reduction='mean')
            loss_aux = F.mse_loss(outputs.aux_logits, labels, reduction='mean')
            loss = loss_main + 0.3 * loss_aux  # weighting aux with 0.3 is the same as that stated in paper
            
        elif TRAIN_MODEL == 'googlenet':
            outputs = model(images)
            loss_main = F.mse_loss(outputs.logits, labels, reduction='mean')
            loss_aux1 = F.mse_loss(outputs.aux_logits1, labels, reduction='mean')
            loss_aux2 = F.mse_loss(outputs.aux_logits2, labels, reduction='mean')
            loss = loss_main + 0.3 * loss_aux1 + 0.3 * loss_aux2  # weighting aux with 0.3 is the same as that stated in paper
            
        else:
            outputs = model(images)
            loss = F.mse_loss(outputs, labels, reduction='mean')
            
        train_loss += float(loss)
        loss.backward()
        optimizer.step()
        
        end_sample = time.time()
        lt_sample.append(end_sample - start_sample)
        
        pbar_overall.update(round(100/(NUM_EPOCHS*batch_size), 2))
        pbar_overall.set_description(desc = f'Overall progress - Epoch [{epoch+1}/{NUM_EPOCHS}]')
        pbar_overall.set_postfix(best_loss = best_loss, train_loss = train_loss/(index+1))
               
        if show_batch_progress:
            pbar_batch.set_description(desc = f'Progress in the epoch {epoch+1} ')
            pbar_batch.set_postfix(mean_batch_loss = train_loss/(index+1))
    
    train_loss /= len(train_loader)
    
    model.eval()
    test_loss = 0.0
    for images, labels in iter(test_loader):
        images = images.to(device)
        labels = labels.to(device)
        outputs = model(images)
        loss = F.mse_loss(outputs, labels)
        test_loss += float(loss)
    test_loss /= len(test_loader)

    end_epoch = time.time()
    lt_epoch.append(end_epoch - start_epoch)
      
    if best_loss == None or test_loss < best_loss :
        torch.save(model.state_dict(), BEST_MODEL_PATH)
        best_loss = test_loss
    
    loss_data.append([train_loss, test_loss])
    
# function plot_loss(loss_data, best_loss, no_epoch, dir_training_records, train_model, train_method) is in jetbot.utils
    if epoch == NUM_EPOCHS:
        show_training_plot=True
    plot_loss(loss_data=loss_data, best_loss=best_loss, no_epoch=NUM_EPOCHS,
              dir_training_records=dir_training_records, # the directory stored training records
              train_model=TRAIN_MODEL, train_method=TRAIN_METHOD, processor=processor, # this 2 parameters are for plot title only
              show_training_plot=show_training_plot)

overall_time = pbar_overall.format_dict["elapsed"]
# function lt_plot(lt_epoch, lt_sample, dir_training_records, train_model, train_method) is in jetbot.utils
lt_plot(lt_epoch=lt_epoch, lt_sample=lt_sample, overall_time=overall_time,
        dir_training_records=dir_training_records, # the directory stored training records
        train_model=TRAIN_MODEL, train_method=TRAIN_METHOD, processor=processor) # this 2 parameters are for plot title only

# save the preprocess module (preprocess_wrap[1]) created from model_selection.py module
torch.save(preprocess_wrap[1].state_dict(), MODEL_PREPROCESS_PATH)

### Save the file path of the trained model
The file path of the trained model (model_path) and the associated preprocess module (preprocess_path) are stored in `torch_model_tbl.csv` ubder the directory `DIR_MODEL_REPO`.

In [None]:
import pandas as pd

df_file = os.path.join(DIR_MODEL_REPO, 'torch_model_tbl.csv')
model_path = "./road_following/" + BEST_MODEL_NAME + ".pth"
preprocess_path = "./road_following/" + BEST_MODEL_NAME + "_preprocess.pth"
if os.path.isfile(df_file):
    df = pd.read_csv(df_file, header=None)
else:
    df = pd.DataFrame()
df = df._append([["classifier", model_type, model_path, preprocess_path]], ignore_index = False)
df = df.drop_duplicates()
df.to_csv(df_file, header=False, index=False)

In [None]:
print('Training is completed! \n you can close the figures by restart the kernel!')

1. Once the model is trained, it will generate a pytorch model file ``best_steering_model_xy_<TRAIN_MODEL>.pth`` and update the file `torch_model_tbl.csv` in `PC` directories as set in `DIR_RC_MODEL_REPO` and `DIR_MODEL_REPO`, respectively; then you have to  copy these 2 files to the `jetbot` directories `model_repo\road_following` and `model_repo`, respectively. Then you can use them for inferencing by starting `jupyter lab of jetbot` and running `live_demo_light.ipynb`.
2. To run the model with Nvidia TensorRT support, you have to create a TensorRT engine by using `live_demo_build_trt.ipynb` which will covert the  generated pytorch model file to a TensorRT engine; then you can execute `live_demo_light_trt.ipynb` with the created engine.
