# Training

## Clean XML files

Below command will run the xmlconversion.py script. It will create a new directory called cleaned inside images directory, copy all the xml files and cleans them by removing empty spaces. Then copy it back to images directory overwriting the old xml files.

In [1]:
!python xmlconversion.py --verbose

images/TurdusMerula407.xml
images/ErithacusRubecula491.xml
images/PeriparusAter167.xml
images/ErithacusRubecula744.xml
images/ErithacusRubecula393.xml
images/PeriparusAter445.xml
images/ErithacusRubecula396.xml
images/TurdusMerula741.xml
images/PeriparusAter396.xml
images/TurdusMerula22.xml
images/PeriparusAter171.xml
images/PeriparusAter225.xml
images/PeriparusAter434.xml
images/ErithacusRubecula279.xml
images/TurdusMerula131.xml
images/ErithacusRubecula511.xml
images/PeriparusAter644.xml
images/ErithacusRubecula152.xml
images/TurdusMerula565.xml
images/TurdusMerula647.xml
images/PeriparusAter253.xml
images/TurdusMerula245.xml
images/TurdusMerula632.xml
images/ErithacusRubecula632.xml
images/PeriparusAter389.xml
images/TurdusMerula606.xml
images/ErithacusRubecula451.xml
images/TurdusMerula145.xml
images/ErithacusRubecula385.xml
images/ErithacusRubecula2.xml
images/TurdusMerula661.xml
images/TurdusMerula24.xml
images/PeriparusAter588.xml
images/TurdusMerula480.xml
images/TurdusMerula83

## Patition the train / test split 90/10

In order to evaluate the model it is best practice to divide dataset into training and testing. The model will be trained on train set and then will be tested on test set. For this dataset 90 percent will be used for train set and 10% percent for test set. Below command performs the partition of dataset into train 90% and test 10%

In [3]:
!python partition_dataset.py -x -i ./images -r 0.1

## Create the TF Record

Tensorflow Object Detection API requires the input data to be in TFRecord format. It is a binary file format, it takes less space on disk and makes training of models faster.

### Update the .PBTXT file

A .pbtxt file is a simple text file that maps labels to some integer values. The Tensorflow Object detection API requires this file for training and detection using the model. Below command opens the .pbtxt file which can be modified accordingly. For the purpose of this project the file will contain three classes ErithacusRubecula, PeriparusAter and TurdusMerula with their respective labels as 1, 2, and 3

In [5]:
!code './data/label_map.pbtxt'

### Create the TF Record (Train)

Below command will create a TFRecord file for the train set

In [1]:
!python generate_tfrecord.py -x images/train -l data/label_map.pbtxt -o data/train.record

Successfully created the TFRecord file: data/train.record


### Create the TF Record (Test)

Below command will create a TFRecord file for the test set

In [2]:
!python generate_tfrecord.py -x images/test -l data/label_map.pbtxt -o data/test.record

Successfully created the TFRecord file: data/test.record


## Faster R-CNN

Faster R-CNN is a deep convolutional neaural network used for object detection. It is evolved from it's predecessor R-CNN and Fast R-CNN. Faster R-CNN is an improved version with more performance and is one of the best architectures for object detection. 

### Set the model path

In [3]:
PATH_TO_MODEL = "faster_rcnn_resnet101_v1_1024x1024_coco17_tpu-8"

### Hyperparameters and Configuration of the config file

Tensorflow Object Detection API uses config file for hyperparameters. It is in this file you modify all hyperparameters for your object detection detection task. Below command opens code editor with the config file. For this project, I will modify the following fields.

* num_classes: This is the number of classes that you want the model to be trained on. It will be set to 3 because we are training the model on 3 classes.
* batch_size: Number of samples the network takes to train at a time. It can be increased with more memory power. Our images are of high resolution and Faster R-CNN requires more computational power. So, I will keep this number at 1.
* num_steps: Number of steps to train the model with. I will start the training with 12000 steps and analyse the loss during training using Tensorboard to find if the model has to be trained more or not.
* fine_tune_checkpoint: path of the checkpoint file of the pretrained model.
* fine_tune_checkpoint_type: This should be detection.
* label_map_path: Path of label_map.pbtxt file.
* input_path: Path of the train TFRecord file.

In [4]:
!code './training/'{PATH_TO_MODEL}'/pipeline.config'

### Training the model

Below command will run the model_main_tf2.py file from the Tenworflow Object Detection API and start the training process. The parameters that should be provided here are as follows.
* model_dir: Path where the checkpoints will save.
* pipeline_config_path: Path of the model config file.
* num_train_steps: Number of steps to train.

In [7]:
!python model_main_tf2.py --model_dir=training/TF2/training/{PATH_TO_MODEL} --pipeline_config_path=training/TF2/training/{PATH_TO_MODEL}/pipeline.config --num_train_steps=24000 –alsologtostderr

2022-01-04 11:53:26.007516: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2022-01-04 11:53:27.613176: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2022-01-04 11:53:27.653199: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-01-04 11:53:27.653797: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce RTX 3090 computeCapability: 8.6
coreClock: 1.755GHz coreCount: 82 deviceMemorySize: 23.70GiB deviceMemoryBandwidth: 871.81GiB/s
2022-01-04 11:53:27.653820: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2022-01-04 11:53:27.655790: I tensorflow/stream_executor/platfor

### Exporting a Trained Inference Graph

After the model has been trained, we should extract the trained inference graph. Below command deso the same. The parameter input_type should be set as image_tensor. The pipeline_config_path should be same as with the training process. The trained_checkpoint_dir should be the path where the checkpoints were saved during the training process. The output_directory should be the path where you want the model inference graph to be saved.

In [8]:
!python exporter_main_v2.py --input_type image_tensor --pipeline_config_path ./training/{PATH_TO_MODEL}/pipeline.config --trained_checkpoint_dir ./training/{PATH_TO_MODEL}/ --output_directory ./training/{PATH_TO_MODEL}/saved_model/

2022-01-04 20:18:54.123004: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2022-01-04 20:18:55.515426: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2022-01-04 20:18:55.539534: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-01-04 20:18:55.540130: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce RTX 3090 computeCapability: 8.6
coreClock: 1.755GHz coreCount: 82 deviceMemorySize: 23.70GiB deviceMemoryBandwidth: 871.81GiB/s
2022-01-04 20:18:55.540150: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2022-01-04 20:18:55.542080: I tensorflow/stream_executor/platfor

## SSD ResNet50

### Set the model path

In [9]:
PATH_TO_MODEL = "ssd_resnet50_v1_fpn_640x640_coco17_tpu-8"

### Hyperparameters and Configuration of the config file

Batch size will be changed from 1 to 2. SSD models are computationally less powerful. This system has 25 GB of GPU memory which is good enough to train SSD models with more batch size. Fine tune checkpoint path will be changed to this models pretrained weights. I will start the training with 10000 steps and analyse the loss during training using Tensorboard to find if the model has to be trained more or not.

In [4]:
!code './training/TF2/training/'{PATH_TO_MODEL}'/pipeline.config'

In [8]:
!python model_main_tf2.py --model_dir=training/{PATH_TO_MODEL} --pipeline_config_path=training/{PATH_TO_MODEL}/pipeline.config --num_train_steps=30000 –alsologtostderr

2022-01-10 00:20:36.346908: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2022-01-10 00:20:38.010022: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2022-01-10 00:20:38.031607: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-01-10 00:20:38.032097: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce RTX 3090 computeCapability: 8.6
coreClock: 1.755GHz coreCount: 82 deviceMemorySize: 23.70GiB deviceMemoryBandwidth: 871.81GiB/s
2022-01-10 00:20:38.032115: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2022-01-10 00:20:38.033992: I tensorflow/stream_executor/platfor

In [10]:
!python exporter_main_v2.py --input_type image_tensor --pipeline_config_path ./training/{PATH_TO_MODEL}/pipeline.config --trained_checkpoint_dir ./training/{PATH_TO_MODEL}/ --output_directory ./training/{PATH_TO_MODEL}/saved_model/

2022-01-10 08:28:44.269296: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2022-01-10 08:28:45.697709: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2022-01-10 08:28:45.719458: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-01-10 08:28:45.719947: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce RTX 3090 computeCapability: 8.6
coreClock: 1.755GHz coreCount: 82 deviceMemorySize: 23.70GiB deviceMemoryBandwidth: 871.81GiB/s
2022-01-10 08:28:45.719965: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2022-01-10 08:28:45.723672: I tensorflow/stream_executor/platfor