# 7144COMP/CW2: Bird Multiple Object Detection Using SSD
## PART 2.Training
### Overview
In this notebook, I will train an object detection model using the pre-processed data from the previous notebook. 

- Download the object detection models from Tensorflow 2 Detection Model Zoo >> [here](https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/tf2_detection_zoo.md).
- The model's hyperparameters and configuration are set in the ```ssd_config.config``` file. 
- The model is trained through this notebook using ```model_main_tf2.py``` with the relevent arguments.


#### Prerequisites
- Environment Setup (see Part 0)
- Data preprocessing (see Part 1)

## 1. Download the model from TensorFlow 2 Detection Model Zoo 
#### Import the necessary packages

In [35]:
import os
import re #<- regular expressions
import tensorflow as tf

#### Setup

In [40]:
# Define constants
RANDOM_SEED = 99 #<-ensure the reproduciblity of the training results
BATCH_SIZE = 1
NUM_STEPS = 28000 
NUM_EVAL_STEPS = 1000
EPOCHS = 1

# Current directory
current_dir = os.getcwd()

#### Download Fine-tuned ```SSD ResNet101``` from Tensorflow 2 Detection Model Zoo 

**Why SSD**?

Single Shot Multibox Detector (SSD) is a fast and accurate object detection algorithm that uses a single deep neural network to predict object classes and locations. It was developed by researchers at Google and published in a 2015 paper titled "SSD: Single Shot MultiBox Detector."

One advantage of SSD is that it is relatively fast compared to other object detection algorithms, as it uses a **single feedforward convolutional neural network (CNN)** to make predictions. This allows it to run in real-time on most devices.

Another advantage of SSD is that it is relatively simple to implement and train, as it does not require the use of region proposal algorithms or anchor boxes.

A disadvantage of SSD is that it may not be as accurate as other object detection algorithms, such as Faster R-CNN, which uses a two-stage approach to object detection. 

However, recent improvements to the SSD architecture, such as the addition of the ```ResNet101``` feature extractor, have significantly improved the accuracy of SSD.

In [37]:
# Download SSD ResNet101 if it doesn't exist locally
if not os.path.isdir('ssd_resnet101_v1_fpn_640x640_coco17_tpu-8'):
    !wget http://download.tensorflow.org/models/object_detection/tf2/20200711/ssd_resnet101_v1_fpn_640x640_coco17_tpu-8.tar.gz
    # Decompression and remove compressed files
    !tar -xf ssd_resnet101_v1_fpn_640x640_coco17_tpu-8.tar.gz
    # Cleanup
    !rm ssd_resnet101_v1_fpn_640x640_coco17_tpu-8.tar.gz

#### Load Train, Test, Valid TFRecords, labelmap

In [38]:
# Train, Test, Valid TFRecord files
train_record_path = os.path.join(current_dir, 'Birds', 'train', 'birds.tfrecord')
test_record_path = os.path.join(current_dir, 'Birds', 'test', 'birds.tfrecord')
valid_record_path = os.path.join(current_dir, 'Birds', 'valid', 'birds.tfrecord')

# Labelmap
labelmap_path = os.path.join(current_dir, 'Birds', 'train', 'birds_label_map.pbtxt')

# 2. Model's Config files, Checkpoints and Hyperparameters

In [41]:
# Load the latest Checkpoint if it exists
fine_tune_checkpoint_ssd = 'ssd_resnet101_v1_fpn_640x640_coco17_tpu-8/checkpoint/ckpt-0'
print('Checkpoint Dir:', fine_tune_checkpoint_ssd)

Checkpoint Dir: ssd_resnet101_v1_fpn_640x640_coco17_tpu-8/checkpoint/ckpt-0


In [42]:
# config files can be edited and updated on ayoubbensakhria/TensorFlowOD repository
if os.path.isfile('ssd_pipeline.config'):
    !rm 'ssd_pipeline.config'

# Download the latest base pipeline config file
!wget https://raw.githubusercontent.com/ayoubbensakhria/TensorFlowOD/master/7144COMP/training/ssd_pipeline.config

# data_augmentation_options section has been removed because it has been done by Roboflow
base_config_path_ssd = 'ssd_pipeline.config'

--2023-01-03 21:24:59--  https://raw.githubusercontent.com/ayoubbensakhria/TensorFlowOD/master/7144COMP/training/ssd_pipeline.config
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4921 (4.8K) [text/plain]
Saving to: ‘ssd_pipeline.config’


2023-01-03 21:24:59 (114 MB/s) - ‘ssd_pipeline.config’ saved [4921/4921]



#### Hyperparameters

```num_epochs``` = one forward pass and one backward pass of all the training examples. One step takes on average 2 seconds, an epoch consists of 28000 steps (batch_size=1), so the total duration of an epoch is approx 15 hours. 

A number of epochs of ```1``` means that the model will only make one pass through the entire training dataset. Since we'd like to quickly test the performance of the model on a small training data. However, for most real-world datasets, it is generally not sufficient to train a model to good performance, as the model will not have the opportunity to learn from the entire dataset.

```batch_size``` = the number of training examples in one forward/backward pass (1 step). The higher the batch_size, the more memory space we would need. Here the available memory allows a max of batch_size = 1 

A batch size of ```1``` means that the model will process and update its weights based on a single example at a time. This can be useful in our case where we are trying to train on a very small dataset, as it allows the model to make more frequent weight updates. However, it can also be inefficient and slow, as the model must process and update the weights for each example individually.

```num_steps```: number of iterations, or a single update of the model weights 

The number of steps is equal to the number of examples in the dataset (4000) divided by the batch size (1), so it is reasonable to set the number of steps to ```28000``` in this case. However, it is worth noting that the number of steps should generally be determined based on the number of epochs and the batch size, rather than the total number of examples in the dataset.

```fixed_shape_resizer```: a fixed resolution of ```640x640 px``` is useful for ensuring that all input images have the same size, which can make them easier to process and may improve the performance of the model.

```grid_anchor_generator```: anchor boxes are used to identify potential object locations within the image. The performance of a Faster R-CNN model can be affected by the parameters of the grid anchor generator such as ```scales```, ```aspect_ratios```, ```height_stride```, ```width_stride```, and it may be necessary to experiment with different values to find the best performing configuration.

```second_stage_post_processing```: is responsible for taking the output of the model's second stage (the region proposal network) and generating the final set of object detections. The specific parameters used can have a significant impact on the model's performance.

 
Overall, these hyperparameters may be suitable for quickly testing the performance of a model on our dataset, but they may not be optimal for training a model to good performance on a real-world dataset.

In [43]:
# Config the Model Pipeline Edition function
def edit_config(model_name, base_config_path, fine_tune_checkpoint):
  with open(base_config_path) as f:
    config = f.read()

  with open('{model}_config.config'.format(model=model_name), 'w') as f:

    # Set labelmap path
    config = re.sub('label_map_path: ".*£?"', 
              'label_map_path: "{}"'.format(labelmap_path), config)
    
    # Set fine_tune_checkpoint path
    config = re.sub('fine_tune_checkpoint: ".*?"',
                    'fine_tune_checkpoint: "{}"'.format(fine_tune_checkpoint), config)

    # Set train tf-record file path
    config = re.sub('(input_path: ".*?)(PATH_TO_BE_CONFIGURED/train)(.*?")', 
                    'input_path: "{}"'.format(train_record_path), config)

    # Set test tf-record file path
    config = re.sub('(input_path: ".*?)(PATH_TO_BE_CONFIGURED/val)(.*?")', 
                    'input_path: "{}"'.format(test_record_path), config)

    # Set number of classes.
    config = re.sub('num_classes: [0-9]+',
                    'num_classes: {}'.format(4), config)

    # Set batch size
    config = re.sub('batch_size: [0-9]+',
                    'batch_size: {}'.format(BATCH_SIZE), config)

    # Set training steps
    config = re.sub('num_steps: [0-9]+',
                    'num_steps: {}'.format(NUM_STEPS), config)

    # Set fine-tune checkpoint type to detection
    config = re.sub('fine_tune_checkpoint_type: "classification"', 
              'fine_tune_checkpoint_type: "{}"'.format('detection'), config)

    f.write(config)

In [44]:
# Edit config SSD
edit_config('ssd', base_config_path_ssd, fine_tune_checkpoint_ssd)

# Clean up
!rm 'ssd_pipeline.config'

# Print config pipeline
%cat 'ssd_config.config'

# SSD with Resnet 101 v1 FPN feature extractor, shared box predictor and focal
# loss (a.k.a Retinanet).
# See Lin et al, https://arxiv.org/abs/1708.02002
# Trained on COCO, initialized from Imagenet classification checkpoint
# Train on TPU-8
#
# Achieves 35.4 mAP on COCO17 Val

model {
  ssd {
    inplace_batchnorm_update: true
    freeze_batchnorm: false
    num_classes: 4
    box_coder {
      faster_rcnn_box_coder {
        y_scale: 10.0
        x_scale: 10.0
        height_scale: 5.0
        width_scale: 5.0
      }
    }
    matcher {
      argmax_matcher {
        matched_threshold: 0.5
        unmatched_threshold: 0.5
        ignore_thresholds: false
        negatives_lower_than_unmatched: true
        force_match_for_each_row: true
        use_matmul_gather: true
      }
    }
    similarity_calculator {
      iou_similarity {
      }
    }
    encode_background_as_zeros: true
    anchor_generator {
      multiscale_anchor_generator {
        min_level: 3
        max_level: 7


## 3. Train SSD ResNet101 Object Detector

In [45]:
# Model training directory and config pipeline
model_dir = os.path.join(current_dir, 'training')
pipeline_config_path = 'ssd_config.config'

# Test training params
print (pipeline_config_path, model_dir, NUM_STEPS)

ssd_config.config /home/msc1/Desktop/7144COMP/Models/ssd_resnet101/training 28000


In [None]:
# Execute training
!python $current_dir/models/research/object_detection/model_main_tf2.py \
    --pipeline_config_path=$pipeline_config_path \
    --model_dir=$model_dir \
    --alsologtostderr \
    --num_train_steps=$NUM_STEPS \
    --run_once=False \
    --sample_1_of_n_eval_examples=1 \
    --eval_interval_secs=600 \
    --num_eval_steps=$NUM_EVAL_STEPS

2023-01-03 21:25:23.738115: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-01-03 21:25:24.771567: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-11.2/lib64
2023-01-03 21:25:24.771629: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-11.2/lib64
2023-01-03 21:25:26.605340: E tensorflow/compiler/xla/stream_executor/cuda/c

The script above was used to train an object detection model using ```TensorFlow 2```. It takes in a pipeline configuration file, which specifies the model and training configuration, and a set of training and evaluation data.

The script has several flags that can be used to control the training process. 

- The ```--pipeline_config_path``` flag specifies the path to the pipeline configuration file, which defines the model architecture and training parameters. 

- The ```--model_dir``` flag specifies the directory where the trained model and training logs should be saved. The --alsologtostderr flag causes the training logs to be written to both the log file and the console.

- The ```--num_train_steps``` flag specifies the number of training steps to run.

- The ```--num_eval_steps``` flag specifies the number of evaluation steps to run. 

- The ```--run_once``` flag tells the script to run evaluation once at the end of training if set to ```True```, or to run evaluation at regular intervals during training if set to ```False```. 

- The ```--sample_1_of_n_eval_examples``` flag specifies how many examples from the evaluation dataset should be used for evaluation.

### Export our OD inference graph

Graphs are data structures that contain a set of ```tf.Operation``` objects, which represent units of computation; and ```tf.Tensor``` objects, which represent the units of data that flow between operations. 

Here we will save our object detection inference graph files in ```ssd_inference_graph/saved_model```.

The following script uses the ```exporter_main_v2.py``` script from the TensorFlow object detection library to export the trained model. The script loads the trained model from the specified checkpoint directory and then uses the pipeline configuration file to create a new model (a copy of the trained model) with the same architecture. The exported model is saved in the specified output directory.

This new model is a copy of the trained model, but it has been converted to a format that is suitable for serving or for further training.

In [34]:
# Define the output directory
output_directory = 'ssd_inference_graph'

# Export OD inference graph
!python $current_dir/models/research/object_detection/exporter_main_v2.py \
    --trained_checkpoint_dir $model_dir \
    --output_directory $output_directory \
    --pipeline_config_path $pipeline_config_path

2022-12-22 13:32:16.408113: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-12-22 13:32:17.489770: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-11.2/lib64
2022-12-22 13:32:17.489831: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-11.2/lib64
2022-12-22 13:32:19.437146: E tensorflow/compiler/xla/stream_executor/cuda/c

- ``` trained_checkpoint_dir``` : Directory containing the trained model checkpoints.
- ``` output_directory``` : Directory where the exported model will be saved.
- ``` pipeline_config_path``` : Path to the pipeline configuration file, which specifies the model architecture and other options.

## Next Steps
- Evaluate the trained model using TensorBoard.