# Training your TFOD model using Colab

In this notebook, we showed you how you can do limited training of your TFOD model using Colab (subject to GPU time-limit imposed by free Colab account). 

***WARNING*** 
It seems that Google Colab now will detect that you are running some long-running script on the linux instance and will block you from using the GPU in the future. 
Use this notebook at your own risk !! 

### Mount your google drive & symlink it to `/drive`

In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

import time
import datetime
import sys

!mkdir -p /drive
!mount --bind /content/drive/My\ Drive /drive
!mkdir -p ~/.ssh

get_ipython().system_raw("alias kill='f(){ kill -9 $(ps -x | grep '$@') ;  unset -f f; }; f'")
!alias kill='f(){ kill -9 $(ps -x | grep "$@") ;  unset -f f; }; f'

### Install Tensorflow object detection API

We will clone the TFOD API from github and install it.


In [None]:
%cd /content/
!git clone --depth 1 https://github.com/tensorflow/models

In [None]:
# Install the Object Detection API
%%bash
cd /content/models/research/
protoc object_detection/protos/*.proto --python_out=.
cp object_detection/packages/tf2/setup.py .
python -m pip install .

### Training Object Detection model

Make sure you have created the project folder (with all the necessary subfolders structure) in your google drive, e.g. `/drive/ballon_project`

### Copy repo files to drive

In [None]:
%%bash 
cd /content
git clone https://github.com/arifhamed/it3103-ari /content/git/it3103
cp -r /content/git/it3103/week5-colab/balloon_project /content/drive/MyDrive/

### Download dataset 

We download the dataset to /content folder and unzip in /content (which is temporary). We then copy the images and annotations to the project folder.
Note: you only need to do this the 1st time. You can skip the following cell if you are resuming your training.

In [None]:
%%bash
cd /content
wget https://nyp-aicourse.s3-ap-southeast-1.amazonaws.com/datasets/balloon_dataset_pascalvoc.zip -q
unzip -q balloon_dataset_pascalvoc.zip -d /content/balloon_dataset
cp  /content/balloon_dataset/*.jpg  /content/drive/MyDrive/balloon_project/data/images/
cp /content/balloon_dataset/*.xml  /content/drive/MyDrive/balloon_project/data/annotations/
rm -rf balloon_dataset

### Create the Label Map

You only need to do this the 1st time. 

In [None]:
%%writefile /content/drive/MyDrive/balloon_project/data/label_map.pbtxt
item {
    id: 1
    name: 'balloon'
}

### Create the TFRecords 

You only need to do this the 1st time.

In [None]:
%%bash
cd /content/drive/MyDrive/balloon_project/
bash /content/drive/MyDrive/balloon_project/create_tf_voc.sh 

### Download Pretrained Model

You only need to do this the first time.

In [None]:
%%bash
cd /content
wget http://download.tensorflow.org/models/object_detection/tf2/20200711/ssd_mobilenet_v2_320x320_coco17_tpu-8.tar.gz
mkdir -p /content/drive/MyDrive/balloon_project/pretrained_models/
tar xzvf /content/ssd_mobilenet_v2_320x320_coco17_tpu-8.tar.gz -C /content/drive/MyDrive/balloon_project/pretrained_models/

Copy the pipeline config file to your model experiment directory

In [None]:
%%bash 
# create the target directory first 
mkdir -p /content/drive/MyDrive/balloon_project/models/ssd_mobilenet_v2_320x320_coco17_tpu-8/run1
# copy the pipeline.config file to the target folder
cp /content/drive/MyDrive/balloon_project/pretrained_models/ssd_mobilenet_v2_320x320_coco17_tpu-8/pipeline.config /content/drive/MyDrive/balloon_project/models/ssd_mobilenet_v2_320x320_coco17_tpu-8/run1/pipeline.config


### Configure your pipeline.config file 

Now double click the pipeline config file in `/drive/balloon_project/models/ssd_mobilenet_v2_320x320_coco17_tpu-8/run1/pipeline.config` to edit. 

You only need to do this the first time.

### Start the training 

You can start or resume your training by running the train.sh. We pipe the error to train.log.  So please check the train.log to see what is the error if your train.sh is not running.

In [None]:
%cd /content/drive/MyDrive/balloon_project
get_ipython().system_raw('bash train.sh 1>train.log 2>train_err.log &')

In [None]:
# make sure the process is running 
!ps aux | grep train.sh

In [None]:
# You can do a tail of the log file here too by uncommenting the following line
# !tail -f /content/drive/MyDrive/balloon_project/train.log
get_ipython().system_raw('tail -f /content/drive/MyDrive/balloon_project/train.log &')

### Start Evaluation

Let's start our eval script. We pipe the error to eval.log. So check your eval.log if your script is not running. 

In [None]:
%cd /content/drive/MyDrive/balloon_project
get_ipython().system_raw('export CUDA_VISIBLE_DEVICES="-1"')
get_ipython().system_raw('bash eval.sh 1>eval.log 2>eval_err.log &')

In [None]:
# make sure the process is running 
!ps aux | grep eval.sh

### Visualize using Tensorboard


In [None]:
%reload_ext tensorboard
%tensorboard --logdir /content/drive/MyDrive/balloon_project/models/ssd_mobilenet_v2_320x320_coco17_tpu-8/run1

In [None]:
# imma just let this wait for about 2 hours
training_start = time.time()
timecheck = 0
while timecheck < 10800:
  time.sleep(1)
  sys.stdout.write("\r"+str(datetime.timedelta(seconds=time.time() - training_start)).split('.')[0])
  sys.stdout.flush()

### Stop training and evaluation

To stop the training, find the process id of your eval.sh and train.sh and kill them. 

In [None]:
# !ps aux | grep -i train.sh | grep -v grep | awk '{print $2}' | xargs kill -9
# !ps aux | grep -i eval.sh | grep -v grep | awk '{print $2}' | xargs kill -9
get_ipython().system_raw('kill train.sh')
get_ipython().system_raw('kill eval.sh')
!kill train.sh
!kill eval.sh

### Export your model


In [None]:
%cd /content/drive/MyDrive/balloon_project
get_ipython().system_raw('bash export.sh 1>export.log 2>export_err.log &')