# (after) Train Model - Convert to frozen graph & tflite
## XPS 8100
#### tf115_p36 environment

This is designed for taking the model.pb checkpoint and converting to:
- frozen graph  
- tflite
in an environment compatible with 

In [1]:
# currently CUDA 10.0
! nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Fri_Feb__8_19:08:17_PST_2019
Cuda compilation tools, release 10.1, V10.1.105


In [2]:
import os
import tensorflow as tf

In [3]:
print (tf.__version__)

1.15.2


### nvidia-smi
this will show you how much memory is available in the GPU.   This is important if you start getting OOM (out of memory) errors.

SageMaker p2.xlarge == 10+ GB  
Note what is available.

you can run (at a terminal)    
  $ nvidia-smi -l 1   
to see the GPU being used during training.  On SageMaker, you'll see the GPU is about 50% busy


In [4]:
! nvidia-smi
# note memory
# - i.e. can't run inferences while you do this!

Fri Apr  3 10:21:07 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64       Driver Version: 440.64       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  GeForce GTX 105...  Off  | 00000000:01:00.0  On |                  N/A |
| 30%   40C    P0    N/A /  75W |   4013MiB /  4038MiB |      1%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage    

### Test your GPU
this should verify your GPU is correct

## WARNING
this is a good test but...  
If you run it, it may not release  the GPU memory.   I didn't figure this out fully.   When I ran it, I would get an OOM error when the model started the training cycle - even with super small batch size.   So, something is up here.   You could play around and try stopping the notebook - check nvidia-smi to verify it released the GPU RAM

In [None]:
with tf.device('/gpu:0'):
    a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
    b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
    c = tf.matmul(a, b)

with tf.Session() as sess:
    print (sess.run(c))


## MobileNet Model
Why use a MobileNet Model?  Because the end objective is a lightweight model - one that will run on a Googl Coral TPU.    This requires a quantized model (int8 - not float32).  And, you get there from a TensorFlow Lite model.  The recommended path is to start with a model structure that you know is compatible (MobileNet) then retrain on top of it.  
1. We pull the MobileNet v1 (there is a v2 that we aren't using) trained on COCO images
2. We train on top of it (xfer learning) with our CFA Products
3. That generates a TensorFlow Lite model (.tflite)
4. We will later conver .tflite to an edge TPU model

## Global Constants

In [5]:
# Security - camera-api
# 8100

S3_MODEL_PATH = "s3://jmduff.security-system/model/"
# base model - starting point that we train on top of
# BASE_MODEL_FOLDER = "20180718_coco14_mobilenet_v1_ssd300_quantized"

# project directories
PROJECT = os.getcwd()

TASKS = os.path.join(PROJECT, "tasks")
MODEL_OUTPUT = os.path.join(PROJECT, 'model')
MODEL_DOWNLOAD = os.path.join(PROJECT, "trained_model_artifacts")

MODEL_DATE = '20200402'

# Link to Security Project
CAMERA_API = os.path.abspath(os.path.join(PROJECT, '..', 'camera-api'))
CAMERA_API_MODEL = os.path.join(CAMERA_API, 'model')

# Get (ssd-dag) Trained Model from S3


In [15]:
! rm -f {MODEL_DOWNLOAD}/*.*
! aws s3 cp {S3_MODEL_PATH}{MODEL_DATE}/ {MODEL_DOWNLOAD} --exclude='*.*' --include='*.*' --recursive --profile=jmduff

download: s3://jmduff.security-system/model/20200402/checkpoint to trained_model_artifacts/checkpoint
download: s3://jmduff.security-system/model/20200402/model.ckpt-170000.index to trained_model_artifacts/model.ckpt-170000.index
download: s3://jmduff.security-system/model/20200402/model.ckpt.index to trained_model_artifacts/model.ckpt.index
download: s3://jmduff.security-system/model/20200402/model.ckpt.meta to trained_model_artifacts/model.ckpt.meta
download: s3://jmduff.security-system/model/20200402/pipeline.config to trained_model_artifacts/pipeline.config
download: s3://jmduff.security-system/model/20200402/output_tflite_graph.tflite to trained_model_artifacts/output_tflite_graph.tflite
download: s3://jmduff.security-system/model/20200402/model.ckpt-170000.meta to trained_model_artifacts/model.ckpt-170000.meta
download: s3://jmduff.security-system/model/20200402/model.ckpt.data-00000-of-00001 to trained_model_artifacts/model.ckpt.data-00000-of-00001
download: s3://jmduff.security

# Trained Model Output -- IMPORTANT
Where did it go? - THERE IS A BIG DIFFERENCE BETWEEN LOCAL TRAIN AND HOSTED TRAIN -- important !!

train*.py will put the output in code/model    This is true for local or SageMaker hosted trained.   In this case, you trained locally, so the output is in code/model  -- end of story.


When you train with a SageMaker Hosted train, the output still goes to code/model -- HOWEVER - that is in a docker image (that you will never see).  Then it gets coped to S3.   Then the notebook (TrainModel_Step3_TrainingJob) pulls a model output from S3.   Then extracts the tarball to {PROJECT}/trained_model   SO AT THIS POINT THE OUTPUT IS IN A DIFFERENT LOCATION !!

The convert graph script is pulling from {PROJECT}/trained_model (not the native code/model location).    The easiest solution (you will see below) is to copy the desired checkpoint graph to the {PROJECT}/trained_model location.

In [16]:
! ls -la  {MODEL_DOWNLOAD}

total 288048
drwxr-xr-x  3 jay jay      4096 Apr  3 10:26 .
drwxr-xr-x 14 jay jay      4096 Apr  3 10:26 ..
-rw-r--r--  1 jay jay        77 Apr  3 09:40 checkpoint
-rw-r--r--  1 jay jay  29536515 Apr  3 09:40 frozen_inference_graph.pb
-rw-r--r--  1 jay jay 109220320 Apr  3 09:40 model.ckpt-170000.data-00000-of-00001
-rw-r--r--  1 jay jay     42388 Apr  3 09:40 model.ckpt-170000.index
-rw-r--r--  1 jay jay  11279875 Apr  3 09:40 model.ckpt-170000.meta
-rw-r--r--  1 jay jay  27381492 Apr  3 09:40 model.ckpt.data-00000-of-00001
-rw-r--r--  1 jay jay     14948 Apr  3 09:40 model.ckpt.index
-rw-r--r--  1 jay jay   3500465 Apr  3 09:40 model.ckpt.meta
-rw-r--r--  1 jay jay   6898968 Apr  3 09:40 output_tflite_graph.tflite
-rw-r--r--  1 jay jay      5103 Apr  3 09:40 pipeline.config
drwxr-xr-x  2 jay jay      4096 Apr  3 10:26 saved_model
-rw-r--r--  1 jay jay  27693983 Apr  3 09:40 tflite_graph.pb
-rw-r--r--  1 jay jay  79346065 Apr  3 09:40 tflite_graph.pbtxt


## Troubleshooting

1. if you run for 500 steps, then rerun the exact process, it is going to restore /ckpt/checkpoints (ckpt-500) and then thinks it is done.  So, basically does nothing
2. Don't delete ckpt/  (rm ckpt/*.*) WITHOUT removing ckpt/checkpoints/   The program is always checking that checkpoints subdirectory and trying to restore.  For exampmle, you delete ckpt/ but leave ckpt/checkpoints, it finds a reference to ckpt-500 but you just deleted it - so it aborts
3. Always check your files & paths carefully - the error messages that get thrown with a missing file are not always clear - and my send you on a wild goose chase when in reality - it was just a missing file
4. can't import nets - this is a PATH problem (models/research/slim needs to be in your path) - in the train.py program, it's programmatically added
5. OOM when allocating tensor of shape [32,19,19,512] and type float
	 [[{{node gradients/zeros_97}}]] -- go to the config file and change batch size to be smaller (e.g. 16)
6. AttributeError: 'ParallelInterleaveDataset' object has no attribute '_flat_structure --- check your directories, like something didn't get installed correction (base model?  models/research stuff?  training data) -- seems to be a problem with the TF build from scratch;   use a pip install and this went away
7. if you are mixing local ops and Docker runs - you may have messed up the ownership file outputs and checkpoints - try deleting everything and a new pull
8. trains - then error:  TypeError: object of type <class 'numpy.float64'> cannot be safely interpreted as an integer.


## Making a useable model
At this point you have checkpoint files.   You need models (graphs).   There are many flavors:
    - saved graph
    - frozen graph
    - TensorFlow Lite
    - TensorRT
    - EdgeTPU
    
The notebook:  TrainingJob_Step3_TrainingJob will show you how to convert a checkpoint file to a graph (frozen graph & tflite).   There is a bash file to do this.
    

In [18]:
# try a copy of what you already converted
! ls {MODEL_OUTPUT} -l
! cp {MODEL_DOWNLOAD}/frozen_inference_graph.pb {MODEL_OUTPUT}
! ls {MODEL_OUTPUT} -l


total 153324
-rw-rw-r-- 1 jay jay  29536515 Mar 29 13:16 frozen_inference_graph.pb
-rw-rw-r-- 1 jay jay 109220320 Mar 27 09:51 model.ckpt-70000.data-00000-of-00001
-rw-rw-r-- 1 jay jay     42388 Mar 27 09:51 model.ckpt-70000.index
-rw-rw-r-- 1 jay jay  11279875 Mar 27 09:52 model.ckpt-70000.meta
-rw-r--r-- 1 jay jay      5056 Mar 26 14:09 mscoco_label_map.pbtxt
-rw-r--r-- 1 jay jay   6898968 Mar 29 13:15 output_tflite_graph.tflite
total 153324
-rw-rw-r-- 1 jay jay  29536515 Apr  3 10:30 frozen_inference_graph.pb
-rw-rw-r-- 1 jay jay 109220320 Mar 27 09:51 model.ckpt-70000.data-00000-of-00001
-rw-rw-r-- 1 jay jay     42388 Mar 27 09:51 model.ckpt-70000.index
-rw-rw-r-- 1 jay jay  11279875 Mar 27 09:52 model.ckpt-70000.meta
-rw-r--r-- 1 jay jay      5056 Mar 26 14:09 mscoco_label_map.pbtxt
-rw-r--r-- 1 jay jay   6898968 Mar 29 13:15 output_tflite_graph.tflite


In [None]:
# WAKE UP - make sure NUM_TRAINING_STEPS = the max number in the checkpoint files you listed above
#  e.g. 
# ls model
# -rw-rw-r--  1 ec2-user ec2-user 41116528 Jan 28 15:16 model.ckpt-6000.data-00000-of-00001
# -rw-rw-r--  1 ec2-user ec2-user    27275 Jan 28 15:16 model.ckpt-6000.index
# -rw-rw-r--  1 ec2-user {ec2-user  6987305 Jan 28 15:16 model.ckpt-6000.meta
NUM_TRAINING_STEPS = 170000
! cp {CODE}/model/*{NUM_TRAINING_STEPS}* {PROJECT}/trained_model
! ls {PROJECT}/trained_model/*{NUM_TRAINING_STEPS}*

# get the config from the train*.py parameters above
PIPELINE_CONFIG = 'local_mobilenet_v1_ssd_security_retrain.config'
# PIPELINE_CONFIG = 'local_mobilenet_v1_ssd_retrain.config'
! ls {CODE}/{PIPELINE_CONFIG}

# if you don't see your checkpoint in */trained_model/  STOP - and fix it

In [None]:
# convert checkpoint is a task script - located in the tasks/ directory
os.chdir(TASKS)  
! ./convert_checkpoint_to_edgetpu_tflite.sh --checkpoint_num {NUM_TRAINING_STEPS} --pipeline_config {PIPELINE_CONFIG}

In [None]:
# Tensorflow FROZEN GRAPH
! ls {PROJECT}/tensorflow_model -l

In [None]:
# Tensorflow Lite model
! ls {PROJECT}/tflite_model -l

### Security
If you are working on the security project,   you need to:  
put thye output_tflight_graph.tflite file in:  camera-api/model/  


In [None]:
# copy the tflite model over to camera-api/model
! cp  {PROJECT}/tflite_model/output_tflite_graph.tflite {CAMERA_API_MODEL}

In [None]:
# just checking ...
! ls -ls {CODE}/ckpt

In [None]:
# move the (converted?  frozen?) ckpt to the starting point
# NOW you can re-train on top of it
! cp {PROJECT}/tensorflow_model/model.ckpt.* {CODE}/ckpt

In [None]:
# backup
! aws s3 ls --profile=jmduff

In [None]:
MODEL_DATE = '20200326'
! aws s3 cp {PROJECT}/tensorflow_model s3://jmduff.security-system/model/{MODEL_DATE}/ --exclude='*.*' --include='*.*' --recursive --profile=jmduff
! aws s3 cp {PROJECT}/tflite_model s3://jmduff.security-system/model/{MODEL_DATE}/ --exclude='*.*' --include='*.*' --recursive --profile=jmduff

In [None]:
os.chdir('/media/home/jay/projects/ssd-dag')
! pwd