# Train Model
#### tensorflow_p36 environment

There are several ways to run this code
- on a SageMaker notebook (the original intent)
- on a physical machine with a well configured dev environment
- on a physical machine using a Docker (grilledclub/cuda-100-tf114:*)

## Step 1 - Develop a train.py script

This is SageMaker Script Mode.   This is relatively new and much easier than the original SageMaker design.   You need to develop a train.py program that will:
1. run locally - that means it will run on the local resources
2. then you will test it locally with a Docker test

If it runs in these tests, then it will/should run fine when you create a SageMaker Training job.   THIS IS THE CORRECT WAY TO USE SAGEMAKER.   Don't get confused - running jobs on the local SageMaker server isn't really what it was designed for.  It is designed to take your program and send it to outside resouces (using a Docker container)


In [35]:
# SageMaker is at 1.15
# - kernel = conda_python3
# ! pip install tensorflow-gpu==1.14
#
# - kernel = conda_tensorflow_p36
#   1.15

In [36]:
# currently CUDA 10.0
! nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130


In [37]:
import os
import tensorflow as tf

In [38]:
print (tf.__version__)

1.15.0


### nvidia-smi
this will show you how much memory is available in the GPU.   This is important if you start getting OOM (out of memory) errors.

SageMaker p2.xlarge == 10+ GB  
Note what is available.

you can run (at a terminal)    
  $ nvidia-smi -l 1   
to see the GPU being used during training.  On SageMaker, you'll see the GPU is about 50% busy


In [39]:
! nvidia-smi

Thu Mar 26 11:50:40 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  GeForce GTX 1080    On   | 00000000:01:00.0  On |                  N/A |
| 35%   55C    P0    55W / 180W |   1042MiB /  8117MiB |     30%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage    

### Test your GPU
this should verify your GPU is correct

## WARNING
this is a good test but...  
If you run it, it may not release  the GPU memory.   I didn't figure this out fully.   When I ran it, I would get an OOM error when the model started the training cycle - even with super small batch size.   So, something is up here.   You could play around and try stopping the notebook - check nvidia-smi to verify it released the GPU RAM

In [None]:
with tf.device('/gpu:0'):
    a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
    b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
    c = tf.matmul(a, b)

with tf.Session() as sess:
    print (sess.run(c))


## MobileNet Model
Why use a MobileNet Model?  Because the end objective is a lightweight model - one that will run on a Googl Coral TPU.    This requires a quantized model (int8 - not float32).  And, you get there from a TensorFlow Lite model.  The recommended path is to start with a model structure that you know is compatible (MobileNet) then retrain on top of it.  
1. We pull the MobileNet v1 (there is a v2 that we aren't using) trained on COCO images
2. We train on top of it (xfer learning) with our CFA Products
3. That generates a TensorFlow Lite model (.tflite)
4. We will later conver .tflite to an edge TPU model

## Global Constants

In [40]:
# S3_TFRECORDS_PATH = "s3://cfa-eadatasciencesb-sagemaker/datasets/cfa_products/tfrecords/"
# TFRECORDS_TARBALL = "20190718_tfrecords.tar.gz"
S3_TFRECORDS_PATH = "s3://cfa-eadatasciencesb-sagemaker/datasets/security/tfrecords/"
TFRECORDS_TARBALL = "20200323_tfrecords.tar.gz"


S3_MODEL_PATH = "s3://cfa-eadatasciencesb-sagemaker/trained-models/tensorflow_mobilenet/"
# base model - starting point that we train on top of
BASE_MODEL_FOLDER = "20180718_coco14_mobilenet_v1_ssd300_quantized"

# our CFA model
# note the COINCIDENCE - 2018-0718 vs 2019-0718, don't let this confuse you!
CFA_MODEL_FOLDER = "20190718_cfa_prod_mobilenet_v1_ssd300/"

# project directories
PROJECT = os.getcwd()
CODE = os.path.join(PROJECT, "code")
TASKS = os.path.join(PROJECT, "tasks")
MODEL_OUTPUT = os.path.join(CODE, 'model')

print ("project directory:", PROJECT)
print ("code directory:", CODE)
print ("task directory:", TASKS)

# Link to Security Project
CAMERA_API = os.path.abspath(os.path.join(PROJECT, '..', 'camera-api'))
CAMERA_API_MODEL = os.path.join(CAMERA_API, 'model')

project directory: /media/home/jay/projects/ssd-dag
code directory: /media/home/jay/projects/ssd-dag/code
task directory: /media/home/jay/projects/ssd-dag/tasks


## Data - 1x only

Get the data from s3.  You only need to pull the data once - unless of course you update it.  you'll need to pass a directory into the training job

### NOTE
still unclear if data is in the Docker or passed in with the SageMaker job  
TODO - figure this out, it's faster to NOT put it in the Docker (code/tfrecords), it just makes the Docker step slower.   the AWS fetch when the Docker starts is much faster

In [None]:
# Physical or Docker
# you can run the script
# $ cd /task

# check the Globals values in the script
# $ bash local_get_s3_files.sh

In [None]:
# SAGEMAKER
#  you're in top project directory
s3_tfrecords = os.path.join(S3_TFRECORDS_PATH, TFRECORDS_TARBALL)
print (s3_tfrecords)
! aws s3 cp $s3_tfrecords code/tfrecords  

# tarball is now in code/tfrecords

In [None]:
! tar -xvf code/tfrecords/$TFRECORDS_TARBALL --strip=1 -C code/tfrecords

# tfrecords are all in the tfrecords/ directory
# SageMaker likes train/test subdirectories
# - warning - confusion with 'test' vs 'eval'
#      I feel eval is the post train loop to evaluate the training loop - thus called val(uaion)
#         and test is to test a model with random real-world data
#      SageMaker calls what I call val == test
! pwd
! rm code/tfrecords/train/*.tfrecord* -f
! rm code/tfrecords/val/*.tfrecord*   -f
! rm code/tfrecords/test/*.tfrecord* -f

! mv code/tfrecords/train*.* code/tfrecords/train
! mv code/tfrecords/val*.* code/tfrecords/val
! mv code/tfrecords/test*.* code/tfrecords/test

! rm code/tfrecords/$TFRECORDS_TARBALL

# tarball is gone, tfrecord files are in code/tfrecord

## Get Model - 1x only

You only have to pull the model once.  This exercise will RETRAIN an existing model.  So, you need the starting point.  In this example, we are training on top of the BASE == MobileNet V1 that was trained with COCO images.   You could train on top of a CFA model - just make sure you config everything properly.

Copy the model from S3.    You are coping a model from an S3 folder.  There may be a label map and config file - that would make sense so you can reproduce that model.   However, if you are training on top of this model - those files aren't useful - MAKE SURE YOU UNDERSTAND THIS.   

So when you pull the model from the folder - just make sure you understand if you are re-using those meta files (e.g. reproducing a model) or or if you need something new (xfer learning).  The training process will NOT read from this download.  The training program will read the config from the code/ just to help avoid this confusion.

#### CKPT
When you retrain, the config file has a train_config / fine_tune_checkpoint attribute.  You are going to download this BASE model and put it in the code/ckpt/ directory.   The training job will start with the checkpoint file you specify.   For example:

fine_tune_checkpoint: "ckpt/model.ckpt"

#### WARNING code/ckpt/checkpoints
When you run training, it will checkpoint to code/ckpt/checkpoints.  
- if you train for 5000 steps, then repeat, it will do nothing basically because it will just reload the 5000 checkpoint file.
- then you'll think you're smart and you'll remove the 5000 checkpoint file.  Not so fast bucko!
- because then you'll discover  there is some pointer in the checkpoints/ that told the system the 5000 checkpoint exists - but now it doesn't because you just wiped it - so you'll get an error (that's difficult to figure out)

just delete the checkpoints directory


In [None]:
# Physical or Docker
# - you may have to delete stuff first
# $ cd code
# $ rm -rf models

# $ cd ../tasks
# $ bash install_tf_models.sh

In [None]:
# SageMaker (& Local?)
# -- warning - something not right here
#    I think you have to do this local or SageMaker (gotta have a base model)
s3_model_folder = os.path.join(S3_MODEL_PATH, BASE_MODEL_FOLDER)
! aws s3 cp $s3_model_folder code/ckpt --recursive

# code/ckpt now has model.ckpt.* files
# there is also a pipeline.config file (this one was configured for the Google Coral - you don't want it)
# there are also some tflite files - we don't want them either

#### github tensorflow/models - 1x Only
manually git clone the FIRST TIME.   The official TensorFlow github repo has a related repo with a bunch of models, tutorials, utilities etc.   We are using them.  So clone them to this machine.   In a subsequent step, we'll get the files we need from this local copy.

!! - hold it -  
!! this doesn't make sense, try not doing this - I don't think you need to git clone  
!! doesn't the install_tf_models.sh do all of this?  
!! I think we no longer copy, set just clone to code/models
!! thus, you don't need this manual git clone, just run install_tf_models.sh in the next cell


PHYSICAL COMPUTER  
`cd ~/projects`  
SAGEMAKER  
`you should be in the SageMaker directory`  

#### this will put /models into ~/projects  (you'll have ~/projects/models)
`git clone https://github.com/tensorflow/models.git`

In [None]:
# 1 time only

# get the latest software
# - git clone (to <project>/code/models)
# - get the protobuf compiler
# - compile the protobufs
# - clean up
os.chdir(TASKS)
! ./install_tf_models.sh

## Local (Script) Mode Training

#### -> if you know what you're doing, (you have a working SageMaker HOSTED training job) - you can jump out here!

see the AWS SageMaker tutorials notably:  
https://github.com/aws-samples/amazon-sagemaker-script-mode/blob/master/tf-eager-script-mode/tf-eager-sm-scriptmode.ipynb

The point here is, you can develop a training script locally, then know (have a high degree of confidence) it will run as a SageMaker training job.   (This is relatively new, the old way was more difficult and cumbersome.)

### What is Local?
- local on THIS SageMaker Notebook (EC2) server
  - p2.xlarge - no problem
  - t2.medium - probably not (I think this is the same footprint as the feeble Workspace)
- A desktop computer.
  - works great on an Ubuntu laptop with GPU
  - should work on a Windows laptop if you have a python environment set up
- An AWS Workspace - not enough memory, you'll get a memory error.   The code runs - but fails on a memory allocation error.

## Do you have a training script that will run locally - without Docker?

considering what is coming up, you want all code needed to train in one directory. (in this example, that will be the code/ directory.) That directory will be included in the Docker image.    

This is going to get a little more cumbersome because we took a bunch of stuff from the (official) github tensorflow/models project.   - we are using the MobileNet model and a BUNCH of utilities.    To make sure we keep up to date, we will get all of this programmatically - i.e. clone the most recent version.

### Model Training Configuration

trained-models/ may have a config file and a label map in the directories.  You can start with one of these.  BUT - there may be environmental variable values that you don't want - and you don't want the s3 pull operation to keep overwriting your config.   So, you can pull a model from s3.  Review the .config and label map files BUT !!! put YOUR config & label map file in the code/ directory.

#### .config file
See the config file for all parameters. the IN USE .config file is in the code/ direcory But you DEFINITELY need to look at these!
- num_classes = should be consistent with labels.txt & label map
- label_map_path (train & eval)
    - there may be one in the model/ (that you pulled from s3)
    - but move your desired label map to code/
- inputs (train & eval) - not sure, SageMaker is passing that in
- check all of the path statements 
- fine_tune_checkpoint - make sure you are fine tuning the correct file
    - don't cross a _v1 with a _v2 - that definitely work
   
#### label map .pbtxt
- classes start with 1 (not 0 based)
- make sure your label map class count matches the config file
- and it should match the label 

#### NOTE - a missing file will generate a complex error message.  NOT something as simple as file not found. 

#### NOTE - --model_dir parameter: 
- local mode, it needs to be model
- SageMaker HOST mode, it needs to be /opt/ml/model

--num_train_steps  
   500 very quick test  
   5000 more like it  
-- num_eval_steps  
   10 verify quick test  
   100 more like it
   
beware of batch size - if you run out of GPU memory - see the config file, batch_size: 32;  you may need to decrease it if you have a small GPU

GPU should be 95% utilized.  
`nvidia-smi -l 1`
CPU will be about 30%  

In [None]:
os.chdir(CODE)   # this will be the training directory

In [None]:
! rm -r model
! mkdir model

In [None]:
# !!! Warning !!!
# I changed the pipeline_config_path = local*.config
# this local version expects the data to be in code/tfrecords

# sagemaker*.config
#  uses S3 to move the data

# !!! I haven't tested !!!

# 20200122 - physical computer (Inspiron)
#  using Jupyter (below) error: ModuleNotFoundError: No module named 'absl'
#  but, ran fine from terminal
#
#  nvidia-smi
#      NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. 
#      Make sure that the latest NVIDIA driver is installed and running.
#  but it ran trained fine so CUDA was good

In [None]:
# These parameters can be set, if ommitted, takes values from SM_CHANNEL_ {_MODEL_DIR, _TRAIN, _VAL}
# --model_dir
# --train
# --val
# note the config file

! python train115.py \
  --pipeline_config_path="local_mobilenet_v1_ssd_security_retrain.config" \
  --num_train_steps="10000" \
  --num_eval_steps="1000"  \
  --model_dir='model' \
  --train='tfrecords/train/train.tfrecord' \
  --val='tfrecords/val/val.tfrecord'

# Trained Model Output -- IMPORTANT
Where did it go? - THERE IS A BIG DIFFERENCE BETWEEN LOCAL TRAIN AND HOSTED TRAIN -- important !!

train*.py will put the output in code/model    This is true for local or SageMaker hosted trained.   In this case, you trained locally, so the output is in code/model  -- end of story.


When you train with a SageMaker Hosted train, the output still goes to code/model -- HOWEVER - that is in a docker image (that you will never see).  Then it gets coped to S3.   Then the notebook (TrainModel_Step3_TrainingJob) pulls a model output from S3.   Then extracts the tarball to {PROJECT}/trained_model   SO AT THIS POINT THE OUTPUT IS IN A DIFFERENT LOCATION !!

The convert graph script is pulling from {PROJECT}/trained_model (not the native code/model location).    The easiest solution (you will see below) is to copy the desired checkpoint graph to the {PROJECT}/trained_model location.

In [41]:
os.chdir(CODE)   # this will be the training directory
! ls -la  {MODEL_OUTPUT}

total 730720
drwxr-xr-x  4 jay  jay       4096 Mar 26 10:37 .
drwxr-xr-x 10 jay  jay       4096 Mar 26 06:09 ..
-rw-r--r--  1 root root       277 Mar 26 10:33 checkpoint
drwxr-xr-x  2 root root      4096 Mar 26 08:01 eval_0
-rw-r--r--  1 root root  41422773 Mar 26 06:10 events.out.tfevents.1585175542.bf060c2b92f1
-rw-r--r--  1 root root  41104985 Mar 26 07:46 events.out.tfevents.1585217492.bf060c2b92f1
-rw-r--r--  1 root root  41122061 Mar 26 10:33 events.out.tfevents.1585223240.bf060c2b92f1
drwxr-xr-x  3 root root      4096 Mar 26 10:37 export
-rw-r--r--  1 root root  21828152 Mar 26 07:47 graph.pbtxt
-rw-r--r--  1 root root 109220320 Mar 26 09:57 model.ckpt-68159.data-00000-of-00001
-rw-r--r--  1 root root     42388 Mar 26 09:57 model.ckpt-68159.index
-rw-r--r--  1 root root  11279875 Mar 26 09:57 model.ckpt-68159.meta
-rw-r--r--  1 root root 109220320 Mar 26 10:07 model.ckpt-68516.data-00000-of-00001
-rw-r--r--  1 root root     42388 Mar 26 10:07 model.ckpt-68516.index

## Troubleshooting

1. if you run for 500 steps, then rerun the exact process, it is going to restore /ckpt/checkpoints (ckpt-500) and then thinks it is done.  So, basically does nothing
2. Don't delete ckpt/  (rm ckpt/*.*) WITHOUT removing ckpt/checkpoints/   The program is always checking that checkpoints subdirectory and trying to restore.  For exampmle, you delete ckpt/ but leave ckpt/checkpoints, it finds a reference to ckpt-500 but you just deleted it - so it aborts
3. Always check your files & paths carefully - the error messages that get thrown with a missing file are not always clear - and my send you on a wild goose chase when in reality - it was just a missing file
4. can't import nets - this is a PATH problem (models/research/slim needs to be in your path) - in the train.py program, it's programmatically added
5. OOM when allocating tensor of shape [32,19,19,512] and type float
	 [[{{node gradients/zeros_97}}]] -- go to the config file and change batch size to be smaller (e.g. 16)
6. AttributeError: 'ParallelInterleaveDataset' object has no attribute '_flat_structure --- check your directories, like something didn't get installed correction (base model?  models/research stuff?  training data) -- seems to be a problem with the TF build from scratch;   use a pip install and this went away
7. if you are mixing local ops and Docker runs - you may have messed up the ownership file outputs and checkpoints - try deleting everything and a new pull
8. trains - then error:  TypeError: object of type <class 'numpy.float64'> cannot be safely interpreted as an integer.


## Making a useable model
At this point you have checkpoint files.   You need models (graphs).   There are many flavors:
    - saved graph
    - frozen graph
    - TensorFlow Lite
    - TensorRT
    - EdgeTPU
    
The notebook:  TrainingJob_Step3_TrainingJob will show you how to convert a checkpoint file to a graph (frozen graph & tflite).   There is a bash file to do this.
    

In [42]:
# WAKE UP - make sure NUM_TRAINING_STEPS = the max number in the checkpoint files you listed above
#  e.g. 
# ls model
# -rw-rw-r--  1 ec2-user ec2-user 41116528 Jan 28 15:16 model.ckpt-6000.data-00000-of-00001
# -rw-rw-r--  1 ec2-user ec2-user    27275 Jan 28 15:16 model.ckpt-6000.index
# -rw-rw-r--  1 ec2-user {ec2-user  6987305 Jan 28 15:16 model.ckpt-6000.meta
NUM_TRAINING_STEPS = 70000
! cp {CODE}/model/*{NUM_TRAINING_STEPS}* {PROJECT}/trained_model
! ls {PROJECT}/trained_model/*{NUM_TRAINING_STEPS}*

# get the config from the train*.py parameters above
PIPELINE_CONFIG = 'local_mobilenet_v1_ssd_security_retrain.config'
# PIPELINE_CONFIG = 'local_mobilenet_v1_ssd_retrain.config'
! ls {CODE}/{PIPELINE_CONFIG}

# if you don't see your checkpoint in */trained_model/  STOP - and fix it

/media/home/jay/projects/ssd-dag/trained_model/model.ckpt-70000.data-00000-of-00001
/media/home/jay/projects/ssd-dag/trained_model/model.ckpt-70000.index
/media/home/jay/projects/ssd-dag/trained_model/model.ckpt-70000.meta
/media/home/jay/projects/ssd-dag/code/local_mobilenet_v1_ssd_security_retrain.config


In [43]:
# convert checkpoint is a task script - located in the tasks/ directory
os.chdir(TASKS)  
! ./convert_checkpoint_to_edgetpu_tflite.sh --checkpoint_num {NUM_TRAINING_STEPS} --pipeline_config {PIPELINE_CONFIG}

TASKS_DIR=/media/home/jay/projects/ssd-dag/tasks
***
/media/home/jay/projects/ssd-dag/tflite_model
:/media/home/jay/projects/ssd-dag/code/models/research/slim:/media/home/jay/projects/ssd-dag/code/models/research
+ ckpt_number=0
+ [[ 4 -gt 0 ]]
+ case "$1" in
+ ckpt_number=70000
+ shift 2
+ [[ 2 -gt 0 ]]
+ case "$1" in
+ pipeline_config=local_mobilenet_v1_ssd_security_retrain.config
+ shift 2
+ [[ 0 -gt 0 ]]
+ rm /media/home/jay/projects/ssd-dag/tensorflow_model -rf
+ rm /media/home/jay/projects/ssd-dag/tflite_model -rf
+ echo '-- check for model checkpoint (the raw graph):,' , 70000
-- check for model checkpoint (the raw graph):, , 70000
+ ls /media/home/jay/projects/ssd-dag/trained_model/model.ckpt-70000.data-00000-of-00001 /media/home/jay/projects/ssd-dag/trained_model/model.ckpt-70000.index /media/home/jay/projects/ssd-dag/trained_model/model.ckpt-70000.meta
/media/home/jay/projects/ssd-dag/trained_model/model.ckpt-70000.data-00000-of-00001
/media/home/jay/projects/ssd-dag/trained_

INFO:tensorflow:Skipping quant after FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_0/add_fold
I0326 11:54:45.761101 140246823876416 quantize.py:299] Skipping quant after FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_0/add_fold
INFO:tensorflow:Skipping quant after FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_1_depthwise/add_fold
I0326 11:54:45.761289 140246823876416 quantize.py:299] Skipping quant after FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_1_depthwise/add_fold
INFO:tensorflow:Skipping quant after FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_1_pointwise/add_fold
I0326 11:54:45.761404 140246823876416 quantize.py:299] Skipping quant after FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_1_pointwise/add_fold
INFO:tensorflow:Skipping quant after FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_2_depthwise/add_fold
I0326 11:54:45.761516 140246823876416 quantize.py:299] Skipping quant after FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_2_depthwise/add_fold
INFO:ten

356 ops no flops stats due to incomplete shapes.
Parsing Inputs...
Incomplete shape.

-max_depth                  10000
-min_bytes                  0
-min_peak_bytes             0
-min_residual_bytes         0
-min_output_bytes           0
-min_micros                 0
-min_accelerator_micros     0
-min_cpu_micros             0
-min_params                 0
-min_float_ops              0
-min_occurrence             0
-step                       -1
-order_by                   name
-account_type_regexes       _trainable_variables
-start_name_regexes         .*
-trim_name_regexes          .*BatchNorm.*
-show_name_regexes          .*
-hide_name_regexes          
-account_displayed_op_only  true
-select                     params
-output                     stdout:

Incomplete shape.

Doc:
scope: The nodes in the model graph are organized by their names, which is hierarchical like filesystem.
param: Number of parameters (in the Variable).

Profile:
node name | # parameters
_TFProfRoot (--/6.

356 ops no flops stats due to incomplete shapes.
Parsing Inputs...
Incomplete shape.

-max_depth                  10000
-min_bytes                  0
-min_peak_bytes             0
-min_residual_bytes         0
-min_output_bytes           0
-min_micros                 0
-min_accelerator_micros     0
-min_cpu_micros             0
-min_params                 0
-min_float_ops              1
-min_occurrence             0
-step                       -1
-order_by                   float_ops
-account_type_regexes       .*
-start_name_regexes         .*
-trim_name_regexes          .*BatchNorm.*,.*Initializer.*,.*Regularizer.*,.*BiasAdd.*
-show_name_regexes          .*
-hide_name_regexes          
-account_displayed_op_only  true
-select                     float_ops
-output                     stdout:

Incomplete shape.

Doc:
scope: The nodes in the model graph are organized by their names, which is hierarchical like filesystem.
flops: Number of float operations. Note: Please read the implement


W0326 11:54:47.639828 140246823876416 module_wrapper.py:139] From /media/home/jay/projects/ssd-dag/code/models/research/object_detection/exporter.py:432: The name tf.train.Saver is deprecated. Please use tf.compat.v1.train.Saver instead.


W0326 11:54:48.762277 140246823876416 module_wrapper.py:139] From /media/home/jay/projects/ssd-dag/code/models/research/object_detection/exporter.py:342: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

2020-03-26 11:54:48.785697: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-03-26 11:54:48.809000: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-03-26 11:54:48.809319: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: GeForce GTX 1080 major: 6 minor: 1 memoryCl

2020-03-26 11:55:00.722531: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-03-26 11:55:00.722740: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.7335
pciBusID: 0000:01:00.0
2020-03-26 11:55:00.722769: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-03-26 11:55:00.722779: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-03-26 11:55:00.722786: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-03-26 11:55:00.722794: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libc

INFO:tensorflow:SavedModel written to: /media/home/jay/projects/ssd-dag/tensorflow_model/saved_model/saved_model.pb
I0326 11:55:03.709092 140246823876416 builder_impl.py:425] SavedModel written to: /media/home/jay/projects/ssd-dag/tensorflow_model/saved_model/saved_model.pb

W0326 11:55:03.755578 140246823876416 module_wrapper.py:139] From /media/home/jay/projects/ssd-dag/code/models/research/object_detection/utils/config_util.py:188: The name tf.gfile.Open is deprecated. Please use tf.io.gfile.GFile instead.

INFO:tensorflow:Writing pipeline config file to /media/home/jay/projects/ssd-dag/tensorflow_model/pipeline.config
I0326 11:55:03.755696 140246823876416 config_util.py:190] Writing pipeline config file to /media/home/jay/projects/ssd-dag/tensorflow_model/pipeline.config
+ echo ' - - - CKPT ==> tflite frozen graph - - -'
 - - - CKPT ==> tflite frozen graph - - -
+ python /media/home/jay/projects/ssd-dag/code/models/research/object_detection/export_tflite_ssd_graph.py --pipeline_con

2020-03-26 11:55:07.543449: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-03-26 11:55:07.543818: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x56345de42660 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-03-26 11:55:07.543832: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): GeForce GTX 1080, Compute Capability 6.1
2020-03-26 11:55:07.543944: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-03-26 11:55:07.544235: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.7335
pciBusID: 0000:01:00.0
2020-03-

Instructions for updating:
Use standard file APIs to check for files with this prefix.
W0326 11:55:09.049654 140289241147200 deprecation.py:323] From /media/home/jay/anaconda3/envs/tf115/lib/python3.7/site-packages/tensorflow_core/python/tools/freeze_graph.py:127: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
2020-03-26 11:55:09.375775: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-03-26 11:55:09.375986: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.7335
pciBusID: 0000:01:00.0
2020-03-26 11:55:09.376016: I tensorflow/stream_executor/platform/default/dso_loader.cc:

2020-03-26 11:55:11.918409: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3696000000 Hz
2020-03-26 11:55:11.919275: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55d35bb6b320 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-03-26 11:55:11.919288: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-03-26 11:55:11.974538: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-03-26 11:55:11.975061: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55d35bbfdc00 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-03-26 11:55:11.975074: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): GeForce GTX 1080, Compute Capability 6.1
20

In [44]:
# Tensorflow FROZEN GRAPH
! ls {PROJECT}/tensorflow_model -l

total 59040
-rw-r--r-- 1 jay jay       77 Mar 26 11:54 checkpoint
-rw-r--r-- 1 jay jay 29536515 Mar 26 11:55 frozen_inference_graph.pb
-rw-r--r-- 1 jay jay 27381492 Mar 26 11:54 model.ckpt.data-00000-of-00001
-rw-r--r-- 1 jay jay    14948 Mar 26 11:54 model.ckpt.index
-rw-r--r-- 1 jay jay  3500465 Mar 26 11:54 model.ckpt.meta
-rw-r--r-- 1 jay jay     5103 Mar 26 11:55 pipeline.config
drwxr-xr-x 3 jay jay     4096 Mar 26 11:55 saved_model


In [45]:
# Tensorflow Lite model
! ls {PROJECT}/tflite_model -l

total 111216
-rw-r--r-- 1 jay jay  6898968 Mar 26 11:55 output_tflite_graph.tflite
-rw-r--r-- 1 jay jay 27693983 Mar 26 11:55 tflite_graph.pb
-rw-r--r-- 1 jay jay 79284971 Mar 26 11:55 tflite_graph.pbtxt


### Security
If you are working on the security project,   you need to:  
put thye output_tflight_graph.tflite file in:  camera-api/model/  


In [46]:
# copy the tflite model over to camera-api/model
! cp  {PROJECT}/tflite_model/output_tflite_graph.tflite {CAMERA_API_MODEL}

In [47]:
# just checking ...
! ls -ls {CODE}/ckpt

total 134004
26740 -rw-r--r-- 1 jay jay 27381492 Mar 24 15:43 model.ckpt.data-00000-of-00001
   16 -rw-r--r-- 1 jay jay    14948 Mar 24 15:43 model.ckpt.index
 3420 -rw-r--r-- 1 jay jay  3500465 Mar 24 15:43 model.ckpt.meta
    8 -rw-r--r-- 1 jay jay     4469 Oct  4 09:22 pipeline.config
27044 -rw-r--r-- 1 jay jay 27692743 Oct  4 09:22 tflite_graph.pb
76776 -rw-r--r-- 1 jay jay 78617899 Oct  4 09:22 tflite_graph.pbtxt


In [48]:
# move the (converted?  frozen?) ckpt to the starting point
# NOW you can re-train on top of it
! cp {PROJECT}/tensorflow_model/model.ckpt.* {CODE}/ckpt

In [49]:
# backup
! aws s3 ls --profile=jmduff

2020-02-29 19:45:19 jmduff.data
2018-08-23 21:11:49 jmduff.glacier
2020-03-24 15:53:56 jmduff.security-system
2020-01-20 15:49:21 jmduff.software
2018-04-19 20:22:47 jmduff.xps14z


In [50]:
MODEL_DATE = '20200326'
! aws s3 cp {PROJECT}/tensorflow_model s3://jmduff.security-system/model/{MODEL_DATE}/ --exclude='*.*' --include='*.*' --recursive --profile=jmduff
! aws s3 cp {PROJECT}/tflite_model s3://jmduff.security-system/model/{MODEL_DATE}/ --exclude='*.*' --include='*.*' --recursive --profile=jmduff

upload: ../tensorflow_model/checkpoint to s3://jmduff.security-system/model/20200326/checkpoint
upload: ../tensorflow_model/model.ckpt.index to s3://jmduff.security-system/model/20200326/model.ckpt.index
upload: ../tensorflow_model/pipeline.config to s3://jmduff.security-system/model/20200326/pipeline.config
upload: ../tensorflow_model/model.ckpt.meta to s3://jmduff.security-system/model/20200326/model.ckpt.meta
upload: ../tensorflow_model/frozen_inference_graph.pb to s3://jmduff.security-system/model/20200326/frozen_inference_graph.pb
upload: ../tensorflow_model/model.ckpt.data-00000-of-00001 to s3://jmduff.security-system/model/20200326/model.ckpt.data-00000-of-00001
upload: ../tensorflow_model/saved_model/saved_model.pb to s3://jmduff.security-system/model/20200326/saved_model/saved_model.pb
upload: ../tflite_model/output_tflite_graph.tflite to s3://jmduff.security-system/model/20200326/output_tflite_graph.tflite
upload: ../tflite_model/tflite_graph.pb to s3://jmduff.security-system

In [51]:
os.chdir('/media/home/jay/projects/ssd-dag')
! pwd

/media/home/jay/projects/ssd-dag
