# Train Model
#### tensorflow_p36 environment

There are several ways to run this code
- on a SageMaker notebook (the original intent)
- on a physical machine with a well configured dev environment
- on a physical machine using a Docker (grilledclub/cuda-100-tf114:*)

## Step 1 - Develop a train.py script

This is SageMaker Script Mode.   This is relatively new and much easier than the original SageMaker design.   You need to develop a train.py program that will:
1. run locally - that means it will run on the local resources
2. then you will test it locally with a Docker test

If it runs in these tests, then it will/should run fine when you create a SageMaker Training job.   THIS IS THE CORRECT WAY TO USE SAGEMAKER.   Don't get confused - running jobs on the local SageMaker server isn't really what it was designed for.  It is designed to take your program and send it to outside resouces (using a Docker container)


In [1]:
# SageMaker is at 1.15
# - kernel = conda_python3
# ! pip install tensorflow-gpu==1.14
#
# - kernel = conda_tensorflow_p36
#   1.15

In [2]:
# currently CUDA 10.0
! nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243


In [3]:
import os
import tensorflow as tf

In [4]:
print (tf.__version__)

1.15.0


### nvidia-smi
this will show you how much memory is available in the GPU.   This is important if you start getting OOM (out of memory) errors.

SageMaker p2.xlarge == 10+ GB  
Note what is available.

you can run (at a terminal)    
  $ nvidia-smi -l 1   
to see the GPU being used during training.  On SageMaker, you'll see the GPU is about 50% busy


In [5]:
! nvidia-smi
! nvcc --version

Sun Apr 26 21:27:28 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.82       Driver Version: 440.82       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  GeForce RTX 208...  Off  | 00000000:01:00.0 Off |                  N/A |
|  0%   32C    P8     1W / 260W |    251MiB / 11011MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|    0  

### Test your GPU
this should verify your GPU is correct

## WARNING
this is a good test but...  
If you run it, it may not release  the GPU memory.   I didn't figure this out fully.   When I ran it, I would get an OOM error when the model started the training cycle - even with super small batch size.   So, something is up here.   You could play around and try stopping the notebook - check nvidia-smi to verify it released the GPU RAM

In [None]:
with tf.device('/gpu:0'):
    a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
    b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
    c = tf.matmul(a, b)

with tf.Session() as sess:
    print (sess.run(c))


## MobileNet Model
Why use a MobileNet Model?  Because the end objective is a lightweight model - one that will run on a Googl Coral TPU.    This requires a quantized model (int8 - not float32).  And, you get there from a TensorFlow Lite model.  The recommended path is to start with a model structure that you know is compatible (MobileNet) then retrain on top of it.  
1. We pull the MobileNet v1 (there is a v2 that we aren't using) trained on COCO images
2. We train on top of it (xfer learning) with our CFA Products
3. That generates a TensorFlow Lite model (.tflite)
4. We will later conver .tflite to an edge TPU model

## Global Constants

In [9]:
# -- original EA DataScience SB
# S3_TFRECORDS_PATH = "s3://cfa-eadatasciencesb-sagemaker/datasets/cfa_products/tfrecords/"
# TFRECORDS_TARBALL = "20190718_tfrecords.tar.gz"
# S3_TFRECORDS_PATH = "s3://cfa-eadatasciencesb-sagemaker/datasets/security/tfrecords/"
# TFRECORDS_TARBALL = "20200323_tfrecords.tar.gz"

# Security - Local using jmduff AWS
S3_TFRECORDS_PATH = "s3://jmduff.security-system/tfrecords/"
TFRECORDS_TARBALL = "20200426_tfrecords.tar.gz"

S3_MODEL_PATH = "s3://jmduff.security-system/model/base_mobilenet/"
# base model - starting point that we train on top of
BASE_MODEL_FOLDER = "20180718_coco14_mobilenet_v1_ssd300_quantized"

# our CFA model
# note the COINCIDENCE - 2018-0718 vs 2019-0718, don't let this confuse you!
CFA_MODEL_FOLDER = "20190718_cfa_prod_mobilenet_v1_ssd300/"

# project directories
PROJECT = os.getcwd()
CODE = os.path.join(PROJECT, "code")
TASKS = os.path.join(PROJECT, "tasks")
MODEL_OUTPUT = os.path.join(CODE, 'model')

print ("project directory:", PROJECT)
print ("code directory:", CODE)
print ("task directory:", TASKS)

# Link to Security Project
CAMERA_API = os.path.abspath(os.path.join(PROJECT, '..', 'camera-api'))
CAMERA_API_MODEL = os.path.join(CAMERA_API, 'model')

MODEL_DATE = '20200426'
USER = 'train'  # linux user

project directory: /home/train/projects/ssd-dag
code directory: /home/train/projects/ssd-dag/code
task directory: /home/train/projects/ssd-dag/tasks


In [7]:
# get the updated label map file from camera-api project
! cp {CAMERA_API_MODEL}/security_label_map.pbtxt {CODE}

## Data - 1x only

Get the data from s3.  You only need to pull the data once - unless of course you update it.  you'll need to pass a directory into the training job

## Options: S3 or USB drive

### Option 1: S3 & Sagemaker

####  NOTE
still unclear if data is in the Docker or passed in with the SageMaker job  
TODO - figure this out, it's faster to NOT put it in the Docker (code/tfrecords), it just makes the Docker step slower.   the AWS fetch when the Docker starts is much faster

### Option 2:  USB Drive
mount the drive first, do it from command line - here are the commands:

`sudo fdisk -l | grep dev/sd`  
/dev/sdb1        2048 976770112 976768065 465.8G 83 Linux  

`sudo mount /dev/sdb1 /media/train/ssd-usb0`  
`sudo chown train:train /media/train/ssd-usb0`  


In [None]:
# Physical or Docker
# you can run the script
# $ cd /task

# check the Globals values in the script
# $ bash local_get_s3_files.sh

In [None]:
# Option 1 - from s3 - kinda slow
# SAGEMAKER
#  you're in top project directory
s3_tfrecords = os.path.join(S3_TFRECORDS_PATH, TFRECORDS_TARBALL)
print (s3_tfrecords)
! aws s3 cp $s3_tfrecords code/tfrecords  

# tarball is now in code/tfrecords

In [8]:
# Option 2 - from ssd-usb0 (sneaker net)
! ls /media/$USER/ssd-usb0/ -l

total 23272796
-rw-r--r-- 1 root   root   7764419626 Apr 26 18:43 20200425_tfrecords.tar.gz
-rw-r--r-- 1 root   root      3340715 Apr 26 21:24 20200426_annotation.tar.gz
-rw-r--r-- 1 root   root   7824595037 Apr 26 21:24 20200426_jpeg_images.tar.gz
-rw-r--r-- 1 root   root   7840680784 Apr 26 21:18 20200426_tfrecords.tar.gz
drwx------ 2 root   root        16384 Apr 25 17:38 lost+found
-rw-r--r-- 1 jayson jayson  398262635 Apr 25 17:12 snapshot_20200425b_8100.tar.gz
drwxrwxrwx 5 root   root         4096 Apr 26 17:45 tfrecord


In [12]:
print (TFRECORDS_TARBALL)
! ls /media/$USER/ssd-usb0/$TFRECORDS_TARBALL
! cp /media/$USER/ssd-usb0/$TFRECORDS_TARBALL code/tfrecords

20200426_tfrecords.tar.gz
/media/train/ssd-usb0/20200426_tfrecords.tar.gz


### Regardless - process the tarball
pick up here - regardless of what option you chose

In [13]:
! tar -xvf code/tfrecords/$TFRECORDS_TARBALL --strip=6 -C code/tfrecords/tarball_extract

# tfrecords are all in the tfrecords/ directory
# SageMaker likes train/test subdirectories
# - warning - confusion with 'test' vs 'eval'
#      I feel eval is the post train loop to evaluate the training loop - thus called val(uaion)
#         and test is to test a model with random real-world data
#      SageMaker calls what I call val == test
! pwd
! rm code/tfrecords/train/*.*record* -f
! rm code/tfrecords/val/*.*record*   -f
! rm code/tfrecords/test/*.*record* -f

! mv code/tfrecords/tarball_extract/train*.* code/tfrecords/train
! mv code/tfrecords/tarball_extract/val*.* code/tfrecords/val
! mv code/tfrecords/tarball_extract/test*.* code/tfrecords/test

# ! rm code/tfrecords/$TFRECORDS_TARBALL

# tarball is gone, tfrecord files are in code/tfrecord

media/home/jay/projects/camera-api/20200426_tfrecords/train.record-00001-of-00016
media/home/jay/projects/camera-api/20200426_tfrecords/val.record-00008-of-00016
media/home/jay/projects/camera-api/20200426_tfrecords/val.record-00009-of-00016
media/home/jay/projects/camera-api/20200426_tfrecords/val.record-00000-of-00016
media/home/jay/projects/camera-api/20200426_tfrecords/test.record-00005-of-00016
media/home/jay/projects/camera-api/20200426_tfrecords/test.record-00010-of-00016
media/home/jay/projects/camera-api/20200426_tfrecords/train.record-00008-of-00016
media/home/jay/projects/camera-api/20200426_tfrecords/val.record-00005-of-00016
media/home/jay/projects/camera-api/20200426_tfrecords/test.record-00002-of-00016
media/home/jay/projects/camera-api/20200426_tfrecords/test.record-00012-of-00016
media/home/jay/projects/camera-api/20200426_tfrecords/train.record-00015-of-00016
media/home/jay/projects/camera-api/20200426_tfrecords/test.record-00009-of-00016
media/home/jay/projects/camer

## Get Model - 1x only

You only have to pull the model once.  This exercise will RETRAIN an existing model.  So, you need the starting point.  In this example, we are training on top of the BASE == MobileNet V1 that was trained with COCO images.   You could train on top of a CFA model - just make sure you config everything properly.

Copy the model from S3.    You are coping a model from an S3 folder.  There may be a label map and config file - that would make sense so you can reproduce that model.   However, if you are training on top of this model - those files aren't useful - MAKE SURE YOU UNDERSTAND THIS.   

So when you pull the model from the folder - just make sure you understand if you are re-using those meta files (e.g. reproducing a model) or or if you need something new (xfer learning).  The training process will NOT read from this download.  The training program will read the config from the code/ just to help avoid this confusion.

#### CKPT
When you retrain, the config file has a train_config / fine_tune_checkpoint attribute.  You are going to download this BASE model and put it in the code/ckpt/ directory.   The training job will start with the checkpoint file you specify.   For example:

fine_tune_checkpoint: "ckpt/model.ckpt"

#### WARNING code/ckpt/checkpoints
When you run training, it will checkpoint to code/ckpt/checkpoints.  
- if you train for 5000 steps, then repeat, it will do nothing basically because it will just reload the 5000 checkpoint file.
- then you'll think you're smart and you'll remove the 5000 checkpoint file.  Not so fast bucko!
- because then you'll discover  there is some pointer in the checkpoints/ that told the system the 5000 checkpoint exists - but now it doesn't because you just wiped it - so you'll get an error (that's difficult to figure out)

just delete the checkpoints directory


In [None]:
# Physical or Docker
# - you may have to delete stuff first
# $ cd code
# $ rm -rf models

# $ cd ../tasks
# $ bash install_tf_models.sh

In [None]:
# SageMaker (& Local?)
# -- warning - something not right here
#    I think you have to do this local or SageMaker (gotta have a base model)
s3_model_folder = os.path.join(S3_MODEL_PATH, BASE_MODEL_FOLDER)
! aws s3 cp $s3_model_folder code/ckpt --recursive

# code/ckpt now has model.ckpt.* files
# there is also a pipeline.config file (this one was configured for the Google Coral - you don't want it)
# there are also some tflite files - we don't want them either

#### github tensorflow/models - 1x Only
manually git clone the FIRST TIME.   The official TensorFlow github repo has a related repo with a bunch of models, tutorials, utilities etc.   We are using them.  So clone them to this machine.   In a subsequent step, we'll get the files we need from this local copy.

!! - hold it -  
!! this doesn't make sense, try not doing this - I don't think you need to git clone  
!! doesn't the install_tf_models.sh do all of this?  
!! I think we no longer copy, set just clone to code/models
!! thus, you don't need this manual git clone, just run install_tf_models.sh in the next cell


PHYSICAL COMPUTER  
`cd ~/projects`  
SAGEMAKER  
`you should be in the SageMaker directory`  

#### this will put /models into ~/projects  (you'll have ~/projects/models)
`git clone https://github.com/tensorflow/models.git`

In [None]:
# 1 time only

# get the latest software
# - git clone (to <project>/code/models)
# - get the protobuf compiler
# - compile the protobufs
# - clean up
os.chdir(TASKS)
! ./install_tf_models.sh

## Local (Script) Mode Training

#### -> if you know what you're doing, (you have a working SageMaker HOSTED training job) - you can jump out here!

see the AWS SageMaker tutorials notably:  
https://github.com/aws-samples/amazon-sagemaker-script-mode/blob/master/tf-eager-script-mode/tf-eager-sm-scriptmode.ipynb

The point here is, you can develop a training script locally, then know (have a high degree of confidence) it will run as a SageMaker training job.   (This is relatively new, the old way was more difficult and cumbersome.)

### What is Local?
- local on THIS SageMaker Notebook (EC2) server
  - p2.xlarge - no problem
  - t2.medium - probably not (I think this is the same footprint as the feeble Workspace)
- A desktop computer.
  - works great on an Ubuntu laptop with GPU
  - should work on a Windows laptop if you have a python environment set up
- An AWS Workspace - not enough memory, you'll get a memory error.   The code runs - but fails on a memory allocation error.

## Do you have a training script that will run locally - without Docker?

considering what is coming up, you want all code needed to train in one directory. (in this example, that will be the code/ directory.) That directory will be included in the Docker image.    

This is going to get a little more cumbersome because we took a bunch of stuff from the (official) github tensorflow/models project.   - we are using the MobileNet model and a BUNCH of utilities.    To make sure we keep up to date, we will get all of this programmatically - i.e. clone the most recent version.

### Model Training Configuration

trained-models/ may have a config file and a label map in the directories.  You can start with one of these.  BUT - there may be environmental variable values that you don't want - and you don't want the s3 pull operation to keep overwriting your config.   So, you can pull a model from s3.  Review the .config and label map files BUT !!! put YOUR config & label map file in the code/ directory.

#### .config file
See the config file for all parameters. the IN USE .config file is in the code/ direcory But you DEFINITELY need to look at these!
- num_classes = should be consistent with labels.txt & label map
- label_map_path (train & eval)
    - there may be one in the model/ (that you pulled from s3)
    - but move your desired label map to code/
- inputs (train & eval) - not sure, SageMaker is passing that in
- check all of the path statements 
- fine_tune_checkpoint - make sure you are fine tuning the correct file
    - don't cross a _v1 with a _v2 - that definitely work
   
#### label map .pbtxt
- classes start with 1 (not 0 based)
- make sure your label map class count matches the config file
- and it should match the label 

#### NOTE - a missing file will generate a complex error message.  NOT something as simple as file not found. 

#### NOTE - --model_dir parameter: 
- local mode, it needs to be model
- SageMaker HOST mode, it needs to be /opt/ml/model

--num_train_steps  
   500 very quick test  
   5000 more like it  
-- num_eval_steps  
   10 verify quick test  
   100 more like it
   
beware of batch size - if you run out of GPU memory - see the config file, batch_size: 32;  you may need to decrease it if you have a small GPU

GPU should be 95% utilized.  
`nvidia-smi -l 1`
CPU will be about 30%  

In [14]:
os.chdir(CODE)   # this will be the training directory

In [None]:
! rm -r model
! mkdir model

In [None]:
# !!! Warning !!!
# I changed the pipeline_config_path = local*.config
# this local version expects the data to be in code/tfrecords

# sagemaker*.config
#  uses S3 to move the data

# !!! I haven't tested !!!

# 20200122 - physical computer (Inspiron)
#  using Jupyter (below) error: ModuleNotFoundError: No module named 'absl'
#  but, ran fine from terminal
#
#  nvidia-smi
#      NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. 
#      Make sure that the latest NVIDIA driver is installed and running.
#  but it ran trained fine so CUDA was good

In [15]:
# These parameters can be set, if ommitted, takes values from SM_CHANNEL_ {_MODEL_DIR, _TRAIN, _VAL}
# --model_dir
# --train
# --val
# note the config file

! python train115.py \
  --pipeline_config_path="local_mobilenet_v1_ssd_security_scratch_v6.config" \
  --num_train_steps="100000" \
  --num_eval_steps="17500"  \
  --model_dir='model' \
  --train='tfrecords/train/train.*' \
  --val='tfrecords/val/val.*'

--> installing: cython
--> installing: pycocotools
--> installing: matplotlib
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

*** train.py/main()
*** FLAGS ***
pipeline_config_path: local_mobilenet_v1_ssd_security_scratch_v6.config
config exists: True
file: display.py
file: security_label_map.pbtxt
file: __pycache__
file: utils
file: ssd_mobilenet_v1_0.75_depth_quantized_300x300_pets_sync.config
file: local_mobilenet_v1_ssd_security_scratch_v2.config
file: ssd_mobilenet_v1_focal_loss_pets.config
file: requirements.txt
file: tflite_interpreter.py
file: local_mobilenet_v1_ssd_security_scratch_v5.config
file: local_mobilenet_v1_ssd_security_scratch_v1.config
file: ssd_mobilenet_

2020-04-26 21:40:07.219516: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-04-26 21:40:08.398668: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-04-26 21:50:03.942433: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-26 21:50:03.942618: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.635
pciBusID: 0000:01:00.0
2020-04-26 21:50:03.942647: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-04-26 21:50:03.942656: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcu

## Tensorboard

`ssh -L 8010:localhost:6006 train@192.168.1.120`  
`conda activate tf115`  
`cd projects/ssd-dag/code`  
`tensorboard --logdir==./model`  
`localhost:8010`  

# Trained Model Output -- IMPORTANT
Where did it go? - THERE IS A BIG DIFFERENCE BETWEEN LOCAL TRAIN AND HOSTED TRAIN -- important !!

train*.py will put the output in code/model    This is true for local or SageMaker hosted trained.   In this case, you trained locally, so the output is in code/model  -- end of story.


When you train with a SageMaker Hosted train, the output still goes to code/model -- HOWEVER - that is in a docker image (that you will never see).  Then it gets coped to S3.   Then the notebook (TrainModel_Step3_TrainingJob) pulls a model output from S3.   Then extracts the tarball to {PROJECT}/trained_model   SO AT THIS POINT THE OUTPUT IS IN A DIFFERENT LOCATION !!

The convert graph script is pulling from {PROJECT}/trained_model (not the native code/model location).    The easiest solution (you will see below) is to copy the desired checkpoint graph to the {PROJECT}/trained_model location.

In [None]:
os.chdir(CODE)   # this will be the training directory
! ls -la  {MODEL_OUTPUT}

## Troubleshooting

1. if you run for 500 steps, then rerun the exact process, it is going to restore /ckpt/checkpoints (ckpt-500) and then thinks it is done.  So, basically does nothing
2. Don't delete ckpt/  (rm ckpt/*.*) WITHOUT removing ckpt/checkpoints/   The program is always checking that checkpoints subdirectory and trying to restore.  For exampmle, you delete ckpt/ but leave ckpt/checkpoints, it finds a reference to ckpt-500 but you just deleted it - so it aborts
3. Always check your files & paths carefully - the error messages that get thrown with a missing file are not always clear - and my send you on a wild goose chase when in reality - it was just a missing file
4. can't import nets - this is a PATH problem (models/research/slim needs to be in your path) - in the train.py program, it's programmatically added
5. OOM when allocating tensor of shape [32,19,19,512] and type float
	 [[{{node gradients/zeros_97}}]] -- go to the config file and change batch size to be smaller (e.g. 16)
6. AttributeError: 'ParallelInterleaveDataset' object has no attribute '_flat_structure --- check your directories, like something didn't get installed correction (base model?  models/research stuff?  training data) -- seems to be a problem with the TF build from scratch;   use a pip install and this went away
7. if you are mixing local ops and Docker runs - you may have messed up the ownership file outputs and checkpoints - try deleting everything and a new pull
8. trains - then error:  TypeError: object of type <class 'numpy.float64'> cannot be safely interpreted as an integer.


## Making a useable model
At this point you have checkpoint files.   You need models (graphs).   There are many flavors:
    - saved graph
    - frozen graph
    - TensorFlow Lite
    - TensorRT
    - EdgeTPU
    
The notebook:  TrainingJob_Step3_TrainingJob will show you how to convert a checkpoint file to a graph (frozen graph & tflite).   There is a bash file to do this.
    

In [None]:
# WAKE UP - make sure NUM_TRAINING_STEPS = the max number in the checkpoint files you listed above
#  e.g. 
# ls model
# -rw-rw-r--  1 ec2-user ec2-user 41116528 Jan 28 15:16 model.ckpt-6000.data-00000-of-00001
# -rw-rw-r--  1 ec2-user ec2-user    27275 Jan 28 15:16 model.ckpt-6000.index
# -rw-rw-r--  1 ec2-user {ec2-user  6987305 Jan 28 15:16 model.ckpt-6000.meta
NUM_TRAINING_STEPS = 219267
! cp {CODE}/model/*{NUM_TRAINING_STEPS}* {PROJECT}/trained_model
! ls {PROJECT}/trained_model/*{NUM_TRAINING_STEPS}*

# get the config from the train*.py parameters above
PIPELINE_CONFIG = 'local_mobilenet_v1_ssd_security_retrain.config'
# PIPELINE_CONFIG = 'local_mobilenet_v1_ssd_retrain.config'
! ls {CODE}/{PIPELINE_CONFIG}

# if you don't see your checkpoint in */trained_model/  STOP - and fix it

In [None]:
# convert checkpoint is a task script - located in the tasks/ directory
os.chdir(TASKS)  
! ./convert_checkpoint_to_edgetpu_tflite.sh --checkpoint_num {NUM_TRAINING_STEPS} --pipeline_config {PIPELINE_CONFIG}

In [None]:
# Tensorflow FROZEN GRAPH
! ls {PROJECT}/tensorflow_model -l

In [None]:
# Tensorflow Lite model
! ls {PROJECT}/tflite_model -l

### Security
If you are working on the security project,   you need to:  
put thye output_tflight_graph.tflite file in:  camera-api/model/  


In [None]:
# copy the tflite model over to camera-api/model
! cp  {PROJECT}/tflite_model/output_tflite_graph.tflite {CAMERA_API_MODEL}

In [None]:
# just checking ...
! ls -ls {CODE}/ckpt

In [None]:
# move the (converted?  frozen?) ckpt to the starting point
# NOW you can re-train on top of it
! cp {PROJECT}/tensorflow_model/model.ckpt.* {CODE}/ckpt

In [None]:
# backup
! aws s3 ls --profile=jmduff

In [None]:
include_parameter = '*{}*'.format(NUM_TRAINING_STEPS)
! ls {MODEL_OUTPUT}/{include_parameter}
! aws s3 cp {MODEL_OUTPUT} s3://jmduff.security-system/model/{MODEL_DATE}/ --exclude='*.*' --include={include_parameter} --recursive --profile=jmduff
! aws s3 cp {PROJECT}/tensorflow_model s3://jmduff.security-system/model/{MODEL_DATE}/ --exclude='*.*' --include='*.*' --recursive --profile=jmduff
! aws s3 cp {PROJECT}/tflite_model s3://jmduff.security-system/model/{MODEL_DATE}/ --exclude='*.*' --include='*.*' --recursive --profile=jmduff

In [None]:
os.chdir('/media/home/jay/projects/ssd-dag')
! pwd