# Train Model
## Step 1 - Develop a train.py script

This is SageMaker Script Mode.   This is relatively new and much easier than the original SageMaker design.   You need to develop a train.py program that will:
1. run locally - that means it will run on the local resources
2. then you will test it locally with a Docker test

If it runs in these tests, then it will/should run fine when you create a SageMaker Training job.   THIS IS THE CORRECT WAY TO USE SAGEMAKER.   Don't get confused - running jobs on the local SageMaker server isn't really what it was designed for.  It is designed to take your program and send it to outside resouces (using a Docker container)


In [None]:
import os
import tensorflow as tf

In [None]:
# pre-TensorFlow 2.0
# set up eager execution
tf.enable_eager_execution()
tf.set_random_seed(0)
tf.logging.set_verbosity(tf.logging.ERROR)

## MobileNet Model
Why use a MobileNet Model?  Because the end objective is a lightweight model - one that will run on a Googl Coral TPU.    This requires a quantized model (int8 - not float32).  And, you get there from a TensorFlow Lite model.  The recommended path is to start with a model structure that you know is compatible (MobileNet) then retrain on top of it.  
1. We pull the MobileNet v1 (there is a v2 that we aren't using) trained on COCO images
2. We train on top of it (xfer learning) with our CFA Products
3. That generates a TensorFlow Lite model (.tflite)
4. We will later conver .tflite to an edge TPU model

## Global Constants

In [None]:
S3_TFRECORDS_PATH = "s3://cfaanalyticsresearch-sagemaker/datasets/cfa_products/tfrecords/"
TFRECORDS_TARBALL = "20190718_tfrecords.tar.gz"


S3_MODEL_PATH = "s3://cfaanalyticsresearch-sagemaker/trained-models/tensorflow_mobilenet/"
# base model - starting point that we train on top of
BASE_MODEL_FOLDER = "20180718_coco14_mobilenet_v1_ssd300_quantized"

# our CFA model
# note the COINCIDENCE - 2018-0718 vs 2019-0718, don't let this confuse you!
CFA_MODEL_FOLDER = "20190718_cfa_prod_mobilenet_v1_ssd300/"

# project directories
PROJECT = os.getcwd()
CODE = os.path.join(PROJECT, "code")
TASKS = os.path.join(PROJECT, "tasks")

print ("project directory:", PROJECT)
print ("code directory:", CODE)
print ("task directory:", TASKS)

## Data

Get the data from s3.  you'll need to pass a directory into the training job
### NOTE
still unclear if data is in the Docker or passed in with the SageMaker job  
TODO - figure this out, it's faster to NOT put it in the Docker (code/tfrecords), it just makes the Docker step slower.   the AWS fetch when the Docker starts is much faster

In [None]:
# you're in top project directory
s3_tfrecords = os.path.join(S3_TFRECORDS_PATH, TFRECORDS_TARBALL)
print (s3_tfrecords)
! aws s3 cp $s3_tfrecords code/tfrecords  

# tarball is now in code/tfrecords

In [None]:
! tar -xvf code/tfrecords/$TFRECORDS_TARBALL --strip=1 -C code/tfrecords
! rm code/tfrecords/$TFRECORDS_TARBALL

# tarball is gone, tfrecord files are in code/tfrecord

## Get Model
this exercise will RETRAIN an existing model.  So, you need the starting point.  In this example, we are training on top of the BASE == MobileNet V1 that was trained with COCO images.   You could train on top of a CFA model - just make sure you config everything properly.

Copy the model from S3.    You are coping a model from an S3 folder.  There may be a label map and config file - that would make sense so you can reproduce that model.   However, if you are training on top of this model - those files aren't useful - MAKE SURE YOU UNDERSTAND THIS.   

So when you pull the model from the folder - just make sure you understand if you are re-using those meta files (e.g. reproducing a model) or or if you need something new (xfer learning).  The training process will NOT read from this download.  The training program will read the config from the code/ just to help avoid this confusion.

#### CKPT
When you retrain, the config file has a train_config / fine_tune_checkpoint attribute.  You are going to download this BASE model and put it in the code/ckpt/ directory.   The training job will start with the checkpoint file you specify.   For example:

fine_tune_checkpoint: "ckpt/model.ckpt"

#### WARNING code/ckpt/checkpoints
When you run training, it will checkpoint to code/ckpt/checkpoints.  
- if you train for 5000 steps, then repeat, it will do nothing basically because it will just reload the 5000 checkpoint file.
- then you'll think you're smart and you'll remove the 5000 checkpoint file.  Not so fast bucko!
- because then you'll discover  there is some pointer in the checkpoints/ that told the system the 5000 checkpoint exists - but now it doesn't because you just wiped it - so you'll get an error (that's difficult to figure out)

just delete the checkpoints directory


In [None]:
s3_model_folder = os.path.join(S3_MODEL_PATH, BASE_MODEL_FOLDER)
! aws s3 cp $s3_model_folder code/ckpt --recursive

# code/ckpt now has model.ckpt.* files
# there is also a pipeline.config file (this one was configured for the Google Coral - you don't want it)
# there are also some tflite files - we don't want them either

## Local (Script) Mode Training

see the AWS SageMaker tutorials notably:  
https://github.com/aws-samples/amazon-sagemaker-script-mode/blob/master/tf-eager-script-mode/tf-eager-sm-scriptmode.ipynb

The point here is, you can develop a training script locally, then know (have a high degree of confidence) it will run as a SageMaker training job.   (This is relatively new, the old way was more difficult and cumbersome.)

## Do you have a training script that will run locally - without Docker?

considering what is coming up, you want all code needed to train in one directory. (in this example, that will be the code/ directory.) That directory will be included in the Docker image.    

This is going to get a little more cumbersome because we took a bunch of stuff from the (official) github tensorflow/models project.   - we are using the MobileNet model and a BUNCH of utilities.    To make sure we keep up to date, we will get all of this programmatically - i.e. clone the most recent version.

#### github tensorflow/models
manually git clone the FIRST TIME.   The official TensorFlow github repo has a related repo with a bunch of models, tutorials, utilities etc.   We are using them.  So clone them to this machine.   In a subsequent step, we'll get the files we need from this local copy.

git clone https://github.com/tensorflow/models.git

In [None]:
# get the latest software
# - git clone
# - get the protobuf compiler
# - compile
# - clean up
os.chdir(TASKS)
! ./install_tf_models.sh

In [None]:
os.chdir(TF_MODEL)
! use the install script

#### Python Packages not in the tensorflow_p36 conda environment
so add them

In [None]:
# python packages that are required
! pip install pycocotools

### Model Training Configuration

trained-models/ may have a config file and a label map in the directories.  You can start with one of these.  BUT - there may be environmental variable values that you don't want - and you don't want the s3 pull operation to keep overwriting your config.   So, you can pull a model from s3.  Review the .config and label map files BUT !!! put YOUR config & label map file in the code/ directory.

#### .config file
See the config file for all parameters. the IN USE .config file is in the code/ direcory But you DEFINITELY need to look at these!
- num_classes = should be consistent with labels.txt & label map
- label_map_path (train & eval)
    - there may be one in the model/ (that you pulled from s3)
    - but move your desired label map to code/
- inputs (train & eval) - not sure, SageMaker is passing that in
- check all of the path statements 
- fine_tune_checkpoint - make sure you are fine tuning the correct file
    - don't cross a _v1 with a _v2 - that definitely work
   
#### label map .pbtxt
- classes start with 1 (not 0 based)
- make sure your label map class count matches the config file
- and it should match the label 

NOTE - a missing file will generate a complex error message.  NOT something as simple as file not found. 

In [None]:
os.chdir(CODE)   # this will be the training directory

In [None]:
! python train.py \
  --pipeline_config_path="sagemaker_mobilenet_v1_ssd_retrain.config" \
  --model_dir="output" \
  --num_train_steps="500" \
  --num_eval_steps="10"

## Troubleshooting

1. if you run for 500 steps, then rerun the exact process, it is going to restore /ckpt/checkpoints (ckpt-500) and then thinks it is done.  So, basically does nothing
2. Don't delete ckpt/  (rm ckpt/*.*) WITHOUT removing ckpt/checkpoints/   The program is always checking that checkpoints subdirectory and trying to restore.  For exampmle, you delete ckpt/ but leave ckpt/checkpoints, it finds a reference to ckpt-500 but you just deleted it - so it aborts
3. Always check your files & paths carefully - the error messages that get thrown with a missing file are not always clear - and my send you on a wild goose chase when in reality - it was just a missing file

### Step 2.  Now try training locally - but your training goes into a Docker container

In [None]:
!/bin/bash ./sagemaker_docker_setup.sh

### create a local SageMaker estimator

code/train_model - this entire directory goes to the Docker image


In [None]:
# an Estimator is a SageMaker class
# and, you're using the tensorflow flavor

import sagemaker
from sagemaker.tensorflow import TensorFlow

model_dir = '/opt/ml/model'     # this is related to how it gets deployed in the Docker
train_instance_type = 'local'   # local vs another server
hyperparameters = {'epochs': 5, 'batch_size': 128, 'learning_rate': 0.01}
local_estimator = TensorFlow(entry_point='train.py',
                       source_dir='train_model',
                       model_dir=model_dir,
                       train_instance_type=train_instance_type,
                       train_instance_count=1,
                       hyperparameters=hyperparameters,
                       role=sagemaker.get_execution_role(),
                       base_job_name='tf-eager-scriptmode-bostonhousing',
                       framework_version='1.13',
                       py_version='py3',
                       script_mode=True)