# Train Model
## SageMaker Script Mode


In [None]:
import os
import tensorflow as tf

In [None]:
# pre-TensorFlow 2.0
# set up eager execution
tf.enable_eager_execution()
tf.set_random_seed(0)
tf.logging.set_verbosity(tf.logging.ERROR)

## Global Constants

In [None]:
S3_TFRECORDS_PATH = "s3://cfaanalyticsresearch-sagemaker/datasets/cfa_products/tfrecords/"
TFRECORDS_TARBALL = "20190718_tfrecords.tar.gz"

S3_MODEL_PATH = "s3://cfaanalyticsresearch-sagemaker/trained-models/tensorflow_mobilenet/"
# base model
BASE_MODEL_FOLDER = "20180718_coco14_mobilenet_v1_ssd300_quantized"

# our CFA model
CFA_MODEL_FOLDER = "20190718_cfa_prod_mobilenet_v1_ssd300/"

## Data

Get the data from s3.  you'll need to pass a directory into the training job

In [None]:
# you're in /code
s3_tfrecords = os.path.join(S3_TFRECORDS_PATH, TFRECORDS_TARBALL)
print (s3_tfrecords)
! aws s3 cp $s3_tfrecords tfrecords  

In [None]:
! tar -xvf tfrecords/$TFRECORDS_TARBALL --strip=1 -C tfrecords
! rm tfrecords/$TFRECORDS_TARBALL

## Get Model
this exercise will RETRAIN an existing model.  So, you need the starting point.

Copy the model from S3.    The files will include a label map.   This will be put into the training directory (the directory that is passed into the training job so everything is together.


In [None]:
s3_model_folder = os.path.join(S3_MODEL_PATH, BASE_MODEL_FOLDER)
! aws s3 cp $s3_model_folder model --recursive

In [None]:
! pwd

## Local (Script) Mode Training

see the AWS SageMaker tutorials notably:  
https://github.com/aws-samples/amazon-sagemaker-script-mode/blob/master/tf-eager-script-mode/tf-eager-sm-scriptmode.ipynb

The point here is, you can develop a training script locally, then know (have a high degree of confidence) it will run as a SageMaker training job.   (This is relatively new, the old way was more difficult and cumbersome.)

### Step 1.  Do you have a training script that will run locally - without Docker?

considering what is coming up, you want all code needed to train in one directory.  That directory will be included in the Docker image.    

This is going to get cumbersome because we took a bunch of stuff from the (real) tensorflow/models project.    To make sure we keep up to date, build the model

#### github tensorflow/models
manually git clone the FIRST TIME.   The official TensorFlow github repo has a related repo with a bunch of models, tutorials, utilities etc.   We are using them.  So clone them to this machine.   In a subsequent step, we'll get the files we need from this local copy.

git clone https://github.com/tensorflow/models.git

In [None]:
# get the latest software
TF_MODEL = "/home/ec2-user/SageMaker/models"                # this is the TensorFlow repo for momdels
OUR_CODE = "/home/ec2-user/SageMaker/ssd-dag/code"          # this is OUR directory for code

In [None]:
os.chdir(TF_MODEL)
! use the install script

#### Python Packages not in the tensorflow_p36 conda environment
so add them

In [None]:
# python packages that are required
! pip install pycocotools

### Model Training Configuration

trained-models/ may have a config file and a label map in the directories.  You can start with one of these.  BUT - there may be environmental variable values that you don't want - and you don't want the s3 pull operation to keep overwriting your config.   So, you can pull a model from s3.  Review the .config and label map files BUT !!! put YOUR config & label map file in the code/ directory.

#### .config file
See the config file for all parameters. the IN USE .config file is in the code/ direcory But you DEFINITELY need to look at these!
- num_classes = should be consistent with labels.txt & label map
- label_map_path (train & eval)
    - there may be one in the model/ (that you pulled from s3)
    - but move your desired label map to code/
- inputs (train & eval) - not sure, SageMaker is passing that in
- check all of the path statements 
- fine_tune_checkpoint - make sure you are fine tuning the correct file
    - don't cross a _v1 with a _v2 - that definitely work
   
#### label map .pbtxt
- classes start with 1 (not 0 based)
- make sure your label map class count matches the config file
- and it should match the label 

NOTE - a missing file will generate a complex error message.  NOT something as simple as file not found. 

In [None]:
os.chdir(CODE)   # this will be the training directory

In [None]:
! python train.py \
  --pipeline_config_path="sagemaker_mobilenet_v1_ssd_retrain.config" \
  --model_dir="output" \
  --num_train_steps="500" \
  --num_eval_steps="10"

## Troubleshooting

1. if you run for 500 steps, then rerun the exact process, it is going to restore /ckpt/checkpoints (ckpt-500) and then thinks it is done.  So, basically does nothing
2. Don't delete ckpt/  (rm ckpt/*.*) WITHOUT removing ckpt/checkpoints/   The program is always checking that checkpoints subdirectory and trying to restore.  For exampmle, you delete ckpt/ but leave ckpt/checkpoints, it finds a reference to ckpt-500 but you just deleted it - so it aborts
3. Always check your files & paths carefully - the error messages that get thrown with a missing file are not always clear - and my send you on a wild goose chase when in reality - it was just a missing file

### Step 2.  Now try training locally - but your training goes into a Docker container

In [None]:
!/bin/bash ./sagemaker_docker_setup.sh

### create a local SageMaker estimator

code/train_model - this entire directory goes to the Docker image


In [None]:
# an Estimator is a SageMaker class
# and, you're using the tensorflow flavor

import sagemaker
from sagemaker.tensorflow import TensorFlow

model_dir = '/opt/ml/model'     # this is related to how it gets deployed in the Docker
train_instance_type = 'local'   # local vs another server
hyperparameters = {'epochs': 5, 'batch_size': 128, 'learning_rate': 0.01}
local_estimator = TensorFlow(entry_point='train.py',
                       source_dir='train_model',
                       model_dir=model_dir,
                       train_instance_type=train_instance_type,
                       train_instance_count=1,
                       hyperparameters=hyperparameters,
                       role=sagemaker.get_execution_role(),
                       base_job_name='tf-eager-scriptmode-bostonhousing',
                       framework_version='1.13',
                       py_version='py3',
                       script_mode=True)