# Train Model
## SageMaker Script Mode


In [1]:
import os
import tensorflow as tf

In [2]:
# pre-TensorFlow 2.0
# set up eager execution
tf.enable_eager_execution()
tf.set_random_seed(0)
tf.logging.set_verbosity(tf.logging.ERROR)

## Global Constants

In [48]:
S3_TFRECORDS_PATH = "s3://cfaanalyticsresearch-sagemaker/datasets/cfa_products/tfrecords/"
TFRECORDS_TARBALL = "20190718_tfrecords.tar.gz"

S3_MODEL_PATH = "s3://cfaanalyticsresearch-sagemaker/trained-models/tensorflow_mobilenet/"
MODEL_FOLDER = "20190718_cfa_prod_mobilenet_v1_ssd300/"

## Data

Get the data from s3.  you'll need to pass a directory into the training job

In [4]:
s3_tfrecords = os.path.join(S3_TFRECORDS_PATH, TFRECORDS_TARBALL)
print (s3_tfrecords)
! aws s3 cp $s3_tfrecords data   

s3://cfaanalyticsresearch-sagemaker/datasets/cfa_products/tfrecords/20190718_tfrecords.tar.gz
download: s3://cfaanalyticsresearch-sagemaker/datasets/cfa_products/tfrecords/20190718_tfrecords.tar.gz to data/20190718_tfrecords.tar.gz


In [5]:
! tar -xvf data/$TFRECORDS_TARBALL --strip=1 -C data/tfrecords
! rm data/$TFRECORDS_TARBALL

20190718_tfrecords/test.tfrecord
20190718_tfrecords/train.tfrecord
20190718_tfrecords/val.tfrecord


## Get Model
this exercise will RETRAIN an existing model.  So, you need the starting point.

Copy the model from S3.    The files will include a label map.   This will be put into the training directory (the directory that is passed into the training job so everything is together.


In [53]:
s3_model_folder = os.path.join(S3_MODEL_PATH, MODEL_FOLDER)
! aws s3 cp $s3_model_folder model --recursive

download: s3://cfaanalyticsresearch-sagemaker/trained-models/tensorflow_mobilenet/20190718_cfa_prod_mobilenet_v1_ssd300/cfa_prod_label_map.pbtxt to model/cfa_prod_label_map.pbtxt
download: s3://cfaanalyticsresearch-sagemaker/trained-models/tensorflow_mobilenet/20190718_cfa_prod_mobilenet_v1_ssd300/labels.txt to model/labels.txt
download: s3://cfaanalyticsresearch-sagemaker/trained-models/tensorflow_mobilenet/20190718_cfa_prod_mobilenet_v1_ssd300/sagemaker_mobilenet_v1_ssd_retrain.config to model/sagemaker_mobilenet_v1_ssd_retrain.config
download: s3://cfaanalyticsresearch-sagemaker/trained-models/tensorflow_mobilenet/20190718_cfa_prod_mobilenet_v1_ssd300/output_tflite_graph.tflite to model/output_tflite_graph.tflite
download: s3://cfaanalyticsresearch-sagemaker/trained-models/tensorflow_mobilenet/20190718_cfa_prod_mobilenet_v1_ssd300/tflite_graph.pb to model/tflite_graph.pb
download: s3://cfaanalyticsresearch-sagemaker/trained-models/tensorflow_mobilenet/20190718_cfa_prod_mobilenet_v1_

In [51]:
! pwd

/home/ec2-user/SageMaker/ssd-dag/code


## Local (Script) Mode Training

see the AWS SageMaker tutorials notably:  
https://github.com/aws-samples/amazon-sagemaker-script-mode/blob/master/tf-eager-script-mode/tf-eager-sm-scriptmode.ipynb

The point here is, you can develop a training script locally, then know (have a high degree of confidence) it will run as a SageMaker training job.   (This is relatively new, the old way was more difficult and cumbersome.)

### Step 1.  Do you have a training script that will run locally - without Docker?

considering what is coming up, you want all code needed to train in one directory.  That directory will be included in the Docker image.    

This is going to get cumbersome because we took a bunch of stuff from the (real) tensorflow/models project.    To make sure we keep up to date, build the model

#### github tensorflow/models
manually git clone the FIRST TIME.   The official TensorFlow github repo has a related repo with a bunch of models, tutorials, utilities etc.   We are using them.  So clone them to this machine.   In a subsequent step, we'll get the files we need from this local copy.

git clone https://github.com/tensorflow/models.git

In [29]:
# get the latest software
TF_MODEL = "/home/ec2-user/SageMaker/models"                # this is the TensorFlow repo for momdels
OUR_CODE = "/home/ec2-user/SageMaker/ssd-dag/code"          # this is OUR directory for code

In [30]:
os.chdir(TF_MODEL)
! git pull

Already up-to-date.


In [32]:
# copy necessary programs/scripts to OUR train_model directory
! echo "--- TF / Model ---"
! ls $TF_MODEL/research/object_detection 


# copy from the tensorflow repo to our repo
! cp $TF_MODEL/research/object_detection $OUR_CODE -r

! echo "--- our code ---"
! ls $OUR_CODE

--- TF / Model ---
anchor_generators		     matchers
box_coders			     meta_architectures
builders			     metrics
CONTRIBUTING.md			     model_hparams.py
core				     model_lib.py
data				     model_lib_test.py
data_decoders			     model_lib_v2.py
dataset_tools			     model_lib_v2_test.py
dockerfiles			     model_main.py
eval_util.py			     models
eval_util_test.py		     model_tpu_main.py
exporter.py			     object_detection_tutorial.ipynb
exporter_test.py		     predictors
export_inference_graph.py	     protos
export_tflite_ssd_graph_lib.py	     __pycache__
export_tflite_ssd_graph_lib_test.py  README.md
export_tflite_ssd_graph.py	     samples
g3doc				     test_ckpt
inference			     test_data
__init__.py			     test_images
inputs.py			     tpu_exporters
inputs_test.py			     utils
legacy
--- our code ---
annotation.py  missing_pb2.tar.gz   requirements.txt	   utils
detect.py      move_missing_pb2.sh  tflite_interpreter.py
display.py     object_detection     train.py


#### Missing *.pb2.py scripts
This is evidently something related to different versions of protobufs.   I never fully figured this out.
- these were in the original Coral Tutorial & model training
- but NOT in the tensorflow/models repo

I pulled them from the original Coral TPU tutorial and put them in a tarball.  This script will move them to the correct place and you can forget about it

In [36]:
# so we are copying them 
os.chdir(CODE)
!/bin/bash ./move_missing_pb2.sh

missing_pb2/anchor_generator_pb2.py
missing_pb2/argmax_matcher_pb2.py
missing_pb2/bipartite_matcher_pb2.py
missing_pb2/box_coder_pb2.py
missing_pb2/box_predictor_pb2.py
missing_pb2/calibration_pb2.py
missing_pb2/eval_pb2.py
missing_pb2/faster_rcnn_box_coder_pb2.py
missing_pb2/faster_rcnn_pb2.py
missing_pb2/graph_rewriter_pb2.py
missing_pb2/grid_anchor_generator_pb2.py
missing_pb2/hyperparams_pb2.py
missing_pb2/image_resizer_pb2.py
missing_pb2/input_reader_pb2.py
missing_pb2/keypoint_box_coder_pb2.py
missing_pb2/losses_pb2.py
missing_pb2/matcher_pb2.py
missing_pb2/mean_stddev_box_coder_pb2.py
missing_pb2/model_pb2.py
missing_pb2/multiscale_anchor_generator_pb2.py
missing_pb2/optimizer_pb2.py
missing_pb2/pipeline_pb2.py
missing_pb2/post_processing_pb2.py
missing_pb2/preprocessor_pb2.py
missing_pb2/region_similarity_calculator_pb2.py
missing_pb2/square_box_coder_pb2.py
missing_pb2/ssd_anchor_generator_pb2.py
missing_pb2/ssd_pb2.py
missing_pb2/string_int_label_m

#### Python Packages not in the tensorflow_p36 conda environment
so add them

In [None]:
# python packages that are required
! pip install pycocotools

### Model Training Configuration
See the config file for all parameters. the .config file is in the /model direcory But you DEFINITELY need to look at these!
- num_classes = should be consistent with labels.txt & label map

In [39]:
os.chdir(CODE)   # this will be the training directory

In [42]:
! python train.py \
  --pipeline_config_path="sagemaker_mobilenet_v1_ssd_retrain.config" \
  --model_dir="${TRAIN_DIR}" \
  --num_train_steps="${num_training_steps}" \
  --num_eval_steps="${num_eval_steps}"


For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
If you depend on functionality not listed there, please file an issue.

Traceback (most recent call last):
  File "train.py", line 26, in <module>
    from object_detection import model_lib
  File "/home/ec2-user/SageMaker/ssd-dag/code/object_detection/model_lib.py", line 28, in <module>
    from object_detection import exporter as exporter_lib
  File "/home/ec2-user/SageMaker/ssd-dag/code/object_detection/exporter.py", line 24, in <module>
    from object_detection.builders import model_builder
  File "/home/ec2-user/SageMaker/ssd-dag/code/object_detection/builders/model_builder.py", line 20, in <module>
    from object_detection.builders import anchor_generator_builder
  File "/home/ec2-user/SageMaker/ssd-dag/code/object_detection/builders/anchor_generator_builder.py", line 22, in <module>
    from object_d

### Step 2.  Now try training locally - but your training goes into a Docker container

In [None]:
!/bin/bash ./sagemaker_docker_setup.sh

### create a local SageMaker estimator

code/train_model - this entire directory goes to the Docker image


In [None]:
# an Estimator is a SageMaker class
# and, you're using the tensorflow flavor

import sagemaker
from sagemaker.tensorflow import TensorFlow

model_dir = '/opt/ml/model'     # this is related to how it gets deployed in the Docker
train_instance_type = 'local'   # local vs another server
hyperparameters = {'epochs': 5, 'batch_size': 128, 'learning_rate': 0.01}
local_estimator = TensorFlow(entry_point='train.py',
                       source_dir='train_model',
                       model_dir=model_dir,
                       train_instance_type=train_instance_type,
                       train_instance_count=1,
                       hyperparameters=hyperparameters,
                       role=sagemaker.get_execution_role(),
                       base_job_name='tf-eager-scriptmode-bostonhousing',
                       framework_version='1.13',
                       py_version='py3',
                       script_mode=True)