# Train Model
#### tensorflow_p36 environment

ref:  https://github.com/aws-samples/amazon-sagemaker-script-mode/blob/master/tf-eager-script-mode/tf-eager-sm-scriptmode.ipynb

Note:  AWS tutorials tend to name the post-training data = 'test'.   Most books call this 'val' for validation or 'eval' for model evaluation.   I named it 'val'.   So if you follow the example, AWS calls it 'test', I call it 'val'

## Step 3 - SageMaker HOSTED Training
At this point, you know you have a working training script (train.py).  So, you can have SageMaker deploy it to outside (not local) resources.  

### Output
After training in the HOSTED SageMaker environment, the model is pushed to S3.  This notebook pulls that newly trained model checkpoint to this (SageMaker) computer.   This notebook will then convert that checkpoint to a tflite model. 

In [None]:
import os
import time
import sagemaker
from sagemaker.tensorflow import TensorFlow

In [None]:
PROJECT_DIR = os.getcwd()

## Data
SageMaker will pull the data from S3.    This is much faster than putting it in your Docker.   However, this is somewhat confusing because the MobileNet software (and utilities) were looking for data path in the config file.  We need to merge this approach:
- allow SageMaker to pull from S3
- AND, we want to continue leveraging the config design pattern

The other challenge is working with tarballs versus tfrecord files.

In [None]:
s3_prefix = 'cfaanalyticsresearch-sagemaker'

traindata_s3_prefix = '{}/datasets/cfa_products/train'.format(s3_prefix)
valdata_s3_prefix = '{}/datasets/cfa_products/val'.format(s3_prefix)
print (traindata_s3_prefix)
print (valdata_s3_prefix)

## TIP
you would be wise to test and make sure you path is good before continuing!!  
cut/paste the printed value and put it into the following form.   You can run this AWS CLI command in a new cell.  

! aws s3 ls s3://cfaanalyticsresearch-sagemaker/datasets/cfa_products/train/  
! aws s3 ls s3://cfaanalyticsresearch-sagemaker/datasets/cfa_products/val/

### Copy data from local (SageMaker instance) to S3
If you ran the TrainModel_Step1 notebook, the data was moved to:
- code/tfrecords/train 
- code/tfrecords/val

In [None]:
! pwd
# train_s3 == a full s3 URL, note that it is a folder, not a file
# this operation may take a few seconds (depending on data size) - it is silently copying
#     data from local drive on SageMaker to s3
train_s3 = sagemaker.Session().upload_data(path='./code/tfrecords/train/', key_prefix=traindata_s3_prefix)
val_s3 = sagemaker.Session().upload_data(path='./code/tfrecords/val/', key_prefix=valdata_s3_prefix)

inputs = {'train':train_s3, 'val': val_s3}

print(inputs)

In [None]:
model_dir = '/opt/ml/model'     # this is related to how it gets deployed in the Docker
                                # this is a SAGEMAKER thing - don't confuse with the model_dir 
                                # that we have inside our code
# p2.xlarge == $1/hr
# p3.2xlarge = $3/hr
# this is a very controlled train & quick so the better server makes sense
# if you are developing - use the p2
train_instance_type = 'ml.p3.2xlarge'   
# TODO
#  o  try a different config for p3.2xlarge that has more images in the batch size to 
#     take advantage of the GPU memory
#  o  still have to figure out the data
#     - including the data under code/ directory means it is in the tarball
#     - but that means the inputs is a wasted step (and it takes longer to create the Docker image)
hyperparameters = {'pipeline_config_path' : 'sagemaker_mobilenet_v1_ssd_retrain.config',
                   'num_train_steps' : '5000',
                   'num_eval_steps' : '1000'
                  }

# SageMaker Execution Role
role = sagemaker.get_execution_role()

In [None]:
estimator = TensorFlow(entry_point='train.py',
                       source_dir='code',
                       model_dir=model_dir,
                       train_instance_type=train_instance_type,
                       train_instance_count=1,
                       hyperparameters=hyperparameters,
                       role=role,
                       base_job_name='cfa-products-mobilenet-v1-SSD',
                       framework_version='1.13',
                       py_version='py3',
                       script_mode=True)

In [None]:
start_time = time.time()
print ("start time: {:.4f}".format(start_time))

# this will create the training job
# you can also see this on the SageMaker Training Job console
# 3 minute overhead prepping the servers, downloading data
# 5000 steps on a p3.2xlarge == 18 min (training time)
# overall ~ 20 minutes
estimator.fit(inputs)

# show the time
finish_time = time.time()
minutes = (finish_time - start_time) / 60
print("time spent: {:.4f}".format(finish_time - start_time))
print("in minutes: {:.4f}".format(minutes))

## Retreiving the Trained Model
SageMaker created a Docker job to train our model and sent it off to external resources (external meaning - not this computer.)   Now we need to get the result - it's not on this computer.

- ./trained_model:  this local directory is (should be) empty
- the trained model is on s3 - in the next step we are copying the result to code/model
- now you'll see the tarball

/trained_model is NOT under the code/ directory.  Primarily because there is no reason to include it inside the Docker training job (in the event you re-run.)  That would just carry extra baggage around for no reason.

In [None]:
# current directory is still the top project directory - NOT the code directory
!aws s3 cp {estimator.model_data} ./trained_model/model.tar.gz

### model.ckpt-XXXX
Make note of the checkpoint files.   For example, if you  said run 5000 steps, there should be a checkpoint file:  
model.ckpt-5000*

This is the file that will be converted 
- frozen graph
- tflite model

In [None]:
!tar -xvzf ./trained_model/model.tar.gz -C ./trained_model

## Convert (Trained) Model Checkpoint to a tflite Model

WARNING: labels.txt - not included, don't think we need this though it was in the Coral project.  We think it's getting the label name from the label map (*.pbtxt) file

ALSO NOTE:  the script is named convert_checkpoint_to_edge_tflite.sh  
Well... the name is no longer totally accurate
- I took this from the original Coral TPU tutorial
- Another step is required for compiling for the EdgeTPU (not really relevant here since we are confined to AWS where there is no TPU -- so we skip that stuff)
- And, I added a step that converts the checkpoint to a TENSORFLOW frozen graph 
  - note that this generates a frozen_inference_graph.pb
  - it ALSO generates a saved model graph.pb  
  THESE (frozen graph & saved model) are NOT the SAME!!  
  https://stackoverflow.com/questions/46547319/error-when-parsing-graph-def-from-string

https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/exporting_models.md

In [None]:
# WAKE UP - make sure the checkpoint num == hyperparameters/num_train_steps
# convert checkpoint is a task script - located in the tasks/ directory
os.chdir("tasks")  
! ./convert_checkpoint_to_edgetpu_tflite.sh --checkpoint_num 5000 --pipeline_config sagemaker_mobilenet_v1_ssd_retrain.config


In [None]:
# set directory back to the project directory
# see the tflite model artifacts
os.chdir(PROJECT_DIR)
! ls tflite_model -l
! ls tensorflow_model -l

## Add the converted model artifacts to S3 (SageMaker Training Job Data)
Add these artifacts:
- frozen model graph
- tflite model

To the SageMaker folder that has all of the training job artifacts - so everything is together

In [None]:
# estimator.model_data == the s3 url & file for the model output tarball
# we need the s3 url only; so you have to extract it; and, you need a / at the end
s3_model_artifacts = os.path.dirname(estimator.model_data) + '/'

# now you can copy these converted files up to s3
!aws s3 cp  ./tensorflow_model {s3_model_artifacts}tensorflow_model --recursive
!aws s3 cp  ./tflite_model {s3_model_artifacts}tflite --recursive

# Deploy
## Optional
This is not required.  You can take your model artifacts:
- tensorflow frozen graph
- tensorflow Lite frozen graph
- tensorflow Lite model file
They are completely useable.  Note that TensorFlow is quite different than TensorFlow Lite (which is much different than the EdgeTPU model.)  Missing is a TensorRT variant by the way.   You can test these models locally or on a different machine.   You don't have to deploy (but if you don't you'll lose the estimator object)

At this moment, you have an SageMaker Estimator.  This is a bunch of information about a model.   At this time, you can't recreate an Estimator from a file (i.e. restore/create from file).   So if you're going to deploy it - now is the time!   

### $$
When you deploy, you are paying for the endpoint server!! E.g. deploy your model to a p2.xlarge (which costs $1.26/hr) and you are paying whether you use it or not!  

### DON'T LEAVE YOUR ENDPOINT RUNNING

In [None]:
# note, 
#  - I'm using a p2.2xlarge for SageMaker server
#  - We trained on a p3.2xlarge (sent the training job)
#  - now we can deploy to yet a 3rd machine - in this case, I'm selecting a p2 because it's the cheapest GPU

# THIS WILL TAKE A FEW MINUTES - Dockerizing (and it's probably carrying your data around if you put it in the code directory)
predictor = estimator.deploy(initial_instance_count=1,instance_type='ml.p2.xlarge')

## Testing the Endpoint
Jump to the notebook:  DetectModel_Step2


### Delete the Endpoint

In [None]:
sagemaker.Session().delete_endpoint(predictor.endpoint)