# Train Model
#### tensorflow_p36 environment

## Prerequisites (done by Step 1)
- Global Constants: not used, not needed by this notebook
- Data: 
  - data was pulled from S3 and processed in Step 1
  - TODO - decide if you want to pull from s3 in the Docker - issues:  multiple tarballs
- Code:  Step 1 build the code (including pulling the tensorflow/models and compiling protobufs)

## Step 2 - Test your train.py with Docker

 USE SAGEMAKER.   Don't get confused - running jobs on the local SageMaker server isn't really what it was designed for.  It is designed to take your program and send it to outside resouces (using a Docker container)

In [None]:
import os
import tensorflow as tf

## Docker
Set up Docker - if this is an instance with a GPU, it will configure accordingly

In [None]:
!/bin/bash ./sagemaker_docker_setup.sh

### create a local SageMaker estimator

code/train_model - this entire directory goes to the Docker image

In [None]:
# an Estimator is a SageMaker class
# and, you're using the tensorflow flavor

import sagemaker
from sagemaker.tensorflow import TensorFlow

In [None]:
model_dir = '/opt/ml/model'     # this is related to how it gets deployed in the Docker
                                # this is a SAGEMAKER thing - don't confuse with the model_dir 
                                # that we have inside our code
    
train_instance_type = 'local'   # local vs another server

# these typical parameters are in the config file
#hyperparameters = {'epochs': 5, 'batch_size': 128, 'learning_rate': 0.01}

hyperparameters = {'pipeline_config_path' : 'sagemaker_mobilenet_v1_ssd_retrain.config',
                   'num_train_steps' : '502',
                   'num_eval_steps' : '10'
                  }

# SageMaker Execution Role
role = sagemaker.get_execution_role()

In [None]:
# 'test' changed to 'val'
inputs = {'train' : f'file://code/tfrecords/train', 'val' : f'file://code/tfrecords/val'}

In [None]:
local_estimator = TensorFlow(entry_point='train.py',
                       source_dir='code',
                       model_dir=model_dir,
                       train_instance_type=train_instance_type,
                       train_instance_count=1,
                       hyperparameters=hyperparameters,
                       role=role,
                       base_job_name='cfa_products_mobilenet_v1_SSD',
                       framework_version='1.13',
                       py_version='py3',
                       script_mode=True)

### Debugging
Debugging once the model is in a container can be difficult because you have so many nested operations.   The error message will be there someplace but it's tough to find.

Write your train.py with a lot of environment logging and verification to make this easier.   Display the environmental variables and verify that everything exists.   Display results to console - and they will show up in the log.  This will make debugging easier.   If the model trained locally and fails in this configuration - it's probably due to the setup.

you can open a terminal tab and $ top to see that this is running.   You can also monitor the notebook instance from the console.  
- training Jobs - scroll down to Monitor
  - you'll see resouces
  - you can also go to instance monitoring - which takes you to CloudWatch - hit GraphSearch

It's CPU bound - no GPU is being used so it's pretty slow.   You'll appreciate the GPU next time

In [None]:
# this took nearly 90 minutes - running locally 
# - you'll see in the log it did NOT use a GPU (not sure why because there was one on the server)
local_estimator.fit(inputs)