<div class="alert alert-block alert-info">
    
### Try out the XGBoost algo with different hyperparameters

#### 0. Setup the imports and bucket reference 
#### 1. Get the image URI for the XGBoost algo
#### 2. Setup the Input and Output data locations
#### 3. Execute a training job
#### 4. Manually tune the model
    
</div>

<code style="background:yellow;color:black">You MUST change S3 Bucket name in the next cell</code>

In [18]:
import boto3
import sagemaker
from sagemaker import get_execution_role

# CHANGE the bucket name
bucket = "rajeev-6510"

# This will be a folder from where data is read & model is written
prefix = "sagemaker/churn-analysis"


# Get the role - ignore the warning
role = get_execution_role()

<div class="alert alert-block alert-success">
    
### 1. Get the SageMaker container implementation for the algo
* SageMaker implements the ML algos in Docker containers (images)
* The images are maintained in *Elastic Container Registry (ECR)*
* The api *get_image_uri* gets a reference to the specified algo container image

</div>

In [19]:
from sagemaker.amazon.amazon_estimator import get_image_uri

# Latest version may be different across regions
# repo_version='0.90-2'  # OH
repo_version = '1.0-1'   # Northern VA

# Ignore the warning
container = get_image_uri(boto3.Session().region_name, 'xgboost',repo_version=repo_version)

<div class="alert alert-block alert-success">

### 2. Setup the Input and Output data locations
    
##### Input
* Ensure that bucket are prefix are correct
* Notice the content_type= csv; the data may be specified in other formats; depends on the algo
* Notice that we have specified the folder NOT the file as the data may spread across multiple files

##### Output
* Training job writes the model artefacts to S3
* A new S3 folder is created with the name of the training job
    
</div>

In [20]:
# These will be passed as the parameters to the algo
s3_input_train = sagemaker.s3_input(s3_data='s3://{}/{}/train'.format(bucket, prefix), content_type='csv')
s3_input_validation = sagemaker.s3_input(s3_data='s3://{}/{}/validation/'.format(bucket, prefix), content_type='csv')
s3_input_test = sagemaker.s3_input(s3_data='s3://{}/{}/test/'.format(bucket, prefix), content_type='csv')

# we will write all outputs under the folder output
s3_output_model='s3://{}/{}/output'.format(bucket, prefix)

<div class="alert alert-block alert-success">

### 3. Execute a training job
    
* Provide a job name - if not provided the default name is created *[Algo Name]-[Timestamp]*
* The job name MUST be unique every time you execute a training job 
    <code style="background:yellow;color:black">otherwise training job will fail will fail to launch!!!</code>
* You may checkout the training jobs using:
    * Console
    * API
    
   
##### Estimator parameters
* container - image for the algo
* role - determines the permissions that the training job will have; fix this for permission errors
* train_instance - depends on the job; smaller machine longer runtime or even a failure
    
##### Starting the training job

* The algo.<code style="background:yellow;color:black">fit()</code> method starts the execution of the job
* MUST provide the data locations; 
    * channel refers to the type of data; train channel = training data, validation channel = validation data
    * channels depend on the algo
    
</div>

In [21]:
# The training job name - change the number to 102, 103 ... timestamp appended to it automaticaly
job_name = "xgboost-my-training-job-101"

# Session is used for connecting to the Sagemaker service
sess = sagemaker.Session()


# Instance of the algo
# You may even use spot instances; which will translate into saving of upto 80%
#        train_use_spot_instances=True
# You can control the run away jobs
#        train_max_run=300
#        train_max_wait=600
# Sagemaker Debugger can help with managing the working of Algo during the training
#        debugger_hook_config=DebuggerHookConfig(...)
xgb = sagemaker.estimator.Estimator(container,
                                    role, 
                                    train_instance_count=1, 
                                    train_instance_type='ml.m4.2xlarge',
                                    output_path=s3_output_model,
                                    sagemaker_session=sess,
                                    base_job_name=job_name)

# Setup the hyperparameters; objective is to predict Yes | No so 'binary:logistic'
objective = 'binary:logistic'

# Tuning involves adjusting the values of Hyperparameters
#     The parameter eval_metric may be set for metric used for job evaluatio.
#     By default it is the validation : error
#     You may try eval_metric='auc' to use 'Area Under Curve'
# ADJUST the parameters to tune - DO NOT CHANGE THE 'objective'
xgb.set_hyperparameters(objective=objective,
                        max_depth=5,
                        eta=0.2,
                        gamma=4,
                        min_child_weight=3,
                        subsample=0.8,
                        silent=0,
                        num_round=100)

# Start the training job - provide the training and validation data locations
xgb.fit({'train': s3_input_train, 'validation': s3_input_validation})

2020-05-24 14:37:59 Starting - Starting the training job...
2020-05-24 14:38:01 Starting - Launching requested ML instances......
2020-05-24 14:39:14 Starting - Preparing the instances for training...
2020-05-24 14:39:56 Downloading - Downloading input data...
2020-05-24 14:40:14 Training - Downloading the training image...
2020-05-24 14:40:55 Uploading - Uploading generated training model.[34mINFO:sagemaker-containers:Imported framework sagemaker_xgboost_container.training[0m
[34mINFO:sagemaker-containers:Failed to parse hyperparameter objective value binary:logistic to Json.[0m
[34mReturning the value itself[0m
[34mINFO:sagemaker-containers:No GPUs detected (normal if no gpus installed)[0m
[34mINFO:sagemaker_xgboost_container.training:Running XGBoost Sagemaker in algorithm mode[0m
[34mINFO:root:Determined delimiter of CSV input is ','[0m
[34mINFO:root:Determined delimiter of CSV input is ','[0m
[34mINFO:root:Determined delimiter of CSV input is ','[0m
[34m[14:40:51] 

<div class="alert alert-block alert-info">

### 1. Checkout the logs generated

* The idea is to MINIMIZE the train-error |  validation-error
* Algorithm is iterating through the runs to get to that minimal for the error
* Checkout the error values in the last 10 to 15 logs - is it changing? If NO then that means we are getting there :)
* Training job's cost is calculated based on number of seconds - how much time did your job take?
    
</div>

<div class="alert alert-block alert-info">

### 2. Checkout the Model artefacts generated in the S3 bucket

* Use the console 
* You may download the artefacts and look into the contents if you are interested (optional)
    * Using the console download the file from
        * s3://Your-Bucket-Name/Sagemaker/churn-analysis/output/[Training folder]/model.tar.gz

    
</div>

<div class="alert alert-block alert-success">
    
### 4. Manually tuning the model
* Based on the run you may decide to adjust the Hyperparameters and run the job again
* Checkout the Hyperparameter definitions below; these apply to XGBoost. Each algo has its own set of Hyperparameters
* The process of tuning is iterative; so you may end up running the job 10-20-30- ... times !!!!
    
**Go ahead run the job a couple of times .<code style="background:yellow;color:black">with different Hyperparameter values</code>. 

1. Change the Hyperparameters <code style="background:yellow;color:black">Do NOT change the objective!!!</code>
2. You may change the Job Name and Run the cells (you may Run All Cells)
3. Check the console > Training Jobs to review results and select the best Job; minimal validation errors.

    * For each run the model will be genaretd in a S3 under a different folder.**

    
### Hyperparameters for XGBoost
[Read the details](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost_hyperparameters.html)
    
<dl>
    <dt>max_depth</dt>
    <dd>controls how deep each tree within the algorithm can be built. Deeper trees can lead to better fit, but are more computationally expensive and can lead to overfitting. There is typically some trade-off in model performance that needs to be explored between a large number of shallow trees and a smaller number of deeper trees.</dd>
    <dt>subsample</dt>
    <dd>controls sampling of the training data. This technique can help reduce overfitting, but setting it too low can also starve the model of data.</dd>
    <dt>num_round</dt> 
    <dd>controls the number of boosting rounds. This is essentially the subsequent models that are trained using the residuals of previous iterations. Again, more rounds should produce a better fit on the training data, but can be computationally expensive or lead to overfitting.</dd>
    <dt>eta</dt> 
    <dd>controls how aggressive each round of boosting is. Larger values lead to more conservative boosting.</dd>
    <dt>gamma</dt>
    <dd>controls how aggressively trees are grown. Larger values lead to more conservative models.</dd>
    
</dl>
    
</div>

<div class="alert alert-block alert-info">

### 2. Checkout the status of the training job in the console or with CLI

* 
* All of the training jobs are listed; you may use the details to compare model performance
* In the details you will find the Hyperparameters used for the training job
<br>
* You may also get the training job details using the AWS CLI | API
    * Open the terminal and run the command
    *  **aws sagemaker list-training-jobs | grep TrainingJobName**
    * **aws sagemaker describe-training-job --training-job-name  [[Copy and Paste full job name from previous command]]**
        * Each job has the metrics that determine how good it is
        * ```
        "FinalMetricDataList": [
        {
            "MetricName": "train:error",
            "Value": 0.028289999812841415,
            "Timestamp": 1589925.549
        },
        {
            "MetricName": "validation:error",
            "Value": 0.06606999784708023,
            "Timestamp": 1589925.549
        }
    ```
        * Compare the **validation:error** for each job and chose the BEST one (lowest error)
    * <code style="background:yellow;color:black">Copy the job name for the BEST model to a temporary file</code> - notepad on Windows, Wordpad on Mac - you will need it later :)
</div>

In [24]:
# !aws sagemaker list-training-jobs | grep TrainingJobName