<div class="alert alert-block alert-info">
    
### Try out the XGBoost algo with different hyperparameters

#### 0. Setup the imports and bucket reference
#### 1. Get the image URI for the XGBoost algo
#### 2. Setup the Input and Output data locations
#### 3. Execute a training job
#### 4. Manually tuning the model
    
</div>

In [22]:
import boto3
import sagemaker
from sagemaker import get_execution_role

# CHANGE the bucket name
bucket = "awsrajeev"

# This will be a folder created under your bucket 
prefix = "sagemaker/churn-analysis"


# Get the role - ignore the warning
role = get_execution_role()



<div class="alert alert-block alert-success">
    
### 1. Get the SageMaker container implementation for the algo
* SageMaker implements the ML algos in Docker containers (images)
* The images are maintained in *Elastic Container Registry (ECR)*
* The api *get_image_uri* gets a reference to the specified algo container image

</div>

In [23]:
from sagemaker.amazon.amazon_estimator import get_image_uri

# Ignore the warning
container = get_image_uri(boto3.Session().region_name, 'xgboost',repo_version='0.90-2')

	get_image_uri(region, 'xgboost', '1.0-1').


<div class="alert alert-block alert-success">

### 2. Setup the Input and Output data locations
    
##### Input
* Ensure that bucket are prefix are correct
* Notice the content_type= csv; the data may be specified in other formats; depends on the algo
* Notice that we have specified the folder NOT the file as the data may spread across multiple files

##### Output
* Training job writes the model artefacts to S3
* A new S3 folder is created with the name of the training job
    
</div>

In [24]:
# These will be passed as the parameters to the algo
s3_input_train = sagemaker.s3_input(s3_data='s3://{}/{}/train'.format(bucket, prefix), content_type='csv')
s3_input_validation = sagemaker.s3_input(s3_data='s3://{}/{}/validation/'.format(bucket, prefix), content_type='csv')
s3_input_test = sagemaker.s3_input(s3_data='s3://{}/{}/test/'.format(bucket, prefix), content_type='csv')

# we will write all outputs under the folder output
s3_output_model='s3://{}/{}/output'.format(bucket, prefix)

<div class="alert alert-block alert-success">

### 3. Execute a training job
    
* Provide a job name - if not provided the default name is created *[Algo Name]-[Timestamp]*
* The job name MUST be unique every time you execute a training job 
    <code style="background:yellow;color:black">otherwise training job will fail will fail to launch!!!</code>
* You may checkout the training jobs using:
    * Console
    * API
    
   
##### Estimator parameters
* container - image for the algo
* role - determines the permissions that the training job will have; fix this for permission errors
* train_instance - depends on the job; smaller machine longer runtime or even a failure
    
##### Starting the training job

* The algo.<code style="background:yellow;color:black">fit()</code> method starts the execution of the job
* MUST provide the data locations; 
    * channel refers to the type of data; train channel = training data, validation channel = validation data
    * channels depend on the algo
    
</div>

In [25]:
# The training job name - change the number to 102, 103 ... timestamp appended to it automaticaly
job_name = "xgboost-my-training-job-101"

# Session is used for connecting to the Sagemaker service
sess = sagemaker.Session()


# Instance of the algo
xgb = sagemaker.estimator.Estimator(container,
                                    role, 
                                    train_instance_count=1, 
                                    train_instance_type='ml.m4.xlarge',
                                    output_path=s3_output_model,
                                    sagemaker_session=sess,
                                    base_job_name=job_name)

# Setup the hyperparameters; objective is to predict Yes | No so 'binary:logistic'
objective = 'binary:logistic'

# Tuning involves adjusting the values of Hyperparameters
xgb.set_hyperparameters(objective=objective,
                        max_depth=5,
                        eta=0.2,
                        gamma=4,
                        min_child_weight=6,
                        subsample=0.8,
                        silent=0,
                        num_round=100)

# Start the training job - provide the training and validation data locations
xgb.fit({'train': s3_input_train, 'validation': s3_input_validation})

2020-05-14 10:42:07 Starting - Starting the training job...
2020-05-14 10:42:09 Starting - Launching requested ML instances......
2020-05-14 10:43:18 Starting - Preparing the instances for training......
2020-05-14 10:44:31 Downloading - Downloading input data
2020-05-14 10:44:31 Training - Downloading the training image...
2020-05-14 10:45:08 Uploading - Uploading generated training model
2020-05-14 10:45:08 Completed - Training job completed
[34mINFO:sagemaker-containers:Imported framework sagemaker_xgboost_container.training[0m
[34mINFO:sagemaker-containers:Failed to parse hyperparameter objective value binary:logistic to Json.[0m
[34mReturning the value itself[0m
[34mINFO:sagemaker-containers:No GPUs detected (normal if no gpus installed)[0m
[34mINFO:sagemaker_xgboost_container.training:Running XGBoost Sagemaker in algorithm mode[0m
[34mINFO:root:Determined delimiter of CSV input is ','[0m
[34mINFO:root:Determined delimiter of CSV input is ','[0m
[34mINFO:root:Determ

<div class="alert alert-block alert-info">

### 1. Checkout the logs generated

* The idea is to MINIMIZE the train-error |  validation-error
* Algorithm is iterating through the runs to get to that minimal for the error
* Checkout the error values in the last 10 to 15 logs - is it changing? If NO then that means we are getting there :)
    
</div>

<div class="alert alert-block alert-info">

### 2. Checkout the status of the training job in the console

* You may also get the training job details using the AWS CLI | API
* All of the training jobs are listed; you may use the details to compare model performance
* In the details you will find the Hyperparameters used for the training job
    
</div>

<div class="alert alert-block alert-info">

### 3. Checkout the Model artefacts generated in the S3 bucket

* Use the console 
* You may download the artefacts and look into the contents if you are interested
    
</div>

<div class="alert alert-block alert-success">
    
### 4. Manually tuning the model
* Based on the run you may decide to adjust the Hyperparameters and run the job again
* Checkout the Hyperparameter definitions below; these apply to XGBoost. Each algo has its own set of Hyperparameters
* The process of tuning is iterative; so you may end up running the job 10-20-30- ... times !!!!
    
**Go ahead run the job a couple of times .<code style="background:yellow;color:black">with different Hyperparameter values</code>. For each run the model will be genaretd in a S3 under a different folder.**

    
### Setup the hyperparameters
    
<dl>
    <dt>max_depth</dt>
    <dd>controls how deep each tree within the algorithm can be built. Deeper trees can lead to better fit, but are more computationally expensive and can lead to overfitting. There is typically some trade-off in model performance that needs to be explored between a large number of shallow trees and a smaller number of deeper trees.</dd>
    <dt>subsample</dt>
    <dd>controls sampling of the training data. This technique can help reduce overfitting, but setting it too low can also starve the model of data.</dd>
    <dt>num_round</dt> 
    <dd>controls the number of boosting rounds. This is essentially the subsequent models that are trained using the residuals of previous iterations. Again, more rounds should produce a better fit on the training data, but can be computationally expensive or lead to overfitting.</dd>
    <dt>eta</dt> 
    <dd>controls how aggressive each round of boosting is. Larger values lead to more conservative boosting.</dd>
    <dt>gamma</dt>
    <dd>controls how aggressively trees are grown. Larger values lead to more conservative models.</dd>
    
</dl>
    
</div>