# Use SageMaker XGboost for classification or regression

This notebook provides instruction on how to use SageMaker built-in algorithm XGboost to solve classification or regression problem

## Create working folder

First, we will create a working folder called **MY_PROJECT** and copy your source code from step 2 into the working folder.

In [2]:
%%sh

DIRECTORY=MY_PROJECT

target_dir=$DIRECTORY
if [ ! -d "$target_dir" ]; then
    mkdir $target_dir
fi


SOURCE_DIR=MY_PROJECT

cp -a ../step-2/$SOURCE_DIR/* ./$DIRECTORY/

## Modify Job_Launcher.ipynb

Next let's modify the job_launcher.ipynb file in the working folder to use SageMaker XGboost algorithm for training and hosting by following the steps below.

### 1. Modify training dataset

SageMaker XGboost takes data in CSV and libsvm format for training and inference.  For CSV format, the SageMaker XGboost algorithm requires the label to be the first column.

If your current training dataset does not have label in the first column, you can use the follow code sample to change the order of columns in your local data directory and save it back. You can also use your code to make this change.  Please note, you will need to run the dataset modification code before the section called **Upload data to S3**.  So add a new section/cell for this.  

```ruby
import pandas

train_file_name = '<name of training data file>' 
local_training_data_path = "./data/train/" + train_file_name

train_df = pandas.read_csv(local_training_data_path)
train_df_xg =  train_df['<name of label column>'] + train_df.columns[:-1].tolist()]
train_df_xg.to_csv(local_training_data_path, index=False, header=False)
```

SageMaker takes validation/test data in a separate file. If you have data in single file, split the data into training and validation data file.  You can use the following code sample to split the dataset. You can also use your code to make the change.

```ruby
split_ratio = 0.7
train_data_size = int(train_df_xg.shape[0] * split_ratio)

train_data = train_df_xg.iloc[:train_data_size, :]
validation_data = train_df_xg.iloc[train_data_size:, :]
```

Finally, we want to save the modified dataset back to the data directory. SageMaker XGboost does to take index or header columns, so when saving the data back, we don't want to include index or header info. 


```ruby
validation_file_name ='<name of the validation data file>'
local_validation_data_path = "./data/validation/" + validation_file_name
train_data.to_csv(local_training_data_path, index=False, header=False)
validation_data.to_csv(local_validation_data_path, index=False, header=False)
```

### Replace training code with XGboost Estimator 

Now we have the data ready, we can use XGboost Estimator to kick off the training.  Before we start with that, we need to clean up some cells that are no longer needed.  Remove all cells starting with the section called **Configure hyperparamters**.

After all the cells are cleared, follow the steps below to continue

1. First we need find out the container for the XGboost algorithm. Copy and paste the following code into a new cell.  


```ruby
#Get the container name for the XGboost algorithm
import boto3
from sagemaker.amazon.amazon_estimator import get_image_uri
container = get_image_uri(boto3.Session().region_name, 'xgboost')
```

2. XGboost supports the default libsvm and CSV format. If you use csv format, you need to specify the content type correctly. Add the following code to set up the data channel

```ruby

s3_inp_train = sagemaker.session.s3_input(s3_data=s3_input_train, content_type="text/csv")
s3_inp_validation = sagemaker.session.s3_input(s3_data=s3_input_validation, content_type="text/csv")

data_channel = {"train":s3_inp_train, "validation":s3_inp_validation}

```

2. Next we configure hyperamaters for XGboost algorithm and kick off the training job. Copy and paste the following code in a new cell. Make changes to any hyperparameters and other configuration as needed. It will take a few minutes to spin up the training instances to start training. Change the `<objective value>` for depending on the type of problem. See some of common ones below.

    `reg:linear`: linear regression

    `reg:logistic`: logistic regression

    `binary:logistic`: logistic regression for binary classification, output probability

    You can see a complete list of hyperparameter [here](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost_hyperparameters.html)


```ruby
# Setting up estimator and start the training job
xgb = sagemaker.estimator.Estimator(container,
                                    role, 
                                    train_instance_count=1, 
                                    train_instance_type='ml.m4.xlarge',
                                    output_path=output_s3,
                                    sagemaker_session=sagemaker_session)
xgb.set_hyperparameters(max_depth=5,
                        eta=0.2,
                        gamma=4,
                        min_child_weight=6,
                        subsample=0.8,
                        silent=0,
                        objective='<objective value>',
                        num_round=100)

xgb.fit(data_channel)
```

3. After the training is completed, let's host the trained model and use it for online predicting.  Copy and paste the following code into a new cell.  It will take a few minutes to spin up the hosting instance

```ruby
# Deploy trained model to a endpoint 
xgb_predictor = xgb.deploy(initial_instance_count=1,
                           instance_type='ml.m4.xlarge')
```

4. After the hosting environment is ready, we are ready to make inference call.  Copy and past the following code into a new cell.  Make any modification as needed.  Make sure the test data does not contain the label column.

```ruby
# Making inference call against the newly created endpoint
from sagemaker.predictor import csv_serializer, json_deserializer

xgb_predictor.content_type = 'text/csv'
xgb_predictor.serializer = csv_serializer
xgb_predictor.deserializer = None

test_data_path = '<test data path>'
test_data = pandas.read_csv(test_data_path, index=False, header=False)

predictions = xgb_predictor.predict(test_data.as_matrix())

```


