# XGBoost - Gradient Boosted Trees

## Algorithms Overview

XGBoost read the docs site [here](https://xgboost.readthedocs.io/en/latest/tutorials/model.html)

Linear model

* Pros
    * Simple and easy to understand
    * Performs suprisingly well for a variety of problems
* Cons
    * Difficulty handling non-linear datasets
    * Features need to be on a similar scale, categorical values need to be one-hot encoded, complex feature engineering may be required
    
Decision Tree

* Separates data into different groups using a series of questions
* During training the algorithm selects the questions to build the tree
* For decisions the algoritm walks the tree and returns the class of the leaf node

* Pros
    * Handle non-linear datasets
    * Can handle feature on different scales
* Cons
    * Can overfit if the tree gets too deep. Limiting the depth of the tree can avoid overfitting, but this can cause underfitting as the tree may not learn the patterns.
    
Ensemble Methods
   
* Use multiple trees and combines them to achieve better results
* Two methods used to come up with the set of decision trees: bagging and boosting

* Bagging
    * Training algorithm uses a random sample of training data at each step to form a tree
    * Can sample features, observations, or a combination of these
* Boosting
    * Algoritm starts with a simple tree
    * Tree is evaluated - some predictions from that tree will be right, some wrong
    * Incorrect predictions are given a higher weight
    * Next a second tree is built focused on the incorrect predictions.  The process is then repeated until there are no more improvements or a limit on the number of trees to create has been reached
    
    
* Pros
    * Not very sensitive to data distribution
    * Can easily handle features on different scales
    * Handles categorical data without one hot encoding
    

Lab

* [linear regression data prep](https://github.com/ChandraLingam/AmazonSageMakerCourse/blob/master/xgboost/LinearAndQuadraticFunctionRegression/linear_data_preparation.ipynb)

Once the data has been prepared, used [this](https://github.com/ChandraLingam/AmazonSageMakerCourse/blob/master/xgboost/LinearAndQuadraticFunctionRegression/linear_xgboost_localmode.ipynb) notebook.

Faster way to install xgboost:

```console
!pip install xgboost==0.90
```

Tree based algorithms makes branch based decisions based on data seen in training. Linear regression captures the relationship between inputs and output using weights, which means it can extrapoloate. Conversely, tree based methods have limits in terms of the range of data they can make predictions using.

Lab

* [Non-leaner data prep](https://github.com/ChandraLingam/AmazonSageMakerCourse/blob/master/xgboost/LinearAndQuadraticFunctionRegression/quadratic_data_preparation.ipynb)

* [Quadratic regression dataset - linear regression vs xgboost](https://github.com/ChandraLingam/AmazonSageMakerCourse/blob/master/xgboost/LinearAndQuadraticFunctionRegression/quadratic_xgboost_localmode.ipynb)

Lab solution: add a quadratic feature to model the, um... quadratic equation.

## Lab - Bike Sharing Kaggle Challenge 

Forecast hourly demand

* Kaggle info [here](https://www.kaggle.com/c/bike-sharing-demand/data)
* How to [download datasets from kaggle](https://freddiek.github.io/2018/06/10/accessing-Kaggle-from-SageMaker-instance.html)

Download the data using the kaggle command, e.g.

```console
kaggle competitions download -c bike-sharing-demand
```

Data prep - rev1 workbook

* xgboost can only handle numerical features and categorical values
* need to break up the time stamp into year, month, day, day of week, hour
* workbook shows several ways to explore the dataset
* For this lab we have a single model to predict total rentals; could have made two different models to predict casual and registered rentals, then add them up.

Train regression model

* bike rental xgboost localmode rev1
* hyper parameters - depth of 5, 150 trees max
* Can examine feature importance - hour and humidity are most influential
* Something funky - regression model predicting negative rentals
    * For plotting purposes set the negative predictions to 0
* Kaggle uses RMSLE - root mean square log error - the % of difference matters, not the magnitude of the difference

Optimization Technique

* When you model needs to predict a positive integer like count, you can apply a log transformation on the target, e.g. log(count)
* To get the predicted count, use an inverse transform on the predicted value, e.g. exp(count)
* Smoothens the effect of seasonality and trend, brings count to a similar scale
* See notebook bike rental data prep rev 3
* And bikerental xgboost localmode rev3
* This optimization performs much better than rev 1



## Training A Model Using SageMaker's XGBoost

Four steps:

* Upload training and validation files to S3
* Specify algorithm and hyperparameters
* Configure type of server and number of servers to use for training
* Create a real-time endpoint for interactive use case

Lab:

* Open xgboost cloud training template notebook for the lab
* Note get image uri now returns the image location in ECR
* You can see the training job running in the sage maker console

How to Connect to an Existing SageMaker Endpoint

* Use the [cloud prediction notebook](https://github.com/ChandraLingam/AmazonSageMakerCourse/blob/master/xgboost/BikeSharingRegression/xgboost_cloud_prediction_template.ipynb)
* Prediction endpoints accept batches of values and can return an array of results.


## Model Hosting

Single Instance Hosting - single point of failure.

* SageMaker integrates with CloudWatch
* SageMaker can also be integrated with autolaunch
* Configure an endpoint with multiple instances, so requests are distributed against multiple instances in multiple AZs for HA
* Scale based on workload as well
* SageMakerVariantInvocationsPerInstance metric = average number of requests per minute per instance

Multiple Instances (Variants) of an Algorithm can be deployed to the same endpoint

* Good for testing new versions of models


## Multiclass Classification

Lab

* Iris Data Classification - [Data Prep](https://github.com/ChandraLingam/AmazonSageMakerCourse/blob/master/xgboost/IrisClassification/iris_data_preparation.ipynb)
    * String class labels must be encoded as integers
    * Use sklearn label encoder
* Iris Classification - [Train the Model](https://github.com/ChandraLingam/AmazonSageMakerCourse/blob/master/xgboost/IrisClassification/iris_xgboost_localmode.ipynb)



## Binary Classification

Lab

* Diabetes Data Classification - [Data Prep](https://github.com/ChandraLingam/AmazonSageMakerCourse/blob/master/xgboost/DiabetesClassification/diabetes_data_preparation.ipynb)
* Diabetes Classification - [Model Training](https://github.com/ChandraLingam/AmazonSageMakerCourse/blob/master/xgboost/DiabetesClassification/diabetes_xgboost_localmode.ipynb)

Performance of the original data was poor, but probably related to some of the problems with the data, such as the zero values for some data that cannot be data. Debug the data before debugging the model.

Solution: replace the missing values with the mean of the non-missing values. Note that the 'missing values' are actually numbers, so be sure that are not included in the calculation of the means. One way to do this io replace them withnp.nan.

## Hyperparameters

* Fine tune the learning process based on the complexity of the data set.
* Algorithms usually use a sensible set of default values.

objective hyperparameter - specifies the learning task and the corresponding learning objective

* regression - "reg:linear"
* binary classification - "binary:logistic"
* multiclass classification - "multi:softmax"

Other important parameters:

* num_rounds - number of trees, too high can overfit
* early_stopping_rounds - stop the training when the validation score stops improving, helps avoid overfitting

Bias and Variance

* Biased models - do not match reality
* Variance
    * How well the model generalizes for unseen data.
    * Difference between validation error and training error

High Bias

* Data is not learning from data
* Translates to large training and validation errors
* Underfitting

High Variance

* Validation error is high, but training error is low.
* Overfitting

Handling High Bias

* Add relevant features
* Combine features
* Create higher order features
* Train longer (more iterations)
* Decrease regularization

Handling High Variance

* Use fewer features
* Use straightforward features instead of higher order features
* Reduce training iterations
* Increase regularization

Regularization

* Many features are equally good at predicting outcome
* Which combination of features is the model going to use?
* Feature selection depends on algorithm and regularization parameters.

Regularization tones down overdependence on specific features/ 

Analogy: google maps on the phone vs garmin gps vs paper map

* Google maps has most up to date context (road closure, traffic, accidents), but has many failure scenarios that can leave you with no information.
* GPS lacks most up to date road closures, traffic conditions, etc. but is less prone to failures - balances risk against performance

L1 Regularization - Aggresively eliminates features that are not important

* Example:
    * Phone GPS - substantial weight
    * Standalone GPS - zero weight
    * Paper map - zero weight
* Useful in large dimension dataset - reduce the number of features

L2 Regularization - Simply reduces weight of some features

* Allows other features to influence outcome
* L2 regularization is a good starting point
* Example:
    * Phone GPS - larger weight
    * Standalone GPS - medium weight
    * Paper map - smaller weight
    
XGBoost Regularization

* alpha - L1 regularization, default is 0. 
* lambda - L0 regularization, default it 1.
* XGBoost tuning guide - see [here](https://xgboost.readthedocs.io/en/latest/tutorials/param_tuning.html)

Hyperparameter tuning is dependent on the dataset, and some hyper parameters are sensitive to the settings of other hyperparameters. There it is recommended to use automated tuning methods.

SKLearn Automatic Tuning

* GridSearch - exhaustive search using specified lower and upper bound of parameter values
* RandomSearch - random search of parameters from specified lower and upper bound

SageMaker - Automatic Tuning

* Bayesian Search - smart search. Treats hyperparameter tuning as a machine learning problem. Often converges faster.
* Random search - random search of parameters from specified lower and upper bound, similar to SKLearn random.
