# Python Machine Learning library: XGBoost

## 0. Some recap


Supervised learning problem: classification, regression, ranking, recommendation.     
  




## 1. XGBoost Introduction

Optimised gradient-boosting machine learning library.      
Has API in several language: Python, R etc.       
Speed and performance is its key.       
Core XGBoost algorithm is parallelisable, it can harness all of the processing power of modern multi-core computers.     
Consistently outperforms single-algorithm methods in ML competition and has been shown to achieve state of the art performance on a variety of benchmark ML dataset.      

When to use XGBoost?     

Supervised ML task that fits:     
1) large number of training example (more than 1000 traning sample and less than 100 features)     
2) number of feature < number of training samples      
3) mixture of categorical and numerical features, or just numerical feature.     

When NOT to use XGBoost?    

1) Image recognition, computer vision or NLP (use Deep Learning instead)     
2) small training set (<100)      
3) training sample << no. of feature       


In [2]:
import xgboost as xgb
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [None]:
#quick example, DON'T RUN

class_data = pd.read_csv("classification_data.csv")
X,y = class_data.iloc[:,:-1], class_data.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=123)

#instantiate xgboost
xg_cl = xgb.XGBClassifier(objective="binary:logistic", n_estimators=10,seed=123)

xg_cl.fit(X_train,y_train)

preds = xg_cl.predict(X_test)

#accuracy of the trained model
accuracy = float(np.sum(preds==y_test))/y_test.shape[0]

## 2. XGBoost for Classification

Classification problem:     

When dealing with binary supervised learning problems, the AUC (Area under the Receiver Operating Characteristic (ROC)) is the most versatile and common evaluation metric used to judge the quality of a binary classification model. It is simply the probability that a randomly chosen positive data point will have a higher rank than a randomly chosen negative data point for the learning problem. Higher AUC = more sensitive, better performing model.      

When dealing with multi-class classification problem, it is common to use accuracy score (higher is better) or to look at overall confusion matrix to evaluate the quality of a model.   

### 2.1 Decision Tree (for classification problem here)

XGBoost is usually used with trees as base learner.     
At each node, a question will be asked.      
At the bottom, every possible decision will eventualy lead to a choice, some taking many fewer questions to get to those choice than others.      

Base learner: Any individual learning algorithm in an ensemble algorithm as a base learner.      

Decision trees are contructed iteratively (one decision at a time),     
until a stopping criterion is met (e.g. depth of tree reaches some pre-defined value).     
During construction, the tree is built one split at a time, and the way that a split is selected (that is, what feature to split on and the where in the feature's range of values to split) can vary, it involves a stategy that segregates the target values better. (put each target category into buckets that are increasingly dominated by just one category), until nearly all vaues within a given split are exclusively of one category or another.             

Each leaf will have a single category in the majority, or should be exclusively of one category.     

Individual decision trees in geneal are low-bias, high-variance learning model. (could be fairly accuracy, but not precise) i.e. they are good at learning relationships within any data we train them on, but they tend to overfit the data we use to train them on, and generalise on new data poorly.     

XBGoost uses CART as base learner. In contrast, CART contain real-valued score in each leaf, regardless of whether they are used for classification or regression. The real-value scores can then be thresholded to convert into categories for classification problems if necessary. 

### 2.2 Boosting

Boosting is not a specific ML algorithm, but a concept that can be applied to a set of machine leanring models (meta-algorithm).     

Specifically, it is an ensemble meta-algorithm primarily used to reduce any given single learner's variance and to convert many weak learners into an arbitrarily strong learner.      

Weak learner: any ML algorithm that is slighly better than chance.(e.g. dicision stump with depth = 1, a decision tree whose prediction are slighly better than 50%.).      

Strong learner: any algorithm that can be tuned to achieve arbitrarily good performance for some supervised learning problem.     

How boosting is accomplished:        
1) iteratively learning a set of weak models on subsets of the data.     
2) weighting each of their predictions according to each weak learner's performance.      
3) Combine the weighted predictions to obtain a single weighted prediction.      



## 3. DMatrix (and CV iXGBoost)

In XGBoost, the dataset is convert into an optimised data structure (for performance and efficiency).      

Normally, the input datasets will be converted into DMatrix on the fly.    

but when we used the XGBoost cv object (for cross validation), we have to first explicityly convert our data into a Dmatrix

In [None]:
# DON'T RUN
# DMatrix and CV in XGBoost

class_data = pd.read_csv("classification_data.csv")
# DMatrix convert
# month_5_still_here is the binary target, last col in the dataset.
churn_dmatrix = xgb.DMatrix(data=class_data.iloc[:,:-1],label=class_data.month_5_still_here)

params = {"objective":"binary:logistic","max_depth":4}

# num_boost_round: how many trees we want to build
# output to be sotres as pd df
cv_results = xgb.cv(dtrain=churn_dmatrix,params=params, nfold=4, num_boost_round=10, metrics="error",as_pandas=True)

print("Accuracy: %f" %((1-cv_results["test-error-mean"]).iloc[-1]))

## 4. XGBoost for Regression

In most case, root mean squared error (RMSE) or the mean absolute error (MAE) is used to evaluate the quality of a regression model.    

RMSE treats positive and negative error equally but punish larger differences between predicted and actual values much more than smaller ones.     

MAE simply sums the absolute differences between predicted and actual values across all samples (then take the mean). MAE is not affected by large difference as much as RMSE, it lacks some nice maths property and it is much less used as an evaluation metrics.      

Common regression algorithms: Linear regression, Decision trees (CART).

### 4.1 Objective (loss) functions and base learners

Loss function quantify how far off a prediction is form the actual result for a given data point.     
It maps the difference between estimated and true values for some collection of data.     
Goal: find the model that yields the minimum value of the loss function.     


Loss function has specific naming conventions in XGBoost:      
For regression model:         
reg:linear       

For binary classification:      
reg:logistic (when we want just decision, not probability)     
binary:logistic (when we want probability rather than just decision)       

XGBoost want base learner (i.e. individual model in the ensemble) when combined create final prediciton that is non-linear.    

Each base learner should be good at distinguishing or predicting different parts of the dataset.      

Two kinds of base learner: tree and linear


In [None]:
# tree as base learner: scikit-learn API