## XGBoost 


### Boosting


* Not a specific machine learning algorithm
* Concept that can be applied to a set of machine learning models
"Meta-algorithm"
* Ensemble meta-algorithm used to convert many weak learners into a strong learner

**weak Learner** - Learners which are slightly better than randomness eg. Decision tree with accuracy greater than 50%

**How boosting works?**

* Iteratively learning a set of weak models on subsets of the data 
* Weighing each weak prediction according to each weak learner's performance
* Combine the weighted predictions to obtain a single weighted prediction

... that is much better than the individual predictions themselves!







What's the difference from Random Forrest?





In [2]:
#how to install 
#pip install xgboost



In [5]:
import xgboost as xgb
import pandas as pd
import numpy as np 
from sklearn import datasets
from sklearn.model_selection import train_test_split

In [6]:
# load the dataset 
iris = datasets.load_iris()

X,y = iris.data , iris.target

# Create the training and test sets
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size=0.2, random_state=123)

# Instantiate the XGBClassifier: xg_cl
xg_cl = xgb.XGBClassifier(objective='binary:logistic', n_estimators=10, seed=123)

# Fit the classifier to the training set
xg_cl.fit(X_train,y_train)

# Predict the labels of the test set: preds
preds = xg_cl.predict(X_test)

# Compute the accuracy: accuracy
accuracy = float(np.sum(preds==y_test))/y_test.shape[0]
print("accuracy:  ", (accuracy))

accuracy:   1.0


**When to NOT use XGBoost?**

* Image recognition
* Computer vision
* Natural language processing and understanding problems
* When the number of training samples is significantly smaller than the number of features

### Objective Functions and Base Learners

**Objective Functions**

* Quantifies how far off a prediction is from the actual result
* Measures the difference between estimated and true values for some collection of data
* Goal: Find the model that yields the minimum value of the loss function

#### Common Loss Functions and XGBoost
Loss function names in xgboost:
* **reg:linear** - use for regression problems
* **reg:logistic** - use when you want probability rather than just decision
* **binary:logistic** - use for classification problems when you want just decision, not probability
    

**Base Learners and Why We Need Them?**

* XGBoost involves creating a meta-model that is composed of many individual models that combine to give a final prediction
* Individual models = base learners
* Want base learners that when combined create final prediction that is non-linear
* Each base learner should be good at distinguishing or predicting different parts of the dataset
* Two kinds of base learners: **tree** and **linear**

#### Trees as Base Learners in Scikit API

In [7]:
boston_data = pd.read_csv("../datasets/boston_housing.csv")
X, y = boston_data.iloc[:,:-1],boston_data.iloc[:,-1] 
X_train, X_test, y_train, y_test= train_test_split(X, y,
test_size=0.2, random_state=123)
xg_reg = xgb.XGBRegressor(objective='reg:linear',
n_estimators=10, seed=123)

xg_reg.fit(X_train,y_train)
preds = xg_reg.predict(X_test)

from sklearn.metrics import mean_squared_error
rmse = np.sqrt(mean_squared_error(y_test,preds))
print("RMSE: %f" % (rmse))

RMSE: 173308.244360


####  Linear Base Learner in XGBoost API

In [10]:
boston_data = pd.read_csv("../datasets/boston_housing.csv")
X, y = boston_data.iloc[:,:-1],boston_data.iloc[:,-1] 

X_train, X_test, y_train, y_test= train_test_split(X, y,
test_size=0.2, random_state=123)
DM_train = xgb.DMatrix(data=X_train,label=y_train)
DM_test = xgb.DMatrix(data=X_test,label=y_test)
params = {"booster":"gblinear","objective":"reg:linear"}
xg_reg = xgb.train(params = params, dtrain=DM_train,
num_boost_round=10)
preds = xg_reg.predict(DM_test)
rmse = np.sqrt(mean_squared_error(y_test,preds))
print("RMSE: %f" % (rmse))

RMSE: 98285.815883


### Tuning the Model

Why tune? because it gives better accuracy

**Untuned Model**


In [12]:

housing_data = pd.read_csv("../datasets/ames_housing_trimmed_processed.csv")

X,y = housing_data[housing_data.columns.tolist()[:-1]], housing_data[housing_data.columns.tolist()[-1]]

housing_dmatrix = xgb.DMatrix(data=X,label=y)

untuned_params = {"objective":"reg:linear"}

tuned_cv_results_rmse = xgb.cv(dtrain=housing_dmatrix, params=untuned_params, nfold=4, num_boost_round=200, metrics="rmse",as_pandas=True, seed=123)

print(type(tuned_cv_results_rmse))

print("Tuned rmse: %f" %((tuned_cv_results_rmse["test-rmse-mean"]).tail(1)))

<class 'pandas.core.frame.DataFrame'>
Tuned rmse: 33288.916504


In [13]:
housing_data = pd.read_csv("../datasets/ames_housing_trimmed_processed.csv")
X,y = housing_data[housing_data.columns.tolist()[:-1]], housing_data[housing_data.columns.tolist()[-1]]
housing_dmatrix = xgb.DMatrix(data=X,label=y)
tuned_params = {"objective":"reg:linear",'colsample_bytree': 0.3, 'learning_rate': 0.1, 'max_depth': 5}
tuned_cv_results_rmse = xgb.cv(dtrain=housing_dmatrix, params=tuned_params, nfold=4, num_boost_round=200, metrics="rmse",as_pandas=True, seed=123)
print("Tuned rmse: %f" %((tuned_cv_results_rmse["test-rmse-mean"]).tail(1)))

Tuned rmse: 31003.937989


#### Common tree tunable parameters

* **learning rate:** learning rate/eta
* **gamma:** min loss reduction to create new tree split
* **lambda:** L2 reg on leaf weights
* **alpha:** L1 reg on leaf weights
* **max_depth:** max depth per tree
* **subsample:** % samples used per tree 
* **colsample_bytree:** % features used per tree

## Assignment

1. Load the kidney disease dataset 
2. Try to predict the disease given blood values 

**Main Challenges : Many missing values**

References :

https://xgboost.readthedocs.io/en/latest/model.html
https://www.analyticsvidhya.com/blog/2015/11/quick-introduction-boosting-algorithms-machine-learning/