# XG Boost

## Classification with XG Boost

* A classification problem involves predicting the category a given data point belongs to out of a finite set of possible categories. Depending on how many possible categories there are to predict, a classification problem can be either binary or multi-class. 
* XG Boost is an ensemble learning method
* XG Boost: "The hottest library in supervised ML"
* XG Boost originally written in C++, but has API's in several other languages:
    * Python
    * R
    * Scala
    * Julia
    * Java
* What makes XG Boost so popular?
    * It's speed and performance 
    * Core algorithm is parallelizable 
    * Also parallelizable to GPUs and networks of computers
    * Consistently outperforms single-algorithm models
    * Achieves state of the art performance in many ML tasks 
* Remember: You always build an ML model using train_test_split
* You can use the scikit-learn .fit() / .predict() paradigm that you are already familiar to build your XGBoost models, as the xgboost library has a scikit-learn compatible API!
* A typical setup for a churn prediction problem: Use the first month's worth of data to predict whether an app's users will remain users of the service at the 5 month mark. 

```
#Import xgboost
import xgboost as xgb
#Create arrays for the features and the target: X, y
X, y = churn_data.iloc[:,:-1], churn_data.iloc[:,-1]
#Create the training and test sets
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size=0.2, random_state=123)
#Instantiate the XGBClassifier: xg_cl
xg_cl = xgb.XGBClassifier(objective='binary:logistic', n_estimators=10, seed=123)
#Fit the classifier to the training set
xg_cl.fit(X_train, y_train)
#Predict the labels of the test set: preds
preds = xg_cl.predict(X_test)
#Compute the accuracy: accuracy
accuracy = float(np.sum(preds==y_test))/y_test.shape[0]
print("accuracy: %f" % (accuracy))
```

### Decision trees as "base learners"
* __Base learner:__ any individual learning algorithm in an ensemble algorithm
* Composed of a series of binary questions
* Predictions happen at the "leaves" of the trees. 
### Decisions trees and CART
* Decision trees are constructed iteratively (one decision at a time) until some stopping criterion is met (the depth reaches some pre-defined maximum value 
* Individual decision trees tend to overfit
    * Low bias
    * High variance
* XG Boost uses a slightly different kind of a decision tree called a
    * __CART:__ Classification And Regression Trees:
        * Whereas the decision tree leaves always contained decision values;
        * CART leaves __always__ contain a real-valued score (regardless of whether they are used for classification or regression). 
        * The real-valued scores can then be thresholded to convert into categories for classifiction problems if necessary

```
#Import the necessary modules
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
#Create the training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
#Instantiate the classifier: dt_clf_4
dt_clf_4 = DecisionTreeClassifier(max_depth= 4)
#Fit the classifier to the training set
dt_clf_4.fit(X_train, y_train)
#Predict the labels of the test set: y_pred_4
y_pred_4 = dt_clf_4.predict(X_test)
#Compute the accuracy of the predictions: accuracy
accuracy = float(np.sum(y_pred_4==y_test))/y_test.shape[0]
print("accuracy:", accuracy)
```

## Boosting
* At bottom, boosting isn't a specific ML algorithmm but rather a concept that can be applied to a set of ML models
    * "Meta-algorithm"
* Ensemble meta-algorithm used to convert many weak learners into a strong learner. 
* __Weak learner:__ ML algorithm that is slightly better than chance
    * Example: Decistion tree whose predictions are slightly better than 50%
* Boosting converts a collection of weak learners into a strong learner
* __Strong learner:__ Any algorithm that can be tuned to achieve good performance
* How Boosting is accomplished:
    * By iteratively learning a set of weak models on subsets of the data
    * Weighing each weak prediction according to each weak learner's performance.
    * Combine the weighted predictions to form a single weighted prediction
* __Model evaluation using cross-validation with XG Boost's Learning API:__
    * (different from the sklearn compatible API we used in the first DC examples above.
    * Cross-validation capabilities baked in: generates many non-overlapping train/test splits on training data, reports the average test set performance across all data splits
    * needs to be converted to DMatrix
    * for previous .fit, .predict etc: data was automatically transformed into required DMatrix
    * For the following steps, data MUST be explicitly transformed to DMatrix
* XGBoost gets its lauded performance and efficiency gains by utilizing its own optimized data structure for datasets called a __DMatrix__.

***
```
#Create arrays for the features and the target: X, y
X, y = churn_data.iloc[:,:-1], churn_data.iloc[:,-1]
#Create the DMatrix from X and y: churn_dmatrix
churn_dmatrix = xgb.DMatrix(data=churn_data.iloc[:,:-1], label=churn_data.month_5_still_here)
#Create the parameter dictionary: params
params = {"objective":"reg:logistic", "max_depth":3}
#Perform cross-validation: cv_results
cv_results = xgb.cv(dtrain=churn_dmatrix, params=params, 
                  nfold=3, num_boost_round=5, 
                  metrics="error", as_pandas=True, seed=123)
#Print cv_results
print(cv_results)
#Print the accuracy
print(((1-cv_results["test-error-mean"]).iloc[-1]))
```

***
```
#Perform cross_validation: cv_results
cv_results = xgb.cv(dtrain=churn_dmatrix, params=params, 
                  nfold=3, num_boost_round=5, 
                  metrics="auc", as_pandas=True, seed=123)
#Print cv_results
print(cv_results)
#Print the AUC
print((cv_results["test-auc-mean"]).iloc[-1])
```

When to use XG Boost: \
For any supervised machine learning task that fits the following criteria: \
* You have a large number of training samples and few fetures
    * Greater than 1,000 training samples and less 100 features 
    * As long as the number of features < the number of training samples
* XG Boost tends to do well when you have a mixture of categorical and numeric features 
    * Or just numeric features 

When to NOT use XG Boost: \
XGBoost is a powerful library that scales very well to many samples and works for a variety of supervised learning problems. That said, as Sergey described in the video, you shouldn't always pick it as your default machine learning library when starting a new project, since there are some situations in which it is not the best option. \
* Not ideally suited for image recognition
* Computer vision
* Natural Language Processing and understanding problems
* (All of the above much better for Deep Learning)
* When the number of training samples is significantly smaller than the number of features (for example fewer than 100 training samples)

## Regression with XG Boost

### Objective functions and why we use them:
* Quantifies how far off a prediction is from the actual result
* Measures the distance between estimated and true values for some collection of data
* Goal: Find the model that yields the minimum value of the loss function

### Common loss functions and XBG:
* For regression models:
    * `reg:linear`
* For binary classification:
    * `reg:logistic` : use for classification problems when you want just decision, not probability.
    * `binary:logistic` : use when you want probability and not just decision
* linear base learners: learning API only (no sklearn API)

```
#Create the training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
#Instantiate the XGBRegressor: xg_reg
xg_reg = xgb.XGBRegressor(objective="reg:linear", seed=123, n_estimators=10)
#Fit the regressor to the training set
xg_reg.fit(X_train, y_train)
#Predict the labels of the test set: preds
preds = xg_reg.predict(X_test)
#Compute the rmse: rmse
rmse = np.sqrt(mean_squared_error(y_test, preds))
print("RMSE: %f" % (rmse))
```

```
#Convert the training and testing sets into DMatrixes: DM_train, DM_test
DM_train = xgb.DMatrix(X_train, y_train)
DM_test =  xgb.DMatrix(X_test, y_test)
#Create the parameter dictionary: params
params = {"booster":"gblinear", "objective":"reg:linear"}
#Train the model: xg_reg
xg_reg = xgb.train(dtrain = DM_train, params=params, num_boost_round=5)
#Predict the labels of the test set: preds
preds = xg_reg.predict(DM_test)
#Compute and print the RMSE
rmse = np.sqrt(mean_squared_error(y_test,preds))
print("RMSE: %f" % (rmse))
```

```
#Create the DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)
#Create the parameter dictionary: params
params = {"objective":"reg:linear", "max_depth":4}
#Perform cross-validation: cv_results
cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=4, num_boost_round=5, metrics="rmse", as_pandas=True, seed=123)
#Print cv_results
print(cv_results)
#Extract and print final boosting round metric
print((cv_results["test-rmse-mean"]).tail(1))
```

```
#Create the DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)
#Create the parameter dictionary: params
params = {"objective":"reg:linear", "max_depth":4}
#Perform cross-validation: cv_results
cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=4, num_boost_round=5, metrics="mae", as_pandas=True, seed=123)
#Print cv_results
print(cv_results)
#Extract and print final boosting round metric
print((cv_results["test-mae-mean"]).tail(1))
```

## Regularization in XG Boost
* Also takes into account how complex the model is. 
* __Regularization:__ Idea of penalizing models as they become more complex
* __Regularization parameters in XG Boost:__
    * __gamma:__ is a parameter for tree-based learners; minimum loss reduction allowed for a split to occur
    * __alpha (L1):__ alpha is another name for L1 regularization; __L1 = Lasso__; L1 regularization on leaf weights, larger values mean more regularization; higher alpha values lead to stronger L1 regularization, which causes many leaf weights in the base learners to go to zero. 
    * __lambda:__ another name for L2 regularization; __L2 = Ridge__; a much smoother penalty than L1 and causes leaf weights to smoothly decrease, instead of enforcing strong sparcity constraints on the leaf weights.
* *See DataCamp's Supervised Learning with Scikit Learn*
### Base Learners in XG Boost:
* __Linear Base Learner:__
    * sum of linear terms
    * when you combine many linear base learners together into a boosted base model, you get a weighted sum of linear models (and thus, is linear itself).
    * Ensemble linear base learners are rarely used, as you won't/can't get any non-linear combination of features in the final model, as you get can identical performance from a regularized linear model.
* __Tree Base Learner:__
    * Decision Tree
    * Boosted model is weighted sum of decision trees (non-linear)
    * Almost exclusively used in XG Boost

### Creating DataFrames from multiple equal-length lists:
* recap of `zip()` and `list()`
* `zip()` creates a generator of parallel values 
* `zip([1,2,3],["a","b","c"])` = `[1, "a"], [2, "b"], [3,"c"]`
* generators need to be completely instantiated before they can be used in DataFrame objects.
* `list()` instantiates the full generator and passing that into the DataFrame converts the whole expression.

```
#Create the DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)
reg_params = [1, 10, 100]
#Create the initial parameter dictionary for varying l2 strength: params
params = {"objective":"reg:linear","max_depth":3}
#Create an empty list for storing rmses as a function of l2 complexity
rmses_l2 = []
#Iterate over reg_params
for reg in reg_params:
    #Update l2 strength
    params["lambda"] = reg
    #Pass this updated param dictionary into cv
    cv_results_rmse = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=2, num_boost_round=5, metrics="rmse", as_pandas=True, seed=123)    
    #Append best rmse (final round) to rmses_l2
    rmses_l2.append(cv_results_rmse["test-rmse-mean"].tail(1).values[0])
# Look at best rmse per l2 param
print("Best rmse as a function of l2:")
print(pd.DataFrame(list(zip(reg_params, rmses_l2)), columns=["l2","rmse"]))
```

```
#Create the DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)
#Create the parameter dictionary: params
params = {"objective":"reg:linear", "max_depth":2}
#Train the model: xg_reg
xg_reg = xgb.train(params=params, dtrain=housing_dmatrix, num_boost_round=10)
#Plot the first tree
xgb.plot_tree(xg_reg, num_trees= 0)
plt.show()
#Plot the fifth tree
xgb.plot_tree(xg_reg, num_trees= 4)
plt.show()
#Plot the last tree sideways
xgb.plot_tree(xg_reg, num_trees= 9, rankdir= "LR")
plt.show()
```

### Visualizing feature importances: What features are most important in my dataset:
Another way to visualize your XGBoost models is to examine the importance of each feature column in the original dataset within the model.

One simple way of doing this involves counting the number of times each feature is split on across all boosting rounds (trees) in the model, and then visualizing the result as a bar graph, with the features ordered according to how many times they appear. XGBoost has a plot_importance() function that allows you to do exactly this, and you'll get a chance to use it in this exercise!
```
#Create the DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(X, y)
#Create the parameter dictionary: params
params = {"objective":"reg:linear", "max_depth":4}
#Train the model: xg_reg
xg_reg = xgb.train(params=params, dtrain=housing_dmatrix ,num_boost_round=10)
#Plot the feature importances
xgb.plot_importance(xg_reg)
plt.show()
```

In [None]:
AWND
GHCND:USW00054704 MA
    
GHCND:USW00014740 CT

GHCND:USW00014742 VT

GHCND:USW00094626 ME