# First Notebook: Basic Walkthrough of XGBoost
@dzhang203 // init 2019-05-21, updated 2019-06-12

Resources:

* [Official XGBoost python introduction](https://xgboost.readthedocs.io/en/latest/python/python_intro.html)
* [XGBoost python demos](https://github.com/dmlc/xgboost/tree/master/demo/guide-python)

In [None]:
import os
import numpy as np
import matplotlib.pyplot as plt
# import scipy.sparse
import xgboost as xgb

In [None]:
os.getcwd()

In [None]:
PATH_DATA = '../data/'

# Load data
Data for XGBoost models are stored in the [DMatrix](https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.DMatrix) data structure. Unlike the [DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) structure from pandas, DMatrix appears optimized for computational speed rather than adhoc inspection of the data.

XGBoost can [read data](https://xgboost.readthedocs.io/en/latest/python/python_intro.html#data-interface) from many formats, including csv files, pandas DataFrames, and more.

One possible data workflow for an XGBoost project:
1. Query the data from source tables.
2. Inspect a sample of the raw data using pandas, matplotlib, and seaborn.
3. Clean the raw data (using pandas for small data, or some more scalable solution for big data).
4. Prepare the raw data (using natural language processing, [one-hot encoding](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) for categorical data, etc.).
5. Train your boosting model.
6. Evaluate performance, troubleshoot, understand, and visualize.

In [None]:
# load data from text files
dtrain = xgb.DMatrix(PATH_DATA + 'agaricus.txt.train')
dtest = xgb.DMatrix(PATH_DATA + 'agaricus.txt.test')

In [None]:
dtrain.num_col()

In [None]:
dtrain.num_row()

In [None]:
print(dtrain.feature_names)

In [None]:
# specify parameters via map
param2 = {
    'max_depth': 2,
    'eta': 1,
    'silent': 1,
    'objective': 'binary:logistic'
}

param3 = {
    'max_depth': 3,
    'eta': 1,
    'silent': 1,
    'objective': 'binary:logistic'
}

In [None]:
# specify validations to watch performance
watchlist = [(dtest, 'eval'), (dtrain, 'train')]
num_round = 2

# Train model

In [None]:
# train model
bst2 = xgb.train(param2,
                 dtrain,
                 num_round,
                 watchlist)

In [None]:

bst3 = xgb.train(param3,
                 dtrain,
                 num_round,
                 watchlist)

# Project to obtain predictions

In [None]:
preds = bst3.predict(dtest)

In [None]:
labels = dtest.get_label()
print('error=%f' % (sum(1 for i in range(len(preds)) if int(preds[i] > 0.5) != labels[i]) / float(len(preds))))

# Understanding our results

## Feature importance
Per this [Medium article](https://medium.com/@srnghn/the-mathematics-of-decision-trees-random-forest-and-feature-importance-in-scikit-learn-and-spark-f2861df67e3) by Stacey Ronaghan, **feature importance** is the decrease in node impurity, for nodes that split at that feature, weighted by the probability of reaching that node. The higher the value, the more important the feature.

We can define **node impurity** as:
$$\text{Gini impurity} \equiv \sum_{i=1}^{C} -f_i(1 - f_i),$$
where $f_i$ is the frequency of label $i$ at a node. Intuitively, impurity is greatest (in absolute value) when the variance in outcomes is high, and lowest (zero) when there is no variance in outcomes.

If a split cleanly divides the outcomes into the possible label categories, then it has contributed greatly to the reduction in variance of the labels.

If a large proportion of the samples pass through a split, then it has a high contribution to the overall variance decrease achieved by the complete tree.

And, finally, if a feature is used repeatedly for splits that greatly decrease node impurity for a large proportion of samples, then that feature is important for the overall predictive power of the tree.

**TODO:** What does feature importance look like if we have **highy correlated (multicollinear) features**? In regression: multicollinearity results in (1) unreliable point estimates, and (2) blown-up standard errors. It seems like having the same feature in there twice would kind of mask the importance of either feature... in this situation it would be useful to do a simple linear regresion alongside the tree-based model to get insight into these relationships within the data.

**TODO:** See [Interpretable Machine Learning with XGBoost](https://towardsdatascience.com/interpretable-machine-learning-with-xgboost-9ec80d148d27), a Medium article by Scott Lundberg, for a discussion of different ways to compute and visualize feature importance for tree-based models. His github repo for [SHAP](https://github.com/slundberg/shap), meaning SHapley Additive exPlanations, contains practical examples for his unified approach to explain the output of any machine learning model.

In [None]:
xgb.plot_importance(bst2)

In [None]:
xgb.plot_importance(bst3)

One way to normalize feature importances is to simply divide by the sum of all feature importances... but this might not actually be that informative because importances will then naturally be lower for models with more features. Ultimately we care about the relative feature importances and have to understand that the scale does not matter.

## Plotting a tree

In [None]:
# TODO: not sure why plot_tree not working...
# xgb.plot_tree(bst, num_trees=2)