# First Notebook: Basic Walkthrough of XGBoost
@dzhang203 // init 2019-05-21, updated 2019-06-12

Resources:

* [Official XGBoost python introduction](https://xgboost.readthedocs.io/en/latest/python/python_intro.html)
* [XGBoost python demos](https://github.com/dmlc/xgboost/tree/master/demo/guide-python)

In [None]:
import os
import numpy as np
import matplotlib as plt # check on this...
# import scipy.sparse
import xgboost as xgb

In [None]:
os.getcwd()

In [None]:
PATH_DATA = '../data/'

# Load data
Data for XGBoost models are stored in the [DMatrix](https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.DMatrix) data structure. Unlike the [DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) structure from pandas, DMatrix appears optimized for computational speed rather than adhoc inspection of the data.

XGBoost can [read data](https://xgboost.readthedocs.io/en/latest/python/python_intro.html#data-interface) from many formats, including csv files, pandas DataFrames, and more.

One possible data workflow for an XGBoost project:
1. Query the data from source tables.
2. Inspect a sample of the raw data using pandas, matplotlib, and seaborn.
3. Clean the raw data (using pandas for small data, or some more scalable solution for big data).
4. Prepare the raw data (using natural language processing, [one-hot encoding](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) for categorical data, etc.).
5. Train your boosting model.
6. Evaluate performance, troubleshoot, understand, and visualize.

In [None]:
# load data from text files
dtrain = xgb.DMatrix(PATH_DATA + 'agaricus.txt.train')
dtest = xgb.DMatrix(PATH_DATA + 'agaricus.txt.test')

In [None]:
dtrain.num_col()

In [None]:
dtrain.num_row()

In [None]:
print(dtrain.feature_names)

In [None]:
# specify parameters via map
param = {
    'max_depth': 2,
    'eta': 1,
    'silent': 1,
    'objective': 'binary:logistic'
}

In [None]:
# specify validations to watch performance
watchlist = [(dtest, 'eval'), (dtrain, 'train')]
num_round = 2

# Train model

In [None]:
# train model
bst = xgb.train(param,
                dtrain,
                num_round,
                watchlist)

# Project to obtain predictions

In [None]:
preds = bst.predict(dtest)

In [None]:
labels = dtest.get_label()
print('error=%f' % (sum(1 for i in range(len(preds)) if int(preds[i] > 0.5) != labels[i]) / float(len(preds))))

# Understanding our results

In [None]:
xgb.plot_importance(bst)

In [None]:
xgb.plot_tree(bst, num_trees=2)