# Training on mortgage data
#### (c) Voltron Data, Field Engineering

In this notebook, we will show a typical workflow to train XGBoost model.

<a id="libraries"></a>
## Load Libraries

Let's load some of the libraries within the RAPIDs ecosystem and see which versions we have.

In [11]:
import numpy as np
import pandas as pd
import sklearn
import xgboost as xgb

<a id="load"></a>
### Load Data

We can load the data using `pandas.read_csv`. We've provided a helper function `load_data` that will load data from a CSV file (and will only read the first 1 billion rows if that file is unreasonably big).

In [3]:
# helper function for loading data
def load_data(filename, n_rows):
    if n_rows >= 1e9:
        df = pd.read_csv(filename)
    else:
        df = pd.read_csv(filename, nrows=n_rows)
    return df.values.astype(np.float32)

In [4]:
# settings
classification = True  # change this to false to use regression
n_rows = int(1e6)  # we'll use 1 millions rows
n_categories = 2

dataset = load_data('../data/mortgage.csv', n_rows)

<a id="split"></a>
### Split Data

We'll split our dataset into a 80% training dataset and a 20% validation dataset.

In [5]:
# identify shape and indices
n_rows, n_columns = dataset.shape
train_size = 0.80
train_index = int(n_rows * train_size)

# split X, y
X, y = dataset[:, 1:], dataset[:, 0]
del dataset

# split train data
X_train, y_train = X[:train_index, :], y[:train_index]

# split validation data
X_validation, y_validation = X[train_index:, :], y[train_index:]

<a id="check"></a>
### Check Dimensions

We can check the dimensions and proportions of our training and validation dataets.

In [6]:
# check dimensions
print('X_train: ', X_train.shape, X_train.dtype, 'y_train: ', y_train.shape, y_train.dtype)
print('X_validation', X_validation.shape, X_validation.dtype, 'y_validation: ', y_validation.shape, y_validation.dtype)

# check the proportions
total = X_train.shape[0] + X_validation.shape[0]
print('X_train proportion:', X_train.shape[0] / total)
print('X_validation proportion:', X_validation.shape[0] / total)

X_train:  (639999, 541) float32 y_train:  (639999,) float32
X_validation (160000, 541) float32 y_validation:  (160000,) float32
X_train proportion: 0.7999997499996875
X_validation proportion: 0.2000002500003125


<a id="convert"></a>
## Convert NumPy data to DMatrix format

With out data loaded and formatted as NumPy arrays, our next step is to convert this to a `DMatrix` object that XGBoost can work with. We can instantiate an object of the `xgboost.DMatrix` by passing in the feature matrix as the first argument followed by the label vector using the `label=` keyword argument. To learn more about XGBoost's support for data structures other than NumPy arrays, see the documentation for the Data Interface:


https://xgboost.readthedocs.io/en/latest/python/python_intro.html#data-interface

In [7]:
%%time

dtrain = xgb.DMatrix(X_train, label=y_train)
dvalidation = xgb.DMatrix(X_validation, label=y_validation)

CPU times: user 3.49 s, sys: 2 s, total: 5.49 s
Wall time: 2.09 s


<a id="parameters"></a>
## Set Parameters

There are a number of parameters that can be set before XGBoost can be run. 

* General parameters relate to which booster we are using to do boosting, commonly tree or linear model
* Booster parameters depend on which booster you have chosen
* Learning task parameters decide on the learning scenario. For example, regression tasks may use different parameters with ranking tasks.

For more information on the configurable parameters within the XGBoost module, see the documentation here:


https://xgboost.readthedocs.io/en/latest/parameter.html

In [12]:
# instantiate params
params = {}

# learning task params
learning_task_params = {}
if classification:
    learning_task_params['eval_metric'] = 'auc'
    learning_task_params['objective'] = 'binary:logistic'
else:
    learning_task_params['eval_metric'] = 'rmse'
    learning_task_params['objective'] = 'reg:squarederror'
params.update(learning_task_params)
print(params)

{'eval_metric': 'auc', 'objective': 'binary:logistic'}


<a id="train"></a>
## Train Model

Now it's time to train our model! We can use the `xgb.train` function and pass in the parameters, training dataset, the number of boosting iterations, and the list of items to be evaluated during training. For more information on the parameters that can be passed into `xgb.train`, check out the documentation:


https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.train

In [9]:
# model training settings
evallist = [(dvalidation, 'validation'), (dtrain, 'train')]
num_round = 100

In [10]:
%%time

bst = xgb.train(params, dtrain, num_round, evallist)

Parameters: { "silent" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


[0]	validation-auc:0.53224	train-auc:0.53945
[1]	validation-auc:0.53329	train-auc:0.54234
[2]	validation-auc:0.53372	train-auc:0.54559
[3]	validation-auc:0.53421	train-auc:0.54707
[4]	validation-auc:0.53424	train-auc:0.54801
[5]	validation-auc:0.53431	train-auc:0.54956
[6]	validation-auc:0.53430	train-auc:0.55029
[7]	validation-auc:0.53415	train-auc:0.55161
[8]	validation-auc:0.53416	train-auc:0.55203
[9]	validation-auc:0.53351	train-auc:0.55418
[10]	validation-auc:0.53363	train-auc:0.55535
[11]	validation-auc:0.53319	train-auc:0.55646
[12]	validation-auc:0.53260	train-auc:0.55752
[13]	validation-auc:0.53226	train-auc:0.55822
[14]	validation-auc:0.53221	train-auc:0.55845
[15]	va