# XGBoost with Python

### Bagging v. Boosting

Random Forest and XGBoost are two common approaches to solving ML problems. Their differences can be understood as follows:

**error = bias + variance**

Random Forest is a bagging algorithm while XGBoost is a boosting algorithm. 
 
- Bagging is an ensemble technique that creates several subsets of training data by sampling with replacement. Each subset is used to train a decision tree and the predictions from each subset are averaged.
- Boosting is an ensemble technique that adds new models to correct the error made by the existing models. Models are thus added sequentially.

In terms of reducing error, RF reduces variances while boosting reduces bias. Random Forest uses fully grow decision trees (low bias, high variance); Boosting works on shallow trees (high bias, low variance). RF reduces variance by making trees uncorrelated to maximise the decrease in variance. In constrast, boosting reduces bias by aggregating the output from many models.


## Load data and libraries

In [1]:
%matplotlib inline

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

In [40]:
train_values = pd.read_csv('cleanTrain.csv')

In [45]:
X = train_values.loc[:, train_values.columns != 'status_group']
y = train_values.loc[:, train_values.columns == 'status_group']

X = pd.get_dummies(X)

Status_group is a categorical variable with the values 'functional', 'functional needs repair', and 'non functional'. However, the model cannot recognise these alphabetical values. We have to first convert them to numeric.

In [49]:
num = []
for i in y.values:
    if i == 'non functional':
        i = 0
        num.append(i)
    elif i == 'functional needs repair':
        i = 1
        num.append(i)
    else:
        i = 2
        num.append(i)

In [50]:
y.status_group = num

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


## Define fit XGBoost function

In [9]:
from sklearn.preprocessing import StandardScaler
import xgboost as xgb
from sklearn.pipeline import make_pipeline
from xgboost.sklearn import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score

In [7]:
def fit_xgb_randomcv(X, y, param_grid):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=1)
    pipe = make_pipeline(StandardScaler(), 
                     XGBClassifier(random_state=1))
    gs = RandomizedSearchCV(pipe, param_grid, n_iter = 10, cv=3, random_state=1)
    fitted_model = gs.fit(X_train, y_train)
    print(gs.best_params_)
    y_pred = gs.predict(X_test)
    print(f1_score(y_test, y_pred, average='micro'))
    return fitted_model

# Train and fine-tune parameters

### 1. Fix paramaters

A relatively large learning_rate is chosen.

In [51]:
param_grid1 = {'xgbclassifier__learning_rate':[0.1],
              'xgbclassifier__n_estimators':[1000],
              'xgbclassifier__max_depth':range(3,10,2),
             'xgbclassifier__min_child_weight':range(1,6,2)}
fitted_model1 = fit_xgb_randomcv(X, y.values.ravel(), param_grid1)

{'xgbclassifier__n_estimators': 1000, 'xgbclassifier__min_child_weight': 5, 'xgbclassifier__max_depth': 9, 'xgbclassifier__learning_rate': 0.1}
0.7945005611672277


### 2. Tune max_depth and min_child_weight

Since the optimal min_child_weight = 5 and max_depth=9, we take one value below and above for fine-tuning.

In [142]:
param_grid2 = {'xgbclassifier__learning_rate':[0.1],
              'xgbclassifier__n_estimators':[1000],
              'xgbclassifier__max_depth':[8, 9, 10],
             'xgbclassifier__min_child_weight':[4, 5, 6]}
fitted_model2 = fit_xgb_randomcv(X, y.values.ravel(), param_grid2)

{'xgbclassifier__n_estimators': 1000, 'xgbclassifier__min_child_weight': 6, 'xgbclassifier__max_depth': 10, 'xgbclassifier__learning_rate': 0.1}
0.7975869809203142


### 3. Tune gamma

In [148]:
param_grid3 = {'xgbclassifier__learning_rate':[0.1],
              'xgbclassifier__n_estimators':[1000],
              'xgbclassifier__max_depth':[10],
               'xgbclassifier__min_child_weight':[6],
              'xgbclassifier__gamma':[i/10.0 for i in range(0,5)]}
fitted_model3 = fit_xgb_randomcv(X, y.values.ravel(), param_grid3)

{'xgbclassifier__n_estimators': 1000, 'xgbclassifier__min_child_weight': 6, 'xgbclassifier__max_depth': 10, 'xgbclassifier__learning_rate': 0.1, 'xgbclassifier__gamma': 0.2}
0.8002244668911336


### 4. Reduce learning rate and increase number of trees

In [75]:
pipe4 = make_pipeline(StandardScaler(), 
                 XGBClassifier(random_state=1,
                              learning_rate=0.01,
                              n_estimators=5000,
                              max_depth=10,
                              min_child_weight=6,
                              gamma=0.2))
pipe4.fit(X, y.values.ravel())

Pipeline(memory=None,
         steps=[('standardscaler',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('xgbclassifier',
                 XGBClassifier(base_score=None, booster=None,
                               colsample_bylevel=None, colsample_bynode=None,
                               colsample_bytree=None, gamma=0.2, gpu_id=None,
                               importance_type='gain',
                               interaction_constraints=None, learning_rate=0.01,
                               max_delta_step=None, max_depth=10,
                               min_child_weight=6, missing=nan,
                               monotone_constraints=None, n_estimators=5000,
                               n_jobs=None, num_parallel_tree=None,
                               objective='binary:logistic', random_state=1,
                               reg_alpha=None, reg_lambda=None,
                               scale_pos_weight=None, subsample=

In [13]:
y_pred4 = pipe4.predict(X)
f1_score(y, y_pred4, average='micro')

0.8985858585858586

## Submission

In [63]:
test_values = pd.read_csv('cleanTest.csv')

In [64]:
test_values = pd.get_dummies(test_values)

In [66]:
predictions = pipe4.predict(test_values)

In [67]:
submission_format = pd.read_csv('dataset/SubmissionFormat.csv', index_col='id')

In [68]:
my_submission = pd.DataFrame(data=predictions,
                             columns=submission_format.columns,
                             index=submission_format.index)

In [69]:
obj = []
for i in my_submission.values:
    if i == 0:
        i = 'non functional'
        obj.append(i)
    elif i == 1:
        i = 'functional needs repair'
        obj.append(i)
    else:
        i = 'functional'
        obj.append(i)

In [70]:
my_submission.status_group = obj

In [71]:
my_submission.to_csv('submission.csv')