# How Gradient Boosting Works

## 1. A Loss Function

GB requires a differentiable loss function. At each step, a weak learner will be trained on the gradient of the previous cumulative model. The gradient is essentially the residual of each predicted value of the previous stacked collection of weak learners. The amount the cost function would be reduced if the predicted value were one unit closer to the true value for each data point.

Regression usually uses RMSE, classification usually uses logarithmic loss.

## 2. Weak Learner
Regression trees are used to predict the residuals of the previous models. In this way, the output of all stacked models can be added together to get a more accurate prediction for harder to predict cases.

Each step learns to adjust a predicted value by grouping the hard to miss cases. If values are continuously miss predicted, the weak learner will try to group these values and adjust with higher weights.

Weak learners can be constrained by the maximum number of levels, nodes, splits, and leaf nodes. Weak learners must remain weak.

## 3. Additive Model

Weak learners are added one at a time, trained on the cost-gradient of all previous levels.



# Using XGBoost in Python



Efficiency gains come drom the DMatrix data structure. 

### HyperParameters
- learning_rate: step size shrinkage fore each new tree added to the model [0,1]
- max_depth: maximum number of levels of each individual weak learner
- subsample: % of observations used to train each tree
- colsample_bytree: % of columns to use in training each subsequent tree.
- n_estimatore: # of trees to add to the model
- objective: loss function to be trained on


- Gamma: Controls whether a node will split based on the expected reduction in loss after a split. High gammma -> Fewer splits
- Alpha: L! regularization on leaf weights
- Lambda: L2 regularization on leaf weights.

In [10]:
import xgboost as xgb

clf = xgb.XGBClassifier()
print(clf)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=1, gamma=0, learning_rate=0.1,
       max_delta_step=0, max_depth=3, min_child_weight=1, missing=None,
       n_estimators=100, n_jobs=1, nthread=None,
       objective='binary:logistic', random_state=0, reg_alpha=0,
       reg_lambda=1, scale_pos_weight=1, seed=None, silent=None,
       subsample=1, verbosity=1)


In [14]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

categories = ['alt.atheism', 'talk.religion.misc']
newsgroups_train = fetch_20newsgroups(subset='train',categories = categories)
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories)

vect = TfidfVectorizer()

X_train = vect.fit_transform(newsgroups_train.data)
y_train = newsgroups_train.target
X_test = vect.transform(newsgroups_test.data)
y_test = newsgroups_test.target

feature_names = np.asarray(vect.get_feature_names())

In [15]:
from sklearn.metrics import classification_report
import time

In [13]:
start = time.time()
clf.fit(X_train,y_train)
print(time.time()-start)

KeyboardInterrupt: 

In [None]:
dmatrix = xgb.DMatrix(data=X_train,label=y_train)

In [None]:
start = time.time()
params = {'colsample_bytree': 1,'learning_rate': 0.1,'max_depth': 5}
clf = xgb.train(params, dtrain = dmatrix)
print(time.time()-start)

## Feature Importances

Count the number of times a feature is split on across all boosting rounds.

In [None]:
import matplotlib.pyplot as plt
xgb.plot_importance(clf,max_num_features=20)
plt.rcParams['figure.figsize'] = [12, 12]
plt.show()

### Need to figure out how to sort feature importances and compart to the Vectorizer feature_names object. 

In [None]:
from sklearn.model_selection import RandomizedSearchCV

In [None]:
cv_params = {'subsample':[0.3,0.5,1],'learning_rate':[0.1,0.5],'max_depth': [3, 5, 7], 'min_child_weight': [1,3,5],}
ind_params = {'n_estimators': 1000, 'seed':0, 'subsample': 0.8, 'colsample_bytree': 0.8, 
             'objective': 'binary:logistic'}

RS = RandomizedSearchCV(xgb.XGBClassifier(**ind_params),
                       cv_params,
                       scoring = 'accuracy',
                       cv=5,
                       n_jobs=-1)

RS.fit(X_train,y_train)

# Getting Started With XGBoost

https://xgboost.readthedocs.io/en/latest/get_started.html

In [18]:
dtrain = xgb.DMatrix(data=X_train,label=y_train)
dval = xgb.DMatrix(data=X_val,label=y_val)
dtest = xgb.DMatrix(data=X_test,label=y_test)
param = {'max_depth':2,
         'eta':1,
         'silent':True,
         'objective':'binary:logistic' }

# specify validations set to watch performance
watchlist = [(dtest, 'eval'), (dtrain, 'train')]

bst = xgb.train(param,dtrain,
                evals=watchlist,
                num_boost_round=8)

[0]	eval-error:0.387719	train-error:0.31972
[1]	eval-error:0.336842	train-error:0.224037
[2]	eval-error:0.314035	train-error:0.164527
[3]	eval-error:0.3	train-error:0.144691
[4]	eval-error:0.27193	train-error:0.11902
[5]	eval-error:0.252632	train-error:0.10035
[6]	eval-error:0.25614	train-error:0.101517
[7]	eval-error:0.268421	train-error:0.089848


In [21]:
preds = bst.predict(dtest)
bst.save_model('model0001.model')

In [23]:
bst2 = xgb.Booster(model_file='model0001.model')

In [25]:
preds2 = bst2.predict(dtest)

In [32]:
import pickle
with open('xgb.pkl','wb') as outfile:
    pickle.dump(bst2,outfile)
    

In [37]:
res = xgb.cv(param, dtrain, num_boost_round=10,nfold=5)

"""
params : dict
    Booster params.
dtrain : DMatrix 
    Data to be trained.
num_boost_round : int
    Number of boosting iterations.
nfold : int
    Number of folds in CV.
stratified : bool
    Perform stratified sampling.
folds : a KFold or StratifiedKFold instance or list of fold indices
    Sklearn KFolds or StratifiedKFolds object.
    Alternatively may explicitly pass sample indices for each fold.
    For ``n`` folds, **folds** should be a length ``n`` list of tuples.
    Each tuple is ``(in,out)`` where ``in`` is a list of indices to be used
    as the training samples for the ``n`` th fold and ``out`` is a list of
    indices to be used as the testing samples for the ``n`` th fold.
metrics : string or list of strings
    Evaluation metrics to be watched in CV.
obj : function
    Custom objective function.
feval : function
    Custom evaluation function.
maximize : bool
    Whether to maximize feval.
early_stopping_rounds: int
    Activates early stopping. CV error needs to decrease at least
    every <early_stopping_rounds> round(s) to continue.
    Last entry in evaluation history is the one from best iteration.
fpreproc : function
    Preprocessing function that takes (dtrain, dtest, param) and returns
    transformed versions of those.
as_pandas : bool, default True
    Return pd.DataFrame when pandas is installed.
    If False or pandas is not installed, return np.ndarray
verbose_eval : bool, int, or None, default None
    Whether to display the progress. If None, progress will be displayed
    when np.ndarray is returned. If True, progress will be displayed at
    boosting stage. If an integer is given, progress will be displayed
    at every given `verbose_eval` boosting stage.
show_stdv : bool, default True
    Whether to display the standard deviation in progress.
    Results are not affected, and always contains std.
seed : int
    Seed used to generate the folds (passed to numpy.random.seed).
callbacks : list of callback functions
    List of callback functions that are applied at end of each iteration.
    It is possible to use predefined callbacks by using
    :ref:`Callback API <callback_api>`.
    Example:

    .. code-block:: python"""

print("")




In [38]:
print(res)

   train-error-mean  train-error-std  test-error-mean  test-error-std
0          0.319135         0.006207         0.324371        0.025551
1          0.226660         0.007444         0.275303        0.038737
2          0.179700         0.022443         0.240303        0.041859
3          0.154903         0.025005         0.213471        0.040183
4          0.130982         0.023309         0.202972        0.039993
5          0.110263         0.016434         0.180851        0.036152
6          0.098018         0.011061         0.184285        0.038796
7          0.088099         0.010866         0.176139        0.035412
8          0.079349         0.012557         0.166837        0.028136
9          0.069428         0.008737         0.155195        0.027340


In [40]:
label = dtrain.get_label()
ratio = float(np.sum(label == 0)) / np.sum(label == 1)
ratio

1.273209549071618

In [41]:
param['scale_pos_weight']=ratio
xgb.cv(param, dtrain, 2, nfold=5,
       metrics={'auc'}, seed=0)

Unnamed: 0,train-auc-mean,train-auc-std,test-auc-mean,test-auc-std
0,0.7155,0.005518,0.708998,0.024066
1,0.817305,0.01376,0.767994,0.027433
