# Using XGBoost to infer links between telemetry

XGBoost stands for eXtreme Gradient BOOSTing.
It is an optimized parallel tree boosting method (GBDT Gradient Boosting Decision Tree). Similar to random forest, XGBoost does boosting; it is based on weak learners that have high bias (HB) and low variance (LV), boosting reduces inference errors by reducing the bias to get low bias (LB) and by the same occasion lower variance even more.
Alike Random forest which uses fully grown decision tree with low bias and high variance, before ensembling to mainly reduce the variance.

XGBoost -> Ensemble(HB, LV) = (LB, LV)
Random Forest -> Ensemble(LB, HV) = (LB, LV)

XGBoost model evolution is less random than Random Forest as it uses knowledge from not-fully-grown decision trees to build new ones. It is also faster as it does not grow trees completely. 

## Starting with importation of xgboost

In [1]:
import xgboost as xgb

XGBoost parametrization is the tricky part. We could search what are the best parameters since we have data we can train on and test. But the following parameters will do fine for what we want; an estimation of dependencies, and not an exact prediction.

Your prediction errors can be reduced by changing these parameters but, usually, the importance/impact order of predictors (telemetry parameters used to predict one of them) does not change significantly.

More info on parameters here: https://xgboost.readthedocs.io/en/latest/parameter.html

In [2]:
params = {
    'learning_rate': 0.1,   # epsilon on slides
    'gamma': 0,             # minimum loss reduction to make a split 
    'max_depth': 10,        # maximum depth of the tree
    'n_estimators': 50,     # number of boosted trees to fit
    
    'base_score': 0.5,      # the initial prediction score of all instances, global bias.
    'colsample_bylevel': 1, # subsample ratio of columns for each split, in each level
    'colsample_bytree': 1,  # subsample ratio of columns when constructing each tree.
    'max_delta_step': 0,    # maximum delta step we allow each tree's weight estimation to be    
    'min_child_weight': 1,  # minimum sum of instance weight(hessian) needed in a child.
    'missing': None,        # how to represent missing value, defaults to np.nan
    'nthread': -1,          # number of parallel threads to run xgboost
    'objective': "reg:linear", # or 'binary:logistic'  or 
    'reg_alpha': 0,         # L1 regularization term on weights
    'reg_lambda': 1,        # L2 regularization term on weights
    'scale_pos_weight': 1,  # balance of positive and negative weights
    'seed': 0,              # randomness
    'silent': True,         # verbosity
    'subsample': 1,         # subsample ratio of the training instance.
    
    # I do not really have a GPU on my laptop so I'll wait warming my hands with the CPU :)
    'predictor': "cpu_predictor", # or 'gpu_predictor'
    'tree_method': "auto"         # or "approx" or "(gpu_)hist" or "(gpu_)exact"
}

## Creating model object to be trained

    We use function *XGBClassifier* that returns a scikit-learn designed model.
    This permits to use pandas dataframe with it and standard scikit-learn interfaces .fit() .predict() .score()

In [3]:
# scikit-learn is needed for this form of model
model = xgb.XGBClassifier(**params)