## XGBOOST

### Introduction

XGBoost is based on GBM, but solved two problems of GBM, Long learning time and absence of Regularization.  
Pros of XGBoost:  
1. Can use parallel CPU and make parallel learning.  
2. High prediction performance on Classification and Regression  
3. Short learning time compared to GBM  
4. Overfitting Regularization  
5. Tree pruning  
6. Self-embedded Cross Validation(+early stopping)  
7. Self_embedded null value preprocessing

NOTE: There are python wrapper XGBoost Module and SKlearn wrapper XGBoost. when use python wrapper XGBoost, we can't use SKlearn frameworks such as .fit() or .predict()

In [1]:
import xgboost as xgb
from xgboost import XGBClassifier

  from pandas import MultiIndex, Int64Index


### Python Wrapper XGBoost module

General Hyperparameter: # of thread, selection of silent mode, etc... usually keep default value  
1. booster[Default = gbtree]: select gbtree(tree based model) or gblinear(linear model).   
2. silent[Default = 0]: if you don't want to print outputs, set 1.   
3. nthread[Default: Whole thread of current CPU]: Manipulate CPU's # of execution thread.

Boost Hyperparameter: Tree Optimization, Boosting, Regularization, etc... most hyperparameters are Boost Parameters.  
1. eta[Default = 0.3]: Same parameter as learning rate in GBM. Set value in [0,1]. (In SKlearn Wrapper module, use learning_rate[Default = 0.1])  
2. num_boost_rounds: Same parameter as n_estimator in GBM.  
3. min_child_weight[Default = 1]: Minimum value(sum of data's weight) to decide additional branch spliting. the bigger min_child_ weight, the less spliting. Use to control overfitting.  
4. gamma[Default = 0]: Sama parameter as min_split_loss in GBM. Minimum loss decrease value to decide diviation of additional leaf node. If loss decrease is bigger than gamma, divide leaf node. the bigger gamma, the less overfitting.  
5. max_depth[Default = 6]: Same parameter as max_depth in DT.  
6. sub_sample[Default = 1]: Same parameter as subsample in GBM. Designate sampling ratio to control overfitting by huge tree.  
7. colsample_bytree[Default = 1]: Similar to max_features in GBM. Sample feature(col) when there are too many features(lead to overfitting).  
8. lambda[Defualt = 1]: L2 Regularization value. Use when there are too many features. bigger lambda, less overfitting.  
9. scale_pos_weight[Default = 1]: Use to keep balance of dataset consist of skewed value.  

Learning Task Parameter: Object Function, Metrics, etc...
1. objective: Define loss function.  
2. binary:logistic: Apply when binary classfication.  
3. multi:softmax: Apply when multiple classfication. to use multi:softmax, need to designate num_class parameter.  
4. multi:softprob: similar to multi:sofmax, but return prediction probability of each label classes.  
5. eval_metric[Default = rmse(regression), error(classification)]: Define Validation function.  
-rmse: Root Mean Square Error  
-mae: Mean Absolute Error  
-logloss: Negative log-likelihood  
-error: Binary Classification error rate(0.5 threshold)  
-merror: Multiclass classification Error rate  
-mloglos: Multiclass Logloss  
-auc: Area under the curve

Example: when significant overfitting problem exists, we can consider hyperparameter tuning.  
1. Reduce eta(learning_rate). however when we reduce eta, we have to raise n_estimator  
2. Reduce max_depth.  
3. Raise min_child_weight.  
4. Raise gamma.  
5. Manipulate subsample & colsample_bytree -> Simplify DT -> Reduce Overfitting.

Early Stopping  
- if prediction error does not improve, we can stop iteration before reaching at n_estimator.
- this means that we can reduce execution time.

### Wisconsin Breast Cancer Prediction by Python Wrapper XGBoost Example

In [2]:
xgb.__version__

'1.5.0'

In [8]:
from xgboost import plot_importance
import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')

dataset = load_breast_cancer()
X_features = dataset.data
print(dataset.feature_names)

y_label = dataset.target
print(dataset.target_names)

cancer_df = pd.DataFrame(data=X_features, columns=dataset.feature_names)
cancer_df['target'] = y_label
cancer_df.head()

['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension']
['malignant' 'benign']


Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0


In [4]:
cancer_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 31 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   mean radius              569 non-null    float64
 1   mean texture             569 non-null    float64
 2   mean perimeter           569 non-null    float64
 3   mean area                569 non-null    float64
 4   mean smoothness          569 non-null    float64
 5   mean compactness         569 non-null    float64
 6   mean concavity           569 non-null    float64
 7   mean concave points      569 non-null    float64
 8   mean symmetry            569 non-null    float64
 9   mean fractal dimension   569 non-null    float64
 10  radius error             569 non-null    float64
 11  texture error            569 non-null    float64
 12  perimeter error          569 non-null    float64
 13  area error               569 non-null    float64
 14  smoothness error         5

In [9]:
cancer_df['target'].value_counts()

1    357
0    212
Name: target, dtype: int64

In [None]:
X_train, X_test, y_train, y_test = train_test_split(cancer_df, cancer_df, test_size=0.2, random_state=156)
