# Extreme Gradient Boosting

1. Regularization:
    - Standard GBM implementation has no regularization like XGBoost, therefore it also helps to reduce overfitting.
    - XGBoost is also known as ‘regularized boosting‘ technique.
2. Parallel Processing:
    - XGBoost implements parallel processing and is blazingly faster as compared to GBM.
    - Boosting is sequential process so how can it be parallelized? We know that each tree can be built only after the previous one, so what stops us from making a tree using all cores? I hope you get where I’m coming from. Check this link out to explore further.
3. High Flexibility
    - XGBoost allow users to define custom optimization objectives and evaluation criteria.
    - This adds a whole new dimension to the model and there is no limit to what we can do.
4. Handling Missing Values
    - XGBoost has an in-built routine to handle missing values.
    - User is required to supply a different value than other observations and pass that as a parameter. XGBoost tries different things as it encounters a missing value on each node and learns which path to take for missing values in future.
5. Tree Pruning:
    - A GBM would stop splitting a node when it encounters a negative loss in the split. Thus it is more of a greedy algorithm.
    - XGBoost on the other hand make splits upto the max_depth specified and then start pruning the tree backwards and remove splits beyond which there is no positive gain.
6. Built-in Cross-Validation
    - XGBoost allows user to run a cross-validation at each iteration of the boosting process and thus it is easy to get the exact optimum number of boosting iterations in a single run.
    - This is unlike GBM where we have to run a grid-search and only a limited values can be tested.
7. Continue on Existing Model
    - User can start training an XGBoost model from its last iteration of previous run. This can be of significant advantage in certain specific applications.

### Not Scaled

In [18]:
import pandas as pd
import warnings
warnings.filterwarnings(action='ignore')

from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split

# Import metrics that would allow us to see how accurate the predictions are
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score, recall_score
from sklearn.metrics import f1_score


# Load data
data = pd.read_csv('Data/11-diabetes.csv')

# Split data into X and y
X = data.iloc[:, 0:8]
y = data.iloc[:,8]

# Split data into train and test sets
seed = 100
test_size = 0.25
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=seed)

# Fit model
model = XGBClassifier()
model.fit(X_train, y_train)

# Make predictions for test data
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]

# Evaluate
print('CM:')
print(confusion_matrix(y_test, y_pred))
print('Classification Report:')
print(classification_report(y_test, y_pred))

print('Evaluation Metrics:')
print('Accuracy Score: {}'.format(round(accuracy_score(y_test, y_pred), 4)))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1 Score:", f1_score(y_test, y_pred))

CM:
[[104  23]
 [ 29  36]]
Classification Report:
              precision    recall  f1-score   support

           0       0.78      0.82      0.80       127
           1       0.61      0.55      0.58        65

   micro avg       0.73      0.73      0.73       192
   macro avg       0.70      0.69      0.69       192
weighted avg       0.72      0.73      0.73       192

Evaluation Metrics:
Accuracy Score: 0.7292
Precision: 0.6101694915254238
Recall: 0.5538461538461539
F1 Score: 0.5806451612903227


### Scaled

In [19]:
from sklearn.preprocessing import MinMaxScaler, RobustScaler, StandardScaler
# scaler = MinMaxScaler()
# scaler = RobustScaler()
scaler = StandardScaler()

# Split data into train and test sets
seed = 150
test_size = 0.25
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=seed)

# Transform the variables to be on the same scale
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Fit model
model = XGBClassifier()
model.fit(X_train, y_train)

# Make predictions for test data
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]

# Evaluate
print('CM:')
print(confusion_matrix(y_test, y_pred))
print('Classification Report:')
print(classification_report(y_test, y_pred))

print('Evaluation Metrics:')
print('Accuracy Score: {}'.format(round(accuracy_score(y_test, y_pred), 4)))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1 Score:", f1_score(y_test, y_pred))

CM:
[[106  23]
 [ 26  37]]
Classification Report:
              precision    recall  f1-score   support

           0       0.80      0.82      0.81       129
           1       0.62      0.59      0.60        63

   micro avg       0.74      0.74      0.74       192
   macro avg       0.71      0.70      0.71       192
weighted avg       0.74      0.74      0.74       192

Evaluation Metrics:
Accuracy Score: 0.7448
Precision: 0.6166666666666667
Recall: 0.5873015873015873
F1 Score: 0.6016260162601625


Small improvement