## XGBoost
XGBoost, in its simplest form, is a type of flow chart. We take our entire dataset and make a decision tree from it. We get the results and use that to make another decision tree. We keep doing this until we get a training accuracy that is sufficient for us. The result is what we get out of our last decision tree.\
In the case of classification, the mis-classified records would be added to the loss function. Specifically, it is the Cross-Entropy Loss Function. If we had a regression loss function, it would most likely be a Mean Square Error Loss Function. For regression problems, we do not calculate accuracy, we use the CELF in classification and the MSLF in regression (there are others, but these are the most common.)\
How do we know where to stop in XGBoost? That is where we use gradients. If we graph the loss function on the y-axis and the predicted outcomes on the x-axis, we can see a minimum value for the loss function, that would be ideal since we want to minimize loss. The change in the loss function over the change in the predicted outcome is the gradient. 

In [2]:
#Regression
from sklearn.datasets import fetch_california_housing, load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import xgboost as xgb
import numpy as np

# Grab the dataset.
diabetes = fetch_california_housing()

# Separate the data
X = diabetes.data
y = diabetes.target

# Since this is a regression problem, a squared error objective makes sense here.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
xgb_model = xgb.XGBRegressor(objective="reg:squarederror", random_state=42)  

# Fit the model to our training data.
xgb_model.fit(X_train, y_train)
y_pred = xgb_model.predict(X_test)

# For regression, we show the MSE, not the accuracy.
mse = mean_squared_error(y_test, y_pred)
print("Root Mean Squared Error:", np.sqrt(mse))

Root Mean Squared Error: 0.47390177839101205


In [4]:
#Classification
from sklearn.metrics import confusion_matrix

# Grab the data
cancer = load_breast_cancer()

# Separate the data
X = cancer.data
y = cancer.target

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Run the model
xgb_model = xgb.XGBClassifier(objective="binary:logistic", random_state=42)
xgb_model.fit(X, y)

# Predict using the model with validation set.
y_pred = xgb_model.predict(X_test)

# This is a better metric than accuracy.
print(confusion_matrix(y_test, y_pred))

[[43  0]
 [ 0 71]]
