# XGBoost with Scikit-Learn Pipeline & GridSearchCV

XGBoost provides a wrapper interface to use the model as if it another model from Scikit-Learn [(more info in the documentation)](https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn). 

In this notebook we show an example on how we can use XGBoost with Pipelines and GridSearchCV like any other Scikit-Learn model.

In [None]:
import pandas as pd
import xgboost as xgb

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

## Dataset

For this example we'll use a simple dataset: [Breast Cancer Wisconsin (Diagnostic) Data Set](https://www.kaggle.com/uciml/breast-cancer-wisconsin-data).

In [None]:
df = pd.read_csv("/kaggle/input/breast-cancer-wisconsin-data/data.csv")
X = df.drop(columns=["id", "Unnamed: 32", "diagnosis"])
y = df["diagnosis"].map({'B': 0, 'M': 1})

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=8)

## Define the Pipeline and GridSearch

The `XGBClassifier` class implements the Scikit-Learn interface for using XGBoost for classification. That means that it has the familiar `fit` method as well as `predict`, `score` and so on.

The preprocessing methods to use in the pipeline and the parameters to optimize are just for the sake of the example.

In [None]:
model = xgb.XGBClassifier()

pipeline = Pipeline([
    ('standard_scaler', StandardScaler()), 
    ('pca', PCA()), 
    ('model', model)
])

param_grid = {
    'pca__n_components': [5, 10, 15, 20, 25, 30],
    'model__max_depth': [2, 3, 5, 7, 10],
    'model__n_estimators': [10, 100, 500],
}

grid = GridSearchCV(pipeline, param_grid, cv=5, n_jobs=-1, scoring='roc_auc')

In [None]:
%%time

grid.fit(X_train, y_train)

## CV results

Here are the results of the model that gave the best mean score in the k-fold cross-validation

In [None]:
mean_score = grid.cv_results_["mean_test_score"][grid.best_index_]
std_score = grid.cv_results_["std_test_score"][grid.best_index_]

grid.best_params_, mean_score, std_score

print(f"Best parameters: {grid.best_params_}")
print(f"Mean CV score: {mean_score: .6f}")
print(f"Standard deviation of CV score: {std_score: .6f}")

Feel free to ask anything or correct me if I made some mistake.

Hope this was helpful, have a nice day 🙂