## Beginner-Friendly XGBoost Tutorial for Classification

This tutorial introduces XGBoost in a simple and beginner-friendly way using a real-world classification dataset that is easy to understand. We begin by loading and exploring the data, followed by splitting it into training and testing sets to simulate a realistic machine-learning workflow. An XGBoost classification model is then built and trained step by step, demonstrating how class predictions are generated and how model performance is evaluated using standard classification metrics such as accuracy, precision, recall, and F1-score.

Throughout the tutorial, the focus is placed on understanding model behavior rather than only training accuracy. Feature importance is used to identify which input variables contribute most to the classification decisions, and training-versus-testing performance is compared to check for overfitting. By the end, the reader gains a clear intuition for how XGBoost learns classification patterns, how to interpret its results, and how the same methodology can later be applied to real-world problems such as airport congestion level prediction.

### Import required libraries

In [76]:
# Data handling
import pandas as pd
import numpy as np

# Dataset and splitting
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

# Evaluation metrics for classification
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

# XGBoost classifier
from xgboost import XGBClassifier

###  Load the dataset

In [78]:
# Load breast cancer dataset
data = load_breast_cancer(as_frame=True)

# Features and target
X = data.data
y = data.target  # 0 = malignant, 1 = benign

# Preview the data
X.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


### Understand the target variable

In [80]:
# Check class distribution
y.value_counts()

target
1    357
0    212
Name: count, dtype: int64

### Train-test split

In [82]:
# Split data into training and testing sets
# 80% training, 20% testing
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Check shapes
X_train.shape, X_test.shape

((455, 30), (114, 30))

### Initialize XGBoost classifier

In [84]:
# Create XGBoost classifier
# n_estimators: number of trees
# max_depth: tree complexity
# learning_rate: step size for updates
xgb_clf = XGBClassifier(
    n_estimators=200,
    max_depth=4,
    learning_rate=0.05,
    random_state=42,
    use_label_encoder=False,
    eval_metric="logloss"
)

### Train the XGBoost model

In [86]:
# Train the classifier
xgb_clf.fit(X_train, y_train)

Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


### Make predictions

In [88]:
# Predict class labels on test data
y_pred = xgb_clf.predict(X_test)

# Compare predictions with actual labels
pd.DataFrame({
    "Actual": y_test[:10].values,
    "Predicted": y_pred[:10]
})

Unnamed: 0,Actual,Predicted
0,0,0
1,1,1
2,0,0
3,1,0
4,0,0
5,1,1
6,1,1
7,0,0
8,0,0
9,0,0


### Evaluate model performance

In [90]:
# Calculate evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# Print results
print("Accuracy :", accuracy)
print("Precision:", precision)
print("Recall   :", recall)
print("F1 Score :", f1)

Accuracy : 0.9473684210526315
Precision: 0.9459459459459459
Recall   : 0.9722222222222222
F1 Score : 0.958904109589041


### Confusion matrix (error analysis)

In [92]:
# Generate confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Convert to DataFrame for readability
pd.DataFrame(
    cm,
    index=["Actual 0", "Actual 1"],
    columns=["Predicted 0", "Predicted 1"]
)

Unnamed: 0,Predicted 0,Predicted 1
Actual 0,38,4
Actual 1,2,70


### Feature importance (model interpretability)

In [94]:
# Get feature importance from XGBoost
importance = xgb_clf.feature_importances_

# Create readable table
feature_importance = pd.DataFrame({
    "Feature": X.columns,
    "Importance": importance
}).sort_values(by="Importance", ascending=False)

feature_importance.head(10)

Unnamed: 0,Feature,Importance
22,worst perimeter,0.440799
20,worst radius,0.20076
7,mean concave points,0.061267
27,worst concave points,0.057798
25,worst compactness,0.033651
23,worst area,0.02534
11,texture error,0.022951
21,worst texture,0.017368
3,mean area,0.016482
15,compactness error,0.014153


### Overfitting check (train vs test)

In [96]:
# Predictions on training data
train_pred = xgb_clf.predict(X_train)

# Training accuracy
train_accuracy = accuracy_score(y_train, train_pred)

# Testing accuracy
test_accuracy = accuracy_score(y_test, y_pred)

print("Training Accuracy:", train_accuracy)
print("Testing Accuracy :", test_accuracy)

Training Accuracy: 1.0
Testing Accuracy : 0.9473684210526315


### Key takeaway cell

In [98]:
# Key ideas:
# XGBoost builds trees sequentially
# Each tree corrects previous classification errors
# Feature importance explains which inputs matter most
# Same workflow applies to real-world classification problems