# HiggsML Classifier – CERN Open Data Project

This notebook implements a binary classification model to separate signal and background events using the Higgs Boson Challenge dataset.

Author: Ahmet Can Çömez

## Importing Dependencies

This cell loads all required libraries for data handling, preprocessing, model training, evaluation, and persistence:

- `pandas`: for data manipulation and numerical operations
- `matplotlib.pyplot`, `seaborn`: for plotting and visualization
- `scikit-learn`: for preprocessing, model building, and evaluation
- `joblib`: to save the trained model
- `os`: for handling file paths



In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
import joblib
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, accuracy_score

## Loading the Dataset

The dataset is loaded from a local CSV file.
We first inspect the shape of the data to understand the number of samples and features, and then display the first few rows using `head()` to preview the structure.


In [None]:
project_root = os.path.abspath(os.path.join(os.getcwd(), ".."))
training_path = os.path.join(project_root, "data", "training.csv")
df = pd.read_csv(training_path)
df.head()

## Target Variable Encoding

The categorical target column `Label` is converted to binary numerical values:
- `'s'` (signal) → 1
- `'b'` (background) → 0

This transformation is required for compatibility with machine learning algorithms.
The updated distribution is reviewed to verify label encoding and assess class balance.


In [None]:
df['Label'] = df['Label'].map({'s': 1, 'b': 0})

## Dataset Overview

A concise summary of the dataset is obtained using `df.info()`.
This includes column data types, non-null counts, and memory usage.
The output is useful for detecting missing values and verifying dataset structure prior to preprocessing.


In [None]:
df.info()

## Visualizing Class Distribution

This count plot displays the distribution of the target classes:
Signal (`1`) and Background (`0`).
It helps identify any class imbalance that may affect model performance.


In [None]:
sns.countplot(x='Label', data=df)
plt.title("Signal (1) vs Background (0) Distribution")
plt.xlabel("Label")
plt.ylabel("Count")
plt.grid(True)
plt.show()


## Feature and Target Definition

The dataset is split into input features (`X`) and target labels (`Y`).
The `Label`, `Weight`, and `EventId` columns are excluded from the feature matrix as they are either target indicators or metadata.
This step structures the data appropriately for supervised learning.


In [None]:
X = df.drop(['Label', 'Weight', 'EventId'], axis=1)
Y = df['Label']

## Train-Test Split and Feature Scaling

The dataset is divided into training and testing subsets using an 80-20 split.
Feature standardization is then applied using `StandardScaler`, transforming the data to have zero mean and unit variance.
This preprocessing step is essential for optimizing the performance and stability of many machine learning algorithms, including gradient boosting.


In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

## Gradient Boosting Classifier Training

A `GradientBoostingClassifier` is instantiated with the following hyperparameters:
- `learning_rate = 0.1`
- `max_depth = 5`
- `n_estimators = 200`
- `subsample = 0.8`

The model is fitted on the standardized training data.
After training, class predictions (`y_pred`) and predicted probabilities (`y_score`) are computed on the test set for subsequent evaluation.


In [None]:
model = GradientBoostingClassifier(
    learning_rate=0.1,
    max_depth=5,
    n_estimators=200,
    subsample=0.8,
    random_state=42
)

model.fit(X_train_scaled, Y_train)
y_pred = model.predict(X_test_scaled)
y_score = model.predict_proba(X_test_scaled)[:, 1]

## ROC Curve and AUC Score Evaluation

The Receiver Operating Characteristic (ROC) curve is generated to assess the model's classification performance.
The Area Under the Curve (AUC) is calculated to quantify the classifier's ability to distinguish between signal and background classes.
AUC values closer to 1.0 indicate strong predictive performance.


In [None]:
fpr, tpr, thresholds = roc_curve(Y_test, y_score)
auc_score = roc_auc_score(Y_test, y_score)

plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f"AUC = {auc_score:.3f}")
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.legend()
plt.grid(True)
plt.show()

## Confusion Matrix and Accuracy Score

The confusion matrix is computed to provide a detailed breakdown of prediction results, showing true positives, false positives, true negatives, and false negatives.
The accuracy score is also calculated to measure the overall proportion of correctly classified samples.
This evaluation helps in understanding the model’s performance beyond a single metric.


In [None]:
cm = confusion_matrix(Y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=["Background (0)", "Signal (1)"])
disp.plot(cmap='Blues')

acc = accuracy_score(Y_test, y_pred)
print(f"Accuracy: {acc:.4f}")


## Saving the Trained Model

The trained `GradientBoostingClassifier` is serialized and saved using the `joblib` library.
This allows the model to be reloaded later for inference or integration into other systems without retraining.


In [None]:
project_root = os.path.abspath(os.path.join(os.getcwd(), ".."))
model_path = os.path.join(project_root, "models", "higgs_classifier.joblib")
print(f"Saving model to {model_path}")
joblib.dump(model, model_path)