# **TD Detecting Higgs bosons**

##Purpose of the Lab and Methodological Overview

The purpose of this laboratory session is to apply and critically evaluate **advanced machine learning paradigms** on the challenging problem of Higgs Boson particle detection—a classification task characterized by **high dimensionality, noise, and complex feature interactions**. Students are guided through a structured experimental workflow designed to progressively incorporate increasingly sophisticated learning strategies.

In the first stage, students perform **data preparation and preprocessing**, addressing issues such as normalization, handling missing values, and feature extraction to ensure data suitability for machine learning. In the second stage, they implement baseline classifiers—including Decision Trees, Logistic Regression, and Bayesian models—to establish reference performance metrics without additional learning paradigms. The third stage introduces e**nsemble learning methods** (Bagging and Boosting), emphasizing how the aggregation of weak learners can enhance predictive stability and reduce variance. In the fourth stage, students integrate the **active learning paradigm** using the **modAL** Python library, enabling the model to iteratively query the most informative samples to reduce labeling effort while improving accuracy. Finally, in the fifth stage, they develop a **hybrid approach that combines ensemble and active learning**, exploring how sampling-based paradigms can synergize to handle the Higgs dataset’s inherent complexity.

Through this progressive design, the lab aims to foster a deep understanding of sampling-based paradigms as robust strategies for learning from difficult, imbalanced, or noisy data—illustrating their capacity to improve model generalization and efficiency in high-stakes scientific contexts such as particle physics.

## I- Pre-processing

### 1- Importing libraries

In [None]:
import numpy as np
import pandas as pd
import csv
import time
import math
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.impute import SimpleImputer
from sklearn import model_selection
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

In [None]:

from google.colab import drive
drive.mount ('/content/drive')

Mounted at /content/drive


### 2- Reading CSV files
Data set available in http://opendata.cern.ch/record/328

**If the dataset is very large, you can select a sub-sample to work with.**.

In [None]:

dataFilename = '/content/drive/MyDrive/Data/atlas-higgs.csv'
data_complet = pd.read_csv(dataFilename)
#select only a sample
data = data_complet.sample(frac=0.1, random_state=42)
print(data.shape)
print(data.dtypes)
data.head()

### 3- Delete superfluous columns
Use the `del` function to delete the 'EventId', 'Weight', 'KaggleSet', 'KaggleWeight' columns, then transform the 'label' column into 0/1 binary values.

In [None]:
del(data['EventId'])
#.............

### 4- Check Outliers
Check the dataset for outliers by displaying the boxplots of all columns.

In [None]:
#boxplot display for different attributes

## 5- Check Unbalance Ratio
Display the histogram of the target column to check the existence of a disequilibrium between the positive and negative classes.

### 6- Split data to Train and test
Separate *input* data from *target* and split the dataset into train and test (30% for test).

In [None]:
#sperate the output

#subdivision of test sample data = 30%.


print(x_train.shape,x_test.shape,y_train.shape,y_test.shape)


### 7 - Dealing with missing data
Missing data in this file are designated -999. To replace them, use the `SimpleImputer` class.

In [None]:
#Handling missing data designated by -999


### 8-  Data calibration with *MinMaxScaler* from Sci-kit Learn
Scale all columns with the "MinMaxScaler".

In [None]:
from sklearn.preprocessing import MinMaxScaler

#...


### 9- Apply a dimensionality reduction technique
Apply the PCA method to reduce training data size while retaining 95% variance.

In [None]:
#import PCA class

#Apply decomposition with 95% variance

#Transforming learning data
.....

## 10 - Create a preprocessing pipleline
Add all the previouos preprocessing steps to a pipeline to be reused for the test data.

In [None]:
from sklearn.pipeline import Pipeline

# Create a pipeline with the preprocessing steps
preprocessing_pipeline = Pipeline([....

# II- Learning I: Models Only (No additional Paradigms)

Creation of a list of performance measures for model comparison

In [None]:
#for saving results
list_accuracies=[]
list_F1=[]
list_times=[]

## 1- Decision Tree

In [None]:
#first preprocess the test set


In [None]:
from sklearn.tree import DecisionTreeClassifier
#create an instance of a DecisionTree classifier
dt_class= .....
deb = time.time()
#If training fails, rerun the code
dt_class.fit(x_train, y_train)
fin=time.time()
res_dt = dt_class.predict(x_test)
print("temp",(fin-deb))
list_times.append(fin-deb)
acc=accuracy_score(res_dt,y_test)
F1=f1_score(res_dt,y_test)
print("accuray : ",acc)
print("F1 Score : ", F1)
list_accuracies.append(acc)
list_F1.append(F1)

classes=['Signal','Background']
cm = confusion_matrix(y_test, res_dt, labels=dt_class.classes_)
disp = ConfusionMatrixDisplay(confusion_matrix=cm,display_labels=dt_class.classes_)
disp.plot()
plt.show()

## 2- Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
lg_clf = LogisticRegression(solver='sag')
#The default solver for logistic regression is "liblinear", suitable for small and medium-sized problems
#sag": is a solver based on stochastic optimization (stochastic mean gradient method).
#It is suitable for large problems.
deb = time.time()
#....

##  3- Baysien

In [None]:
from sklearn.naive_bayes import GaussianNB
gn_clf=GaussianNB()
deb = time.time()
#......

### Plotting results

In [None]:
x = ['DT', 'LR', 'Gauss']
plt.grid()
plt.plot(x,list_accuracies)
plt.plot(x,list_F1,'r')
#ajouter une légende
#....
plt.show()

In [None]:
#figures des temps de calcul
plt.plot(x,list_times, label='Execution Time')
#ajouter une légende
plt.legend()
plt.show()

## III- Learning II: Ensemble Learning

## III-1 Bagging

### Random Forest
class sklearn.ensemble.RandomForestClassifier(n_estimators=100, *, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='sqrt', max_leaf_nodes=None, min_impurity_decrease=0.0, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, class_weight=None, ccp_alpha=0.0, max_samples=None)

In [None]:
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
deb = time.time()
....

### Bagging classifier with Logistic Regression
class sklearn.ensemble.BaggingClassifier(estimator=None, n_estimators=10, *, max_samples=1.0, max_features=1.0, bootstrap=True, bootstrap_features=False, oob_score=False, warm_start=False, n_jobs=None, random_state=None, verbose=0, base_estimator='deprecated')

*   max_samples: maximum sample size relative to the initial sample
*   max_features: number of attributes to select randomly
*   bootstrap: Whether samples are drawn with replacement. If False, sampling is performed without replacement.
*   bootstrap_features: Whether features are drawn with replacement.
*   n_jobs: The number of jobs to run in parallel for fitting and predicting. -1 uses all available processors.

In [None]:
from sklearn.ensemble import BaggingClassifier
#start with max_sample=1 and max_features=1


### Plotting results

In [None]:
x = ['DT', 'LR', 'Gauss', "RF", "Bag LR"]
plt.grid()
#plt.plot(...)
#..........
plt.show()

In [None]:
#figures des temps de calcul
plt.grid()
plt.plot(x,list_times)

### Disucussion

..................

#Exercice: #
Vary the values of the 'max_samples' and 'max_features' parameter to visualize the effect of sampling in ensemble learning on classifier quality.


## III - 2- Boosting

### AdaBoost
lass sklearn.ensemble.AdaBoostClassifier(estimator=None, *, n_estimators=50, learning_rate=1.0, algorithm='SAMME.R', random_state=None, base_estimator='deprecated')

An AdaBoost classifier is a meta-estimator that begins by fitting a classifier on the original dataset and then fits additional copies of the classifier on the same dataset but where the weights of incorrectly classified instances are adjusted such that subsequent classifiers focus more on difficult cases.

In [None]:
from sklearn.ensemble import AdaBoostClassifier
deb = time.time()
....

## Gradient Boost (Gradient Boosted Decision Trees)

This algorithm builds an additive model in a forward stage-wise fashion; it allows for the optimization of arbitrary differentiable loss functions. In each stage n_classes_ regression trees are fit on the negative gradient of the loss function, e.g. binary or multiclass log loss. Binary classification is a special case where only a single regression tree is induced.

class sklearn.ensemble.GradientBoostingClassifier(*, loss='log_loss', learning_rate=0.1, n_estimators=100, subsample=1.0, criterion='friedman_mse', min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_depth=3, min_impurity_decrease=0.0, init=None, random_state=None, max_features=None, verbose=0, max_leaf_nodes=None, warm_start=False, validation_fraction=0.1, n_iter_no_change=None, tol=0.0001, ccp_alpha=0.0

In [None]:

from sklearn.ensemble import GradientBoostingClassifier


### Plotting new results

In [None]:
x = ['DT', 'LR', 'Gauss', "RF", "Bag LR", "AdaBoost", "GdBoost" ]
plt.grid()
plt.plot(x,list_accuracies)
plt.plot(x,list_F1,'r')
plt.show()

# IV  Unbalance Learning (optional)

### Over Sampling :

Add at least one over sampling technique to the preporecssing pipeline and re-test the different models
*   Over  Samplig (OS)
*   SMOTE (synthetic sampling)
*   BSMOTE (Borderline Sampling)



In [None]:
from imblearn.over_sampling import RandomOverSampler, SMOTE, BorderlineSMOTE

#Over Sampling

#SMOTE or BSMOTE # heavy in execution!!!!
sm = SMOTE(random_state=1)
....

## New results after balancing

# V Active Learning Integration

Use modAL to actively select the most informative samples from the training pool.

**Steps**:

*   Start with a small labeled subset (e.g., 1% of training data)
*   Use uncertainty sampling to query new samples
*   Train a classifier (e.g. LogisticRegression)
*   Track performance after each query round

In [None]:
#Install modAL
!pip install modAL-python

In [None]:
#import librairies
from modAL.models import ActiveLearner
from modAL.uncertainty import uncertainty_sampling
import warnings
warnings.filterwarnings(
    "ignore",
    category=FutureWarning,
    module='sklearn'
)

In [None]:
#step1-split an initial training set from the pool -training set)

#step2 - Initialize the Active Learner with LogisticRegresion and UncertaintySampling

#Step3- Active Learning loop
#record accuracy after each teaching (training) step

#Step4: Plot the evolution of accuracy



# VI - Hybrid Active Learning + Ensemble

Propose an hybrid classification model using modAL and RandomForestClassifier

#VII - Conclusion and Discussion