# ML4SCI at GSoC 2025 – ML4DQM Evaluation Test
## Data Understanding and Exploration

This notebook focuses on understanding the HCAL DigiOccupancy datasets and visualizing key features that will be important for the classification task.

## Task Overview

We are provided with two synthetic datasets containing DigiOccupancy values from the Hadronic Calorimeter (HCAL) at the CMS detector:
- Run355456_Dataset.npy
- Run357479_Dataset.npy

Our objective is to develop a Vision Transformer (ViT) model to classify these "images" according to which run they originated from.

## Dataset Description

The datasets represent DigiOccupancy (hit multiplicity) values for the Hadronic Calorimeter (HCAL) at the CMS detector. Each dataset has the shape (10000, 64, 72), where:

- 10,000 refers to the number of luminosity sections (LS)
- 64 refers to the number of iEta cells (pseudorapidity)
- 72 refers to the number of iPhi cells (azimuthal angle)

Each value in the array represents the number of particle hits detected in a specific cell during a specific luminosity section.

In [None]:
# Import necessary libraries
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import LogNorm
import seaborn as sns
import os
from tqdm import tqdm

# Set the style for plots
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_context("notebook", font_scale=1.2)

## 1. Loading the Datasets

First, let's download and load the datasets. For this notebook, we'll need to ensure we have the datasets in our working directory.

In [None]:
# Function to download the datasets if not already present
def download_dataset(url, filename):
    if not os.path.exists(filename):
        print(f"Downloading {filename}...")
        try:
            import urllib.request
            urllib.request.urlretrieve(url, filename)
            print(f"Downloaded {filename} successfully.")
        except Exception as e:
            print(f"Error downloading {filename}: {e}")
            print(f"Please download {filename} manually from the provided URL.")
    else:
        print(f"{filename} already exists.")

# URLs for the datasets
url1 = "https://cernbox.cern.ch/s/cDOFb5myDHGqRfc"
url2 = "https://cernbox.cern.ch/s/n8NvyK2ldUPUxa9"

# Download the datasets
download_dataset(url1, "Run355456_Dataset.npy")
download_dataset(url2, "Run357479_Dataset.npy")

In [None]:
# Load the datasets
try:
    run1 = np.load("Run355456_Dataset.npy")
    run2 = np.load("Run357479_Dataset.npy")
    print(f"Run1 shape: {run1.shape}")
    print(f"Run2 shape: {run2.shape}")
except FileNotFoundError:
    print("Error: Dataset files not found. Please ensure they are in the current directory.")

## 2. Understanding HCAL Coordinates

Before diving into the data analysis, let's understand what iEta and iPhi represent in the HCAL detector:

### iEta (η) - Pseudorapidity
- Related to the polar angle θ by η = -ln(tan(θ/2))
- Measures position along the beam axis
- Invariant under Lorentz boosts along the beam axis
- In our dataset: 64 discrete iEta values

### iPhi (φ) - Azimuthal Angle
- Measures angular position around the beam pipe
- Covers the full 360° around the cylindrical detector
- In our dataset: 72 discrete iPhi values

The following diagram illustrates how these coordinates map to the physical HCAL detector:

```
                      ^ iEta
                      |
                      |
    +----------------+----------------+
    |                |                |
    |                |                |
    |                |                |
    |                |                |
    |                |                |
    +----------------+----------------+ --> iPhi
    |                |                |
    |                |                |
    |                |                |
    |                |                |
    |                |                |
    +----------------+----------------+
                      |
                      |
```

Each cell in this grid represents a detector cell, and the DigiOccupancy value is the number of particle hits in that cell during a luminosity section.

## 3. Basic Data Statistics

Let's calculate some basic statistics about the datasets to understand their characteristics:

In [None]:
def calculate_statistics(data, name):
    """Calculate and print basic statistics for a dataset"""
    print(f"\n{name} Statistics:")
    print(f"Min value: {np.min(data)}")
    print(f"Max value: {np.max(data)}")
    print(f"Mean value: {np.mean(data):.2f}")
    print(f"Median value: {np.median(data):.2f}")
    print(f"Standard deviation: {np.std(data):.2f}")
    
    # Calculate the percentage of zero values
    zero_percentage = (data == 0).sum() / data.size * 100
    print(f"Percentage of zero values: {zero_percentage:.2f}%")
    
    # Calculate sparsity of each image
    sparsity_per_image = [(img == 0).sum() / img.size * 100 for img in data]
    print(f"Average image sparsity: {np.mean(sparsity_per_image):.2f}%")
    print(f"Min image sparsity: {np.min(sparsity_per_image):.2f}%")
    print(f"Max image sparsity: {np.max(sparsity_per_image):.2f}%")

# Calculate statistics for both runs
calculate_statistics(run1, "Run 355456")
calculate_statistics(run2, "Run 357479")

## 4. Key Visualizations

Now let's create some visualizations to better understand the data:

### 4.1 Average Occupancy Maps

First, let's look at the average DigiOccupancy for each cell across all luminosity sections. This will show us which regions of the detector are typically active.

In [None]:
# Calculate average maps
avg_run1 = np.mean(run1, axis=0)
avg_run2 = np.mean(run2, axis=0)

# Create the figure
fig, axes = plt.subplots(1, 2, figsize=(16, 7))

# Find global min and max for consistent color scale, excluding zeros
vmin = min(np.min(avg_run1[avg_run1 > 0]), np.min(avg_run2[avg_run2 > 0]))
vmax = max(np.max(avg_run1), np.max(avg_run2))

# Plot average maps with logarithmic scale
im1 = axes[0].imshow(avg_run1, cmap='viridis', 
                     norm=LogNorm(vmin=max(0.1, vmin), vmax=vmax))
axes[0].set_title("Average DigiOccupancy - Run 355456", fontsize=14)
axes[0].set_xlabel("iPhi (azimuthal angle)", fontsize=12)
axes[0].set_ylabel("iEta (pseudorapidity)", fontsize=12)

im2 = axes[1].imshow(avg_run2, cmap='viridis', 
                     norm=LogNorm(vmin=max(0.1, vmin), vmax=vmax))
axes[1].set_title("Average DigiOccupancy - Run 357479", fontsize=14)
axes[1].set_xlabel("iPhi (azimuthal angle)", fontsize=12)
axes[1].set_ylabel("iEta (pseudorapidity)", fontsize=12)

# Add a colorbar
cbar = fig.colorbar(im1, ax=axes, label='Average DigiOccupancy (log scale)')

plt.tight_layout()
plt.savefig('average_occupancy_maps.png', dpi=300, bbox_inches='tight')
plt.show()

### 4.2 Difference Map

Next, let's visualize the difference between the average occupancy maps to see which regions differ most between the two runs. This will highlight the distinguishing features for our classification task.

In [None]:
# Calculate difference
diff = avg_run2 - avg_run1

# Create figure
plt.figure(figsize=(12, 9))

# Set symmetric color limits for better visualization
vmax = max(abs(np.max(diff)), abs(np.min(diff)))

# Plot difference map with diverging color scale
im = plt.imshow(diff, cmap='RdBu_r', vmin=-vmax, vmax=vmax)
plt.title("Difference in DigiOccupancy (Run 357479 - Run 355456)", fontsize=16)
plt.xlabel("iPhi (azimuthal angle)", fontsize=14)
plt.ylabel("iEta (pseudorapidity)", fontsize=14)

# Add colorbar
cbar = plt.colorbar(im, label='Difference in DigiOccupancy')

# Add annotations for positive and negative regions
plt.text(5, 5, "Blue: Higher in Run 355456", color='blue', fontsize=12, 
         bbox={"facecolor":"white", "alpha":0.8, "pad":5})
plt.text(5, 10, "Red: Higher in Run 357479", color='red', fontsize=12, 
         bbox={"facecolor":"white", "alpha":0.8, "pad":5})

plt.tight_layout()
plt.savefig('difference_map.png', dpi=300, bbox_inches='tight')
plt.show()

### 4.3 Sample Images from Each Run

Let's look at a few random samples from each run to get a sense of what individual images look like:

In [None]:
# Set up a figure to display random samples
num_samples = 3
fig, axes = plt.subplots(num_samples, 2, figsize=(14, 4 * num_samples))

# Select random indices
np.random.seed(42)  # For reproducibility
run1_indices = np.random.choice(run1.shape[0], num_samples, replace=False)
run2_indices = np.random.choice(run2.shape[0], num_samples, replace=False)

# Find global min and max for consistent color scale
vmin = min(np.min(run1[run1 > 0]), np.min(run2[run2 > 0]))
vmax = max(np.max(run1), np.max(run2))

for i in range(num_samples):
    # Plot Run 355456
    im1 = axes[i, 0].imshow(run1[run1_indices[i]], 
                           cmap='viridis', 
                           norm=LogNorm(vmin=max(0.1, vmin), vmax=vmax))
    axes[i, 0].set_title(f"Run 355456 - Sample {run1_indices[i]}")
    axes[i, 0].set_xlabel("iPhi")
    axes[i, 0].set_ylabel("iEta")
    
    # Plot Run 357479
    im2 = axes[i, 1].imshow(run2[run2_indices[i]], 
                           cmap='viridis', 
                           norm=LogNorm(vmin=max(0.1, vmin), vmax=vmax))
    axes[i, 1].set_title(f"Run 357479 - Sample {run2_indices[i]}")
    axes[i, 1].set_xlabel("iPhi")
    axes[i, 1].set_ylabel("iEta")

# Add colorbar
fig.colorbar(im1, ax=axes, label='DigiOccupancy Value (log scale)')

plt.tight_layout()
plt.savefig('sample_images.png', dpi=300, bbox_inches='tight')
plt.show()

## 5. Sparsity Analysis

The dataset description mentions that there will be many zero-valued entries. Let's analyze the sparsity patterns to understand where hits typically occur in the detector.

In [None]:
# Create binary masks where values are present (non-zero)
mask_run1 = np.mean(run1 > 0, axis=0)
mask_run2 = np.mean(run2 > 0, axis=0)

# Difference in occurrence frequency
diff_mask = mask_run2 - mask_run1

fig, axes = plt.subplots(1, 3, figsize=(18, 6))

# Plot masks
im1 = axes[0].imshow(mask_run1, cmap='Blues', vmin=0, vmax=1)
axes[0].set_title("Non-Zero Frequency - Run 355456")
axes[0].set_xlabel("iPhi")
axes[0].set_ylabel("iEta")
fig.colorbar(im1, ax=axes[0], label='Frequency of Non-Zero Values')

im2 = axes[1].imshow(mask_run2, cmap='Blues', vmin=0, vmax=1)
axes[1].set_title("Non-Zero Frequency - Run 357479") 
axes[1].set_xlabel("iPhi")
axes[1].set_ylabel("iEta")
fig.colorbar(im2, ax=axes[1], label='Frequency of Non-Zero Values')

# Plot difference
vmax_diff = max(abs(np.min(diff_mask)), abs(np.max(diff_mask)))
im3 = axes[2].imshow(diff_mask, cmap='RdBu_r', vmin=-vmax_diff, vmax=vmax_diff)
axes[2].set_title("Difference in Non-Zero Frequency") 
axes[2].set_xlabel("iPhi")
axes[2].set_ylabel("iEta")
fig.colorbar(im3, ax=axes[2], label='Frequency Difference')

plt.tight_layout()
plt.savefig('sparsity_pattern.png', dpi=300, bbox_inches='tight')
plt.show()

## 6. Value Distribution Analysis

Let's look at the distribution of non-zero DigiOccupancy values to understand the range and frequency of hit counts.

In [None]:
plt.figure(figsize=(12, 6))

# Flatten the arrays for histograms, excluding zeros
run1_flat = run1[run1 > 0].flatten()
run2_flat = run2[run2 > 0].flatten()

# If datasets are large, sample for better performance
sample_size = 100000
if len(run1_flat) > sample_size:
    run1_flat = np.random.choice(run1_flat, sample_size, replace=False)
if len(run2_flat) > sample_size:
    run2_flat = np.random.choice(run2_flat, sample_size, replace=False)

# Create histograms with log scale on y-axis
sns.histplot(run1_flat, color='blue', label='Run 355456', alpha=0.5, 
             log_scale=(False, True), stat='density', bins=50)
sns.histplot(run2_flat, color='orange', label='Run 357479', alpha=0.5, 
             log_scale=(False, True), stat='density', bins=50)

plt.xlabel('DigiOccupancy Value')
plt.ylabel('Density (log scale)')
plt.title('Distribution of Non-Zero DigiOccupancy Values')
plt.legend()
plt.tight_layout()
plt.savefig('value_distribution.png', dpi=300, bbox_inches='tight')
plt.show()

## 7. Summary of Findings

Based on our exploratory data analysis, we can summarize our findings about the HCAL DigiOccupancy datasets:

1. **Data Structure**: Each dataset contains 10,000 images of size 64×72 representing hits in the HCAL detector.

2. **Sparsity**: The data is highly sparse, with a significant percentage of zero values, which is expected as particle hits tend to be localized.

3. **Value Distribution**: Non-zero values have a wide dynamic range, with most values being low but some high-activity regions showing much larger counts.

4. **Spatial Patterns**: The average occupancy maps show clear spatial patterns of activity in the detector, with certain regions consistently showing higher hit counts.

5. **Distinguishing Features**: The difference map reveals regions where the two runs differ significantly, which will be important for our classification task.

6. **Classification Challenge**: Our task is to develop a Vision Transformer model that can identify which run a given DigiOccupancy image came from based on these patterns.

## 8. Next Steps

With this understanding of the data, we can now proceed to develop our Vision Transformer model. In the next notebook, we will:

1. Preprocess the data for model training
2. Implement a Vision Transformer architecture
3. Train the model on our classification task
4. Evaluate its performance using accuracy, ROC curves, and AUC

The key insights from this exploratory analysis will inform our modeling approach, particularly with respect to handling the sparse nature of the data and focusing on the differentiating features between the two runs.