# Exploratory Data Analysis

This notebook contains initial Exploratory Data Analysis code. This is typically the first step in a data analysis pipeline and it is fundamental to:
 - get acquainted with the use-case
 - understand the nature and format of the data
 - explore the information and noise contained in the data
 - get insights of challenges we may face when training

## Problem

This dataset contains a Monte Carlo simulation of $\rho^{\pm} \rightarrow \pi^{\pm} + \pi^0$ decays and the corresponding detector response. Specifically, the data report the measured response of **i) tracker** and **ii) calorimeter**, along with the true pyshical quantitites that generated those measurements.

<div class="alert alert-block alert-info">
This means that we expect one track per event, with mainly two energy blobs (clusters of cells) in the calorimeter.
</div>

The final **goal** is to associate the cell signals observed in the calorimeter to the track that caused those energy deposits.

## Method

The idea is to leverage a **point cloud** data representation to combine tracker and calorimeter information so to associate cell hits to the corresponding track. We will use a [**PointNet**](https://openaccess.thecvf.com/content_cvpr_2017/papers/Qi_PointNet_Deep_Learning_CVPR_2017_paper.pdf) model that is capable of handling this type of data, framed as a **semantic segmentation** approach. More precisely, this means that:
- we represent each hit in the detector as a point in the point cloud: x, y, z coordinates + additional features ("3+"-dimensional point)
- the **learning task** will be binary classification at hit level: for each cell the model learns whether its energy comes mostly from the track (class 1) or not (class 0)

## Data structure

<div class="alert alert-block alert-info">

This dataset is organized as follows:
 - for each event, we create a **sample** (i.e. point cloud)
 - each sample contains all hits in a cone around a track of the event, called **focal track**
     - the cone includes all hits within some $\Delta R$ distance of the track
     - if an event has multiple tracks, then we have more samples per event
     - since different samples have possibly different number of hits, **we pad all point clouds to ensure they have same size** (needed since the model requires inputs of same size)

</div>

In [1]:
import numpy as np
from pathlib import Path

REPO_BASEPATH = Path().cwd().parent
DATA_PATH = REPO_BASEPATH / "pnet_data/raw/rho_small.npz"

events = np.load(DATA_PATH)["feats"]

In [2]:
### dataset content and types

# n_samples and points_per_sample
print("Data format and shape:")
print(f"{type(events)}\t{events.shape=}\n\n")
# Note: structured numpy array --> columns accessible by name

# dataset columns
print(f"Column\tdtype")
for colname, coltype in events.dtype.descr:
    print(f"{colname}\t:{coltype}")

Data format and shape:
<class 'numpy.ndarray'>	events.shape=(325, 800)


Column	dtype
event_number	:<i4
cell_ID	:<i4
track_ID	:<i4
delta_R	:<f4
truth_cell_focal_fraction_energy	:<f4
truth_cell_non_focal_fraction_energy	:<f4
truth_cell_neutral_fraction_energy	:<f4
truth_cell_total_energy	:<f4
category	:|i1
track_num	:<i4
x	:<f4
y	:<f4
z	:<f4
distance	:<f4
normalized_x	:<f4
normalized_y	:<f4
normalized_z	:<f4
normalized_distance	:<f4
cell_sigma	:<f4
track_chi2_dof	:<f4
track_chi2_dof_cell_sigma	:<f4
cell_E	:<f4
normalized_cell_E	:<f4
track_pt	:<f4
normalized_track_pt	:<f4
track_pt_cell_E	:<f4
normalized_track_pt_cell_E	:<f4


In [3]:
# dataset statistics

n_samples, n_points_per_sample = events.shape

sample_event_ids = events['event_number'][:,0]
event_ids, event_ids_count = np.unique(sample_event_ids, return_counts=True)
n_events = len(event_ids)

print(f"{n_events=}")
print(f"{n_samples=}")
print(f"{n_points_per_sample=}")

n_events=314
n_samples=325
n_points_per_sample=800
