<a href="https://colab.research.google.com/github/cabbagecongee/Particle_Transformer_Fine_Tunning/blob/main/JetClass2_data_explore.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [18]:
!pip install weaver-core
!pip install onnxruntime





In [19]:
! wget --no-verbose https://github.com/jet-universe/sophon/raw/main/notebooks/JetClassII_example.parquet

2025-07-17 18:16:24 URL:https://raw.githubusercontent.com/jet-universe/sophon/main/notebooks/JetClassII_example.parquet [447746/447746] -> "JetClassII_example.parquet.1" [1]


In [20]:
import awkward as ak
from typing import List, Tuple, Dict
import numpy as np

arrays = ak.from_parquet("JetClassII_example.parquet")
arrays

Each jet entry includes
1. `jet_*` jet level features
2. `part_*` jet constituent features
3. `genjet_*` matched GEN-jet features. only filled in if the jet was matched to a GEN-jet during simulation
4. `genpart_*` GEN-jet constituent features.
5. `aux_genpart_*` selected truth level particles

Basic layout from what I know so far:
* arrays -> shape: (n_jets, )

each element in arrays is a jet, and that jet has mutiple fields
* `arrays["jet_pt"][0]` scalar: pt of jet 0
* `arrays["part_px][0]` list: px of jet 0's particles
* `arrays["aux_genpart_pid"][0]` list: truth particles for jet 0



In [21]:
print(arrays.fields)

['part_px', 'part_py', 'part_pz', 'part_energy', 'part_deta', 'part_dphi', 'part_d0val', 'part_d0err', 'part_dzval', 'part_dzerr', 'part_charge', 'part_isElectron', 'part_isMuon', 'part_isPhoton', 'part_isChargedHadron', 'part_isNeutralHadron', 'jet_pt', 'jet_eta', 'jet_phi', 'jet_energy', 'jet_sdmass', 'jet_nparticles', 'jet_tau1', 'jet_tau2', 'jet_tau3', 'jet_tau4', 'jet_label', 'genpart_px', 'genpart_py', 'genpart_pz', 'genpart_energy', 'genpart_jet_deta', 'genpart_jet_dphi', 'genpart_x', 'genpart_y', 'genpart_z', 'genpart_t', 'genpart_pid', 'genjet_pt', 'genjet_eta', 'genjet_phi', 'genjet_energy', 'genjet_sdmass', 'genjet_nparticles', 'aux_genpart_pt', 'aux_genpart_eta', 'aux_genpart_phi', 'aux_genpart_mass', 'aux_genpart_pid', 'aux_genpart_isResX', 'aux_genpart_isResY', 'aux_genpart_isResDecayProd', 'aux_genpart_isTauDecayProd', 'aux_genpart_isQcdParton']


In [22]:
print(f'jet level features: {arrays["jet_pt"]}') #each index gives the scalar value for one jet, shape (n_jets, )

#constituent level features
print(f' constitutent level features: {arrays["part_px"]}') #shape (n_jets, n_particles_per_jet)
arrays["part_isElectron"][0]

jet level features: [717, 258, 210, 290, 202, 1.34e+03, ..., 518, 499, 1.29e+03, 1.64e+03, 385, 450]
 constitutent level features: [[-202, -140, -76.9, -49.3, -44.9, ..., -0.866, -0.597, -0.563, -0.491], ...]


The goal is to:
1. Extract each jet's constitutents (`part_*`) and format it as a tensor input for the Particle Transformer.
2. Extract each jet's high level features (`jet_*`)
3. Format everything properly for later integration into the full architecture.

In [30]:
import torch
def extract_constituents(jet, keys: List):
  feats = []
  for k in keys:
    data = jet[k]
    if isinstance(data[0], (bool, np.bool_)):
      data = np.array(data, dtype = np.float32)
    feats.append(data)
    return torch.tensor(np.stack(feats, axis=1), dtype=torch.float32)

In [38]:
constituent_keys = [
    "part_px", "part_py", "part_pz", "part_energy",
    "part_deta", "part_dphi",
    "part_d0val", "part_d0err", "part_dzval", "part_dzerr",
    "part_charge",
    "part_isElectron", "part_isMuon", "part_isPhoton",
    "part_isChargedHadron", "part_isNeutralHadron"
]

hlf_keys = ["jet_pt", "jet_eta", "jet_phi", "jet_energy", "jet_sdmass"]
label_key = "jet_label"

In [32]:
jet = arrays[0]
jet_constituents = extract_constituents(jet, constituent_keys)
jet_constituents

tensor([[-202.0545],
        [-140.0157],
        [ -76.9225],
        [ -49.3323],
        [ -44.9454],
        [ -42.0783],
        [ -34.7558],
        [ -21.3512],
        [ -11.2199],
        [ -11.4482],
        [  -8.7899],
        [  -7.8863],
        [  -6.3248],
        [  -5.8400],
        [  -4.5917],
        [  -3.3347],
        [  -3.8522],
        [  -3.2000],
        [  -2.1997],
        [  -2.5759],
        [  -2.4835],
        [  -2.4848],
        [  -2.2239],
        [  -2.2774],
        [  -2.2122],
        [  -2.0639],
        [  -1.6403],
        [  -1.6803],
        [  -1.5309],
        [  -1.3263],
        [  -1.3407],
        [  -0.9704],
        [  -0.8136],
        [  -1.0400],
        [  -1.0182],
        [  -0.9781],
        [  -0.9044],
        [  -0.8993],
        [  -0.8657],
        [  -0.5966],
        [  -0.5629],
        [  -0.4911]])

In [36]:
def extract_hlfs(jet, keys):
  feats = []
  for k in keys:
    data = jet[k]
    feats.append(torch.tensor(data, dtype=torch.float32))
  return torch.stack(feats, dim=0)

In [39]:
extract_hlfs(jet, hlf_keys)

tensor([ 7.1733e+02, -4.4632e-01, -3.0333e+00,  8.3698e+02,  2.7475e+02])