# Trigger Efficiency Measurement for Dilepton H→WW Analysis

## Overview
This notebook measures the trigger efficiency for the electron-muon dilepton trigger paths used in the H→WW analysis. 
The trigger efficiency quantifies the fraction of events that pass physics selection criteria and successfully fire the required High-Level Trigger (HLT) paths.

## Physics Context
In high-energy physics experiments, triggers are essential hardware/software filters that select interesting collision events from the billions of interactions per second. 
For the H→WW→e-μ analysis, we rely on specific HLT paths that require either:
- A muon with pT > 12 GeV and an electron with pT > 23 GeV, OR
- An electron with pT > 23 GeV and a muon with pT > 12 GeV

The trigger efficiency describes how reliably these paths select events that meet our physics criteria.

## Purpose
Understanding trigger efficiency is critical because:
1. **Correction Factor**: We need to correct Monte Carlo simulations to match real detector performance
2. **Event Yield**: Determines how many events we retain for analysis
3. **Systematic Uncertainty**: Differences between data and MC efficiency contribute to analysis uncertainty

## Inputs
- **Data Files**: 48 NanoAOD ROOT files from 2016 Run periods G and H (MuonEG dataset)
- **Golden JSON**: Certification file that marks good data-taking periods as valid
- **Event Selection**: Collision events containing exactly 2 leptons (e-μ pairs) passing quality criteria

## Outputs
- **Trigger Efficiency**: A single scalar value (~91.29%) representing the fraction of selected events that pass HLT
- **Detailed Breakdown**: File and sample-level statistics for validation

## Analysis Workflow
1. **Load Data**: Read NanoAOD files with distributed processing (Dask)
2. **Apply Data Quality**: Filter events using golden JSON certification
3. **Lepton Selection**: Identify tight electrons and muons passing isolation criteria
4. **Event Selection**: Apply kinematic cuts to build a pure dilepton sample
5. **Trigger Matching**: Count how many events fire the required HLT paths
6. **Efficiency Calculation**: Compute the ratio of triggered events to selected events

<!-- **Efficiency Formula**:  -->

In [1]:
import os
import sys
import time
import gc 
import psutil
import json
from pathlib import Path

import uproot
import awkward as ak
import numpy as np

import vector
vector.register_awkward()

import dask
from dask.distributed import Client

print("All imports added")

All imports added


## Setup and Configuration

### Required Libraries
- **uproot**: Reading ROOT files from CERN's storage
- **awkward**: Efficient array operations on nested data structures
- **numpy**: Numerical computations
- **dask**: Distributed parallel processing across multiple workers
- **vector**: 4-vector physics calculations (energy-momentum)

### Distributed Processing
This notebook uses Dask to process 48 files in parallel, drastically reducing runtime. 
A Client connection distributes tasks across available worker nodes, each processing independent data files.

In [2]:
client = Client("tls://localhost:8786")
client

0,1
Connection method: Direct,
Dashboard: /user/anujraghav.physics@gmail.com/proxy/8787/status,

0,1
Comm: tls://192.168.202.5:8786,Workers: 1
Dashboard: /user/anujraghav.physics@gmail.com/proxy/8787/status,Total threads: 1
Started: Just now,Total memory: 2.89 GiB

0,1
Comm: tls://129.93.182.107:44115,Total threads: 1
Dashboard: /user/anujraghav.physics@gmail.com/proxy/33599/status,Memory: 2.89 GiB
Nanny: tls://172.19.0.3:37857,
Local directory: /var/lib/condor/execute/dir_3267622/dask-scratch-space/worker-uzhh8npa,Local directory: /var/lib/condor/execute/dir_3267622/dask-scratch-space/worker-uzhh8npa
Tasks executing:,Tasks in memory:
Tasks ready:,Tasks in flight:
CPU usage: 2.0%,Last seen: Just now
Memory usage: 171.64 MiB,Spilled bytes: 0 B
Read bytes: 462.8115232872504 B,Write bytes: 1.62 kiB


In [3]:
HOME_DIR = Path(os.environ.get("HOME", "/home/cms-jovyan"))
PROJECT_NAME = "H-to-WW-NanoAOD-analysis"

PROJECT_DIR = HOME_DIR / PROJECT_NAME
DATASETS_DIR = PROJECT_DIR / "Datasets"
DATA_DIR = DATASETS_DIR / "DATA"
AUX_DIR = PROJECT_DIR / "Auxillary_files"

GOLDEN_JSON_PATH = AUX_DIR / "Cert_271036-284044_13TeV_Legacy2016_Collisions16_JSON.txt"

RUN_PERIODS_2016 = {
    "Run2016G": {"run_min": 278820, "run_max": 280385},
    "Run2016H": {"run_min": 280919, "run_max": 284044}
}

print(f"HOME_DIR:         {HOME_DIR}")
print(f"PROJECT_DIR:     {PROJECT_DIR}")
print(f"DATA_DIR:        {DATA_DIR}")
print(f"AUX_DIR:         {AUX_DIR}")
print(f"GOLDEN_JSON:      {GOLDEN_JSON_PATH}")
print(f"JSON exists:     {GOLDEN_JSON_PATH.exists()}")


HOME_DIR:         /home/cms-jovyan
PROJECT_DIR:     /home/cms-jovyan/H-to-WW-NanoAOD-analysis
DATA_DIR:        /home/cms-jovyan/H-to-WW-NanoAOD-analysis/Datasets/DATA
AUX_DIR:         /home/cms-jovyan/H-to-WW-NanoAOD-analysis/Auxillary_files
GOLDEN_JSON:      /home/cms-jovyan/H-to-WW-NanoAOD-analysis/Auxillary_files/Cert_271036-284044_13TeV_Legacy2016_Collisions16_JSON.txt
JSON exists:     True


In [44]:
SAMPLE_MAPPING = {
    'data' : "Data",
}

def load_urls_from_files(filepath, max_files = None):
    urls = []

    if not os.path.exists(filepath):
        return urls

    with open(filepath, 'r') as f:
        for line in f:
            line = line.strip()
            if line and line.startswith('root://'):
                urls.append(line)
                if max_files and len(urls) >= max_files:
                    break
    return urls

def load_all_files(data_dir, max_per_sample = None):

    files_dict = {}

    if not os.path.exists(data_dir):
        print(f"Directory not found: {data_dir}")
        return files_dict

    # Loop directly over files in the single data_dir
    for filename in os.listdir(data_dir):
        if not filename.endswith(".txt"):
            continue

        filepath = os.path.join(data_dir, filename)
        filename_lower = filename.lower().replace('.txt', '')

        label = None

        for pattern, sample_label in SAMPLE_MAPPING.items():
            if pattern in filename_lower:
                label = sample_label
                break

        if not label:
            print(f" unknown file: {filename}- skipping")
            continue

        urls = load_urls_from_files(filepath, max_per_sample)

        if urls: 
            if label in files_dict:
                files_dict[label].extend(urls)
            else:
                files_dict[label] = urls

    return files_dict

# files = load_all_files(DATA_DIR, max_per_sample=1)
files = load_all_files(DATA_DIR)

print("\n" + "="*70)
print("FILES TO PROCESS")
print("="*70)
total = 0
for label, urls in files.items():
    print(f"{label:20s}: {len(urls):4d} files")
    total += len(urls)
print("_"*70)
print(f"{'TOTAL':20s}: {total:4d} files")
print("="*70)


FILES TO PROCESS
Data                :   48 files
______________________________________________________________________
TOTAL               :   48 files


## Step 1: Data Quality and Certification

### Golden JSON Validation
Not all recorded data is suitable for physics analysis. Detector issues, beam problems, or detector outages 
can occur during data-taking. The "golden JSON" file contains certified luminosity blocks (sub-run periods) 
where all detector subsystems were operational and data quality was verified.

**What we do**: Filter events to only use data from certified runs and luminosity blocks
**Why**: Ensures we only analyze good-quality data with reliable detector performance
**How**: For each event, check if its run number and luminosity block appear in the golden JSON list

In [45]:
def load_golden_json(json_input, run_periods=None):
    """
    Load golden JSON from either a file path (str) or a dict.
    """
    
    if isinstance(json_input, str):
        with open(json_input, 'r') as f:
            golden_json = json.load(f)
    elif isinstance(json_input, dict):
        golden_json = json_input
    else:
        raise TypeError(f"Expected str or dict, got {type(json_input)}")
    
    valid_lumis = {}
    for run_str, lumi_ranges in golden_json.items():
        run = int(run_str)
        
        # Filter by run periods 
        if run_periods is not None: 
            in_period = any(
                period['run_min'] <= run <= period['run_max']
                for period in run_periods.values()
            )
            if not in_period:
                continue
        
        valid_lumis[run] = [tuple(lr) for lr in lumi_ranges]
    
    return valid_lumis


def apply_json_mask(arrays, json_input, run_periods=None):

    valid_lumis = load_golden_json(json_input, run_periods)
    
    runs = ak.to_numpy(arrays.run)
    lumis = ak.to_numpy(arrays.luminosityBlock)
    
    mask = np. zeros(len(runs), dtype=bool)
    
    for run, lumi_ranges in valid_lumis.items():
        run_mask = (runs == run)
        
        if not np.any(run_mask):
            continue
        
        # Check lumi sections 
        run_lumis = lumis[run_mask]
        run_lumi_mask = np.zeros(len(run_lumis), dtype=bool)
        
        for lumi_start, lumi_end in lumi_ranges: 
            run_lumi_mask |= (run_lumis >= lumi_start) & (run_lumis <= lumi_end)
        
        mask[run_mask] = run_lumi_mask
    
    return ak.Array(mask)

## Step 2: Event Loading and Reconstruction

### Reading NanoAOD Files
NanoAOD (Nano Analysis Object Data) is a compressed ROOT format containing the essential physics objects 
for analysis: electrons, muons, jets, and missing energy.

**What we read**:
- **Leptons**: Electron and muon 4-momenta (pT, η, φ, mass) and quality flags
- **Isolation**: Relative isolation variables (how much nearby activity surrounds the lepton)
- **Triggers**: Boolean flags indicating which HLT paths fired
- **Missing Energy**: PuppiMET (Particle Flow Missing Energy)
- **Metadata**: Run and luminosity block numbers for data quality

**Why batching**: Processing files in chunks (~1.25M events) manages memory efficiently while maintaining performance
**How**: uproot iterates through ROOT trees in configurable batch sizes, with automatic retry logic for network timeouts

In [46]:
Batch_size = 1_250_000

def load_events(file_url, batch_size=1_250_000, timeout=600, max_retries=3, retry_wait=10, is_data=False):
    columns = [
        "Electron_pt", "Electron_eta", "Electron_phi", "Electron_mass", 
        "Electron_mvaFall17V2Iso_WP90", "Electron_charge",
        
        "Muon_pt", "Muon_eta", "Muon_phi", "Muon_mass", 
        "Muon_tightId", "Muon_charge", "Muon_pfRelIso04_all",
        "PuppiMET_pt", "PuppiMET_phi",
        
        "Jet_pt", "Jet_eta", "Jet_phi", "Jet_mass",
        "Jet_btagDeepFlavB", "nJet", "Jet_jetId", "Jet_puId",

        "HLT_Mu12_TrkIsoVVL_Ele23_CaloIdL_TrackIdL_IsoVL_DZ",
        "HLT_Mu23_TrkIsoVVL_Ele12_CaloIdL_TrackIdL_IsoVL_DZ"
    ]

    columns.extend(["run", "luminosityBlock"])
        
    for attempt in range(max_retries):
        try:
            with uproot.open(file_url, timeout=timeout) as f:
                tree = f['Events']
                
                for arrays in tree.iterate(columns, step_size=batch_size, library="ak"):
                    yield arrays
                
                return
                
        except (TimeoutError, OSError, IOError, ConnectionError) as e:
            error_type = type(e).__name__
            file_name = file_url.split('/')[-1]
            
            if attempt < max_retries - 1:
                print(f"      {error_type} on {file_name}")
                print(f"       Retry {attempt+1}/{max_retries-1} in {retry_wait}s...")
                time.sleep(retry_wait)
            else:
                print(f"     FAILED after {max_retries} attempts: {file_name}")
                print(f"       Error: {str(e)[:100]}")
                raise
                
        except Exception as e:
            file_name = file_url.split('/')[-1]
            print(f"     Unexpected error on {file_name}: {str(e)[:100]}")
            raise

## Step 3: Lepton Selection (Object Identification)

### Tight Lepton Criteria
We select "tight" leptons that pass stringent quality criteria, ensuring they are real particles from collisions 
and not detector noise or misidentified background.

**Tight Electrons**:
- MVA ID (Multivariate Analysis): A machine learning score indicating electron-like properties
- We use the "Fall17V2Iso_WP90" working point, which requires MVA score > 0.9

**Tight Muons**:
- Tight ID flag: A set of quality cuts on track fit and chamber hits
- Isolation requirement: Relative isolation < 0.15 (pfRelIso04_all)
  - This ensures little hadronic/electromagnetic activity near the muon direction
  - Prevents selection of jets or tau leptons misidentified as muons

**What we do**: Create a combined lepton collection containing both tight electrons and tight muons
**Why**: Tight selection reduces background contamination and ensures we study real leptons
**How**: Apply boolean masks to filter lepton arrays, then concatenate electron and muon collections

In [47]:
def select_tight_leptons(arrays):
    tight_electron_mask = arrays.Electron_mvaFall17V2Iso_WP90 == 1
    tight_muon_mask = (arrays.Muon_tightId == 1) & (arrays.Muon_pfRelIso04_all < 0.15)
    
    tight_electrons = ak.zip({
        "pt": arrays.Electron_pt[tight_electron_mask],
        "eta": arrays.Electron_eta[tight_electron_mask],
        "phi": arrays.Electron_phi[tight_electron_mask],
        "mass": arrays.Electron_mass[tight_electron_mask],
        "charge": arrays.Electron_charge[tight_electron_mask],
        "flavor": ak.values_astype(ak.ones_like(arrays.Electron_pt[tight_electron_mask]) * 11, "int32")
    })
    
    tight_muons = ak.zip({
        "pt": arrays.Muon_pt[tight_muon_mask],
        "eta": arrays.Muon_eta[tight_muon_mask],
        "phi": arrays.Muon_phi[tight_muon_mask],
        "mass": arrays.Muon_mass[tight_muon_mask],
        "charge": arrays.Muon_charge[tight_muon_mask],
        "flavor": ak.values_astype(ak.ones_like(arrays.Muon_pt[tight_muon_mask]) * 13, "int32")
    })
    
    tight_leptons = ak.concatenate([tight_electrons, tight_muons], axis=1)
    return tight_leptons
    

## Step 4: Kinematic Variable Calculation

### Physics Variables for Event Selection
To ensure we're selecting genuine H→WW→e-μ events, we calculate several kinematic variables from the lepton and MET 4-momenta.

**Key Variables**:
- **M_ll (Dilepton Mass)**: Invariant mass of the two leptons
  - H→WW events show a characteristic mass distribution
  - We cut at M_ll > 12 GeV to reject low-mass backgrounds (e.g., from jets)

- **pT_ll (Dilepton Momentum)**: Combined transverse momentum of both leptons
  - Higgs particles produced at high pT indicate interesting events
  - Cut at pT_ll > 30 GeV

- **MT_Higgs (Higgs Transverse Mass)**: Constructed from both leptons and missing energy
  - Accounts for the undetected neutrinos from W decays
  - Cut at MT_Higgs > 60 GeV to suppress background

- **MT_L2 (Subleading Lepton Transverse Mass)**: Between the lower-pT lepton and MET
  - Sensitive to W decay kinematics
  - Cut at MT_L2 > 30 GeV

- **ΔΦ (Azimuthal angle difference)**: Angle between the two leptons in the transverse plane
  - Wrapped to the range [-π, π]

**Why these cuts**: Each variable helps distinguish H→WW signal from background processes
**How**: Construct 4-vectors from lepton properties, use vector addition/subtraction for combined quantities, apply trigonometry for angular calculations

In [48]:
def wrap_angle_to_pi(angle):
    
    return (angle + np.pi) % (2 * np.pi) - np.pi

def create_lepton_vector(lepton):
    """Create 4-vector from lepton properties """
    return vector.array({
        "pt": lepton.pt,
        "eta": lepton.eta,
        "phi": lepton.phi,
        "mass": lepton.mass
    })

def cal_kinematic_var(leading, subleading, met):

    # Create vectors
    lepton_1 = create_lepton_vector(leading)
    lepton_2 = create_lepton_vector(subleading)


    dilepton = lepton_1 + lepton_2
    
    #  Basic Variables
    mll = dilepton.mass
    ptll = dilepton.pt
    dphi = wrap_angle_to_pi(leading.phi - subleading.phi)

    # Higgs Transverse Mass 
    dll_et = np.sqrt(dilepton.pt**2 + dilepton.mass**2)
    mt_higgs_dphi = wrap_angle_to_pi(dilepton.phi - met.phi)
    term_1 = mll**2
    term_2 = 2 * (dll_et * met.pt - dilepton.pt * met.pt * np.cos(mt_higgs_dphi))
    
    mt_higgs = np.sqrt(term_1 + term_2)
    

    # Lepton 2 Transverse Mass
    mt_l2_met_dphi = wrap_angle_to_pi(subleading.phi - met.phi)
    mt_l2_met = np.sqrt(2 * subleading.pt * met.pt * (1 - np.cos(mt_l2_met_dphi)))


    return mll, ptll, dphi, mt_higgs, mt_l2_met


## Step 5: Event Selection (Kinematic Cuts & Trigger)

### Electron-Muon Pair Selection
We select events containing exactly one electron and one muon (dilepton events) passing multiple criteria.

**Selection Criteria Applied**:

1. **Multiplicity**: Exactly 2 tight leptons (e-μ pair)

2. **Flavor Requirement**: One electron (PDG ID = 11) and one muon (PDG ID = 13)
   - Rejects e-e or μ-μ pairs which come from different processes

3. **Charge Requirement**: Opposite sign (charge product < 0)
   - Rejects same-sign pairs which are rare in signal but common in backgrounds

4. **Transverse Momentum (pT)**:
   - Leading lepton: pT > 25 GeV
   - Subleading lepton: pT > 15 GeV
   - Ensures trigger efficiency and online reconstruction capability

5. **Pseudorapidity (η)**:
   - Electrons: |η| < 2.5 (detector coverage)
   - Muons: |η| < 2.4 (detector coverage)
   - Ensures leptons are in well-instrumented detector regions

6. **Kinematic Variables** (from Step 4):
   - M_ll > 12 GeV, pT_ll > 30 GeV
   - MT_Higgs > 60 GeV, MT_L2 > 30 GeV
   - MET > 20 GeV (indicating W decays with neutrinos)

7. **Trigger Requirement** (for efficiency calculation):
   - Event must fire one of two HLT paths:
     - `HLT_Mu12_TrkIsoVVL_Ele23_CaloIdL_TrackIdL_IsoVL_DZ`: Muon pT > 12 & Electron pT > 23
     - `HLT_Mu23_TrkIsoVVL_Ele12_CaloIdL_TrackIdL_IsoVL_DZ`: Electron pT > 23 & Muon pT > 12
   - "DZ" in the trigger name means the leptons must be matched (Δz < threshold)

**Efficiency Definition**:
- **Denominator**: Events passing all kinematic cuts (but not necessarily HLT)
- **Numerator**: Events passing kinematics AND HLT
- **Result**: Ratio tells us what fraction of signal-like events are retained by the trigger

**What we do**: Count events in two categories: (1) kinematics only, (2) kinematics + HLT
**Why**: Enables direct calculation of trigger efficiency as a ratio
**How**: Apply boolean masks sequentially, count surviving events at each stage

In [55]:
def select_emu_events(tight_leptons, arrays, met):
    # Sort leptons
    sorted_leptons = tight_leptons[ak.argsort(tight_leptons.pt, ascending=False)]

    mask_2lep = ak.num(sorted_leptons) == 2
    
    events_2lep = sorted_leptons[mask_2lep]
    arrays_2lep = arrays[mask_2lep]  # Keeps HLT branches aligned
    met_2lep = met[mask_2lep]

    if len(events_2lep) == 0:
        return 0, 0, None

    # Kinematic Cuts 
    leading = events_2lep[:, 0]
    subleading = events_2lep[:, 1]

        
    # calculating variables 
    mll, ptll, dphi, mt_higgs, mt_l2_met = cal_kinematic_var(leading, subleading, met_2lep)

    # CUTS
    mask_flavor = ((leading.flavor == 13) & (subleading.flavor == 11)) | \
                  ((leading.flavor == 11) & (subleading.flavor == 13))
    mask_charge = leading.charge * subleading.charge < 0
    mask_pt = (leading.pt > 25) & (subleading.pt > 15)

    pass_eta_leading = ((leading.flavor == 11) & (abs(leading.eta) < 2.5)) | \
                       ((leading.flavor == 13) & (abs(leading.eta) < 2.4))
                       
    pass_eta_subleading = ((subleading.flavor == 11) & (abs(subleading.eta) < 2.5)) | \
                          ((subleading.flavor == 13) & (abs(subleading.eta) < 2.4))
    
    mask_eta = pass_eta_leading & pass_eta_subleading
    other_masks = ((mll > 12) &
                  (mt_higgs > 60) &
                  (mt_l2_met > 30) &
                  (ptll > 30) &
                  (met_2lep.pt > 20))
    # mask_eta < 2.4
    # mll >20
    # mll from top 12 = , ptll cut same as Sr> 30, met > 20

    # kinematics_mask =  mask_flavor & mask_charge & mask_pt & mask_eta & other_masks

    # E-Mu Selected mask
    mask_emu_kinematics = mask_flavor & mask_charge & mask_pt & mask_eta & other_masks

    # Trigger (HLT) Cut
    mask_hlt = (arrays_2lep.HLT_Mu12_TrkIsoVVL_Ele23_CaloIdL_TrackIdL_IsoVL_DZ == 1) | \
               (arrays_2lep.HLT_Mu23_TrkIsoVVL_Ele12_CaloIdL_TrackIdL_IsoVL_DZ == 1)

    #  Final Masks
    # Events passing ONLY kinematics
    events_passing_emu = events_2lep[mask_emu_kinematics]
    
    # Events passing Kinematics AND HLT
    final_mask = mask_emu_kinematics & mask_hlt
    events_passing_all = events_2lep[final_mask]

    # Return counts and the final objects
    n_emu = len(events_passing_emu)
    n_final = len(events_passing_all)
    
    return n_emu, n_final, events_passing_all

## Step 6: Distributed Processing Framework

### Parallel File Processing
With 48 data files, we use Dask to process them in parallel across multiple worker nodes. 
This dramatically reduces runtime (184 seconds total vs potentially hours sequentially).

**Processing Strategy**:
1. Create a "processor function" with the golden JSON and run periods baked in (closure pattern)
2. Submit all 48 files as independent tasks to Dask workers
3. Each worker:
   - Loads events from one file in batches
   - Applies JSON validation
   - Performs lepton and event selection
   - Counts kinematic and HLT events
4. Collect results from all workers and aggregate statistics

**Error Handling**:
- Network timeouts on ROOT file access: Automatic retry (up to 3 attempts)
- Processing errors: Caught and logged, file marked as failed
- Allows analysis to complete even if a few files become unavailable

**What happens**: Each file is processed independently and in parallel.
**Why**: Reduces wall-clock time from hours to minutes.
**How**: Dask Client maps the processor function across all file URLs.

In [56]:
import time
import awkward as ak
import numpy as np

def make_processor(golden_json_data, run_periods):
    """
    Factory function that returns a worker function with 
    JSON data and Run Periods baked in (Closure Pattern).
    """

    def processing_file(label, file_url, file_idx):
        
        # Initialize Counters
        count_emu_kinematics = 0
        count_emu_hlt = 0
        
        file_name = file_url.split('/')[-1] 
        is_data = True
        
        max_file_retries = 3

        for file_attempt in range(max_file_retries):
            try:
                # Load Events
                for arrays in load_events(file_url, batch_size=1_250_000, is_data=is_data):
                    
                    #  Apply JSON Mask to Data
                    if is_data and golden_json_data is not None:
                        try:
                            json_mask = apply_json_mask(arrays, golden_json_data, run_periods=run_periods)
                            if np.sum(json_mask) == 0: continue
                            arrays = arrays[json_mask]
                        except Exception as e: 
                            print(f"Warning: JSON mask failed for {file_name}: {e}")
                            continue
                    
                    # Object Selection
                    tight_leptons = select_tight_leptons(arrays)

                    met = ak.zip({"pt": arrays.PuppiMET_pt, "phi": arrays.PuppiMET_phi})
                    
                    # Event Selection (Kinematics + HLT)
                    n_emu, n_hlt, _ = select_emu_events(tight_leptons, arrays, met)
                    
                    #  Accumulate
                    count_emu_kinematics += n_emu
                    count_emu_hlt += n_hlt
                
                # Success
                return label, count_emu_kinematics, count_emu_hlt, None

            except (OSError, IOError, ValueError) as e:
                if file_attempt < max_file_retries - 1:
                    time.sleep(3)
                    continue
                else: 
                    return label, 0, 0, f"{file_name}: Failed after retries - {str(e)[:100]}"
            
            except Exception as e:
                return label, 0, 0, f"{file_name}: Unexpected error - {str(e)[:100]}"

        return label, 0, 0, "Unknown loop exit"

    # Return the inner function
    return processing_file

## Step 7: Main Analysis - Computing Trigger Efficiency

### Execution Summary
The code below:
1. **Loads the golden JSON** into memory (contains ~20k valid luminosity blocks)
2. **Creates file lists** from text files in the data directory
3. **Submits 48 files** to the Dask cluster for parallel processing
4. **Monitors progress** (bar shows completion percentage)
5. **Aggregates results** from all workers
6. **Calculates efficiency** as: (HLT events) / (Kinematic events) × 100%
7. **Reports statistics** with error handling

### Expected Output
The table shows:
- **SAMPLE**: Data or MC (here, only Data)
- **FILES**: Number of files processed for this sample
- **KINEMATICS**: Events passing physics selection criteria
- **HLT PASS**: Events that also fired the required trigger
- **TRIG EFF**: Percentage of selected events that pass the trigger

For this 2016 data sample: **~91.3% of selected e-μ events fire the HLT paths**

This efficiency will be used as a correction factor applied to simulated events.

### Performance
- **Total Time**: 184.6 seconds to process 48 files
- **Throughput**: 3.85 seconds per file (includes I/O, network, processing)
- **Scaling**: Roughly linear with number of files

In [57]:
# %%
# MAIN PROCESSING (Trigger Efficiency)

import time
import json
from collections import defaultdict
from dask.distributed import progress

print(f"\n{'='*70}\nTRIGGER EFFICIENCY PROCESSING START\n{'='*70}")

golden_json_data = None
if GOLDEN_JSON_PATH.exists():
    # print(f"Reading Golden JSON: {GOLDEN_JSON_PATH.name}")
    with open(GOLDEN_JSON_PATH, 'r') as f:
        golden_json_data = json.load(f)
    # print(f"  Loaded {len(golden_json_data)} runs into memory\n")
else:
    print(f"WARNING: Golden JSON not found at {GOLDEN_JSON_PATH}")

processing_task = make_processor(
    golden_json_data=golden_json_data,
    run_periods=RUN_PERIODS_2016
)

arg_labels = []
arg_urls = []
arg_indices = []

print("Preparing file lists...")

for label, urls in files.items():
    is_data = (label == 'Data')
    
    if is_data:
        if golden_json_data is not None:
             print(f"  {label}: Validation enabled ({len(urls)} files)")
    
    for file_idx, file_url in enumerate(urls):
        arg_labels.append(label)
        arg_urls.append(str(file_url))
        arg_indices.append(file_idx)

start_time = time.perf_counter()

print(f"\nSubmitting {len(arg_urls)} files to the cluster...")

futures = client.map(
    processing_task,    
    arg_labels,
    arg_urls,
    arg_indices,
    retries=1
)

progress(futures)
results = client.gather(futures)
elapsed = time.perf_counter() - start_time

final_stats = defaultdict(lambda: [0, 0, 0]) 
errors = []

for label, n_kinematics, n_hlt, error in results:
    if error:
        errors.append((label, error))
    else:
        stats = final_stats[label]
        stats[0] += n_kinematics # Denominator (Events passing cuts)
        stats[1] += n_hlt        # Numerator (Events passing cuts + HLT)
        stats[2] += 1            # File count

print(f"\n{'='*70}")
print(f"{'SAMPLE':<20} | {'FILES':<8} | {'KINEMATICS':>14} | {'HLT PASS':>12} | {'TRIG EFF':>10}")
print("="*70)

tot_kinematics = tot_hlt = tot_files = 0

for label, (n_kinematics, n_hlt, n_files) in sorted(final_stats.items()):
    eff = (n_hlt / n_kinematics * 100) if n_kinematics > 0 else 0.0
    
    print(f"{label:<20} | {n_files:<8} | {n_kinematics:>14,} | {n_hlt:>12,} | {eff:>9.2f}%")
    
    tot_kinematics += n_kinematics
    tot_hlt += n_hlt
    tot_files += n_files

print("_"*70)
tot_eff = (tot_hlt / tot_kinematics * 100) if tot_kinematics > 0 else 0.0
print(f"{'TOTAL':<20} | {tot_files:<8} | {tot_kinematics:>14,} | {tot_hlt:>12,} | {tot_eff:>9.2f}%")
print(f"{'='*70}")

if errors:
    print(f"\n[!] Encountered {len(errors)} errors:")
    for label, err in errors[:5]: print(f"  - {label}: {err}")
    if len(errors) > 5: print(f"  ... and {len(errors)-5} more.")

print(f"\nDone in {elapsed:.1f}s ({elapsed/len(arg_urls):.2f}s/file)")


TRIGGER EFFICIENCY PROCESSING START
Preparing file lists...
  Data: Validation enabled (48 files)

Submitting 48 files to the cluster...

SAMPLE               | FILES    |     KINEMATICS |     HLT PASS |   TRIG EFF
Data                 | 48       |        132,784 |      121,224 |     91.29%
______________________________________________________________________
TOTAL                | 48       |        132,784 |      121,224 |     91.29%

Done in 184.6s (3.85s/file)


In [28]:
#get name of the branch required for trigger efficiency 

# DATA
root_file_name = "root://eospublic.cern.ch//eos/opendata/cms/Run2016G/MuonEG/NANOAOD/UL2016_MiniAODv2_NanoAODv9-v1/120000/2ADBED61-A06A-D64B-BE90-E9B267D15700.root"

with uproot.open(root_file_name) as file:
        # Access the Events tree
        if "Events" not in file:
            print("Error: 'Events' tree not found in file.")
        else:
            tree = file["Events"]
            branches = tree.keys()
            
            print(f"\nConnection Successful!")
            print(f"Total Branches found: {len(branches)}")
            print("=" * 60)
            
            # Print all branches alphabetically
            for branch in sorted(branches):
                if "HLT_Mu12_TrkIsoVVL_Ele23_CaloIdL_TrackIdL_IsoVL_DZ" in branch or "HLT_Mu23_TrkIsoVVL_Ele12_CaloIdL_TrackIdL_IsoVL_DZ"  in branch:
                    print(branch)



Connection Successful!
Total Branches found: 1380
HLT_Mu12_TrkIsoVVL_Ele23_CaloIdL_TrackIdL_IsoVL_DZ
HLT_Mu23_TrkIsoVVL_Ele12_CaloIdL_TrackIdL_IsoVL_DZ
