# Branch Inspection for ATLAS Open Data Releases

This notebook inspects branch names for each release year to help configure schemas.

Each release may have different branch naming conventions (e.g., `AnalysisElectronsAuxDyn` vs `ElectronAuxDyn`).


In [52]:
import atlasopenmagic as atom
import uproot
from collections import defaultdict
import json
from pprint import pprint
import sys
import os

# Add parent directory to path if needed
sys.path.insert(0, os.path.abspath('..'))

from src.parse_atlas import schemas


## Get Available Releases


In [53]:
# Get all available releases
available_releases = atom.available_releases()
print("Available releases:")
for release_id, release_info in available_releases.items():
    print(f"  {release_id}: {release_info}")


Available releases:
2016e-8tev           2016 Open Data for education release of 8 TeV proton-proton collisions (https://opendata.cern.ch/record/3860).
2020e-13tev          2020 Open Data for education release of 13 TeV proton-proton collisions (https://cern.ch/2r7xt).
2024r-pp             2024 Open Data for research release for proton-proton collisions (https://opendata.cern.record/80020).
2024r-hi             2024 Open Data for research release for heavy-ion collisions (https://opendata.cern.ch/record/80035).
2025e-13tev-beta     2025 Open Data for education and outreach beta release for 13 TeV proton-proton collisions (https://opendata.cern.ch/record/93910).
2025r-evgen-13tev    2025 Open Data for research release for event generation at 13 TeV (https://opendata.cern.ch/record/160000).
2025r-evgen-13p6tev  2025 Open Data for research release for event generation at 13.6 TeV (https://opendata.cern.ch/record/160000).
Available releases:
  2016e-8tev: 2016 Open Data for education relea

## Inspect Branches for Each Release


In [54]:
# Objects we're interested in
TARGET_OBJECTS = {"Electrons", "Muons", "Jets", "Photons"}
REQUIRED_FIELDS = {"pt", "eta", "phi"}
OPTIONAL_FIELDS = {"mass"}

# Store results
branch_inspection_results = {}

# Limit files per release for inspection (to save time)
MAX_FILES_PER_RELEASE = 3


In [55]:
release_branches = dict()

In [56]:
def inspect_branches_for_release(release_id, max_files=3):
    """
    Inspect branch names for a specific release.
    
    Returns:
        dict with branch patterns and detected objects
    """
    print(f"\n{'='*80}")
    print(f"Inspecting release: {release_id}")
    print(f"{'='*80}")
    
    try:
        # Set the release
        atom.set_release(release_id)
        
        # Get datasets
        datasets = atom.available_datasets()
        print(f"Found {len(datasets)} datasets")
        
        if not datasets:
            return None
        
        # Get URLs for first few datasets
        sample_files = []
        for dataset_id in list(datasets)[:max_files]:
            try:
                urls = atom.get_urls(dataset_id)
                if urls:
                    sample_files.extend(urls[:1])  # Take first file from each dataset
                    if len(sample_files) >= max_files:
                        break
            except Exception as e:
                print(f"  Warning: Could not get URLs for dataset {dataset_id}: {e}")
                continue
        
        if not sample_files:
            print(f"  No files found for {release_id}")
            return None
        
        print(f"  Inspecting {len(sample_files)} sample files...")
        
        # Inspect branches from sample files
        all_branches = set()
        branch_patterns = defaultdict(set)
        detected_objects = {}
        for file_url in sample_files[:max_files]:
            try:
                print(f"    Opening: {file_url[:80]}...")
                if 'gz' in file_url:
                    release_branches[release_id] = "gzipped"
                    continue
                with uproot.open(file_url) as root_file:
                    # Find the data tree
                    tree_name = None
                    for key in root_file.keys():
                        if "CollectionTree" in key or "mini" in key or "analysis" in key:
                            tree_name = key.split(";")[0]  # Remove ';1' suffix
                            break
                    
                    if tree_name is None:
                        print(f"      Warning: No suitable tree found")
                        continue
                    
                    tree = root_file[tree_name]
                    branches = set(tree.keys())
                    all_branches.update(branches)
                    release_branches[release_id] = list(branches)
                    
                    # Analyze branch patterns
                    for branch in branches:
                        # Look for object-related branches
                        branch_lower = branch.lower()
                        
                        # Check for each target object
                        for obj_name in TARGET_OBJECTS:
                            obj_lower = obj_name.lower()
                            if obj_lower in branch_lower:
                                # Extract base branch name (before the field)
                                if "." in branch:
                                    base_branch = branch.split(".")[0]
                                    field = branch.split(".")[1]
                                    
                                    if obj_name not in detected_objects:
                                        detected_objects[obj_name] = {
                                            "base_branch": base_branch,
                                            "fields": set(),
                                            "all_branches": []
                                        }
                                    
                                    detected_objects[obj_name]["fields"].add(field)
                                    detected_objects[obj_name]["all_branches"].append(branch)
                                    
                                    # Store pattern
                                    branch_patterns[obj_name].add(base_branch)
                                break
                    
                    print(f"      Found {len(branches)} branches in tree '{tree_name}'")
                    
            except Exception as e:
                print(f"      Error inspecting file: {e}")
                continue
        
        # Analyze results
        result = {
            "release_id": release_id,
            "total_branches": len(all_branches),
            "detected_objects": {},
            "branch_patterns": {},
            "schema_suggestion": {}
        }
        
        for obj_name, obj_data in detected_objects.items():
            base_branch = obj_data["base_branch"]
            fields = sorted(obj_data["fields"])
            
            # Determine prefix and suffix
            # Try to extract pattern: prefix + obj_name + suffix
            obj_lower = obj_name.lower()
            base_lower = base_branch.lower()
            
            if obj_lower in base_lower:
                idx = base_lower.find(obj_lower)
                prefix = base_branch[:idx]
                suffix_start = idx + len(obj_name)
                suffix = base_branch[suffix_start:] if suffix_start < len(base_branch) else ""
            else:
                prefix = ""
                suffix = ""
            
            result["detected_objects"][obj_name] = {
                "base_branch": base_branch,
                "prefix": prefix,
                "suffix": suffix,
                "available_fields": fields,
                "has_required_fields": REQUIRED_FIELDS.issubset(set(fields)),
                "sample_branches": obj_data["all_branches"][:5]  # First 5 as examples
            }
            
            result["branch_patterns"][obj_name] = list(branch_patterns[obj_name])
        
        # Generate schema suggestion
        if detected_objects:
            # Find common prefix and suffix
            prefixes = [obj["prefix"] for obj in result["detected_objects"].values()]
            suffixes = [obj["suffix"] for obj in result["detected_objects"].values()]
            
            common_prefix = prefixes[0] if len(set(prefixes)) == 1 else "<varies>"
            common_suffix = suffixes[0] if len(set(suffixes)) == 1 else "<varies>"
            
            result["schema_suggestion"] = {
                "branch_prefix": common_prefix,
                "branch_suffix": common_suffix,
                "objects": {}
            }
            
            for obj_name, obj_data in result["detected_objects"].items():
                if obj_data["has_required_fields"]:
                    # Include required fields + optional if available
                    fields_to_include = list(REQUIRED_FIELDS)
                    if "mass" in obj_data["available_fields"]:
                        fields_to_include.append("mass")
                    result["schema_suggestion"]["objects"][obj_name] = fields_to_include
        
        return result
        
    except Exception as e:
        print(f"  Error inspecting {release_id}: {e}")
        import traceback
        traceback.print_exc()
        return None


In [57]:
# Inspect all releases
for release_id in available_releases.keys():
    result = inspect_branches_for_release(release_id, max_files=MAX_FILES_PER_RELEASE)
    if result:
        branch_inspection_results[release_id] = result

with open("branches.json", "a") as f:
    json.dump(release_branches, f, indent=2)



Inspecting release: 2016e-8tev
Fetching and caching all metadata for release: 2016e-8tev...
Successfully cached 43 datasets.
Active release: 2016e-8tev. (Datasets path: REMOTE)
Found 43 datasets
  Inspecting 3 sample files...
    Opening: root://eospublic.cern.ch//eos/opendata/atlas/OutreachDatasets/2016-07-29/MC/mc_1...
      Found 46 branches in tree 'mini'
    Opening: root://eospublic.cern.ch//eos/opendata/atlas/OutreachDatasets/2016-07-29/MC/mc_1...
      Found 46 branches in tree 'mini'
    Opening: root://eospublic.cern.ch//eos/opendata/atlas/OutreachDatasets/2016-07-29/MC/mc_1...
      Found 46 branches in tree 'mini'

Inspecting release: 2020e-13tev
Fetching and caching all metadata for release: 2020e-13tev...
Successfully cached 229 datasets.
Active release: 2020e-13tev. (Datasets path: REMOTE)
Found 229 datasets
  No files found for 2020e-13tev

Inspecting release: 2024r-pp
Fetching and caching all metadata for release: 2024r-pp...
Successfully cached 374 datasets.
Active r

## Display Results


In [None]:
# Print summary for each release
print("\n" + "="*80)
print("BRANCH INSPECTION SUMMARY")
print("="*80)

for release_id, result in branch_inspection_results.items():
    print(f"\n{release_id}:")
    print(f"  Total branches found: {result['total_branches']}")
    print(f"  Detected objects: {list(result['detected_objects'].keys())}")
    
    if result['schema_suggestion']:
        print(f"  Suggested prefix: '{result['schema_suggestion']['branch_prefix']}'")
        print(f"  Suggested suffix: '{result['schema_suggestion']['branch_suffix']}'")
        
        print(f"  Objects and fields:")
        for obj_name, fields in result['schema_suggestion']['objects'].items():
            print(f"    {obj_name}: {fields}")



BRANCH INSPECTION SUMMARY

2016e-8tev:
  Total branches found: 46
  Detected objects: []

2024r-pp:
  Total branches found: 1563
  Detected objects: ['Electrons', 'Jets', 'Muons', 'Photons']
  Suggested prefix: '<varies>'
  Suggested suffix: 'AuxDyn'
  Objects and fields:
    Electrons: ['eta', 'phi', 'pt']
    Jets: ['eta', 'phi', 'pt']
    Muons: ['eta', 'phi', 'pt']
    Photons: ['eta', 'phi', 'pt']

2024r-hi:
  Total branches found: 281
  Detected objects: ['Muons']
  Suggested prefix: ''
  Suggested suffix: 'pectrometerTrackParticlesAuxDyn'
  Objects and fields:
    Muons: ['eta', 'phi', 'pt']

2025e-13tev-beta:
  Total branches found: 116
  Detected objects: []

2025r-evgen-13tev:
  Total branches found: 0
  Detected objects: []

2025r-evgen-13p6tev:
  Total branches found: 0
  Detected objects: []


In [None]:
# Detailed view for each release
for release_id, result in branch_inspection_results.items():
    print(f"\n{'='*80}")
    print(f"DETAILED RESULTS FOR {release_id}")
    print(f"{'='*80}")
    
    for obj_name, obj_data in result['detected_objects'].items():
        print(f"\n{obj_name}:")
        print(f"  Base branch: {obj_data['base_branch']}")
        print(f"  Prefix: '{obj_data['prefix']}'")
        print(f"  Suffix: '{obj_data['suffix']}'")
        print(f"  Available fields: {obj_data['available_fields']}")
        print(f"  Has required fields: {obj_data['has_required_fields']}")
        print(f"  Sample branches:")
        for branch in obj_data['sample_branches'][:3]:
            print(f"    - {branch}")



DETAILED RESULTS FOR 2016e-8tev

DETAILED RESULTS FOR 2024r-pp

Electrons:
  Base branch: TruthElectronsAuxDyn
  Prefix: 'Truth'
  Suffix: 'AuxDyn'
  Available fields: ['', 'Classification', 'DFCommonElectronsECIDS', 'DFCommonElectronsECIDSResult', 'DFCommonElectronsLHLoose', 'DFCommonElectronsLHLooseBL', 'DFCommonElectronsLHLooseBLIsEMValue', 'DFCommonElectronsLHLooseIsEMValue', 'DFCommonElectronsLHMedium', 'DFCommonElectronsLHMediumIsEMValue', 'DFCommonElectronsLHTight', 'DFCommonElectronsLHTightIsEMValue', 'DFCommonElectronsLHVeryLoose', 'DFCommonElectronsLHVeryLooseIsEMValue', 'OQ', 'TruthLink', 'TruthLink/AnalysisElectronsAuxDyn', 'ambiguityLink', 'ambiguityLink/AnalysisElectronsAuxDyn', 'ambiguityType', 'author', 'barcode', 'caloClusterLinks', 'charge', 'childLinks', 'clEta', 'clPhi', 'classifierParticleOrigin', 'classifierParticleOutCome', 'classifierParticleType', 'd0Normalized', 'decayVtxLink', 'decayVtxLink/TruthElectronsAuxDyn', 'e', 'e_dressed', 'eta', 'eta_dressed', 'etco

## Generate Schema Configuration


In [None]:
# Generate Python code for schema configuration
print("\n" + "="*80)
print("SCHEMA CONFIGURATION CODE")
print("="*80)
print("\n# Add this to schemas.py RELEASE_SCHEMAS dictionary:\n")

for release_id, result in branch_inspection_results.items():
    if result['schema_suggestion']:
        schema = result['schema_suggestion']
        
        print(f"    \"{release_id}\": {{")
        print(f"        \"branch_prefix\": \"{schema['branch_prefix']}\",")
        print(f"        \"branch_suffix\": \"{schema['branch_suffix']}\",")
        print(f"        \"objects\": {{")
        
        for obj_name, fields in schema['objects'].items():
            fields_str = ', '.join([f'"{f}"' for f in fields])
            print(f"            \"{obj_name}\": [{fields_str}],")
        
        print(f"        }}")
        print(f"    }},")
        print()



SCHEMA CONFIGURATION CODE

# Add this to schemas.py RELEASE_SCHEMAS dictionary:

    "2024r-pp": {
        "branch_prefix": "<varies>",
        "branch_suffix": "AuxDyn",
        "objects": {
            "Electrons": ["eta", "phi", "pt"],
            "Jets": ["eta", "phi", "pt"],
            "Muons": ["eta", "phi", "pt"],
            "Photons": ["eta", "phi", "pt"],
        }
    },

    "2024r-hi": {
        "branch_prefix": "",
        "branch_suffix": "pectrometerTrackParticlesAuxDyn",
        "objects": {
            "Muons": ["eta", "phi", "pt"],
        }
    },



In [None]:
# Save results to JSON file
output_file = "branch_inspection_results.json"
with open(output_file, 'w') as f:
    json.dump(branch_inspection_results, f, indent=2)

print(f"\nResults saved to {output_file}")



Results saved to branch_inspection_results.json


## Compare with Current Schema


In [None]:
# Compare detected patterns with current schema
print("\n" + "="*80)
print("COMPARISON WITH CURRENT SCHEMAS")
print("="*80)

for release_id, result in branch_inspection_results.items():
    print(f"\n{release_id}:")
    
    try:
        current_schema = schemas.get_schema_for_release(release_id)
        print(f"  Current schema exists: ✓")
        print(f"    Prefix: '{current_schema['branch_prefix']}'")
        print(f"    Suffix: '{current_schema['branch_suffix']}'")
        
        if result['schema_suggestion']:
            detected_prefix = result['schema_suggestion']['branch_prefix']
            detected_suffix = result['schema_suggestion']['branch_suffix']
            
            prefix_match = current_schema['branch_prefix'] == detected_prefix
            suffix_match = current_schema['branch_suffix'] == detected_suffix
            
            if not prefix_match:
                print(f"    ⚠ Prefix mismatch: current='{current_schema['branch_prefix']}', detected='{detected_prefix}'")
            if not suffix_match:
                print(f"    ⚠ Suffix mismatch: current='{current_schema['branch_suffix']}', detected='{detected_suffix}'")
            
            if prefix_match and suffix_match:
                print(f"    ✓ Schema matches detected pattern")
    
    except KeyError:
        print(f"  Current schema: ✗ (not found)")
        if result['schema_suggestion']:
            print(f"  → Use detected pattern: prefix='{result['schema_suggestion']['branch_prefix']}', suffix='{result['schema_suggestion']['branch_suffix']}'")



COMPARISON WITH CURRENT SCHEMAS

2016e-8tev:
  Current schema exists: ✓
    Prefix: 'Analysis'
    Suffix: 'AuxDyn'

2024r-pp:
  Current schema exists: ✓
    Prefix: 'Analysis'
    Suffix: 'AuxDyn'
    ⚠ Prefix mismatch: current='Analysis', detected='<varies>'

2024r-hi:
  Current schema exists: ✓
    Prefix: 'Analysis'
    Suffix: 'AuxDyn'
    ⚠ Prefix mismatch: current='Analysis', detected=''
    ⚠ Suffix mismatch: current='AuxDyn', detected='pectrometerTrackParticlesAuxDyn'

2025e-13tev-beta:
  Current schema exists: ✓
    Prefix: 'Analysis'
    Suffix: 'AuxDyn'

2025r-evgen-13tev:
  Current schema exists: ✓
    Prefix: 'Analysis'
    Suffix: 'AuxDyn'

2025r-evgen-13p6tev:
  Current schema exists: ✓
    Prefix: 'Analysis'
    Suffix: 'AuxDyn'
