# Notebook 01: Data Fabric & Graph Construction

## From Tabular to Graphs: Representing Nuclear Topology

**Learning Objective:** Understand how to represent the Chart of Nuclides as a graph for GNN training using real EXFOR data.

**Focus:** See how U-235 and Cl-35 are positioned in the nuclear landscape and how GNNs can learn relationships between them.

### Why Graphs?

Nuclear data has inherent structure:
- **Isotopes** are related by reactions: U-235 + n → U-236 (capture)
- **Neighborhoods matter**: Similar Z/A have similar physics
- **Topology encodes physics**: Chart of Nuclides = Graph!
- **Transfer learning**: U-235 (data-rich) can inform Cl-35 (data-sparse) through graph structure

This structure is **invisible** to classical ML but **natural** for GNNs.

In [None]:
import sys
sys.path.append('..')

import numpy as np
import matplotlib.pyplot as plt
import networkx as nx
from pathlib import Path

from nucml_next.data import NucmlDataset
from nucml_next.utils import ChartOfNuclides

# Verify EXFOR data exists
exfor_path = Path('../data/exfor_processed.parquet')
if not exfor_path.exists():
    raise FileNotFoundError(
        f"EXFOR data not found at {exfor_path}\n"
        "Please run: python scripts/ingest_exfor.py --exfor-root <path> --output data/exfor_processed.parquet"
    )

print("✓ Imports successful")
print("✓ EXFOR data found")

### Step 1: Build the Nuclear Graph

In [None]:
# Load real EXFOR data in GRAPH mode
# Uses default DataSelection (reactor physics, neutrons, essential reactions)
# This demonstrates all available options with inline documentation

from nucml_next.data import NucmlDataset, DataSelection

# OPTION 1: Use default selection (recommended for most users)
# Uncomment this to use defaults:
# dataset = NucmlDataset(
#     data_path='../data/exfor_processed.parquet',
#     mode='graph'
#     # selection parameter omitted → uses default_selection()
# )

# OPTION 2: Explicit DataSelection with all options documented
dataset = NucmlDataset(
    data_path='../data/exfor_processed.parquet',
    mode='graph',  # Graph mode for GNN training
    selection=DataSelection(
        # ========================================================================
        # PROJECTILE SELECTION
        # ========================================================================
        projectile='neutron',      # Options: 'neutron' | 'all'
        
        # ========================================================================
        # ENERGY RANGE (eV)
        # ========================================================================
        energy_min=1e-5,           # Thermal neutrons (0.01 eV)
        energy_max=2e7,            # 20 MeV (reactor physics upper bound)
        
        # ========================================================================
        # REACTION (MT) MODE
        # ========================================================================
        mt_mode='all_physical',    # Options:
                                   # 'reactor_core'   → MT 2, 4, 16, 18, 102, 103, 107
                                   # 'threshold_only' → MT 16, 17, 103-107
                                   # 'fission_details'→ MT 18, 19, 20, 21, 38
                                   # 'all_physical'   → All codes < 9000
                                   # 'custom'         → Use custom_mt_codes
        
        custom_mt_codes=None,      # Example: [2, 18, 102] when mt_mode='custom'
        
        # ========================================================================
        # EXCLUSION & VALIDITY
        # ========================================================================
        exclude_bookkeeping=True,  # Exclude MT 0, 1, and MT >= 9000
        drop_invalid=True,         # Drop NaN or non-positive cross-sections
        
        # ========================================================================
        # HOLDOUT ISOTOPES
        # ========================================================================
        holdout_isotopes=None      # Example: [(92, 235), (17, 35)]
    )
)

# Get global graph representation
graph = dataset.graph_builder.build_global_graph()

print(f"Graph Statistics:")
print(f"  Nodes (isotopes): {graph.num_nodes}")
print(f"  Edges (reactions): {graph.num_edges}")
print(f"  Node features: {graph.x.shape[1]} (includes AME2020 enrichment)")
print(f"  Edge features: {graph.edge_attr.shape[1]}")

print(f"\n💡 Data Selection Applied:")
print(f"   - Projectile: neutron (reactor physics focus)")
print(f"   - Energy: 1e-5 to 2e7 eV (thermal to fast neutrons)")
print(f"   - MT codes: all physical reactions (excludes bookkeeping)")
print(f"   - Invalid data: dropped (NaN or non-positive cross-sections)")

### Step 2: Visualize the Chart of Nuclides

In [None]:
# Plot Chart of Nuclides with U-235 and Cl-35 highlighted
chart = ChartOfNuclides()

# Highlight our focus isotopes
fig, ax = chart.plot_chart(
    dataset.df[['Z', 'N']].drop_duplicates(),
    highlight_isotopes=[
        (92, 235),  # U-235 (well-understood, data-rich)
        (17, 35)    # Cl-35 (research interest, data-sparse)
    ]
)
plt.show()

print("\n🔍 Observation: The chart shows nuclear topology!")
print("   • U-235 (Z=92): Heavy actinide region (highlighted)")
print("     - Fissile isotope, critical for reactors")
print("     - Well-characterized, extensive EXFOR data")
print()
print("   • Cl-35 (Z=17): Light element region (highlighted)")
print("     - (n,p) reaction produces S-35 (medical isotope)")
print("     - Limited experimental data, active research")
print()
print("   • GNNs can learn this structure and TRANSFER knowledge:")
print("     - Physics principles learned from U-235 (data-rich)")
print("     - Applied to Cl-35 predictions (data-sparse)")
print("     - Graph connections enable information flow!")

### 🎓 Key Takeaway

> The Chart of Nuclides is a **graph**, not a table!
>
> **U-235** and **Cl-35** are connected through:
> - Shared nuclear physics (Q-values, thresholds, conservation laws)
> - Graph topology (neighboring isotopes, reaction channels)
> - **GNN message passing**: Information flows through the graph!
>
> With **8D node features** (including AME2020 mass excess and binding energy), 
> the graph encodes rich nuclear physics for the GNN to learn from.
>
> **Key Advantage**: Models trained on data-rich isotopes (U-235) can improve 
> predictions for data-sparse isotopes (Cl-35) through graph structure!

Continue to `02_GNN_Transformer_Training.ipynb` →