# Data Ingestion and Access via Tiled

This notebook demonstrates the VDP data workflow, from inspecting raw HDF5 files to ingesting data and accessing it via the Tiled server.  

We will cover:

1. Inspect the data files and their schema
2. Generate manifests
3. Register data via two workflows:
   - **Bulk Registration (fast, SQLite only)**
   - **Server Registration (Visualizer mode)**
4. Retrieve the data

## 1. Data Overview
We are currently working with a set of HDF5 files containing RIXS and XPS data.

In [9]:
from pathlib import Path

data_path = Path("/sdf/data/lcls/ds/prj/prjmaiqmag01/results/cfitussi/proj-vdp/data")

# List HDF5 files
h5_files = sorted(data_path.glob("**/*.h5"))
print(f"We have {len(h5_files)} HDF5 files")
print("First 5 files:")
for f in h5_files[:5]:
    print(f)

We have 32 HDF5 files
First 5 files:
/sdf/data/lcls/ds/prj/prjmaiqmag01/results/cfitussi/proj-vdp/data/NiPS3_random_rank0000.h5
/sdf/data/lcls/ds/prj/prjmaiqmag01/results/cfitussi/proj-vdp/data/NiPS3_random_rank0001.h5
/sdf/data/lcls/ds/prj/prjmaiqmag01/results/cfitussi/proj-vdp/data/NiPS3_random_rank0002.h5
/sdf/data/lcls/ds/prj/prjmaiqmag01/results/cfitussi/proj-vdp/data/NiPS3_random_rank0003.h5
/sdf/data/lcls/ds/prj/prjmaiqmag01/results/cfitussi/proj-vdp/data/NiPS3_random_rank0004.h5


In [10]:
import subprocess
# Show the HDF5 schema for the first file
example_file = h5_files[0]
print(f"Inspecting schema of: {example_file}\n")

subprocess.run(["h5ls", "-r", str(example_file)])

Inspecting schema of: /sdf/data/lcls/ds/prj/prjmaiqmag01/results/cfitussi/proj-vdp/data/NiPS3_random_rank0000.h5

/                        Group
/RIXS                    Dataset {3125, 151, 40}
/XPS                     Dataset {3125, 5000}
/ein_RIXS                Dataset {40}
/eloss_RIXS              Dataset {151}
/params                  Group
/params/F2_dd            Dataset {3125}
/params/F2_dp            Dataset {3125}
/params/F4_dd            Dataset {3125}
/params/G1_dp            Dataset {3125}
/params/G3_dp            Dataset {3125}
/params/Gam_c            Dataset {3125}
/params/sigma            Dataset {3125}
/params/soc_c            Dataset {3125}
/params/soc_v_i          Dataset {3125}
/params/soc_v_n          Dataset {3125}
/params/tenDq            Dataset {3125}
/params/xoffset          Dataset {3125}
/w_xps                   Dataset {5000}


CompletedProcess(args=['h5ls', '-r', '/sdf/data/lcls/ds/prj/prjmaiqmag01/results/cfitussi/proj-vdp/data/NiPS3_random_rank0000.h5'], returncode=0)

From this data schema, we can see that each file contains RIXS and XPS spectra, as well as the parameters of 3,125 Hamiltonians. Therefore, with the 32 files, we have 100,000 Hamiltonians. 

## 2. Generate Manifests

This step scans all HDF5 files and creates two structured metadata tables stored as **Parquet files**.

### 1️. Hamiltonian Manifest (`manifest_hamiltonians.parquet`)

For each Hamiltonian inside each HDF5 file:

- Extract all physical parameters from `/params`
- Assign a unique identifier (`huid`)
- Store one row per Hamiltonian


### 2️. Artifact Manifest (`manifest_artifacts.parquet`)

For each Hamiltonian and each available artifact (`/RIXS`, `/XPS`):

- Link the artifact to its `huid`
- Store:
  - File path
  - Data shape
  - Data type
  - Axis names

This creates a mapping layer between:
- A Hamiltonian
- Its spectral outputs
- The exact location of the data in the HDF5 file

### Why does it matter?

Instead of scanning large HDF5 files during registration, we precompute structured metadata, store it fast, therefore enabling efficient bulk or HTTP-based catalog registration. 

In [11]:
# Example of generating a manifest

path = Path("/sdf/data/lcls/ds/prj/prjmaiqmag01/results/cfitussi/proj-vdp")

manifest_script = path /"data"/ "scripts" / "create_manifest.py"

# Command example (adjust as needed)
cmd = f"uv run --with pandas --with pyarrow python {manifest_script}"
print(f"Command to run:\n{cmd}")

Command to run:
uv run --with pandas --with pyarrow python /sdf/data/lcls/ds/prj/prjmaiqmag01/results/cfitussi/proj-vdp/data/scripts/create_manifest.py


In [12]:
import pandas as pd
from pathlib import Path

manifest_h_path = Path("/sdf/data/lcls/ds/prj/prjmaiqmag01/results/cfitussi/proj-vdp/data/manifests/manifest_hamiltonians.parquet")
manifest_artifact_path = Path("/sdf/data/lcls/ds/prj/prjmaiqmag01/results/cfitussi/proj-vdp/data/manifests/manifest_artifacts.parquet")
df_h = pd.read_parquet(manifest_h_path)
df_a = pd.read_parquet(manifest_artifact_path)

print("Shape:", df_h.shape)
display(df_h.head())

print("Shape:", df_a.shape)
display(df_a.head())

Shape: (100000, 13)


Unnamed: 0,huid,F2_dd,F2_dp,F4_dd,G1_dp,G3_dp,Gam_c,sigma,soc_c,soc_v_i,soc_v_n,tenDq,xoffset
0,NiPS3_random_rank0020_0000,14.167416,2.161438,8.021071,4.386197,2.015,0.717656,0.028074,9.77267,0.173338,0.080417,2.238504,7.40824
1,NiPS3_random_rank0020_0001,12.128871,4.134814,3.25585,1.868437,1.937351,0.498172,0.014648,12.853813,0.135301,0.10973,2.677842,11.243425
2,NiPS3_random_rank0020_0002,13.577481,6.526595,3.796882,3.459782,3.812318,0.422061,0.0231,4.876471,0.151471,0.101204,1.292423,-1.567886
3,NiPS3_random_rank0020_0003,3.982542,1.280815,8.829851,3.181704,2.866715,0.584292,0.005265,1.399733,0.19175,0.019449,3.274356,9.621646
4,NiPS3_random_rank0020_0004,10.289201,2.711531,4.042017,1.006256,1.050068,0.392571,0.018624,4.848176,0.174947,0.040957,1.760266,2.573255


Shape: (200000, 7)


Unnamed: 0,huid,type,path_rel,shape,dtype,axis_x_name,axis_y_name
0,NiPS3_random_rank0020_0000,rixs,/sdf/data/lcls/ds/prj/prjmaiqmag01/results/cfi...,"[151, 40]",float64,/eloss_RIXS,/ein_RIXS
1,NiPS3_random_rank0020_0000,xps,/sdf/data/lcls/ds/prj/prjmaiqmag01/results/cfi...,[5000],float64,/w_xps,
2,NiPS3_random_rank0020_0001,rixs,/sdf/data/lcls/ds/prj/prjmaiqmag01/results/cfi...,"[151, 40]",float64,/eloss_RIXS,/ein_RIXS
3,NiPS3_random_rank0020_0001,xps,/sdf/data/lcls/ds/prj/prjmaiqmag01/results/cfi...,[5000],float64,/w_xps,
4,NiPS3_random_rank0020_0002,rixs,/sdf/data/lcls/ds/prj/prjmaiqmag01/results/cfi...,"[151, 40]",float64,/eloss_RIXS,/ein_RIXS


## 3. Registering Data

We have **two registration workflows**:

### Option A: Server Registration (Visualizer mode)
- Server must be running
- Registration communicates via HTTP API
- Data is accessible through Tiled for visualization and analysis

### Option B: Bulk Registration (fast)
- No server needed
- Writes directly to SQLite (`catalog-bulk.db`)
- Very fast for large datasets 



### Option A: Server Registration

1. Start the Tiled server:

```bash
cd $path/tiled_poc
uv run --with 'tiled[server]' tiled serve config config.yml --api-key secret
```
2. Open an other terminal:
```bash
#Register 10000 hamiltonians
VDP_MAX_HAMILTONIANS=10000 uv run --with 'tiled[server]' --with pandas --with pyarrow --with h5py --with ruamel.yaml \
python scripts/register_catalog.py

### Option B: Bulk Registration

This script bypasses the HTTP server and writes directly to the SQLite database.  

1. Generate the SQLite database
- **Command:**  

```bash
cd $path/tiled_poc
uv run --with pandas --with pyarrow --with h5py --with canonicaljson --with ruamel.yaml \
python scripts/bulk_register.py -n 10000 -o catalog-bulk.db

In [13]:
import sqlite3
import pandas as pd

# Path to bulk catalog database
path_to_catalog=Path("/sdf/data/lcls/ds/prj/prjmaiqmag01/results/cfitussi/proj-vdp/tiled_poc")
catalog_db =  path_to_catalog / "catalog-bulk.db"

# Connect and preview first rows
if catalog_db.exists():
    conn = sqlite3.connect(catalog_db)
    df = pd.read_sql_query("SELECT * FROM nodes LIMIT 10;", conn)
    display(df)
else:
    print("catalog-bulk.db does not exist. Run bulk_register.py first.")

Unnamed: 0,id,parent,key,structure_family,metadata,specs,access_blob,time_created,time_updated
0,0,,,container,{},[],{},2026-02-12 05:30:14,2026-02-12 05:30:14
1,1,0.0,H_NiPS3_random_rank0020_0000,container,"{""huid"": ""NiPS3_random_rank0020_0000"", ""F2_dd""...",[],{},2026-02-12 05:33:34,2026-02-12 05:33:34
2,2,0.0,H_NiPS3_random_rank0020_0001,container,"{""huid"": ""NiPS3_random_rank0020_0001"", ""F2_dd""...",[],{},2026-02-12 05:33:34,2026-02-12 05:33:34
3,3,0.0,H_NiPS3_random_rank0020_0002,container,"{""huid"": ""NiPS3_random_rank0020_0002"", ""F2_dd""...",[],{},2026-02-12 05:33:34,2026-02-12 05:33:34
4,4,0.0,H_NiPS3_random_rank0020_0003,container,"{""huid"": ""NiPS3_random_rank0020_0003"", ""F2_dd""...",[],{},2026-02-12 05:33:34,2026-02-12 05:33:34
5,5,0.0,H_NiPS3_random_rank0020_0004,container,"{""huid"": ""NiPS3_random_rank0020_0004"", ""F2_dd""...",[],{},2026-02-12 05:33:34,2026-02-12 05:33:34
6,6,0.0,H_NiPS3_random_rank0020_0005,container,"{""huid"": ""NiPS3_random_rank0020_0005"", ""F2_dd""...",[],{},2026-02-12 05:33:34,2026-02-12 05:33:34
7,7,0.0,H_NiPS3_random_rank0020_0006,container,"{""huid"": ""NiPS3_random_rank0020_0006"", ""F2_dd""...",[],{},2026-02-12 05:33:34,2026-02-12 05:33:34
8,8,0.0,H_NiPS3_random_rank0020_0007,container,"{""huid"": ""NiPS3_random_rank0020_0007"", ""F2_dd""...",[],{},2026-02-12 05:33:34,2026-02-12 05:33:34
9,9,0.0,H_NiPS3_random_rank0020_0008,container,"{""huid"": ""NiPS3_random_rank0020_0008"", ""F2_dd""...",[],{},2026-02-12 05:33:34,2026-02-12 05:33:34


2. Run the Tiled server
```bash
uv run --with 'tiled[server]' tiled serve config config.yml --api-key secret

## 4: Access the Data

Once registration is done (via bulk or server), we can retrieve the data

In [None]:
from tiled.client import from_uri

# Connect to running Tiled server
client = from_uri("http://localhost:8005", api_key="secret")

print("Top-level keys (Hamiltonians):")
print(list(client.keys())[:5])

# Access first Hamiltonian
first_key = list(client.keys())[0]
h = client[first_key]

print("\nArtifacts available:")
print(list(h.keys()))

# Load RIXS spectrum
rixs = h["rixs"][0][:]
print("\nRIXS shape:", rixs.shape)

# Load XPS spectrum
xps = h["xps"][0][:]
print("XPS shape:", xps.shape)


In [19]:
#After running the server and executing the script above, we obtain: 

print("Top-level keys (Hamiltonians):")
print([
    'H_NiPS3_random_rank0020_0000',
    'H_NiPS3_random_rank0020_0001',
    'H_NiPS3_random_rank0020_0002',
    'H_NiPS3_random_rank0020_0003',
    'H_NiPS3_random_rank0020_0004'
])

print("Keys :", ['rixs', 'xps'])

rixs_arr = [
    [1.53975164e-096, 1.29003630e-096, 1.08012952e-096, "...", 6.89870155e-098, 7.26642481e-098],
    [2.13658548e-100, 1.79007624e-100, 1.49880602e-100, "...", 9.57275518e-102, 1.00830142e-101],
    [2.48656128e-104, 2.08329332e-104, 1.74431262e-104, "...", 1.11407864e-105, 1.17346266e-105],
    "...",
    [1.71538485e-008, 1.62213596e-008, 1.47870886e-008, "...", 2.70834471e-010, 2.40853177e-010],
    [1.77472893e-007, 1.67825408e-007, 1.52986509e-007, "...", 2.80204044e-009, 2.49185541e-009]
]
print('rixs_arr:', rixs_arr)
print("Shape : (151, 40)")

xps_arr = [0.00448695, 0.00449262, 0.0044983, "...", 0.00192027, 0.00191865]
print('xps_arr', xps_arr)
print("Shape : (5000)")

Top-level keys (Hamiltonians):
['H_NiPS3_random_rank0020_0000', 'H_NiPS3_random_rank0020_0001', 'H_NiPS3_random_rank0020_0002', 'H_NiPS3_random_rank0020_0003', 'H_NiPS3_random_rank0020_0004']
Keys : ['rixs', 'xps']
rixs_arr: [[1.53975164e-96, 1.2900363e-96, 1.08012952e-96, '...', 6.89870155e-98, 7.26642481e-98], [2.13658548e-100, 1.79007624e-100, 1.49880602e-100, '...', 9.57275518e-102, 1.00830142e-101], [2.48656128e-104, 2.08329332e-104, 1.74431262e-104, '...', 1.11407864e-105, 1.17346266e-105], '...', [1.71538485e-08, 1.62213596e-08, 1.47870886e-08, '...', 2.70834471e-10, 2.40853177e-10], [1.77472893e-07, 1.67825408e-07, 1.52986509e-07, '...', 2.80204044e-09, 2.49185541e-09]]
Shape : (151, 40)
xps_arr [0.00448695, 0.00449262, 0.0044983, '...', 0.00192027, 0.00191865]
Shape : (5000)


## Work in Progress: Server-Side Hamiltonian Slicing

In the current implementation, the Tiled server exposes the full RIXS and XPS datasets as they are stored in the HDF5 files.

Each HDF5 file contains multiple Hamiltonians stacked along the first axis:

- `RIXS` shape ≈ `(N_hamiltonians, energy_in, energy_loss)`
- `XPS` shape  ≈ `(N_hamiltonians, energy)`

When accessing data through Tiled, we currently retrieve the full dataset and manually slice it on the client side:

```python
rixs = h["rixs"][0][:]
xps  = h["xps"][0][:]

### Proposed Improvement

A more intuitive design would be to adapt the Tiled server so that:
- Each Hamiltonian node directly corresponds to its own RIXS/XPS slice
- The slicing is handled server-side
- The client receives a ready-to-use array without manual indexing

### Ongoing Improvement

To move the slicing logic to the server side, I implemented:
- A custom ```$path/tiled_poc/slicer.py ``` module :
    - Takes an HDF5 file path
    - Takes an index i
    - Returns only the i-th slice corresponding to the i-th Hamiltonian
- An additional index field (i) stored in the manifest

Note that modifications should be done:
- In ```config.yml``` : 
```yaml 
media_types:
  application/x-sliced-hdf5:
    adapter: "slicer:SlicedHDF5Adapter" #to transform the 3D array in a list of 2D array per hamiltonian
```
- In ```bulk_register.py```:
We need to modify the artifact registration so that:
- The custom mimetype (application/x-sliced-hdf5) is specified.
- The dataset path and slice index (i) are stored as parameters.
- Each Hamiltonian row in the manifest includes its corresponding slice index.


Each Hamiltonian row would include its slice index that is passed to the artifact source configuration. 
The goal is that each Hamiltonian node knows:
- Which HDF5 file it belongs to
- Which slice index corresponds to it
- The server uses this index internally to return only the relevant slice

