# RPT-Style SOH@600 Transfer Demo (BatteryML)

This notebook demonstrates:
- training SOH@600 predictors on **public datasets** (RWTH / CALCE)
- using **capacity-normalized** early-cycle discharge features (0/100/200)
- simulating sparse RPT checkpoints and predicting **SOH at cycle 600**

Assumptions:
- You run Jupyter from the **repo root**.
- You have installed `batteryml` in editable mode (`pip install -e .`).

Public data download / preprocess is optional (commands included below).

In [1]:
from pathlib import Path
import os
import pandas as pd

from batteryml.pipeline import Pipeline
from batteryml import BatteryData, CycleData

print('cwd:', os.getcwd())

cwd: c:\Users\FanWang\Documents\GitHub\BatteryML


## (Optional) Download + preprocess public datasets

RWTH (graphite/NMC, narrow voltage window) and CALCE (graphite/LCO, wider voltage window) are both downloadable via CLI.

If you already have processed data under `data/processed/<DATASET>`, you can skip this section.

In [2]:
# Uncomment to download + preprocess.
# !batteryml download RWTH data/raw/RWTH
# !batteryml preprocess RWTH data/raw/RWTH data/processed/RWTH -q

# !batteryml download CALCE data/raw/CALCE
# !batteryml preprocess CALCE data/raw/CALCE data/processed/CALCE -q

print('RWTH processed exists:', Path('data/processed/RWTH').exists())
print('CALCE processed exists:', Path('data/processed/CALCE').exists())

RWTH processed exists: True
CALCE processed exists: True


## Train + evaluate multiple model configs

We use configs under `configs/soh/transfer/`:
- Feature: `NormalizedVoltageCapacityMatrixFeatureExtractor` (capacity-normalized Q(V) diff curves)
- Label: `SOHAtCycleNumberLabelAnnotator` for SOH@600
- Models: Ridge / XGBoost / RandomForest

Note: configs use a RandomTrainTestSplitter, so scores depend on the seed/split.

In [3]:
CONFIGS = [
    # 'configs/soh/transfer/rwth_soh600_rptnorm_ridge.yaml',
    # 'configs/soh/transfer/rwth_soh600_rptnorm_xgb.yaml',
    # 'configs/soh/transfer/rwth_soh600_rptnorm_rf.yaml',
    'configs/soh/transfer/calce_soh600_rptnorm_ridge.yaml',
    'configs/soh/transfer/calce_soh600_rptnorm_xgb.yaml',
    'configs/soh/transfer/calce_soh600_rptnorm_rf.yaml',
]

rows = []
for cfg in CONFIGS:
    ws = Path('workspaces/rpt_soh600_demo') / Path(cfg).stem
    pipe = Pipeline(config_path=cfg, workspace=str(ws))
    model, dataset = pipe.train(device='cpu', skip_if_executed=False)
    pipe.evaluate(seed=0, device='cpu', metric=['RMSE', 'MAE'], model=model, dataset=dataset, skip_if_executed=False)
    pred = model.predict(dataset, data_type='test')
    rmse = dataset.evaluate(pred, 'RMSE', data_type='test')
    mae = dataset.evaluate(pred, 'MAE', data_type='test')
    rows.append({'config': cfg, 'rmse': rmse, 'mae': mae})

df = pd.DataFrame(rows).sort_values('rmse')
df

workspaces\rpt_soh600_demo\calce_soh600_rptnorm_ridge
Seed is set to 0.


Reading train data: 100%|██████████| 10/10 [00:00<00:00, 18.17it/s]
Reading test data: 100%|██████████| 3/3 [00:00<00:00, 16.87it/s]
Extracting features: 100%|██████████| 10/10 [00:00<00:00, 1010.75it/s]
Extracting features: 100%|██████████| 3/3 [00:00<00:00, 1084.55it/s]


ValueError: No training samples after filtering NaN labels / invalid features. Check your label cycle_number, required feature cycles, and dataset coverage.

## Simulate sparse RPT data from ONE public cell

Your pouch data has sparse checkpoints (0/100/200/...) and only discharge curves.
Here we simulate that by taking one public cell and keeping only cycles: 0, 100, 200, 600 (or nearest).

We will store the simulated sparse cell under `data/processed/SIM_RPT/`.

In [None]:
def pick_cycle_by_number(cell: BatteryData, target: int, max_gap: int = 2):
    cycles = list(cell.cycle_data)
    exact = [c for c in cycles if int(c.cycle_number) == int(target)]
    if exact:
        return exact[0]
    best = None
    best_diff = None
    for c in cycles:
        diff = abs(int(c.cycle_number) - int(target))
        if best is None or diff < best_diff:
            best = c
            best_diff = diff
    if best is None or best_diff is None or best_diff > max_gap:
        raise ValueError(f'cycle {target} not found (best diff={best_diff})')
    return best

def clone_cycle_with_number(c: CycleData, cycle_number: int) -> CycleData:
    d = c.to_dict()
    d['cycle_number'] = int(cycle_number)
    return CycleData(**d)

src_dir = Path('data/processed/RWTH') if Path('data/processed/RWTH').exists() else Path('data/processed/CALCE')
assert src_dir.exists(), 'Need at least one processed dataset to simulate from.'

src_cell_path = next(src_dir.glob('*.pkl'))
src_cell = BatteryData.load(src_cell_path)
print('Source cell:', src_cell.cell_id, 'cycles:', len(src_cell.cycle_data))

targets = [0, 100, 200, 600]
picked = [clone_cycle_with_number(pick_cycle_by_number(src_cell, t), t) for t in targets]

sim_dir = Path('data/processed/SIM_RPT')
sim_dir.mkdir(parents=True, exist_ok=True)
sim_cell = BatteryData(
    cell_id=f'RPT_SIM_{src_cell.cell_id}',
    cycle_data=picked,
    form_factor='pouch',
    anode_material='graphite',
    cathode_material=getattr(src_cell, 'cathode_material', None),
    nominal_capacity_in_Ah=getattr(src_cell, 'nominal_capacity_in_Ah', None),
    min_voltage_limit_in_V=getattr(src_cell, 'min_voltage_limit_in_V', None),
    max_voltage_limit_in_V=getattr(src_cell, 'max_voltage_limit_in_V', None),
)
sim_path = sim_dir / f'{sim_cell.cell_id}.pkl'
sim_cell.dump(sim_path)
print('Wrote simulated sparse cell:', sim_path)

Source cell: RWTH_002 cycles: 2393
Wrote simulated sparse cell: data\processed\SIM_RPT\RPT_SIM_RWTH_002.pkl


In [None]:
from pprint import pprint
# pprint(vars(sim_cell))
# pprint(vars(sim_cell.cycle_data[2]))

## Train on public data, predict SOH@600 on the simulated sparse cell

We use a dedicated split: train = public dataset dir, test = `SIM_RPT` dir.

This is the same pattern you will use for your own 3 pouch cells after preprocessing them into `data/processed/MY_POUCH/*.pkl`.

In [None]:
import torch
from batteryml.task import Task
from batteryml.builders import MODELS

best_cfg = df.iloc[0]['config']
print('Using best config:', best_cfg)

# Load YAML config as plain dict (minimal helper)
import yaml
with open(best_cfg, 'r', encoding='utf-8') as f:
    cfg = yaml.safe_load(f)

# Override split to: train on src_dir, test on SIM_RPT
cfg['train_test_split'] = {
    'name': 'CellIdPrefixTrainTestSplitter',
    'cell_data_path': [str(src_dir), 'data/processed/SIM_RPT'],
    'test_prefixes': ['RPT_SIM_'],
}

task = Task(
    train_test_splitter=cfg['train_test_split'],
    feature_extractor=cfg['feature'],
    label_annotator=cfg['label'],
    feature_transformation=cfg.get('feature_transformation'),
    label_transformation=cfg.get('label_transformation'),
)
dataset = task.build().to('cpu')
model = MODELS.build(cfg['model'])
model.fit(dataset)

pred = model.predict(dataset, data_type='test')
true = dataset.test_data.label
# Inverse-transform SOH if label transformation exists
if dataset.label_transformation is not None:
    pred = dataset.label_transformation.inverse_transform(pred)
    true = dataset.label_transformation.inverse_transform(true)

print('Pred SOH@600:', pred.tolist())
print('True SOH@600:', true.tolist())

Using best config: configs/soh/transfer/rwth_soh600_rptnorm_rf.yaml


Reading train data: 100%|██████████| 48/48 [00:19<00:00,  2.47it/s]
Reading test data: 100%|██████████| 1/1 [00:00<00:00, 117.58it/s]
Extracting features: 100%|██████████| 48/48 [00:00<00:00, 557.71it/s]
Extracting features: 100%|██████████| 1/1 [00:00<?, ?it/s]


Pred SOH@600: [0.8461166670703973]
True SOH@600: [0.8485139012336731]


## How to run on your own 3 pouch cells

1) Preprocess your RPT files into BatteryML `.pkl`:

```bash
batteryml preprocess RPT --config <your_rpt_config.yaml> <raw_dir> data/processed/MY_POUCH
```

2) Replace `data/processed/SIM_RPT` above with `data/processed/MY_POUCH` and set `test_prefixes` to match your generated `cell_id` prefix (default `RPT_`).

3) For better transfer, ensure your feature extractor uses the same voltage window as the training dataset.
   - If training on RWTH: use 3.5-3.9V window
   - If training on CALCE: use 2.7-4.2V window

If your pouch RPT is full-range (e.g. 2.7-4.2V), training on CALCE may align better on the voltage axis, while RWTH aligns better on chemistry (NMC).