# Tutorial

This notebook is a guided tutorial for running and understanding a full kMCpy workflow.

What you will learn:

1. Which input artifacts are required and why.
2. How to run a reproducible NASICON simulation through the Python API.
3. How to inspect and interpret the resulting transport metrics.
4. How to build an LCE model and prepare fitting matrices for your own dataset.
5. How to fit LCE parameters.


## 0) Prerequisites

From the repository root:

```bash
uv sync --extra doc
uv sync --extra dev
```

The notebook is stored in docs and rendered with `nbsphinx`.
By default docs rendering does not execute cells (`nbsphinx.execute = never`).


## 1) Understand the required inputs

For a production kMC run, kMCpy needs these categories of files:

- **Structure**: a CIF file defining the host lattice.
- **Event library**: `events.json` + `event_dependencies.csv` describing allowed hops and their dependency graph.
- **Model definition**: `lce.json` and `lce_site.json` containing local cluster expansion structure.
- **Fitted parameters**: `fitting_results.json` and `fitting_results_site.json` containing trained coefficients.
- **Initial state**: `initial_state.json` occupation vector at simulation start.

In this tutorial we use the repository's validated NASICON regression dataset under `tests/files`.


### Input snapshot (used in this tutorial)

Below is the exact **modern** `SimulationConfig` payload represented as JSON-like data:

```json
{
  "structure_file": "tests/files/EntryWithCollCode15546_Na4Zr2Si3O12_573K.cif",
  "cluster_expansion_file": "tests/files/input/lce.json",
  "fitting_results_file": "tests/files/input/fitting_results.json",
  "cluster_expansion_site_file": "tests/files/input/lce_site.json",
  "fitting_results_site_file": "tests/files/input/fitting_results_site.json",
  "event_file": "tests/files/input/events.json",
  "initial_state_file": "tests/files/input/initial_state.json",
  "mobile_ion_specie": "Na",
  "temperature": 298,
  "attempt_frequency": 5e12,
  "equilibration_passes": 1,
  "kmc_passes": 100,
  "supercell_shape": [2, 1, 1],
  "immutable_sites": ["Zr", "O", "Zr4+", "O2-"],
  "convert_to_primitive_cell": true,
  "elementary_hop_distance": 3.47782,
  "random_seed": 12345,
  "name": "NASICON_Regression"
}
```

This configuration corresponds to the regression setup validated in `tests/test_nasicon_bulk.py`.


### Raw input data excerpts

These are **actual excerpts** from the input dataset used in this tutorial.

**A) `tests/files/input/initial_state.json` (occupation preview)**

```json
{
  "occupation_preview_first_24": [
    0,
    0,
    0,
    0,
    0,
    0,
    0,
    0,
    1,
    0,
    0,
    0,
    0,
    0,
    0,
    1,
    0,
    0,
    0,
    0,
    0,
    0,
    0,
    0
  ],
  "occupation_length": 84
}
```

**B) `tests/files/input/events.json` (first event)**

```json
{
  "mobile_ion_indices": [
    0,
    8
  ],
  "local_env_indices": [
    8,
    10,
    14,
    4,
    6,
    12,
    22,
    16,
    20,
    24,
    26,
    18
  ],
  "local_env_indices_site": [
    8,
    10,
    14,
    4,
    6,
    12,
    22,
    16,
    20,
    24,
    26,
    18
  ]
}
```

**C) `tests/files/input/fitting_results.json` (latest fitted record preview)**

```json
{
  "alpha": 1.5,
  "empty_cluster": 339.1083600548,
  "rmse": 29.1495050791,
  "loocv": 149.8671381416,
  "keci_preview": [
    10.5029085379,
    4.0803809727,
    0.0,
    0.0,
    -11.2639080302,
    0.0,
    -4.0578126572,
    0.0
  ]
}
```

**D) `NASICON cif file (ICSD-15546)` (header excerpt)**



## 2) Build a local cluster expansion (LCE) model

Use this when starting from a new crystal system and you need model topology (`lce.json`) before fitting.

In [None]:
from pathlib import Path
from kmcpy.external.structure import StructureKMCpy
from kmcpy.models.local_cluster_expansion import LocalClusterExpansion
from kmcpy.structure.local_lattice_structure import LocalLatticeStructure

out_dir = Path('example/output/tutorial/model')
out_dir.mkdir(parents=True, exist_ok=True)

structure = StructureKMCpy.from_cif('nasicon.cif', primitive=True)

local_lattice = LocalLatticeStructure(
    template_structure=structure,
    center=0,
    cutoff=4.0,
    specie_site_mapping={
        'Na': ['Na', 'X'],
        'Zr': 'Zr',
        'Si': ['Si', 'P'],
        'O': 'O',
    },
    basis_type='chebyshev',
    exclude_species=['O2-', 'O', 'Zr4+', 'Zr'],
)

lce = LocalClusterExpansion()
lce.build(local_lattice_structure=local_lattice, cutoff_cluster=[6, 6, 0])
lce.to_json('lce.json')


## 3) Generate correlation matrix

For your own training dataset (for example NEB barriers), build a list of structures + target values and convert them into fitting matrices.
If you leave the training lists empty, this tutorial uses bundled fitting matrices so the workflow still runs end-to-end.


### Fitting data format (what each file contains)

When preparing fitting inputs, `LocalClusterExpansion.fit(...)` expects:

- `correlation_matrix.txt`: shape `(n_samples, n_orbits)`
- `e_kra.txt`: shape `(n_samples,)` target barrier values
- `weight.txt`: shape `(n_samples,)` sample weights

Minimal conceptual example:

```text
correlation_matrix.txt
0.12  0.00 -0.34 ...
0.10  0.05 -0.29 ...
...

e_kra.txt
120.5
140.1
...

weight.txt
1.0
1.0
...
```


In [None]:
from pathlib import Path
import shutil
import numpy as np

from kmcpy.external.structure import StructureKMCpy
from kmcpy.io import NEBDataLoader, NEBEntry
from kmcpy.structure.local_lattice_structure import LocalLatticeStructure
from kmcpy.models.local_cluster_expansion import LocalClusterExpansion

repo = Path('.').resolve()
model = LocalClusterExpansion.from_json(str(repo / 'example/output/tutorial/model/lce.json'))
fit_dir = repo / 'example/output/tutorial/fit_data'
fit_dir.mkdir(parents=True, exist_ok=True)

training_structures = [
    # Replace with your local-environment structure paths
    # 'path/to/neb_local_env_0001.cif',
    # 'path/to/neb_local_env_0002.cif',
]
target_ekra = [
    # Replace with matching barrier values in meV
    # 120.5,
    # 140.1,
]

if not training_structures and not target_ekra:
    reference_dir = repo / 'tests' / 'files' / 'fitting' / 'local_cluster_expansion'
    for name in ('correlation_matrix.txt', 'e_kra.txt', 'weight.txt'):
        shutil.copy(reference_dir / name, fit_dir / name)
    print('No custom NEB data supplied; copied bundled tutorial fitting data from', reference_dir)
else:
    if len(training_structures) != len(target_ekra):
        raise ValueError('training_structures and target_ekra must have the same length')

    loader = NEBDataLoader()
    for cif_file, ekra in zip(training_structures, target_ekra):
        structure = StructureKMCpy.from_file(cif_file)
        local_lattice = LocalLatticeStructure(
            template_structure=structure,
            center=0,
            cutoff=4.0,
            specie_site_mapping={
                'Na': ['Na', 'X'],
                'Zr': 'Zr',
                'Si': ['Si', 'P'],
                'O': 'O',
            },
            basis_type='chebyshev',
            exclude_species=['O2-', 'O', 'Zr4+', 'Zr'],
        )
        loader.add(
            NEBEntry(local_lattice_structure=local_lattice, property_value=float(ekra)),
            model=model,
            validate_structure=False,
        )

    if len(loader) == 0:
        raise ValueError('No samples were added. Provide at least one (structure, e_kra) pair.')

    np.savetxt(fit_dir / 'correlation_matrix.txt', loader.get_correlation_matrix())
    np.savetxt(fit_dir / 'e_kra.txt', loader.get_properties())
    np.savetxt(fit_dir / 'weight.txt', np.ones(len(loader)))
    print('Generated fitting data from custom NEB structures in', fit_dir)

corr = np.atleast_2d(np.loadtxt(fit_dir / 'correlation_matrix.txt'))
e_kra = np.atleast_1d(np.loadtxt(fit_dir / 'e_kra.txt'))
weight = np.atleast_1d(np.loadtxt(fit_dir / 'weight.txt'))
print('correlation_matrix shape:', corr.shape)
print('e_kra shape:', e_kra.shape)
print('weight shape:', weight.shape)


## 4) Fit LCE parameters

Fit using generated training files. This writes `keci.txt` and `fitting_results.json` for later model loading:


In [None]:
from pathlib import Path
from kmcpy.models.local_cluster_expansion import LocalClusterExpansion

repo = Path('.').resolve()
fit_dir = repo / 'example/output/tutorial/fit_data'

lce = LocalClusterExpansion()
model_params, y_pred, y_true = lce.fit(
    alpha=1.5,
    max_iter=1_000_000,
    ekra_fname=str(fit_dir / 'e_kra.txt'),
    keci_fname=str(fit_dir / 'keci.txt'),
    weight_fname=str(fit_dir / 'weight.txt'),
    corr_fname=str(fit_dir / 'correlation_matrix.txt'),
    fit_results_fname=str(fit_dir / 'fitting_results.json'),
    lce_params_fname=None,
    lce_params_history_fname=None,
)

print('Non-zero KECI terms:', sum(abs(x) > 1e-8 for x in model_params.keci))
print('RMSE (meV):', round(float(model_params.rmse), 4))
print('Saved fitting outputs in', fit_dir)


## 5) Construct a `CompositeLCEModel`

The simulator consumes a composite model with both barrier (`kra_model`) and site-energy (`site_model`) contributions.
This cell loads both models and their latest fitted parameters from JSON files.


In [None]:
from pathlib import Path
from kmcpy.models.composite_lce_model import CompositeLCEModel

root = Path('.').resolve()
input_dir = root / 'tests' / 'files' / 'input'

compound_lce_model = CompositeLCEModel.from_json(
    lce_fname=str(input_dir / 'lce.json'),
    fitting_results=str(input_dir / 'fitting_results.json'),
    lce_site_fname=str(input_dir / 'lce_site.json'),
    fitting_results_site=str(input_dir / 'fitting_results_site.json'),
)

print(type(compound_lce_model).__name__)
print('Has site model:', compound_lce_model.site_model is not None)
print('Has KRA model:', compound_lce_model.kra_model is not None)


## 6) Run a reproducible NASICON simulation (API path)

This section runs the same setup used by the regression test suite.
It is a good sanity check that your environment and dependencies are working.


In [None]:
from pathlib import Path
from kmcpy.simulator.config import SimulationConfig
from kmcpy.simulator.kmc import KMC

root = Path('.').resolve()
file_path = root / 'tests' / 'files'

config = SimulationConfig.create(
    structure_file=file_path / 'EntryWithCollCode15546_Na4Zr2Si3O12_573K.cif',
    cluster_expansion_file=file_path / 'lce.json',
    fitting_results_file=file_path / 'fitting_results.json',
    cluster_expansion_site_file=file_path / 'lce_site.json',
    fitting_results_site_file=file_path / 'fitting_results_site.json',
    event_file=file_path / 'events.json',
    initial_state_file=file_path / 'initial_state.json',
    mobile_ion_specie='Na',
    temperature=298,
    attempt_frequency=5e12,
    equilibration_passes=1,
    kmc_passes=100,
    supercell_shape=(2, 1, 1),
    immutable_sites=('Zr', 'O', 'Zr4+', 'O2-'),
    convert_to_primitive_cell=True,
    elementary_hop_distance=3.47782,
    random_seed=12345,
    name='NASICON_Regression',
)

kmc = KMC.from_config(config)
tracker = kmc.run(config)
tracker.return_current_info()


(1.1193006038758543e-06, np.float64(307.3744449426362), np.float64(1.4630573145769372e-08), np.float64(4.576882562174338e-09), np.float64(1.1823906621661553), np.float64(0.31283002494661705), np.float64(0.21998150220477225))

## 7) Results (reference + run output)

For the input snapshot above, the expected final metrics are:

| Metric | Value |
|---|---:|
| time | 1.1193006038758543e-06 |
| msd | 307.37444494263616 |
| D_J | 1.4630573145769372e-08 |
| D_tracer | 4.5768825621743376e-09 |
| conductivity | 1.1823906621661553 |
| H_R | 0.312830024946617 |
| f | 0.21998150220477225 |

These values are intentionally embedded here so the tutorial remains informative even when cells are not executed during docs build.


### Metric definitions

`tracker.return_current_info()` returns:

`(time, msd, D_J, D_tracer, conductivity, H_R, f)`

- `time`: simulated physical time (s)
- `msd`: mean squared displacement
- `D_J`: jump diffusivity
- `D_tracer`: tracer diffusivity
- `conductivity`: ionic conductivity
- `H_R`: Haven ratio
- `f`: correlation factor

### Output files

After the run, kMCpy writes compressed CSV artifacts in the current working directory:

- `results_NASICON_Regression.csv.gz`
- `displacement_NASICON_Regression_<pass>.csv.gz`
- `hop_counter_NASICON_Regression_<pass>.csv.gz`
- `current_occ_NASICON_Regression_<pass>.csv.gz`

In [2]:
from pathlib import Path
import pandas as pd

results_file = Path('results_NASICON_Regression.csv.gz')
df = pd.read_csv(results_file)
print(df.tail(1).to_string(index=False))


    time          D_J     D_tracer  conductivity        f     H_R        msd
0.000001 1.463057e-08 4.576883e-09      1.182391 0.219982 0.31283 307.374445


## 8) Next steps

- Swap NASICON regression inputs with your own CIF/event/model files.
- Run temperature sweeps by varying `temperature` in `SimulationConfig`.
- Track convergence by increasing `kmc_passes` and comparing `D_J`, `D_tracer`, and conductivity.
