(sec-guide-preprocessing)=
# Preprocessing Data

Raw dynamical systems data often need to be lightly preprocessed before use in Operator Inference.
This page introduces tools in the {mod}`opinf.pre` submodule for preprocessing data.
We show examples of
- centering or shifting to account for boundary conditions, and
- scaling / nondimensionalizing the variables represented in the state.

In this guide, we use $\mathbf{s}(t)$ to denote the unprocessed state variable for which we have data $\mathbf{s}_{j} = \mathbf{s}(t_{j})$, $j=0, \ldots, k-1$. We use $\mathbf{q}(t)$ to denote the processed state variable which we will use for Operator Inference.

## Shifting / Centering

A common first preprocessing step is to shift the training snapshots by some reference snapshot $\bar{\mathbf{s}}$, i.e.,

$$
    \mathbf{q}(t) = \mathbf{s}(t) - \bar{\mathbf{s}}.
$$

For example, the reference snapshot could be chosen to be the average of the training snapshots:

$$
    \bar{\mathbf{s}}
    := \frac{1}{k}\sum_{j=0}^{k-1}\mathbf{s}_{j}.
$$

In this case, the transformed snapshots $\mathbf{q}_{j} = \mathbf{s}_{j} - \bar{\mathbf{s}}$ are centered around $\mathbf{0}$.
This type of transformation can be accomplished using {func}`opinf.pre.shift` or the class {class}`opinf.pre.SnapshotTransformer`.

::::{note}
The following code uses data pulled from the combustion problem described in {cite}`swischuk2020combustion`. The data consists of nine variables recorded at 100 points in time.

:::{dropdown} Snapshot Variables

- Pressure $p$
- $x$-velocity $v_{x}$
- $y$-velocity $v_{y}$
- Temperature $T$
- Specific volume $\xi$
- Chemical species molar concentrations for CH$_{4}$, O$_{2}$, CO$_{2}$, and H$_{2}$O.

The dimension of the spatial discretization in the full example in {cite}`swischuk2020combustion` is $38{,}523$ per variable, so that $\mathbf{s}(t)$ has dimension $346{,}707$. Here we have downsampled the state dimension to $535$ for each variable for demonstration purposes.
:::

You can [download the data here](https://github.com/Willcox-Research-Group/rom-operator-inference-Python3/raw/data/data_scaling_example.npy) to repeat the experiments.
::::

In [None]:
import numpy as np
import scipy.linalg as la
import matplotlib.pyplot as plt

import opinf

In [None]:
# Matplotlib customizations.
plt.rc("axes.spines", right=False, top=False)
plt.rc("figure", dpi=300, figsize=(9, 3))
plt.rc("font", family="serif")
plt.rc("legend", edgecolor="none", frameon=False)
plt.rc("text", usetex=True)

In [None]:
# Load snapshot data and extract just the pressure variable.
snapshots = np.load("data_scaling_example.npy")
pressure = np.split(snapshots, 9, axis=0)[0]

# Shift the pressure snapshots by the average pressure snapshot.
pressure_shifted, reference_snapshot = opinf.pre.shift(pressure)

In [None]:
# Average pressure value.
np.mean(pressure)

In [None]:
# Average shifted pressure value.
np.mean(pressure_shifted)

In [None]:
# Plot the distribution of the entries of the raw and processed states.
fig, axes = plt.subplots(1, 2, sharey=True)
axes[0].hist(pressure.flatten(), bins=40)
axes[1].hist(pressure_shifted.flatten(), bins=40)
axes[0].set_ylabel("Frequency")
axes[0].set_xlabel("Pressure")
axes[1].set_xlabel("Shifted pressure")
fig.tight_layout()
plt.show()

The reference snapshot may also represent boundary conditions or other physical constraints. See {cite}`swischuk2019physicsml` for examples.

## Scaling

Many engineering problems feature multiple variables with ranges across different scales. For such cases, it is often beneficial to scale the variables to similar ranges so that one variable does not overwhelm the other in the operator learning.

A simple scaling is given by

$$
    \mathbf{q}(t) = \frac{1}{\alpha}\mathbf{s}(t),
$$

where $\alpha$ is chosen by examining the range of the training data. For example, after centering the data, a scaling to $[-1, 1]$ is given by

$$
    \mathbf{q}(t)
    = \frac{1}{\alpha}\big(\mathbf{s}(t) - \bar{\mathbf{s}}(t)\big),
    \qquad
    \alpha = \max_{i,j}|\tilde{s}_{ij}|
$$

where $\tilde{s}_{ij}$ is the $i$th entry of $\mathbf{s}_{j} - \bar{\mathbf{s}}$.
Use {func}`opinf.pre.scale` or the class {class}`opinf.pre.SnapshotTransformer` for this type of transformation.

In [None]:
# Extract the H2O molar concentration.
water = np.split(snapshots, 9, axis=0)[-1]

# Compare the scales of the variables.
print(f"Pressure range (raw):\t\t[{pressure.min():.2e}, {pressure.max():.2e}]")
print(f"Pressure range (shifted):\t[{pressure_shifted.min():.2e}, {pressure_shifted.max():.2e}]")
print(f"Water range:\t\t\t[{water.min():.2e}, {water.max():.2e}]")

In [None]:
# Apply a min-max scaling to [0, .01] on the shifted pressure snapshots.
pressure_scaled, pscale1, pscale2 = opinf.pre.scale(
    pressure_shifted,
    (0, 1e-2),
)

In [None]:
# Compare the scales of the variables.
print(f"Pressure range (raw):\t\t[{pressure.min():.2e}, {pressure.max():.2e}]")
print(f"Pressure range (shifted):\t[{pressure_shifted.min():.2e}, {pressure_shifted.max():.2e}]")
print(f"Pressure range (scaled):\t[{pressure_scaled.min():.2e}, {pressure_scaled.max():.2e}]")
print(f"Water range:\t\t\t[{water.min():.2e}, {water.max():.2e}]")

## Transformer Classes

{class}`opinf.pre.SnapshotTransformer` bundles shifting and scaling transformations and their inverses.

In [None]:
st = opinf.pre.SnapshotTransformer(center=True, scaling="standard", verbose=True)
pressure_preprocessed = st.fit_transform(pressure)

In [None]:
st = opinf.pre.SnapshotTransformer(scaling="maxabssym", byrow=True, verbose=True).fit(pressure)

The constructor accepts arguments to set the type of shifting / scaling transformation.
{meth}`opinf.pre.SnapshotTransformer.fit` and {meth}`opinf.pre.SnapshotTransformer.fit_transform` methods learn the particular transformation, and {meth}`opinf.pre.SnapshotTransformer.fit_transform` or {meth}`opinf.pre.SnapshotTransformer.transform` applies the learned transformation.
Finally, {meth}`opinf.pre.SnapshotTransformer.inverse_transform` method applies the inverse of the learned transformation.

## Multivariable Data

For systems where the full state consists of several variables (pressure, velocity, temperature, etc.), it may not be appropriate to apply the same scaling to each variable.
The {class}`opinf.pre.SnapshotTransformerMulti` class handles multivariable data by constructing a separate {class}`opinf.pre.SnapshotTransformer` instance for each variable.
The constructor accepts the number of snapshot variables and the same parameters as the constructor of {class}`opinf.pre.SnapshotTransformer`.

In [None]:
# Learn the variable transformation used in the paper.
stm = opinf.pre.SnapshotTransformerMulti(
    9,
    center=(True, False, False, True, False, False, False, False, False),
    scaling=(
        "maxabs",
        "maxabs",
        "maxabs",
        "maxabs",
        None,
        None,
        None,
        None,
        None,
    ),
    variable_names=["p", "vx", "vy", "T", "xi", "CH4", "O2", "CO2", "H2O"],
    verbose=True,
)

snapshots_preprocessed = stm.fit_transform(snapshots)

Choosing an advantageous preprocessing $\mathbf{s}(t) \mapsto \mathbf{q}(t)$ is highly problem dependent, and the tools shown here are not the only ways to preprocess snapshot data. See, for example, {cite}`issan2022shiftedopinf` for a compelling application of Operator Inference to solar wind streams in which preprocessing plays a vital role.