# Data structures: PCManifold and TSCDataFrame

This tutorial introduces the two **datafold** data structures (located in the package `pcfold`):  

* `PCManifold` - point clouds on a manifold  
* `TSCDataFrame` - time series collection  

Both data structures are used internally in models and algorithms, but can also used on their own. For the case of `TSCDataFrame` it is also a required input type for models that built from time series data. Because both classes have base classes that are widely used in the scientific context of Python (`numpy.ndarray` and `pandas.DataFrame`) the handling is straight forward. We can refer to the documentation of the original packages and only highlight in what context the two data structures are useful.  

In [None]:
import pandas as pd
import numpy as np
import scipy
import matplotlib.pyplot as plt
import mpl_toolkits.mplot3d.axes3d as p3

from sklearn.datasets import make_swiss_roll
from scipy.sparse.linalg import lsqr


# NOTE: make sure "path/to/datafold" is in sys.path or PYTHONPATH if not installed
from datafold.pcfold import PCManifold, TSCDataFrame 
from datafold.pcfold.kernels import GaussianKernel

## 1. Point cloud on manifold (`PCManifold`)
`PCManifold` subclasses `numpy.ndarray` and therefore inherits a rich set of functionality of the quasi-standard representation of numerical data in Python. The `PCManifold` restricts the general purpose of the base class array to a specific case

* A technical requirement is that the point cloud must be numeric (i.e. no `object`, `str` etc.) and two-dimensional with samples in rows and features in columns. 
* A non-technical requirement is that the point cloud is (beliebed to be) sampled from a manifold. This means it has some geometrical structure and is not completely random.

To showcase some of the functionality, we first generate a dataset on a "swiss-roll manifold" using a data generator from scikit-learn. Once we have the swiss-roll point cloud, we create an instance of `PCManifold` where we attach as new attributes to the array: 

1. A kernel (here `GaussianKernel`) that describes the locality between points. 
2. An (optional) `cut_off` distance value that controls a threshold, at which all kernel values are set to zero if the corresponding metric exeeds the cut-off. The parameter allows us to promote sparsity (and scale problems) by restricting the "sphere of influence" with respect to the metric (here the Euclidean distance in `GaussianKernel`).
3. A distance backend to select an algorithm for computing the distance matrix with the specified metric in the kernel. The distance backend has to support the metric that is required in the kernel. 

In [None]:
X, color = make_swiss_roll(n_samples=2000)

pcm = PCManifold(X, kernel=GaussianKernel(epsilon=4), cut_off=6, dist_backend="guess_optimal")

# plot the swiss roll dataset
fig = plt.figure(figsize=[7, 7])
ax = fig.add_subplot(1, 1, 1, projection="3d")
ax.scatter(*X.T, c=color, cmap=plt.cm.Spectral)
ax.set_title("Swiss roll: sampled manifold point cloud");

print(f"isinstance(pcm, np.ndarray)={isinstance(pcm, np.ndarray)}" )
pcm  # displays the data

### Showcase: Radial basis interpolation of swiss-roll with color as function target 

We can now use the `PCManifold` object to evaluate the attached kernel and compute the kernel matrix for the actual point cloud. Kernel matrices are used in many algorithms with "manifold assumption", because the kernel describes the local information of a point with respect to its neighborhood. We showcase this by creating an radial basis interpolation (RBF) and use the (extended) functionality of `PCManifold`. For simplicity we take the (pseudo-)color values of the swiss-roll data generator as the function target values that we want to interpolate. 

In the first step we compute the pairwise kernel matrix. With the kernel matrix and the known target values we compute the RBF weights by using a sparse least squares solver from the scipy package.

In [None]:
# use PCManifold to evaluate specified kernel on point cloud
kernel_matrix = pcm.compute_kernel_matrix()  # returns a scipy.sparse.csr_matrix

# compute RBF interpolation weights
weights = lsqr(kernel_matrix, color)[0]
color_rbf_centers = kernel_matrix @ weights

# plotting:
fig = plt.figure(figsize=(7, 7))
ax = fig.add_subplot(1, 1, 1, projection="3d")
ax.scatter(*X.T, c=color_rbf_centers, cmap=plt.cm.Spectral)
ax.set_title("RBF interpolation at training points");

The computed weights allow us to interpolate out-of-sample points with the RBF model. To actually interpolate points we generate a new set of points on the swiss-roll manifold, interpolate the color values and (visually) compare it with the true color information.  

The out-of-sample point cloud are now a reference point cloud for the existing `PCManifold`. This means we compute the kernel matrix now component wise. Because we view the points independently for interpolation, we do not need to make new point cloud a `PCManifold`. 

In [None]:
# create many out-of-sample points
X_interp, true_color = make_swiss_roll(20000)

# interpolate points with RBF model
kernel_matrix_interp = pcm.compute_kernel_matrix(Y=X_interp)  # component wise if Y is not None
color_rbf_interp = kernel_matrix_interp @ weights

# plotting:
fig = plt.figure(figsize=(16,9))

ax = fig.add_subplot(1, 2, 1, projection="3d")
ax.scatter(*X_interp.T, c=true_color, cmap=plt.cm.Spectral)
ax.set_title("True color values from swiss role")

ax = fig.add_subplot(1, 2, 2, projection="3d")
ax.scatter(*X_interp.T, c=color_rbf_interp, cmap=plt.cm.Spectral)
ax.set_title("Interpolated color at interpolated points");

### Summary

In effectively 4 lines of code we created an RBF interpolation by using the `PCManifold` object. We can now easily exchange a kernel, compute a kernel matrix with varying degree of sparsity, and choose a distance algorithm (usually the computationally most expensive part). The data structure makes kernel based algorithms much easier to write and improves code readability.

The RBF showcase can be improved by 

* properly optimizing the kernel parameters (see e.g. `PCManifold.optimize_parameters()` or via cross validation)
* choose another interpolation method (e.g. `GeometricHarmonicsInterpolator`), as target values in regions with low sampling density quickly decrease to zero for RBF interpolation.

Because `PCManifold` inherits from `numpy.ndarray`, we can use all of NumPy's functionality in-place. For example, we can compute eigenvectors on a `PCManifold` with 

```
np.linalg.eig(pcm)
```

## 2. Time series collection (`TSCDataFrame`)

`TSCDataFrame` adds time context to sampled data coming from dynamical systems (e.g. either simulated from a ODE/PDE system or measured with a sensor). The data-driven models aiming to learn a dynamical system from data, (also known as ["system identification"](https://en.wikipedia.org/wiki/System_identification)) have, like in `PCManifold` often the assumption that the system's phase space lies on a manifold. However, in contrast to an unordered point cloud, time series data have an inherit temporal order and moreover may have come from different time series with different initial condition. These "time issues" require in many cases a separate handling, for example, if we randomly subsample time series without taking care of the time values, a desired property of having evenly sampled time values is destroyed.  

To adress the special handling of time series collection data, we introduce the data structure `TSCDataFrame`. It subclasses from `pandas.DataFrame` and therefore inherits rich functionality from another popular Python package. 


To showcase `TSCDataFrame` we define a simple linear system to generate (single) time series data as a `pandas.DataFrame`. Note that data are the spatial positions and the index containts the time information). We can give the features (columns) names, in this case `x1` and `x2`.

In [None]:
np.random.seed(1)
def get_data(t, x0) -> pd.DataFrame:
    r"""Evaluate time series of randomly created linear system.
    
    Solves:
    
    .. code-block::
        
        dx/dt = A x
        
    where `A` is a random matrix. 
    
    Parameters
    ----------
    t 
        time values to evaluate
    
    x0
        initial state (2-dimensional)
    
    Returns
    -------
    pandas.DataFrame
        time series with shape `(n_time_values, 2)`
    """
    
    A = np.random.randn(2,2)
    
    expA = scipy.linalg.expm(A)
    states = np.row_stack([scipy.linalg.fractional_matrix_power(expA, ti) @ x0 for ti in t])
    
    return pd.DataFrame(data=np.real(states), index=t, columns=['x1','x2'])

### Create a TSCDataFrame

Now that we have a way to generate individual time series, let us collect two of them into a `TSCDataFrame`. 

In general, we can create a new instance of `TSCDataFrame` like the super class

```
DataFrame(data, index, columns, **kwargs)` 
```

However, at this stage the requirements on the frame format of a `TSCDataFrame` must be fulfilled already, which are:

* Two levels of index, where the first index level is the time series ID and the second for the time values.
* One level index for columns to index features.
* The time series IDs must be positive integers, and the time values must be numeric. 
* Each time series must have at least two time values.
* There are no duplicated indexes allowed (row and columns).

Note that the data orientation is the same as in `PCManifold` (samples in rows, features in columns). 

For easier instantiation, there are also classmethods `TSCDataFrame.from_X`.  

Here, we use `TSCDataFrame.from_single_timeseries`, where we only need to insert a single `pandas.DataFrame(data, index=time, columns=feature_names)`. After the initial construction we can iteratively add new time series with `tsc.insert_ts()`.   

In [None]:
# create a single time series as pandas data frame with time as index
x0 = np.random.randn(2,)
x1 = np.random.randn(2,)
data1 = get_data(np.arange(0, 5), x0)
data2 = get_data(np.arange(0, 5), x1)

# convert it to a "time series collection" (TSC) data frame
tsc_regular = TSCDataFrame.from_single_timeseries(data1)
tsc_regular = tsc_regular.insert_ts(data2)  # here could be loop to insert more time series


print('delta_time:', tsc_regular.delta_time)
print('n_timesteps:', tsc_regular.n_timesteps)
print('is_const_delta_time:', tsc_regular.is_const_delta_time())
print('is_equal_length:', tsc_regular.is_equal_length())
print('is_same_time_values:', tsc_regular.is_same_time_values())

tsc_regular

We now create a second `TSCDataFrame`, in which the time series are not sharing the same time values. For instatiation we use `TSCDataFrame.from_frame_list`, which allows a list of single time series (as `pandas.DataFrame`) to be inserted. 

We see that `delta_time` and `n_timesteps` cannot give a "global" value of the entire time series collection anymore. Instead the attributes list the value for each time series and is of type `pandas.Series`.

In [None]:
df1 = get_data(np.arange(0, 5), np.random.randn(2,))
df2 = get_data(np.arange(5, 10, 2), np.random.randn(2,))

tsc_irregular = TSCDataFrame.from_frame_list([df1, df2])

print('delta_time:', tsc_irregular.delta_time)
print('')
print('n_timesteps:', tsc_irregular.n_timesteps)
print('')
print('is_const_delta_time:', tsc_irregular.is_const_delta_time())
print('is_equal_length:', tsc_irregular.is_equal_length())
print('is_same_time_values:', tsc_irregular.is_same_time_values())

# print the time series. It now has two series in it, with IDs 0 and 1.
tsc_irregular

### Accessing data

Because `TSCDataFrame` is a `pandas.DataFrame` most of the data access and functions work in the same way. However, there are a few things to consider:

* The `TSCDataFrame` type is kept as long as the accessed data slice is still valid (i.e. fulfills the format requirements). This is also true if the sliced data would  actually be a `Series` (but note last point).
* If a slice leads to an invalid `TSCDataFrame`  then the general fallback type is `pandas.DataFrame` or `pandas.Series` (e.g. accessing a single row leads to an invalid time series collection, as more than one sample is required).
* Currently, there are inconsistencies with pandas, because currently there is no "`TSCSeries`". This is most noteable for `.iloc` slicing which returns `pandas.Series` even if it is a valid `TSCDataFrame` (with one column). A simple type conversion `TSCDataFrame(slice_result)` is the current workaround.

In the following we look at some examples to slice data from the constructed `tsc_regular` and `tsc_irregular`.

#### Access an individual feature from the collection

Note that the type is now a `TSCDataFrame` and not a `pandas.Series`

In [None]:
slice_result = tsc_regular["x1"]

print(type(slice_result))
slice_result

It is also always possible to turn the object to a `pandas.DataFrame` beforehand and have the usual accessing.   

In [None]:
slice_result = pd.DataFrame(tsc_regular)["x1"]

print(type(slice_result))
slice_result

The inconsistency with `.iloc` manifests as follows:

In [None]:
slice_result = tsc_regular.iloc[:, 0]  # access the 0-th column

print(type(slice_result))
slice_result

If we want to recover `TSCDataFrame` type we can workaround:

In [None]:
slice_result = TSCDataFrame(tsc_regular.iloc[:, 0])

print(type(slice_result))
slice_result

#### Access a single time series

A `TSCDataFrame` has a two level index, the first index the ID and the second the time. When we now access a single ID, the the now constant ID index is dropped. This means the returned slice is not a legal `TSCDataFrame` anymore and the type falls back to `pandas.DataFrame`.

In [None]:
slice_result = tsc_regular.loc[0]

print(type(slice_result))
slice_result

#### Select specific time values

The minimum length of a time series is two. If we request specific time values and all time series have more than one matching sample the type remains `TSCDataFrame`.   

Also note that the inherited rules of accessing data from a pandas DataFrame hold. In the following example not all requested time values have to exist in all time series (not even in *any* as indicated with time value 99). An `KeyError` exception is only raised if *no* time value matches. 

In [None]:
slice_result = tsc_irregular.select_time_values([3, 4, 5, 7, 99])
print(type(slice_result))
slice_result

In [None]:
slice_result = tsc_irregular.select_time_values(1)
print(type(slice_result))
slice_result

#### Extracting initial states

Initial states are required for a dynamic model to make predictions and evolve the system forward in time. An initial condition can be either of a single state (typed as a `pandas.DataFrame`), but can also be a time series itself (e.g. the current state and the past samples). Extracting initial states work with the usual pandas slicing. Here we take the first sample by using the `groupby` function and take the first sample of each series:

In [None]:
slice_result = tsc_regular.groupby('ID').head(1)
print(type(slice_result))
slice_result

But there is also a convenience function which improves code readability:

In [None]:
slice_result = tsc_regular.initial_states()
print(type(slice_result))
slice_result

We can also take the first 2 samples. (Note that the times mismatch):

In [None]:
slice_result = tsc_irregular.initial_states(2)
print(type(slice_result))
slice_result

There is actually an extra class `InitialCodition` that provides methods and validation for initial conditions. 

For example, we want to adress different situations:

* In the case where time series have the same time values, we want to group and evaluate these initial conditions for these time values together.

* If time series have different time values, we want to treat them separately and make separate predictions with the model.

For example, this functionality is very useful when we want to reconstruct a time series data with a model. We use the iterator `InitialCondition.iter_reconstruct_ic` method:

(Note that `InitialCondition.validate(ic)` can be used to check if the initial condition is valid)

In [None]:
from datafold.pcfold import InitialCondition

print("REGULAR CASE (groups time series together)")
print("------------------------------------------\n")

for ic, time_values in InitialCondition.iter_reconstruct_ic(tsc_regular):
    print(f"Initial condition \n")
    print(ic)
    assert InitialCondition.validate(ic)
    print(f"with corresponding time values {time_values}")

    
print("\n\n==========================================================================\n\n")
print("IRREGULAR CASE (separates initial conditions):")
print("----------------------------------------------")

for ic, time_values in InitialCondition.iter_reconstruct_ic(tsc_irregular):
    print(f"Initial condition \n")
    print(ic)
    assert InitialCondition.validate(ic)
    print(f"with corresponding time values {time_values}\n\n")

### Plot time series data

`TSCDataFrame` provides basic plotting facility: 

In [None]:
tsc_regular.plot(figsize=(7,7))

We can also use the iterator `TSCDataFrmae.itertimeseries` which allows us access the time series separately and create plots for each time series. 

In [None]:
f, ax = plt.subplots(1, len(tsc_regular.ids),figsize=(15,7),sharey=True)

for _id, time_series in tsc_regular.itertimeseries():
    ts_axis = time_series.plot(ax=ax[_id])
    ts_axis.set_title(f'time series ID={_id}')
    if _id == 0:
        ts_axis.set_ylabel('quantity of interest')