# Data structures: PCManifold and TSCDataFrame

This tutorial introduces the two data structures (in `datafold.pcfold`)

* `PCManifold` - point clouds on a manifold  
* `TSCDataFrame` - time series collection  

Both data structures are used in internally in classes, but can also used on their own and sometimes are even required as input fomar (e.g. for time series predictions). Because both classes have widely used scientific computing base classes (`numpy.ndarray` and `pandas.DataFrame`) the handling is straight forward and we can also refer to existing documentation. 

In [None]:
import pandas as pd
import numpy as np
import scipy
import matplotlib.pyplot as plt
import mpl_toolkits.mplot3d.axes3d as p3

# NOTE: make sure "path/to/datafold" is in sys.path or PYTHONPATH if not installed
from datafold.pcfold import PCManifold, TSCDataFrame 
from datafold.pcfold.kernels import GaussianKernel
from sklearn.datasets import make_swiss_roll

from scipy.sparse.linalg import lsqr

## 1. Point clouds on manifold (`PCManifold`)
`PCManifold` subclasses `numpy.ndarray` and therefore inherits a rich set of functionality. A technical requirement to reduce the more general scope of the base class is that the point cloud (data array) must be numeric and two-dimensional with samples in rows and features in columns. The other non-technical requirement is that the point cloud is sampled from a manifold (i.e. has some structure and is not completely random).   

To showcase some of the functionality, we first generate a point cloud dataset on a "swiss-roll manifold" using a dataset generator from scikit-learn. 

We create an instance of `PCManifold` by attaching

1. a kernel (`GaussianKernel`) that describes locality on the manifold
2. an optional `cut_off`, which promotes sparsity for kernel evaluations. It allows us to promote sparsity (and scale problems) and restrict the "sphere of influence" and with respect to the metric (here the Euclidean distance in `GaussianKernel`).
3. a distance backend to select the algorithm for computing a distance matrix with the specified metric in the kernel 

to the data.

In [None]:
X, color = make_swiss_roll(n_samples=2000)

pcm = PCManifold(X, kernel=GaussianKernel(epsilon=4), cut_off=6, dist_backend="guess_optimal")

fig = plt.figure(figsize=[7, 7])
ax = fig.add_subplot(1, 1, 1, projection="3d")
ax.scatter(*X.T, c=color, cmap=plt.cm.Spectral)
ax.set_title("Swiss roll: sampled manifold point cloud");

print(f"isinstance(pcm, np.ndarray)={isinstance(pcm, np.ndarray)}" )
pcm  # displays the data

### Showcase: Radial basis interpolation with targets=color

We can now use the `PCManifold` object to evaluate the kernel and compute the kernel matrix, which consists of locality information of data points. Kernel matrices are used in many algorithms with "manifold assumption".

We showcase this by creating an radial basis interpolation (RBF) and using the extended functionality of a `PCManifold`. For simplicity we take the (pseudo-)color values of the data generator as function target values. 

In the first step we compute the pairwise kernel matrix. With the kernel matrix and the known target values we compute the RBF weights and using a sparse least squares solver.

In [None]:
# use PCManifold to evaluate specified kernel on point cloud
kernel_matrix = pcm.compute_kernel_matrix()

# compute RBF interpolation weights
weights = lsqr(kernel_matrix, color)[0]
color_rbf_centers = kernel_matrix @ weights

# plotting:
fig = plt.figure(figsize=(7, 7))

ax = fig.add_subplot(1, 1, 1, projection="3d")
ax.scatter(*X.T, c=color_rbf_centers, cmap=plt.cm.Spectral)
ax.set_title("RBF interpolation model evaluated at known points");

The computed weights allow us to interpolate out-of-sample points with the RBF model. For this, we generate more data on the S-curve manifold for a out-of-sample dataset and visually compare it with the true color information. 

The out-of-sample point cloud are reference points for the existing `PCManifold`. So we evaluate the same kernel again, but instead of a pairwise kernel matrix, we compute it component-wise with the out-of-sample point cloud. 

In [None]:
# create out-of-sample points
X_interp, true_color = make_swiss_roll(20000)

# interpolate points with RBF model
kernel_matrix_interp = pcm.compute_kernel_matrix(Y=X_interp)
color_rbf_interp = kernel_matrix_interp @ weights

# plotting:
fig = plt.figure(figsize=(16,9))

ax = fig.add_subplot(1, 2, 1, projection="3d")
ax.scatter(*X_interp.T, c=true_color, cmap=plt.cm.Spectral)
ax.set_title("True color values from swiss role")

ax = fig.add_subplot(1, 2, 2, projection="3d")
ax.scatter(*X_interp.T, c=color_rbf_interp, cmap=plt.cm.Spectral)
ax.set_title("Interpolated color at interpolated points");

The showcase be improved by 

* properly optimizing the kernel parameters (see e.g. `PCManifold.optimize_parameters()` or via cross validation)
* choose another interpolation method (e.g. `GeometricHarmonicsInterpolator`), as target values in regions with low sampling density quickly decrease to zero for RBF interpolation.

The key point is that that `PCManifold` provides ingredients to define a locality measure on manifolds via a kernel. We can simply evaluate a kernel matrix (both dense and sparse) of the existing point cloud or with respect to a reference point cloud.    

Because it inherits from `numpy.ndarray`, we can use all the numpy functionality in-place. For example, we can compute eigenvectors on a `PCManifold` with 

```
np.linalg.eig(pcm)
```

## Time series collection (`TSCDataFrame`)

This data structure adds time context to the collected data, specifically data coming from dynamical systems (either simulated or measured with a sensor). When learning a dynamical system from data, which is better known as ["system identification"](https://en.wikipedia.org/wiki/System_identification), also algorithms often have the assumption that the system's phase space lies on a manifold. In contrast to `PCManifold`, which looks at unordered point clouds, we want to address the temporal ordering. We also account for the fact that there can be one or many time series recorded that need to be looked at separately. These issues prohibit also operations on `PCManifold`, for example, if we subsample time series without taking care of time information, a desired property of having evenly sampled time values is destroyed.  

To adress the special handling of time series data, we introduce the data structure `TSCDataFrame`. It subclasses from `pandas.DataFrame` and therefore inherits a different set of methods from a popular Python package. 

To showcase `TSCDataFrame` we define a simple linear system to generate (single) time series data as a `pandas.DataFrame`. Note that data are the spatial positions and the index containts the time information). We can give the features (columns) names, in this case `x1` and `x2`.

In [None]:
def get_data(t, x0) -> pd.DataFrame:
    r"""Evaluate time series of randomly created linear system.
    
    Solves:
    
    .. code-block::
        
        dx/dt = A x
        
    where `A` is a random matrix. 
    
    Parameters
    ----------
    t 
        time values to evaluate
    
    x0
        initial state (2-dimensional)
    
    Returns
    -------
    pandas.DataFrame
        time series with shape `(n_time_values, 2)`
    """
    
    A = np.random.randn(2,2)
    
    expA = scipy.linalg.expm(A)
    states = np.row_stack([scipy.linalg.fractional_matrix_power(expA, ti) @ x0 for ti in t])
    
    return pd.DataFrame(data=np.real(states), index=t, columns=['x1','x2'])

### Create a TSCDataFrame

Now that we have a way to generate individual time series, let us collect two of them into a time series collection (`TSCDataFrame`). 

In general, we can use the usual of `pandas.DataFrame(data, index, columns, **kwargs)`, however, the requirements on the frame format of a `TSCDataFrame` must be fulfilled already. For easier instantiation, there are classmethods `TSCDataFrame.from_X` that allow construction from common situations.

Here, we use `TSCDataFrame.from_single_timeseries`, where we only need to insert a single `pandas.DataFrame(data, index=time, columns=feature_names)`. After the initial construction we can iteratively add new time series with `insert_ts()`.   

In [None]:
# create a single time series as pandas data frame with time as index
x0 = np.random.randn(2,)
x1 = np.random.randn(2,)
data1 = get_data(np.arange(0, 5), x0)
data2 = get_data(np.arange(0, 5), x1)

# convert it to a "time series collection" (TSC) data frame
tsc_regular = TSCDataFrame.from_single_timeseries(data1)
tsc_regular = tsc_regular.insert_ts(data2)


print('delta_time:', tsc_regular.delta_time)
print('n_timesteps:', tsc_regular.n_timesteps)
print('is_const_delta_time:', tsc_regular.is_const_delta_time())
print('is_equal_length:', tsc_regular.is_equal_length())
print('is_same_time_values:', tsc_regular.is_same_time_values())

tsc_regular

We now create another `TSCDataFrame`, where now the time series do not share the same time values. This time we use `TSCDataFrame.from_frame_list` that allows to insert a list of data frames time series. 

We see that `delta_time` and `n_timesteps` cannot give a "global" of the entire time series colleciton anymore. Instead the attributes list the value for each timeseries ID typed with a `pandas.Series`.

In [None]:
df1 = get_data(np.arange(0, 5), np.random.randn(2,))
df2 = get_data(np.arange(5, 10, 2), np.random.randn(2,))

tsc_irregular = TSCDataFrame.from_frame_list([df1, df2])

print('delta_time:', tsc_irregular.delta_time)
print('')
print('n_timesteps:', tsc_irregular.n_timesteps)
print('')
print('is_const_delta_time:', tsc_irregular.is_const_delta_time())
print('is_equal_length:', tsc_irregular.is_equal_length())
print('is_same_time_values:', tsc_irregular.is_same_time_values())

# print the time series. It now has two series in it, with IDs 0 and 1.
tsc_irregular

### Accessing time series data

Because `TSCDataFrame` is a subclass of `pandas.DataFrame` many functions and attributes are inherited that can generally be used like on a normal `pandas.DataFrame`. However, there are things to consider:

* The `TSCDataFrame` type is kept as long as the validation of being a legal format is successful. 
* If a slice leads to an invalid `TSCDataFrame` (e.g. talking a single row, as a time series must have more than one point), then the general fallback type is `pandas.DataFrame`.
* Currently, there are inconsistencies with pandas, because there is currently no "`TSCSeries`". This is most noteable for `.iloc` slicing which returns `pandas.Series` even if is a valid `TSCDataFrame` (with one columns). A simple type conversion `TSCDataFrame(slice_result)` is the current workaround.

In the following are some examples to access data using the constructed `tsc_regular` and `tsc_irregular`.

#### Access an individual coordinate with the times and IDs

Note that the type is now a `TSCDataFrame` and not a `pandas.Series`

In [None]:
slice_result = tsc_regular["x1"]

print(type(slice_result))
slice_result

It is also always possible to turn the object to a `pandas.DataFrame` beforehand and have the usual accessing.   

In [None]:
slice_result = pd.DataFrame(tsc_regular)["x1"]

print(type(slice_result))
slice_result

The inconsistency with `.iloc` looks as follows:

In [None]:
slice_result = tsc_regular.iloc[:, 0]

print(type(slice_result))
slice_result

If we require the type to be a `TSCDataFrame` we can workaround this with

In [None]:
slice_result = TSCDataFrame(tsc_regular.iloc[:, 0])

print(type(slice_result))
slice_result

#### Access a single time series

A `TSCDataFrame` has a two level index, the first index the ID and the second the time. When we now access a singe ID pandas drops the constant ID index. Because of this, a single accessed time series is not a valid `TSCDataFrame` anymore and falls back to `pandas.DataFrame`.

In [None]:
slice_result = tsc_regular.loc[0]

print(type(slice_result))
slice_result

#### Select specific time values

The minimum length of a time series is two. If we access specific time values, and a time series has only one matching time value the return type will be a `pandas.DataFrame`. 

In general, the typical rules of accessing data from a frame hold. In the following example not all requested time values have to exist in all time series (not even in *any* as indicated with time value 99). An `KeyError` exception is only raised if *no* time value matches. 

In [None]:
slice_result = tsc_irregular.select_time_values([3, 4, 5, 7, 99])
print(type(slice_result))
slice_result

In [None]:
slice_result = tsc_irregular.select_time_values(1)
print(type(slice_result))
slice_result

#### Extracting initial states

Initial states are required for a dynamic model to make predictions and evolve the system forward in time. An initial condition can be either of a single state (typed as a `pandas.DataFrame`), but can also be a time series itself (e.g. the current state and the past samples). Extracting initial states works with the usual pandas slicing. Here we take the first sample of each time series:

In [None]:
slice_result = tsc_regular.groupby('ID').head(1)
print(type(slice_result))
slice_result

... but ther is also a convenience function and improves readability in the code:

In [None]:
slice_result = tsc_regular.initial_states(1)
print(type(slice_result))
slice_result

We can also take the first 2 samples. (Note that the times mismatch) 

In [None]:
slice_result = tsc_irregular.initial_states(2)
print(type(slice_result))
slice_result

There is actually an extra class `InitialCodition` that provides methods and validation for initial conditions. 

For example, we want to adress different situations:

* In the case where time series have the same time values, we want to evaluate group them together. So we can give a model a set of initial conditions and evaluate each for the same time values. 

* If time series have different time values, we want to treat them separately and make separate model requests.

This functionality is very useful if we want to reconstruct a time series data with a model. For this we can create an iterator with the `InitialCondition.iter_reconstruct_ic` method.

Note that `InitialCondition.validate(ic)` can be used to check if the initial condition is valid.

In [None]:
from datafold.pcfold import InitialCondition

print("REGULAR CASE (groups time series together)")
print("------------------------------------------\n")

for ic, time_values in InitialCondition.iter_reconstruct_ic(tsc_regular):
    print(f"Initial condition \n")
    print(ic)
    assert InitialCondition.validate(ic)
    print(f"with corresponding time values {time_values}")

    
print("\n\n==========================================================================\n\n")
print("IRREGULAR CASE (separates initial conditions):")
print("----------------------------------------------")

for ic, time_values in InitialCondition.iter_reconstruct_ic(tsc_irregular):
    print(f"Initial condition \n")
    print(ic)
    assert InitialCondition.validate(ic)
    print(f"with corresponding time values {time_values}\n\n")

## 3. Plotting time series data

`TSCDataFrame` provides basic plotting facility: 

In [None]:
tsc_regular.plot(figsize=(7,7))

The iterator `TSCDataFrmae.itertimeseries` allows to create separate plots for each time series. 

In [None]:
f, ax = plt.subplots(1, len(tsc_regular.ids),figsize=(15,7),sharey=True)

for _id, time_series in tsc_regular.itertimeseries():
    ts_axis = time_series.plot(ax=ax[_id])
    ts_axis.set_title(f'time series ID={_id}')
    if _id == 0:
        ts_axis.set_ylabel('quantity of interest')