# Data structure for time series collections `TSCDataFrame` 

In this tutorial, we will introduce the `TSCDataFrame` data structure, which is designed to handle time series collection data. This is particularly useful when dealing with datasets that contain multiple time series. For instance, a system might have been sampled multiple times with different initial conditions, or there could be missing samples.

The `TSCDataFrame` is primarily used for input/output specification of data-driven methods in the *datafold* package, and its implementation can be found in the sub-package pcfold.

`TSCDataFrame` is a subclass of the popular DataFrame data structure from the pandas project. This means that it inherits all the rich functionality of `pandas`, making it easy for users familiar with `pandas` to specify their collected data. However, `TSCDataFrame` restricts the more generic `DataFrame` to a structure with certain guarantees to organize the time series collection.

The key advantage of using `TSCDataFrame` is that it makes it easy to work with collections of time series data. Users can easily manipulate and analyze their data using the powerful tools provided by *pandas*. For those who are new to working with `DataFrame`s, we refer to the main *pandas* [documentation](https://pandas.pydata.org/) for a broader introduction. In this tutorial, we will focus on the specific context in which `TSCDataFrame` is useful.

In [None]:
import matplotlib.pyplot as plt
import mpl_toolkits.mplot3d.axes3d as p3
import numpy as np
import pandas as pd
import scipy
from scipy.sparse.linalg import lsqr
from sklearn.datasets import make_swiss_roll

from datafold import TSCDataFrame

To showcase the `TSCDataFrame` we first define a simple two-dimensional linear system and generate two separate time series, which at this point treat the two time series separately in a pandas `DataFrame` first. Like `TSCDataFrame` (see below), the data is oriented such thcolumns contain spatial information (states) -- `x1` and `x2` -- and the row index contains time information.

In [None]:
def generate_data(t, x0) -> pd.DataFrame:
    r"""Evaluate time series of randomly created linear system.

    Solves:

    .. code-block::

        dx/dt = A x

    where `A` is a random matrix.

    Parameters
    ----------
    t
        time values to evaluate

    x0
        initial state (2-dimensional)

    Returns
    -------
    pandas.DataFrame
        time series with shape `(n_time_values, 2)`
    """

    A = np.random.default_rng(1).standard_normal(size=(2, 2))

    expA = scipy.linalg.expm(A)
    states = np.row_stack(
        [scipy.linalg.fractional_matrix_power(expA, ti) @ x0 for ti in t]
    )

    return pd.DataFrame(data=np.real(states), index=t, columns=["x1", "x2"])


# create a single time series as pandas data frame with time as index
x0 = np.random.randn(2)
x1 = np.random.randn(2)
time_series1 = generate_data(np.arange(0, 5), x0)
time_series2 = generate_data(np.arange(0, 5), x1)

### Create a `TSCDataFrame`

#### Specification

Now that we have a way to generate one-or-many time series, let us generate and then collect two of them into a `TSCDataFrame` object. In general, we can create a new instance of `TSCDataFrame` in the same way as we would instantiate the superclass

```
TSCDataFrame(data, index, columns, **kwargs)` 
```

However, when initializing a `TSCDataFrame`, there are specific requirements which may need to be fulfilled on the `data`, `index` and `columns` arguments to successfully create the new object. Otherwise, there will be an `AttributeError`. The special requirements of `TSCDataFrame` are:


Requirements on `data`:

* Only numeric data is allowed (e.g. no strings or other objects). 

Requirements on `index`:

* The row-index must have two levels (i.e. be a `MultiIndex`). The first index level is for the time series ID and the second level for the time values within each time series.
* The time values must be sorted per time series
* The time series IDs must be positive integers, and the time values must be non-negative numerical values. 
* No duplicate names are allowed.

Requirements on `column`:

* The index only suppors a single level.
* The column names must be unique.
* No duplicate names are allowed.

Note that for practical reasons a time series can only consist of a single sample and be a valid `TSCDataFrame`. 


#### Data orientation

The data orientation aligns to the format in scikit-learn. This means each row contains a single instance of the system and the columns describe the descriptive features. 


To ease the construction of `TSCDataFrame`, there exist class methods `TSCDataFrame.from_X` (e.g. `from_tensor`, `from_single_timeseries`, `from_array`, `from_csv`, ...).

Here, we use `TSCDataFrame.from_frame_list`, where we can insert the two generated `DataFrame`s from above.

After this we print some specific attributes of `TSCDataFrame` to describe the time series collection. `

In [None]:
# convert both time series to a single "time series collection" (TSCDataFrame)
tsc_regular = TSCDataFrame.from_frame_list([time_series1, time_series2])

print("delta_time:", tscdf.delta_time)
print("n_timesteps:", tscdf.n_timesteps)
print("is_const_delta_time:", tscdf.is_const_delta_time())
print("is_equal_length:", tscdf.is_equal_length())
print("is_same_time_values:", tscdf.is_same_time_values())

tsc_regular

We now create a second `TSCDataFrame`, in which the time series are not sharing the same time values. We see that `delta_time` and `n_timesteps` cannot give a single value that is true for the entire time series collection anymore. Instead, the attributes list the value for each time series separately (the list is of type `pandas.Series`).

In [None]:
df1 = generate_data(
    np.arange(0, 5),  # sampling 2
    np.random.randn(
        2,
    ),
)
df2 = generate_data(
    np.arange(5, 10, 2),  # sampling 2
    np.random.randn(
        2,
    ),
)

tsc_irregular = TSCDataFrame.from_frame_list([df1, df2])

print("delta_time: \n", tsc_irregular.delta_time)
print("")
print("n_timesteps: \n", tsc_irregular.n_timesteps)
print("")
print("is_const_delta_time:", tsc_irregular.is_const_delta_time())
print("is_equal_length:", tsc_irregular.is_equal_length())
print("is_same_time_values:", tsc_irregular.is_same_time_values())

# print the time series. It now has two series in it, with IDs 0 and 1.
tsc_irregular

### Accessing data

Because `TSCDataFrame` *is a* `pandas.DataFrame` most of the data indexing and functionality is inherited. However, there are a few things to consider if slicing data (extract only partial data from the table):

* The `TSCDataFrame` type is maintained if a slice of the object is still a valid `TSCDataFrame`. This is also true if the sliced data would actually be a `Series` (but also note the last point in this list).
* If a slice leads to an invalid `TSCDataFrame`  then the general fallback type is `pandas.DataFrame` or `pandas.Series`.
* Currently, there are inconsistencies with `pandas.DataFrame`, because there is no "`TSCSeries`" yet. This is most noticeable for `.iloc` slicing which returns `pandas.Series` even if the slice is a valid `TSCDataFrame` (with one column). A simple type conversion `TSCDataFrame(slice_result)` is a current workaround.

We now look at some examples to slice data from the constructed `tsc_regular` and `tsc_irregular` from above.

#### Access an individual feature from the collection

In [None]:
slice_result = tsc_regular["x1"]

print(type(slice_result))
slice_result

Note that the type is now a `TSCDataFrame` and not a `pandas.Series`.

It is also possible to turn the object to a `DataFrame` beforehand. The returned value is now a `Series` and not a `TSCDataFrame`.

In [None]:
slice_result = pd.DataFrame(tsc_regular)["x1"]

print(type(slice_result))
slice_result

The inconsistency with `.iloc` slicing manifests as follows:

In [None]:
slice_result = tsc_regular.iloc[:, 0]  # access the 0-th column

print(type(slice_result))
slice_result

Instead of having a `TSCDataFrame` as expected, we got a `Series`. In order to obtain a `TSCDataFrame` type we can simply insert brackets around the index `[0]`.

In [None]:
slice_result = tsc_regular.iloc[:, [0]]

print(type(slice_result))
slice_result

#### Access a single time series

A `TSCDataFrame` has a two-level index (ID and time values). When we now access a single time series with its ID, the ID index is dropped. Because the returned frame has not two levels anymore, it is not a legal `TSCDataFrame` specification anymore. The fallback type is then a `pandas.DataFrame`.

In [None]:
slice_result = tsc_regular.loc[0]

print(type(slice_result))
slice_result

#### Select specific time values

The minimum length of a time series is two. However, `TSCDataFrame` also supports single sampled time series and describes them as "degenerated time series". The advantage is a better interoperatibility with the superclass.

In the next step, we select certain time values and get the samples from each time series with a match. Note that the inherited rules of accessing data from a `pandas.DataFrame` hold. This means, in the example, not all requested time values have to exist in a time series (the time value 99 does not have a match with any time series). Only if *no* time value matches, a `KeyError` exception is raised.

In [None]:
slice_result = tsc_irregular.select_time_values([3, 4, 5, 7, 99])
print(type(slice_result))
slice_result

Now, we only select a single time value, which has only one match. This is of course not a legal time series anymore, but the `TSCDataFrame` is still maintained. We can access all "degenerated" time series IDs with a `TSCDataFrame` method. 

In [None]:
slice_result = tsc_irregular.select_time_values(1)
print(type(slice_result), "\n")
print("Degenerated IDs:", slice_result.degenerate_ids())
slice_result

#### Extract initial states

Initial states are required for a dynamical model to make predictions and evolve the system forward in time. An initial condition can be either of a single state, but can also be a time series itself. The latter case occurs if the initial condition also consists of the current and the past samples. Extracting initial states can be achieved with the usual slicing of a `DataFrame`. Here we take the first sample by using the `groupby` function and take the first sample of each series:

In [None]:
slice_result = tsc_regular.groupby("ID").head(1)
print(type(slice_result))
slice_result

The `TSCDataFrame` data structure also provides convenience methods:

In [None]:
slice_result = tsc_regular.initial_states()
print(type(slice_result))
slice_result

The method also allows us to conveniently extract the first two samples of each time series. Note, however, that the time values mismatch:

In [None]:
slice_result = tsc_irregular.initial_states(2)
print(type(slice_result))
slice_result

There is an extra class `InitialCodition` that provides methods and validation for initial conditions. 

For example, we want to address different situations:

* In the case where time series in a collection share the same time values, we can group them and evaluate these initial conditions together.

* If time series have different time values, we want to treat them separately and make separate predictions with the model.

This grouping functionality is very useful when we want to reconstruct time series data with a model. We use the iterator `InitialCondition.iter_reconstruct_ic` method:

(In the cell we also use `InitialCondition.validate(ic)` to check that the initial condition is valid.)

In [None]:
from datafold.pcfold import InitialCondition

print("REGULAR CASE (groups time series together)")
print("------------------------------------------\n")

for ic, time_values in InitialCondition.iter_reconstruct_ic(tsc_regular):
    print(f"Initial condition \n")
    print(ic)
    assert InitialCondition.validate(ic)
    print(f"with corresponding time values {time_values}")


print(
    "\n\n==========================================================================\n\n"
)
print("IRREGULAR CASE (separates initial conditions):")
print("----------------------------------------------")

for ic, time_values in InitialCondition.iter_reconstruct_ic(tsc_irregular):
    print(f"Initial condition \n")
    print(ic)
    assert InitialCondition.validate(ic)
    print(f"with corresponding time values {time_values}\n\n")

### Plot time series data

`TSCDataFrame` provides basic plotting facility: 

In [None]:
tsc_regular.plot(figsize=(7, 7))

We can also use the iterator `TSCDataFrmae.itertimeseries` which allows us access the time series separately and create plots for each time series. 

In [None]:
f, ax = plt.subplots(1, len(tsc_regular.ids), figsize=(15, 7), sharey=True)

for _id, time_series in tsc_regular.itertimeseries():
    ts_axis = time_series.plot(ax=ax[_id])
    ts_axis.set_title(f"time series ID={_id}")
    if _id == 0:
        ts_axis.set_ylabel("quantity of interest")
