# Dataset

Sample code used for generation of the datasets used in the experiments. 

Three datasets are generated, which are also available for direct download from `data/csv` directory.

In [None]:
import sys

sys.path.append("..")  # we run from subdirectory, so to access sources append repo root to path

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
from numpy.typing import NDArray
from scipy import integrate, special, stats

In [None]:
sns.set()

# Sine Function

Dataset is simple static nonlinear function, represented by the equation given by `f`.

In [None]:
def f(x: NDArray, freq: float, gamma: float) -> NDArray:
    return np.exp(-gamma*x) * np.sin(freq*x)

In [None]:
# sample 1000 points to plot
x = np.linspace(0, 12 * np.pi, 1000)
y = f(x, freq=1, gamma=0.1)

In [None]:
_ = plt.figure(figsize=[8, 4])
_ = plt.plot(x, y)

In [None]:
# generate 2000 random points for the benchmark
# they will be split into two sets (train and test), 1000 points each
x = stats.uniform.rvs(size=2000, scale=12*np.pi)
y = f(x, freq=1, gamma=0.1) + stats.norm.rvs(size=2000, loc=0, scale=0.05)

In [None]:
_ = plt.figure(figsize=[8, 4])
_ = plt.scatter(x, y, s=1)

We create `DataFrame` from the generated dataset, but do not save it in this notebook. The random noise does not have seed for this example, so it might be different the data used in the experiments. We generated dataset once and stored it, so all experiments used exactly the same points.

In [None]:
dataset = pd.DataFrame.from_records(np.column_stack([x, y]), columns=["x", "y"])
dataset.head(3)

# Sum of Gaussians

This dataset represents high dimensional static system, R^8 to R^1. The function is defined as sum of 8 Gaussian density
functions (not distribution, but explicit PDF) with different C (given in table below). The inputs are samples uniformly
on range [-1, 1]. L is 1/4 for all Gaussian functions, so for its sum as well. Noise was added with sigma = 0.1.

In [None]:
ndim = 8
samples = 20_000

In [None]:
x = stats.uniform.rvs(size=ndim * samples, loc=-1, scale=2)
x = x.reshape([samples, ndim])
x.shape

In [None]:
# randomly generated coefficients C (exact values for C used in experiments are given in README)
coefficients = stats.norm.rvs(size=ndim)
coefficients

In [None]:
def f(x):
    return 1 / 8 * np.sum([coefficients[dim] * stats.norm.pdf(x[:, dim]) for dim in range(x.shape[-1])], axis=0)

In [None]:
y = f(x)
y = y + stats.norm.rvs(size=samples, loc=0, scale=0.01)

In [None]:
dataset = pd.DataFrame.from_records(np.column_stack([x, y]), columns=[f"x{n + 1}" for n in range(ndim)] + ["y"])
dataset.head(3)

# Nonlinear Dynamics

This dataset represents low dimensional dynamic system, R^1 to R^1 with nonlinear dynamics. The function is defined as
`x' = -k \sigma(x-1) u(t)`, where `sigma` is sigmoid function, `k` is constant equal to 0.9 and `u(t)` is input signal,
which was generated as sine wave. The system was numerically integrated and is stored in CSV files with 3 columns:
"t" for time, "u" for inputs and "y" for noised outputs. Initial condition was 1, time is set to 100 000 samples from 0
to 100 seconds and forcing was given as `sin(pi/5 t)`.

In [None]:
samples = 100_000

In [None]:
# generate forcing singal dependant on time
def forcing(t):
    return np.sin(np.pi / 5 * t)

In [None]:
t = np.linspace(0, 100, samples)

In [None]:
_ = plt.figure(figsize=[8, 4])
_ = plt.plot(t, forcing(t))

Define the dynamical system with `k` constant and initial condition `y0`.

In [None]:
y0 = [1.0]
k = 0.9

def system(x: NDArray, t: NDArray) -> NDArray:
    dxdt = -k * special.expit(x - 1) * forcing(t)
    return dxdt

In [None]:
# solve the system with odeint from scipy
solution = integrate.odeint(system, y0, t)
y = solution.flatten() + stats.norm.rvs(loc=0.0, scale=0.05, size=samples) 

In [None]:
dataset = pd.DataFrame.from_records(np.column_stack([t, forcing(t), y]), columns=["t", "x", "y"])
dataset.head(3)