In this notebook we look at the ability of `twinLab` to model correlated variables.

First we import the necessary libraries: `numpy` allows us to efficiently manipulate arrays of numerical data; `pandas` gives us access to `DataFrames` which are a way of storing tabular data in `Python` and is the format used by `twinLab`. `matplotlib.pyplot` is used for plotting. `twinlab` is the machine-learning library we are using. Some of the libraries are renamed using `as` for convenience. 

In [None]:
# Third-party imports
import numpy as np

import pandas as pd
import matplotlib.pyplot as plt

# Project imports
import twinlab as tl

In this cell we create a dataset and upload it to the `twinLab` cloud.
- The only input is $X$, which is linearly spaced.
- The first output, $y_1$, is an array of `n` which is equal to $X$ plus a small amount of noise.
- We define the second output $y_2$ as an array, where each element is the corresponding $y_3$ element divided by the corresponding $y_1$ element.
   We give it the third output $y_3$ which is an array of `n` evenly spaced numbers between 0 and 1.

In [None]:
dataset_id = "degeneracy"
campaign_id = dataset_id
random_seed = 43

# Create training Data
n = 101
X = np.linspace(0, 1, n)
y1 = X + np.random.normal(0, 0.05, n)
y2 = np.random.rand(n)
y3 = y1/y2
train_data = pd.DataFrame({"X": X, "y1": y1, "y2": y2, "y3": y3})
display(train_data)

# Upload training data to twinLab
tl.upload_dataset(train_data, dataset_id=dataset_id, verbose=True)
tl.list_datasets(verbose=True)
tl.query_dataset(dataset_id, verbose=True)

In this cell we set the parameters we are going to use to train the model; then we train the model.

In [None]:
# Training parameters
prediction_params = {
    "filename": dataset_id,
    "inputs" : ["X"],
    "outputs": ["y1", "y2", "y3"],
    "test_train_ratio": 1.,
}

# Training
tl.train_campaign(prediction_params, campaign_id, verbose=True)
tl.list_campaigns(verbose=True)
tl.query_campaign(campaign_id, verbose=True)

This cell then creates the data we are going to use the model for predictions.

In [None]:
num_predictions = 1001
input_dict = {
    "X": np.linspace(0., 1., num_predictions).tolist(),
}

prediction_inputs = pd.DataFrame(input_dict)
display(prediction_inputs)

df_mean, df_std = tl.predict_campaign(prediction_inputs, campaign_id, verbose=True)
display(df_mean)
display(df_std)

We then give these numbers to the model, and it generates what it thinks the three outputs should be. `df_mean` is the value it predicts. `df_std` is how uncertain the model is about that value.

Now we plot the data on 3 graphs - one for X against $y_1$, one for X against $y_2$, and one for X against $y_3$.
- The black dots on the graph are the training data we gave it. 
- The darkest blue line in the graph is the `df_mean` value.
- The blue sections either side represent the range of uncertainty in the `df_mean` value.

$y_1$ settles to around a value of 0.5.
$y_2$'s average will increase the more numbers the model predicts, currently at around 2 - but with more data it would increase.
$y_2$ also has some enourmously high values which occur whenever $y_1$ is a very tiny number, so the result of the division is very high. 
The third graph shows the model is good at predicting $y_3$, because the training data shows it is the same as the $X$ value it is given.

In [None]:
# Plot parameters
nsigs = [1, 2]
color = "blue"
alpha = 0.5
plot_training_data = True
plot_model_mean = True
plot_model_bands = True

# Plot results
for Y, Ylabel in zip(["y1", "y2", "y3"], ["$y_1$", "$y_2$", "$y_3$"]):
    grid = prediction_inputs["X"]
    mean = df_mean[Y]
    err = df_std[Y]
    if plot_model_bands:
        label = "Model prediction"
        plt.fill_between(grid, np.nan, np.nan, lw=0, color=color, alpha=alpha, label=label)
        for isig, nsig in enumerate(nsigs):
            plt.fill_between(grid, mean-nsig*err, mean+nsig*err, lw=0, color=color, alpha=alpha/(isig+1))
    if plot_model_mean:
        label = "Model prediction" if not plot_model_bands else None
        plt.plot(grid, mean, color=color, alpha=alpha, label=label)
    if plot_training_data:
        plt.plot(train_data["X"], train_data[Y], ".", color="black", label="Training data")
    plt.xlim((0., 1.))
    plt.xlabel("$X$")
    plt.ylabel(Ylabel)
    plt.legend()
    plt.show()

We can finally remove our dataset and trained model from the `twinLab` cloud.

In [None]:
# Delete campaign and dataset (if desired)
tl.delete_campaign(campaign_id, verbose=True)
tl.delete_dataset(dataset_id, verbose=True)