This notebook shows how a model is degraded if data is used in training that contains no additional information that is useful in prediction. In the training data, $y=\sin(X_1)$, with some added noise. However, an additional feature $X_2$, has no correlation with $y$ whatsoever, and is just noise.

First we import the necessary libraries: `numpy` allows us to efficiently manipulate arrays of numerical data; `pandas` gives us access to `DataFrames` which are a way of storing tabular data in `Python` and is the format used by `twinLab`. `matplotlib.pyplot` is used for plotting. `twinlab` is the machine-learning library we are using. Some of the libraries are renamed using `as` for convenience. 

In [None]:
# Third-party imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Project imports
import twinlab as tl

At the top of this cell we define the name of our dataset and model. Because we are using random numbers here we also seed the random generator, so that our results are reproducible.

In [None]:
dataset_id = "noise"
campaign_id = dataset_id

random_seed = 43

Now we create some data
- $X_1$ and $X_2$ are both arrays of random values between 0 and 1.
- $y$ is $\sin(X_1)$ and, crucially, has no dependency on $X_2$ whatsoever. 

At the bottom of the cell we put these arrays into a Pandas `DataFrame` with the corresponding column headings.

In [None]:
# Seed the random-number generator
np.random.seed(random_seed)

#Training Data
X1 = np.random.rand(10)
X2 = np.random.rand(10)
y = np.sin(X1*2.*np.pi) + np.random.normal(0, 0.05, 10)

train_data = pd.DataFrame({'X1': X1, 'X2': X2, 'y': y})
display(train_data)

tl.upload_dataset(train_data, dataset_id, verbose=True)
tl.list_datasets(verbose=True)
tl.query_dataset(dataset_id, verbose=True)

In this cell we set the parameters to be used for training the machine-learning model. By default, all the data is used in training the model and a Gaussian process is trained. We need to provide the id of the dataset on the `twinLab` cloud and the columns of this that should be taken as inputs `X` and outputs `y`.

In [None]:
#defines parameters for our prediction
prediction_params = {
    "filename": dataset_id,
    "inputs" : ["X1", "X2"],
    "outputs": ["y"],
    "test_train_ratio": 1.,
}

tl.train_campaign(prediction_params, campaign_id, verbose=True)
tl.list_campaigns(verbose=True)
tl.query_campaign(campaign_id, verbose=True)

Now we create values for the model to predict outputs for. Both $X_1$ and $X_2$ are 101 linearly-spaced numbers between 0 and 1.

We now create a `pandas` `DataFrame` with data to be used for model evaluation/prediction.

In [None]:
input_dict = {
    "X1": np.linspace(0, 1, 101),
    "X2": np.linspace(0, 1, 101),
}
prediction_inputs = pd.DataFrame(input_dict)
display(prediction_inputs)

df_mean, df_std = tl.predict_campaign(prediction_inputs, campaign_id)

Now we first plot on a graph the $X_1$ against $y$, then $X_2$ against $y$. 
- The black dots on the graph are the training data we gave it. 
- The darkest blue line in the graph is the `df_mean` value.
- The blue sections either side represent the range of uncertainty in the `df_mean` value.

On the first graph ($X_1$ against $y$), the model has become more uncertain about its predictions of $y$ because of the introduction of $X_2$
On the second graph, we can see there is no correlation between $X_2$ and $y$.

In [None]:
# Plot parameters
nsigs = [1, 2]
color = "blue"
alpha = 0.5
plot_training_data = True
plot_model_mean = True
plot_model_bands = True

for X, Xlabel in zip(["X1", "X2"], ["$X_1$", "$X_2$"]):
# Plot results
    grid = prediction_inputs[X]
    mean = df_mean["y"]
    err = df_std["y"]
    if plot_model_bands:
        label = "Model prediction"
        plt.fill_between(grid, np.nan, np.nan, lw=0, color=color, alpha=alpha, label=label)
        for isig, nsig in enumerate(nsigs):
            plt.fill_between(grid, mean-nsig*err, mean+nsig*err, lw=0, color=color, alpha=alpha/(isig+1))
    if plot_model_mean:
        label = "Model prediction" if not plot_model_bands else None
        plt.plot(grid, mean, color=color, alpha=alpha, label=label)
    if plot_training_data:
        plt.plot(train_data[X], train_data["y"], ".", color="black", label="Training data")
    plt.xlim((0., 1.))
    plt.xlabel(Xlabel)
    plt.ylabel("$y$")
    plt.legend()
    plt.show()

Now we can clean up and delete the campaign and dataset (if desired)

In [None]:
tl.delete_campaign(campaign_id, verbose=True)
tl.delete_dataset(dataset_id, verbose=True)