## Qualitative Study of GPR Model Performance using Increasing Dataset Sizes

### [Alessio Tamburro](https://alessiot.github.io/dsprojects/)
<br>

**The source code and the full notebook (including all functions) for this article are available [here](https://github.com/alessiot/polygp-sklearn).**
<br>

*Note: the html version of this notebook was created with [nbconvert](https://nbconvert.readthedocs.io/en/latest/config_options.html) and running the command jupyter nbconvert --to html --TagRemovePreprocessor.remove_input_tags="notvisible" notebook.ipynb. A tag "notvisible" was added to the cells that are not displayed in this rendered html*
<br>

This study follows the post and notebook published earlier [here](https://alessiot.github.io/dsprojects/posts/01_01_2024_polygp_sklearn.html). In that post, we used Optuna to search the space of potential polynomial kernels with varying complexity and find an optimal kernel (automated kernel engineering) for the Gaussian Process model. In this post, we will use the same approach to train Gaussian Process regression models for a set of 24 benchmark functions, which are used to generate the training data. We will observe how increasing dataset sizes affects the performance of the trained models. The study focuses on 1D functions. The plan is to extend this study to more input dimensions.

In [1]:
%load_ext autoreload
%autoreload 2

In [None]:
import plotly.io as pio
pio.renderers.default = "notebook"

In [49]:
import numpy as np
import pandas as pd

from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, ExpSineSquared

from sklearn.metrics import r2_score, mean_squared_error
from sklearn.pipeline import make_pipeline
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

import plotly.graph_objects as go
import plotly.io as pio

from scipy.spatial import distance

from contextlib import redirect_stdout
from io import StringIO
from IPython.display import clear_output, display, HTML

from typing import Callable

try:
    import polygp
    from polygp import train_gp, _calculate_bic, _calculate_num_params, build_poly_kernel
except ImportError as e:
    # this is needed if you decide not to install polygp 
    import sys
    sys.path.append('../')    
    from polygp.src.polygp import train_gp, _calculate_bic, _calculate_num_params, build_poly_kernel


In [None]:
import warnings
from sklearn.exceptions import ConvergenceWarning

# raise error if convergence warning is issued so that we can try/except it
warnings.filterwarnings("error", category = ConvergenceWarning)

## Summary of the Results

Before digging deeper into the details, we show the summary of this study below.

In [107]:
df_1D = pd.read_csv('../data/opti_functions_1D.csv')

save_path = '../data/imgs_1D'

for i, fi in enumerate(df_1D.index):

    f_str = str(df_1D.loc[fi,'Function Name'])
    f_str = "_".join(f_str.replace("'",'').replace("-","_").replace(".","").replace(",","").replace("(","").replace(")","").split())

    graph_name = f"{save_path}/graph_{str(f_str)}.png"
    movie_name = f"{save_path}/movie_{str(f_str)}.gif"
    #print(f'{fi} out of {max(df_1D.index)}, {movie_name}')

    html_code = f"""
        <div style="display: flex; justify-content: space-between;">
            <img src="{movie_name}" alt="GIF Video" style="width: 48%;">
            <img src="{graph_name}" alt="PNG Image" style="width: 48%;">
        </div>
    """

    # Display HTML
    display(HTML(html_code))

## Data Generation using Benchmark Functions

In this section, we show how we generate data based on an example benchmark function, the dropwave function. This function along with several other functions are typically used to benchmark optimization algorithms. We utilized some of the bemchmark functions that are available from [SFU](https://www.sfu.ca/~ssurjano/optimization.html).
These functions will be considered our true generative process model.

In [7]:
def generate_grid(exp_space: dict = {}, 
                  num_samples: int = 10,
                  random_state: int = 1) -> np.ndarray:
    rng = np.random.RandomState(random_state)    
    rnd_samples = []
    while len(rnd_samples) < num_samples:
        # Generate random values
        rnd_dict = {
            c_i: rng.uniform(exp_space[c_i]['values'][0],exp_space[c_i]['values'][1])\
                if exp_space[c_i]['val_type']=='float'\
                else rng.choice(exp_space[c_i]['values'])\
                for c_i in exp_space if exp_space[c_i]['cl']=='inp'
        }
        rnd_samples.append(rnd_dict)

    # 2D array
    rnd_samples = pd.DataFrame(rnd_samples).values

    return rnd_samples

def generate_data(exp_space, generating_fun, num_samples = 1_000, random_state=1):

    # Sampling the process space
    X = generate_grid(exp_space, num_samples = num_samples,
                      random_state = random_state) 
    # True generative process function
    Z = generating_fun(X)

    return X, Z

def select_training_data(X, Z, exp_space, generating_fun, is_error=False, training_size=10, random_state=1):

    rng = np.random.RandomState(random_state) 

    # Sample training data indices
    training_indices = rng.choice(np.arange(Z.size), size=training_size, replace=False)
    X_train, Z_train = X[training_indices], Z[training_indices]
    X_noise_std, Z_noise_std = None, None
    if is_error:
        ## we observe at the location identified by the training indices and measure the X_train values with uncertainty values X_noise_std
        X_train = X[training_indices]
        X_noise_std = np.array([exp_space['X1']['err']]*len(X_train)).reshape(-1,1) #* abs(X_train)
        ## ...but due to the uncertainty in X, Z = f(X+X_noise)
        X_noise = [rng.normal(loc=0.0, scale=xn, size=1) for xn in X_noise_std]
        Z_train = generating_fun(X_train+X_noise)
        ## and on top of it we have uncertainty in the measurement of Z
        Z_noise_std = np.array([exp_space['Z']['err']]*len(Z_train)).reshape(-1,1) #* abs(Z_train)
        Z_noise = [rng.normal(loc=0.0, scale=zn, size=1) for zn in Z_noise_std]
        Z_train += Z_noise
    
    return X_train, Z_train, X_noise_std, Z_noise_std 



The function generate_data generates a grid of input values to sample from the dropwave function, whhich is defined as a lambda function. The grid is built using the ranges for the input variable _X1_, which are specified in the design space dictionary. The design space can also specify if we want to generate the data with noise. We will come back to this poinr later in this notebook. For now, we set the noise ('err') to 0. 

In [69]:
# Design space
exp_space_dict = {'X1' : {'values' : [-2,2], 'val_type': 'float', 'cl': 'inp', 'err': 0}, 
                  'Z': {'values': [0, 1], 'val_type': 'float', 'cl': 'out', 'err': 0}} 

# True generative process function
#sphere_b = lambda x_in: np.square(x_in).sum(axis=1).reshape(-1,1)
dropwave_b = lambda x_in: (-(1.0 + np.cos(12 * np.sqrt(np.square(x_in).sum(axis=1))))/((1.0/len(x_in))*np.square(x_in).sum(axis=1) + 2)).reshape(-1,1)

X, Z = generate_data(exp_space_dict, dropwave_b, num_samples = 1_000, random_state=1)

X_train, Z_train, X_noise_std, Z_noise_std = select_training_data(X, Z, exp_space_dict, dropwave_b, is_error=False, training_size=10, random_state=1)


The selected training dataset has 10 data rows.

In [70]:
# The dataframe with the measured observations, which will be used for training a model
train_df = pd.DataFrame(np.concatenate([X_train,Z_train], axis=1), columns=['X1','Z'])
print(f'Shape of the training set: {train_df.shape}')
train_df.head(3)

Shape of the training set: (10, 2)


Unnamed: 0,X1,Z
0,-1.259478,-0.08564
1,0.867083,-0.221497
2,1.623238,-0.90303


The full generated dataset has 1000 data rows.

In [71]:
# This is the larger data frame holding about 1000 sampled observations 
generated_df = pd.DataFrame(np.concatenate([X,Z], axis=1), columns=['X1','Z']) 
print(f'Shape of the generated set: {generated_df.shape}')
generated_df.head(3)

Shape of the generated set: (1000, 2)


Unnamed: 0,X1,Z
0,-0.331912,-0.166763
1,0.881298,-0.296005
2,-1.999543,-0.708185


## Gaussian Process Regression Model Training

We can now fit a Gaussian Process regression model to the training data generated above using the Python library of *scikit-learn* for Gaussian Process regression [gaussian_process](https://github.com/scikit-learn/scikit-learn/tree/3f89022fa04d293152f1d32fbc2a5bdaaf2df364/sklearn/gaussian_process). We let the optimizer find the best paramter values for the Gaussian Process regression model kernel. The optimizer will adapt the parameters to maximize the log marginal likelihood. We selected the _RBF_ kernel for this exercise.


In [16]:
def fit_model(train_df, 
              num_lst,
              cat_lst,
              inp_lst,
              out_lst,
              kernel, 
              optimizer, 
              max_retries = 20):

    # preprocessing pipeline
    ppl_pre = []
    cat_proc = OneHotEncoder(handle_unknown="ignore", 
                            sparse=False, 
                            drop='if_binary') 
    num_proc = StandardScaler()#PowerTransformer('yeo-johnson', standardize=True)
    ppl_pre.append(
        ColumnTransformer(
            transformers=[
                ('one-hot-encoder', cat_proc, cat_lst), 
                ('scaler', num_proc, num_lst), 
            ],
            remainder='passthrough', # selection of inputs happen during fit anyway
        )
    )  

    # random state: should we fix it as well, maybe with a different number?
    gpr = GaussianProcessRegressor(
        optimizer=optimizer, 
        n_restarts_optimizer = 10,
        kernel=kernel,
    )

    # append to pipeline
    ppl_pre.append(gpr)      
    # create pipeline
    ppl = make_pipeline(*ppl_pre)

    opti_msg = 'Training kernel - kernel will change' if optimizer else 'Refitting - kernel will not change'
    print(f"Fitting with kernel: {kernel}. {opti_msg}")
    ## recover numerical issues
    for r_i in range(1, max_retries+1):
        try:
            ppl.fit(train_df[inp_lst], train_df[out_lst])
            break
        except Exception as e:
            # Handle the exception 
            print(f"retry: {r_i} - Operation failed with warning: {e}")
    else:
        # This block runs if the loop completes without a successful operation
        print("Maximum retries reached, operation failed.")
    ###
    llh_current = ppl[-1].log_marginal_likelihood_value_
    kernel_current = ppl[-1].kernel_

    return ppl, llh_current, kernel_current



In [72]:
init_kernel = 1 * RBF(length_scale=1.0, length_scale_bounds=(1e-5, 1e5))
gaussian_process, llh, kernel = fit_model(train_df, ['X1'], [], ['X1'], ['Z'], init_kernel, optimizer="fmin_l_bfgs_b")

# kernel optimized with LLH
print(f"kernel: {kernel}")
print(f"llh: {llh}")
# predictions for the whole set
mean_prediction, std_prediction = gaussian_process.predict(generated_df[['X1']], return_std=True)
mean_prediction = mean_prediction.reshape(-1,1)
std_prediction = std_prediction.reshape(-1,1)

mse_all = mean_squared_error(generated_df[['Z']], mean_prediction)
r2_all = r2_score(generated_df[['Z']], mean_prediction)

print(f'r2: {r2_all}')   
print(f'mse: {mse_all}')  

Fitting with kernel: 1**2 * RBF(length_scale=1). Training kernel - kernel will change
kernel: 0.624**2 * RBF(length_scale=0.0998)
llh: -8.899513414028853
r2: -0.13454580527073468
mse: 0.13579160384538438


As we can see from the model performance indicators and the graph below, 10 data points do not return a satisfactory model. Later, we will show how the performce changes by iteratively increasing the dataset size.

In [67]:
def create_gp_regression_plot(X_true, Z_true, X_obs, Z_obs, dX_obs, dZ_obs, Z_true_preds, Z_std_preds, text_display = '', ix=0, zx=0):    
    '''Plot the true values, the observed values, and the predicted values using Plotly
    '''

    # Make sure values are sorted before plotting
    sorted_indices = np.argsort(X_true[:,ix])

    fig = go.Figure()

    # Plot the true function
    fig.add_trace(go.Scatter(
        name="True function", 
        showlegend=True,
        x=X_true[sorted_indices,ix], y=Z_true[sorted_indices,zx],
        mode='lines', 
        line=dict(dash='dash'),
        line_color='rgba(0, 0, 255, 1)',
        #error_x=dict(type='data', array=error_x, visible=True),
        #error_y=dict(type='data', array=noise_level*df[sq.out_lst[0]], visible=True if noise_level>0 else False),    
    ))

    # Plot the observed values
    fig.add_trace(go.Scatter(
        name="Observations", 
        x=X_obs[:,ix], y=Z_obs[:,zx],
        error_y=dict(
            type='data',  # Error type (data or constant)
            array=dZ_obs[:, zx],  # Array of error values
            visible=True,
            color='rgba(255, 165, 0)',  # Color of error bars
        ) if type(dZ_obs)==np.ndarray else None,
        error_x=dict(
            type='data',  # Error type (data or constant)
            array=dX_obs[:,ix],  # Array of error values
            visible=True,
            color='rgba(255, 165, 0)',  # Color of error bars
        ) if type(dX_obs)==np.ndarray else None,
        mode='markers', 
        marker=dict(size=8),
        marker_color = 'rgba(255, 165, 0)',
        #error_x=dict(type='data', array=error_x, visible=True),
        #error_y=dict(type='data', array=noise_level*df[sq.out_lst[0]], visible=True if noise_level>0 else False),    
        showlegend=True,
    ))

    # Plot the mean prediction
    fig.add_trace(go.Scatter(
        name="Mean prediction",
        showlegend=True,
        x=X_true[sorted_indices,ix], y=Z_true_preds[sorted_indices,zx],
        mode='lines', 
        line_color='rgba(0, 128, 0, 1)',
        #line=dict(dash='dash'),
        #error_x=dict(type='data', array=error_x, visible=True),
        #error_y=dict(type='data', array=noise_level*df[sq.out_lst[0]], visible=True if noise_level>0 else False),    
    ))

    # Create confidence interval shading
    upper_bound = Z_true_preds[sorted_indices,zx] + 1.96 * Z_std_preds[sorted_indices,zx]
    lower_bound = Z_true_preds[sorted_indices,zx] - 1.96 * Z_std_preds[sorted_indices,zx]
    fig.add_trace(go.Scatter(
        name="95% C.I. Lower Bound",
        x=X_true[sorted_indices,ix],
        y=lower_bound,
        line_color='rgba(0, 128, 0, 1)',
        #line=dict(width=1, shape='spline'),
    ))
    fig.add_trace(go.Scatter(
        name="95% C.I. Upper Bound",
        x=X_true[sorted_indices,ix],
        y=upper_bound,
        fill="tonexty",
        fillcolor='rgba(0, 128, 0, 0.2)',
        line_color='rgba(0, 128, 0, 1)',
        #line=dict(width=1, shape='spline'),
    ))

    # Update layout
    fig.update_layout(
        #title="Gaussian process regression",
        xaxis_title=f"X({ix})" if X_true.shape[1]>1 else 'X',
        yaxis_title=f"Y({zx})" if Z_true.shape[1]>1 else 'Z',
        showlegend=True,
        plot_bgcolor='white', 
        paper_bgcolor='white',
        yaxis_range = [np.min(Z_true[:,zx]), np.max(Z_true[:,zx])],   
        xaxis_range = [np.min(X_true[:,ix]), np.max(X_true[:,ix])]    
    )

    # Add a text box below the legend
    fig.update_layout(
        annotations=[
            dict(
                x=0,
                y=max(Z_true),
                xref="paper",
                yref="paper",
                text=text_display,
                showarrow=False,
                font=dict(size=15),
            )
        ]
    )

    return fig


In [73]:
gp_plot = create_gp_regression_plot(X, Z, X_train, Z_train, X_noise_std, Z_noise_std, mean_prediction, std_prediction)

gp_plot

### Adding more points

Given the training array 10 points, we generate 5 additional points on the input grid that satisy the _maxmin_ property. This approach allows for sampling the generating true function where we lack data. The function __generate_samples__ generates many random data points on the specified grid before applying the minmax condition to select the additional data points.

In [99]:
def generate_samples(X, exp_space, num_samples=1000, k_points=1, random_state = 1):
    """Given an existing 2D array of N m-dimensional points (N, m), return k additional points 
    that satisfy the maxmin condition 
    """
    print(f'Generating {num_samples} in {exp_space}')
    # generate random samples to select from based on minmax later
    rnd_samples = generate_grid(exp_space, num_samples, random_state)

    #sel_rnd_samples = []
    #for _ in range(k_points):
    #_, sel_rnd_samples_tmp = get_minmax(rnd_samples, X, 1)
    #sel_rnd_samples.append(sel_rnd_samples_tmp)
    #sel_rnd_samples = np.concatenate(sel_rnd_samples, axis=0)
    _, sel_rnd_samples = get_minmax(rnd_samples, X, k_points)
    #print(f'Selected points have coordinates {sel_rnd_samples}')

    return sel_rnd_samples

def get_minmax(X_rnd, X, k_points):
    """Given an existing 2D array X of N m-dimensional points (N, m), and an array X_rnd of S randomly sampled
    m-dimensional points (S, m), select k_points from X_rnd that satisy the maxmin property
    """
    
    # Calculate distances between each random additional point in X_rnd and each existing point in X
    cdistances = distance.cdist(X_rnd, X, metric='euclidean') # dist(u=XA[i], v=XB[j])

    # Take min distance of each random point to each existing point
    cdistances_min = cdistances.min(axis=1) 

    # Take max of all the min distances
    cdistances_idx_max = np.argpartition(cdistances_min, -k_points)[-k_points:]

    return cdistances_idx_max, X_rnd[cdistances_idx_max]



In [102]:
# Generate new points from the true function based on maxmin
X_new = generate_samples(X_train, exp_space_dict, k_points=1)
Z_new = dropwave_b(X_new)
X_new = np.concatenate([X_train,X_new], axis=0)
Z_new = np.concatenate([Z_train,Z_new], axis=0)
train_df = pd.DataFrame(np.concatenate([X_new,Z_new], axis=1), columns=['X1','Z'])

# Fit the model without changing the kernel - no retraining
init_kernel = gaussian_process[-1].kernel_
gaussian_process, llh, kernel = fit_model(train_df, ['X1'], [], ['X1'], ['Z'], init_kernel, optimizer="fmin_l_bfgs_b")

# kernel optimized with LLH
print(f"kernel: {kernel}")
print(f"llh: {llh}")
# predictions for the whole set
mean_prediction, std_prediction = gaussian_process.predict(generated_df[['X1']], return_std=True)
mean_prediction = mean_prediction.reshape(-1,1)
std_prediction = std_prediction.reshape(-1,1)

mse_all = mean_squared_error(generated_df[['Z']], mean_prediction)
r2_all = r2_score(generated_df[['Z']], mean_prediction)

print(f'r2: {r2_all}')   
print(f'mse: {mse_all}')  

Generating 1000 in {'X1': {'values': [-2, 2], 'val_type': 'float', 'cl': 'inp', 'err': 0}, 'Z': {'values': [0, 1], 'val_type': 'float', 'cl': 'out', 'err': 0}}
Fitting with kernel: 0.652**2 * RBF(length_scale=0.189). Training kernel - kernel will change
kernel: 0.618**2 * RBF(length_scale=0.104)
llh: -9.73794702296226
r2: 0.05962798788282475
mse: 0.11255131625666795


As we can see, adding one point improves the model performance. We can also notice that the point is added between -1 and 0, which was the region that was not sampled before.

In [103]:
gp_plot = create_gp_regression_plot(X, Z, X_new, Z_new, X_noise_std, Z_noise_std, mean_prediction, std_prediction)

gp_plot

## Sequential Training of 1D Benchmark Functions

To assess qualitatively the performance of trained Gaussian Process regression models using increasing dataset sizes, we set up sequential training based on different benchmark functions and increasing dataset sizes. For different benchmark function, the polynomial kernel for the GPR model will be selected using automated kernel engineering as described in a previous post [here](https://alessiot.github.io/dsprojects/posts/01_01_2024_polygp_sklearn.html) (the notebook is available [here](https://github.com/alessiot/polygp-sklearn) under the notebook folder). 

### 1D Benchmark Functions

We prepared a dataset that contains also the associated lambda functions we need to generate our training observations.

In [104]:
df_1D = pd.read_csv('../data/opti_functions_1D.csv')
df_1D = df_1D[df_1D.Select==1]
print(f'Shape of the benchmark function dataset: {df_1D.shape}')
df_1D.head(2)

Shape of the benchmark function dataset: (25, 9)


Unnamed: 0,Function Name,link,lambda,xrange,zmin,zmax,markdown,Select,kernel
0,Bird-Like Function,/benchmarks/unconstrained/1-dimension/221-bird...,f = lambda x: (2*x**4 + x**2 + 2) / (x**4 + 1),"[-4, 4]",[2],"[-1, 1]",$f(x) = \frac{2x^4 + x^2 + 2}{x^4 + 1}$,1,0.191**2 * RBF(length_scale=0.269) + WhiteKern...
1,Gramacy-Lee's Function No.01,/benchmarks/unconstrained/1-dimension/258-gram...,f = lambda x: np.sin(10 * np.pi * x) / (2 * x)...,"[0.5, 2.5]",[0.548563444114526],[],$f(x) = \frac{\sin(10 \pi x)}{2x} + (x - 1)^4$,1,5.6**2 * RBF(length_scale=1.92) + WhiteKernel(...


In [37]:
df_1D.loc[12]

Function Name    Problem No.09 (Timonov's Function No.03 or Zil...
link             /benchmarks/unconstrained/1-dimension/37-zilin...
lambda                    f = lambda x: -np.sin(x) - np.sin(2*x/3)
xrange                                                 [3.1, 20.4]
zmin                                                            []
zmax                                          [17.039198942112002]
markdown         $f(x) = -\sin(x) - \sin\left(\frac{2x}{3}\right)$
Select                                                           1
kernel           1.43**2 * RBF(length_scale=0.608) + WhiteKerne...
Name: 12, dtype: object

In [38]:
df_1D['kernel'] = ['']*len(df_1D)
df_1D.head(2)

Unnamed: 0,Function Name,link,lambda,xrange,zmin,zmax,markdown,Select,kernel
0,Bird-Like Function,/benchmarks/unconstrained/1-dimension/221-bird...,f = lambda x: (2*x**4 + x**2 + 2) / (x**4 + 1),"[-4, 4]",[2],"[-1, 1]",$f(x) = \frac{2x^4 + x^2 + 2}{x^4 + 1}$,1,
1,Gramacy-Lee's Function No.01,/benchmarks/unconstrained/1-dimension/258-gram...,f = lambda x: np.sin(10 * np.pi * x) / (2 * x)...,"[0.5, 2.5]",[0.548563444114526],[],$f(x) = \frac{\sin(10 \pi x)}{2x} + (x - 1)^4$,1,


### Sequential Training

Sequential training starts with 10 initial generated data points and proceeds until it reaches 40 data points. If the performance of the model (measured with the _R squared_) stays above 0.98 for more than 5 subsequent added points, the iterative loop stops.
For the polynomial kernel, we use the kernel functions that are available in sklearn: _RBF, RationalQuadratic, ExpSineSquared, Matern, DotProduct_. The polynomial kernel will have no more than 3 additive terms and each additive term will have no more than 2 product terms. The output of this cell is omitted.

In [105]:
def sequential_training(exp_space: dict = {},
                        use_error: bool = False,
                        generating_func: Callable = lambda x: 1,
                        init_training_size: int = 10, 
                        add_training_size: int = 1,
                        end_training_size: int = 30,
                        random_state: int = 1, 
                        graph_path: str = '') -> dict:  # type: ignore
    """This function performs sequential training of a Gaussian Process model
    inputs:

        X: 
            numpy array with the input data from the true generative process (2D array)
        Z:
            numpy array with the output data from the true generative process (2D array)
        init_training_size:
            initial number of data rows
        add_training_size:
            additional number of data rows at each iteration
        end_training_size:
            final number of data rows
        random_state:
            seed for training data sampling
        save_graphs:
            if provided, graphs will be saved at this location

    outputs:
        dictionary storing the results at each iteration
    """

    base_kernels = [
        'RBF()',
        'RationalQuadratic()',
        'ExpSineSquared()',
        'Matern()',
        'DotProduct()'
    ]

    results_dict = {}

    # Randomly sample true function in design space and select training samples
    training_size = init_training_size
    X, Z = generate_data(exp_space, generating_func, num_samples = 1_000, random_state=random_state)
    X_train, Z_train, X_noise_std, Z_noise_std = select_training_data(X, Z, exp_space, 
                                                                      generating_func, 
                                                                      is_error=use_error, 
                                                                      training_size=training_size, 
                                                                      random_state=random_state)
    # create data frame to handle pipelines
    inp_lst = [ii for ii, jj in exp_space.items() if jj['cl'] == 'inp'] 
    out_lst = [ii for ii, jj in exp_space.items() if jj['cl'] == 'out'] 
    num_lst = [ii for ii in inp_lst if exp_space[ii]['val_type'] == 'float'] 
    cat_lst = [ii for ii in inp_lst if exp_space[ii]['val_type'] == 'object'] 
    train_df = pd.DataFrame(np.concatenate([X_train,Z_train], axis=1), columns=inp_lst+out_lst)
    generated_df = pd.DataFrame(np.concatenate([X,Z], axis=1), columns=inp_lst+out_lst) 
    # fit initial model - find best kernel - llh is bic (the smaller the better)
    gaussian_process, llh_current, kernel_current, study = train_gp(train_df, inp_lst, out_lst,
                                                             cat_lst, num_lst,
                                                             max_evals=1000, early_stopping_rounds=200,
                                                             model_complexity = {'max_n_prod': 2, 
                                                                                 'max_n_sum': 3,
                                                                                 'comb_type': 'worepl'},
                                                             base_kernels = base_kernels
                                                            )
    # predictions for the whole set
    mean_prediction, std_prediction = gaussian_process.predict(generated_df[inp_lst], return_std=True)
    mean_prediction = mean_prediction.reshape(-1,1)
    std_prediction = std_prediction.reshape(-1,1)
    mse_all = mean_squared_error(generated_df[out_lst], mean_prediction)
    r2_all = r2_score(generated_df[out_lst], mean_prediction)
    print(f"Initial GPR --- training size: {init_training_size}, llh: {llh_current}, r2: {r2_all}, 'mse: {mse_all}, kernel: {kernel_current},")
    # store results
    results_dict['training_size'] = [init_training_size]
    results_dict['llh'] = [llh_current]
    results_dict['r2'] = [r2_all]
    results_dict['mse'] = [mse_all]
    results_dict['kernel'] = [kernel_current]

    if graph_path!='':
        cnt = 0
        gp_plot = create_gp_regression_plot(X, Z, X_train, Z_train, X_noise_std, Z_noise_std,
                                            mean_prediction, std_prediction, 
                                            f"training size: {training_size}; llh: {np.round(llh_current,3)}; mse: {np.round(mse_all,3)}; r2: {np.round(r2_all,3)}")
        pio.write_image(gp_plot, f"{graph_path}/graph_{str(cnt)}.png")

    rng = np.random.RandomState(random_state) # for error

    # increase data
    n_iter = int((end_training_size - init_training_size)/add_training_size)
    above_thr = 0
    print(f"Running {n_iter} iterations with {add_training_size} additional points")
    for cnt in range(1, n_iter+1):
        # Sample new training points 
        X_idx_new, X_new = get_minmax(X, X_train, add_training_size)
        Z_new = Z[X_idx_new,:]
        # add error if requested
        X_new_noise_std, Z_new_noise_std = None, None
        if use_error:
            X_new_noise_std = np.array([exp_space['X1']['err']]*len(X_new)).reshape(-1,1) 
            ## ...but due to the uncertainty in X, Z = f(X+X_noise)
            X_new_noise = [rng.normal(loc=0.0, scale=xn, size=1) for xn in X_new_noise_std]
            Z_new = generating_func(X_new+X_new_noise)
            ## and on top of it we have uncertainty in the measurement of Z
            Z_new_noise_std = np.array([exp_space['Z']['err']]*len(Z_new)).reshape(-1,1) 
            Z_new_noise = [rng.normal(loc=0.0, scale=zn, size=1) for zn in Z_new_noise_std]
            Z_new += Z_new_noise
            X_noise_std = np.concatenate([X_noise_std,X_new_noise_std], axis=0)
            Z_noise_std = np.concatenate([Z_noise_std,Z_new_noise_std], axis=0)

        X_train = np.concatenate([X_train,X_new], axis=0)
        Z_train = np.concatenate([Z_train,Z_new], axis=0)
        # Fit with initial history
        train_df = pd.DataFrame(np.concatenate([X_train,Z_train], axis=1), columns=inp_lst+out_lst)
        gaussian_process, llh_current, kernel_current, study = train_gp(train_df, inp_lst, out_lst,
                                                        cat_lst, num_lst,
                                                        max_evals=1000, early_stopping_rounds=200, 
                                                        model_complexity = {'max_n_prod': 2, 
                                                                            'max_n_sum': 3,
                                                                            'comb_type': 'worepl'},
                                                        base_kernels = base_kernels,
                                                        study_init=study
                                                    )
        # predictions for the whole set
        mean_prediction, std_prediction = gaussian_process.predict(generated_df[inp_lst], return_std=True)
        mean_prediction = mean_prediction.reshape(-1,1)
        std_prediction = std_prediction.reshape(-1,1)
        mse_all = mean_squared_error(generated_df[out_lst], mean_prediction)
        r2_all = r2_score(generated_df[out_lst], mean_prediction)
        
        training_size = len(X_train)

        print(f"GPR --- training size: {training_size}, llh: {llh_current}, r2: {r2_all}, 'mse: {mse_all}, kernel: {kernel_current},")

        results_dict['training_size'].append(training_size)
        results_dict['llh'].append(llh_current)
        results_dict['r2'].append(r2_all)
        results_dict['mse'].append(mse_all)
        results_dict['kernel'].append(kernel_current)

        if graph_path!='':
            gp_plot = create_gp_regression_plot(X, Z, X_train, Z_train, X_noise_std, Z_noise_std,
                                                mean_prediction, std_prediction, f"training size: {training_size}; llh: {np.round(llh_current,3)}; mse: {np.round(mse_all,3)}; r2: {np.round(r2_all,3)}"
                                                )
            pio.write_image(gp_plot, f"{graph_path}/graph_{str(cnt)}.png")

        if r2_all >= 0.98:
            above_thr += 1
        else:
            above_thr = 0
        
        # if 3 above 98% in a row, break
        if above_thr>=5:
            break

    return results_dict

def plot_results(res_dict, text_display=''):

    fig = go.Figure()

    fig.add_trace(go.Scatter(
        name="Log-likelihood", 
        x=res_dict['training_size'], 
        y=res_dict['llh'],
        yaxis='y',
        mode='lines', 
        line=dict(dash='dash'),
        line_color='rgba(0, 0, 255, 1)',
        showlegend=True,
    ))

    fig.add_trace(go.Scatter(
        name="R2", 
        x=res_dict['training_size'], 
        y=res_dict['r2'],
        yaxis='y2',
        mode='lines', 
        marker=dict(size=8),
        line_color = 'rgba(255, 165, 0)',
        showlegend=True,
    ))

    #fig.add_trace(go.Scatter(
    #    name="MSE", 
    #    x=res_dict['training_size'], 
    #    y=res_dict['mse'],
    #    yaxis='y2',
    #    mode='lines', 
    #    marker=dict(size=8),
    #    line_color='rgba(0, 128, 0, 1)',
    #    showlegend=True,
    #))

    # Update layout
    fig.update_layout(
        #title=f'Model Performance {text_display}',
        #title_font=dict(size=30, ),
        xaxis_title=f"Training size",
        yaxis_title=f"Peformance (on true unknown function)",
        showlegend=True,
        plot_bgcolor='white', 
        paper_bgcolor='white',
        yaxis=dict(title='Log-likelihood', side='left', showgrid=False),
        yaxis2=dict(title='R2', side='right', overlaying='y', showgrid=False),
        #yaxis_range = [min(Z_true[:,zx]), max(Z_true[:,zx])],   
        #xaxis_range = [min(X_true[:,ix]), max(X_true[:,ix])]    
    )

    # Add a text box below the legend
    #fig.update_layout(
    #    annotations=[
    #        dict(
    #            x=np.mean(res_dict['training_size']),
    #            y=np.min(res_dict['llh']),
    #            xref="x",
    #            yref="y",
    #            text=f"Final Kernel: {str(res_dict['kernel'][-1])}",
    #            showarrow=False,
    #            font=dict(size=10),
    #        )
    #    ]
    #)



    return fig


In [106]:
save_path = '../data/imgs_1D'

output_catcher = StringIO()

for fi in df_1D.index:
    
    if fi==5:
        continue

    #if fi>=2:
    #    continue

    exec_scope = {}
    f_str = str(df_1D.loc[fi,'Function Name'])
    f_str = "_".join(f_str.replace("'",'').replace("-","_").replace(".","").replace(",","").replace("(","").replace(")","").split())
    print(f'{fi} out of {max(df_1D.index)}, {f_str}')
    lambda_str = str(df_1D.loc[fi,'lambda']).split('f = ')[1]
    xrange_str = str(df_1D.loc[fi,'xrange']).replace('pi', 'np.pi')
    code_to_exec = f'''import numpy as np\nf = {lambda_str}\nxrange = {xrange_str}'''
    try:
        with redirect_stdout(output_catcher):
            exec(code_to_exec, exec_scope)
    except Exception as e:
        print(f"An error occurred: {e}")
    f = exec_scope['f']
    xrange = exec_scope['xrange']
    benchmark_dict = {'X1' : {'values': xrange, 'val_type': 'float', 'cl': 'inp', 'err': 0.05},
                      'Z' : {'val_type': 'float', 'cl': 'out', 'err': 0.05}}

    !rm ../images/*

    res_dict = sequential_training(exp_space = benchmark_dict,
                                use_error = False,
                                generating_func=f, 
                                init_training_size=10, 
                                add_training_size = 1, 
                                end_training_size = 40, 
                                random_state = 1,
                                graph_path='../images')

    res_plot = plot_results(res_dict)
    pio.write_image(res_plot, f"{save_path}/graph_{str(f_str)}.png")

    df_1D.loc[fi, 'kernel'] = str(res_dict['kernel'][-1])

    !ffmpeg -framerate 2 -i ../images/graph_%d.png -vf "palettegen" ../images/palette.png
    !ffmpeg -framerate 2 -i ../images/graph_%d.png -i ../images/palette.png -lavfi "paletteuse" ../images/movie.gif
    !mv ../images/movie.gif {save_path}/movie_{str(f_str)}.gif

    # Clear the output
    clear_output(wait=True)

# remove output before converting to html

24 out of 24, Strongins_Function


2024-01-09 06:46:34,186 - POLYGP DEBUG - train_polygp - Running kernel optimization
2024-01-09 06:46:34,187 - POLYGP DEBUG - train_polygp - 575 available polynomials


  0%|          | 0/1000 [00:00<?, ?it/s]

2024-01-09 06:46:48,519 - POLYGP DEBUG - objective - Duplicated trial: {'poly_sel': 7}, return 18.50857774202804
2024-01-09 06:46:52,792 - POLYGP DEBUG - objective - Duplicated trial: {'poly_sel': 351}, return 34.15937778209566
2024-01-09 06:47:12,210 - POLYGP DEBUG - objective - Duplicated trial: {'poly_sel': 7}, return 18.50857774202804
2024-01-09 06:47:12,210 - POLYGP DEBUG - objective - Duplicated trial: {'poly_sel': 7}, return 18.508577741436273
2024-01-09 06:47:29,526 - POLYGP DEBUG - objective - Duplicated trial: {'poly_sel': 316}, return 36.46170000869736
2024-01-09 06:47:36,182 - POLYGP DEBUG - objective - Duplicated trial: {'poly_sel': 450}, return 32.26921414970452
2024-01-09 06:47:45,722 - POLYGP DEBUG - objective - Duplicated trial: {'poly_sel': 188}, return 31.856542326616335
2024-01-09 06:47:48,243 - POLYGP DEBUG - objective - Duplicated trial: {'poly_sel': 138}, return 21.232864277246726
2024-01-09 06:47:55,857 - POLYGP DEBUG - objective - Duplicated trial: {'poly_sel':

Initial GPR --- training size: 10, llh: 12.76479918193347, r2: 0.9999999876642903, 'mse: 1.2199094658042641e-08, kernel: 0.00316**2 * RBF(length_scale=0.1) + WhiteKernel(noise_level=1e-05) + 45.3**2 * ExpSineSquared(length_scale=89.2, periodicity=1.82) + 0.00316**2 * DotProduct(sigma_0=0.799) + 12.2**2 * ExpSineSquared(length_scale=17.6, periodicity=3.64) * Matern(length_scale=1e+05, nu=1.5),
Running 30 iterations with 1 additional points


2024-01-09 06:50:20,124 - POLYGP DEBUG - run_optimization - 223 trials with valid kernel available
2024-01-09 06:50:20,125 - POLYGP DEBUG - run_optimization - 193 unique trials
2024-01-09 06:50:20,140 - POLYGP DEBUG - run_optimization - top trial: 0.00316**2 * RBF(length_scale=0.1) + WhiteKernel(noise_level=1e-05) + 45.3**2 * ExpSineSquared(length_scale=89.2, periodicity=1.82) + 0.00316**2 * DotProduct(sigma_0=0.799) + 12.2**2 * ExpSineSquared(length_scale=17.6, periodicity=3.64) * Matern(length_scale=1e+05, nu=1.5)
2024-01-09 06:50:20,155 - POLYGP DEBUG - run_optimization - using previous study: 30 trials added. Total: 30


  0%|          | 0/1030 [00:00<?, ?it/s]

2024-01-09 06:51:01,271 - POLYGP DEBUG - objective - Duplicated trial: {'poly_sel': 278}, return 32.002940508146615
2024-01-09 06:51:07,930 - POLYGP DEBUG - objective - Duplicated trial: {'poly_sel': 138}, return 21.232864276378585
2024-01-09 06:51:10,685 - POLYGP DEBUG - objective - Duplicated trial: {'poly_sel': 445}, return 29.329412103908012
2024-01-09 06:51:19,881 - POLYGP DEBUG - objective - Duplicated trial: {'poly_sel': 51}, return 22.62224704973064
2024-01-09 06:51:21,124 - POLYGP DEBUG - objective - Duplicated trial: {'poly_sel': 1}, return 16.206232386434053
2024-01-09 06:51:27,309 - POLYGP DEBUG - objective - Duplicated trial: {'poly_sel': 141}, return 22.669542339309647
2024-01-09 06:51:34,033 - POLYGP DEBUG - objective - Duplicated trial: {'poly_sel': 163}, return 24.527053417652304
2024-01-09 06:51:36,848 - POLYGP DEBUG - objective - Duplicated trial: {'poly_sel': 445}, return 29.329412103908012
2024-01-09 06:51:36,849 - POLYGP DEBUG - objective - Duplicated trial: {'pol

GPR --- training size: 11, llh: 6.650902591381083, r2: 0.9999999986683623, 'mse: 1.316890123041646e-09, kernel: 0.00316**2 * RBF(length_scale=0.1) + WhiteKernel(noise_level=1e-05) + 54.4**2 * ExpSineSquared(length_scale=107, periodicity=1.91) + 0.00316**2 * RBF(length_scale=1.35) * DotProduct(sigma_0=0.00111) + 0.00316**2 * ExpSineSquared(length_scale=20.6, periodicity=3.81) * DotProduct(sigma_0=4.51e+03),


2024-01-09 07:12:22,843 - POLYGP DEBUG - run_optimization - 981 trials with valid kernel available
2024-01-09 07:12:22,855 - POLYGP DEBUG - run_optimization - 611 unique trials
2024-01-09 07:12:22,871 - POLYGP DEBUG - run_optimization - top trial: 0.00316**2 * RBF(length_scale=0.1) + WhiteKernel(noise_level=1e-05) + 54.4**2 * ExpSineSquared(length_scale=107, periodicity=1.91) + 0.00316**2 * RBF(length_scale=1.35) * DotProduct(sigma_0=0.00111) + 0.00316**2 * ExpSineSquared(length_scale=20.6, periodicity=3.81) * DotProduct(sigma_0=4.51e+03)
2024-01-09 07:12:22,885 - POLYGP DEBUG - run_optimization - using previous study: 30 trials added. Total: 30


  0%|          | 0/1030 [00:00<?, ?it/s]

2024-01-09 07:12:28,591 - POLYGP DEBUG - objective - Duplicated trial: {'poly_sel': 52}, return 14.230114944235613
2024-01-09 07:12:28,591 - POLYGP DEBUG - objective - Duplicated trial: {'poly_sel': 52}, return 15.101603608201966
2024-01-09 07:13:13,174 - POLYGP DEBUG - objective - Duplicated trial: {'poly_sel': 53}, return 13.581762650474577
2024-01-09 07:13:15,556 - POLYGP DEBUG - objective - Duplicated trial: {'poly_sel': 339}, return 20.607485718617017
2024-01-09 07:13:16,496 - POLYGP DEBUG - objective - Duplicated trial: {'poly_sel': 342}, return 35.264169057549736
2024-01-09 07:13:24,122 - POLYGP DEBUG - objective - Duplicated trial: {'poly_sel': 299}, return 51.37118518999567
2024-01-09 07:13:28,881 - POLYGP DEBUG - objective - Duplicated trial: {'poly_sel': 338}, return 6.650902591381083
2024-01-09 07:13:29,834 - POLYGP DEBUG - objective - Duplicated trial: {'poly_sel': 29}, return 17.868603308191908
2024-01-09 07:13:36,536 - POLYGP DEBUG - objective - Duplicated trial: {'poly_

GPR --- training size: 12, llh: 0.31518168657909484, r2: 0.9999999996335132, 'mse: 3.624280361608509e-10, kernel: 0.00316**2 * RBF(length_scale=0.0004) + WhiteKernel(noise_level=1e-05) + 0.00316**2 * ExpSineSquared(length_scale=4.57, periodicity=3.42) * DotProduct(sigma_0=4.84e+03),


2024-01-09 07:24:18,388 - POLYGP DEBUG - run_optimization - 601 trials with valid kernel available
2024-01-09 07:24:18,393 - POLYGP DEBUG - run_optimization - 429 unique trials
2024-01-09 07:24:18,410 - POLYGP DEBUG - run_optimization - top trial: 0.00316**2 * RBF(length_scale=0.0004) + WhiteKernel(noise_level=1e-05) + 0.00316**2 * ExpSineSquared(length_scale=4.57, periodicity=3.42) * DotProduct(sigma_0=4.84e+03)
2024-01-09 07:24:18,426 - POLYGP DEBUG - run_optimization - using previous study: 30 trials added. Total: 30


  0%|          | 0/1030 [00:00<?, ?it/s]

2024-01-09 07:24:28,054 - POLYGP DEBUG - objective - Duplicated trial: {'poly_sel': 432}, return 21.67911789590979
2024-01-09 07:24:36,024 - POLYGP DEBUG - objective - Duplicated trial: {'poly_sel': 85}, return 11.634403072449377
2024-01-09 07:24:38,836 - POLYGP DEBUG - objective - Duplicated trial: {'poly_sel': 184}, return 24.244049034538776
2024-01-09 07:24:50,971 - POLYGP DEBUG - objective - Duplicated trial: {'poly_sel': 36}, return 9.641760029280032
2024-01-09 07:25:06,301 - POLYGP DEBUG - objective - Duplicated trial: {'poly_sel': 9}, return 9.400066800353432
2024-01-09 07:25:24,331 - POLYGP DEBUG - objective - Duplicated trial: {'poly_sel': 515}, return 26.80939984890167
2024-01-09 07:25:26,033 - POLYGP DEBUG - objective - Duplicated trial: {'poly_sel': 407}, return 15.307085228616259
2024-01-09 07:25:46,881 - POLYGP DEBUG - objective - Duplicated trial: {'poly_sel': 395}, return 29.37394878170248
2024-01-09 07:25:50,263 - POLYGP DEBUG - objective - Duplicated trial: {'poly_sel

GPR --- training size: 13, llh: -5.163494317746103, r2: 0.9999999996734059, 'mse: 3.229771813283888e-10, kernel: 0.00316**2 * RBF(length_scale=0.000472) + WhiteKernel(noise_level=1e-05) + 18.6**2 * RationalQuadratic(alpha=0.451, length_scale=1e+05) * ExpSineSquared(length_scale=5.06, periodicity=3.39),


2024-01-09 07:31:57,590 - POLYGP DEBUG - run_optimization - 430 trials with valid kernel available
2024-01-09 07:31:57,592 - POLYGP DEBUG - run_optimization - 335 unique trials
2024-01-09 07:31:57,607 - POLYGP DEBUG - run_optimization - top trial: 0.00316**2 * RBF(length_scale=0.000472) + WhiteKernel(noise_level=1e-05) + 18.6**2 * RationalQuadratic(alpha=0.451, length_scale=1e+05) * ExpSineSquared(length_scale=5.06, periodicity=3.39)
2024-01-09 07:31:57,621 - POLYGP DEBUG - run_optimization - using previous study: 30 trials added. Total: 30


  0%|          | 0/1030 [00:00<?, ?it/s]

2024-01-09 07:31:59,506 - POLYGP DEBUG - objective - Duplicated trial: {'poly_sel': 73}, return 5.657944837642965
2024-01-09 07:32:08,789 - POLYGP DEBUG - objective - Duplicated trial: {'poly_sel': 106}, return 5.725008634412831
2024-01-09 07:32:19,130 - POLYGP DEBUG - objective - Duplicated trial: {'poly_sel': 422}, return 17.15476632248148
2024-01-09 07:32:20,461 - POLYGP DEBUG - objective - Duplicated trial: {'poly_sel': 338}, return 6.650902591381083
2024-01-09 07:32:26,223 - POLYGP DEBUG - objective - Duplicated trial: {'poly_sel': 136}, return 7.148455010079424
2024-01-09 07:32:38,759 - POLYGP DEBUG - objective - Duplicated trial: {'poly_sel': 138}, return 3.8884019608916454
2024-01-09 07:32:45,061 - POLYGP DEBUG - objective - Duplicated trial: {'poly_sel': 315}, return 17.15411117243024
2024-01-09 07:32:49,105 - POLYGP DEBUG - objective - Duplicated trial: {'poly_sel': 7}, return 6.28940238289567
2024-01-09 07:32:49,356 - POLYGP DEBUG - objective - Duplicated trial: {'poly_sel':

GPR --- training size: 14, llh: -9.409659241991562, r2: 0.9999999997481432, 'mse: 2.490675410229408e-10, kernel: 0.00316**2 * RBF(length_scale=3.69e-05) + WhiteKernel(noise_level=1e-05) + 0.00316**2 * DotProduct(sigma_0=4.68e-05) + 19.4**2 * RBF(length_scale=1e+05) * ExpSineSquared(length_scale=5.16, periodicity=3.49),


2024-01-09 07:38:10,979 - POLYGP DEBUG - run_optimization - 373 trials with valid kernel available
2024-01-09 07:38:10,981 - POLYGP DEBUG - run_optimization - 310 unique trials
2024-01-09 07:38:10,996 - POLYGP DEBUG - run_optimization - top trial: 0.00316**2 * RBF(length_scale=3.69e-05) + WhiteKernel(noise_level=1e-05) + 0.00316**2 * DotProduct(sigma_0=4.68e-05) + 19.4**2 * RBF(length_scale=1e+05) * ExpSineSquared(length_scale=5.16, periodicity=3.49)
2024-01-09 07:38:11,010 - POLYGP DEBUG - run_optimization - using previous study: 30 trials added. Total: 30


  0%|          | 0/1030 [00:00<?, ?it/s]

2024-01-09 07:38:20,924 - POLYGP DEBUG - objective - Duplicated trial: {'poly_sel': 211}, return 0.2836424276814924
2024-01-09 07:39:07,792 - POLYGP DEBUG - objective - Duplicated trial: {'poly_sel': 124}, return 2.3003584682216527
2024-01-09 07:39:09,865 - POLYGP DEBUG - objective - Duplicated trial: {'poly_sel': 139}, return 0.4708532787002113
2024-01-09 07:39:09,866 - POLYGP DEBUG - objective - Duplicated trial: {'poly_sel': 139}, return 2.105718486426074
2024-01-09 07:39:10,769 - POLYGP DEBUG - objective - Duplicated trial: {'poly_sel': 52}, return -5.21778084717803
2024-01-09 07:39:14,453 - POLYGP DEBUG - objective - Duplicated trial: {'poly_sel': 572}, return 15.273185571746577
2024-01-09 07:39:21,671 - POLYGP DEBUG - objective - Duplicated trial: {'poly_sel': 397}, return 15.272404062056534
2024-01-09 07:39:29,091 - POLYGP DEBUG - objective - Duplicated trial: {'poly_sel': 545}, return 12.563954952989743
2024-01-09 07:39:42,254 - POLYGP DEBUG - objective - Duplicated trial: {'po

GPR --- training size: 15, llh: -25.89776632309024, r2: 0.9999999998290501, 'mse: 1.69056658211551e-10, kernel: 0.00316**2 * RBF(length_scale=5.96e-05) + WhiteKernel(noise_level=1e-05) + 22.1**2 * ExpSineSquared(length_scale=5.52, periodicity=3.61),
ffmpeg version 6.1 Copyright (c) 2000-2023 the FFmpeg developers
  built with Apple clang version 15.0.0 (clang-1500.1.0.2.5)
  configuration: --prefix=/opt/homebrew/Cellar/ffmpeg/6.1 --enable-shared --enable-pthreads --enable-version3 --cc=clang --host-cflags= --host-ldflags='-Wl,-ld_classic' --enable-ffplay --enable-gnutls --enable-gpl --enable-libaom --enable-libaribb24 --enable-libbluray --enable-libdav1d --enable-libharfbuzz --enable-libjxl --enable-libmp3lame --enable-libopus --enable-librav1e --enable-librist --enable-librubberband --enable-libsnappy --enable-libsrt --enable-libsvtav1 --enable-libtesseract --enable-libtheora --enable-libvidstab --enable-libvmaf --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx264 --en

In [47]:
# saving dataframe back with optimized kernels
df_1D.to_csv('./data/opti_functions_1D.csv', index=False)

In [50]:
df_1D.iloc[0]

Function Name                                   Bird-Like Function
link             /benchmarks/unconstrained/1-dimension/221-bird...
lambda              f = lambda x: (2*x**4 + x**2 + 2) / (x**4 + 1)
xrange                                                     [-4, 4]
zmin                                                           [2]
zmax                                                       [-1, 1]
markdown                   $f(x) = \frac{2x^4 + x^2 + 2}{x^4 + 1}$
Select                                                           1
kernel           0.00316**2 * RBF(length_scale=0.00114) + White...
Name: 0, dtype: object

## Uncertainty on X and Y

We end this notebook by showing how the uncertainty in the measurements of the input and target values can affect the performance of the GPR model. Depending on the magnitude of the uncertainty, achieving reasonable predictive performance for the trained model may require more iterations during sequential training. In fact, this is the situation that most commonly happen in scientific experimentation.


In [54]:
# absolute uncertainties are reported to avoid case when value is zero
exp_space_dict = {'X1' : {'values' : [-2,2], 'val_type': 'float', 'cl': 'inp', 'err': 0.05}, 
                  'Z': {'values': [0, 1], 'val_type': 'float', 'cl': 'out', 'err': 0.05}} 

X, Z = generate_data(exp_space_dict, dropwave_b, num_samples = 1_000, random_state=1)
X_train, Z_train, X_noise_std, Z_noise_std = select_training_data(X, Z, exp_space_dict, 
                                                                    dropwave_b, 
                                                                    is_error=True, 
                                                                    training_size=10, 
                                                                    random_state=1)

train_df = pd.DataFrame(np.concatenate([X_train,Z_train], axis=1), columns=['X1','Z'])
generated_df = pd.DataFrame(np.concatenate([X,Z], axis=1), columns=['X1','Z']) # remove training data from generated data

In the example, we add an uncertainty of 0.05 to both the input _X1_ and the output _Z_ when generating the data. As we can see, ...

In [55]:
# fit initial model
init_kernel = 1 * RBF() * ExpSineSquared()
gaussian_process, llh, kernel = fit_model(train_df, ['X1'], [], ['X1'], ['Z'], init_kernel, optimizer="fmin_l_bfgs_b")

# kernel optimized with LLH
print(f"kernel: {kernel}")
print(f"llh: {llh}")
# predictions for the whole set
mean_prediction, std_prediction = gaussian_process.predict(generated_df[['X1']], return_std=True)
mean_prediction = mean_prediction.reshape(-1,1)
std_prediction = std_prediction.reshape(-1,1)

mse_all = mean_squared_error(generated_df[['Z']], mean_prediction)
r2_all = r2_score(generated_df[['Z']], mean_prediction)

print(f'r2: {r2_all}')   
print(f'mse: {mse_all}')  

gp_plot = create_gp_regression_plot(X, Z, X_train, Z_train, X_noise_std, Z_noise_std, mean_prediction, std_prediction)
gp_plot


Fitting with kernel: 1**2 * RBF(length_scale=1) * ExpSineSquared(length_scale=1, periodicity=1). Training kernel - kernel will change
kernel: 0.646**2 * RBF(length_scale=3.94) * ExpSineSquared(length_scale=0.631, periodicity=0.974)
llh: -5.718850298614291
r2: 0.05888873710504161
mse: 0.11263979575947229


In [None]:
#jupyter nbconvert --to html --TagRemovePreprocessor.remove_input_tags="notvisible" experimenting_with_gpr.ipynb --output 01_13_2024_polygp_sklearn_2.html


Copyright (c) [2024] [Alessio Tamburro]

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.


[MIT license](https://choosealicense.com/licenses/mit/)