# 121-dataset tabular benchmark

**Hyperparameter Tuning:** We first describe the experiments for 120 of the 121 datasets with fewer than 130,000 examples since we used EigenPro [52] to train kernels on the largest dataset. 

For all kernel methods (RFMs, Laplace kernel and NTK), we grid search over ridge regularization from the set $\{10, 1, .1, .01, .001\}$. We grid search over 5 iterations for RFMs and used a bandwidth of $L=10$ for all Laplace kernels. 

For NTK ridge regression experiments, we grid search over NTKs corresponding to ReLU networks with between 1 and 5 hidden layers. 

For the dataset with 130000 samples, we use EigenPro to train all kernel methods and RFMs. We run EigenPro for at most 50 iterations and select the iteration with best validation accuracy for reporting test accuracy. For small datasets (i.e., those with fewer than 5000 samples), we grid search over updating just the diagonals of M and updating the entire matrix M. Lastly, for all kernel methods and RFMs, we grid search over normalizing the data to the unit sphere. We note that there is one dataset (balance-scale), which had a data point with norm 0, and so we did not grid search over normalization for this dataset.

**Evaluation Scores:**
<!-- 
| Dataset name | size | features | description |
|--------------|------|----------|-------------|
|california    | 20634 | 8       | https://www.dcc.fc.up.pt/ltorgo/Regression/cal_housing.html%22 |
|Diabetes130US | 71090 | 7       | https://www.openml.org/d/4541 |
| jannis       | 57580 | 54      | https://www.openml.org/search?type=data&sort=runs&id=41168&status=active |
| covertype    | 566602 | 10     | https://www.openml.org/search?type=data&sort=runs&id=293&status=active |
| Higgs	       | 940160 | 24     | https://www.openml.org/search?type=data&sort=runs&id=42769&status=active | -->

In [1]:
# install using `conda install -c conda-forge line_profiler`
%load_ext line_profiler
%load_ext autoreload
%autoreload 2

In [None]:
# install using `conda install -c conda-forge line_profiler`
%load_ext line_profiler
%load_ext autoreload
%autoreload 2

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, accuracy_score
import numpy as np
import pandas as pd
from copy import deepcopy

# utils for plotting
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.express as px

# utils for kernel ridge regression
from goodpoints.krr.util_estimators import get_estimator, get_sigma_heuristic
# utils for evaluating kernels
from goodpoints.krr.util_k_mmd import kernel_eval, to_regression_kernel
# utils for generate samples from the data distribution
from goodpoints.krr.util_sample import get_Xy #, ToyData , get_toy_dataset, logistic
from goodpoints.krr.util_load_data import get_real_dataset
# utils for dataset thinning
from goodpoints.krr.util_thin import sd_thin, kt_thin2

In [3]:
# add this to be able to render plotly plots in non-vscode notebooks
import plotly.io as pio
pio.renderers.default = "notebook_connected"

In [4]:
# helper functions
def sample(arr, n=1000):
    return arr[np.random.choice(len(arr), n, replace=True)]
def histogram(arr, height=400, width=600):
    return px.histogram(arr, width=width, height=height)

def classification_accuracy(labels, pred):
    decision = pred.copy()
    # implement classification rule
    decision[decision > 0.5] = 1
    decision[decision <= 0.5] = 0
    return accuracy_score(labels, decision)

## Set hyperparameters

In [5]:
### Regression parameters

kernel = 'laplace'  # ['gauss', 'laplace']
sigma = 10
alpha = 1e-3 # 1.0

### RFM parameters
rfm_iters = 2

### Experiment parameters

k_fold = 5      # k >= 2
n_repeats = 10
use_cross_validation = False

n_jobs = 2 # -1 = use all CPUs
save = False

### Thinning parameters

m = None # Thinned dataset will have size n/2**m

In [6]:
# Determine auxiliary parameters

task = 'classification'
refit = 'accuracy'
postprocess = 'threshold'
ydim = 1

Kernels:
- RBF:
$$\mathbf{k}(x, y) = \exp(-\gamma ||x-y||_2^2)$$
- Laplacian:
$$\mathbf{k}(x, y) = \exp(-\gamma ||x-y||_1)$$

Median heuristic to choose the bandwidth parameter, i.e., median of squared pairwise distances:
- For Gaussian data, we can compute this exactly. Assume $X\sim \mathcal{N}(0,\sigma^2 I_d)$. For the RBF kernel, $X_1-X_2\sim \mathcal{N}(0,2\sigma^2 I_d)$. Then $(X_1-X_2)^2$ follows a chi-squared distribution with $d$ degrees of freedom, mean $d\cdot \sqrt{2}\sigma$ and median roughly $d(1-\frac{2}{9d})^3 \cdot \sqrt{2}\sigma$. For the Laplacian kernel, $||x-y||_1$ follows a folded normal distribution (https://en.wikipedia.org/wiki/Folded_normal_distribution) with median roughly $\sqrt{2}\sigma$.

Available kernels in sklearn: 
https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics.pairwise

## Get dataset

For list of available subsets and their OpenML links:
https://huggingface.co/datasets/inria-soda/tabular-benchmark

```
WARNING:root:Received uncompressed content from OpenML for https://api.openml.org/data/v1/download/22103185/credit.arff.
OpenML Classification Task
==========================
Task Type Description: https://www.openml.org/tt/TaskType.SUPERVISED_CLASSIFICATION
Task ID..............: 361055
Task URL.............: https://www.openml.org/t/361055
Estimation Procedure.: crossvalidation
Evaluation Measure...: predictive_accuracy
Target Feature.......: SeriousDlqin2yrs
# of Classes.........: 2
Cost Matrix..........: Available
WARNING:root:Received uncompressed content from OpenML for https://api.openml.org/data/v1/download/22103245/electricity.arff.
OpenML Classification Task
==========================
Task Type Description: https://www.openml.org/tt/TaskType.SUPERVISED_CLASSIFICATION
Task ID..............: 361060
Task URL.............: https://www.openml.org/t/361060
Estimation Procedure.: crossvalidation
Evaluation Measure...: predictive_accuracy
Target Feature.......: class
# of Classes.........: 2
Cost Matrix..........: Available
WARNING:root:Received uncompressed content from OpenML for https://api.openml.org/data/v1/download/22103246/covertype.arff.
OpenML Classification Task
==========================
Task Type Description: https://www.openml.org/tt/TaskType.SUPERVISED_CLASSIFICATION
Task ID..............: 361061
Task URL.............: https://www.openml.org/t/361061
Estimation Procedure.: crossvalidation
Evaluation Measure...: predictive_accuracy
Target Feature.......: Y
# of Classes.........: 2
Cost Matrix..........: Available
WARNING:root:Received uncompressed content from OpenML for https://api.openml.org/data/v1/download/22103247/pol.arff.
OpenML Classification Task
==========================
Task Type Description: https://www.openml.org/tt/TaskType.SUPERVISED_CLASSIFICATION
Task ID..............: 361062
Task URL.............: https://www.openml.org/t/361062
Estimation Procedure.: crossvalidation
Evaluation Measure...: predictive_accuracy
Target Feature.......: binaryClass
# of Classes.........: 2
Cost Matrix..........: Available
WARNING:root:Received uncompressed content from OpenML for https://api.openml.org/data/v1/download/22103248/house_16H.arff.
OpenML Classification Task
==========================
Task Type Description: https://www.openml.org/tt/TaskType.SUPERVISED_CLASSIFICATION
Task ID..............: 361063
Task URL.............: https://www.openml.org/t/361063
Estimation Procedure.: crossvalidation
Evaluation Measure...: predictive_accuracy
Target Feature.......: binaryClass
# of Classes.........: 2
Cost Matrix..........: Available
WARNING:root:Received uncompressed content from OpenML for https://api.openml.org/data/v1/download/22103250/MagicTelescope.arff.
OpenML Classification Task
==========================
Task Type Description: https://www.openml.org/tt/TaskType.SUPERVISED_CLASSIFICATION
Task ID..............: 361065
Task URL.............: https://www.openml.org/t/361065
Estimation Procedure.: crossvalidation
Evaluation Measure...: predictive_accuracy
Target Feature.......: class
# of Classes.........: 2
Cost Matrix..........: Available
WARNING:root:Received uncompressed content from OpenML for https://api.openml.org/data/v1/download/22103251/bank-marketing.arff.
OpenML Classification Task
==========================
Task Type Description: https://www.openml.org/tt/TaskType.SUPERVISED_CLASSIFICATION
Task ID..............: 361066
Task URL.............: https://www.openml.org/t/361066
Estimation Procedure.: crossvalidation
Evaluation Measure...: predictive_accuracy
Target Feature.......: Class
# of Classes.........: 2
Cost Matrix..........: Available
WARNING:root:Received uncompressed content from OpenML for https://api.openml.org/data/v1/download/22103253/MiniBooNE.arff.
OpenML Classification Task
==========================
Task Type Description: https://www.openml.org/tt/TaskType.SUPERVISED_CLASSIFICATION
Task ID..............: 361068
Task URL.............: https://www.openml.org/t/361068
Estimation Procedure.: crossvalidation
Evaluation Measure...: predictive_accuracy
Target Feature.......: signal
# of Classes.........: 2
Cost Matrix..........: Available
WARNING:root:Received uncompressed content from OpenML for https://api.openml.org/data/v1/download/22103254/Higgs.arff.
OpenML Classification Task
==========================
Task Type Description: https://www.openml.org/tt/TaskType.SUPERVISED_CLASSIFICATION
Task ID..............: 361069
Task URL.............: https://www.openml.org/t/361069
Estimation Procedure.: crossvalidation
Evaluation Measure...: predictive_accuracy
Target Feature.......: target
# of Classes.........: 2
Cost Matrix..........: Available
WARNING:root:Received uncompressed content from OpenML for https://api.openml.org/data/v1/download/22103255/eye_movements.arff.
OpenML Classification Task
==========================
Task Type Description: https://www.openml.org/tt/TaskType.SUPERVISED_CLASSIFICATION
Task ID..............: 361070
Task URL.............: https://www.openml.org/t/361070
Estimation Procedure.: crossvalidation
Evaluation Measure...: predictive_accuracy
Target Feature.......: label
# of Classes.........: 2
Cost Matrix..........: Available
WARNING:root:Received uncompressed content from OpenML for https://api.openml.org/data/v1/download/22111908/Diabetes130US.arff.
OpenML Classification Task
==========================
Task Type Description: https://www.openml.org/tt/TaskType.SUPERVISED_CLASSIFICATION
Task ID..............: 361273
Task URL.............: https://www.openml.org/t/361273
Estimation Procedure.: crossvalidation
Evaluation Measure...: predictive_accuracy
Target Feature.......: readmitted
# of Classes.........: 2
Cost Matrix..........: Available
WARNING:root:Received uncompressed content from OpenML for https://api.openml.org/data/v1/download/22111907/jannis.arff.
OpenML Classification Task
==========================
Task Type Description: https://www.openml.org/tt/TaskType.SUPERVISED_CLASSIFICATION
Task ID..............: 361274
Task URL.............: https://www.openml.org/t/361274
Estimation Procedure.: crossvalidation
Evaluation Measure...: predictive_accuracy
Target Feature.......: class
# of Classes.........: 2
Cost Matrix..........: Available
WARNING:root:Received uncompressed content from OpenML for https://api.openml.org/data/v1/download/22111906/default-of-credit-card-clients.arff.
OpenML Classification Task
==========================
Task Type Description: https://www.openml.org/tt/TaskType.SUPERVISED_CLASSIFICATION
Task ID..............: 361275
Task URL.............: https://www.openml.org/t/361275
Estimation Procedure.: crossvalidation
Evaluation Measure...: predictive_accuracy
Target Feature.......: y
# of Classes.........: 2
Cost Matrix..........: Available
WARNING:root:Received uncompressed content from OpenML for https://api.openml.org/data/v1/download/22111905/Bioresponse.arff.
OpenML Classification Task
==========================
Task Type Description: https://www.openml.org/tt/TaskType.SUPERVISED_CLASSIFICATION
Task ID..............: 361276
Task URL.............: https://www.openml.org/t/361276
Estimation Procedure.: crossvalidation
Evaluation Measure...: predictive_accuracy
Target Feature.......: target
# of Classes.........: 2
Cost Matrix..........: Available
WARNING:root:Received uncompressed content from OpenML for https://api.openml.org/data/v1/download/22111914/california.arff.
OpenML Classification Task
==========================
Task Type Description: https://www.openml.org/tt/TaskType.SUPERVISED_CLASSIFICATION
Task ID..............: 361277
Task URL.............: https://www.openml.org/t/361277
Estimation Procedure.: crossvalidation
Evaluation Measure...: predictive_accuracy
Target Feature.......: price_above_median
# of Classes.........: 2
Cost Matrix..........: Available
WARNING:root:Received uncompressed content from OpenML for https://api.openml.org/data/v1/download/22111912/heloc.arff.
OpenML Classification Task
==========================
Task Type Description: https://www.openml.org/tt/TaskType.SUPERVISED_CLASSIFICATION
Task ID..............: 361278
Task URL.............: https://www.openml.org/t/361278
Estimation Procedure.: crossvalidation
Evaluation Measure...: predictive_accuracy
Target Feature.......: RiskPerformance
# of Classes.........: 2
Cost Matrix..........: Available

```

In [7]:
import openml
openml.config.apikey = 'e6d1ecc68afe6fbcd296c034335dd888'  # set the OpenML Api Key

# # SUITE_ID = 336 # Regression on numerical features
# SUITE_ID = 337 # Classification on numerical features
# #SUITE_ID = 335 # Regression on numerical and categorical features
# #SUITE_ID = 334 # Classification on numerical and categorical features
# benchmark_suite = openml.study.get_suite(SUITE_ID)  # obtain the benchmark suite
# for task_id in benchmark_suite.tasks:  # iterate over all tasks
#     task = openml.tasks.get_task(task_id)  # download the OpenML task
#     dataset = task.get_dataset()
#     X, y, categorical_indicator, attribute_names = dataset.get_data(
#         dataset_format="dataframe", target=dataset.default_target_attribute
#     )

dataset = openml.tasks.get_task(361277).get_dataset() # download the OpenML dataset
print(task)

classification


In [8]:
X, y, categorical_indicator, attribute_names = dataset.get_data(
    dataset_format="array", target=dataset.default_target_attribute
)
print('converting types')
X = X.astype(np.float64)
y = y.astype(np.float64)

print('normalizing X')
X_mean = X.mean(0, keepdims=True)
X_std = X.std(0, keepdims=True)
X -= X_mean
X /= X_std

converting types
normalizing X


In [9]:
X.shape, y.shape

((20634, 8), (20634,))

In [10]:
X[:10]

array([[-8.88588377e-01, -2.09816541e-01, -3.66863611e-01,
        -3.69682934e-01, -9.89654814e-01, -8.58627479e-02,
         2.06878536e+00, -1.26304969e+00],
       [-4.18631666e-01,  2.66967114e-01, -3.25786175e-01,
        -2.39017596e-01,  2.15756595e+00,  1.47184040e-01,
        -1.33995416e+00,  1.25266373e+00],
       [-1.07766586e+00,  9.02678654e-01, -2.94229551e-01,
         6.28993468e-02, -4.39354916e-01,  4.67020988e-02,
         9.91848597e-01, -1.29300125e+00],
       [-1.26769093e+00,  6.64286826e-01, -4.77432337e-01,
         1.89394221e-02,  2.18708364e-01,  5.15203017e-02,
        -7.68708685e-01,  6.43700852e-01],
       [-1.03408122e+00, -1.24284779e+00, -7.98617715e-01,
        -2.36861393e-01, -1.00643764e+00,  1.34617040e-01,
        -7.78073570e-01,  7.03596342e-01],
       [ 2.21325638e-03,  1.61785414e+00, -2.00243811e-01,
        -2.50522920e-01, -8.45675868e-01, -4.66739360e-02,
         1.00589503e+00, -1.29799000e+00],
       [-1.19678682e+00,  4.258949

In [11]:
y[:10]

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=(k_fold-1)/k_fold, 
                                                    shuffle=True, random_state=42)

In [13]:
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(16507, 8) (16507,)
(4127, 8) (4127,)


In [14]:
histogram(np.linalg.norm(X_train, axis=1, ord=2))

In [15]:
heur_sigma, distances = get_sigma_heuristic(X_train, sample_size=200, return_dist=True)
print('heuristic bandwidth:', heur_sigma)

distances: (200, 200)
heuristic bandwidth: 3.0185240415192833


In [16]:
histogram(sample(distances, 10000))

In [17]:
if ydim == 1:
    fig = histogram(y_train)
else:
    fig = histogram(np.argmax(y_train, axis=-1))
fig.show()

### Standard Thinning (ST)

In [18]:
%%time
sd_coreset = sd_thin(X_train, m=m)
print('sd coreset:', len(sd_coreset))
X_train_sd_thin, y_train_sd_thin = X_train[sd_coreset], y_train[sd_coreset]

sd coreset: 128
CPU times: user 306 µs, sys: 87 µs, total: 393 µs
Wall time: 331 µs


### Kernel Thinning (KT)

In [19]:
from functools import partial

# KERNEL THINNING

# Define kernel params
d = X_train.shape[-1]
var_k = sigma**2
params_k_swap = {"name": kernel, "var": var_k, "d": int(d)}
params_k_split = {"name": kernel, "var": var_k, "d": int(d)}

split_kernel = partial(kernel_eval, params_k=params_k_split)
swap_kernel = partial(kernel_eval, params_k=params_k_swap)

regression_split_kernel = to_regression_kernel(split_kernel, ydim=ydim)
regression_swap_kernel = to_regression_kernel(swap_kernel, ydim=ydim)

In [20]:
Xy_train = get_Xy(X_train, y_train)
print(Xy_train.shape)


(16507, 9)


In [21]:
# %lprun -f kt_thin3 kt_coreset = kt_thin3(X_train, split_kernel, swap_kernel, m=m)

In [22]:
# from goodpoints.compress import compress_gsn_kt
# X_intermediate = compress_gsn_kt(X_train)

In [23]:
# from goodpoints import compress
# %lprun -f compress.compresspp ktr_coreset = kt_thin2(Xy_train, regression_split_kernel, regression_swap_kernel, m=m, store_K=True)

| n | 5,000 | 20,000 |
| -------- | -------- | -------- |
| store_K=True | 7.9s | 46.9s |
| store_K=False | 20.8s | 1m59s |

In [24]:
# X_train_ktr_thin, y_train_ktr_thin = X_train[ktr_coreset], y_train[ktr_coreset]

In [25]:
# X_train_ktr_thin.shape

In [26]:
# print('n:', len(Xy_train))
# log2n = int(np.log2(len(Xy_train)))
# log4n = int(np.log2(len(Xy_train)) / 2)
# print('log2n:', log2n)
# print('log4n:', log4n) 

# print('2^log2n:', 2**log2n)
# print('4^log4n:', 4**log4n)

# for i in range(log2n // 2 + 1):
#     with TicToc():
#         print(i, kt_thin2(Xy_train, regression_split_kernel, regression_swap_kernel, m=i).shape[0])
#         print(i, sd_thin(X_train, m=i).shape[0])

## KRR (Full)

In [27]:
krr_full = get_estimator(
    'regression',
    'full', 
    alpha=alpha, 
    kernel=kernel, 
    sigma=sigma, 
    postprocess=None, # no postprocessing so that we can compute the MSE
)

In [28]:
krr_full

In [29]:
%%time
K_full = krr_full.fit(X_train, y_train)

distances: (16507, 16507)
CPU times: user 1min 49s, sys: 6.46 s, total: 1min 55s
Wall time: 17.9 s


In [30]:
histogram(sample(K_full.flatten(), n=10000))

In [31]:
pred_full = krr_full.predict(X_test)
train_pred_full = krr_full.predict(X_train)

distances: (16507, 4127)
distances: (16507, 16507)


In [32]:
fig = make_subplots(rows=2, cols=1, subplot_titles=['train', 'test'])

fig.add_trace(go.Histogram(x=train_pred_full.flatten(), name='train', opacity=0.5), row=1, col=1)
fig.add_trace(go.Histogram(x=y_train.flatten(), name='ground truth', opacity=0.5, legendgroup=1), row=1, col=1)

fig.add_trace(go.Histogram(x=pred_full.flatten(), name='test', opacity=0.5), row=2, col=1)
fig.add_trace(go.Histogram(x=y_test.flatten(), name='ground truth', opacity=0.5, legendgroup=1), row=2, col=1)
fig.show()

In [33]:
print('Train acc:', classification_accuracy(y_train, train_pred_full))
print('acc:', classification_accuracy(y_test, pred_full))
print()
print('Train MSE:', mean_squared_error(y_train, train_pred_full))
print('MSE:', mean_squared_error(y_test, pred_full))

Train acc: 1.0
acc: 0.8684274291252726

Train MSE: 0.0001666926275040645
MSE: 0.09648699338000194


In [34]:
histogram(krr_full.sol_)

In [35]:
len(krr_full.sol_)

16507

## KRR + ST

In [36]:
krr_sd_thin = get_estimator(
    'regression', 
    'st', 
    alpha=alpha / np.power(len(X_train), 1/4), 
    kernel=kernel, 
    sigma=sigma, 
    m=m, 
    postprocess=None
)

In [37]:
%%time
krr_sd_thin.fit(X_train, y_train)

distances: (128, 128)
CPU times: user 222 ms, sys: 53.8 ms, total: 276 ms
Wall time: 38 ms


In [38]:
krr_sd_thin.X_fit_.shape

(128, 8)

In [39]:
%%time
pred_sd = krr_sd_thin.predict(X_test)
train_pred_sd = krr_sd_thin.predict(X_train)


distances: (128, 4127)
distances: (128, 16507)
CPU times: user 509 ms, sys: 215 ms, total: 725 ms
Wall time: 103 ms


In [40]:
print('Train acc:', classification_accuracy(y_train, train_pred_sd))
print('acc:', classification_accuracy(y_test, pred_sd))
print()
print('train MSE:', mean_squared_error(y_train, krr_sd_thin.predict(X_train)))
print('MSE:', mean_squared_error(y_test, pred_sd))

Train acc: 0.7982674017083662
acc: 0.7908892658105161

distances: (128, 16507)
train MSE: 0.1391909775594185
MSE: 0.14206295876010291


## KRR + KT

In [41]:
krr_kt_thin = get_estimator(
    'regression',
    'kt', 
    kernel=kernel, 
    alpha=alpha / np.power(len(X_train), 1/4), 
    sigma=sigma, 
    m=m, 
    postprocess=None,
    ydim=ydim,
)

In [42]:
%%time
krr_kt_thin.fit(X_train, y_train)

# To run line profiler, uncomment the next line
# %lprun -f krr_kt_thin.fit krr_kt_thin.fit(X_train, y_train)

distances: (128, 128)
CPU times: user 1.17 s, sys: 155 ms, total: 1.33 s
Wall time: 768 ms


In [43]:
krr_kt_thin.X_fit_.shape

(128, 8)

In [44]:
%%time
pred_kt = krr_kt_thin.predict(X_test)
train_pred_kt = krr_kt_thin.predict(X_train)


distances: (128, 4127)
distances: (128, 16507)
CPU times: user 471 ms, sys: 70.1 ms, total: 541 ms
Wall time: 79.4 ms


In [45]:
print('Train acc:', classification_accuracy(y_train, train_pred_kt))
print('acc:', classification_accuracy(y_test, pred_kt))
print()
print('train MSE:', mean_squared_error(y_train, krr_kt_thin.predict(X_train)))
print('MSE:', mean_squared_error(y_test, pred_kt))

Train acc: 0.8100805718785969
acc: 0.803489217349164

distances: (128, 16507)
train MSE: 0.13401787964721784
MSE: 0.1387261863358149


## RFM

Note: changing the bandwidth for RFM doesn't make a big difference, since increasing bandwidth will lead to greater weight values. However, there is a big difference in terms of numerical stability. Therefore, it's better to use the default bandwidth $L=10$.

In [46]:
rfm = get_estimator(
    'regression', 
    'rfm', 
    alpha=alpha, 
    kernel=kernel, 
    sigma=sigma,
    iters=rfm_iters,
    ydim=ydim,
)

In [48]:
rfm

In [49]:
Ms, mses, preds = rfm.fit(
    X_train, y_train, 
    val_data=(X_test, y_test),
)

Round 0, Test MSE: 0.0965
Using batch size of 4032


  0%|          | 0/5 [00:00<?, ?it/s]

Round 1, Test MSE: 0.0853
Using batch size of 4032


  0%|          | 0/5 [00:00<?, ?it/s]

Final MSE: 0.0812


In [50]:
# plot correlation matrices Ms as subplots
fig = make_subplots(rows=1, cols=len(Ms), subplot_titles=[f'iter {i}' for i in range(len(Ms))])
for i, M in enumerate(Ms):
    # add image
    fig.add_trace(go.Heatmap(z=M, showlegend=False), row=1, col=i+1)
    fig.update_layout(height=400, width=1000, title_text="Feature matrix per iteration")
fig.show()

In [51]:
histogram(rfm._model.weights)

In [52]:
%%time
pred_rfm = rfm.predict(X_test)
train_pred_rfm = rfm.predict(X_train)

CPU times: user 2.47 s, sys: 273 ms, total: 2.74 s
Wall time: 643 ms


In [53]:
pred_rfm

array([[ 0.03534168],
       [-0.01080523],
       [ 0.33392749],
       ...,
       [ 0.0506293 ],
       [ 0.75887035],
       [ 0.80956088]])

In [54]:
print('Train acc:', classification_accuracy(y_train, train_pred_rfm))
print('acc:', classification_accuracy(y_test, pred_rfm))
print()
print('train MSE:', mean_squared_error(y_train, train_pred_rfm))
print('MSE:', mean_squared_error(y_test, pred_rfm))

Train acc: 1.0
acc: 0.8878119699539617

train MSE: 0.0004889650541787932
MSE: 0.08122475537500312


## KRR + KT + Feature Learning

In [55]:
krr_kf_thin = get_estimator(
    'regression',
    'kf', 
    kernel=kernel, 
    alpha=alpha, # / np.power(len(X_train), 1/4), 
    sigma=10, 
    m=m, 
    postprocess=None,
    ydim=ydim,
    rfm_iters=rfm_iters,
)

In [56]:
krr_kf_thin

In [57]:
%%time
K = krr_kf_thin.fit(X_train, y_train)

learning feature matrix...
Round 0, Test MSE: 0.0002
Using batch size of 4032


  0%|          | 0/5 [00:00<?, ?it/s]

Round 1, Test MSE: 0.0004
Using batch size of 4032


  0%|          | 0/5 [00:00<?, ?it/s]

Final MSE: 0.0005
CPU times: user 2min 8s, sys: 13.5 s, total: 2min 22s
Wall time: 28.6 s


In [58]:
fig = go.Figure(data=[go.Heatmap(z=krr_kf_thin.M)])
fig.update_layout(height=400, width=400, title_text="Feature matrix")
fig.show()

In [59]:
krr_kf_thin.X_fit_.shape

(128, 8)

In [60]:
K.shape

(128, 128)

In [61]:
histogram(K.flatten())

In [62]:
histogram(krr_kf_thin.sol_)

In [63]:
%%time
pred_kf = krr_kf_thin.predict(X_test)
train_pred_kf = krr_kf_thin.predict(X_train)

CPU times: user 99.9 ms, sys: 110 ms, total: 209 ms
Wall time: 25 ms


In [64]:
print('Train acc:', classification_accuracy(y_train, train_pred_kf))
print('acc:', classification_accuracy(y_test, pred_kf))
print()
print('train MSE:', mean_squared_error(y_train, train_pred_kf))
print('MSE:', mean_squared_error(y_test, pred_kf))

Train acc: 0.8432786090749379
acc: 0.8429852192876182

train MSE: 0.11314729753739665
MSE: 0.11393526899400219


## RFM-Thin

In [65]:
rfm_thin = get_estimator(
    'regression', 
    'rfm', 
    alpha=alpha, 
    kernel=kernel, 
    sigma=sigma,
    iters=rfm_iters,
    ydim=ydim,
    use_kt = True,
)

In [66]:
Ms, mses, preds = rfm_thin.fit(
    X_train, y_train, 
    val_data=(X_test, y_test),
)

Using kernel thinning to select centers...
Round 0, Test MSE: 0.1342
Using batch size of 4032


  0%|          | 0/5 [00:00<?, ?it/s]

Using kernel thinning to select centers...
Round 1, Test MSE: 0.1204
Using batch size of 4032


  0%|          | 0/5 [00:00<?, ?it/s]

Using kernel thinning to select centers...
Final MSE: 0.1346


In [67]:
%%time
pred_rfm_thin = rfm_thin.predict(X_test)
train_pred_rfm_thin = rfm_thin.predict(X_train)

CPU times: user 20.3 ms, sys: 4.65 ms, total: 25 ms
Wall time: 7.67 ms


In [68]:
print('Train acc:', classification_accuracy(y_train, train_pred_rfm_thin))
print('acc:', classification_accuracy(y_test, pred_rfm_thin))
print()
print('train MSE:', mean_squared_error(y_train, train_pred_rfm_thin))
print('MSE:', mean_squared_error(y_test, pred_rfm_thin))

Train acc: 0.8123220451929485
acc: 0.8017930700266538

train MSE: 0.13094962295346568
MSE: 0.13463501100479455


## FALKON

In [69]:
krr_falkon = get_estimator(
    task,
    'falkon',
    kernel=kernel,
    sigma=sigma,
    alpha=alpha,
    m=m,
    postprocess=postprocess,
)

No module named 'falkon'


In [70]:
%%time
if krr_falkon:
    krr_falkon.fit(X_train, y_train)

CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 2.86 µs


In [71]:
%%time
if krr_falkon:
    pred_falkon = krr_falkon.predict(X_test)
    train_pred_falkon = krr_falkon.predict(X_train)

    print('Train acc:', classification_accuracy(y_train, train_pred_falkon))
    print('acc:', classification_accuracy(y_test, pred_falkon))
    print()
    print('train MSE:', mean_squared_error(y_train, train_pred_falkon))
    print('MSE:', mean_squared_error(y_test, pred_falkon))

CPU times: user 2 µs, sys: 1e+03 ns, total: 3 µs
Wall time: 2.86 µs


## FALKON + KT

In [72]:
# krr_falkon_kt = get_estimator(
#     task,
#     'falkon+kt',
#     kernel=kernel,
#     sigma=sigma,
#     alpha=alpha,
#     m=m,
#     postprocess=postprocess,
#     ydim=ydim,
# )

In [73]:
# %lprun -f krr_falkon_kt.fit krr_falkon_kt.fit(X_train, y_train)

In [74]:
# %%time
# if krr_falkon_kt:
#     pred_falkon_kt = krr_falkon_kt.predict(X_test)
#     print('Score:', accuracy_score(y_test, pred_falkon_kt))
#     print('RMSE:', np.sqrt(mean_squared_error(y_test, pred_falkon_kt)))

## Run experiment

We now run a full grid search with cross validation across different-size datasets.

Reference: https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_hist_grad_boosting_comparison.html#sphx-glr-auto-examples-ensemble-plot-forest-hist-grad-boosting-comparison-py

In [75]:
from sklearn.model_selection import GridSearchCV, KFold

In [76]:
# NOTE: these will only be applied if `use_cross_validation` is True
# Default param grid to search for each model
default_param_grid = {
    "sigma" :   [1,2,5,10,20], 
    "alpha" :   [1e-3,1e-4,1e-5],
}
# falkon_param_grid = {
#     "sigma" :   [0.05,0.1, 0.2, 0.5], 
#     "alpha" :   [1e-3,1e-4,1e-5], # Falkon requires smaller alpha
# }
falkon_param_grid = default_param_grid

rfm_param_grid = {
    "sigma" :   [10,], 
    "alpha" :   [1e-3,1e-4,1e-5],
    # "iters" :   [1,2,3],
}

In [77]:
# The different values will correspond to different columns in the final plots
varying_variable = 'kernel'
varying_variable_values = ['gauss', 'laplace',]
datasets = ['housing',]

In [78]:
# Model constructors and data size for each model
# We allow for different data sizes to avoid running Full KR on large datasets
model_configs = {
    'full' : {
        'dataset' : datasets,
        'kwargs': {
            'postprocess' : postprocess
        },
        'param_grid' : default_param_grid
    },
}

# for m in [None,]:
model_configs[f'st'] = {
    'dataset' : datasets,
    'kwargs' : {
        'm' : m,
        'postprocess' : postprocess
    },
    'param_grid' : default_param_grid
}

model_configs[f'kt'] = {
    'dataset' : datasets,
    'kwargs' : {
        'm' : m,
        'postprocess' : postprocess,
        'ydim' : ydim,
    },
    'param_grid' : default_param_grid
}

model_configs[f'falkon'] = {
    'dataset' : datasets,
    'kwargs' : {
        'm' : m,
        'postprocess' : postprocess,
    },
    'param_grid' : falkon_param_grid
}

# model_configs[f'falkon+kt_{m}'] = {
#     'dataset' : datasets,
#     'kwargs' : {
#         'm' : m,
#         'postprocess' : postprocess,
#         'ydim' : ydim,
#     },
#     'param_grid' : falkon_param_grid
# }
model_configs[f'rfm'] = {
    'dataset' : datasets,
    'kwargs' : {
        'iters' : rfm_iters,
        'postprocess' : postprocess,
    },
    'param_grid' : rfm_param_grid
}

model_configs[f'kf'] = {
    'dataset' : datasets,
    'kwargs' : {
        'm' : m,
        'postprocess' : postprocess,
        'ydim' : ydim,
        'rfm_iters' : rfm_iters,
    },
    'param_grid' : rfm_param_grid
}

model_configs[f'rfm-thin'] = {
    'dataset' : datasets,
    'kwargs' : {
        'iters' : rfm_iters,
        'use_kt' : True,
        'postprocess' : postprocess,
    },
    'param_grid' : rfm_param_grid
}

In [79]:
model_configs

{'full': {'dataset': ['housing'],
  'kwargs': {'postprocess': 'threshold'},
  'param_grid': {'sigma': [1, 2, 5, 10, 20], 'alpha': [0.001, 0.0001, 1e-05]}},
 'st': {'dataset': ['housing'],
  'kwargs': {'m': None, 'postprocess': 'threshold'},
  'param_grid': {'sigma': [1, 2, 5, 10, 20], 'alpha': [0.001, 0.0001, 1e-05]}},
 'kt': {'dataset': ['housing'],
  'kwargs': {'m': None, 'postprocess': 'threshold', 'ydim': 1},
  'param_grid': {'sigma': [1, 2, 5, 10, 20], 'alpha': [0.001, 0.0001, 1e-05]}},
 'falkon': {'dataset': ['housing'],
  'kwargs': {'m': None, 'postprocess': 'threshold'},
  'param_grid': {'sigma': [1, 2, 5, 10, 20], 'alpha': [0.001, 0.0001, 1e-05]}},
 'rfm': {'dataset': ['housing'],
  'kwargs': {'iters': 2, 'postprocess': 'threshold'},
  'param_grid': {'sigma': [10], 'alpha': [0.001, 0.0001, 1e-05]}},
 'kf': {'dataset': ['housing'],
  'kwargs': {'m': None, 'postprocess': 'threshold', 'ydim': 1, 'rfm_iters': 2},
  'param_grid': {'sigma': [10], 'alpha': [0.001, 0.0001, 1e-05]}},
 

In [80]:
use_cross_validation

False

In [81]:
# Run experiment (depending on experiment_type)

results = []

count = 0
for name, config in model_configs.items():
    for dataset in config['dataset']:

        for v in varying_variable_values:
            kwargs = deepcopy(config['kwargs'])
            kwargs[varying_variable] = v
            model_name = f"{name}_{v}"
            # NOTE: full and rfm are deterministic, so we only need to run them once
            trials = (1 if name in ['full', 'rfm'] else n_repeats)

            # STEP 1: Get data
            # use X_train, y_train, X_test, y_test from above
            
            if 'kernel' not in kwargs:
                kwargs['kernel'] = kernel

            model = get_estimator(task, name=name, **kwargs)
            if model is None: continue
            print(f'i={count+1}: dataset={dataset}, model={model}')

            # STEP 2: Get optimal parameters through grid search
            # NOTE: we want to get rid of randomness in the Kernel Thinning (or Standard Thinning) routine
            # so we do k-fold cross validation `trials` times using the *same* split.
            # This is different from sklearn's repeated k-fold implementation which uses a 
            # different random split each time.            

            if use_cross_validation:
                split = list(KFold(n_splits=k_fold).split(X_train)) * trials
                grid_search = GridSearchCV(
                    estimator=model,
                    param_grid=config['param_grid'],
                    return_train_score=True,
                    cv=split,
                    scoring=refit,
                    refit=False,
                    n_jobs=n_jobs,
                ).fit(X_train, y_train)
                # get validation scores
                cv_results = pd.DataFrame(grid_search.cv_results_)
                val_scores = []
                for i in range(trials):
                    val_scores.append( cv_results.iloc[grid_search.best_index_][f'split{i}_test_score'] )
            
                # get optimal parameters
                best_params = grid_search.best_params_
            
            else:
                # Dummy values
                val_scores = [1,] * trials
                
                best_params = {
                    'sigma' : sigma,
                    'alpha' : alpha, # * (len(X_train)**(1/4) if name in ['st', 'kt'] else 1),
                }
            print(f"best params: {best_params}")
            best_model = get_estimator(task, name=name, 
                                       sigma=best_params['sigma'],
                                       alpha=best_params['alpha'],
                                       **kwargs)
            print(best_model)

            # STEP: Estimate test score
            train_scores = []
            test_scores = []
            for _ in range(trials):
                best_model.fit(X_train, y_train)

                # compute train score
                train_pred = best_model.predict(X_train).squeeze()
                # compute test score
                test_pred = best_model.predict(X_test).squeeze()

                train_score = 1- classification_accuracy(y_train, train_pred)
                test_score = 1- classification_accuracy(y_test, test_pred)
                
                train_scores.append( train_score )
                test_scores.append( test_score )

            results.append({
                "dataset": dataset, 
                "model": model_name, 
                "cv_results": pd.DataFrame(grid_search.cv_results_) if use_cross_validation else None,
                "best_index_" : grid_search.best_index_ if use_cross_validation else 0,
                "best_params_" : best_params,
                "val_scores" : val_scores,
                "train_scores" : train_scores,
                "test_scores" : test_scores,
            })

            count += 1

i=1: dataset=housing, model=KernelRidgeClassifier(kernel='gauss', postprocess='threshold')
best params: {'sigma': 10, 'alpha': 0.001}
KernelRidgeClassifier(alpha=0.001, kernel='gauss', postprocess='threshold',
                      sigma=10)
distances: (16507, 16507)
distances: (16507, 16507)
distances: (16507, 4127)
i=2: dataset=housing, model=KernelRidgeClassifier(postprocess='threshold')
best params: {'sigma': 10, 'alpha': 0.001}
KernelRidgeClassifier(alpha=0.001, postprocess='threshold', sigma=10)
distances: (16507, 16507)
distances: (16507, 16507)
distances: (16507, 4127)
i=3: dataset=housing, model=KernelRidgeSTClassifier(kernel='gauss', postprocess='threshold')
best params: {'sigma': 10, 'alpha': 0.001}
KernelRidgeSTClassifier(alpha=0.001, kernel='gauss', postprocess='threshold',
                        sigma=10)
distances: (128, 128)
distances: (128, 16507)
distances: (128, 4127)
distances: (128, 128)
distances: (128, 16507)
distances: (128, 4127)
distances: (128, 128)
distance

  0%|          | 0/5 [00:00<?, ?it/s]

Round 1, Test MSE: 0.1251
Using batch size of 4032


  0%|          | 0/5 [00:00<?, ?it/s]

Final MSE: 0.1272
i=8: dataset=housing, model=RFMClassifier(alpha=0.001, iters=2, kernel='laplace', postprocess='threshold',
              sigma=10)
best params: {'sigma': 10, 'alpha': 0.001}
RFMClassifier(alpha=0.001, iters=2, kernel='laplace', postprocess='threshold',
              sigma=10)
Round 0, Test MSE: 0.0002
Using batch size of 4032


  0%|          | 0/5 [00:00<?, ?it/s]

Round 1, Test MSE: 0.0004
Using batch size of 4032


  0%|          | 0/5 [00:00<?, ?it/s]

Final MSE: 0.0005
Unexpected exception formatting exception. Falling back to standard exception


Traceback (most recent call last):
  File "/Users/ag2435/anaconda3/envs/goodpoints/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3508, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "/var/folders/g7/tv_m7tdj25q_nq7w910g22qw0000gp/T/ipykernel_90352/3161747072.py", line 22, in <module>
    model = get_estimator(task, name=name, **kwargs)
  File "/Users/ag2435/repos/goodpoints/goodpoints/krr/util_estimators.py", line 361, in get_estimator
    return get_classifier(name, kernel, **kwargs)
  File "/Users/ag2435/repos/goodpoints/goodpoints/krr/util_estimators.py", line 341, in get_classifier
    return KernelRidgeKTFeatureClassifier(kernel='M_' + kernel, **kwargs)
TypeError: KernelRidgeKT.__init__() got an unexpected keyword argument 'rfm_iters'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/ag2435/anaconda3/envs/goodpoints/lib/python3.10/site-packages/IPython/core/intera

In [89]:
results

[{'dataset': 'housing',
  'model': 'full_gauss',
  'cv_results': None,
  'best_index_': 0,
  'best_params_': {'sigma': 10, 'alpha': 0.001},
  'val_scores': [1],
  'train_scores': [0.14430241715635794],
  'test_scores': [0.14320329537194088]},
 {'dataset': 'housing',
  'model': 'full_laplace',
  'cv_results': None,
  'best_index_': 0,
  'best_params_': {'sigma': 10, 'alpha': 0.001},
  'val_scores': [1],
  'train_scores': [0.0],
  'test_scores': [0.1315725708747274]},
 {'dataset': 'housing',
  'model': 'st_gauss',
  'cv_results': None,
  'best_index_': 0,
  'best_params_': {'sigma': 10, 'alpha': 0.001},
  'val_scores': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
  'train_scores': [0.2107590719088871,
   0.18258920457987515,
   0.19004058884109776,
   0.1850729993336161,
   0.1963409462652208,
   0.1853153207730054,
   0.17192706124674384,
   0.18519416005331069,
   0.1987641606591143,
   0.18870782092445626],
  'test_scores': [0.2188030046038284,
   0.18875696631936034,
   0.1906954204022292,
   0.1

In [90]:
# Save results with pickle
if save:
    import pickle
    filename = dataset + ('_cv' if use_cross_validation else '') # '_'.join(['toy', housing])
    pickle_file = filename + '.p'
    print(pickle_file)

    with open(pickle_file, 'wb') as f:
        pickle.dump(results, f)

## Plot Results

In [91]:
import plotly.colors as colors
import seaborn as sns

from functools import reduce
from operator import concat

### Varying variable (e.g., kernel choice)

In [92]:
def plot_results(varying_variable, varying_variable_values, scale='linear'):
    # row_subplot_titles = ["Test score vs n"], #, "Test Neg-MSE vs n"] #, "Train time vs n", "Predict time vs n"]
    row_subplot_titles = ["Test Score", "Val Score", "Train Score"]

    fig = make_subplots(
        rows=len(row_subplot_titles),
        cols=len(varying_variable_values),
        shared_yaxes=True,
        subplot_titles=reduce(concat, [[f'{varying_variable}={v}' for v in varying_variable_values] for _ in row_subplot_titles]),
        vertical_spacing=0.1,
    )
    model_names = [model_name.split('_')[0] for model_name in model_configs.keys()]
    colors_list = colors.qualitative.Plotly * (
        len(model_names) // len(colors.qualitative.Plotly) + 1
    )
    colors_used = set()

    def plot_vs_n(print_name, attr_name, vvv, r, c, is_better='higher', scale='log2'):
        """
        Args:
        - vvv: varying variable value
        """
        
        for result in results:
            model_name = result["model"]
            name_components = model_name.split('_') # E.g., Kernel-Thin_rbf -> Kernel-Thin, rbf
            if len(name_components) == 2:
                model_name_prefix, vv_name = name_components
                m = '0'
            else:
                model_name_prefix, m, vv_name = name_components        
            best_params = result["best_params_"]

            if vv_name != vvv:
                continue

            color = colors_list[model_names.index(model_name_prefix)]

            if scale == 'log2':
                y = np.log2(np.abs(result[attr_name]))
            elif scale == 'linear':
                y = np.abs(result[attr_name])

            trace = go.Box(
                x=[result['dataset']]*len(result[attr_name]),
                y=y,
                name=model_name_prefix,
                # opacity=0.5,
                legendgroup=model_name_prefix,
                line_color=color,
                offsetgroup=model_name_prefix,
                showlegend=color not in colors_used,
                boxmean=True,
            )

            fig.add_trace(trace, row=r, col=c)
            colors_used.add(color)

        if c == 1: fig.update_yaxes(title_text=f"{scale}({print_name}) - {is_better} is better", row=r, col=c)
        fig.update_xaxes(title_text="dataset", row=r, col=c)
        fig.update_layout(boxmode='group')

    def plot_test_score_vs_n(vvv, r, c, scale):
        plot_vs_n(f"Test MSE", "test_scores", vvv, r, c, is_better='lower', scale=scale)

    def plot_val_score_vs_n(vvv, r, c, scale):
        plot_vs_n(f"Val MSE", "val_scores", vvv, r, c, is_better='lower', scale=scale)
    def plot_train_score_vs_n(vvv, r, c, scale):
        plot_vs_n(f"Train MSE", "train_scores", vvv, r, c, is_better='lower', scale=scale)

    for c, vvv in enumerate(varying_variable_values):
        plot_test_score_vs_n(str(vvv), 1, c+1, scale=scale)
        plot_val_score_vs_n(str(vvv), 2, c+1, scale=scale)
        plot_train_score_vs_n(str(vvv), 3, c+1, scale=scale)

    return fig

In [93]:
fig = plot_results(varying_variable, varying_variable_values, scale='linear')
fig.update_layout(
    legend=dict(traceorder="normal", borderwidth=1),
    title=dict(x=0.5, text=f"Evaluation for {varying_variable} in {varying_variable_values}"), # \
            #    f"sigma {param_grid['sigma']} / alpha {param_grid['alpha']}"),
    width=800,
    height=1000,
)
fig.show()
if save:
    fig_file = filename + '.png'
    print(fig_file)
    fig.write_image(fig_file)

In [94]:
fig = plot_results(varying_variable, varying_variable_values, scale='log2')
fig.update_layout(
    legend=dict(traceorder="normal", borderwidth=1),
    title=dict(x=0.5, text=f"Evaluation for {varying_variable} in {varying_variable_values}"), # \
            #    f"sigma {param_grid['sigma']} / alpha {param_grid['alpha']}"),
    width=800,
    height=1000,
)
fig.show()
if save:
    fig_file = filename + '_log2.png'
    print(fig_file)
    fig.write_image(fig_file)


divide by zero encountered in log2



### Overfitting

In [95]:
def plot_results_overfitting(varying_variable, varying_variable_values, scale='linear'):
    col_subplot_titles = ["Test Score", "Val Score", "Train Score", ]

    fig = make_subplots(
        rows=len(varying_variable_values),
        cols=len(col_subplot_titles),
        shared_yaxes=True,
        subplot_titles=col_subplot_titles + [None,] * len(varying_variable_values),
        vertical_spacing=0.1,
    )
    model_names = [model_name.split('_')[0] for model_name in model_configs.keys()]
    colors_list = colors.qualitative.Plotly * (
        len(model_names) // len(colors.qualitative.Plotly) + 1
    )
    colors_used = set()

    def plot(print_name, attr_name, vvv, r, c, is_better='higher', scale='log2'):
        """
        Args:
        - vvv: varying variable value
        """
        
        for result in results:
            model_name = result["model"]
            name_components = model_name.split('_') # E.g., Kernel-Thin_rbf -> Kernel-Thin, rbf
            if len(name_components) == 2:
                model_name_prefix, vv_name = name_components
                m = '0'
            else:
                model_name_prefix, m, vv_name = name_components        
            best_params = result["best_params_"]

            if vv_name != vvv:
                continue

            color = colors_list[model_names.index(model_name_prefix)]

            if scale == 'log2':
                y = np.log2(np.abs(result[attr_name]))
            elif scale == 'linear':
                y = np.abs(result[attr_name])

            trace = go.Box(
                x=[result['dataset']]*len(result[attr_name]),
                y=y,
                name=model_name_prefix,
                # opacity=0.5,
                legendgroup=model_name_prefix,
                line_color=color,
                offsetgroup=model_name_prefix,
                showlegend=color not in colors_used,
                boxmean=True,
            )

            fig.add_trace(trace, row=r, col=c)
            colors_used.add(color)

        if c == 1: fig.update_yaxes(title_text=f"{varying_variable}={vvv}", row=r, col=c)
        fig.update_xaxes(title_text="dataset", row=r, col=c)
        fig.update_layout(boxmode='group')

    def plot_test_score(vvv, r, c, scale):
        plot(f"Test MSE", "test_scores", vvv, r, c, is_better='lower', scale=scale)
    def plot_val_score(vvv, r, c, scale):
        plot(f"Val MSE", "val_scores", vvv, r, c, is_better='lower', scale=scale)
    def plot_train_score(vvv, r, c, scale):
        plot(f"Train MSE", "train_scores", vvv, r, c, is_better='lower', scale=scale)

    for r, vvv in enumerate(varying_variable_values):
        plot_test_score(str(vvv), r+1, 1, scale=scale)
        plot_val_score(str(vvv), r+1, 2, scale=scale)
        plot_train_score(str(vvv), r+1, 3, scale=scale)

    return fig

In [96]:
fig = plot_results_overfitting(varying_variable, varying_variable_values, scale='linear')
fig.update_layout(
    legend=dict(traceorder="normal", borderwidth=1),
    title=dict(x=0.5, text=f"Evaluation for {varying_variable} in {varying_variable_values}" \
            #    f"sigma {param_grid['sigma']} / alpha {param_grid['alpha']}"
               "<br>scale: linear"
               ),
    width=1000,
    height=600,
)
fig.show()

In [97]:
fig = plot_results_overfitting(varying_variable, varying_variable_values, scale='log2')
fig.update_layout(
    legend=dict(traceorder="normal", borderwidth=1),
    title=dict(x=0.5, text=f"Evaluation for {varying_variable} in {varying_variable_values}" \
            #    f"sigma {param_grid['sigma']} / alpha {param_grid['alpha']}"
               "<br>scale: log2"
               ),
    width=1000,
    height=600,
)
fig.show()


divide by zero encountered in log2

