# 2022 Flatiron Machine Learning x Science Summer School

## Step 16: Implement convolutional symbolic discriminator

* Define input data on a grid `X11`

* Function library `F11_v1`

* Implement CNN SD (check https://production-media.paperswithcode.com/methods/vgg_7mT4DML.png)

In [1]:
%matplotlib widget
%load_ext autoreload
%autoreload 2

import os
import numpy as np
import matplotlib.pyplot as plt
import joblib

import torch
import torch.nn as nn
import wandb

from srnet import SRNet, SRData
from sdnet import SDNet, SDData
from csdnet import CSDNet
import srnet_utils as ut

### Step 16.1: Define input data on grid

* Input data range: -3 to 3 is similar to N(0,1) samples

* No mask as grid structure is required

* Create `F11_v1.lib`

In [2]:
np.random.seed(0)

In [3]:
x_min = -3
x_max = 3

data_size = 700
data_name = "X11"

data = np.linspace(x_min, x_max, data_size, endpoint=False)

In [4]:
data_path = "data_1k"
data_ext = ".gz"
update = False

# create data folder
os.makedirs(data_path, exist_ok=True)

# save input data
if update or data_name + data_ext not in os.listdir(data_path):
    np.savetxt(os.path.join(data_path, data_name + data_ext), data)
    print(f"Saved {data_name} data.")

In [5]:
# load data
data_path = "data_1k"
in_var = "X11"
lat_var = None
target_var = None

# mask_ext = ".mask"
# masks = joblib.load(os.path.join(data_path, in_var + mask_ext))

train_data = SRData(data_path, in_var, data_mask=None)

In [6]:
# load function library
fun_path = "funs/F11_v1.lib"
shuffle = True
iter_sample = False
disc_data = SDData(fun_path, in_var, shuffle=shuffle, iter_sample=iter_sample)

In [7]:
hp = {
    "arch": {
        "in_size": 1,
        "out_size": 1,
        "hid_num": (2,0),
        "hid_size": 32, 
        "hid_type": ("MLP", "MLP"),
        "hid_kwargs": {
            "alpha": None,
            "norm": None,
            "prune": None,
            },
        "lat_size": 1,
    },
}

In [8]:
fig, ax = plt.subplots()

num_samp = 10
colors = plt.rcParams['axes.prop_cycle'].by_key()['color']

for _ in range(num_samp):
        
    ax.plot(disc_data.get(1, train_data.in_data)[0,0,:,0], color=colors[0], alpha=0.5)
    
    model = SRNet(**hp['arch'])
    with torch.no_grad():
        preds, acts = model(train_data.in_data, get_lat=True)
    
    ax.plot(acts[:,0], color=colors[1], alpha=0.5)
    
plt.show()

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

In [9]:
fig, ax = plt.subplots()

num_samp = 10
colors = plt.rcParams['axes.prop_cycle'].by_key()['color']

for _ in range(num_samp):
    
    ax.scatter(train_data.in_data[:,0], disc_data.get(1, train_data.in_data)[0,0,:,0], color=colors[0], alpha=0.5)
    
    model = SRNet(**hp['arch'])
    with torch.no_grad():
        preds, acts = model(train_data.in_data, get_lat=True)
    
    ax.scatter(train_data.in_data[:,0], acts[:,0], color=colors[1], alpha=0.5)
    
plt.show()

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

Comparison to `X10`:

In [10]:
# load data
data_path = "data_1k"
in_var = "X10"
lat_var = None
target_var = None

mask_ext = ".mask"
masks = joblib.load(os.path.join(data_path, in_var + mask_ext))

train_data10 = SRData(data_path, in_var, data_mask=masks['train'])

In [11]:
train_data10.in_data = train_data10.in_data[:,:1]

In [12]:
train_data10.in_data.shape

torch.Size([700, 1])

In [13]:
train_data10.in_data.min()

tensor(-3.3096)

In [14]:
train_data10.in_data.max()

tensor(2.9148)

In [15]:
train_data10.in_data.mean()

tensor(0.0395)

In [16]:
train_data10.in_data.std()

tensor(0.9967)

In [17]:
train_data.in_data.shape

torch.Size([700, 1])

In [18]:
train_data.in_data.min()

tensor(-3.)

In [19]:
train_data.in_data.max()

tensor(2.9914)

In [20]:
train_data.in_data.mean()

tensor(-0.0043)

In [21]:
train_data.in_data.std()

tensor(1.7333)

In [22]:
in_var = "X11"

In [23]:
fig, ax = plt.subplots()

num_samp = 5
colors = plt.rcParams['axes.prop_cycle'].by_key()['color']

for _ in range(num_samp):
    
    x_data = train_data10.in_data
    ax.scatter(x_data[:,0], disc_data.get(1, x_data)[0,0,:,0], color=colors[0], alpha=0.5)
    
    model = SRNet(**hp['arch'])
    with torch.no_grad():
        preds, acts = model(x_data, get_lat=True)
    
    ax.scatter(x_data[:,0], acts[:,0], color=colors[1], alpha=0.5)
    
    x_data = train_data.in_data
    ax.scatter(x_data[:,0], disc_data.get(1, x_data)[0,0,:,0], color=colors[2], alpha=0.5)
    
    model = SRNet(**hp['arch'])
    with torch.no_grad():
        preds, acts = model(x_data, get_lat=True)
    
    ax.scatter(x_data[:,0], acts[:,0], color=colors[3], alpha=0.5)
    
plt.show()

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

In [24]:
fig, ax = plt.subplots()

ax.hist(train_data10.in_data[:,0].numpy(), 50, density=True, facecolor='g', alpha=0.5)
ax.hist(train_data.in_data[:,0].numpy(), 50, density=True, facecolor='y', alpha=0.5)

plt.show()

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

Check gradient calculation:

In [25]:
x_data = train_data.in_data

In [26]:
model = SRNet(**hp['arch'])
with torch.no_grad():
    preds, acts = model(x_data, get_lat=True)

In [27]:
grads = model.jacobian(x_data, get_lat=True).transpose(0,1)

In [28]:
fig, ax = plt.subplots()

dx = 0.1
grad_idx = [100, 200, 300, 400, 500, 600]

ax.scatter(x_data[:,0], acts[:,0])

for idx in grad_idx:
    xl = x_data[idx,0] - dx
    xh = x_data[idx,0] + dx
    yl = acts[idx,0] - dx * grads[0,idx,0]
    yh = acts[idx,0] + dx * grads[0,idx,0]

    ax.scatter([x_data[idx,0].item()], [acts[idx,0].item()], color='k')
    ax.plot([xl, xh], [yl, yh], 'k')

plt.show()

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

### Step 16.2: Embed gradient data in SD

Note: Here, input data is still 1000 (instead of 700) data points uniformly spaced between -3 and 3.

In [29]:
# set wandb project
wandb_project = "162-ext2-study-F11_v1"

In [30]:
# hyperparams = {
#     "arch": {
#         "in_size": train_data.in_data.shape[1],
#         "out_size": 1,
#         "hid_num": (2,0),
#         "hid_size": 32, 
#         "hid_type": ("MLP", "MLP"),
#         "hid_kwargs": {
#             "alpha": None,
#             "norm": None,
#             "prune": None,
#             },
#         "lat_size": 3,
#     },
#     "epochs": 100000,
#     "runtime": None,
#     "batch_size": train_data.in_data.shape[0],
#     "ext": ["grad"],
#     "ext_type": "embed",
#     "ext_size": 1,
#     "disc": {
#         "hid_num": 2,
#         "hid_size": 64,
#         "lr": 1e-4,
#         "wd": 1e-7,
#         "betas": (0.9,0.999),
#         "iters": 5,
#         "gp": 1e-5,
#     },
# }

In [31]:
# define hyperparameter study
hp_study = {
    "method": "random",
    "parameters": {
        "arch": {
            "parameters": {
                "in_size": {
                    "values": [1]
                },
                "out_size": {
                    "values": [1]
                },
                "hid_num": {
                    "values": [(2,0)]
                },
                "hid_size": {
                    "values": [32]
                },
                "lat_size": {
                    "values": [1, 3, 5]
                },
            }
        },
        "disc": {
            "parameters": {
                "hid_num": {
                    "values": [(2,2), (2,4), (2,8)]
                },
                "hid_size": {
                    "values": [(64,64), (64,128), (64,256), (64,512)]
                },
                "lr": {
                    "values": [1e-5, 1e-4, 1e-3]
                },
                "iters": {
                    "values": [1, 3, 5]
                },
                "gp": {
                    "values": [1e-6, 1e-5, 1e-4]
                },
            }
        }
    }
}

In [None]:
# create sweep
sweep_id = wandb.sweep(hp_study, project=wandb_project)

<img src="results/162-ext2-study-F11_v1_conv.png">

**Sanity check**:

Let's select the hyperparameters of the best `155-ext2-study-F10_v1` run:

* 700 N(0,1) data points works


* 700 N(0,1) sorted data points works


* 1000 N(0,1) unmasked data points works


* 100 N(0,1) unmasked data points works


* 1000 \[-3,3) grid points not working


* 700 \[-3,3) grid points not working


* 100 \[-3,3) grid points not working


* 10000 \[-3,3) grid points not working


* 700 \[-1,1) grid points not working


* 100 \[-2,2) grid points saved as `X10` not working

### Step 16.3: Compare input data distributions

At what point between normally distributed and uniformly spaced input data does the previous approach of embedding gradients starts failing?

In [32]:
sample_size = 1000000
data_size = 700

in_min = -3
in_max = 3

stds = [1, 1.25, 2, 3, 10]

In [33]:
in_dict = {}

for std in stds:
    print(f"Standard deviation: {std}")
    
    # sample points
    in_data = np.random.normal(loc=0, scale=std, size=(sample_size))
    print(f"  Number of points in {in_min} to {in_max} range: {in_data[(in_min <= in_data) & (in_data <= in_max)].shape[0]}")
    
    # select range
    in_dict[std] = in_data[(in_min <= in_data) & (in_data <= in_max)]
    
    # reduce size
    in_dict[std] = np.sort(in_dict[std])[::(len(in_dict[std])//data_size)][:data_size]
    print(f"  Number of selected points: {in_dict[std].shape[0]}")

Standard deviation: 1
  Number of points in -3 to 3 range: 997331
  Number of selected points: 700
Standard deviation: 1.25
  Number of points in -3 to 3 range: 983847
  Number of selected points: 700
Standard deviation: 2
  Number of points in -3 to 3 range: 865855
  Number of selected points: 700
Standard deviation: 3
  Number of points in -3 to 3 range: 682837
  Number of selected points: 700
Standard deviation: 10
  Number of points in -3 to 3 range: 236018
  Number of selected points: 700


In [34]:
fig, ax = plt.subplots()

for std in in_dict:
    
    in_data = in_dict[std]
    
    ax.hist(in_data, 50, density=True, alpha=0.5)

plt.show()

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

In [35]:
data_path = "data_1k"
data_ext = ".gz"
update = False

# create data folder
os.makedirs(data_path, exist_ok=True)

# save input data
for std in in_dict:
    data_name = f"X11_std{int(std*100):04d}"
    if update or data_name + data_ext not in os.listdir(data_path):
        np.savetxt(os.path.join(data_path, data_name + data_ext), in_dict[std])
        print(f"Saved {data_name} data.")

<img src="results/163-input-check-F11_v1_conv.png">

### 16.4: Compare standard classifiers

Training on 10 different instantiations of a fixed dataset of size 1000 using 100 N(0,1) input data:

```
SD : 0.616+-0.054
MLP: 0.933+-0.008
KNN: 0.823+-0.006
SVC: 0.708+-0.015
GP : 0.746+-0.012
DT : 1.000+-0.000
RF : 1.000+-0.000
ADA: 0.693+-0.012
GNB: 0.526+-0.014
QDA: 0.888+-0.011
```

The performance of the `SD` in comparison to standard `sklearn` classifiers is terrible.

How does changing from N(0,1) to \[-3,3) impact the results?

Training on 10 different instantiations of a fixed dataset of size 1000 using 100 \[-3,3) input data:

```
SD: 0.707+-0.033
MLP: 0.960+-0.007
KNN: 0.865+-0.004
SVC: 0.792+-0.010
GP : 0.837+-0.008
DT : 1.000+-0.000
RF : 1.000+-0.000
ADA: 0.754+-0.013
GNB: 0.571+-0.009
QDA: 0.996+-0.003
```

In this setting (fixed dataset, input size 100), training on uniformly distributed input data seems to be beneficial.

### 16.5: Analyze loss components and gradients

Anyway, more importantly, how is there such a big difference in performance between `SD` and `MLP`?

`SD`:

* 1000 epochs with 5 iterations per epoch

* 2 hidden layers with 64 nodes

* Adam optimizer with `1e-3` learning rate (no weight decay)

* Batch size: 1000

* WGANs objective: minimizing fake data, maximizing real data

* No gradient penalty

`MLP`: (`sklearn` default settings)

* 200 epochs

* 1 hidden layer with 100 nodes

* Adam optimizer with `1e-3` learning rate (with 1e-4 weight decay)

* Batch size: 200 with shuffling

* Log-loss objective

Well, the difference can only really result from the objective function, right?

Let's try changing the `SD` to a log-loss objective:

`X10_100`: Epoch: 100%|█████| 10000/10000 [01:17<00:00, 129.23it/s, acc=0.90, avg_acc=0.95]

`X11_100`: Epoch: 100%|█████| 10000/10000 [01:18<00:00, 126.96it/s, acc=0.90, avg_acc=0.98]

What's the reason that the WGAN objective is not working?

Let's analyze the propagated gradients on `X10_100`:

In [36]:
ut.plot_disc_accuracies("disc_model_F11_v1_grad_check_", "models", excl_names=[], avg_hor=500, uncertainty=False);

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

Low `gp` values:

In [37]:
ut.plot_disc_losses("disc_model_F11_v1_grad_check_", "models", excl_names=["1e0", "1e1", "1e2"], avg_hor=50, uncertainty=False, summation=False);

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

In [38]:
ut.plot_disc_gradients("disc_model_F11_v1_grad_check_", "models", excl_names=["1e0", "1e1", "1e2"], avg_hor=50, uncertainty=False);

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

High `gp` values:

In [39]:
ut.plot_disc_losses("disc_model_F11_v1_grad_check_", "models", excl_names=["0e0", "1e-2", "1e-4"], avg_hor=50, uncertainty=False, summation=False);

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

In [40]:
ut.plot_disc_gradients("disc_model_F11_v1_grad_check_", "models", excl_names=["0e0", "1e-2", "1e-4"], avg_hor=50, uncertainty=False);

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

No WGANs please!

What is the performance when including gradients or convolution on `X11`?

`none`: Epoch: 100%|██████| 10000/10000 \[02:00<00:00, 83.16it/s, acc=1.00, avg_acc=0.99\]

`stack`: Epoch: 100%|██████| 10000/10000 \[02:32<00:00, 65.58it/s, acc=1.00, avg_acc=1.00\]

`embed`: Epoch: 100%|██████| 10000/10000 \[08:12<00:00, 20.32it/s, acc=1.00, avg_acc=1.00\]

`conv`: Epoch: 100%|██████| 10000/10000 \[09:33<00:00, 17.42it/s, acc=1.00, avg_acc=0.97\]

In [41]:
ut.plot_disc_accuracies("disc_model_F11_v1_grad_check2", "models", excl_names=[], avg_hor=50, uncertainty=False);

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

* All approaches trained on `X11` with an infinite function library `F11_v1` achieve nearly 100% accuracy with the binary cross entropy loss (actually `BCEWithLogitsLoss`, i.e. sigmoid and `BCELoss`)

* Including gradient information to the discriminator speeds up convergence significantly and embedding appears to be slightly more effective than stacking (although stacking trains at 4x epochs per second)

* The convolutional symbolic discriminator (CSD) appears to not be more effective than the regular SD. However, we literally tested a single architecture and set of hyperparameters

* Identifying a better CSD architecture and set of hyperparameters might be more insightful on a more difficult use case

Let's train the standard SD until convergence and utilize it to regularize `DSN`.

In [42]:
ut.plot_disc_accuracies("disc_model_F11_v1_fixed_BCE", "models", excl_names=[], avg_hor=50, uncertainty=True);

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …