# 2022 Flatiron Machine Learning x Science Summer School

## Step 6: Train DSN with $L_1$ regularization on latent features using GhostAdam

Based on Sparse Relaxed Regularized Regression ([SR3](https://arxiv.org/abs/1807.05411)).

[GhostAdam](https://github.com/MilesCranmer/sr3_sjnn) attempts to minimize prediction errors and regularization penalties separately. 

Questions:

* For the DSN, do we consider $a_1$ and $a_2$ as one regularization penalty?

Let's explore GhostAdam with simple sweeps.

READ GhostWrapper and GhostAdam!

Sanity check: set requires grad to 0 

Do we make sure that unregularized parameters are not effected by our trick?




We need to make sure that ghost parameters change with live parameters:
    Set differences to 0 and set p.ghost to p.live?
    
necessary?
all_p.append(p.live.sum() * 0.0 + p.ghost.sum() * 0.0)


is hack slowing us down?

ghost L1 activation:
13:28<00:00, 12.36it/s, train_loss=2.09e-06, val_loss=1.20e-03

vs. regular L1 loss:
09:43<00:00, 17.15it/s, train_loss=3.34e-04, val_loss=1.83e-03

vs. no ghost L1 activation:
09:08<00:00, 18.23it/s, train_loss=1.90e-03, val_loss=1.92e-03

vs. ghost L1 activation via 0 loss:
13:12<00:00, 12.63it/s, train_loss=2.09e-06, val_loss=1.20e-03


Do we even want to create new GhostTuples all the time?

py37: ghost_class
01:36<00:00, 10.38it/s, train_loss=2.36e-04, val_loss=3.72e-03

py38: ghost_class
01:23<00:00, 11.92it/s, train_loss=2.36e-04, val_loss=3.72e-03

py38: ghost_tuple
01:25<00:00, 11.67it/s, train_loss=2.36e-04, val_loss=3.72e-03

### Step 6.1: Determine optimal ghost coefficient $g_c$

In [1]:
%matplotlib widget
%load_ext autoreload
%autoreload 2

import os
import numpy as np
import matplotlib.pyplot as plt
import joblib

import torch
import wandb

from srnet import SRNet, SRData
import srnet_utils as ut

In [2]:
# set wandb project
wandb_project = "61-l1-gc-study-F00"

In [3]:
# hyperparams = {
#     "arch": {
#         "in_size": train_data.in_data.shape[1],
#         "out_size": train_data.target_data.shape[1],
#         "hid_num": (2,0),
#         "hid_size": 32, 
#         "hid_type": ("DSN", "MLP"),
#         "lat_size": 16,
#         },
#     "epochs": 10000,
#     "runtime": None,
#     "batch_size": 64,
#     "lr": 1e-4,
#     "wd": 1e-4,
#     "l1": 0.0,
#     "a1": 0.0,
#     "a2": 0.0,
#     "gc": 0.0,
#     "shuffle": True,
# }

# define hyperparameters
# hyperparams = {
#     "arch": {
#         "in_size": train_data.in_data.shape[1],
#         "out_size": train_data.target_data.shape[1],
#         "hid_num": (2,0),
#         "hid_size": 32, 
#         "hid_type": "MLP",
#         "lat_size": 16,
#         },
#     "epochs": 10000,
#     "runtime": None,
#     "batch_size": 64,
#     "lr": 1e-4,
#     "wd": 1e-4,
#     "l1": 0.0,
#     "shuffle": True,
# }

In [4]:
# define hyperparameter study
hp_study = {
    "method": "grid", # random, bayesian
    #"metric": {
    #    "name": "val_loss",
    #    "goal": "minimize",
    #},
    "parameters": {
        #"l1": {
        #    "values": [1e-4, 1e-3, 1e-2]
        #},
        "gc": {
            "values": [1e3, 1e2, 1e1, 1e0, 1e-1, 1e-2, 1e-3, 1e-4, 1e-5]
        }
    }
}

In [None]:
# create sweep
sweep_id = wandb.sweep(hp_study, project=wandb_project)

In [None]:
# download data from wandb
file_ext = ".pkl"

api = wandb.Api()

runs = api.runs(wandb_project)
for run in runs:
    for f in run.files():
        if f.name[-len(file_ext):] == file_ext and not os.path.isfile(f.name):
            print(f"Downloading {os.path.basename(f.name)}.")
            run.file(f.name).download()

In [6]:
# plot losses
save_names = ["F00_l1_1e-03_gc_1e"]
save_path = "models"
model_names = ut.plot_losses(save_names, save_path="models")

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

In [7]:
# print losses
save_file = "models/srnet_model_F00_l1_1e-03_gc_{gc:.0e}.pkl"
for gc in hp_study['parameters']['gc']['values'][::-1]:
    file_name = save_file.format(gc=gc)
    state = joblib.load(os.path.join(file_name))
    print(f"{file_name.split('.')[0].split('_')[-1]}:\t{state['total_train_loss']:.3e} {state['total_val_loss']:.3e}")

1e-05:	1.518e-04 2.118e-03
1e-04:	1.155e-04 4.540e-03
1e-03:	1.402e-04 1.522e-03
1e-02:	1.272e-04 2.846e-03
1e-01:	1.295e-04 7.527e-04
1e+00:	1.486e-04 6.193e-04
1e+01:	1.184e-04 3.138e-04
1e+02:	9.637e-05 5.301e-04
1e+03:	2.427e-04 1.153e-03


Notes:
    
* Validation errors are lowest between `1e-1` and `1e+2`

In [8]:
# load data
data_path = "data_1k"

in_var = "X00"
lat_var = "G00"
target_var = "F00"

mask_ext = ".mask"
masks = joblib.load(os.path.join(data_path, in_var + mask_ext))     # TODO: create mask if file does not exist

train_data = SRData(data_path, in_var, lat_var, target_var, masks["train"])
val_data = SRData(data_path, in_var, lat_var, target_var, masks["val"])

In [9]:
# overview latent feature variance and alpha matrix
model_path = "models"
model_ext = ".pkl"

models = [
    "srnet_model_F00_l1_1e-03",
    "srnet_model_F00_l1_1e-03_gc_1e-05",
    "srnet_model_F00_l1_1e-03_gc_1e-04",
    "srnet_model_F00_l1_1e-03_gc_1e-03",
    "srnet_model_F00_l1_1e-03_gc_1e-02",
    "srnet_model_F00_l1_1e-03_gc_1e-01",
    "srnet_model_F00_l1_1e-03_gc_1e+00",
    "srnet_model_F00_l1_1e-03_gc_1e+01",
    "srnet_model_F00_l1_1e-03_gc_1e+02",
    "srnet_model_F00_l1_1e-03_gc_1e+03",
]

for model_name in models:
    print(model_name)
    model = ut.load_model(model_name + model_ext, model_path, SRNet, model_type="live")
    
    with torch.no_grad():
        preds, acts = model(train_data.in_data, get_lat=True)
        
    ut.get_node_order(acts, show=True)
     
    ut.get_param_diff(model_name + model_ext, model_path, show=True)
    
    print("")

srnet_model_F00_l1_1e-03
[0.1238546, 0.08738362, 0.01592117, 0.00017046831, 1.50748965e-05, 1.2677921e-07, 1.1294611e-07, 4.7373774e-08, 4.062948e-08, 4.3782458e-09, 4.0273433e-09, 3.2008614e-09, 2.0540378e-09, 1.9341104e-09, 9.1654095e-10, 9.0483304e-10]
[15, 0, 8, 10, 1, 5, 14, 3, 6, 9, 12, 7, 13, 11, 4, 2]
0.0

srnet_model_F00_l1_1e-03_gc_1e-05
[0.40937722, 0.3279128, 0.2996628, 0.25111094, 0.24648619, 0.24086009, 0.19747312, 0.19093783, 0.17695394, 0.15664122, 0.14782391, 0.14141595, 0.13983782, 0.13594861, 0.121987805, 0.09133091]
[10, 1, 12, 7, 0, 11, 5, 14, 4, 9, 15, 8, 6, 13, 2, 3]
0.13792159669562234

srnet_model_F00_l1_1e-03_gc_1e-04
[0.3860327, 0.3000263, 0.2857832, 0.23882972, 0.23224738, 0.22723271, 0.18922329, 0.18449567, 0.15582484, 0.14754942, 0.1405187, 0.13627873, 0.1342742, 0.12352268, 0.11439613, 0.090995476]
[10, 1, 12, 0, 7, 11, 5, 14, 4, 9, 15, 8, 6, 13, 2, 3]
0.13385708226822865

srnet_model_F00_l1_1e-03_gc_1e-03
[0.293643, 0.25060013, 0.21002333, 0.20667155, 0.

Notes:

* Below a $g_c$ value of `1e+01` the parameters between `live` and `ghost` are too different and no regularization occurs

In [10]:
model_name = "srnet_model_F00_l1_1e-03_gc_1e+02"

model = ut.load_model(model_name + model_ext, model_path, SRNet)

with torch.no_grad():
    preds, acts = model(train_data.in_data, get_lat=True)
    
all_nodes = ut.get_node_order(acts, show=True)

[0.1355422, 0.05743135, 0.056502942, 0.03167708, 0.002887015, 0.00059130846, 0.0005267492, 0.0003297159, 0.00032601, 0.00014168238, 0.00011782035, 8.344469e-05, 4.0464194e-05, 3.9992006e-05, 3.485184e-05, 2.331977e-05]
[15, 10, 0, 1, 5, 8, 7, 14, 12, 6, 3, 13, 4, 9, 11, 2]


In [11]:
nodes = all_nodes[:4]

In [12]:
# select plotting data
x_data = train_data.in_data[:,0]
y_data = train_data.in_data[:,1]
z_data = [
    ("target", train_data.target_data),
    #("x**2", x_data**2), 
    #("cos(y)", np.cos(y_data)), 
    #("x*y", x_data * y_data),
]
plot_size = train_data.target_data.shape[0]

In [13]:
ut.plot_acts(x_data, y_data, z_data, acts=acts, nodes=nodes, model=model, bias=True, nonzero=False, agg=False, plot_size=plot_size)

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

In [15]:
corr_data = [
    ("x**2", x_data**2), 
    ("cos(y)", np.cos(y_data)), 
    ("x*y", x_data * y_data),
    ("x", x_data),
    ("y", y_data),
]

In [16]:
ut.node_correlations(acts, nodes, corr_data, nonzero=True)


Node 15
corr(n15, x**2): 0.8429/0.8429
corr(n15, cos(y)): 0.1551/0.1551
corr(n15, x*y): 0.6328/0.6328
corr(n15, x): 0.2401/0.2401
corr(n15, y): 0.0414/0.0414

Node 10
corr(n10, x**2): 0.8424/0.8424
corr(n10, cos(y)): 0.1972/0.1972
corr(n10, x*y): 0.6142/0.6142
corr(n10, x): 0.3282/0.3282
corr(n10, y): 0.0358/0.0358

Node 0
corr(n0, x**2): -0.8306/-0.8306
corr(n0, cos(y)): -0.1962/-0.1962
corr(n0, x*y): -0.6365/-0.6365
corr(n0, x): -0.2645/-0.2645
corr(n0, y): -0.0246/-0.0246

Node 1
corr(n1, x**2): 0.7807/0.7807
corr(n1, cos(y)): 0.2097/0.2097
corr(n1, x*y): 0.6917/0.6917
corr(n1, x): 0.2753/0.2753
corr(n1, y): 0.0683/0.0683


Notes:

* The high variance latent features still do not split into the desired latent functions

### Step 6.2: Check $a_1$, $a_2$ and $g_c$ parameters

In [17]:
# set wandb project
wandb_project = "62-a1-a2-gc-study-F00"

In [18]:
# define hyperparameters
# hyperparams = {
#     "arch": {
#         "in_size": train_data.in_data.shape[1],
#         "out_size": train_data.target_data.shape[1],
#         "hid_num": (2,0),
#         "hid_size": 32, 
#         "hid_type": ("DSN", "MLP"),
#         "lat_size": 16,
#         },
#     "epochs": 10000,
#     "runtime": None,
#     "batch_size": 64,
#     "lr": 1e-4,
#     "wd": 1e-4,
#     "l1": 0.0,
#     "a1": 0.0,
#     "a2": 0.0,
#     "gc": 0.0,
#     "shuffle": True,
# }

In [19]:
# define hyperparameter study
hp_study = {
    "method": "grid", # random, bayesian
    #"metric": {
    #    "name": "val_loss",
    #    "goal": "minimize",
    #},
    "parameters": {
        "a1": {
            "values": [1e-5, 1e-3]
        },
        "a2": {
            "values": [1e-5, 1e-3]
        },
        "gc": {
            "values": [1e0, 1e+1, 1e+2]
        }
    }
}

In [None]:
# create sweep
sweep_id = wandb.sweep(hp_study, project=wandb_project)

In [None]:
# download data from wandb
file_ext = ".pkl"

api = wandb.Api()

runs = api.runs(wandb_project)
for run in runs:
    for f in run.files():
        if f.name[-len(file_ext):] == file_ext and not os.path.isfile(f.name):
            print(f"Downloading {os.path.basename(f.name)}.")
            run.file(f.name).download()

In [20]:
# plot losses
save_names = ["a2_1e-03_gc", "a2_1e-05_gc"]
save_path = "models"
models = ut.plot_losses(save_names, save_path="models")

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

In [21]:
# overview latent feature variance and alpha matrix
model_path = "models"
model_ext = ".pkl"

alpha_eps = 1e-4

for model_name in models:
    print(model_name)
    model = ut.load_model(model_name + model_ext, model_path, SRNet)
    
    with torch.no_grad():
        preds, acts = model(train_data.in_data, get_lat=True)
        
    all_nodes = ut.get_node_order(acts, show=True)
    
    ut.get_param_diff(model_name + model_ext, model_path, show=True)
    
    if alpha_eps:
        alpha = model.layers1.alpha.detach().cpu().numpy()[all_nodes]
        print(alpha[np.abs(alpha).sum(axis=1) > alpha_eps])
    
    print("")

srnet_model_F00_a1_1e-03_a2_1e-03_gc_1e+00
[1.7262817e-09, 7.0760814e-10, 5.343229e-10, 2.220446e-16, 5.551115e-17, 5.551115e-17, 5.551115e-17, 5.551115e-17, 1.3877788e-17, 1.3877788e-17, 1.3877788e-17, 8.271806e-25, 1.2924697e-26, 0.0, 0.0, 0.0]
[15, 1, 0, 9, 4, 8, 10, 11, 2, 5, 12, 6, 3, 7, 13, 14]
0.00013756640482292357
[]

srnet_model_F00_a1_1e-03_a2_1e-03_gc_1e+01
[0.5570939, 0.10172816, 0.060101792, 0.0026873527, 5.1735126e-14, 5.551115e-17, 5.551115e-17, 1.3877788e-17, 1.3877788e-17, 1.3877788e-17, 1.3877788e-17, 1.3877788e-17, 1.3877788e-17, 3.3881318e-21, 0.0, 0.0]
[15, 0, 1, 9, 7, 2, 8, 4, 5, 10, 11, 12, 14, 3, 6, 13]
8.396129152650397e-06
[[ 3.9434361e-01 -2.4572837e-01]
 [-2.6216727e-01 -1.6837510e-01]
 [-2.2929659e-01 -1.4543431e-01]
 [ 1.9853187e-01  2.5451674e-05]]

srnet_model_F00_a1_1e-03_a2_1e-03_gc_1e+02
[0.54178536, 0.078142114, 0.027209772, 8.022135e-15, 5.435781e-16, 1.5042531e-17, 3.469447e-18, 3.469447e-18, 8.6736174e-19, 8.6736174e-19, 5.421011e-20, 4.9563527e-

Notes:

* It seems that, even when using GhostAdam, sparsity w.r.t. the input features is not achieved

In [22]:
model_name = "srnet_model_F00_a1_1e-03_a2_1e-05_gc_1e+01"

model = ut.load_model(model_name + model_ext, model_path, SRNet)

with torch.no_grad():
    preds, acts = model(train_data.in_data, get_lat=True)
    
all_nodes = ut.get_node_order(acts, show=True)

[0.47209367, 0.29460657, 0.042899564, 0.004746877, 9.885634e-14, 5.551115e-17, 5.551115e-17, 5.551115e-17, 5.551115e-17, 1.3877788e-17, 1.3877788e-17, 1.3877788e-17, 2.1684043e-19, 1.3552527e-20, 0.0, 0.0]
[15, 0, 1, 9, 7, 4, 10, 11, 13, 2, 5, 14, 6, 3, 8, 12]


In [23]:
nodes = all_nodes[:4]

In [24]:
# select plotting data
x_data = train_data.in_data[:,0]
y_data = train_data.in_data[:,1]
z_data = [
    ("target", train_data.target_data),
    #("x**2", x_data**2), 
    #("cos(y)", np.cos(y_data)), 
    #("x*y", x_data * y_data),
]
plot_size = train_data.target_data.shape[0]

In [25]:
ut.plot_acts(x_data, y_data, z_data, acts=acts, nodes=nodes, model=model, bias=True, nonzero=False, agg=False, plot_size=plot_size)

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

In [26]:
corr_data = [
    ("x**2", x_data**2), 
    ("cos(y)", np.cos(y_data)), 
    ("x*y", x_data * y_data),
    ("x", x_data),
    ("y", y_data),
]

In [27]:
ut.node_correlations(acts, nodes, corr_data, nonzero=True)


Node 15
corr(n15, x**2): 0.7102/0.7102
corr(n15, cos(y)): 0.3716/0.3716
corr(n15, x*y): 0.6367/0.6367
corr(n15, x): 0.2395/0.2395
corr(n15, y): -0.1341/-0.1341

Node 0
corr(n0, x**2): -0.7903/-0.7903
corr(n0, cos(y)): -0.0722/-0.0722
corr(n0, x*y): -0.6560/-0.6560
corr(n0, x): -0.4129/-0.4129
corr(n0, y): -0.2661/-0.2661

Node 1
corr(n1, x**2): -0.7922/-0.7922
corr(n1, cos(y)): -0.0190/-0.0190
corr(n1, x*y): -0.4364/-0.4364
corr(n1, x): 0.0874/0.0874
corr(n1, y): 0.1378/0.1378

Node 9
corr(n9, x**2): -0.7686/-0.7686
corr(n9, cos(y)): -0.0127/-0.0127
corr(n9, x*y): -0.1170/-0.1170
corr(n9, x): 0.1052/0.1052
corr(n9, y): -0.0324/-0.0324


### Step 6.3: Train DSN for longer

Models:

* <s>`srnet_model_F00_a1_1e-05_a2_1e-03_gc_1e+01_max`</s> (training very slowly)
    
* `srnet_model_F00_a1_1e-03_a2_1e-05_gc_1e+01_max`
    
* `srnet_model_F00_a1_1e-03_a2_1e-03_gc_1e+01_max`

* `srnet_model_F00_a1_1e-05_a2_1e-05_gc_1e+01_max`

In [79]:
# set wandb project
wandb_project = "62-a1-a2-gc-study-F00"

In [80]:
# define hyperparameters
# hyperparams = {
#     "arch": {
#         "in_size": train_data.in_data.shape[1],
#         "out_size": train_data.target_data.shape[1],
#         "hid_num": (2,0),
#         "hid_size": 32, 
#         "hid_type": ("DSN", "MLP"),
#         "lat_size": 16,
#         },
#     "epochs": 100000,
#     "runtime": None,
#     "batch_size": 64,
#     "lr": 1e-4,
#     "wd": 1e-4,
#     "l1": 0.0,
#     "a1": 0.0,
#     "a2": 0.0,
#     "gc": 10.0,
#     "shuffle": True,
# }

In [81]:
# download data from wandb
file_ext = ".pkl"

api = wandb.Api()

runs = api.runs(wandb_project)
for run in runs:
    for f in run.files():
        if f.name[-len(file_ext):] == file_ext and not os.path.isfile(f.name):
            print(f"Downloading {os.path.basename(f.name)}.")
            run.file(f.name).download()

Downloading srnet_model_F00_a1_1e-05_a2_1e-05_gc_1e+01_max.pkl.


In [82]:
# plot losses
save_names = ["gc_1e+01_max"]
save_path = "models"
models = ut.plot_losses(save_names, save_path="models")

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

In [83]:
# overview latent feature variance and alpha matrix
model_path = "models"
model_ext = ".pkl"

alpha_eps = 1e-4

for model_name in models:
    print(model_name)
    model = ut.load_model(model_name + model_ext, model_path, SRNet)
    
    with torch.no_grad():
        preds, acts = model(train_data.in_data, get_lat=True)
        
    all_nodes = ut.get_node_order(acts, show=True)
    
    ut.get_param_diff(model_name + model_ext, model_path, show=True)
    
    if alpha_eps:
        alpha = model.layers1.alpha.detach().cpu().numpy()[all_nodes]
        print(alpha[np.abs(alpha).sum(axis=1) > alpha_eps])
    
    print("")

srnet_model_F00_a1_1e-03_a2_1e-03_gc_1e+01_max
[0.20942572, 8.271806e-25, 2.0679515e-25, 2.0679515e-25, 5.169879e-26, 5.169879e-26, 5.169879e-26, 1.2924697e-26, 1.2924697e-26, 8.077936e-28, 8.077936e-28, 8.077936e-28, 5.04871e-29, 0.0, 0.0, 0.0]
[15, 8, 10, 13, 1, 2, 3, 7, 11, 0, 6, 9, 14, 4, 5, 12]
9.46639256872379e-06
[[ 0.21940252 -0.16427106]]

srnet_model_F00_a1_1e-03_a2_1e-05_gc_1e+01_max
[0.3616124, 2.1175824e-22, 2.0679515e-25, 1.2924697e-26, 3.2311743e-27, 3.2311743e-27, 8.077936e-28, 2.019484e-28, 2.019484e-28, 5.04871e-29, 1.2621775e-29, 0.0, 0.0, 0.0, 0.0, 0.0]
[15, 0, 7, 2, 5, 6, 9, 12, 14, 11, 8, 1, 3, 4, 10, 13]
2.8323291445613374e-06
[[ 0.3638257  -0.26250717]]

srnet_model_F00_a1_1e-05_a2_1e-05_gc_1e+01_max
[0.59949, 7.270142e-27, 3.2311743e-27, 3.2311743e-27, 8.077936e-28, 2.019484e-28, 2.019484e-28, 5.04871e-29, 5.04871e-29, 5.04871e-29, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
[9, 4, 2, 10, 13, 0, 1, 3, 5, 8, 6, 7, 11, 12, 14, 15]
1.1918593743582121e-06
[[1.1544445 1.0789212]]

Notes:

* When training for 100000 epochs, the applied regularization seems to be too strong, as all information is pushed into one node

**TODO**:

* Train `max` with less regularization?

* Determine ratios from hardcoded network?

### Step 6.4: Increase $a_2$ further

Models:

* `srnet_model_F00_a1_1e-05_a2_1e-02_gc_1e+01`
    
* `srnet_model_F00_a1_1e-05_a2_1e-01_gc_1e+01`

* `srnet_model_F00_a1_1e-05_a2_1e+00_gc_1e+01`

In [32]:
# set wandb project
wandb_project = "62-a1-a2-gc-study-F00"

In [33]:
# define hyperparameters
# hyperparams = {
#     "arch": {
#         "in_size": train_data.in_data.shape[1],
#         "out_size": train_data.target_data.shape[1],
#         "hid_num": (2,0),
#         "hid_size": 32, 
#         "hid_type": ("DSN", "MLP"),
#         "lat_size": 16,
#         },
#     "epochs": 10000,
#     "runtime": None,
#     "batch_size": 64,
#     "lr": 1e-4,
#     "wd": 1e-4,
#     "l1": 0.0,
#     "a1": 0.0,
#     "a2": 0.0,
#     "gc": 10.0,
#     "shuffle": True,
# }

In [34]:
# download data from wandb
file_ext = ".pkl"

api = wandb.Api()

runs = api.runs(wandb_project)
for run in runs:
    for f in run.files():
        if f.name[-len(file_ext):] == file_ext and not os.path.isfile(f.name):
            print(f"Downloading {os.path.basename(f.name)}.")
            run.file(f.name).download()

In [36]:
# plot losses
save_names = ["a2_1e-02_gc_1e+01", "a2_1e-01_gc_1e+01", "a2_1e+00_gc_1e+01"]
save_path = "models"
models = ut.plot_losses(save_names, save_path="models")

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

In [39]:
# overview latent feature variance and alpha matrix
model_path = "models"
model_ext = ".pkl"

alpha_eps = 1e-17

for model_name in models:
    print(model_name)
    model = ut.load_model(model_name + model_ext, model_path, SRNet)
    
    with torch.no_grad():
        preds, acts = model(train_data.in_data, get_lat=True)
        
    all_nodes = ut.get_node_order(acts, show=True)
    
    ut.get_param_diff(model_name + model_ext, model_path, show=True)
    
    if alpha_eps:
        alpha = model.layers1.alpha.detach().cpu().numpy()[all_nodes]
        print(alpha[np.abs(alpha).sum(axis=1) > alpha_eps])
    
    print("")

srnet_model_F00_a1_1e-05_a2_1e-02_gc_1e+01
[3.1589572e-08, 3.8916146e-09, 1.2906931e-09, 2.750616e-12, 3.469447e-18, 3.469447e-18, 8.6736174e-19, 8.6736174e-19, 4.6713624e-19, 2.1684043e-19, 1.323489e-23, 0.0, 0.0, 0.0, 0.0, 0.0]
[1, 9, 0, 15, 10, 13, 2, 12, 7, 4, 3, 5, 6, 8, 11, 14]
2.496143645252759e-05
[[-6.1761035e-05 -2.4773335e-04]
 [ 1.7401895e-04 -6.8831014e-06]
 [-1.7340572e-05 -6.0306789e-05]
 [ 4.3078588e-05  1.5295534e-05]
 [-1.2220856e-06 -1.3482542e-06]
 [ 3.5727801e-06 -1.6661281e-05]
 [-5.0514473e-07  1.3857356e-06]
 [ 3.8097505e-06  2.4295218e-06]
 [ 2.5103918e-06 -1.5627547e-06]
 [ 4.9164951e-06  3.3473179e-06]
 [-3.8375717e-07  1.4726976e-07]
 [-6.4794472e-06  1.2070394e-06]
 [ 7.2593241e-07 -1.1647048e-06]
 [-1.0579968e-06 -5.6599947e-07]
 [ 5.0744279e-07 -5.8805830e-07]
 [-4.9762889e-08  8.6289265e-07]]

srnet_model_F00_a1_1e-05_a2_1e-01_gc_1e+01
[1.1470445e-10, 1.4095036e-12, 1.2422269e-12, 7.92382e-16, 7.5970974e-17, 5.551115e-17, 1.3877788e-17, 1.3877788e-17, 7.

Notes:

* There is still no clear dependence on only one input feature

### Step 6.5: Apply hardcoded $\alpha$ mask

In [40]:
from srnet_hard import SRNet as SRNetH

In [85]:
model_name = "srnet_model_F00_hard"

model = ut.load_model(model_name + model_ext, model_path, SRNetH)

with torch.no_grad():
    preds, acts = model(train_data.in_data, get_lat=True)
    
all_nodes = ut.get_node_order(acts, show=True)

[1.6012026, 1.4713281, 0.00053509773, 0.0]
[0, 2, 1, 3]


In [86]:
nodes = all_nodes[:3]

In [87]:
# select plotting data
x_data = train_data.in_data[:,0]
y_data = train_data.in_data[:,1]
z_data = [
    ("target", train_data.target_data),
    #("x**2", x_data**2), 
    #("cos(y)", np.cos(y_data)), 
    #("x*y", x_data * y_data),
]
plot_size = train_data.target_data.shape[0]

In [88]:
ut.plot_acts(x_data, y_data, z_data, acts=acts, nodes=nodes, model=model, bias=True, nonzero=False, agg=False, plot_size=plot_size)

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

In [119]:
fig, ax = plt.subplots()

ax.scatter(x_data, x_data**2)
ax.scatter(x_data, acts[:,0])
# ax.scatter(x_data, -1.1*acts[:,0] - 5)

plt.show()

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

In [109]:
fig, ax = plt.subplots()

ax.scatter(y_data, np.cos(y_data))
ax.scatter(y_data, acts[:,1])

plt.show()

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

In [43]:
# select plotting data
x_data = train_data.in_data[:,0]
y_data = train_data.in_data[:,1]
z_data = [
    #("target", train_data.target_data),
    #("x**2", x_data**2), 
    #("cos(y)", np.cos(y_data)), 
    ("x*y", x_data * y_data),
]
plot_size = train_data.target_data.shape[0]

In [52]:
ut.plot_acts(x_data, y_data, z_data, acts=acts, nodes=[2], model=model, bias=True, nonzero=False, agg=False, plot_size=plot_size)

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

In [47]:
corr_data = [
    ("x**2", x_data**2), 
    ("cos(y)", np.cos(y_data)), 
    ("x*y", x_data * y_data),
    ("x", x_data),
    ("y", y_data),
]

In [48]:
ut.node_correlations(acts, nodes, corr_data, nonzero=True)


Node 0
corr(n0, x**2): -0.9974/-0.9974
corr(n0, cos(y)): 0.0179/0.0179
corr(n0, x*y): -0.1693/-0.1693
corr(n0, x): -0.2781/-0.2781
corr(n0, y): -0.0388/-0.0388

Node 2
corr(n2, x**2): 0.5474/0.5474
corr(n2, cos(y)): 0.3007/0.3007
corr(n2, x*y): 0.8389/0.8389
corr(n2, x): 0.1973/0.1973
corr(n2, y): 0.0243/0.0243

Node 1
corr(n1, x**2): 0.0225/0.0225
corr(n1, cos(y)): -0.9074/-0.9074
corr(n1, x*y): 0.0765/0.0765
corr(n1, x): 0.0105/0.0105
corr(n1, y): 0.4290/0.4290


What are the individual error terms?

In [53]:
model

SRNet(
  (layers1): SparseJacobianNN(
    (w): ParameterList(
        (0): Parameter containing: [torch.FloatTensor of size 4x2x32]
        (1): Parameter containing: [torch.FloatTensor of size 4x32x32]
        (2): Parameter containing: [torch.FloatTensor of size 4x32x1]
    )
    (b): ParameterList(
        (0): Parameter containing: [torch.FloatTensor of size 4x32]
        (1): Parameter containing: [torch.FloatTensor of size 4x32]
        (2): Parameter containing: [torch.FloatTensor of size 4x1]
    )
  )
  (layers2): Sequential(
    (0): Linear(in_features=4, out_features=1, bias=True)
  )
)

In [59]:
(preds - train_data.target_data).pow(2).mean().item()

0.0002780820650514215

In [61]:
model.layers1.alpha

Parameter containing:
tensor([[1., 0.],
        [0., 1.],
        [1., 1.],
        [0., 0.]], requires_grad=True)

In [64]:
model.layers1.alpha.requires_grad = False

In [66]:
a = model.layers1.alpha

In [68]:
a.abs().sum().item()

4.0

In [70]:
a.abs().sum(dim=1).pow(2).sum()

tensor(6.)

Is this helpful? Not really.

One concern is that `few_latents` and `few_dependencies` are too similar, so that before reducing `few_dependencies` significantly, there is rather a collapse into one node.

Takeaways from discussion with Roy:

* We probably want to row-wise normalize $\alpha$, e.g. via `softmax`

* Regularization via $L_1$ can be done before normalization, regularization via entropy can be done after normalization

* Check Gumbel transformation

* Fix target latent space size and try to enforce `few_dependencies`

$L_1$: sparsity (see https://scikit-learn.org/stable/modules/linear_model.html#lasso)

<img src="https://scikit-learn.org/stable/_images/sphx_glr_plot_lasso_lars_001.png" width="500">

$L_2$: low magnitudes (see https://scikit-learn.org/stable/modules/linear_model.html#ridge-regression-and-classification)

<img src="https://scikit-learn.org/stable/_images/sphx_glr_plot_ridge_path_001.png" width="500">

Notes:

* Adaptive regularization, e.g. an $a_2$ ramp?

* Do we want to apply $L_1$ to the latent features or $\alpha$?

    * $\alpha$ basically determines how much input is let through to the output

    * By reducing $\alpha$, not much of the input is getting to the outputs

    * However, the network could still amplify the signal again

    * What we really want is no activations in the output, so why not regularize that?

alpha:

```
 In1ToOut1 In2ToOut1
 In1ToOut2 In2ToOut2
 In1ToOut3 In2ToOut3
 In1ToOut4 In2ToOut4
 In1ToOut5 In2ToOut5
 ...
```

In [71]:
alpha1 = np.array([
    [0.5, 0.5], 
    [0.4, 0.6], 
    [0.45, 0.55]
])

alpha2 = np.array([
    [0.9, 0.1], 
    [0.85, 0.15], 
    [0.8, 0.2]
])

alpha3 = np.array([
    [0.95, 0.05], 
    [0.1, 0.9], 
    [0.3, 0.7]])

In [73]:
# sparsity across all values
print(np.sum(np.abs(alpha1)))
print(np.sum(np.abs(alpha2)))
print(np.sum(np.abs(alpha3)))

3.0
3.0
3.0


In [74]:
# sparsity across rows (full: np.sum(np.sum(np.abs(alpha), axis=1)^2)^0.5)
print(np.sum(np.abs(alpha1), axis=1))
print(np.sum(np.abs(alpha2), axis=1))
print(np.sum(np.abs(alpha3), axis=1))

[1. 1. 1.]
[1. 1. 1.]
[1. 1. 1.]


In [77]:
# sparsity across columns (full: np.sum(np.sum(np.abs(alpha), axis=0)^2)^0.5)
print(np.sum(np.abs(alpha1), axis=0))
print(np.sum(np.abs(alpha2), axis=0))
print(np.sum(np.abs(alpha3), axis=0))

[1.35 1.65]
[2.55 0.45]
[1.35 1.65]


In [120]:
# new objective
print(np.sum(alpha1**2)**-1)
print(np.sum(alpha2**2)**-1)
print(np.sum(alpha3**2)**-1)

1.5250000000000001
2.245
2.305
