# 2022 Flatiron Machine Learning x Science Summer School

This notebook trains SRNets on Colab and explains the hyperparameters.

1. Clone `symrep` repo and install missing libraries:

In [None]:
!git clone https://github.com/fabxy/symrep.git

In [None]:
%cd symrep

In [None]:
!pip install wandb einops --upgrade

2. If you want to use `wandb`, login with your API key:

In [None]:
!wandb login

In [None]:
import os
import torch
import joblib
from srnet import SRNet, SRData, run_training
from sdnet import SDNet, SDData
import wandb

3. If you want to use `wandb`, specify the project name and if applicable, the sweep ID and the number of sweep runs:

In [None]:
# set device
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# set wandb options
wandb_project = "132-bn-mask-DSN-sd-study-F00_v5"
sweep_id = None
sweep_num = None

4. Load the training and validation data. If you want to generate new data, the format is the output format of `np.savetxt`, which is a space-separated text file without headers. Additionally, a pickled dictionary is loaded, which contains masks (`np.array`) to select the training, validation and testing data from the complete data file, e.g. `F00.gz`. Alternatively, you can save the training, validation and testing data into separate folders and not pass masks to `SRData`:

In [None]:
# load data
data_path = "data_1k"

in_var = "X00"
lat_var = None         # the true latent feature values, not required for training
target_var = "F00"

mask_ext = ".mask"
masks = joblib.load(os.path.join(data_path, in_var + mask_ext))

train_data = SRData(data_path, in_var, lat_var, target_var, masks["train"], device=device)
val_data = SRData(data_path, in_var, lat_var, target_var, masks["val"], device=device)

5. Load the symbolic discriminator (SD) function library. The `shuffle` determines if the order of the loaded functions is shuffled, which should be true when there are more functions in the library than the SD loads per iteration. If the input data, i.e. `train_data.in_data`, is passed to `SDData`, all function values are computed before the training (otherwise, these values are computed during training). Furthermore, each character `U` and `N` in any function string is replaced during evaluation with a scalar value sampled from a uniform (between 0 and 1) and a normal (mean 0 and variance 1) distribution, respectively:

In [None]:
# create discriminator data
fun_path = "funs/F00_v5.lib"
shuffle = True

if fun_path:
    disc_data = SDData(fun_path, in_var, shuffle=shuffle)
else:
    disc_data = None

6. Define load and save `pkl` file names and the logging frequency. Setting a load file allows restarting the training. If a `wandb` project is defined, the save file is also saved to `wandb`. The logging frequency determines the update frequency of `tqdm`, `wandb` and the size of the save file:

In [None]:
# set load and save file
load_file = None
save_file = "models/srnet_model_F00_v5_bn_mask_sd_study.pkl"
log_freq = 25

7. Set the hyperparameters which define the `SRNet` and `SDNet` architectures and the training.

* `arch` define the `SRNet` architecture:

    * `in_size` and `out_size` are determined by the training data
    
    * `hid_num` and `hid_size` define the hidden layers:
    
        * Note that the general structure of the network is `input -> layers1 -> latent space -> layers2 -> output`
        
        * If `hid_num` or `hid_size` are scalar values, they are applied to both `layers1` and `layers2`
        
        * If `hid_num` or `hid_size` are tuples, the first values are applied to `layers1` and the second values to `layers2`
        
        * Let's call this tuple definition logic (TDL)
        
    * `hid_type` allows to select between a disentangled sparsity network (`DSN`) and a multi-layer perceptron (`MLP`) (TDL applies)
    
    * `hid_kwargs` allows passing additional keyword arguments to `layers1` and `layers2` (TDL applies):
    
        * `alpha` controls the input mask of the DSN. If `None`, the parameters are learned. If `-1`, the parameters are randomly initialized, but not learned (useful when using pre-trained network). If specific values are defined, these are set and no learning occurs. Since we have previously shown that learning the correct input dependencies is feasible, we fix the input mask to simplify the learning problem.
        
        * `norm` is the identity function for `None` and applies softmax for `softmax`
        
        * `prune` sets input mask parameters that are below the defined pruning value to zero
        
    * `lat_size` defines the number of latent space nodes. The value is set to the target value here, as it was previously shown that it can be determined correctly.
    
* `epochs` defines the number of epochs

* `runtime` allows defining a maximum runtime in seconds, after which the training is stopped

* `batch_size` determines the batch size and the current value uses all training data points in one batch

* `shuffle` determines whether the training data points are shuffled during training

* `lr` denotes the learning rate of `SRNet`

The following parameters define various types of regularization:

* `wd` denotes the weight decay

* `l1` applies L1 regularization on the latent feature activations

* `a1` and `a2` regularize the DSN input mask according to https://astroautomata.com/data/sjnn_paper.pdf

* `e1` regularizes the row-wise entropy of the normalized DSN input mask

* `e2` is an experimental approach to regularize a combination of entropy and variance

* `e3` is an experimental approach to regularize correlations between latent features and SD library functions

* `sd` weights the real vs. fake predictions of the SD for the latent feature activations

* `disc` describes the SD architecture:

    * `hid_num` and `hid_size` define the hidden layers of the MLP
    
    * If `emb_size` is not `None`, a second MLP is created which embeds multiple dimensions, e.g. input dimensions and target value (if any other information should be embedded, please modify `run_training` code), into a single dimension, which is then passed to the actual SD. In this case, TDL applies to `hid_num` and `hid_size`.
    
    * `lr` defines the SD learning rate
    
    * `wd` defines the SD weight decay
    
    * `iters` defines the number of iterations of SD training per `SRNet` ("generator") iteration
    
    * `gp` defines the level of gradient penalty

In [None]:
# define hyperparameters
hyperparams = {
    "arch": {
        "in_size": train_data.in_data.shape[1],
        "out_size": train_data.target_data.shape[1],
        "hid_num": (2,0),
        "hid_size": 32, 
        "hid_type": ("DSN", "MLP"),
        "hid_kwargs": {
            "alpha": [[1,0],[0,1],[1,1]],
            "norm": None,
            "prune": None,
            },
        "lat_size": 3,
        },
    "epochs": 30000,
    "runtime": None,
    "batch_size": train_data.in_data.shape[0],
    "shuffle": False,
    "lr": 1e-4,
    "wd": 1e-6,
    "l1": 0.0,
    "a1": 0.0,
    "a2": 0.0,
    "e1": 0.0,
    "e2": 0.0,
    "e3": 0.0,
    "gc": 0.0,
    "sd": 1e-6,
    "disc": {
        "hid_num": 6,
        "hid_size": 128,
        "emb_size": None,
        "lr": 1e-3,
        "wd": 1e-4,
        "iters": 5,
        "gp": 1e-5,
    },
}

8. Start training or sweep if `sweep_id` is defined

In [None]:
def train():
    run_training(SRNet, hyperparams, train_data, val_data, SDNet, disc_data, load_file=load_file, save_file=save_file, log_freq=log_freq, device=device, wandb_project=wandb_project)

In [None]:
# hyperparameter study
if sweep_id:
    wandb.agent(sweep_id, train, count=sweep_num, project=wandb_project)

# one training run
else:
    train()