In [1]:
import time
import sys
import os
import numpy as np
import matplotlib.pyplot as plt
from IPython import display

import utils
import functions as F
import optimizers as Opt
  

# Goal for this Cross Validation task: Channels
Even with such a simple model for a simple learning task, there is a good deal dimensionality to the parameter space that makes finding the optimal values for any of them difficult. 

Apparent fitness for any given setting might--in actuality--be due to wholly different and uncorrelated affects than that setting. Maybe it was just a lucky seed, which is probably the most common fools-gold found in new local minima. Or maybe it was a particular choice of activation, for this subset of network layers, with that particular sequence of starting batches.

So any given search requires robust evaluation, and a clearly defined set of fixed parameters and search targets.

I've decided to focus on finding a performant set of network channels.

The network channels determine both layer-sizes (the "connectivity" strength or intensity of a layer wrt its input), and the network depth. It is arguably the most complex hyperparameter for the network, considering the possible permutations alone. It also directly affects the number of *other* parameters within the network, such as dense layers.

## "Fixed" parameters
The current fixed set of parameters is:
 - **RNG Seeds: [3310, 99467, 27189, 77771]** 
   
   Selected at random are four seeds. Seeds are our "starting" loop variable. Every channel, and every other fixed variable in rotation against a channel-set, will be evaluated against each RNG seed, and the final scores will be averaged over the four seeds to get a more robust or generalized metric of performance.
 
 
 - **Loss function: [Cross Entropy (with Softmax)]** 
 
 Cross entropy is well suited for this task, and out of the two implementations available (the other being logistic CE with sigmoid activation), the softmax cross entropy seems to perform better.
 
 
 - **Optimizer: [SGD]** 
 
 There are only two optimizers available for use at the moment, and Adam is significantly more powerful than SGD. Anything SGD does well, Adam will do well, likely much better. So this was a simple choice.
 
 
 
 - **Activations: [`Sigmoid`, `SeLU`, `Swish`]** 
   - `sigmoid` is a classic activation and has decent, mostly consistent, performance on this task, and represents the "logistic" type of nonlinearities making it an ideal choice for benchmarking.
   
   - `selu` was chosen for being a very different type of activation, as a representative of the rectifiers, as well as being the best performing activation on this task out of all the activation functions for the "default" channels [4, 64, 3]. Many times, `selu` will do well when `sigmoid` does not, so I'm hoping to exploit that phenomena during the search (ie, if both do well/poor/inverse).
   
   - `swish`, the "best discovered activation function", was selected for being the only activation currently implemented with learnable parameters. The motivation here being to see if kernel sizes is at all correlated with `swish`'s performance as an activation, since greater layer sizes also increases `swish` connectivity. Swish does pretty well already on this task, though generally selu does better.

* * *


### Secondary search
While channels are my target parameter, the search process should also give us much info on optimal values for the fixed parameters. 

For instance, if a certain activation function within the fixed rotation has consistently poorer performance than other activations, for the many different combinations of channels and seeds tested, then have some idea about what activations do not work well for this task. 

I intend to capture **all** data during the search within multi-dimensional array, so we can analyze at depth.

In [11]:
#==============================================================================
# CV Setup
#==============================================================================
# Model config
#------------------
SEEDS = [3310, 99467, 27189, 77771]
optimizer   = [Opt.SGD]
objective   = [F.SoftmaxCrossEntropy]
activations = [F.Sigmoid, F.SeLU, F.Swish]

# Target variable: channels
#--------------------------------------------------------------
K_IN  = 4
K_OUT = 3
MAX_DEPTH = 3 # 5 total kernels in channels, including input/outputs
MAX_SIZE = 700 # limit on sum kernels for any given channels sample
kernels = np.array([  31,  41,  32, 151,  16,  53,  78, 157,  47, 149, 144, 121,  11,
                     256,  13, 163,  55, 103, 128, 124,   4,  59, 131,  97,   8,  61,
                      29, 101,  83,  17, 113, 167, 127, 101, 130,  37, 139,   7,  64,
                      79, 109,  71,  43,  80,  67, 101,  19,  23,  74,  73, 107,  89,
                      173, 137])
# Dataset
#-------------------
#X_train = np.load(utils.IRIS_TRAIN)

# Code used in cross-validation search
Some of the code used for the search was only used for this purpose, and does not feature in the normal repo codebase.

That code is reproduced here for reference. 

The **full cross-validation script is featured in the cell just below this one**.

* * *

### Splitting the dataset to the default x_train and x_test
Normally, the full Iris dataset (150 samples) is loaded during training, and split into `x_train` and `x_test` sets then. 

So long as you are using the same random seed for splitting the dataset, you should not *need* to have two separate files for the training and test sets, considering how small the dataset is.

However, for cross-validation we needed to make sure we were only working with the normal `x_train` data, so it became necessary to split the files so that only the training set would be accessed.
```python
# Check if train/test dataset has been
##  created yet in user project
if not os.path.exists(utils.IRIS_TRAIN):
    iris_dataset = utils.IrisDataset()
    _X_train, _X_test = iris_dataset.X_train, iris_dataset.X_test
    #==== confirm evenly distributed class split
    _, counts = np.unique(_X_test[:,-1], return_counts=True)
    print(f'counts: {counts}')
    assert np.all(counts == (_X_test.shape[0] // 3))
    #==== save files
    np.save(utils.IRIS_TRAIN, _X_train)
    np.save(utils.IRIS_TEST, _X_test)
    assert os.path.exists(utils.IRIS_TRAIN) # sanity check
```

* * *

### Generating the kernel sizes used for channel search
A non-trivial NAS task in its own right, the selection method for the candidate channel sizes used here is very simple.

It is approximately 50 ints, selected from some primes, powers of 2, and random integers.
```python
import primesieve
kernels = primesieve.n_primes(40)[3:]  #====== Primes
kernels.extend([2**i for i in range(2, 9)]) #===== Powers of 2
kernels.extend(np.random.randint(50, 151, 10)) #==== Random ints
#==== Shuffle list
kernels = np.random.permutation(kernels)
```

* * *

## Generating the channels
After the list of kernel sizes has been created, we create the full list of sample channels to be evaluated in cross-validation.

Channel sizes are selected from random sampling into the kernels, for a randomly generated depth (length of channels list). Both depth and sum-total channel sizes is capped.

```python
def generate_channels(kernels, num_samples, mdepth=MAX_DEPTH, 
                      msize=MAX_SIZE, replace=True):
    """ randomly general num_samples channels from kernels 
    * Given kernel list, randomly select kernels of randomly generated depth
    * Convert to list, insert K_IN (input channel) at 0, and K_OUT at end
    Params
    ------
    kernels : ndarray.int
        randomly generated list of eligible channel sizes from which channels
        are drawn
    num_samples : int
        number of channels samples desired
    mdepth : int, default=700
        maximum number of hidden layers (excluding input/output);
        possible depths range [1, mdepth]
    msize : int, default=650
        limit on sum-total kernels in a sample
    replace : bool
        whether kernels are randomly sampled with or without replacement
        
    Returns
    -------
    channels_list : list(list(int))
        list of channels samples, of length num_samples
    """
    CHANNELS_LIST = []
    # Helpers
    #-----------------
    #==== Random ops : depth, sampling
    get_depth   = lambda: np.random.choice(MAX_DEPTH) + 1
    get_kernels = lambda depth: np.random.choice(kernels, size=depth, replace=replace)
    #==== resample when sum kernel sizes exceed max-size
    def _resample_bigboys(sample):
        sample_sum = np.sum(sample)
        if sample_sum <= msize: return sample
        depth_in = len(sample)
        while np.sum(sample) > msize:
            sample = get_kernels(depth_in)
        return sample
    #==== type conversion, inserting input/output dims
    def _format_sample(arr):
        lst = list(arr)
        lst.insert(0, K_IN)
        lst.append(K_OUT)
        return lst
    
    # Generate channels
    #-----------------------
    for _ in range(num_samples):
        depth = get_depth()
        candidate_sample = get_kernels(depth)
        sample = _resample_bigboys(candidate_sample)        
        channels = _format_sample(sample)
        CHANNELS_LIST.append(channels)
    return CHANNELS_LIST

# Channels generated from the following call:
#-----> CHANNELS = generate_channels(K, 700)
```

# Cross validation script
Reproduced here from its originally separate .py script is the cross-validation code.

It essentially specifies the constants and settings decribed above, loads the prepared generated data, and performs the CV loops.

This style of CV does not train in epochs, but rather more typical iterations. As such, the "K"-fold style is only an observed property: each fold is trained on for 1500 iterations, and then evaluated against the odd-fold out that serves as the "test" set. 

New "folds" are randomly selected, but the split percentage is equivalent to 5-fold CV.

```python
"""
Cross validation for parameter search script
"""
import numpy as np

import utils
import layers
import functions as F
import optimizers as Opt

#==================================================
#                   CV Setup                      #
#==================================================
# Model config
#------------------
SEEDS = [3310, 99467, 27189, 77771]
optimizer   = [Opt.SGD]
objective   = [F.SoftmaxCrossEntropy]
activations = [F.Sigmoid, F.SeLU, layers.Swish]

# Training config
#------------------
NUM_ITERS = 1500
batch_size = 6

# Dataset
#-------------------
"""
80/20 train/test split gives us an even 5-fold for cross val
Per fold:
 * 24 test samples
 * 96 train samples
"""
_X_train = np.load(utils.IRIS_TRAIN)
split_size = .8 # already default in dataset

# Target variable: channels
#--------------------------------------------------------------
K_IN  = 4
K_OUT = 3
MAX_DEPTH = 3
MAX_SIZE = 700 # limit on sum kernels for any given channels sample
CHANNELS = list(np.load('GEN_700_CHANNELS.npy'))

# Helpers
#--------------------------------------------------------------
def generate_dataset():
    x_copy = np.copy(_X_train)
    split_seed = np.random.choice(15000)
    _dataset = utils.IrisDataset(x_copy, split_size, split_seed)
    return _dataset

def init_trainer(chan, act, dset, seed):
    return utils.Trainer(chan, Opt.SGD, F.SoftmaxCrossEntropy,
                         act, dataset=dset, steps=NUM_ITERS,
                         batch_size=batch_size, rng_seed=seed)

# Loss tracker
#--------------------------------------------------------------
cv_dims = (len(SEEDS), len(activations), len(CHANNELS), NUM_ITERS, 2)
cv_test_dims = cv_dims[:-2] + (24, 2)
#==== collections
CV_TRAIN_LOSS = np.zeros(cv_dims,      np.float32)
CV_TEST_LOSS  = np.zeros(cv_test_dims, np.float32)


#==================================================
#                    CV Train                     #
#==================================================

for idx_seed, seed in enumerate(SEEDS):
    for idx_act, act in enumerate(activations):
        dataset = generate_dataset()
        for idx_chan, channels in enumerate(CHANNELS):
            # copy data for safety
            dataset.X_train = np.copy(dataset.X_train)
            dataset.X_test  = np.copy(dataset.X_test)

            # make trainer and train
            trainer = init_trainer(channels, act, dataset, seed)
            trainer()

            # Store loss
            lh_train, lh_test = trainer.get_loss_histories()
            CV_TRAIN_LOSS[idx_seed, idx_act, idx_chan] = lh_train
            CV_TEST_LOSS[ idx_seed, idx_act, idx_chan] = lh_test

# Save results
#--------------------------------------------------------------
np.save('CV_train_loss', CV_TRAIN_LOSS) # WARNING: large file!
np.save('CV_test_loss',  CV_TEST_LOSS)


```


