<a href="https://colab.research.google.com/github/cu-applied-math/SciML-Class/blob/main/Labs/lab05.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 5 for SciML
Why do we hvae validation and testing sets?

In [None]:
import numpy as np

Two-sided [Hoeffding inequality](https://en.wikipedia.org/wiki/Hoeffding%27s_inequality#Special_Case:_Bernoulli_RVs) for Bernoulli r.v.
$$
\mathbb{P}\left[ \left| \frac{1}{n}S_n - \mu  \right| \ge t\right] \le 2 \exp\left( -2nt^2\right)
$$
where $S_n=Z_1+Z_2+\ldots+Z_n$ is the sum of $n$ **independent** Bernoulli random variables.

## Part 1
Let $\mu$ be the true risk of a binary classifier $f$ (using the 0-1 loss $\ell$), and suppose we have $n$ independent samples (that are also independent of $f$), and let $\frac{1}{n}S_n$ be the empirical risk (still using the 0-1 loss) on these samples.  (Think of these as validation or testing samples)

**Q**: If we want to estimate $\mu$ to an accuracy of $\pm t$, with a confidence of $p=1-\alpha$, how many samples $n$ do we need? Do this for $t=0.001, 0.01, 0.02$ and $0.05$, and for $\alpha$ being $10^{-1},10^{-2},10^{-3},10^{-4},10^{-5}$. Make a table of the resulting $n$ values.

Comment on whether $p$ or $t$ has a "bigger effect".

**A**:

In [None]:
sep=""  # for printing to screen
# sep="|" # for copying to markdown


pList = np.logspace(-5,-1,5)
print(end=sep)
print("        ",end=sep)
for p in pList:
    print(f"{p:10.0e}",end=sep)
    n = np.log(2/p)/(2*0.001**2)
print(flush=True)
for t in [0.001, 0.01, 0.02, 0.05]:
    nValues = []
    print(end=sep)
    print(f"{t:.3f}",end="\t")
    print(end=sep)
    for p in pList:

        n = ???????

        print(f"{int(n):-10d}",end=sep)
    print(flush=True)

## Utility code (for parts 2--4)

In [None]:
import torch
from torch.utils.data import Dataset
import torch.nn as nn
import torch.nn.functional as F
from matplotlib import pyplot as plt

In [None]:
torch.manual_seed(101)
d = 1  # dimension

def get_feature_samples(n):
    return torch.rand( (n,d) )-0.5 # uniform in [-.5,5]^d

def f(x):
    """ the *true* function that we're trying to learn
    Should be 0 and 1 outputs
    """
    y = (torch.sign( torch.sin( 15*x + 50*x**2 ) ) + 1 )/2
    return y.long() # the datatype torch expects for classification

# if d == 1:
#     # Easy to plot
#     X_plot = torch.linspace(-.5,.5,300)
#     y_plot = f(X_plot)
#     plt.plot( X_plot, y_plot )
#     plt.show()


n_test      = int(1e5) # Pick enough to make this the "truth"
X_test      = get_feature_samples(n_test)
y_test      = f(X_test) # can use torch.vmap if needed
testset     = torch.utils.data.TensorDataset(X_test,  y_test)
testloader  = torch.utils.data.DataLoader(testset,batch_size=1000)

vals,bins   = torch.histogram(y_test.float(),bins=2)
print(vals/n_test) # check if the data is approximately well-balanced
# print(bins)


class Net( nn.Module ):
    """ A simple fully-connected neural net, all hidden layers the same size
    This does binary classification by default (outputDim=2), setup for cross-entropy loss
    """
    def __init__(self, nHiddenLayers=5, width=100, activation = nn.ReLU(), outputDim=2, inputDim = 1):
        super().__init__()
        # Construct a net as instructed (all hidden layers of the same size)
        self.net = []
        self.net.append(nn.Linear(inputDim, width))
        self.net.append( activation )
        for i in range(nHiddenLayers):
            self.net.append(nn.Linear(width, width))
            self.net.append( activation )
        self.net.append(nn.Linear(width, outputDim))
        self.net = nn.Sequential(*self.net)
    def forward(self, x):
        return self.net(x)

def empiricalRisk( model, dataloader ):
    """ We need a dataloader, not a dataset
    (otherwise the size of the tensor isn't correct)
    Assumes the true labels are [0,1,...,outputdim-1]
    """
    model.eval()
    with torch.no_grad():
        total   = 0
        correct = 0
        for data in dataloader:
            inputs, trueLabels = data
            outputs = model(inputs)
            predicted = torch.argmax(outputs.data, 1)
            total   += trueLabels.size(0)
            # Careful: if predicted is of size (batchSize,1)
            #   and trueLabels are of size (batchSize,) then using "=="
            #   will implicitly broadcast them to the wrong size!
            correct += (predicted.ravel() == trueLabels.ravel() ).sum().item()
            # We did .sum() in case batch size > 1
    return correct/total

def train_model( trainloader, optimizer, epochs = 30, criterion = nn.CrossEntropyLoss()  ):
    """
    You don't explicitly pass in the model, but that's implicit in the optimizer
    """
    for t in range(epochs):
        runningLoss = 0.
        for data in trainloader:
            inputs, trueLabels = data
            optimizer.zero_grad()
            outputs = model(inputs)
            loss    = criterion(outputs,trueLabels.flatten() )
            loss.backward()
            optimizer.step()
            runningLoss += loss.item()

## Part 2: for a fixed model, look at validation risk

Fix some classifier $h$ (i.e., model), and for each data point $(X_i,y_i)$ for $i=1,\ldots,n_\text{validation}$, we set the random variable
$$
Z_i = \begin{cases} 1 & y_i = h(X_i) \\ 0 & y_i \neq h(X_i)\end{cases}
$$
so $S_n$ represents the **accuracy** (i.e., empirical risk, if we use the 0-1 loss) of our model $h$ on this data.

We'll find $\mu$ by evaluating accuracy on a huge number of testing points.

Then, for many **experiments**, draw a new validation set (of size $n_\text{validation}=1000$) and we compare that estimate $S_n$ with $\mu$.  We do many experiments so that we can estimate the probability of $|S_n - \mu| > t$  (we can choose $t$ *after* we run the experiments)


In [None]:
model = Net(nHiddenLayers=3, width=20 )
trueRisk    = empiricalRisk(model, testloader)
print(f"True risk: {trueRisk}")

# Use many validation sets
n_experiments = 1000
for i in range(n_experiments):
    X_valid     = get_feature_samples(n_valid)
    y_valid     = f(X_valid)
    validset    = torch.utils.data.TensorDataset(X_valid, y_valid)
    validloader = torch.utils.data.DataLoader(validset,batch_size=1000)

    .... do something, record something, ...


... do some analysis... compare with Hoeffding bound

## Part 3: choose the best from among many models
Now, fix a given validation set, but we'll do many experiments with sets of models.

That is, for a set of 10 models, we'll choose the best model (in terms of accuracy; higher is better)


In [None]:
# Use a single validation set...
X_valid     = get_feature_samples(n_valid)
y_valid     = f(X_valid)
validset    = torch.utils.data.TensorDataset(X_valid, y_valid)
validloader = torch.utils.data.DataLoader(validset,batch_size=1000)

....

## Part 4: training and choosing best hyperparameters

Same as part 3, but now we'll optimize the models some

In [None]:
n_train     = int(1e3) # not too large, in order to keep this fast
X_train     = get_feature_samples(n_train)
y_train     = f(X_train)
trainset    = torch.utils.data.TensorDataset(X_train, y_train)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=100, shuffle=True)
criterion   = nn.CrossEntropyLoss()

# Use a single validation set...
X_valid     = get_feature_samples(n_valid)
y_valid     = f(X_valid)
validset    = torch.utils.data.TensorDataset(X_valid, y_valid)
validloader = torch.utils.data.DataLoader(validset,batch_size=1000)

# Hyperparameters:
nHiddenLayers = 3
width         = 20
learning_rate = 0.01
epochs        = 1  # let's make it fast!


...

    model = Net(nHiddenLayers=nHiddenLayers, width=width )
    model.train()
    optimizer     = torch.optim.SGD(model.parameters(),lr=learning_rate)

    # Train it
    train_model( trainloader, optimizer, epochs=epochs )


...
