# Analysis on the loss lanscape and dynamics of optimization for training a neural network
- author: hayley song
- date: 2022-02-11 (sat)
- context: hw1 for cs669 2022sp

## Goals
- visualize 2d projection of the loss landscape and optimization trajectory during training a neural network.
- experiment with model/training hyperparams (e.g. model architecture, dataset, training protocols, choice of optimizer) and report their effects on the loss landscape or the trajectory during loss optimization

## Deliverables
- A write-up: an abstract, a description of the exp. set-up, resutls, a short discussion. Must constain at minimum:
  - [ ] a plot of train and test loss
  - [ ] a 2d contour plot of the loss landscape around the optimum
  - [ ] a plot of the parameter dynamics in the same 2d projection
  - [ ] do the three above for at least two settings: (null, variation)
- Code
  - well-organized, commented, and
  - reproduciable
  

## Notes:
- #[[Qs]] can actually measure the effect of a change in the parameters as the size of gradieent
- #[[Qs]] can visualize the effect of a change in the parameters on the output, but this  is done at the inference time, meaning we do this analysis on an already-trained model, so it's not about the learning dynamics, but more on the effect/impact of each model weight variable on the output, through a fixed trained model

## Action items:
- [ ] fix a design of the model and training protocol
  - As a starting point:
    - turn off normalization layers, and
    - use a smooth activation function that is not ReLU
    
    
    
    
    
- [ ] build the visualization module
  - use FD from Umag's repo
  - at the end of a forward step, call FD -> get two vectors, each of length = number of all params
  





  
- [ ] make a referece section  and put the bibtxts
- [ ] table of the parameters: categorized into three broad groups,
  - data type
  - NN architeuture choices
    - discrete
      - normalization layer type:
      - etc...
    - continuous
      - number of layers:
      - etc...
      
  - Training objective
    - type of regularization: none, l1 (TV),  l2, ...?
    
    

1. Visualize our loss function (ie. the objective function for our optimization 
problem used for training a NN) over the process of optimization steps
The domain of this loss function is the space of all tweaks involved in the training
that we are interested in studying its effect on the training process, such as:

I will refer to these tweaks as parameters of the loss function. I can categorize 
them into the following categories, based on the literature in Neural Architecture Search(NAS)[todo: cites] and Auto-ML [todo: cites]. I will closely follow the 
categorization in [reverse-enginnering] for the model hyperparameters and training type.


- Model 
- $$\theta$$: weights of the neural network
- hyperparameters of the neural network architecuture
  - discrete variables:
    - type of the layers: fully-connected (FC) vs. convolutional (Conv)
    - number of layers: 
    - 
- Types of loss function (? not sure if this is relevant here)


- b

For visualization of the learning dynamics: 
- project the parameters into a 2D space using the frequent directions method (Ghashami et al, 2016)
  
What does FD algorithm do?
- computes the top eigenvectors of SVD of the gradients, ie.  
  $ SVD (\nabla_{\theta}J^{(t)}) $
  where $J$ is the loss function/training objective, and $t$ is the index of the 
  optimization step
- at each call, we get two vectors $\vec{u_1}, \vec{u_2}$ that can be viewed as 
the top 2 main directions of the change of the loss function 
  - these two vectors are conceptually similar to the top 2 principle axis (which 
  are computationally expensive/infeasible to compute in the high-dim space, e.g. 
  the domain contains model weights (and variables regarding architectural, training
  choices -- but the model weights contribute the most to the cardinality of the 
  domain)
  
  
Steps:
- use 

## Load libraries

In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [2]:
import os,sys
from datetime import datetime
import time
from collections import OrderedDict

sys.dont_write_bytecode = True
from IPython.core.debugger import set_trace as breakpoint

In [3]:
import pandas as pd
import joblib
import numpy as np
import matplotlib.pyplot as plt

from pathlib import Path
from typing import Any,List, Set, Dict, Tuple, Optional, Iterable, Mapping, Union, Callable, TypeVar

from pprint import pprint

In [4]:
import torch 
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader, random_split
import torchvision
from torchvision import datasets, transforms

import pytorch_lightning as pl
from pytorch_lightning.core.lightning import LightningModule
from pytorch_lightning import loggers as pl_loggers
from pytorch_lightning.callbacks import Callback

# Select Visible GPU
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID" 
os.environ["CUDA_VISIBLE_DEVICES"]="0"
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")


## Import ReprLearn and TileMani packages

In [5]:
import reprlearn as rl

In [6]:
from reprlearn.visualize.utils import show_timg, show_timgs, show_batch, make_grid_from_tensors
from reprlearn.utils.misc import info, now2str, today2str, get_next_version_path, n_iter_per_epoch
