# Exploratory Analysis On Optimizers

The problem is pretty straightforward. According to the "A Brief Tour of Deep Learning from a Statistical Perspective" (Nalisnick et al.) we see that:

```
While DNNs have been shown to be universal approximators for some time, these results do not guarantee anything about the class of functions that can be reached by SGD. Thus, there has been much interest in studying the optimization landscape of these models. For many years, it was thought that NN optimization would be hopelessly plagued by local minima (Cheng & Titterington 1994). However, this concern has been alleviated, to a degree, with more recent conjectures that it is not local minima but saddle points that comprise many of the loss surface’s critical points (Dauphin et al. 2014, Kawaguchi 2016). The intuition is that it is unlikely that the optimization surface will be going the same direction in every dimension, as is necessary to build a local minimum. In consequence, much attention has been given to escaping saddle points efficiently ( Jin et al. 2017). In addition to classifying critical points, the qualities of the minima are also of interest. In particular, whether minima are wide and flat versus narrow and sharp has been of keen interest (Hochreiter & Schmidhuber 1997a, Keskar et al. 2017). The intuition is that wide minima are likely to generalize to never-before-seen data since there is a neighborhood of parameters that represent roughly equivalent solutions.
```

## Setting up our Basic Network

Here we setup a basic network using Xavier Initalization.

In [1]:
import numpy as np
import pandas as pd
import plotly.express as px
from matplotlib import pyplot as plt

In [2]:
import plotly.graph_objs as go
import plotly.offline as pyo

In [3]:
import tensorflow as tf

### Sequential Class [Harness]

In [4]:
class Sequential(tf.Module):
    """
    Follows a similar format for the keras implementation of Sequential. 
    Its super basic, but its good for many testing and benchmarking purposes.
    """
    def __init__(self,**kwargs):  
        super().__init__(**kwargs)
        self._layers = []
        self._optimizer = None
        self._loss_func = None
        self.lr = None # This is a custom learning rate, it won't be used if we use a tensorflow optimizer 
    
    def _compile(self, 
                 optimizer = None,
                 loss_fn = None,
                ):
        self._optimizer = optimizer
        self._loss_func = loss_fn
    
    def _predict(self, x0): # Forward Propogation
        for layer in self._layers:
            x0 = layer(x0)
        return x0
    
    @tf.function # Graph-mode Execution (Symbolic differentiation), X | Y are automatically converted to tensors
    @tf.autograph.experimental.do_not_convert # Suppress Autograph Conversion
    def _train_step(self, x, y):
        with tf.GradientTape(persistent=True) as t:
                current_loss = self._loss_func(y, self._predict(x))
                
        gradients = t.gradient(current_loss, self.trainable_variables)
        self._optimizer.apply_gradients(zip(gradients, self.trainable_variables))
        
        # This is for custom optimizers...
#         for i_trainable_variable, i_gradient in zip(self.trainable_variables, gradient):
#             i_trainable_variable.assign_sub(self.lr * i_gradient)
        
        return current_loss # Returns loss tensor
        
    def add(self, layer):
        self._layers.append(layer)
    
    def fit(self,x, y, n_epochs, batch_size):      
        # Conversions (assume float32 network)
        x = tf.cast(x, tf.float32)
        y = tf.cast(y, tf.float32)
            
        for _epoch in range(n_epochs):
            # Generate a shared shuffle index
            indices = tf.range(start=0, limit=x.shape[0], dtype=tf.int32)
            shuffled_indices = tf.random.shuffle(indices)
            
            # Shuffle data and labels using the shared shuffle index
            shuffled_x = tf.gather(x, shuffled_indices)
            shuffled_y = tf.gather(y, shuffled_indices)
            
            # Batch loop
            for step in range(0, shuffled_x.shape[0], batch_size):
                batch_x = shuffled_x[step : step + batch_size]
                batch_y = shuffled_y[step : step + batch_size]
                
                loss = self._train_step(batch_x, batch_y)
                print("Training loss in epoch {0} = {1}".format(_epoch, loss))

### Dense Layer 

* Xavier Intialization **by default**
* Linear Activation Function **by default**

In [5]:
class Dense_Layer(tf.Module):
    """
    Regular Dense Layer found in many regular Neural Networks
    """
    def __init__(
        self, 
        input_size, 
        output_size, 
        output_layer=False, 
        activation_function="linear",
        **kwargs
    ):
        super().__init__(**kwargs)
        
        self.input_size = input_size;
        self.output_size = output_size;
        self.output_layer = output_layer;
        self.activation_function = activation_function;
        
        # Weight Scheme
        self.w = tf.Variable(
            tf.random.normal([input_size, output_size]) * tf.sqrt(2 / (input_size + output_size)),
            name='w'
        );
        
        # Bias Scheme
        self.b = tf.Variable(0.0, name='b');
        
    def __call__(self, x):
        match self.activation_function: # Works with Python 3.10 and above
            case "leaky_relu":
                result = tf.nn.leaky_relu(x @ self.w + self.b)
            case "relu":
                result = tf.nn.relu(x @ self.w + self.b)
            case "softmax":
                result = tf.nn.softmax(x @ self.w + self.b)
            case "sigmoid":
                result = tf.nn.sigmoid(x @ self.w + self.b)
            case _:
                result = (x @ self.w + self.b) # I believe this is just the linear result
                
        return result

## Guassian Clusters on 2D plane

In [6]:
cluster_amnt = 5000

In [7]:
# Easy Linear Seperation
cluster_1 = np.random.multivariate_normal([10, 10], [[1, 0], [0, 5]], cluster_amnt)
cluster_2 = np.random.multivariate_normal([-10, 10], [[1, 0], [0, 5]], cluster_amnt)
cluster_3 = np.random.multivariate_normal([10, -10], [[1, 0], [0, 5]], cluster_amnt)
cluster_4 = np.random.multivariate_normal([-10, -10], [[1, 0], [0, 5]], cluster_amnt)

# No Clear Linear Seperation
# cluster_1 = np.random.multivariate_normal([10, 10], [[1, 0], [0, 30]], cluster_amnt)
# cluster_2 = np.random.multivariate_normal([-10, 10], [[1, 0], [0, 30]], cluster_amnt)
# cluster_3 = np.random.multivariate_normal([10, -10], [[1, 0], [0, 30]], cluster_amnt)
# cluster_4 = np.random.multivariate_normal([-10, -10], [[1, 0], [0, 30]], cluster_amnt)

### Pack into a Pandas for training

In [9]:
l1 = np.empty(cluster_amnt)
l2 = np.empty(cluster_amnt)
l3 = np.empty(cluster_amnt)
l4 = np.empty(cluster_amnt)

l1.fill(0)
l2.fill(1)
l3.fill(2)
l4.fill(3)

labels = np.concatenate((l1, l2, l3, l4), axis=0)

In [10]:
x = np.concatenate((cluster_1.T[0], cluster_2.T[0], cluster_3.T[0], cluster_4.T[0]), axis=0)
y = np.concatenate((cluster_1.T[1], cluster_2.T[1], cluster_3.T[1], cluster_4.T[1]), axis=0)

data = {
    "f1": x,
    "f2": y,
    "Y": labels
}

df = pd.DataFrame(data)
print(df)

              f1         f2    Y
0       9.455424   9.082024  0.0
1       9.191688   9.775133  0.0
2      11.179970  10.030100  0.0
3       8.708052  11.417686  0.0
4       9.710806  11.759120  0.0
...          ...        ...  ...
19995 -10.530834 -12.085361  3.0
19996  -9.404026 -10.263830  3.0
19997  -9.606129  -8.205653  3.0
19998  -8.660109  -9.715495  3.0
19999  -8.984113 -10.369897  3.0

[20000 rows x 3 columns]


In [17]:
fig = px.scatter(df, x='f1', y='f2', color='Y', title='Gaissian Clusters')
fig.show(renderer="browser")

### Custom Dense Network

In [None]:
def Custom_DNN():
    model = Sequential() 
    
    model.add(Dense_Layer(2, 64)) # 2 -> 64 node layer
    model.add(Dense_Layer(64, 32)) # 64 -> 32 node layer
    model.add(Dense_Layer(32, 16)) # 32 -> 16 node layer
    model.add(Dense_Layer(16, 4, activation_function="softmax")) # 16 -> 4 Output layer
    
    model._compile(
        optimizer = tf.keras.optimizers.legacy.Adam(learning_rate = 0.001),
        loss_fn = tf.keras.losses.CategoricalCrossentropy()
    )
    
    return model

In [None]:
Custom_DNN = Custom_DNN();

In [None]:
Custom_Classifier.fit(
    x = X_train,
    y = y_train,
    n_epochs = 100,
    batch_size = 16
)