In [1]:
import torch

In [2]:
# Quick trial with pandas
import pandas as pd

In [9]:
df = pd.DataFrame({"x": [1,2,3,4,5, 3], "y": ["a", "b", "c", "d", "e", "e"]})

In [12]:
pd.get_dummies(df, columns = ["x", "y"]) # Turns each value in a categorical column into a dummy variable
# Each value becomes a sparse vector with a 1 sentinel representing every time that value is seen

Unnamed: 0,x_1,x_2,x_3,x_4,x_5,y_a,y_b,y_c,y_d,y_e
0,1,0,0,0,0,1,0,0,0,0
1,0,1,0,0,0,0,1,0,0,0
2,0,0,1,0,0,0,0,1,0,0
3,0,0,0,1,0,0,0,0,1,0
4,0,0,0,0,1,0,0,0,0,1
5,0,0,1,0,0,0,0,0,0,1


## Vanishing and Exploding Gradients

In [13]:
# Backpropagation involves a multitiude of matrix multiplications, with the derivatives
# Occasionally being positively correlated functions of the weights themselves
# If weights too large: repeated multiplication of values with magnitude > 1 => exploding gradient => weights shifted radically to immediate divergence
# Weights too small: repeated initial multiplication of values with magnitude < 1 => gradient of 0 => no optimization

In [14]:
# Use of Sigmoid represents activations firing / not firing (very small intermediary)
    # However, this curve also has 0 gradient for much of the function and a broad variety of inputs will produce 
    # minute gradients that vanish when multiplied by each other
# ReLUs emerge as default choice for practitioners because good choice bc sustain gradients, not as direct at representing neurons

**Symmetry** - We seek to break symmetry as otherwise we would get homogeneous gradients across matrix multiplications, resulting in homogenous weighting functions across neurons, and thus obtaining the expressive power of only a single neuron (a combination of a few homogenous functions does not yield new features / weighting information, so only one feature of resultin g information is yielded

In [15]:
# Explicit assumption of independence: E(X) = E(w) *0 = 0, Var(X) = E(x^2) - E(x)^2 = E(x^2) = E(w^2) * E(x^2) = nin * sigma^2 * delta^2 (covariance is 0, so this is ninput features * variances)
# If we fix variance of weights at one variance of weights will be variance of inputs by formula for variance and assuming expectation on inputs is 0
# we can maintain variance of input to layer
# By property of expectation variance of weight matrix n input features * sigma ^2
# By same logic if we fix output as input to layer in backprop, by using identical logic we fix output to keep gradients constant

**Xavier Initialization** - The idea is that we want to maintain variance in our data at a constant during forward and backpropagation so that our gradient neither vanishes nor explodes - AKA so that the Expectation of any function of our weights/outputs will be the variance of our initial inputs, being neither boosted nor squashed. If it is boosted, it could diverge. Otherwise, it could go to 0 as could the weights. Implicitly assumes that mean and inputs have variance 0 - an unpractical assumption. 

Math: Uses law of total probability to partition Expectation of outputs = Expectation of input * weight across all weights. Proceeds to claim that both of these expectations are 0 by implicit assumptions, hence expectation 0. Proceeds to use this to calculate variance of output, which is now just sum across inputs of E(Weight matrix)^2 * E(input)^2, since the means of all both of these are implicitly assumed to be drawn from dist of mean 0, both of these squared means are equivalent to variances, and we assume the variance of the input to be 1 since we cannot change it. 

We end up with n-inputs * variance of weights = 1 so that maintained at a constant. We want an identical property in backpropagation, the weight matrix is just a reverse mapping, and the input to backprop is output weights (or a function of them) so variance should also follow attribute of n-outputs * variance of weights = 1 (by independence). Hence variance of weights = 1/(ninputs + n-outputs), and we can use this to initialize uniform dist as variance is (b-a)^2 / 12, so we can equate and use to initialize distance between b and a. Since want uniform to have mean 0, this ends up being 2a^2/12  = -sqrt(6)/nin + nout and its absolute value.

Xavier initialization uses these preset neural properties to sample from a distribution that naively should preserve variance in both forward propagation and backward propagation, with the average weight keeping the gradient adjustment from becoming too large or too small

**Note**: We apply the Expectation * the # of elements before configuring squared expectation = variance