## Normalization Functions

Previously, we have covered how a single neuron would look like and when they "activate", they produce a "squashing" effect to force the output into some bounded domain.

## Simple Statistics

Let's warm up with some simple calculations of mean, variance and standard deviation. 


## (Arithmetic) Mean
The (arithmetic) mean of a list is the sum of its elements divided by the no. of elements:

<img src="images/arithmetic-mean.png">

In [1]:
import numpy as np

In [2]:
x = np.array([1,2,3])
m = x.sum() / len(x) # x_bar
print 'Mean =', m

Mean = 2


## Variance 

The variance of a list of real numbers (historically aka as the "population" in statistics) is the measure of how spread out are the elements within the list.

<img src="images/variance.png">

In [3]:
x = np.array([1,2,3])
n = len(x)
m = x.sum() / n
s2 = sum((x - m)**2) / (n-1)
print 'Variance =', s2

Variance = 1


## Standard Deviation

The standard deviation of a list of real numbers is simply the square-root of the variance, it quantifies the spread-out-ness of the elements in the list.

<img src='images/standard-deviation.png'>

In [4]:
x = np.array([1,2,3])
n = float(len(x))
m = x.sum() / n
s2 = sum((x - m)**2) / (n-1.) 
s = s2**0.5
print 'Mean =', m
print 'Variance =', s2
print 'Standard deviation =', s

Mean = 2.0
Variance = 1.0
Standard deviation = 1.0


## Making our statistics "grabber" into a Python function:

In [5]:
def get_statistics_old(x):
    n = float(len(x))
    m = x.sum() / n
    s2 = sum((x - m)**2) / (n-1.) 
    s = s2**0.5
    return m, s2, s

def get_statistics(x):
    return np.mean(x), np.var(x), np.std(x)

x = np.array([1,2,3])
m, s2, s = get_statistics_old(x)
print 'Old statistics functions'
print 'Mean =', m
print 'Variance =', s2
print 'Standard deviation =', s
print 
m, s2, s = get_statistics(x)
print 'With numpy'
print 'Mean =', m
print 'Variance =', s2
print 'Standard deviation =', s

Old statistics functions
Mean = 2.0
Variance = 1.0
Standard deviation = 1.0

With numpy
Mean = 2.0
Variance = 0.666666666667
Standard deviation = 0.816496580928


In [6]:
print np.var.func_doc


    Compute the variance along the specified axis.

    Returns the variance of the array elements, a measure of the spread of a
    distribution.  The variance is computed for the flattened array by
    default, otherwise over the specified axis.

    Parameters
    ----------
    a : array_like
        Array containing numbers whose variance is desired.  If `a` is not an
        array, a conversion is attempted.
    axis : None or int or tuple of ints, optional
        Axis or axes along which the variance is computed.  The default is to
        compute the variance of the flattened array.

        .. versionadded: 1.7.0

        If this is a tuple of ints, a variance is performed over multiple axes,
        instead of a single axis or all the axes as before.
    dtype : data-type, optional
        Type to use in computing the variance.  For arrays of integer type
        the default is `float32`; for arrays of float types it is the same as
        the array type.
    out : ndarray,

In [7]:
def get_statistics_ddf0(x):
    return np.mean(x), np.var(x), np.std(x)

def get_statistics_ddf1(x):
    return np.mean(x), np.var(x, ddof=1), np.std(x, ddof=1)


m, s2, s = get_statistics(x)
print 'With numpy, default parameters'
print 'Mean =', m
print 'Variance =', s2
print 'Standard deviation =', s
print 
m, s2, s = get_statistics_ddf0(x)
print 'With numpy, degree of freedom=0'
print 'Mean =', m
print 'Variance =', s2
print 'Standard deviation =', s
print 
m, s2, s = get_statistics_ddf1(x)
print 'With numpy, degree of freedom=1'
print 'Mean =', m
print 'Variance =', s2
print 'Standard deviation =', s

get_statistics = get_statistics_ddf1

With numpy, default parameters
Mean = 2.0
Variance = 0.666666666667
Standard deviation = 0.816496580928

With numpy, degree of freedom=0
Mean = 2.0
Variance = 0.666666666667
Standard deviation = 0.816496580928

With numpy, degree of freedom=1
Mean = 2.0
Variance = 1.0
Standard deviation = 1.0


## Remember the Sigmoid?

The sigmoid function also has similar "abilities" and in fact the softmax and the sigmoid functions are related. Previously, we saw how a single neuron would "activate" given an input and split a "squashed" output.

The standard sigmoid activation function is:

<img src="images/">

Now, we'll introduce the notion of normalized sigmoid where it's

<img src="images/normalized-sigmoid.png">

We see that it substracts the `x_i` value by the mean and divide it by the standard deviation.

In [16]:
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def norm_sigmoid(x):
    m = np.mean(x)
    s = np.std(x)
    return 1 / (1 + np.exp(-(x-m)/s))

In [17]:
x = np.array([1,2,3])
print x
print sigmoid(x)
print norm_sigmoid(x)

[1 2 3]
[ 0.73105858  0.88079708  0.95257413]
[ 0.22710252  0.5         0.77289748]


## Standard vs Normalized Sigmoid

It seems like there's quite some difference in terms of the transformations that the standard and the normalized sigmoid functions are doing to the inputs.

Let's look at it from another perspective where the inputs are highly skewed and contains outliers:

In [18]:
x = np.array([0.5, 1, 10, 15, 20.3245, 500, 50000])

In [20]:
sigmoid(x) # Generalize the outliers 

array([ 0.62245933,  0.73105858,  0.9999546 ,  0.99999969,  1.        ,
        1.        ,  1.        ])

In [22]:
norm_sigmoid(x) # Generalize the "inliers"

array([ 0.39809267,  0.39809953,  0.39822302,  0.39829162,  0.39836469,
        0.40496512,  0.9205157 ])

## Softmax function

A softmax function also squeezes the real number values into the 0 to 1 range; in addition, the special property of the function is that the elements in the list sums to 1. In fact, it's a generalized form of the logistic regression.

<img src='images/softmax.png'>


In [13]:
def softmax(x):
    exp_x = np.exp(x)
    return exp_x / exp_x.sum()

In [14]:
x = np.random.rand(3)
print x
print softmax(x)
print softmax(x).sum()

[ 0.95656121  0.40232443  0.73815292]
[ 0.42046703  0.24156275  0.33797022]
1.0


In [15]:
x = np.array([1,2,3])
print softmax(x)
print softmax(x).sum()

[ 0.09003057  0.24472847  0.66524096]
1.0


## The relation between Softmax and Sigmoid

Essentially, the softmax function is the generalized version of the sigmoid function. So if we try to differentiate the sigmoid function:

<img src='images/differentiate-sigmoid.png'>

Now, we love / like this function, especially when we need the derivative for optimization / back-propagation later ;P

By this we have a relation between the 2nd and 1st order function (`y^2` and `y`) through the derivative (*Bernoulli differential equation*), so we the function `f(x)` has a solution:

<img src='images/bernoulli-solution.png'>

And in the case of sigmoid, it happens to be `C=1`.

<img src='images/bernoulli-solution-sigmoid.png'>

In [23]:
x = np.array([1,2,3])
print sigmoid(x), softmax(x)

[ 0.73105858  0.88079708  0.95257413] [ 0.09003057  0.24472847  0.66524096]


In [25]:
# Do note that softmax isn't very good when it comes to highly skewed inputs, e.g.
x = np.array([0.21, 0.99, 5000, 499, 500])
print 'Sigmoid(x):', sigmoid(x)
print 'NormalizeSigmoid(x):', norm_sigmoid(x)
print 'Softmax(x):', softmax(x)

Sigmoid(x): [ 0.55230791  0.72908792  1.          1.          1.        ]
NormalizeSigmoid(x): [ 0.34814876  0.3482413   0.87935657  0.40939807  0.40952447]
Softmax(x): [  0.   0.  nan   0.   0.]


  from ipykernel import kernelapp as app
  app.launch_new_instance()


If you need to pass highly skewed input into a `softmax()` function such that you receive an output vector that sums to 1, you can try doing something like `softmax(sigmoid(x))` or `softmax(norm_sigmoid(x))`:

In [26]:
softmax(sigmoid(x))

array([ 0.14519143,  0.17326688,  0.22718056,  0.22718056,  0.22718056])

In [27]:
softmax(norm_sigmoid(x))

array([ 0.17159898,  0.17161486,  0.29188739,  0.18243785,  0.18246091])

**Note:** the normalized sigmoid usually highlights the "most prominent" class. This is useful when it is placed before a final output layer when we want to do classification and requires 1 class to "stand out".


## OPTIONAL MATERIAL!!

## The relation between Softmax and Sigmoid

Essentially, the softmax function is the generalized version of the sigmoid function. So if we try to differentiate the sigmoid function:

<img src='images/differentiate-sigmoid.png'>

Now, we love / like this function, especially when we need the derivative for optimization / back-propagation later ;P

By this we have a relation between the 2nd and 1st order function (`y^2` and `y`) through the derivative (*Bernoulli differential equation*), so we the function `f(x)` has a solution:

<img src='images/bernoulli-solution.png'>

And in the case of sigmoid, it happens to be `C=1`.

<img src='images/bernoulli-solution-sigmoid.png'>

## Tanh Function

The `tanh()` activation function also a kind of softmax that looks very similar to the `sigmoid()`.
Imagine you double the "steepness" of the slope of the sigmoid curve and then move the

(I would like to think it's named after me, but it isn't ;P)

In [2]:
import numpy as np
np.array([1,2,3])

array([1, 2, 3])


mi