## Finite Notebook Fun for Lessons 1/2

Note: the session notebooks are not reviewed and only serve for the purpose of in-class illustrations.

In [1]:
from random import random
import numpy as np

import tensorflow as tf
from tensorflow.keras import layers

### 1. Checking a few things about cross-entropy

Imagine a probability distribution $p$ cross 5 outcomes:

In [2]:
p = np.array([0.3,0.2,0.1,0.25,0.15])
print(p)

[0.3  0.2  0.1  0.25 0.15]


Calculate the entropy:

In [3]:
S = -(0.3 * np.log2(0.3) + 0.2 * np.log2(0.2) + 0.1 * np.log2(0.1) + 0.25 * np.log2(0.25) + 0.15 * np.log2(0.15))
print(S)

2.228212945841001


Explicit, but tedious and silly. Better way? Use vectors:

In [4]:
S = - np.dot(p, np.log2(p))
print(S)

2.228212945841001


Same result. Good. 

Now imagine a 2nd distribution q which is close to $p$ but differs a bit. 
Construct it via a small $\delta$ vector which has a mean of zero, and whose scale is a small number $\epsilon$: 

In [5]:
def create_q(p0, epsilon=0.03):

    delta = epsilon * np.random.random(5)
    delta = delta - np.mean(delta)
    
    return (p0 + delta)

q = create_q(p)
print('q: ', q)

q:  [0.29531147 0.20553361 0.09956972 0.24076992 0.15881527]


What is the cross entropy?

In [6]:
ce = -np.dot(p, np.log2(q))
print(ce)

2.2289878623320147


Close! Smaller or bigger than entropy?

In [7]:
print((ce - S) > 0)

True


Let's try that for many q:

In [8]:
for attempt in range(5):
    q = create_q(p)
    ce = -np.dot(p, np.log2(q))
    #print(q)
    print('ce > S? ', (ce-S)>0)
    

ce > S?  True
ce > S?  True
ce > S?  True
ce > S?  True
ce > S?  True


Indeed... $ce$ is always larger than $S$ ('experimentally verified', but should be proven). $ce = S$ when $q=p$, therefor minimizing $ce$ drives $q \rightarrow p$.

### 2. A few basic words on vector and matrix calculations 

Imagine a neural net where the dimension  of the incoming layer is $4$, and the next layer has dimension $3$. Let $x$ be the input. What are the dimensions of the weight matrix $W$ and bias vector $b$ in $$z = f(x W + b)?$$    

$W$ needs to 'translate' a 4-d vector into a 3-d vector. Hence dim$(W) = 4 \times 3$.

Each output neuron has its own bias value, dim$(b) = 3$.

Example:

In [9]:
x = np.array([1,1,2,2])

W = np.array([[-1,-3,-2],[-2,3,-6],[4,-2,3], [-1, 5, 1]]) 


b = np.array([-1, -2, -3])

In [10]:
print('x shape: ', x.shape)
print('W shape: ', W.shape)
print('b shape: ', b.shape)

x shape:  (4,)
W shape:  (4, 3)
b shape:  (3,)


In [11]:
z = x.dot(W) + b
z

array([ 2,  4, -3])

Now let's add the non-linearity, sigmoid to be concrete. The non-linearity is an element-wise operation. (The last output layer at the end however is different.)

In [12]:
def sigmoid(y):
    return 1/(1 + np.exp(-y))

In [13]:
h = sigmoid(z)
h

array([0.88079708, 0.98201379, 0.04742587])

### 3. Familiarization with the Output Layer and Softmax operation

Imagine your output layer (which does not have a non-linearity) returns these 5 numbers:

In [14]:
o = 10 * np.random.random(5) - 5
print('o: ', o)

o:  [ 4.23717353 -0.1165268  -4.15928131 -4.72762245 -0.39322557]


In a classification problem, these 5 numbers would be the output for each class. They can be large, small, positive, negative, etc. 

We want to model probabilities though. Hence, we need to convert the output to 5 values that are **1) positive** and **2) sum to 1**.

Let's first exponentiate all values:

In [15]:
exp_o = np.exp(o)
print('outputs: ', exp_o)

outputs:  [6.92119495e+01 8.90006248e-01 1.56187790e-02 8.84748135e-03
 6.74876498e-01]


Great. All numbers are positive. But they don't sum to $1$. Simple solution: divide each value by the sum of all values:

In [16]:
sum_exp_o = np.sum(exp_o)
print('sum of all output values: ', sum_exp_o)

sum of all output values:  70.80129855521079


In [17]:
p = exp_o/sum_exp_o
print('model probabilities p: ', p)

model probabilities p:  [9.77551979e-01 1.25704792e-02 2.20600177e-04 1.24962134e-04
 9.53197910e-03]


In [18]:
p = exp_o/sum_exp_o
print('model probabilities q: ', p)

model probabilities q:  [9.77551979e-01 1.25704792e-02 2.20600177e-04 1.24962134e-04
 9.53197910e-03]


In [19]:
np.sum(p)

1.0

Cool! Sums to 1 as desired. (Not a surprise.)

What happens if I add a constant to each value in $o$? How is softmax affected?

In [20]:
o_2 = o + 1

In [21]:
exp_o_2 = np.exp(o_2)
print('outputs: ', exp_o_2)

outputs:  [1.88137585e+02 2.41928781e+00 4.24562432e-02 2.40499478e-02
 1.83450452e+00]


In [22]:
sum_exp_o_2 = np.sum(exp_o_2)

In [23]:
q = exp_o_2/sum_exp_o_2
print('new probabilities q_2: ', q)

new probabilities q_2:  [9.77551979e-01 1.25704792e-02 2.20600177e-04 1.24962134e-04
 9.53197910e-03]


Same!

### 4. Most Basic Keras Intro

Simple 3-class classification model. This is only intended to illustrate the TensorFlow/Keras concepts. The data is randomly generated. See: https://www.tensorflow.org/guide/keras

There are two types of formalisms: 1) **Sequential** models, and 2) **Functional API**. The former is a little simpler, but it only supports (as the name says) sequential architectures. The Functional API approach is extremely flexible and will be introduced later.   


Setting up a Sequential model:

In [24]:
# Define 'sequential' model (vs. 'functional'... we'll discuss later.)
model = tf.keras.Sequential([
    
# Adds a densely-connected layer with 8 units to the model:
layers.Dense(8, activation='relu',
             kernel_initializer=tf.keras.initializers.glorot_normal ,
             input_shape=(4,)),         # '4' is the number of features of the input
    
# Add another, now with 15 neurons:
layers.Dense(15, activation='relu'),
    
# Add another, with 5 neurons:
layers.Dense(5, activation='relu'),
    
# Create a softmax layer with 3 output units - as required by the labels being 3 dimensional representinbg 3 classes:
layers.Dense(3, activation='softmax')])


Compiling the model... i.e., adding losses and metrics etc:

In [25]:
# Configure a model for categorical classification.
model.compile(optimizer=tf.keras.optimizers.RMSprop(0.01),
              loss='categorical_crossentropy',
              metrics=[tf.keras.metrics.categorical_accuracy])


Create some fake input data ():

In [26]:
def random_one_hot_labels(shape):
    n, n_class = shape
    classes = np.random.randint(0, n_class, n)
    labels = np.zeros((n, n_class))
    labels[np.arange(n), classes] = 1
    return labels

data = np.random.random((1000, 4))
labels = random_one_hot_labels((1000, 3))


In [27]:
data[:3]

array([[0.15491145, 0.1624717 , 0.77169716, 0.86330761],
       [0.32929663, 0.90288679, 0.73958897, 0.45347842],
       [0.99768954, 0.32197616, 0.38582041, 0.67262834]])

In [28]:
labels[:3]

array([[1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.]])

And then train your model. (well... nothing to be trained here as there are by construction no patterns.)

In [29]:
model.fit(data, labels,   epochs=1, batch_size=1000, verbose=2)  # verbose=2: only show last step 
model.fit(data, labels,   epochs=1000, batch_size=50, verbose=0) # verbose=0: silent execution
model.fit(data, labels,   epochs=1, batch_size=1000, verbose=2)  

Train on 1000 samples
1000/1000 - 0s - loss: 1.0993 - categorical_accuracy: 0.3450
Train on 1000 samples
1000/1000 - 0s - loss: 0.9331 - categorical_accuracy: 0.5350


<tensorflow.python.keras.callbacks.History at 0x13aec7150>

Is the value of the initial cost function surprising? Shouldn't..:

In [30]:
-np.log(0.33)

1.1086626245216111

... and we don't expect much training for random data. Any training here is over-training.