# DSCI 572 "lecture" 6

The plan

- T/F (15 min)
- Quiz recap (10 min)
- Lab 3 recap (10 min)
- T/F (15 min)
- Break (5 min)

- T/F (15 min)
- Neural network as a feature extractor + linear model (10 min)

In [1]:
from keras.datasets import mnist
from keras.layers import Dense, Dropout, Flatten, Activation, Conv2D, MaxPooling2D
from keras.models import Sequential, Model, clone_model
from keras.utils import np_utils

from sklearn.linear_model import LogisticRegression

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

Using TensorFlow backend.


In [2]:
%autosave 0

Autosave disabled


## More true/false - left over from last time (15 min)

1. Each iteration of stochastic gradient is faster than each iteration of gradient descent, but we likely need more iterations in total.
2. One epoch of stochastic gradient takes about the same amount of time as one iteration of gradient descent.
3. Stochastic gradient with a minibatch size of $n$ is the same thing as gradient descent.
4. In terms of the fundamental tradeoff, more layers and larger layers both lead to lower training error in a neural network.

<br><br><br><br><br><br><br><br>

Extra note about SGD: 

- We don't use the same termination condition as gradient descent, for several reasons
  - Slow to check the full gradient
  - Function is non-convex
- So we just want to specify a number of iterations
- Is 1000 iterations a good number? It depends on the data set size and minibatch size.
- Hence we measure in epochs instead of iterations.

## Quiz recap (10 min)

See [quiz solutions](https://github.ubc.ca/MDS-2018-19/DSCI_572_sup-learn-2_students/blob/master/solutions/quiz1/quiz1.md)




## Lab 3 Exercise 0 recap (5 min)

- _go through lab exercise briefly_

Extra note about optimization for a neural net:

- From the optimization perspective, we should think if _flattening_ all the parameters into one big vector. 
- For example if the parameters are


$$\begin{align}W^{(0)} &= \begin{bmatrix}-2 & 2 & -1\\-1 & -2 & 0\end{bmatrix},  &b^{(0)}&=\begin{bmatrix}2 \\ 0\end{bmatrix} \\ W^{(1)} &= \begin{bmatrix}3 & 1\end{bmatrix},  &b^{(1)}&=-10\end{align}$$

then we can flatten the whole thing down into

$$w_\text{all} = \begin{bmatrix}W^{(0)}_\text{flattened}\\b^{(0)}_\text{flattened}\\ W^{(1)}_\text{flattened}\\b^{(1)}_\text{flattened}\end{bmatrix} = \begin{bmatrix}-2\\-1\\2\\-2\\-1\\0\\2\\0\\3\\1\\-10\end{bmatrix}$$

- And then we consider $f(w_\text{all})$, and $\nabla f(w_\text{all})$, etc.
- An implementation may or may not actually do this, but you can think of this way conceptually.
- If you wanted to use an opimizer like `scipy.optimize.minimize` (not recommended for neural nets!) that expects vectors, then you'd need to implement that flattening.

## True/false - convolutions (15 min)



1. Convolving with the filter $w=[0.2, 0.2, 0.2, 0.2, 0.2]$ performs a local averaging operation.
2. For a convolution to be valid, the values in the filter must add up to $1$ (e.g. $w=[0,1,0], w=\left[\frac13,\frac13,\frac13\right]$, etc).
3. 1D convolution is just a special case of 2D convolution where one of the dimensions of $w$ and $x$ happens to be 1.
4. The convolution of a filter $w$ and signal $x$ is equivalent to the matrix multiplication $w^Tx$.

<br><br><br><br><br><br><br><br>

## Break (5 min)

## Convolution notes (0 min)

- What is a convolution?
- What is a linear operator/transformation?
  - in the discrete world, anything that can be represented as a matrix multiplication
  - we have linear operatiors in the continuous world (like sum, derivative; hence the trick in the last lecture), but that is out of scope here
  - finite differencing is a linear operator. PCA is linear. 
- 1-D convolutions
- 2-D and higher-D convolutions
- boundary conditions: the output might be slightly bigger or slightly smaller, or the same
  - these are the options in [scipy.signal.convolve2d](https://docs.scipy.org/doc/scipy-0.16.0/reference/generated/scipy.signal.convolve2d.html)

- (optional) all of this extends to continuous scenarios, where convolution is an integral instead of a sum
- convolution is often written as $x \ast y$ 
- convolution is commutative: $x \ast y = y \ast x$
  - Nonetheless, we often have a "signal" and a "filter" and they have different interpretations
  - The filter is often small and has an interpretation like "highpass" or "lowpass"

- Note (for completeness, not required): there is a fast implementations of convolution using the FFT (fast Fourier transform). For an $n\times n$ image and $m\times m$ filter this takes $nm\log(nm)$ time instead of $n^2m^2$. For sufficiently big images and filters this is a big win BUT there's a catch, which is that the equivalence only holds for periodic boundary conditions. 

- Convolutions appear all over the field of _signal processing_, which is traditionally a discipline within electrical engineering. But they also show up in math, physics, CS, etc. A lot of communications theory is based on this stuff.



## True/false - CNNs (15 min)

1. Convolutional neural network = convolutional network = convolutional net = convnet = CNN
2. Convolutional neural networks are a special case of fully-connected ("regular") neural networks where some of the weights are fixed at 0 and some of the weights are "tied" (fixed to be the same value).
3. The term "convolutional neural network" refers specifically to using **2D** convolutions, and the main application is image data.
4. A convnet applied to image recognition would typically have fewer parameters than a fully-connected net applied to the same problem.

<br><br><br><br><br><br><br><br>

## CNNs notes (0 min)

- Imagine an neural net with image as inputs. For a $1000\times 1000$ image that is 1 million features.
- Now say the next layer is 10% or even 1% that size. Now we need a matrix of size $10^6\times10^4=10^{10}$. That is not going to happen.
- Key insight: things happen "locally" in images. The top-left matters and the bottom-right matters, but they don't necessarily need to interact right away.
  - so we do some "local processing" on the different parts of the image, and then "report back" and start merging the information when we've reduced the dimension
  - this is the promise/dream of "deep learning": hierarchical abstractions like edges, curves, objects, higher and higher level "understanding"
- Key idea: use layers that are not fully connected (this was called "Dense" in Keras). Instead, have units in layer 2 that only get input from some _nearby units_ (pixels) in layer 1. 
- The above notion is precisely a convolution. Thus people talk about convolutions but keep in mind it's just a not-fully-connected neural network. This means everything from before (gradients, tricks) carry over nicely. 
- But for computational reasons we don't form those giant matrices full of zeros! We just do convolutions. 
- The parameters (weights) are now the filters themselves. So we can interpret it as "learning the filters". It's all the same stuff. 

## Neural network as a feature extrator (10 min)

Neural nework as a feature extractor.


In [3]:
(X_train, y_train), (X_test, y_test) = mnist.load_data()
X_train = X_train.reshape(X_train.shape[0], 28, 28, 1).astype('float32')
X_test = X_test.reshape(X_test.shape[0], 28, 28, 1).astype('float32')

# normalize inputs from 0-255 to 0-1
X_train = X_train / 255.
X_test = X_test / 255.

# one hot encode outputs
Y_train = np_utils.to_categorical(y_train)
Y_test = np_utils.to_categorical(y_test)

In [4]:
cnn = Sequential()
cnn.add(Conv2D(10, (5, 5), input_shape=(28, 28, 1), 
               activation='relu'))
cnn.add(MaxPooling2D(pool_size=(2, 2)))
cnn.add(Conv2D(10, (5, 5), activation='relu'))
cnn.add(MaxPooling2D(pool_size=(2, 2)))
cnn.add(Flatten())
cnn.add(Dense(50, activation='relu'))
cnn.add(Dense(10, activation='softmax'))

cnn.compile(loss='categorical_crossentropy', 
            optimizer='adam', metrics=['accuracy'])
cnn.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_1 (Conv2D)            (None, 24, 24, 10)        260       
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 12, 12, 10)        0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 8, 8, 10)          2510      
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 4, 4, 10)          0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 160)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 50)                8050      
_________________________________________________________________
dense_2 (Dense)              (None, 10)                510       
Total para

In [5]:
cnn.fit(X_train, Y_train, epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x1a1dc291d0>

In [13]:
feature_extractor = Model(cnn.input, cnn.layers[-2].output)

In [14]:
feature_extractor.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_1_input (InputLayer)  (None, 28, 28, 1)         0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 24, 24, 10)        260       
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 12, 12, 10)        0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 8, 8, 10)          2510      
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 4, 4, 10)          0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 160)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 50)                8050      
Total para

In [30]:
Z_train = feature_extractor.predict(X_train)
Z_test  = feature_extractor.predict(X_test)

In [16]:
Z_train.shape

(60000, 50)

In [17]:
lr = LogisticRegression(multi_class="multinomial", solver="sag", C=1000)
lr.fit(Z_train, y_train)



LogisticRegression(C=1000, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='multinomial',
          n_jobs=None, penalty='l2', random_state=None, solver='sag',
          tol=0.0001, verbose=0, warm_start=False)

In [18]:
cnn.evaluate(X_train, Y_train)



[0.03379892049527843, 0.9889166666666667]

In [19]:
lr.score(Z_train, y_train)

0.99455

In [29]:
np.argmax(cnn.predict(X_test[:15]), axis=1)

array([7, 2, 1, 0, 4, 1, 4, 9, 5, 9, 0, 6, 9, 0, 1])

In [31]:
lr.predict(Z_test[:15])

array([7, 2, 1, 0, 4, 1, 4, 9, 5, 9, 0, 6, 9, 0, 1], dtype=uint8)

- As we can see, the predictions are the same. 
- The probabilities and coefficients aren't exactly the same (it's hard to match everything up perfectly)
- If we want to make them exactly the same, we can set the coefficients:

In [41]:
W = cnn.layers[-1].get_weights()[0]
W.shape

(50, 10)

In [42]:
b = cnn.layers[-1].get_weights()[1]
b.shape

(10,)

In [39]:
lr.coef_.shape

(10, 50)

In [40]:
lr.intercept_.shape

(10,)

In [43]:
lr.coef_ = W.T
lr.intercept_ = b

In [44]:
cnn.predict(X_test[:1])

array([[1.2994968e-08, 1.8788144e-06, 1.1874030e-07, 1.2116589e-05,
        2.8262336e-12, 4.1235131e-07, 2.3576339e-13, 9.9996102e-01,
        5.2370819e-08, 2.4371320e-05]], dtype=float32)

In [46]:
lr.predict_proba(Z_test[:1])

array([[1.29949935e-08, 1.87881642e-06, 1.18740431e-07, 1.21165895e-05,
        2.82624466e-12, 4.12351739e-07, 2.35763422e-13, 9.99961138e-01,
        5.23707229e-08, 2.43713221e-05]], dtype=float32)

Bam!