<a href="https://colab.research.google.com/github/fuanonemus/cop4630spring2020/blob/master/hw5_4630.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
import numpy as np
from tensorflow.keras import models
from tensorflow.keras import layers

<h1>Terminology</h1>

Artificial Intelligence - when a computer can do things that usually require human intelligence, ex. vision, recognition, and decision making

Machine learning is a subset of AI and deep-learning is a subset of machine learning.

Machine Learning - when the program adjusts itself in response to the data it's exposed to without human intervention. You train a model to make useful predictions using a data set.

Two types of problems:
* Supervised learning
  * algorithm is given labeled training data
  * during training, the algorithm determines the relationship between the features and the labels
* unsupervised learning
  * algorithm is given data and has to come up with labels
  * goal is to find meaningful patterns in the data

Reinforcement learning - you give the algorithm a goal, like winning a game, and it takes actions until it succeeds

Label - whaat we want to predict, like a type or a value

Feature - input variable used in the prediction, like an image or a value

Example - instance of data given to the model

Training - creating or learning the relationship between features and labels, building of the model

Inference/Testing - applying the trained model to unlabeled examples to make predictions

Regression Model - predicts continuous values like prices or probabilities

Classification model - predicts discrete values like types

<h1>Linear Regression</h1>
For modeling continuous values linearly.

Equation:       y' = b + $w_1$$x_1$
* y is the predicted label or desired output
* b is the bias
* w is the weight of the feature, how much impact it has
* x is the feature

You can have multiple features with multiple $wx$ terms.

Training a model means finding good values for all the weights and the bias.

Empirical Risk Minimization - in supervised learning, you iterate over lots of examples to find a model that minimizes loss

Loss - value indicating how bad the model's prediction was on a single example. If you graphed all the data and your model line, the loss is the average of the distance of the points to your line.

Mean Square Error (MSE) - average squared loss per example over the whole data set

Iterative learning - finding the model that minimizes loss by:
1. pick some random values for w
2. calculate loss
3. recalculate the values
4. repeat 2-3 until the model converges (loss stops changing)

<h2>Gradient Descent</h2>
If you graph all the loss values with the weight values, you'll get a polynomial. The minimum of the polynomial is the weight value where the model converges.

The gradient of loss is the derivative of the loss-weight curve. It has a direction and a magnitude. 

If your model has multiple features, thus multiple weights, the gradient is a vector of the partial derivatives with respect to the weights.

To calculate the next weight value, add some fraction of the gradient's magnitude to the old weight value.

Learning Rate - the step size used to determine the new weight value

<h3>Stochastic Gradient Descent (SGD)</h3>
A batch is the total number of examples used to calculate the gradient in a single iteration. Redundancy grows with batch size, it can be helpful for smoothening, but is more likely to slow down the process.

SGD takes a single random example from the data set for each iteration. This requires a large number of iterations, and can be very noisy.

Mini-batch SGD is when you choose a relatively small batch size. More efficient than full-batch and less noisy than SGD.

Problems that might happen with gradient descent:
* your learning rate is too high and you step past the minimum
* your learning rate is too low and it takes forever
* your model has multiple minimums and you get stuck in a local minimum

<h2>Example of Linear Regression</h2>

In [0]:
# first we need a data set
xs = 2 * np.random.rand(100, 1)
ys = 4 + 3 * xs + np.random.rand(100, 1)

# then we split into training and testing
trainin = xs[:80]
trainout = ys[:80]
testin = xs[80:]
testout = ys[80:]

# next we set the learning rate and the number of epochs (iterations)
lr = 0.01
epochs = 10

# then we pick a random weight and bias to start with
w = np.random.randn(1)
b = np.zeros(1)

# next we do iterative gradient descent
for epoch in range(epochs):
  for i in range(80):
    # calculate value
    y = w * trainin[i] + b

    # calculate loss
    gradientW = (y - trainout[i]) * trainin[i]
    gradientB = (y - trainout[i])

    # set new values by multiplying by step
    w -= lr * gradientW
    b -= lr * gradientB

# finally we evaluate our model with test
loss = 0
for i in range(20):
  loss += 0.5 * (w * testin[i] + b - testout[i]) ** 2
loss /= 20

<h2>Example with Mini-Batch Stochastic Gradient Descent</h2>

In [0]:
# use the same data set, divisions, lr, epochs as before
# for weight and bias in this example, i'll use vectorized code
# this means makes it easier to add features, b = w_0 and x_0 = 1
w_vector = np.random.randn(2,1)

# because we're vectorizing, we need to add a column to x for bias
trainin_vector = np.column_stack([np.ones((m, 1)), trainin])
testin_vector = np.column_stack([np.ones((m,1)), testin])

batch_size = 4
for epoch in range(epochs):
  # we need to come up with a random set of 4 values from the test set
  # so start by shuffling the data so every time we pull a sample it 
  # will be random
  indices = np.random.permutation(80)
  shuffledxs = trainin_vector[indices]
  shuffledys = trainout[indices]

  for i in range(0, 80, batch_size):
    # get sample from shuffled training set
    xi = shuffledxs[i:i+batch_size]
    yi = shuffledys[i:i+batch_size]

    # calculate gradient and calculate new weights
    gradient = 1 / batch_size * xi.T.dot(xi.dot(w_vector) - yi)
    w_vector = w_vector - lr * gradient

# evaluate the model
loss = 0
for i in range(20):
  loss += 0.5 * (testin_vector[i].dot(w_vector) - testout[i]) ** 2
loss /= 20

<h1>Models</h1>
Network of layers. A directed and acyclic graph. The typology of the model defines the hypothesis space where the variation of weights is constrained.

Topologies include: linear stack (single input to single output), two-branch, multi-head, and inception block

After you define the architecture you pick the loss function and the optimizer.

Loss/objective function - the quantity minimized during training.

Optimizer - how the network is updated based on the loss function, some variant of stochastic gradient descent.

Problem Type | Last-layer Activation | Loss Function
--- | --- | ---
Binary classification | sigmoid | binary_crossentropy
multiclass, single-label classification | softmax | categorical_crossentropy
multiclass, multi-label classification | sigmoid | binary_crossentropy
regression to arbitrary values | none | mse
regression to values in [0, 1] | sigmoid | mse or binary_crossentropy

<h2>Keras</h2>
Deep-learning framework for python with methods for defining and training deep-learning models

Process:
1. Define your training data: input tensors and target tensors
2. Define a network of layers that map your inputs to your targets
3. Configure the learning process by choosing a loss function, an optimizer, and some metrics to monitor
4. Iterate on the training data with the fit() method

<h2>Layers</h2>
The fundamental data structure of a neural network. A data processing module. Input and output are tensors. Some layers are stateless, but most have weights learned through stochastic gradient descent.

Different layers work for different formats:
* vector data of 2D tensors is processed with densely/fully connected layers
* sequence data in 3D tensors is processed with recurrent layers, ex. Long-short term memory (LSTM) layers
* image data in 4D tensors is processed with 2D convolutional layers

Each layer has certain shape limits for input and output. In Keras, layers are dynamic so they'll shift to match the shape of the next layer

<h2>Data Sets</h2>
In order to evaluate your model, you need a test set.

Typically you split your data set into a training set and a testing set. The test set should be large enough to produce statistically significant results and it should be representative of the data set as a whole.

The goal is to do as well on the training set as you do on the testing set. Never train on the test data or you'll ruin your evaluation metrics. 

One method for dividing the data set is Simple Hold-Out Validation. It's the simplest evaluation protocol. If your data set is small, the test set might not be big enough to be statistically significant. To identify this problem, shuffle the data before splitting and repeat for several rounds. If you get different metrics, your test set is too small. To fix this problem use k-fold validation or iterated k-fold validation (for especially small data sets). Split the data into k paritions of equal size. For each partition, train the model on the remaining k-1 partitions and evaluate using the initial partition. Average your metrics.

If you divide your data set into three parts instead of two (training, validation, and testing), you'll further prevent overfitting. Use the validation set as a gateway to the test set. Use it to fine tune the model and only use the test set if your model has passed the validation set. 

<h2>Overfitting</h2>
Occurs when a model adapts to the peculiarities of the training data in an effort to produce a low loss, but when given the test set produces a high loss. Caused by making a model more complex than necessary. 

Your model should be good a generalizing. Make sure:
* to draw examples independently and identically at random from the dataset so they don't influence each other
* the distribution of the data set is consistent (stationarity)
* to draw examples from partitions of the same distribution

If you violate these principles, pay careful attention to your evaluation metrics.

<h2>Example of Linear Regression with Keras Model</h2>

In [0]:
# for this example, we'll use the same data as before
# first build the model by specifying the architecture
# we only need one layer for one feature
model = models.Sequential()
model.add(layers.Dense(1, input_shape=(1,)))

# next specify the loss function and the optimizer
model.compile(optimizer='sgd',loss='mse',metrics=['mse'])

# train the model
model.fit(trainin, trainout, epochs=epochs, batch_size=batch_size)

# evaluate the model
loss, accuracy = model.evaluate(testin, testout)

<h1>Convolutional Neural Networks</h1>
Used for image classification to progressively extract higher and higher level representations of an image. Take the image's pixel data as input and learns to extract features and their significance. 

Input: 3D tensor (feature map) the size of the image x 3 for RGB

Anatomy:


*   Stack of modules which extract features, perform the following operations:
  *  Convolution
  *  ReLU
  *  Pooling
*   Fully Connected Classification Layers

<h2>Convolution</h2>
Operation performed on feature maps to produce a more meaningful image.

You take a filter and slide it over the input over every possible orientation and convolve the filter with the portion of overlapping input.

Convolving is like if you made the matrices vectors and performed the dot product. You can add padding to increase the output size.

Through training, the model finds optimal filter values for meaningful features. Increasing the number of filters will increase the number of features, but it will also increase the training time and each added filter will produce less information.

<h2>ReLU</h2>
A transformation on the map to add nonlinearity by getting rid of negative values.

ReLU(x) = max(0, x)

<h2>Pooling</h2>
Downsamples the map to save time. One method is Max Pooling: You lay a grid over the map and only keep the largest value in each square. Take input value stride which defines the size of the squares of the grid.

<h2>Fully Connected Layers</h2>
Used for classification. Determine labels based on the feature data. 

Last layer is typically softmax, which outputs a probability value from 0 to 1 for each label.



In [0]:
# Here's how to build a convnet from scratch:

model = models.Sequential()

# add convolutional layers with pooling
model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(150, 150, 3)))
model.add(layers.MaxPooling2D(2, 2))

model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D(2, 2))

model.add(layers.Conv2D(128, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D(2, 2))

model.add(layers.Conv2D(128, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D(2, 2))

# add final fully connected layers for classification
model.add(layers.Flatten())
model.add(layers.Dense(512, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

# alternatively you can use a pretrained convnet from keras
# all you need to do is import the desired convnet
from keras.applications import VGG16

# then you call its constructor
base = VGG16(weights='imagenet',include_top=False,input_shape=(150,150,3))

# and add the base to a new model
model = models.Sequential()
model.add(base)
# then add in a dense classifier

# you can also train only certain parts of the model by freezing layers
base.trainable = False


<h1>Fine Tuning Issues</h1>

Regularization - a technique for improving generalization in a model to prevent or less the effects of overfitting by adding a term to reduce the weights of features

You can add a regularizer to a layer in keras by adding the kernel_regularizer parameter to a layer.

Dropout - another technique for preventing overfitting by randomly droping inputs

You can add a dropout layer in keras

Data Augmentation - a technique used for training a large model if you don't have a lot of data

You can use the keras ImageDataGenerator library to do this

To fine tune a model try retraining it's layers by freezing some and unfreezing others.