# Homework 5




## Problem 1

Summarize and describe the different concepts/methods/algorithms that you have learned in this course.

Use a Colab notebook. Make sure that you organize the material logically by using sections/subsections. Also, use code cell to include code snippets.

I suggest that you group everything into five categories:

 - General concepts (for instance, what is artificial intelligence, machine learning, deep learning)

 - Basic concepts (for instance, here you can talk about linear regression, logistic regression, gradients, gradient descent)

 - Building a model (for instance, here you can talk about the structure of a convent, what it components are etc.)

 - Comping a model (for instance, you can talk here about optimizers, learning rate etc.)

- Training a model (for instance, you can talk about overfitting/underfitting)

- Finetuning a pretrained model (describe how you proceed)

## General Concepts

### Artificial Intelligence

- Umbrella term for any computer program that does something smart
- "The science and engineering of making intelligent machines"
- Deals with the simulation of intelligent behavior in computers
- Uses input and rules to produce an output

#### Machine Learning

- Field of study that gives computers the ability to learn without being explicity programmed

- Adjust themselves in response to data they are exposed to

- Does not require human intervention to make certain changes
- Uses inputs and expected outputs to produce its own rules

      A computer program is said to learn from experience E with respect to 

      some class of tasks T and performance measure P if its performance at 

      tasks in T, as measured by P, improves with experience E.

##### Deep Learning

 - Subset of machine learning, uses networks to extract higher level features from the input

## Basic Concepts

### Linear Regression


A linear approach modeling the relationship between a dependent variable and independent variable(s)

A basic model can be defined with the following equation:

$\quad \quad \hat{y} = b + w_1x_1$

where:

- $\hat{y}$ is the predicted label (a desired output)

- $b$ is the bias (the y-intercept), also known as $w_0$

- $w_1$ is the weight of feature 1. Same concept as the slope m of an equation in slope-intercept form

- $x_1$ is a feature (the input)

It can be extended to multiple inputs (features):

$\quad \quad \hat{y} = b + \sum\limits_{j=1}^n w_jx_j$

Where n is the number of features

### Loss

- Loss is the penalty for a bad prediction. It quantifies how bad the model's prediction was on a single example.

- A perfect prediction yields a loss of zero, else its greater than 0.

- Training a model means examining the examples and adjusting the weights and bias so that the loss is minimized.

- Goal of training is to find a set of weights and biases that have low loss across all examples. This process is Empirical Risk Minimization.

Mean Square Error (MSE) - average squared loss per example over whole data set

$\quad \quad MSE(w) = \frac{1}{m} \sum\limits_{i=1}^{m}(y^{(i)} - \hat{y}^{(i)})^2$

where:
  - $m$ is the number of examples
  - $x^{(i)}$ and $y^{(i)} $ are the features and the label of the ith example
  - $\hat{y}^{(i)}$ is the prediction of the model

### Gradient Descent

Gradient Descent is how the model updates the learnable parameters during training.

- Calculates the gradient of the loss function at some starting point
  - A gradient always points in the direction of the steepest increase in the loss function

- Gradient descent takes a step in the direction of the negative gradient to reduce the loss

- Then updates each weight such that:

  $\quad w = w - \alpha \nabla \mathcal{L}$
- where $\alpha$ is the learning rate, $\mathcal{L}$ is the loss  function

Gradient Descent can be done in batches of examples. Larger batches may cause a single iteration to take a very long time to compute. Smaller batches tend to have noisier gradients.

- Stochastic Gradient Descent uses batch size = 1 per iteration. The example is chosen at random for each batch.

- Mini-batch stochastic gradient descent typically choses between 10 and 10,000 examples at random for each batch.



### Logistic Regression

Logistic Regression is used for binary classification problems.

Binary Classification means there are only two possible classes, either positive or negative.

Only one output neuron, whose activation indicates the probability of class 1.

Logistic Regression uses the sigmoid function on the output neuron to map the values between 0 and 1.

$\quad \sigma (z) = \frac{1}{1+e^{-z}}$

Either squared error loss or binary cross entropy loss can be used. The latter option speeds up training. 

## Building a Model


- A neural network has 3 main layers: the input layer, the hidden layer, and the output layer

- Each layer has an output size, input shape, and activation function.

- Keras calculates the input shape of subsequent layers so that only the shape of the input has to be specified.

### The Input Layer

The input layer receives the input for the model. The following is an example for adding an input layer to a model in Keras.

    model.add(tf.keras.layers.Dense(1024, activation = 'relu', input_shape = (128*128,)))

where 1024 is the output size, the activation function used is ReLu, and the input shape is a 128 by 128 image

### The Hidden Layer

The hidden layer is comprised of all of the layers between the input and output. That is, there can be multiple layers in the hidden layer.

These additional layers increase model complexity and help the model better extract features.

### The Output Layer

The output layer yields the prediction that the model makes. 

In classification problems, one-hot encoding is used to show which label the example belongs to.

In binary classification problems, the sigmoid activation function may be used, as it maps the output between 0 and 1.

    model.add(tf.keras.layers.Dense(1, activation = 'sigmoid'))

The output size is 1 because it only needs to show the probability of one label in binary classification.

In classification problems with more labels, the output size will need to increase to accomodate each label. The following example can be used as the output layer for a model trying to recognize the numbers 0 through 9.

    model.add(tf.keras.layers.Dense(10, activation = 'softmax'))

There are 10 different labels for the example to map to, so 10 different vectors are needed to represent the probability of each one. 

### Convolutional Neural Networks



- A breakthrough in building models for image classification
- Can be used to progressively extract higher and higher level representations of image content
- A CNN takes an image's raw pixel data as input and learns how to extract these features, and then infer what object they constitute
- The input has 3 dimensions, length, width, and color
- The input is convolved with different image filters, one for each feature, and then moves to the next layer
- Different layer operations include pooling, activation, and additional convolutions
- The filters used for convolution are the learnable parameters for the model, and are updated after each iteration

## Compiling a Model

### Optimizers

Optimizers update the model in response to the output of the loss function by modifying the weights of the model. 

Examples include SGD (Stochastic Gradient Descent) and RMSprop

### Learning Rate

The learning rate $\alpha$ is a hyperparameter that controls how quickly the model learns by determining the next point of gradient descent. 

The gradient descent algorithm multiplies the gradient by the learning rate and then updates the weight such that:

$\quad w = w - \alpha \nabla \mathcal{L}$

### Keras Implementation

In [0]:
model.compile(optimizer='rmsprop',
                loss='categorical_crossentropy',
                metrics=['accuracy'])

Note that the optimizer is RMSprop and the loss function used is Categorical Cross Entropy

## Training a Model

###Overfitting and Underfitting

Overfitting occurs when a model has low loss on the training data, but it has a high loss on the test data. This is caused by making the model more complex than necessary. 

Underfitting occurs when the model is not complex enough for the problem at hand, and will thus have high loss for both the training data and test data. 

The fundamental tension of machine learning is between fitting our training data well, but also fitting it as simply as possible.

But to test if your model overfits, it needs to be run on previously unseen data. But how do you get this data?
  - One solution is to divide the existing data set into two:
      - A training set
      - A test set
  - Given a large enough test set, a good performance on the test set is a useful indicator of good performance on new data.


### Epochs

- The number of epochs indicates the number of passes through the entire training dataset the machine learning algorithm will complete. 

### Implementation

In this example, the model is trained on 60000 samples and validated on 10000 samples

In [0]:
epochs = 10
history = model.fit(train_images, 
                      train_labels, 
                      epochs=epochs, 
                      batch_size=128, 
                      validation_data=(test_images, test_labels))

**Example Output:**

Epoch 1/10
60000/60000 [==============================] - 3s 47us/sample - loss: 0.2553 - accuracy: 0.9267 - val_loss: 0.1188 - val_accuracy: 0.9659

.

.

.

Epoch 10/10
60000/60000 [==============================] - 2s 33us/sample - loss: 0.0100 - accuracy: 0.9973 - val_loss: 0.0690 - val_accuracy: 0.9803

## Finetuning a Model

Fine-tuning allows us to further tune a pre-trained model to our specific problem or data set. 

Existing models available for fine-tuning include VGG16, ResNet50, and Xception.



### Finetuning By Adding Additional Layers

As seen in the example below, layers can be added to an existing ResNet model.

This allows us to potentially improve performance by adding layers that are more helpful/specific to our problem. 

An output layer of size 1 using the sigmoid function was added below because the model was being used for binary classification.

In [0]:
from keras.applications import ResNet50
from keras import layers
from keras import models
from keras import optimizers

conv_base = ResNet50(
    weights='imagenet', 
    include_top=False, 
    input_shape=(150, 150, 3))

model = models.Sequential()
model.add(conv_base)
model.add(layers.Flatten())
model.add(layers.Dense(256, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

### Finetuning By Tuning the Pre-Made Model

We can fine-tune the layers of the pre-made meodel by freezing/unfreezing  certain layers in the model. 

By unfreezing a layer in the pre-made model, our model can train those layers by modifying its weights.

In [0]:
conv_base.trainable = True

set_trainable = False
for layer in conv_base.layers:
  if layer.name == 'block5_conv1':
    set_trainable = True
  if set_trainable:
    layer.trainable = True
  else:
    layer.trainable = False