<a href="https://colab.research.google.com/github/brettfazio/AI-Homework/blob/master/HW_5/Homework5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Homework 5

**Problem 1**

Summarize and describe the different concepts/methods/algorithms that you have learned in this course.

Use a Colab notebook. Make sure that you organize the material logically by using sections/subsections. Also, use code cell to include code snippets.

I suggest that you group everything into five categories:

1. General concepts (for instance, what is artificial intelligence, machine learning, deep learning)

2. Basic concepts (for instance, here you can talk about linear regression, logistic regression, gradients, gradient descent)

3. Building a model (for instance, here you can talk about the structure of a convent, what it components are etc.)

4. Comping a model (for instance, you can talk here about optimizers, learning rate etc.)

5. Training a model (for instance, you can talk about overfitting/underfitting)

6. Finetuning  a pretrained model (describe how you proceed)

Take this homework *very seriously*.  You have the opportunity to make up for lost point on previous homework assignments.  

Some quotations taken from https://github.com/schneider128k/machine_learning_course

----

---

## Section 1 General Concepts



### Artificial Intelligence 
  - "science and engineering of making intelligent machines" - John McCarthy
  - Symbolic AI - methods in AI research that are human readable aka "symbolic." GOFAI falls under this.
  - **Input & Rules produce output.**

### Machine Learning 
  - "field of study that gives computers the ability to learn without explicitly being programmed" - Arthur Samuel
  - "A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E" - Tom Mitchell.
  - **Input & output produce rules.**

### Deep Learning 
  - Subset of ML, utilizes networks that are able to learn about data in an unsupervised manor. 
  - Neural networks are an example of this.


## Section 2 Basic Concepts

At a basic level regression predicts discrete data (yes or no, 1 or 2 or 3, etc.).

## Linear Regression

Linear Regression - A linear approach to modeling between one or more variables.

At a basic level your linear regression function could look like this:

$$ \hat y = b + w_1 x_1 $$

In the above $ \hat y $ is the output you are trying to predict. $ b $ is your bias, $ x_1 $ is your feature, and $ w_1 $ is your weight on $ x_1 $.

We could make this more complicated for a few variable by just expanding the function like so:

$$ \hat y = b + w_1 x_1 + w_2 x_2 + ... + w_n x_n $$

Where the above would have $ n $ features.

## Logistic Regression

Logistic Regression is used to predict a value between 0 and 1. As such it is often used for classification problems as we could have 0 repesent "dog" and 1 represent "cat." In that example, the closer it is to 0 the more likely it is a dog (according to the model) and the closer to 1 the more likely it is a cat.

Logistic regression uses the *sigmoid function* to map these values between 0 and 1.

$$ \sigma (z) = \frac{1}{1 + e^{-z}}$$

## Gradients

Gradients are the steepness of the slope at a certain point.

A gradient is defined as so:

$$ \triangledown f(p) = [ \frac{\delta f}{\delta x_1}(p) ...  \frac{\delta f}{\delta x_n}(p)] $$

We use gradients to make our "steps" when performing gradient descent, as gradients have both magnitude and direction.

## Gradient Descent

Gradient Descent can be thought of as an iterative trial and error process. In which first a starting value is chosen, and then the gradient is used to make steps that (hopefully) lead to a good value. (I say hopefully as gradient descent needs to be tuned and there's a possibility one could jump too far and totally miss the answer).

Once an initial value is chosen the gradient, as described above, is used to make steps.

To be percise, if we are stepping on $ w $ we make the steps like so:

$$ w = w - \alpha \triangledown  L$$

In the above $ \alpha $ is the learning rate that determines how big we are going to step. The learning rate is considered a *hyperparameter*  ("a parameter external to the model").

We want $ \alpha $ to be just the right size so we can efficiently get to our answer.

In [0]:
import numpy as np

In [0]:
# Implementing the sigmoid function using numpy


def sigmoid(z):
  return 1 / (1 + np.exp(-1*z))

## Section 3 Building a Model

### Overview

A basic neural network typically consists of 3 layers: input, hidden, output.

#### Input

The input layer is the layer that accepts the input data (i.e. the data that we are trying to learn on). If were accepting the MINST dataset our input layer could look like this:

```
network.add(tf.keras.layers.Dense(512, activation='relu', input_shape=(28 * 28,)))
```

With the input shape being `28*28` because the MINST images are `28*28`.

#### Hidden

The hidden layers are the layers in the middle of our model (all the layers that are not the input or output). These layers help to add complexity to our model to extract deeper insights.

Adding too many hidden layers can possibly make the model less accurate.

#### Output

The output layer is the layer that gives us the resulting information we want. Often times we want to select a specific activation function based on the type of problem we are presented with. With hte MNIST example again, here is an example output layer:

```
network.add(tf.keras.layers.Dense(10, activation='softmax'))
```

In the above, the `softmax` activation function is used.

### ConvNets

A convnet is a subclass of neural networks often used for images because of how it can learn to extract higher level features from an image.

### Structure of a convnet

Convnets usually consist of a series of "modules" in which `conv2d`, `ReLU`, and `maxpool` operations are performed. There are performed for feature extraction purposes.

After the "modules" there is a classification layer. This is a series of dense layers that turns the information into some sort of classification (this will of course depend on the problem at hand).


## Section 4 Compiling a Model

### Loss Function

The loss function is the quantity that will be minimized during training. In the example below, `categorical_crossentropy` is the used loss function. `categorical_crossentropy` is the loss function commonly used in multi-label classification problems. If one wanted to do binary classification, for example, a common choice is `binary_crossentropy`.

MSE (Mean Squared Error) is also a popular loss function.

It is defined as follows:

$$
L_{SE} = \frac{1}{2} (\alpha-y)^2
$$

### Optimizers

The optimizer determines how the model will be updated based on the loss function. For example, in the implementation section below you can see that the optimizer used is `rmsprop`. 

There are many different optimizers you can use in keras as listed [here](https://keras.io/optimizers/).

### Learning Rate

The learning rate, sometimes denoted $\alpha$, is a *hyperparameter* (external to the model). It controls how quickly the model learns. If it is too small the model will take forever to fit, if it is too large it is possible it could jump the minimum and produce incorrect or inaccurate results.

### Implementation

If we build a model using `tf.keras.models.Sequential()` called `model` we can compile it like so using the `compile` function:

```
network.compile(optimizer='rmsprop',
                loss='categorical_crossentropy',
                metrics=['accuracy'])
```

In [0]:
# Implemeting binary cross entropy in numpy

def binary_cross_entropy(y, a):
  return -y * np.log10(a) - (1-y) * np.log10(1-a)

Here is an example of function that computes loss using the `binary_cross_entropy` and `sigmoid` functions I defined earlier.

In [0]:
def compute_loss(A, B):
  loss = 0
  for data_val, label in zip(A, B):
    pred = np.dot(np.reshape(manual_weights, (2,)), data_val) + manual_bias
    bce = binary_cross_entropy(label, sigmoid(pred))
    loss += bce
  loss /= (partial)
  return loss

## Section 5 Training a Model

### Training

At a high level training means iteratively adjusting the weights and the bias in an attempt to minimize the loss and thus increase accuracy of the model.

### Loss

At a high level loss is simply just the penalty for making an incorrect prediction. A perfect prediction would be 0, anything else will be higher than 0. I gave an example of computing loss above with `compute_loss`, `binary_cross_entropy`, and `sigmoid`.

In combining that definition of training and loss we get a process called **empiricial risk minimization** (source: [here](https://github.com/schneider128k/machine_learning_course/blob/master/slides/2_c_slides.pdf)).

### Implementation

To train a model using Keras/Tensorflow you use the `.fit` function.

From the provided [MNIST example notebook](https://colab.research.google.com/drive/144nj1SRtSjpIcKZgH6-GPdA9bWkg68nh):

We built and complied a model `network` and then on network we will call `.fit`:

```
epochs = 10
history = network.fit(train_images, 
                      train_labels, 
                      epochs=epochs, 
                      batch_size=128, 
                      validation_data=(test_images, test_labels))
```

The `.fit` call will train our model with the provided training set, number of epochs, desired batch size, and a validation set.


### Potential Problems


#### Overfitting

Overfitting is the problem that arises when your model is too complex for the provided task and it skews the results in an incorrect manner.

Often times what will happen is the model will become too accruate on the test data that when run on any other data set (like a validation set) it will be incredibility inaccurate.

There are many ways to fix overfitting. One way is to use a model more suited for the task at hand, if your model is too complicated for the task it may be best to use a simplier model. You may also want to modify your training data, possibly removing outliers and getting rid of meaningless features.

#### Underfitting

Underfitting is the opposite problem of overfitting. Where overfitting is the model being too complex and getting too accurate on the training data - underfitting is when the model is not complex enough. Thus since the model is not complex enough, the accuracy will not be high on either the validation or test set.

A simple way to fix this is to increase the complexity of your model, or possibly use a complex pre-made model.


Source/idea for the code [here](https://colab.research.google.com/drive/144nj1SRtSjpIcKZgH6-GPdA9bWkg68nh#scrollTo=mHp0sz7cYPsK&line=1&uniqifier=1).

In [0]:
# Putting building, compiling, and training together to do MNIST learning.

%tensorflow_version 2.x
import tensorflow as tf

# Set up data

mnist = tf.keras.datasets.mnist

train_data, test_data = mnist.load_data()

train_images_original, train_labels_original = train_data
test_images_original, test_labels_original = test_data

train_images = train_images_original.reshape((60000, 28 * 28))
train_images = train_images.astype('float32') / 255

test_images = test_images_original.reshape((10000, 28 * 28))
test_images = test_images.astype('float32') / 255

# Label data

train_labels = tf.keras.utils.to_categorical(train_labels_original)
test_labels = tf.keras.utils.to_categorical(test_labels_original)



In [0]:
# BUILD

# define sequential model
network = tf.keras.models.Sequential()
# add two layers to model
network.add(tf.keras.layers.Dense(512, activation='relu', input_shape=(28 * 28,)))
network.add(tf.keras.layers.Dense(10, activation='softmax'))

In [0]:
# COMPILE

network.compile(optimizer='rmsprop',
                loss='categorical_crossentropy',
                metrics=['accuracy'])

In [0]:
# TRAIN

epochs = 5
history = network.fit(train_images, 
                      train_labels, 
                      epochs=epochs, 
                      batch_size=128, 
                      validation_data=(test_images, test_labels))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


## Section 6 Finetuning a Model

#### Fine-tuning is the process one can undertake when using a pretrained model (such as [these](https://keras.io/applications/)) to increase its accuracy and useability on your specific data set / use case. We are trying to get the model to work well on our specific data set.

If you look at the keras applications there are lots of pretrained models such sa Xception, VGG16, and VGG19. On these models we can fine-tune them to work on our specific data set.

*Using the provided notes from class ([here](https://colab.research.google.com/drive/1uVLIUWdT7--b59vM7NaSHkB-qFcu30jU#scrollTo=dI5rmt4UBwXs)) and my homework 4 I will give some specific examples.*

There are two popular ways to fine tune a model.

------------

### First is to add additional layers on the model.

Looking at my [homework 4 problem 3](https://colab.research.google.com/drive/1O1hvDELqlPM3Ybg7LYfHcS2M73zI-FPg):


```
from keras.applications import Xception

conv_base = Xception(
    weights='imagenet', 
    include_top=False, 
    input_shape=(150, 150, 3))

model = models.Sequential()
model.add(conv_base)
model.add(layers.Flatten())
model.add(layers.Dense(256, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
```

You can see I used a pretrained model from keras, in this situation it was `Xception`.

When I create my sequential model, I add my `Xception` model but then I add a `relu` activation layer and a `sigmoid` activation layer. 

While this does not directly change the `conv_base` pretrained model it does help to improve our performance as we can add layers specific to our use case - whether we're doing binary classification or a different classification problem.

-------------

### Second is to fine tune the premade model itself.

We can actually change the weights on the top layers of the pretrained model by "unfreezing" certain layers after a layer in a model.

From my homework 4 again:

```
conv_base.trainable = True

set_trainable = False
for layer in conv_base.layers:
  if layer.name == 'conv2d_3':
    set_trainable = True
  if set_trainable:
    layer.trainable = True
  else:
    layer.trainable = False
```

You can see that I set all layers after `conv2d_3` to trainable. That means when I run my pretrained model those layers that I set to trainable will actually adjust their weights.

From the provided in class notes that I refereced earlier there are two important notes about this:
  1. "Fine-tuning should only be attempted after you have trained the top-level classifier with the pretrained model set to non-trainable."
  2. "We fine-tune only the top layers of the pre-trained model rather than all layers of the pretrained model."

Keeping those concepts in mind, when fine tuning a pretrained model you should see a performance improvement as that is the goal. If performance degradation occurs then it is possible that layers were unfrozen too high up in the pretrained model messing with the original performance of the model.



In [0]:
# Building a model from a pre-existing model and then adding extra layers to fine tune it.

from keras import layers
from keras import models
from keras import optimizers


from keras.applications import Xception

conv_base = Xception(
    weights='imagenet', 
    include_top=False, 
    input_shape=(150, 150, 3))

model = models.Sequential()
model.add(conv_base)
model.add(layers.Flatten())
# Extra layers to fine tune.
model.add(layers.Dense(256, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

model.summary()

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
xception (Model)             (None, 5, 5, 2048)        20861480  
_________________________________________________________________
flatten_3 (Flatten)          (None, 51200)             0         
_________________________________________________________________
dense_5 (Dense)              (None, 256)               13107456  
_________________________________________________________________
dense_6 (Dense)              (None, 1)                 257       
Total params: 33,969,193
Trainable params: 33,914,665
Non-trainable params: 54,528
_________________________________________________________________
