<a href="https://colab.research.google.com/github/c-bujari/CAP4630/blob/master/HW_5/HW_5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Homework 5
###CAP 4630 Artificial Intelligience
####Clyde Bujari

## 1. General Concepts

**Artificial Intelligience** is a concept that has had many definitions by many different people over the years - but the most accurate one is John McCarthy's definition of it as "the science and engineering of making intelligient machines". 

In other words, the study of AI is the study of designing computer systems which can perform tasks such as visual perception, speech recognition, decision making, etc.; tasks which are virtually impossible to emulate with traditional computer programming, but that human intelligience is naturally good at.

**Machine Learning** is a subset of AI. Arthur Samuel defined it as a "field of study that gives computers the ability to learn without being explicitely programmed". The goal of ML is to create programs, more specifically, which adjust their output in response to a given set of input data, much like humans learn from birth. These programs are more flexible than those created by human programmers, able to change themselves based on data and thus create much more feasible solutions to problems AI seeks to solve.

**Deep Learning** is a subset of ML that specifically uses algorithms broken into  multiple layers (hence the term "deep"). By breaking up an algorithm into these layers, each layer can handle a different part of the process - for example, in image processing a lower layer may handle edge identification, and a higher layer may identify which of these edges correspond to a face or a digit.

## 2. Basic Concepts

**Linear regression** is a statistical tool used to model the relationship between a dependent variable and one or more independent variables.

**Logistic regression**, is a statistical tool that is used to model the probability of a certain class/event being valid - pass/fail, win/lose, etc. This can also be extended to more than just two binary classes, such as to determine what animal an image contains (assigning each possible class a probability between 0 and 1).

The **gradient** of a multivariate function is a vector field for which the components of a vector at a given point are the partial derivatives of function f. It is the multivariate equivalent to a "derivative"; the gradient vector's direction = the direction of fastest increase of f, and the gradient vector's magnitude = the rate of increase in that direction.

**Gradient descent** (also known as *steepest descent*) is an iterative algorithm for finding a local minimum of any differentiable function. In each iteration, it updates the value of this minimum by decreasing by a step size, where the step = the gradient at the current point ${*}$ a scalar learning rate.

Gradient descent can be computationally expensive. To mitigate this, we can break our dataset into much smaller batches - approaching the same amount of accuracy with much less computation over many iterations. **Stochastic gradient descent (SGD)** is an extreme example of this, computing single samples from the dataset in each iteration to great performance improvements, but at the cost of noise. 

A middle ground between this and computing the full batch at once is **Mini-batch stochastic gradient descent (Mini-Batch SGD)**, which uses batch sizes of 10 - 1000. This produces much less noise than SGD, while still being more efficient than a full-batch gradient descent.

## 3. Building a model

A typical Keras model generally consists of several common types of layers appropriate for different use cases:
* **Fully connected layers (`keras.Dense()`):**

These layers consist of simple vector data stored in 2D tensors of shape (samples, features). Also called *densely connected layers*, this is the typical set of weights and biases that form the foundation of all models.
* **Recurrent layers:**

These layers store sequence data in 3D tensors of shape (samples, timesteps, features). In other words they use the *sequence in which data occurs* in addition to the data itself, very useful for tasks such as speech recognition.
* **Convolutional layers (`keras.Conv2d()`):**

These layers typically process image data, stored in 4D tensors. They perform a convolution operation on this data, applying a filter to the matrix of data to identify image features. The output of these layers must be flattened before being passed to different types of layers, such as the fully connected layers that almost always follow them.

The most common topology of a Keras model is a sequential stack of these types of layers, which maps one input to one output.

In [0]:
# Example of a simple network architecture, with a convolutional layer 
# and two fully connected layers

# Creating the model
model = models.Sequential()
# Adding a convolutional layer and flattening its output
model.add(Conv2D)
model.add(layers.Flatten())
# Adding a fully connected layer with output of 256 and ReLU activation function
model.add(layers.Dense(256, activation='relu'))
# Adding a fully connected layer with output of 10 and sigmoid activation function
model.add(layers.Dense(10, activation='sigmoid'))

## 4. Compiling a model

After defining the network architecture of your model, you must choose a loss function and optimizer for use during training.

* The **loss function** is the function the model seeks to minimize during training, measuring the difference between values the model predicts and the expected target values for a given training dataset.
* The **optimizer** handles the actual updating of the model in order to minimize the loss function. It implements a chosen variant of SGD.
* We also choose what metrics (such as accuracy, loss) should be calculated so that we can monitor the performance of our model.

In [0]:
model.compile(
    loss='binary_crossentropy', 
    optimizer='rmsprop', 
    metrics=['acc'])

Different loss functions are appropriate for different types of problems. The following table gives some appropriate loss functions for common ML problems.

| Problem type              | Last layer activation  | Loss function              | 
|:-:                        |:-:                     |:-:                         |
| Binary classification     | sigmoid                | binary_crossentropy        |
| Multiclass, single-label  | softmax                | categorical_crossentropy   |
| Mutlticlass, multi-label  | sigmoid                | binary_crossentropy        |
| Regression to real values | none                   | mse                        |
| Regression to \[0,1\]     | sigmoid                | mse or binary_crossentropy |

## 5. Training a model

After compiling a model, we use the `model.fit()` or `model.fit_generator()` functions, passing the training data, validation data, number of epochs, steps, and other parameters into the function. The output of these functions should be stored into a "history" variable, which will allow us to examine the loss/accuracy curves once training is complete.

* **Overfitting:**
 * Overfitting is a common problem that occurs in training where a model is only able to make correct predictions on the training dataset, and is unable to generalize and make accurate predictions on data outside of that set.
 * Methods to mitigate overfitting:
   * Early stopping - detect when overfitting is occuring (with a seperate validation dataset) and stop training before it can worsen
   * Dropout - Randomly disable nodes within hidden layers
   * Data augmentation - Generate more data (through slight modifications to the original training set) so that the model cannot overfit (which happens less easily with large, varied datasets)

* **Underfitting**
 * Underfitting is essentially when a model is unable to train well enough to the input dataset. This occurs when a model is not complex enough for the dataset it is analyzing.

In [0]:
# Early stopping: 
# Instantiate an EarlyStopping callback.
# Inputs: metric to monitor, what to search for (max, min, mse), verbosity,
#         and patience (number of epochs where no improvement occurs before stopping)
# Since this is a callback, it must be added to a list with any others, 
# then passed as an argument to model.fit()
es = EarlyStopping(monitor='val_loss', mode='min', patience=20)
cb_list = [es, ...]

In [0]:
# Dropout layer: drops a certain percentage of nodes, 
# chosen at random to prevent overfitting.
model.add(layers.Dropout(0.3))

In [0]:
# Data augmentation as seen in HW 4:
# each image is scaled/rotated by slight amounts to lessen chance of overfitting
train_datagen = ImageDataGenerator(
    rescale=1./255, 
    rotation_range=40,
    width_shift_range=0.2,
    height_shift_range=0.2,
    shear_range=0.2,
    zoom_range=0.2,
    horizontal_flip=True,
    fill_mode='nearest'
)
# We can also do this for validation sets, but this is less necessary (since
# they are not actually used for training)
validation_datagen = ImageDataGenerator(rescale=1./255)

# Augmented data should then be added to a dataset which can be used by Keras.
train_generator = train_datagen.flow_from_directory(
    train_dir,
    target_size=(150, 150),
    batch_size=20,
    class_mode='binary')

validation_generator = validation_datagen.flow_from_directory(
    validation_dir,
    target_size=(150, 150),
    batch_size=20,
    class_mode='binary')

In [0]:
# Training the model using datasets/callback list created in prior steps:
history = model.fit_generator(
    train_generator,
    steps_per_epoch=100,
    epochs=30,
    validation_data=validation_generator,
    validation_steps=50,
    callbacks=cb_list
)

# If we do not do any data augmentation, we can use the simpler:
history = model.fit(
    training_data,
    training_labels,
    epochs=30,
    batch_size = 128,
    validation_data=(test_data, test_labels)
)

## 6. Finetuning a pretrained model

We can use pretrained models, especially convolutional models trained on specific datasets and tailored to different kinds of work, by adding them to our network as we would any layers. Prior to completing the training steps from the previous section, we should freeze the pretrained model and only train what we add onto it.

In [0]:
from keras.applications import VGG16
from keras import layers
from keras import models
from keras import optimizers

# Example of a pretrained VGG16 CNN
conv_base = VGG16(
    weights='imagenet',
    include_top=False,
    input_shape=(150, 150, 3))

# Set the pretrained model as untrainable
conv_base.trainable = False

# Create the model's architecture, inserting the pretrained model as well as any
# other layers we choose to add.
model = models.Sequential()
model.add(conv_base)
model.add(layers.Flatten())
model.add(layers.Dense(256, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

# Continue to training (as seen in part 5)...

After training this model, we should unfreeze some of the layers towards the end of the pre-trained model so that we can increase the accuracy of the model more.

In [0]:
# (This method of unfreezing works for VGG16, some pretrained CNNs have 
#  different labeling of layers and require adjustment)
conv_base.trainable = True

set_trainable = False
for layer in conv_base.layers:
  if layer.name == 'block5_conv1':
    set_trainable = True
  if set_trainable:
    layer.trainable = True
  else:
    layer.trainable = False

We can then re-compile and re-train the model. We use a smaller learning rate this time, since we are just fine-tuning and do not need/want large changes to model accuracy anymore.

In [0]:
model.compile(
    loss='binary_crossentropy',
    # Using a much smaller learning rate
    optimizer=optimizers.RMSprop(lr=1e-5), 
    metrics=['acc'])

history = model.fit_generator(
    train_generator,
    steps_per_epoch=100,
    epochs=100,
    validation_data=validation_generator,
    validation_steps=50)

Sources:

1. General concepts:
 * Professor's [Introductory slides](https://github.com/schneider128k/machine_learning_course/blob/master/slides/1_a_slides.pdf)

2. Basic Concepts
 * [Wikipedia for Linear Regression](https://en.wikipedia.org/wiki/Linear_regression)
 * [Wikipedia for Logistic Regression](https://en.wikipedia.org/wiki/Logistic_regression)
 * [Wikipedia for Gradient Descent](https://en.wikipedia.org/wiki/Gradient_descent)
 * Professor's [slides on Gradient Descent](https://github.com/schneider128k/machine_learning_course/blob/master/slides/2_e_slides.pdf)

3. Building a Model
 * Professor's [Keras Basics](https://github.com/schneider128k/machine_learning_course/blob/master/keras_basics.md)
 * [Explanation of Recurrent Networks](http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/)
 * [StackExchange answer with a very helpful animation of 2D convolution](https://stats.stackexchange.com/questions/154798/difference-between-kernel-and-filter-in-cnn/188216#188216)

4. Compiling a Model
 * Professor's [Keras Basics](https://github.com/schneider128k/machine_learning_course/blob/master/keras_basics.md) (Table of loss functions is directly pulled from here)
5. Training a Model
 * [Implementation of Early Stopping in Keras](https://machinelearningmastery.com/how-to-stop-training-deep-neural-networks-at-the-right-time-using-early-stopping/)
 * Professor's [Pretrained Convnet Finetuning Example](https://colab.research.google.com/drive/1F-RWvoxH8MmT7c1UmNy41iuOp-ejiLoF#scrollTo=Fh6gZSeAjF7c)
6. Fine-tuning a Model
 * Professor's [Pretrained Convnet Finetuning Example](https://colab.research.google.com/drive/1F-RWvoxH8MmT7c1UmNy41iuOp-ejiLoF#scrollTo=Fh6gZSeAjF7c)