<a href="https://colab.research.google.com/github/diego6289/CAP4630/blob/master/HW_5/HW_5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##General Concepts


**What is Aritificial Intelligence?**

Artificial intelligence refers to the simulation of human intelligence in machines that are programmed 
to think like humans and mimic their actions. The term may also be applied to any machine that exhibits 
traits associated with a human mind such as learning and problem-solving. Artificial intelligence is based on the principle that human intelligence can be defined in a way that a machine can easily mimic it and execute tasks, from the most simple to those that are even more complex. 


**What is Machine Learning?**

Machine learning is an application of artificial intelligence that provides systems the ability to 
automatically learn and improve from experience without being explicitly programmed. Machine learning focuses on 
the development of computer programs that can access data and use it learn for themselves.

**What is Deep Learning?**

Deep learning is an artificial intelligence function that imitates the workings of the human brain in processing 
data and creating patterns for use in decision making. Deep learning is a subset of machine learning in 
artificial intelligence that has networks capable of learning unsupervised from data that is unstructured 
or unlabeled. 

##Basic Concepts

**Linear Regression:**

Linear regression attempts to model the relationship between two variables by fitting a linear equation to observed data. We are essentially trying to find a linear relationship between the target and one or more predictors. The core idea is to obtain a line that best fits the data. One where the error of all data points are as small as possible. 

A linear regression line has an equation of the form Y = a + bX, where X is the explanatory variable and Y is the dependent variable.


**Logistic Regression:**

Logistic Regression is a probablility model similar to linear regression except instead of a linear function we use a different cost function. An example would be the sigmoid function.
Logistic Regression is used when the dependent variable(target) is categorical. For example we can determine whether email is spam(0) or not(1).

**Gradient:**

The gradient is closely related to a derivative. It's a vector that points in the direction of greatest increase of a function. By taking steps in that direction, we hope to reach an optimal solution.



**Gradient Descent:**

Gradient descent is an optimization algorithm used for finding the weights or coefficients of machine learning algorithms. It works by having the model make predictions on training data and using the error on the predictions to update the model in such a way as to reduce the error. The goal of the algorithm is to find model parameters that minimize the error of the model on the training dataset. This is psuedocode that summarizes the gradient descent algorithm:

In [0]:
model = initialization(...)
n_epochs = ...
train_data = ...
for i in n_epochs:
	train_data = shuffle(train_data)
	X, y = split(train_data)
	predictions = predict(X, model)
	error = calculate_error(y, predictions)
	model = update_model(model, error)

##Building a Model

###Convolutional Neural Network

Convolutional networks are a deep learning algorithm which can take in an input image, assign importance to various aspects/objects in the image (weights and biases) and be able to differentiate one from the other. There are three basic layers that compose a CNN.

**Convolutional Layer**:

Convolution is the first layer to extract features from an input image. Colvolution preserves the relationship between pixels by learning image features using small squares of input data. It takes two inputs such as image matrix and a filter or kernel.

**Activation Function:**
It is used to determine the output of a neural network (like 1/0). One activation function we used was the Sigmoid. *Example below.*

**Strides and Padding**:

Stride is the number of pixels shifts over the input matrix. Padding is the number of pixels added to an image when it is being process by the kernel of a CNN. Stride and padding work together to allow for a minimized reduction of size in the output layer.

**Pooling Layer**:

It is responsible for reducing the spatial size of the convoluted feature. Pooling combines the output of neuron cluster at one layer into a single neuron in the next layer.
One method of doing this is Max Pooling, where we use the maximum value of each cluster of neurons at the prior layer. It helps reduce overfitting and aids efficiency.

**Fully Connected Layer:**

Fully Connected Layer is simply, feed forward neural networks. The input to the fully connected layer is the output from the final Pooling or Convolutional Layer, which is flattened and then fed into the fully connected layer.



In [0]:
def sigmoid_function(z):             # Example of Activation Function
  return 1 / (1 + np.exp(-1 * z))

##Comping a Model

**Optimizers:**

Optimizers tie together the loss function and model parameters by updating the model in response to the output of the loss function. The loss function is the guide to the terrain, telling the optimizer when it’s moving in the right or wrong direction.

**Learning Rate:**

Learning rate ensures that we change our weights at the right pace, not making any changes that are too big or too small.
Changing our weights too fast by adding or subtracting too much can hinder our ability to minimize the loss function. We don’t want to make a jump so large that we skip over the optimal value for a given weight. Similarly, we don’t want to take steps that are too small, because then we might never end up with the right values for our weight.

**Loss Function:**

The loss function is a method of evaluating how well the algorithm models the dataset. If the predictions are totally off, the loss function will output a higher number. If they’re good, it’ll output a lower number. The loss function will let us know how the model is doing. This is an example of the loss function from HW 3.

In [0]:
def loss_computation(A, B):
  partial = m * .80
  loss_val = 0
  for data_val, label in zip(A, B):
    predictor = np.dot(np.reshape(weights_manually_calculated, (2, )), data_val) + bias_term
    bce = entropy_func(label, sigmoid_function(predictor))
    loss_val += bce
  loss_val /= (partial)
  return loss_val

##Training a Model

We use the .fit() method to train our model. This is an example used in the previous homework.

In [0]:
# importing tensorflow package 
%tensorflow_version 2.x
import tensorflow as tf

def keras_model():
  glm_model = tf.keras.models.Sequential()

  # adding layers and activation function
  glm_model.add(tf.keras.layers.Dense(1, activation='sigmoid', input_shape = (2, )))

  glm_model.compile(optimizer = tf.keras.optimizers.RMSprop(lr = 0.001), loss = 'binary_crossentropy', metrics = ['accuracy'])
  return glm_model

  # setting the number of epochs
epochs_count = 300

my_keras_model = keras_model()

history_of_model = my_keras_model.fit(training_set, training_set_label, epochs = epochs_count, batch_size = 512, validation_data = (testing_set, testing_set_label))

**Overfitting:**

Overfitting happens when we train a model with too much data. When a model gets trained with so much data, it starts learning from the noise and inaccurate data entries in our data set. Then the model does not categorize the data correctly, because of too much of details and noise. To avoid overfitting we can use cross-validation.

**Underfitting:**

Underfitting occurs when a model cannot capture the underlying trend of the data. Underfitting destroys the accuracy of the model. It can be avoided by using more data and also by reducing features.

##Finetuning a Pretrained Model

**Finetuning a pretrained model:**

Fine-tuning a network is basically optimising the parameters and hyperparameters of an already trained network to adapt it to the new task. We only do it if our current dataset is not drastically different from the dataset on which the network was trained. Another concern to watch out for is that if our dataset is small, fine-tuning can lead to overfitting, especially in the case of VGG since the last few layers are fully connected layers. Now, if we ensure that both the cases are not applicable we can proceed to fine tune the network. 

First, we try to truncate the last layer (softmax in classification problems) of the pre-trained network and replace it with the softmax layer applicable to our data. For instance, if pretrained model had softmax with 100 categories but we are working with 5 categories. In addition, we can also use different classifiers which might be suitable to our application and evaluate which one fits best.

Then, we freeze the weights of the initial layers of the pre-trained model and train only the last few layers because the first few layers capture the general/universal features (like edges and curves in a dog/cat problem) of our classification problem.

We use a smaller learning rate for training because we don't to completely undo the pre-training of weights, just want to tune them and not distort them too fast or too much.