## Image Preprocessing

Using Scikit-Image, Pillow, OpenCV to make patterns stand out more.

We will now initialize the parameters of the model, perform forward propagation, and calculate the current loss. Perform backward propagation (basically calculating the current gradient), and update the parameters (with gradient descent).

MLP's take their input as vectors, not matrices or tensors. If all of the images were different sizes, then we would have a more significant problem on our hands, because we'd have challenges getting each image reshaped into a vector the exact same size as our input layer.

X_train /= 255
X_test /= 255

## PCA for Compression

After dimensionality reduction, the training set takes up less space. Applying this to our images, we can preserve 95% of the variance, but will be less than 20% of the original size. This is a reasonable compression ratio, and the size reduction can speed up a classification algorithm (such as a Support Vector Machine) considerably. The mean squared distance between the original data and the reconstructed data (compressed and then decompressed) is called the reconstruction error.

In [None]:
# The following code compresses the xray images
# and then decompresses back to the original dimensions.
# There will be some image quality loss, but not so much as 
# to be unusable.
pca = PCA(n_components=154)
xreduced = pca.fit_transform(xtrain)
xrecovered = pca.inverse_transform(xreduced)

## Logistic Regression in a Convolutional Neural Network

The purpose of a neuron in a neural network is 2-fold:

1. Transforms the inputs ding a linear transformation (generally wx+b).
2. Uses an activation function (generally sigmoid).

Logistic regression is a 1-layer neural network as the output is either 0 or 1. The input layer is not a counted layer. No transformations happen at the input layer, it's just where the inputs are added. In each of the nodes of a hidden layer, a linear transformation will take place, as well as a transformation because of an activation function. 

Activation functions determine the output of a node from a given set of inputs. Most NN's are optimized using some form of gradient descent, activations functions need to be differentiable (or almost entirely differentiable... see ReLu).

Since we have a binary classification task, the output layer should be a dense layer with a single neuron, and the activation set to 'sigmoid'.

model_1.add(Dense(1, activation='sigmoid'))
model_1.compile(loss='binary_crossentropy', optimizer='sgd', metrics=['accuracy'])
results_1 = model_1.fit(scaled_data, labels, epochs=25, batch_size=1, validation_split=0.2)

## The Hyperbolic Tangent (tanh) function

The hyperbolic tangent function goes between -1 and 1 and is a shifted version of the sigmoid function. For intermediate layers, the tanh function generally performs pretty well because, with values between -1 and 1, the means of the activations coming out are closer to zero.

The disadvantage of both tanh and sigmoid activation functions is that when z gets quite large or small, the derivative of the slopes of these functions become very small, generally 0.0001. This will slow down gradient descent.

## The Inverse Tangent (arctan) function

The inverse tangent has a lot of the same qualities that tanh has, but the range roughly goes from -1.6 to 1.6 and the slope is more gentle than the one we saw with tanh.

## Rectified Linear Unit function

This is probably the most popular activation function, along with tanh. The fact that the activation is exaclty 0 when z < 0 is slightly cumbersome when taking derivatives, though.

## The Leaky Rectified Linear Unit function

The leaky ReLu solves the derivatives issue by allowing for the activation ot be slightly negative when z < 0.

## Forward Propagation

Obtaining the expression for the cost function and the loss function is called forward propagation. The cost function takes a convex form. The idea is that we start with some initial values of w abd b and then graident descent takes a step in the steepest direction downhill. Taking the derivatives to calculate the difference between the desired and calcualted outcome, repeating these steps until you get the lowest possible cost value, is called backpropagation.

Neural Networks work like this: We perform the dot product for input data set with the weight set. We feed z into an activation function, f(z) so that we get an output for the hidden layer's first neuron. F() need not be a step function. We repeat this with the same input data and weights to the second hidden neuron, and then we will get the output for the second hidden neuron in another hidden layer. We repeat these steps until the last weight using input data from the nth instance. Now the hidden layer outputs are calcualted, and used as inputs to calculate the final output. For the final output, perform a dot product for hidden layer outputs, which we now consider as inputs, and hidden layer weights. Remember that the inputs are hidden layer outputs. Since we only have 1 output, our set of weights consists of a range with n weights, since we have n hidden layer inputs. Then we add the sum of the dot product with bias to obtain a result z. We feed z into an activation function f(z) so that we get an output for the output layer.

If the activation function is linar, then you can stack as many hidden layers in the neural network as you wish, and the final output is still a linear combination of the original input data. Here, a small change in any weight in the input layer of our perceptron network could possibly lead to one neuron to suddenly flip from 0 to 1, which could again affect the hidden layer's behavior and then affect the final outcome; with a step function as an activation function. The sigmoid neuron, which comes with the sigmoid function is another activation function. The sigmoid function produces similar results to step function in that the output is between 0 and 1; the curve is smooth, and has a nice and simple derivative, which is differentiable everywhere on the curve. Sigmoid introduces non-linearity into our neural network model. The output layer of a neural node represents a combinatorial charge.

Neural networks become better by repetitively training themselves on data so they can adjust each layer of the network to get the final results/actual output closer to the desired output. So when we actually train this neural network with al the training examples, we don't know what weights we should assign to each of the layers. So we just randomly ask the computer to assign weights in each layer.

The concept of randomly initializing weights is important because each time you train a deep learning neural network, you are initializing different numbers to the weights. So, essentially, you are initializing different numbers to the weights and you have no clue what's going on in the network until after the network is trained. A trained neural network has weights which are optimized at certain values that make the best prediction or classification of our problem. It's a black box, and each time the network will have different sets o weights.

In multi layer neural networks, the first hidden layer will be able to learn some very simple patterns. Each additional hidden layer will somehow be able to learn progressively more complicated patterns.

In forward propagation, you get x and you compute y_hat; after, you calculate the cost function.

Here is how forward propagation is performed/calculated: Z1 is the output of the linear transformation of the initial input, A1 (the observations). In successive layers, A1 is the output from the previous hidden layer. In all of these cases, W1 is a matrix of weights to be optimized to minimize the cost function. B1 is also optimized but is a vector as opposed to a matrix.

G1 is the activation function which takes the outut of this linear transformation and yields the input to the next hidden layer.

## Backward Propagation

Once an output for the NN given the current parameter weights has been calculated, we must back propagate to calculate the gradients of layer paramters with respect to the cost function. This will allow us to apply an optimization algorithm such as gradient descent in order to make small adjustments to the parameters in order to minimize our cost (and improve our predictions).

To summarize the process once more, we begin by defining a model architecture which includes the number of hidden layers, activation functions, and the number of units in each of these. When we initialize parameters for each of these layers (typically randomly). After the initial parameters are set, forward propagation evaluates the model given a prediction, which is then used to evaluate a cost function. Forward propagation involves evaluating each layer and then piping this output into the next layer.

Each layer consists of a linear transformation and an activation function. The paramters for the linear transformation in each layer include W1 and B1. The output of this linear transfromation is represented by Z1. This is then fed through the activation function (again, for each layer) giving us an output A which is the input for the next layer of the model.

After forward propagation is completed and the cost function is evaluated, back propagation is used to calculate gradients of the initial paramters with respect to this cost function. Finally, these gradients are then used to optimize the algorithm, such as gradient descent, to make small adjustments to the parameters and the entire process of forward propagation, back propagation, and parameter adjustments is repeated until the modeller is satisfied with the results.

We use batches because if we were to push all of our samples through at once, we would have to wait until everyhthing is processed and can only start backpropagating then. Therefore, batches are used, so that after each batch has done a forward propagation step, backward prop can happen again. In essence, it's 'mini-batch' gradient descent. A batch generally approximates the distribution of the input data better than a single output. The larger the batch, the better the approximation. However, it is also true that the batch will take longer to process and will result in only one update. For inference (to evaluate, or predict) it is recommended to pick a batch size that is as large as you can afford without going out of memory. Epochs are an arbitrary cutoff, generally defined as one pass over the entire dataset, used to separate training into distinct phases, which is useful for logging and periodic evaluation. When using valudation_data or validation_split, with the .fit() method of Keras, evaluation will be run at the end of every epoch. Within Keras, there is the ability to add callbacks specifically designed to be run at the end of the epoch. Examples of these are learning rate changes and model checkpointing (saving).

Samples in a neural network are one element of the dataset. ex. an image.

history.history will return a dictionary of the metrics we indicated when compiling the model. By default, the loss criteria will always be included as well. This dictionary will have two keys, one for the loss and one for the accuracy. If you want to plot learning curves for the loss or accuracy versus the epochs, you can then simply retrieve these lists.

Making predictions from a trained model is straightforward. y_hat = model.predict(x).

We use the .evaluate() method in order to compute the loss and other specified metrics for our train model. model.evaluate(xtrain, xtrain_labels).

In [None]:
# Once we have initialized a network object, we can then add layers
# which inlcudes the number of layers we wish to add, as well as the 
# activation function we want to use. We can use sigmoid, ReLu, etc.
# The Dense() class indicates that this layer will be fully connected
# The input_shape parameter is optional. In successive layers, Keras
# implies the required shape of the layer to be added based on the shape
# of the previous layer.
# model.add(layers.Dense(units, activation, input_shape))

# Train the model
# history = model.fit(xtrain,
                    # ytrain, epochs=20,
                    # batch_size = 512, 
                    # validation_data = (xval, yval))

We can detect overfitting if the model's training performance steadily improves long after the validation performance plateaus. By adding another hidden layer, we can give the model the ability to capture more high-level abstraction in the data. However, increasing the depth of the model also increases the amount of data the model needs to converge to answer, because a more complex model comes with the 'Curse of Dimensionality'. 

Even with a good validation score, a model is not acceptable if the training and validation accuracies do not converge. We can remedy this by decreasing the size of the network, or by increasing the size of our training data.

Deep representations are really good at automating what used to be a tedious process of feature engineering. For example, deep layers of a neural network for a computer might look like this: 

1. First layer detects edges in pictures.
2. Second layer groups edges together and starts to detect different parts.
3. More layers, group even bigger parts together.

The general idea is shallow networks detect "simple" things and the deeper you go, the more complex things that can be detected.

## Keras

Scalar is a 0D tensor, Vectors are 1D tensors, Matrices are 2D tensors, and then there are 3D tensors. A tensor is defined by three key attributes: the rank or number of axes, the shape, the data type.

Unrowing matrices is important for images. Then we increase the rank. Ex.) vector with np.shape() (790,) -> np.reshape(vector, (1,790)).

We can slice tensors using the the usual syntax of start index : end index.

The dot product is the sum of the element wise products.

## Tuning Hyperparameters with Regularization

There are many hyperparameters you can tune. These include: 

1. number of hidden units
2. number of layers
3. learning rate (alpha)
4. activation function

The question then becomes how do you choose these paramters? One primary method is to develop validation sets to strike a balance between specificity and generalization.

When tuning neural networks, it typically helps to split the data into three distinct partitions as follows:

1. Train algorithms on the training set.
2. Use a validation set to decide which one will be your final model after paramter tuning.
3. After having chosen the final model, and havng evaluated long enough, you'll use the test set to get an unbiased estimate of the classification performance (or whatever your evaluation metric will be).

Remember that it is very important to make sure that the holdout validation set and the test samples come from the same distribution ex. the same resolution of santa pictures.

Bias = underfitting
High variance = overfitting
Good fit = somewhere in between.

We use regularization when the model overfits to the data. L1 and L2 regularization.

## Dropout Regularization

When you apply the Dropout technique, a random subset of nodes (also called the units) in a layer are ignored (their weights are set to zero) during each phase of training; allows us to train neural networks on different parts of the data, thus ensuring that our model is not overly sensitive to noise in the data.

In Keras, you specifically Dropout using the Dropout layer, which is applied to input and hidden layers. The Dropout layer requires one argument, rate, which specifices the fraction of units to drop, usually between 0.2 and 0.5.

In [None]:
# Dropout
model = models.Sequential()
model.add(layers.Dense(5, activation='relu', input_shape=(500,)))
# Dropout applied to the input layer
model.add(layers.Drouput(0.3))
model.add(layers.Dense(5, activation='relu'))
# Dropout applied to the hidden layer
model.add(layers.Dropout(0.3))
model.add(layers.Dense(1, activation='sigmoid'))
# In different iterations through the training set, different nodes
# will be zeroed out.

In [None]:
# Downsample the data to test how model reacts to different sizes
df_sample = df.sample(10000, random_state=123)
# Split the data into x an y
y = df_sample['Product']
x = df_sample['Consumer complaint narrative']
# Train Test Split
xtrain, x test, ytrain, ytest = train_test_split(x, y, test_size=1500, random_state=42)

In [None]:
# Validation set
xtrain_final, xval, ytrain_final, yval = train_test_split(xtrain, ytrain, test_size=1000, random_state=42
# Good practice to set aside a validation set, which is then used during
# Hyperparameter tuning. Afterwards, the test set can be used to determine
# the unbiased performance of the model.                                                       

## Early Stopping

Overfitting-- it's not possible to know in advance how many epochs you need to train your model on, and running the model multiple times with varying number of epochs may be helpful, but it is a time consuming process.

In [None]:
# Import EarlyStopping and ModelCheckpoint
from keras.callbacks import EarlyStopping, ModelCheckpoint

# Define the callbacks
early_stopping = [EarlyStopping(monitor='val_loss', patience=10), 
                  ModelCheckpoint(filepath='best_model.h5', monitor='val_loss', save_best_only=True)]

In [None]:
model_2_val = model_2.fit(X_train_tokens, 
                          y_train_lb, 
                          epochs=150, 
                          callbacks=early_stopping, 
                          batch_size=256, 
                          validation_data=(X_val_tokens, y_val_lb))

## L2 Regularization

In [None]:
# Import regularizers
from keras import regularizers
random.seed(123)
L2_model = models.Sequential()

# Add the input and first hidden layer
L2_model.add(layers.Dense(50, activation='relu', kernel_regularizer=regularizers.l2(0.005), input_shape=(2000,)))

# Add another hidden layer
L2_model.add(layers.Dense(25, kernel_regularizer=regularizers.l2(0.005), activation='relu'))

# Add an output layer
L2_model.add(layers.Dense(7, activation='softmax'))

# Compile the model
L2_model.compile(optimizer='SGD', 
                 loss='categorical_crossentropy', 
                 metrics=['accuracy'])

# Train the model 
L2_model_val = L2_model.fit(X_train_tokens, 
                            y_train_lb, 
                            epochs=150, 
                            batch_size=256, 
                            validation_data=(X_val_tokens, y_val_lb))

In [None]:
# L2 model details
L2_model_dict = L2_model_val.history
L2_acc_values = L2_model_dict['acc'] 
L2_val_acc_values = L2_model_dict['val_acc']

# Baseline model
baseline_model_acc = baseline_model_val_dict['acc'] 
baseline_model_val_acc = baseline_model_val_dict['val_acc']

# Plot the accuracy for these models
fig, ax = plt.subplots(figsize=(12, 8))
epochs = range(1, len(acc_values) + 1)
ax.plot(epochs, L2_acc_values, label='Training acc (L2)')
ax.plot(epochs, L2_val_acc_values, label='Validation acc (L2)')
ax.plot(epochs, baseline_model_acc, label='Training acc (Baseline)')
ax.plot(epochs, baseline_model_val_acc, label='Validation acc (Baseline)')
ax.set_title('Training & validation accuracy L2 vs regular')
ax.set_xlabel('Epochs')
ax.set_ylabel('Loss')
ax.legend();

## L1 Regularization

In [None]:
random.seed(123)
L1_model = models.Sequential()

# Add the input and first hidden layer
L1_model.add(layers.Dense(50, activation='relu', kernel_regularizer=regularizers.l1(0.005), input_shape=(2000,)))

# Add a hidden layer
L1_model.add(layers.Dense(25, kernel_regularizer=regularizers.l1(0.005), activation='relu'))

# Add an output layer
L1_model.add(layers.Dense(7, activation='softmax'))

# Compile the model
L1_model.compile(optimizer='SGD', 
                 loss='categorical_crossentropy', 
                 metrics=['accuracy'])

# Train the model 
L1_model_val = L1_model.fit(X_train_tokens, 
                            y_train_lb, 
                            epochs=150, 
                            batch_size=256, 
                            validation_data=(X_val_tokens, y_val_lb))

In [None]:
fig, ax = plt.subplots(figsize=(12, 8))

L1_model_dict = L1_model_val.history

acc_values = L1_model_dict['acc'] 
val_acc_values = L1_model_dict['val_acc']

epochs = range(1, len(acc_values) + 1)
ax.plot(epochs, acc_values, label='Training acc L1')
ax.plot(epochs, val_acc_values, label='Validation acc L1')
ax.set_title('Training & validation accuracy with L1 regularization')
ax.set_xlabel('Epochs')
ax.set_ylabel('Loss')
ax.legend();

## Dropout Regularization

In [None]:
# ⏰ This cell may take about a minute to run
random.seed(123)
dropout_model = models.Sequential()

# Implement dropout to the input layer
# NOTE: This is where you define the number of units in the input layer
dropout_model.add(layers.Dropout(0.3, input_shape=(2000,)))

# Add the first hidden layer
dropout_model.add(layers.Dense(50, activation='relu'))

# Implement dropout to the first hidden layer 
dropout_model.add(layers.Dropout(0.3))

# Add the second hidden layer
dropout_model.add(layers.Dense(25, activation='relu'))

# Implement dropout to the second hidden layer 
dropout_model.add(layers.Dropout(0.3))

# Add the output layer
dropout_model.add(layers.Dense(7, activation='softmax'))


# Compile the model
dropout_model.compile(optimizer='SGD', 
                      loss='categorical_crossentropy', 
                      metrics=['accuracy'])

# Train the model
dropout_model_val = dropout_model.fit(X_train_tokens, 
                                      y_train_lb, 
                                      epochs=150, 
                                      batch_size=256, 
                                      validation_data=(X_val_tokens, y_val_lb))

In [None]:


results_train = dropout_model.evaluate(X_train_tokens, y_train_lb)
print(f'Training Loss: {results_train[0]:.3} \nTraining Accuracy: {results_train[1]:.3}')

print('----------')

results_test = dropout_model.evaluate(X_test_tokens, y_test_lb)
print(f'Test Loss: {results_test[0]:.3} \nTest Accuracy: {results_test[1]:.3}')   

## Tuning Neural Networks with Normalization

One way to speed up training of your neural networks is to normalize the input. In fact, even if training time were not a concern, normalization to a consistent scale (0 to 1) across features should be used to ensure that the process converges to a stable solution.

Not only will normalizing your inputs speed up training, it can also mitigate other risks inherent in training neural networks. Ex. having input of various ranges can lead to difficult numerical problems when the algorithm goes to compute gradients during forward and back propagation. This can lead to untenable solutions and will prevent the algorithm from converging to a solution. Activation will explode when there are many layers in the network.

Aside from normalizing the data, you can also investigate the impact of changing the initialization parameters when you first launch the gradient descent algorithm.

In addition, you could even use an alternative convergence algorithm instead of gradient descent. One issue with gradient descent is that it oscillates to a fairly big extent, because the derivative is bigger in the vertical direction.

## Gradient Decent with Momentum

Compute an exponentially weighted average of the gradients and use that gradient instead. The intuitive interpretation is that this will successively dampen oscillations improving convergence. Generally, B = 0.9 is a good hyperparameter value.

## RMSProp

RMSProp stands for 'root mean square' prop. It slows down learning in one direction and speeds up in another one. On each iteration, it uses exponentially weighted average of the squares of the derivatives. In the direction where we want to learn fast, the corresponding S will be small, so dividing by a small number. On the other hand, in the direction where we will want to learn slow, the corresponding S will be relatively large, and updates will be smaller. Often, add small e in the denominator to make sure that you don't end up dividing by 0.

## Adam Optimization Algorithm

"Adaptive Moment Estimator,' basically using the first and second moment estimators. It takes momentum and RMSprop to put it together. Generally, only alpha gets tuned.

## Learning Rate Decay

Learnign rate decreases across epochs.

alpha = 1/1+decay_rate * epoch_nb * alpha0

Note: If all the values for traning and validation loss are 'nan'. This indicates that the algorithm did not converge. The first solution to this is to normalize the input. From there, if convergence is not achieved, normalizing the output may also be required.

In [None]:
## Creating the datasets

In [None]:
224, 251, 297, 445...... 326 hyper

In [None]:
print(np.shape(train_images))
print(np.shape(train_labels))
print(np.shape(val_images))
print(np.shape(val_labels))
print(np.shape(test_images))
print(np.shape(test_labels))

In [None]:
## Unrowing the data
# Reshape the train images
train_img_unrow = train_images.reshape(5216, -1).T

In [None]:
# Preview the shape of train_img_unrow
np.shape(train_img_unrow)

In [None]:
# Reshape the val images
val_img_unrow = val_images.reshape(16, -1).T

In [None]:
# Preview the shape of val_img_unrow
np.shape(val_img_unrow)

In [None]:
# Reshape the test images
test_img_unrow = test_images.reshape(624, -1).T

In [None]:
# Preview the shape of test_img_unrow
np.shape(test_img_unrow)

In [None]:
## Confirming classes
# Confirm that all sets contain the same classes
train_generator.class_indices

In [None]:
val_generator.class_indices
test_generator.class_indices
train_labels

In [None]:
## Reshaping labels
# Reshape labels for every set
train_labels_final = train_labels.T[[1]]

In [None]:
np.shape(train_labels_final)
val_labels_final = val_labels.T[[1]]
np.shape(val_labels_final)
test_labels_final = test_labels.T[[1]]
np.shape(test_labels_final)

In [None]:
## Ensuring class identification
# Quality Assurance -- Make sure classes are identified correctly.
array_to_img(train_images[10])
train_labels_final[:10]

## Normalizing the data
print(val_labels_final[:10])

# Each RGB pixel in an image takes a value between 0 and 255. We need
# to Standardize our data by dividing by 255. This will make sure each
# pixel value is between 0 and 1.
train_img_final = train_img_unrow/255
val_img_final = val_img_unrow/255
test_img_final = test_img_unrow/255

In [None]:
## Baseline Sequential Model
# Build a baseline neural network model using Keras
baseline_model = models.Sequential()
baseline_model.add(layers.Dense(50, activation='relu'))
baseline_model.add(layers.Dense(25, activation='relu'))
baseline_model.add(layers.Dense(1, activation='sigmoid'))
# Compile the model
baseline_model.compile(optimizer='adam', 
                       loss='binary_crossentropy', 
                       metrics=['accuracy'])

# Train the model
baseline_model_val = baseline_model.fit(train_generator,
                                        epochs=5, 
                                        validation_data=val_generator)

In [None]:
# The attribute .history stored as a dictionnary now contains four
# entries; one per metric that was being monitored during training
# and validation. Print the keys of the dictioanry for confirmation.
# Access the history attribute and store the dictionary
baseline_model_val_dict = baseline_model_val.history

# Print the keys
baseline_model_val_dict.keys()

In [None]:
# Evaluate this model on the training data
results_train = baseline_model.evaluate(X_train_tokens, y_train_lb)
print('----------')
print(f'Training Loss: {results_train[0]:.3} \nTraining Accuracy: {results_train[1]:.3}')

In [None]:
# Evaluate on the test data
results_test = baseline_model.evaluate(X_test_tokens, y_test_lb)
print('----------')
print(f'Test Loss: {results_test[0]:.3} \nTest Accuracy: {results_test[1]:.3}')

In [None]:
# Plot the results
fig, ax = plt.subplots(figsize=(12, 8))

loss_values = baseline_model_val_dict['loss']
val_loss_values = baseline_model_val_dict['val_loss']

epochs = range(1, len(loss_values) + 1)
ax.plot(epochs, loss_values, label='Training loss')
ax.plot(epochs, val_loss_values, label='Validation loss')

ax.set_title('Training & validation loss')
ax.set_xlabel('Epochs')
ax.set_ylabel('Loss')
ax.legend();

In [None]:
# Plot again, comparing the training and validation accuracy
# to the number of epochs
fig, ax = plt.subplots(figsize=(12, 8))

acc_values = baseline_model_val_dict['acc'] 
val_acc_values = baseline_model_val_dict['val_acc']

ax.plot(epochs, acc_values, label='Training acc')
ax.plot(epochs, val_acc_values, label='Validation acc')
ax.set_title('Training & validation accuracy')
ax.set_xlabel('Epochs')
ax.set_ylabel('Loss')
ax.legend();

In [None]:
model = models.Sequential()
model = Sequential([
        Dense(32),
        Activation('relu'),
        Dense(10),
        Activation('softmax'),
])

Before training a model, we need to configure the learning process, which is with the compile method. It receives three arguments: An optimizer, a loss function, a list of metrics.

In [None]:
# Optimization for a binary classification problem
model.compile(optimizer='rmsprop',
             loss='binary_crossentropy',
             metrics=['accuracy'])