## Analysis of Fine Tuning Runs

Analysis and charts to interpret the output from the second run for 


In [1]:
###  Add mathematical libraries
import numpy as np
import pandas as pd

# Graphical libraries and items.
import matplotlib.pyplot as plt
from matplotlib.pyplot import subplots, draw


import re
# import json
# import datetime
# import string


In [3]:
### Read file into Pandas Data array

file_loc = "./runs3.log"

df = pd.read_csv(file_loc, sep='|', header=(0))

df

ParserError: Error tokenizing data. C error: Expected 34 fields in line 160, saw 38


In [None]:
### Find max accuracy and min loss on training and
### validation sets, and which epoch it was achieved on
### This will let us see both the most accurate runs
### and let us detect if (a) convergence has occurred
### and (b) whether we have overfit the model

epochs = 30  #output columns are counted from 0

### Metrics are tuples of the metric name and + or -1
### depending whether low or high numbers are best

for metric in [ ('loss', -1), ('accuracy',1), ('val_loss',-1), ('val_accuracy',1)]:
    best_val     = f"{metric[0]}-best"
    best_epc = f"{metric[0]}-epoch"
    met_sign     = metric[1]
    
    # Create list of the column names we want to check for the metric
    metric_cols  = [ f"{metric[0]}-{epoch}" for epoch in range(0,epochs) ]
   


    # Find the best value for each metric, as well as the epoch in which it occurred
    #
    # idxmax(axis=1) returns the column name with the maximum value, idxmin does
    # the sames for the minimum
    #
    # The str.extract() turns the values into strings and then pulls out only digits
    # Ordinarily this would also pull separators like "," and "." as well, but
    # we don't have them in the column names.
    
    if met_sign == 1:
        df[best_val] = df[metric_cols].max(axis=1)
        df[best_epc] = df[metric_cols].idxmax(axis=1).str.extract('(\d+)').astype(int)
    else:
        df[best_val]     = df[metric_cols].min(axis=1)
        df[best_epc] = df[metric_cols].idxmax(axis=1).str.extract('(\d+)').astype(int)
        

    
    

In [None]:
temp_cols = metric_cols.copy()

temp_cols.append(best_val)
temp_cols.append(best_epc)
# df[temp_cols]
# type(best_val)

print(list(df.columns))

## Review Results by each of the hyperparameters we are varying

Unless otherwise stated, we will be measuring loss and accuracy for the validation data set.


### trainable = [True | False] - Whether we allow the embeddings to be trained as well

First, do we get better accuracy by allowing the embeddings to be trained?

In [None]:
### Check best accuracy

boxplot = df.boxplot(column=["val_accuracy-best"], by=['train_embeds'])

In [None]:
### Check when the best accuracy is found in the epochs

boxplot = df.boxplot(column=["val_accuracy-epoch"], by=['train_embeds'])

**From the above, better accuracy is obtained when we allow the embeddings to be refined during training, and that the model converges at least 1 epoch sooner on average.**
The additional computational work to backpropogate into the embeddings is more than offset by the fewer epochs required to reach optimum accuracy.

In [None]:
### Refine to only runs where we train the embeddings

df1 = df[df['train_embeds'] == True]

### Look at the effect of the convolutional filters

In [None]:
boxplot = df1.boxplot(column=["val_accuracy-best"], by=['kernel_sizes'], figsize=(20,10))

In [None]:
boxplot = df1.boxplot(column=["val_accuracy-epoch"], by=['kernel_sizes'], figsize=(20,10))


In [None]:
# look at each set of kernel sizes by their counts

kernels = df1.kernel_sizes.unique()

# create 3-wide subplots to show
fig, ax = plt.subplots(nrows=5, ncols=2, figsize=(20, 30), sharey=True)

for i in range(0,len(kernels)):
      
    x = i // 2
    y = i % 2
    
    axis = ax[x,y]
    
    axis.set_ylabel("val_accuracy-best")

    # plt.xticks(rotation = 45) # Rotates X-Axis Ticks by 45-degrees
    boxplot = df1[df1.kernel_sizes == kernels[i]].boxplot(column=["val_accuracy-best"],
                                                          by=['num_filters'],
                                                          ax=axis,
                                                          figsize=(20,10))
    axis.title.set_text(f"Kernel Sizes: {kernels[i]}")


In [None]:
# look at each set of filter counts by their kernel sizes

filters = df1.num_filters.unique()

# create 3-wide subplots to show
# fig, ax = plt.subplots(nrows=8, ncols=2, figsize=(20, 50), sharey=True)
fig, ax = plt.subplots(nrows=8, ncols=2, figsize=(20, 50), sharey=False)

for i in range(0,len(filters)):
      
    x = i // 2
    y = i % 2
    
    axis = ax[x,y]
    
    axis.set_ylabel("val_accuracy-best")

    # plt.xticks(rotation = 45) # Rotates X-Axis Ticks by 45-degrees
    boxplot = df1[df1.num_filters == filters[i]].boxplot(column=["val_accuracy-best"],
                                                          by=['kernel_sizes'],
                                                          ax=axis,
                                                          figsize=(20,10))
    axis.title.set_text(f"Filter counts: {filters[i]}")


**In all of the cases we tried, the best accuracies came from the highest number of filters for each set of kernels, regardless of the kernel sizes:** ```[16,32]``` for the two-filter convolutions and ```[8,16,32]``` for the three-filter ones. We should therefore test even higher counts to see if that makes any marginal improvement, including ```[32,64]``` for the two-filter convolutions, and ```[16,32,63]``` for the three-filter ones.

**Equally, the largest filters generally produce the best results,** though there seems to be some fall off between ```[4,9,12]``` and ```[8,12,16]``` suggesting that a 4-word kernel does have value.  We should, in addition, test [4,8,16], [4,8,16,32] and other similar combinations to see if we can improve further.

## Compare dropout effect on best performing convolution
### Kernel sizes ```[4,8,12]``` with filter counts ```[8,16,32]```

In [None]:
df2 = df1[df1["kernel_sizes"] == "[4, 8, 12]"][df1["num_filters"] == "[8, 16, 32]"]

In [None]:
# look at accuracy by dropout
boxplot = df2.boxplot(column=["val_accuracy-best"], by=['dropout_rate'], figsize=(20,10))



**The best dropout from the sample is 0.2.** Suggest in wave 2 we search smaller steps between a 15% and 40% dropout to get the best potential output.

## Evaluate the Dense Layers
Look at differences in the fully connected layers within the beset selections so far.

In [None]:
df3 = df2[df2["dropout_rate"] == 0.2]
df3

In [None]:
# look at accuracy by dense layers

boxplot = df3.plot.bar(x="dense_layer_dims", y='val_accuracy-best', figsize=(20,10),ylim=(0.88,0.9))



**After selecting the best from the rest of the hyperparameters, the dense layers have less of an effect (within 1% of accuracy) - a single layer of only 8 nodes appears to be sufficient to give a good result.**

As we have a relatively small data set for training - fewer than 7000 records in training, it may not be possible to effectively converge on larger dense-layer models.

We also may want to look at f1 instead of accuracy to be certain??

Round 1 testing was for 8280 tests with different hyperparameters in a total run time of 49h40m for an average of 1 test every 21.6s


In [None]:
best_test = df3[ df3["dense_layer_dims"] == "[8]"].to_dict(orient='records')[0]


In [None]:
## print out best version:

print(f"Best validation accuracy in first hyperparameter tuning run is {best_test['val_accuracy-best']:5f}, in epoch {best_test['val_accuracy-epoch']},  Run at: {best_test['timestamp']}")

print(f"Model:  num_filters: {best_test['num_filters']}, kernel_sizes: {best_test['kernel_sizes']},  dense_layer_dims: {best_test['dense_layer_dims']},  dropout_rate: {best_test['dropout_rate']}")

## TensorBoard Output

Looking at the tensorboard output, it appears that although the highest accuracy was attained by epoch 9, there was clear overfitting as the loss function for the validation set started to increase after epoch 4, and the peak accuracy seems to be an outlier vs. the smoothed curve.

**For future tests, we will only go to 5 epochs** unless we see evidence that more epochs might improve accuracy without overfitting.

![Key](Key.png)
![Accuracy](Accuracy.png)
![Loss](Loss.png)