<table align="center">
  <td align="center"><a target="_blank" href="https://colab.research.google.com/github/andrew-nash/CS6421-labs/blob/main/Lab3.ipynb">
        <img src="https://i.ibb.co/2P3SLwK/colab.png"  style="padding-bottom:5px;" />Run in Google Colab</a></td>
  <td align="center"><a target="_blank" href="https://github.com/andrew-nash/CS6421-labs/blob/main/Lab3.ipynb">
        <img src="https://i.ibb.co/xfJbPmL/github.png"  height="70px" style="padding-bottom:5px;"  />View Source on GitHub</a></td>
</table>


# Continued Model Optimiztion with TensorFlow

In the last lab, we looked at a very basic end-to-end example of how models can be trained in TensorFlow. We saw how plugins such as TensorBoard and Weights and Biases are used to visualise the model training performance and compare the performance of selecting different hyper-parameters. In this Lab we will continue this process, but in more depth - we will consider more carefully the data processing requried, develop a simple (albeit non-trivial model) and perform  effective basic hyper-parameter tuning.

Content based on: https://www.justintodata.com/hyperparameter-tuning-with-python-keras-guide/

In [None]:
import numpy as np
import pandas as pd

# A Brief Sidebar on Numpy and Pandas

We are going to be using a real-world dataset, taken from an older Kaggle competition (based on the Russian housing market from 2011-2015).

In order to prepare this raw data in a format appropiate for use with Deep Learning models, we need to perform some *pre-processing*. There a number of potential tools for this job, in this particular case we will focus on NUMPY and PANDAS

## What is Numpy?

https://pub.towardsai.net/numpy-guide-super-simple-way-to-learn-it-in-10-minutes-d382ff45e215

Short for Numerical Python, it is intended for use in high-performance computation on multi-dimensional arrays.

### Creating numpy arrays

Numpy arrays can be created very simply from python lists:

In [None]:
l = [1,2,3,4,5]
a = np.array(l)
print(a)

[1 2 3 4 5]


In [None]:
l = [[1,2,3,4,5],[6,7,8,9,10]]
a = np.array(l)
print(a)

In [None]:
l = [
      [ [1,2,3],    [4,5,6] ],
      [ [7,8,9],    [10,11,12] ],
      [ [13,14,15], [16,17,18] ],
      [ [19,20,21], [22,23,24] ]
    ]
a = np.array(l)
print(a)

Numpy arrays have some very useful attributes - particularly size, shape and dtpye

1. *Size* tracks the number of scalar values contained within then array (and any sub-arrays)
2. *Shape* contains the size of each dimension of the array - e.g. shape=(3,4) corresponds to a 3x4 matrix
3. *Dtype* tracks the primitive type of the scalars in the array - such as np.float34, np.int32, np.int64 etc

In [None]:
l = [1,2,3,4,5]
a = np.array(l)
print(a.size, a.shape, a.dtype)

Before running the following code, try to predict the size, shape and dtype that will be printed

In [None]:
l = [[1.2,2.5,3.5,4.5,5.5],[6.1,7.0,8.4184,9.1,10.14]]
a = np.array(l)
print(a.size, a.shape, a.dtype)

In [None]:
l = [
      [ [1,2,3],    [4,5,6] ],
      [ [7,8,9],    [10,11,12] ],
      [ [13,14,15], [16,17,18] ],
      [ [19,20,21], [22,23,24] ]
    ]
a = np.array(l)
print(a.size, a.shape, a.dtype)

### Basic Array Operations

In [None]:
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

A huge beneift of numpy is that it allows simple arithmetic and linear algebraic operations to be applied simply to arrays.

For example, basic matehmatical operations can be applied between two arrays (of the same shape). The operations will be applied element-by-element

In [None]:
a+b

In [None]:
a/b

In [None]:
a*b

Given two arrays that are **compatible** for matrix multiplication, this can be performed with:


In [None]:
M = np.array([
    [1,2,3],
    [3,2,1]
])

v = np.array([2,4])

In [None]:
v@M

This also works to perform Tensor products for higher rank arrays.

### Other important numpy commands

#### Creating arrays

In [None]:
# There are some other options for creating numpy arrays, such as:

# create a 5x4 matrix of zeros
zA = np.zeros(shape=(5,4), dtype=np.float64)

# a 5x4 matrix of standard normal realizations
nA = np.random.normal(0,1, (5,4))

#### Indexing and slicing
Numpy arrays can be indexed and sliced similarly to regular Python lists


In [None]:
a = np.random.normal(0,1,(5,4))
a

In [None]:
a[0,3]

In [None]:
a[1:,]

In [None]:
a[0,1:]

In [None]:
a = np.zeros(shape=(5,4,3))
a = a+1

In [None]:
a[0,:,:]

### Reshaping

Reshaping is an important operation in numpy - it allows us to change the shape of an array, while maintaining te same scalar elements. It allows us to increase and decrease the dimension of our data without changing values

In [None]:
a = np.random.normal(0,1,(5,4))
# rounding is possible in numpy, done here to make the array
# easier to look at
a = np.round(a, 2)
a

In [None]:
# convert this matrix into a Rank-3 tensor, with an outer dimension of 1
a.reshape(1,5,4)

In [None]:
# convert this matrix into a Rank-1 tensor (a vector)
a.reshape(20)

In [None]:
# the sizes of the dimenstions can also be swapped!
# convert this matrix into a Rank-3 tensor, with an outer dimension of 1
a.reshape(4,5)

**Important** - the last reshape operation shown here will **NOT** compute the transpose of a matrix or tensor

In [None]:
# correctly computing the Transpose
a.T

Key takeaway: only use .reshape() for increasing and decreasing the rank of a tensor, not performing transposes!

There is much more to numpy - vectorised operations, computing statistics such as max, min, mean, variance, etc. For now, the above knowledge will be more than sufficient for the purpose of this lab.

## What is Pandas

https://pandas.pydata.org/docs/user_guide/10min.html

*pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.*

Pandas provides two types of classes for handling data Series and Dataframe. We will only look at Datafrane here

### Download the data

In [None]:
!wget https://github.com/AdmiralWen/Sberbank-Russian-Housing-Market/raw/master/Data/train.csv

### Load the data into pandas

In [None]:
# reading a csv is simple in pandas!
dataset = pd.read_csv("train.csv")

## Exploring the data

In [None]:
dataset

In [None]:
dataset.columns

In [None]:
#compiting summary statistics
dataset.describe()

### Basic Dataframe Selection

Pandas allows us to select particular rows and/or columns out of the whole dataframe

In [None]:
#acccessing a particular column
dataset["timestamp"]

In [None]:
dataset.loc[2, "timestamp"]

In [None]:
# accessing a particular row
dataset.loc[2, :]

In [None]:
dataset[["id", "timestamp"]]

It is also possible to perform **conditional** selection

In [None]:
dataset['build_year']>2012

In [None]:
dataset[dataset['build_year']>2012]

# Data Pre-Processing

In [None]:
selected_columns = ['build_year','full_sq', 'life_sq', 'floor','product_type','area_m', "price_doc"]

In [None]:
sub_dataset = dataset[selected_columns]
sub_dataset

### Data pre-processing issue #1 - missing data

In this case we will drop rows where any data is missing. There are more sophisticated solutions to this issue, but these are outside the scope of this lab.

In [None]:
clean_sub_dataset = sub_dataset.dropna()
clean_sub_dataset

In this case, we have roughly halved the dataset - this is far from ideal considering how much useful data we have just thrown away!

### Data Pre-Processing #2, convert Categorical data to numeric

Keras will not be able to handle data such as `product_type` above, we will need to encode this to a numerical representation

In [None]:
set(clean_sub_dataset['product_type'])

We see that there are two options for product_type - either Investment or OwnerOccupier. A simple option is to encode these as 0 and 1 respectively

In [None]:
clean_sub_dataset=clean_sub_dataset.replace("OwnerOccupier", 1)
clean_sub_dataset=clean_sub_dataset.replace("Investment", 0)

In [None]:
clean_sub_dataset

### Data Pre-processing #3, scaling

In general neural netowrk optimisation works more effectively when data are closely distributed around 0. Why is this?

First, we will export our pandas data to numpy

In [None]:
np_data = clean_sub_dataset.to_numpy()
np_data

The y data (price), is the final column of the dataset

In [None]:
x_data = np_data[:,:-1]
y_data = np_data[:,-1]

#### Doing the scaling

 To scale the data, so that each column has values between 0 and 1 we perform a few steps.

 1. Get the max and min of each column
 2. Subtract the column-wise min from each value
 3. Divide this subtracted value by (max-min)

 i.e.

 $\displaystyle x_{scaled}=\frac{x-x_{col min}}{x_{col max}-x_{col min}}$

In [None]:
x_data.shape

In [None]:
col_max = np.max(x_data, axis=0)
col_max.shape

In [None]:
col_min = np.min(x_data, axis=0)
col_max.shape

In [None]:
x_scaled = (x_data-col_min)/(col_max-col_min)

In [None]:
y_scaled = (y_data-np.min(y_data))/(np.max(y_data)-np.min(y_data))

The data is now ready for use!

# Modelling this Data With Keras

First, partition the data into train and test splits

In [None]:
train_frac = 0.7

num_train_samples = int(train_frac*len(x_scaled))

x_train = x_scaled[:num_train_samples,:]
y_train = y_scaled[:num_train_samples]

In [None]:
x_test = x_scaled[num_train_samples:,:]
y_test = y_scaled[num_train_samples:]

We will now fit a model on this data, tuning the following hyper-parameters:

1. The Number of layers used
2. The number of neurons in each hidden layer
3. The activation function used
4. The number of training epochs and learning rate

Finally, we will produce a model that should achieve good overall results!

#### Imports and installations

In [None]:
import tensorflow as tf

In [None]:
!pip install wandb -qU

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m21.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m196.4/196.4 kB[0m [31m20.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m257.7/257.7 kB[0m [31m30.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.7/62.7 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
%load_ext tensorboard

In [None]:
import datetime
log_dir = "logs/" + datetime.datetime.now().strftime("%Y%m%d")

In [None]:
%tensorboard --logdir logs

In [None]:

import wandb
from wandb.keras import WandbMetricsLogger, WandbModelCheckpoint
wandb.login()

True

## Varying Number of Layers & Number of Neurons in Each Layer

In [None]:
for hidden_layers in [1,5,10,20,50]:
  project_name = f"varying-layers"
  run_name = f"{hidden_layers}-layers"

  wandb.init(
      project= project_name,
      name = run_name ,
      config={
        "layers": hidden_layers,
        "optimizer": "SGD",
        "loss": "mean_squared_error",
        "epoch": 10,
        "batch_size": 8
      },
  )
  config = wandb.config

  #######
  #### MODEL STARTS HERE
  #######
  model = tf.keras.models.Sequential()
  dense_layer_1 = tf.keras.layers.Dense( 10, input_shape=[6])
  model.add(dense_layer_1)
  for l in range(config.layers):
    dense_layer = tf.keras.layers.Dense(10)
    model.add(dense_layer)

  # output layer has 1 neuron
  dense_layer = tf.keras.layers.Dense(1)
  model.add(dense_layer)

  model.compile(optimizer=config.optimizer, loss=config.loss, metrics=[])
  #######
  #### MODEL ENDS HERE
  #######


  # FORMATTING THE TENSORBOARD CALLBACK LOG DIRECTORY TO MAKE SEPARATE RUNS CLEARLY
  tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir+'/'+project_name+'_'+run_name, update_freq=1, histogram_freq=1)
  # Add WandbMetricsLogger to log metrics and WandbModelCheckpoint to log model checkpoints
  wandb_callbacks = [
      WandbMetricsLogger(),
      WandbModelCheckpoint(filepath="model_{epoch:02d}"),
  ]

  model.fit(
      x=x_train,
      y=y_train,
      epochs=config.epoch,
      batch_size=config.batch_size,
      validation_data=(x_test, y_test),
      callbacks=[wandb_callbacks, tensorboard_callback],
      verbose=0
  )

  # Mark the run as finished
  wandb.finish()

## Exploring different activation functions

Will activations such as `sigmoid` and `tanh` help?

What about `leaky_relu`, `elu` ?

In [None]:
for act in ["relu", "elu", "leaky_relu"]:
  project_name = f"varying-act-func"
  run_name = f"{act}-func"

  wandb.init(
      project= project_name,
      name = run_name ,
      config={
        "layers": ''' TODO: DECIDE ON A NUMBER OF LAYERS''',
        "act" : act,
        "optimizer": "SGD",
        "loss": "mean_squared_error",
        "epoch": 10,
        "batch_size": 8
      },
  )
  config = wandb.config

  #######
  #### MODEL STARTS HERE
  #######
  model = tf.keras.models.Sequential()
  dense_layer_1 = tf.keras.layers.Dense( 10, input_shape=[6],activation=config.act)
  model.add(dense_layer_1)
  for l in range(config.layers):
    dense_layer = tf.keras.layers.Dense('''TODO: HOW MANY NEURONS PER LAYER''', activation=config.act)
    model.add(dense_layer)

  # output layer has 1 neuron
  dense_layer = tf.keras.layers.Dense(1, activation=config.act)
  model.add(dense_layer)

  model.compile(optimizer=config.optimizer, loss=config.loss, metrics=[])
  #######
  #### MODEL ENDS HERE
  #######


  # FORMATTING THE TENSORBOARD CALLBACK LOG DIRECTORY TO MAKE SEPARATE RUNS CLEARLY
  tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir+'/'+project_name+'_'+run_name, update_freq=1, histogram_freq=1)
  # Add WandbMetricsLogger to log metrics and WandbModelCheckpoint to log model checkpoints
  wandb_callbacks = [
      WandbMetricsLogger(),
      WandbModelCheckpoint(filepath="model_{epoch:02d}"),
  ]

  model.fit(
      x=x_train,
      y=y_train,
      epochs=config.epoch,
      batch_size=config.batch_size,
      validation_data=(x_test, y_test),
      callbacks=[wandb_callbacks, tensorboard_callback],
      verbose=0
  )

  # Mark the run as finished
  wandb.finish()

## Regularization

Regularization can be added to layers, to penalize high variance, quite simply.

`tf.keras.layers.Dense(...,kernel_regularizer=tf.keras.regularizers.L2(l1=0.01), bias_regularizer=tf.keras.regularizers.L2(l1=0.01) )`

If you suspect that your model could benefit from regularization, try adding these hyper-parameters and seeing if performance improves.


## Varying Optimizer and Its Configuration

So far we have used straightforward SGD - but what about more complex optimizers such as Adam and RMSProp ?

Consider also the impact of changing the learning rate

In [None]:
'''
TODO: Adapt the code above to tune the optimizer, Learning rate, batch size and number of epochs of training

BONUS: does adding Dropout ( tf.keras.layers.Dropout(p=0.2) ) to the model after each layer have any impact?
'''

# Next Steps

There are a few avenues to improve the performance of this model:

1. Improve the dat preprocessing \\
  a. We droppped all rows with missing data - is there a better way to impute this mssing data, to make better use of the other values in these rows that we are otherwise throwing away \\
  b. Our system of encoding the categorical data was very simple - but is there an issue with it? \\
2. Are there more effective strategies for optimising the choice of hyper-paremeters in the training process?


Once you have settled on good sets of hyper-parameters, more exhaustive fine tuning can be performed with tools like the Keras Tuner (https://www.tensorflow.org/tutorials/keras/keras_tuner).