[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/drdave-teaching/OPIM5509Files/blob/main/OPIM5509_Module2_Files/CheatSheet_BuildingFFNNs.ipynb)

# Cheat Sheet: Building FFNNs
---------------------------------------
**Dr. Dave Wanik - University of Connecticut**

A lot of students want to know how many layers and hidden units to use in their neural networks. The easy/annoying answer is 'it depends on the problem', which can be frustrating to a Deep Learning newbie. However, there are some rules of thumb and general guidelines that may be useful to you.

## Strategy 1: One layer, same number of hidden nodes as input data features.

This is an easy one to understand - remember, you are making intermediate/temporary, nonlinear representations of data. So, if you have 10 input data features, you can create 10 intermediate, nonlinear representations of data.

This makes sense - your 10 inputs can be combined in such a way that you make 10 new things that are useful for prediction. Each one of the nodes in the hidden layer is a different nonlinear combination of inputs.

In [None]:
# X_train.shape[1] = 13 features

In [None]:
# build the model!
model = Sequential()
model.add(Dense(13, input_shape=(X_train.shape[1],), activation='relu')) # (features,)
model.add(Dense(1, activation='linear')) # output node
model.summary() # see what your model looks like

## Strategy 2: One layer, two times the number of input data features.

If you have 10 input data features, you create a single layer with 20 hidden units. You are making 20 nonlinear representations of data from your 10 input features. This seems to be believeable to me.

Perhaps you could also try three or five times the input data features - but realize what you are doing, you are going to slow down training because you have to tune your network and get it to converge.

In [None]:
# build the model!
model = Sequential()
model.add(Dense(26, input_shape=(X_train.shape[1],), activation='relu')) # (features,)
model.add(Dense(1, activation='linear')) # output node
model.summary() # see what your model looks like

## Strategy 3: Two or more layers, same number of hidden units as input data features.

Recall WHY we are using multiple layers - multiple layers mean that you can have deeper representations of data. This table is not the ultimate authority on neural networks, but it will get you thinking the right way.

**Table:** Determining the Number of Hidden Layers

Num Hidden Layers|	Result|
---|---|
none|	Only capable of representing linear separable functions or decisions.
1	|Can approximate any function that contains a continuous mapping from one finite space to another.|
2	|Can represent an arbitrary decision boundary to arbitrary accuracy with rational activation functions and can approximate any smooth mapping to any accuracy.
>2	|Additional layers can learn complex representations (sort of automatic feature engineering) for layers.

We see this all the time in the Chollet book. Just keep repeating the layer size!

In [None]:
# build the model!
model = Sequential()
model.add(Dense(13, input_shape=(X_train.shape[1],), activation='relu')) # (features,)
model.add(Dense(13, activation='relu')) # output node
model.add(Dense(13, activation='relu')) # output node
model.add(Dense(13, activation='relu')) # output node
model.add(Dense(13, activation='relu')) # output node
model.add(Dense(13, activation='relu')) # output node
model.add(Dense(1, activation='linear')) # output node
model.summary() # see what your model looks like

## Strategy 4: Many layers, decreasing hidden nodes by half.

Imagine you have a first layer with 50 hidden units - your second layer would have 25 hidden units, then 12, then 6. Just keep dividing by 2. This is what I mean by 'information funnel' - because the shape goes from wide to narrow, left to right. You are creating new representations as you go along - and the nonlinear nuggets you create at the end will be nonlinearly combined to predict the output.

In [None]:
# build the model!
model = Sequential()
model.add(Dense(1000, input_shape=(X_train.shape[1],), activation='relu')) # (features,)
model.add(Dense(500, activation='relu')) # output node
model.add(Dense(250, activation='relu')) # output node
model.add(Dense(125, activation='relu')) # output node
model.add(Dense(60, activation='relu')) # output node
model.add(Dense(30, activation='relu')) # output node
model.add(Dense(15, activation='relu')) # output node
model.add(Dense(1, activation='linear')) # output node
model.summary() # see what your model looks like

## [Optional] Strategy 5: A systematic grid search
This is a bit advanced, but do as Brownlee does and try to tune everything at once in a loop!

* https://machinelearningmastery.com/grid-search-hyperparameters-deep-learning-models-python-keras/

PS also check out how to build your own NN from scratch - just as FYI - use the Sequential() API from Keras to make things easier!
* https://machinelearningmastery.com/implement-backpropagation-algorithm-scratch-python/

## Bells and Whistles
You should always be using:
* Dropout between 0.1 and 0.5
* Early stopping with a suitable patience or other stopping criterion (min decrease in error).
* 'relu' or 'tanh' activation function in every layer (except for output node - there you have to use the appropriate 'linear', 'sigmoid' or 'softmax'). You need to know the difference between all of these!
* batch size - use full, stochastic or batch gradient descent and see how that affects learning.

# (new!) Bayesian hyperparameter tuning with Optuna
* https://www.kaggle.com/code/mistag/keras-model-tuning-with-optuna

^ This uses Optuna to hyperparameter tune a Recurrent Neural Network



# Resources
* https://www.heatonresearch.com/2017/06/01/hidden-layers.html
* https://towardsdatascience.com/17-rules-of-thumb-for-building-a-neural-network-93356f9930af
* https://machinelearningmastery.com/how-to-configure-the-number-of-layers-and-nodes-in-a-neural-network/