d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 1200px">
</div>

# Advanced Keras

Congrats on building your first neural network! In this notebook, we will cover even more topics to improve your model building. After you learn the concepts here, you will apply them to the neural network you just created.

We will use the California Housing Dataset.

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) In this lesson you:<br>
 - Perform data standardization for better model convergence
 - Add validation data
 - Generate model checkpointing/callbacks
 - Use TensorBoard
 - Apply dropout regularization

In [3]:
%run "./Includes/Classroom-Setup"

In [4]:
from sklearn.datasets.california_housing import fetch_california_housing
from sklearn.model_selection import train_test_split

cal_housing = fetch_california_housing()

# split 80/20 train-test
X_train, X_test, y_train, y_test = train_test_split(cal_housing.data,
                                                    cal_housing.target,
                                                    test_size=0.2,
                                                    random_state=1)

print(cal_housing.DESCR)

Let's take a look at the distribution of our features.

In [6]:
import pandas as pd

pd.DataFrame(X_train, columns=cal_housing.feature_names).describe()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
count,16512.0,16512.0,16512.0,16512.0,16512.0,16512.0,16512.0,16512.0
mean,3.876149,28.604469,5.441114,1.099598,1425.257146,3.094971,35.632194,-119.574288
std,1.891584,12.586046,2.613727,0.507173,1123.756792,11.597402,2.137087,2.007578
min,0.4999,1.0,0.846154,0.333333,3.0,0.75,32.54,-124.3
25%,2.57205,18.0,4.439906,1.00626,786.0,2.427283,33.93,-121.81
50%,3.54455,29.0,5.226528,1.048797,1164.0,2.813449,34.26,-118.49
75%,4.75,37.0,6.057778,1.099574,1723.0,3.273834,37.71,-118.01
max,15.0001,52.0,141.909091,34.066667,35682.0,1243.333333,41.95,-114.31


## 1. Data Standardization

Because our features are all on different scales, it's going to be more difficult for our neural network during training. Let's do feature-wise standardization.

We are going to use the [StandardScaler](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) from Sklearn, which will remove the mean (zero-mean) and scale to unit variance.

$$x' = \frac{x - \bar{x}}{\sigma}$$

In [8]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

## Keras Model

In [10]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense  
tf.random.set_seed(42)

model = Sequential([
  Dense(20, input_dim=8, activation="relu"),
  Dense(20, activation="relu"),
  Dense(1, activation="linear")
])

#creating linear stack of layers
##each layer has their own AF

In [11]:
from tensorflow.keras.optimizers import Adam

model.compile(optimizer=Adam(lr=0.01), loss="mse", metrics=["mse"])

## 2. Validation Data

Let's take a look at the [.fit()](https://www.tensorflow.org/api_docs/python/tf/keras/Sequential#fit) method in the docs to see all of the options we have available! 

We can either explicitly specify a validation dataset, or we can specify a fraction of our training data to be used as our validation dataset.

The reason why we need a validation dataset is to evaluate how well we are performing on unseen data (neural networks will overfit if you train them for too long!).

We can specify `validation_split` to be any value between 0.0 and 1.0 (defaults to 0.0).

In [13]:
history = model.fit(X_train, y_train, validation_split=.2, epochs=10, verbose=2)

## 3. Checkpointing

After each epoch, we want to save the model. However, we will pass in the flag `save_best_only=True`, which will only save the model if the validation loss decreased. This way, if our machine crashes or we start to overfit, we can always go back to the "good" state of the model.

To accomplish this, we will use the ModelCheckpoint [callback](https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/ModelCheckpoint). History is an example of a callback that is automatically applied to every Keras model.

In [15]:
from tensorflow.keras.callbacks import ModelCheckpoint

filepath = f"{working_dir}/keras_checkpoint_weights.ckpt"

model_checkpoint = ModelCheckpoint(filepath=filepath, verbose=1, save_best_only=True)

In [16]:
working_dir

## 4. Tensorboard

[Tensorboard](https://www.tensorflow.org/tensorboard/get_started) provides a nice UI to visualize and debug your neural networks. We can also define it as a callback.

In [18]:
from tensorflow.keras.callbacks import TensorBoard

log_dir = f"{working_dir}/_tb.dir"
tensorboard = TensorBoard(log_dir)

dbutils.tensorboard.start(log_dir) # Will be empty until call .fit() below

#dbutils is a databricks utility

Now let's add in our model checkpoint and Tensorboard callbacks to our `.fit()` command.

In [20]:
history = model.fit(X_train, y_train, validation_split=.2, epochs=10, verbose=2, callbacks=[model_checkpoint, tensorboard])

In [21]:
model.evaluate(X_test, y_test)

##validation loss is a bit higher than training. Model may be overfit

-sandbox
## 5. Dropout Regularization

It's generally more difficult to overtrain neural networks than classical machine learning methods.  However, overfitting is more common with smaller datasets and is caused in part by co-adapted neurons.

**Dropout is a regularization method that reduces overfitting by randomly and temporarily removing nodes during training.**  It works like this:<br><br>

- Apply to most type of layers (e.g. fully connected, convolutional, recurrent) and larger networks
- Set an additional probability for keeping each node
  - .5 is a good starting place for hidden layers
  - .9 is a good starting place for the input layer
- Temporarily and randomly remove nodes and their connections during each training cycle
- Since this results in larger weights, scale the weights proportional to the dropout rate
- Score on the entire architecture (without dropout)

![](https://files.training.databricks.com/images/nn_dropout.png)

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> See the original paper here: [Dropout: A Simple Way to Prevent Neural Networks from
Overfitting](http://jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf)

Redefine a larger network.  Note the following changes:<br><br>

- We'll use a dropout rate of .5
- Increase the size of the network in proportion to the dropout rate (`original network size` / `dropout rate`)
- Large weight sizes can be a sign of an unstable network.  Manage this using a weight constraint to force the magnitude of all weights to be below a specific value.  Typical values for this constraint `c` are between 3 and 4.

In [24]:
import tensorflow as tf
from tensorflow.keras.constraints import max_norm
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
tf.random.set_seed(42)

def create_dropout_model():
  model = Sequential()
  model.add(Dropout(.1))
  model.add(Dense(40, input_dim=8, activation="relu", kernel_constraint=max_norm(4))) #use between 3 or 4 as rule of thumb
  model.add(Dropout(.5))#dropping 50% of units
  model.add(Dense(40, activation="relu", kernel_constraint=max_norm(4)))
  model.add(Dropout(.5))
  model.add(Dense(1, activation="linear"))
  return model

#drop out can come before or after adding units

dropoutModel = create_dropout_model()

Compile the model with a learning rate increased by 1-2 orders of magnitude.

> "Although dropout alone gives significant improvements, using dropout along with maxnorm regularization, large decaying learning rates and high momentum provides a significant
boost over just using dropout"
> [- Dropout: A Simple Way to Prevent Neural Networks from
Overfitting](http://jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf)

In [26]:
dropoutModel.compile(optimizer=Adam(lr=0.1), loss="mse", metrics=["mse"]) 

dropoutHistory = dropoutModel.fit(X_train, y_train, validation_split=.2, epochs=10, verbose=2)

-sandbox
<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> Batch normalization is an alternative approach.  This will be discussed in the lesson on CNNs.

Now it's your turn to try out these techniques on the Boston Housing Dataset!

-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>