### Improving Keras Models

Welcome!  Today we're going to build on the previous lessons by making improvements to the way we make and build tensorflow models with Keras.

To recap:
 - We've discussed the basics of an NLP model with two different types of network layers:
   - `word embedding`:  the docking stations for each word in our training corpus, with their assigned set of weights
   - `dense layers`: the 'standard' neural network layer, which multiplies the incoming data by a block of weights in a linear fashion
 - We've discussed the basics of how to prep word data for said model:
   - cleaning up unnecessary characters due to formatting issues
   - tokenizing our data so that each word is mapped to its own index position based on its frequency in the training corpus
   - padding our word sequences so that they all have equal length
   
 - We've also discussed some other nuts and bolts of neural networks:
   - `activation functions`:  data transformations that happen inbetween layers to introduce non-linearity to assist in pattern recognition;  sometimes also to coerce data into a particular format;
     - in the broadest terms, activation functions come in two flavors:
       - **internal activation functions**:  these tend to be simple, only slightly non-linear, and are usually done with the purpose of providing the neural network with easy gradients to calculate.  The `ReLU` is the most common example of this type.
       - **squashing functions**: these are designed to make data conform to a particular range of values, and are usually used at the end of classification models to turn output into a proper prediction.  There are usually two used:
        - **sigmoid:** takes a single number and turns it into a probability. ie `sigmoid(0) = 0.5`.  This is standard for binary classification.
        - **softmax:** the big brother of the sigmoid, it allows you to take multiple probabilites and make sure they all add up to 1.  The standard for multi-class classification.
   - `model compilation`:  In Scikit-learn models come pre-packaged with specific loss functions (in many ways this defines what the models are).  In Keras, you define what they are manually, and this is to be done at the compilation step.
   
In addition to this, we also got a chance to take a look at some more sophisticated network layers that do a better job of preserving sequential detail in your data -- a useful feature for time series and NLP problems.  

These two layers were:

 - `SimpleRNN`:  the standard way of capturing sequential dependence in a neural network.  It takes each data point in your sample, passes it through a set of matrix multiplications, and then takes the output of that and uses it as the input to the next character in your sample. So by the time you get to your final character, information from every previous word / timestep leading up to that has been incorporated into it (with the most recent values accounting for more).
 - `LSTM`:  a slightly more advanced version of the `SimpleRNN`, this layer "gates" information at each multiplication to make information that was present early on in the sequence stay relevant for longer.  
 
Each of these can be added into a keras model rather easily.  See below.

In [1]:
from tensorflow import keras
# a sample neural network with an RNN
rnn_mod = keras.models.Sequential([
      keras.layers.Embedding(10000, 64, input_length = 300),
      # notice we don't flatten -- sequential layers typically take 3 layers in -- 2 layers out
      keras.layers.SimpleRNN(32),
      keras.layers.Dense(64, activation = 'relu'),
      keras.layers.Dense(64, activation = 'relu'),
      keras.layers.Dense(1, activation = 'sigmoid')
])

# a sample neural network with an LSTM cell
lstm_mod = keras.models.Sequential([
      keras.layers.Embedding(10000, 64, input_length = 300),
      # notice we don't flatten -- sequential layers typically take 3 layers in -- 2 layers out
      keras.layers.LSTM(32),
      keras.layers.Dense(64, activation = 'relu'),
      keras.layers.Dense(64, activation = 'relu'),
      keras.layers.Dense(1, activation = 'sigmoid')
])

We'll also add one additional detail to using these layers: how to stack them on top of each other.  If you refer to our previous notebook, you'll recall that an RNN layer returns output data *for each time step*.  This means that for our 300 word reviews, we get 300 sets of outputs.  If you are going to pass the output from this layer into a dense layer, the first 299 are superfluous:  you only need the last one (since it's the final step in a sequence of calculations that make use of the previous inputs).  

However, if you want to connect these layers, then you need to provide all of the data for each sequence.  To accommodate this, you need to pass in one additional argument when creating the layers:  `return_sequences = True`.

An example is below:

In [3]:
# keras model with stacked LSTM cells
lstm_mod_stacked = keras.models.Sequential([
    keras.layers.Embedding(10000, 64, input_length = 300),
    # notice the argument that we're adding in here -- important!
    keras.layers.LSTM(32, return_sequences = True),
    # no need to do that here since we're passing this into a dense layer
    keras.layers.LSTM(32),
    keras.layers.Dense(64, activation = 'relu'),
    keras.layers.Dense(64, activation = 'relu'),
    keras.layers.Dense(1, activation = 'sigmoid')
])

### Making Improvements

We'll now discuss three different ways to improve upon our configurations from previous classes:
 - optimizers (algorithms that adjust your model's weights during training)
 - using batch sizes 
 - callbacks
 
#### Optimizers
After every forward pass in your neural network, your model uses the gradient (error) to update the model weights.  It turns out there are a few different ways to update your model's weights from your gradient.  These different methods are known as optimizers.  This is not the most critical choice you can make, but choosing the right one can help in small ways.  

For the time being, there's a near universal choice for the best method to use:  Adam.  (You can read more here:  https://keras.io/api/optimizers/adam/).  It works better because it adaptively changes your learning rate using the momentum from your loss function after each round.

To use this in our model, we'll specify it during the compilation step.  

In [5]:
# compilation step -- but specifying the adam optimizer
lstm_mod.compile(optimizer = keras.optimizers.Adam(learning_rate = 0.0001), 
                 loss='binary_crossentropy', 
                 metrics = ['acc'])

  super(Adam, self).__init__(name, **kwargs)


#### Callbacks

Callbacks are functions you can pass into a model during training to monitor its behavior.  They allow you to take model snapshots, monitor out of sample performance error, and adjust the learning rate throughout training.  

Since neural networks can get very complex, it's helpful to be able to adjust their behavior in the middle of training, vs. waiting until the very end to make modifications.  Keras has built in modules for some of the most common uses for them, so let's take a look at them now.

We'll add callbacks that allow you to stop the model if you go so many iterations without improvement (early stopping), and also reduce the learning rate if you go so many rounds without an improvement (learning rate annealing).

#### Batch Sizes

It's better to feed your data to a neural network in small sips vs. big gulps.  What this means is that during the forward pass, you should break your training data into smaller chunks before sending it through.  If your training data is 1000 samples and your batch size is 100, a single epoch will actually consist of 10 different batches of 100.  After each batch, your model will update its weights, making the next batch a little more effective than the last one.  This typically makes training both faster and more accurate.

Like layer size, it's best to make this number a multiple of 32.  Batches of 64 are a good starting point for this parameter.

Implementations of batch size and callbacks is initiated during the `fit()` method.

Please see below:

In [None]:
# and now we'll fit -- notice the use of callbacks
lstm_mod.fit(X_train, y_train, 
             epochs = 5, 
             # specify batch size here
             batch_size = 64,
             validation_split = 0.2, 
             # enter your list of callbacks into this argument
             callbacks = [
                 # the EarlyStopping callback will shut your model down if no improvement is observed after a certain time
                 tf.keras.callbacks.EarlyStopping(monitor="val_loss", 
                                                  # the model will turn itself off after 20 rounds w/ no improvement
                                                  patience=20, 
                                                  # when we are finished, restore weights with the best validation score
                                                  restore_best_weights=True),
                 # this will adjust the learning rate if we go so many rounds without improvement
                 tf.keras.callbacks.ReduceLROnPlateau(monitor = "val_loss", patience = 10, verbose = 1)])

**Your Turn:** Re-run the model from the previous class, but with the following details:

 - Stack two LSTM layers with 64 columns of weights each before connecting to your dense layers
 - Use Adam as your optimizer, with a learning rate of 0.001
 - Fit your model with a batch size of 64, and the following two callbacks:
   - Early stopping after 4 rounds of no improvement, restore the best weights
   - Adjust the learning rate after 2 rounds of no improvement
 - Fit the model for 300 rounds of training