# Introduction to Deep Learning with Keras
Deep learning is here to stay! It's the go-to technique to solve complex problems that arise with unstructured data and an incredible tool for innovation. Keras is one of the frameworks that make it easier to start developing deep learning models, and it's versatile enough to build industry-ready models in no time. In this course, you will learn regression and save the earth by predicting asteroid trajectories, apply binary classification to distinguish between real and fake dollar bills, use multiclass classification to decide who threw which dart at a dart board, learn to use neural networks to reconstruct noisy images and much more. Additionally, you will learn how to better control your models during training and how to tune them to boost their performance.

**Instructor:** Miguel Esteban, co-founder of Xtreme AI

## $\star$ Chapter 1: Introducing Keras
In this first chapter, you will get introduced to neural networks, understand what kind of problems they can solve, and when to use them. You will also build several networks and save the earth by training a regression model that approximates the orbit of a meteor that is approaching us!

* Keras is a high-level deep learning framework.

#### Theano vs Keras
* Theano is a lower level framework
* Building a neural network in Theano can take many lines of code and requires a deep understanding of how they work internally 

<img src='data/Theano_vs_Keras.png' width="700" height="350" align="center"/>

* Building and training this very same network in Keras only takes a few lines of code

#### Keras
* Open source deep learning Framework
* Enables fast experimentation with neural networks
* Runs on top of other frameworks like TensorFlow, Theano or CNTK
* Created by French AI researcher François Chollet

#### Why Keras instead of other low-level libraries like TensorFlow?
* Fast industry-ready models
* For beginners and experts
* Less code
* Allows for quickly and easily checking if a neural neetwork will solve your problems
* Build any architecture, from:
    * simple networks
    * more complex networks
    * auto-encoders
    * convolutional neural networks (CNNs)
    * recurrent neural networks (RNNs)
    * Deploy models in multiple platforms (Android, iOS, web-apps, etc.)
* Keras is now fully integrated into TensorFlow2 and is TensorFlow's high-level framework of choice
* Keras is complementary to TensorFlow
* **Can also use TensorFlow in same code pipeline if, as you dive into deep learning, you find yourself needing to use low-level features, have a finer control of how your network applies gradients, etc.**

#### Feature Engineering
* Neural networks are good feature extractors, since they learn the best way to make sense of **unstructured data.**
* NNs can learn the best features and their combinations, and can perform feature engineering themselves

#### Unstructured data
* **Unstructured data** is data that is not easily put into a table
* Examples: sound, video, images, etc.
* These are also the types of data where performing feature engineering can be more challenging, so leaving this task to NNs is often a good idea.

#### When to use neural networks
* Dealing with unstructured data
* Don't need easily interpretable results
* You can benefit from a known architecture

### Simple Neural Networks
* A **neural network** is a machine learning algorithm with the training data being the input to the input layer and the predicted value being the value at the output layer
* Each connection from one neuron to another has an associated weight, $w$.

<img src='data/bias_weight.png' width="400" height="200" align="center"/>

* Each neuron, except the input layer which just holds the input value, also has an extra weight we call the bias weight, $b$.
* During **forward propagation** (feed-forward), our input gets transformed by weight multiplications and additions at each layer, the output of each neuron can also get transformed by the application of what's called an **activation function**.

<img src='data/grad_descent.png' width="600" height="300" align="center"/>

* Learning in neural networks consists of tuning the weights or parameters to give the desired output
* One way of achieving this is by using gradient descent and applying weight updates incrementally via back-propagation

### The sequential API
* Keras allows you to build models in 2 different ways; using either the Functional API or the Sequential API
* The sequential API is a simple, yet very powerful way of building neural networks that will get you covered for most use cases
* With the sequential API, you're essentially building a model as a stack of layers

```
from keras.models import Sequential
from keras.layers import Dense

# Create a new sequential model
model = Sequential()

# Add an input AND dense layer
model.add(Dense(2, input_shape=(3,)))
```
* **In this last line of code, *we add 2 layers*: a 2-neuron Dense fully-connected layer, and an input later consisting of 3 neurons.**

```
# Add a final 1-neuron layer
model.add(Dense(1))
```

#### model.summary()
* **Parameters** are the weights, including the bias weight of each neuron in a given layer (or the model as a whole)
* **When the input layer is defined via the `input_shape` parameter, as we did before, it is not shown as a layer in the summary, but is included in the layer where it was defined** (in the case above, within the first Dense layer).

<img src='data/visualize_parameters.png' width="600" height="300" align="center"/>

* The image above helps illustrate why the layer **`dense_3`** has **8 parameters**:
    * **6 parameters** (or weights) come from the connections of the 3 input neurons to the 2 neurons in the layer (green lines)
    * **2 parameters** come from the bias weights, `b0`, and `b1`, 1 per each neuron in the hidden layer
    
#### Exercises: Hello nets!
You're going to build a simple neural network to get a feeling of how quickly it is to accomplish this in Keras.

You will build a network that **takes two numbers as an input**, passes them through **a hidden layer of 10 neurons**, and finally **outputs a single non-constrained number**.

A **non-constrained output can be obtained by avoiding setting an activation function in the output layer**. This is useful for problems like regression, when we want our output to be able to take any non-constrained value.

<img src='data/ex1_vis.png' width="400" height="200" align="center"/>

```
# Import the Sequential model and Dense layer
from keras.models import Sequential
from keras.layers import Dense

# Create a Sequential model
model = Sequential()

# Add an input layer and a hidden layer with 10 neurons
model.add(Dense(10, input_shape=(2,), activation="relu"))

# Add a 1-neuron output layer
model.add(Dense(1))

# Summarise your model
model.summary()
```

#### Exercises: Counting Parameters
You've just created a neural network. But you're going to create a new one now, taking some time to think about the weights of each layer. The Keras `Dense` layer and the `Sequential` model are already loaded for you to use.

This is the network you will be creating:

<img src='data/ex2_vis.png' width="300" height="150" align="center"/>

```
# Instantiate a new Sequential model
model = Sequential()

# Add a Dense layer with five neurons and three inputs
model.add(Dense(5, input_shape=(3,), activation="relu"))

# Add a final Dense layer with one neuron and no activation
model.add(Dense(1))

# Summarize your model
model.summary()
```
**There are 20 parameters, 15 from the connections of our inputs to our hidden layer and 5 from the bias weight of each neuron in the hidden layer.**

#### Exercises: Build as shown!
You will take on a final challenge before moving on to the next lesson. Build the network shown in the picture below. Prove your mastered Keras basics in no time!

<img src='data/ex3_vis.png' width="200" height="100" align="center"/>

```
from keras.models import Sequential
from keras.layers import Dense

# Instantiate a Sequential model
model = Sequential()

# Build the input and hidden layer
model.add(Dense(3, input_shape=(2,)))

# Add the ouput layer
model.add(Dense(1))
```

### Surviving a meteor
* **`loss_function` is the function we are trying minimize during training.**
* **Compiling a model produces no output.**
    * But, our model is now ready to train.
* Creating a model is useless if we don't train it.

#### Training

```
# Train your model
model.fit(X_train, y_train, epochs=5)
```

#### Predicting
* We can store predictions in a variable for later use.
* The predictions are stored as numbers in a numpy array 

```
# Predict on new data
preds = model.predict(X_test)

# Look at the predicitons
print(preds)
```

#### Evaluating
* Feed-forward consists in computing a model's output from a given set of inputs
* It then computes the error comparing the results to the true values stored in `y_test`

```
# Evaluate your results
model.evaluate(X_test, y_test)
```

#### Exercises: Specifying a model 
You will build a simple regression model to predict the orbit of the meteor!

Your training data consist of measurements taken at time steps from **-10 minutes before the impact region to +10 minutes after**. Each time step can be viewed as an X coordinate in our graph, which has an associated position Y for the meteor orbit at that time step.

*Note that you can view this problem as approximating a quadratic function via the use of neural networks.*

<img src='data/impact_reg.png' width="300" height="150" align="center"/>

This data is stored in two numpy arrays: one called `time_steps` , what we call *features*, and another called `y_positions`, with the *labels*. Go on and build your model! It should be able to predict the y positions for the meteor orbit at future time steps.

Keras `Sequential` model and `Dense` layers are available for you to use.

```
# Instantiate a Sequential model
model = Sequential()

# Add a Dense layer with 50 neurons and an input of 1 neuron
model.add(Dense(50, input_shape=(1,), activation='relu'))

# Add two Dense layers with 50 neurons and relu activation
model.add(Dense(50,activation='relu'))
model.add(Dense(50,activation='relu'))

# End your model with a Dense layer and no activation
model.add(Dense(1))
```

#### Training
You're going to train your first model in this course, and for a good cause!

Remember that **before training your Keras models you need to compile them**. This can be done with the `.compile()` method. The `.compile()` method takes arguments such as the `optimizer`, used for weight updating, and the `loss` function, which is what we want to minimize. Training your model is as easy as calling the `.fit()` method, passing on the features, labels and a number of epochs to train for.

The regression `model` you built in the previous exercise is loaded for you to use, along with the `time_steps` and `y_positions` data. Train it and evaluate it on this very same data, let's see if your model can learn the meteor's trajectory.

```
# Compile your model
model.compile(optimizer = 'adam', loss = 'mse')

print("Training started..., this can take a while:")

# Fit your model on your data for 30 epochs
model.fit(time_steps,y_positions, epochs = 30)

# Evaluate your model 
print("Final loss value:",model.evaluate(time_steps, y_positions))
```

#### Exercises: Predicting the orbit!
You've already trained a `model` that approximates the orbit of the meteor approaching Earth and it's loaded for you to use.

Since you trained your model for values between -10 and 10 minutes, your model hasn't yet seen any other values for different time steps. You will now visualize how your model behaves on unseen data.

If you want to check the source code of `plot_orbit`, paste `show_code(plot_orbit)` into the console.

Hurry up, the Earth is running out of time!

*Remember `np.arange(x,y)` produces a range of values from **x** to **y-1**. That is the `[x, y)` interval.*

```
# Predict the twenty minutes orbit
twenty_min_orbit = model.predict(np.arange(-10, 11))

# Plot the twenty minute orbit 
plot_orbit(twenty_min_orbit)

# Predict the eighty minute orbit
eighty_min_orbit = model.predict(np.arange(-40, 41))

# Plot the eighty minute orbit 
plot_orbit(eighty_min_orbit)
```

### Binary classification
* We use binary classification when we want to solve problems where you predict whether an observation belongs to one of two possible classes

<img src='data/bin_class_plot.png' width="500" height="250" align="center"/>

* The coordinates are pairs of values corresponding to the X and Y coordinates of each circle in the graph
* The labels are `1` for red circles and `0` for blue circles 

#### Pairplot
* We can make use of seaborn's `pairplot` function to explore a small dataset and identify whether our classification problem will be easily separable
* We can get an intuition for this is we see that the classes separate well-enough along several variables

```
import seaborn as sns

# Plot a pairplot
sns.pairplot(circles, hue='target')
```

<img src='data/pplots.png' width="400" height="200" align="center"/>

* In this case, for the circles dataset, there is a very clear boundary. The red circle concentrate at the center while the blue are outside. It should be easy for our network to find a way to separate them just based on x and y coordinates

#### The sigmoid function
* You can consider the output of the sigmoid function as the probability of a pair of coordinates being in one class or another
* So we can set a threshold and say everything below 0.5 will be a blue circle and everything above 0.5, a red circle.

#### Binary crossentropy
* **Binary cross-entropy** is the function we use when our output neuron is using sigmoid as its activation function $\star$
* For binary classification; sigmoid function will be used to predict logistic regression/binary classification problems.

```
# Compile model
model.compile(optimizer='sgd', loss='binary_crossentropy')

# Train model
model.train(coordinates, labels, epochs=20)

# Predict with trained model
preds = model.predict(coordinates)
```

* Note that we obtain the predicted labels by calling `predict` on `coordinates`

<img src='data/circ_class.png' width="400" height="200" align="center"/>

```
# Import the sequential model and dense layer
from keras.models import Sequential
from keras.layers import Dense

# Create a sequential model
model = Sequential()

# Add a dense layer 
model.add(Dense(1, input_shape=(4,), activation='sigmoid'))

# Compile your model
model.compile(loss='binary_crossentropy', optimizer='sgd', metrics=['accuracy'])

# Display a summary of your model
model.summary()

# Train your model for 20 epochs
model.fit(X_train, y_train, epochs = 20)

# Evaluate your model accuracy on the test set
accuracy = model.evaluate(X_test, y_test)[1]

# Print accuracy
print('Accuracy:', accuracy)
```

### Multi-class classification
* If we have more than two classes to classify, then we have a multi-class classification problem.
* Outcomes of multi-class classification problems must be **mutually exclusive**
* The activation function of choice for the *final* layer of a mulit-class classification neural network is `softmax`.
* Output values will be provided as probabilities, and the class with the highest probability is chosen as the model's prediction.

```
# Instantiate a sequential model 
# ...
# Add an input and hidden layer
# ...
# Add more hidden layers
# ...
# Add your output layer
model.add(Dense(4, activation='softmax')
```

#### Categorical cross-entropy
* When compiling your model, instead of binary cross-entropy as is used for binary classification problems, we use categorical cross-entropy (aka **log loss**).
* **Categorical cross-entropy** measures the difference between the predicted probabilities and the true label of the class we should have predicted

<img src='data/log_loss1.png' width="700" height="350" align="center"/>

* So, if we should have predicted `1` for a given class, by taking a look at the graph above, we see we would get high loss values for predicting close to `0` (since we'd be "very" wrong) and low loss values for predicting closer to 1 (the true label).

* Since our outputs are vectors containing the probabilities of each class, our neural network must also be trained with vectors representing this concept. To achieve that, we make use of the `keras.utils.to_categorical` function
* We first turn our response variable into a categorical variable with pandas `Categorical`; This allows us to redefine the column using the categorical codes (**cat codes**) of the different categories 
* Once our categories are each represented by a unique integer, we can use the `to_categorical` function to turn them into one-hot-encoded vectors, where each component is 0 except for the one corresponding to the labeled categories

```
import pandas as pd
from keras.utils import to_categorical

# Load dataset
df = pd.read_csv('data.csv')

# Turn response variable into labeled codes
df.response = pd.Categorical(df.response)
df.response.cat.codes

# Turn response variable into one-hot response vector
y = to_categorical(df.response)
```

#### Label encoding vs one-hot encoding

<img src='data/label_vs_ohe.png' width="400" height="200" align="center"/>

* Keras `to_categorical` essentially performs the process described in the picture above (of transformed label-encoded variables into one-hot-encoded variables).

#### Exercises: A multi-class model
You're going to build a model that predicts who threw which dart only based on where that dart landed! (That is the dart's x and y coordinates on the board.)

This problem is a multi-class classification problem since each dart can only be thrown by one of 4 competitors. So classes/labels are mutually exclusive, and therefore we can build a neuron with as many output as competitors and use the `softmax` activation function to achieve a total sum of probabilities of 1 over all competitors.

Keras `Sequential` model and `Dense` layer are already loaded for you to use.

```
# Instantiate a sequential model
model = Sequential()
  
# Add 3 dense layers of 128, 64 and 32 neurons each
model.add(Dense(128, input_shape=(2,), activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(32, activation='relu'))
  
# Add a dense layer with as many neurons as competitors
model.add(Dense(4, activation='softmax'))
  
# Compile your model using categorical_crossentropy loss
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
```          

#### Exercises: Prepare your dataset
In the console you can check that your labels, `darts.competitor` are not yet in a format to be understood by your network. They contain the names of the competitors as strings. You will first turn these competitors into unique numbers,then use the `to_categorical()` function from `keras.utils` to turn these numbers into their one-hot encoded representation.

This is useful for multi-class classification problems, since there are as many output neurons as classes and for every observation in our dataset we just want one of the neurons to be activated.

The dart's dataset is loaded as `darts`. Pandas is imported as `pd`. Let's prepare this dataset!

```
# Transform into a categorical variable
darts.competitor = pd.Categorical(darts.competitor)

# Assign a number to each category (label encoding)
darts.competitor = darts.competitor.cat.codes

# Print the label encoded competitors
print('Label encoded competitors: \n',darts.competitor.head())

# Drop the original competitor column with competitors' names
coordinates = darts.drop(['competitor'], axis=1)

# Use to_categorical on your labels
competitors = to_categorical(darts.competitor)

# Now print the one-hot encoded labels
print('One-hot encoded competitors: \n',competitors)
```

#### Exercises: Training on dart throwers
Your model is now ready, just as your dataset. It's time to train!

The `coordinates` features and `competitors` labels you just transformed have been partitioned into `coord_train`, `coord_test` and `competitors_train`, `competitors_test`.

Your `model` is also loaded. Feel free to visualize your training data or `model.summary()` in the console.

Let's find out who threw which dart just by looking at the board!

```
# Fit your model to the training data for 200 epochs
model.fit(coord_train,competitors_train,epochs=200)

# Evaluate your model accuracy on the test data
accuracy = model.evaluate(coord_test, competitors_test)[1]

# Print accuracy
print('Accuracy:', accuracy)
```

#### Exercises: Softmax predictions
Your recently trained `model` is loaded for you. This model is generalizing well!, that's why you got a high accuracy on the test set.

Since you used the `softmax` activation function, for every input of 2 coordinates provided to your model there's an output vector of 4 numbers. Each of these numbers encodes the probability of a given dart being thrown by one of the 4 possible competitors.

When computing accuracy with the model's `.evaluate()` method, your model takes the class with the highest probability as the prediction. `np.argmax()` can help you do this since it returns the index with the highest value in an array.

Use the collection of test throws stored in `coords_small_test` and `np.argmax()`to check this out!

```
# Predict on coords_small_test
preds = model.predict(coords_small_test)

# Print preds vs true values
print("{:45} | {}".format('Raw Model Predictions','True labels'))
for i,pred in enumerate(preds):
  print("{} | {}".format(pred,competitors_small_test[i]))

# Extract the position of highest probability from each pred vector
preds_chosen = [np.argmax(pred) for pred in preds]

# Print preds vs true values
print("{:10} | {}".format('Rounded Model Predictions','True labels'))
for i,pred in enumerate(preds_chosen):
  print("{:25} | {}".format(pred,competitors_small_test[i]))
```

### Multi-label classification
* Now that we know how multi-class classification works, we can take a look at **multi-label classification.**
* Both deal with predicting classes, but in multi-label classification, a single input can be assigned to more than one class
* Real world example: Movie genre classification- could be multi-label, for example: Drama, Suspense, Action

<img src='data/multiclass_vs_multilabel.png' width="500" height="250" align="center"/>

* Imagine we have three clases: `sun`, `moon`, and `clouds`
* In **multi-class problems**, if we took a sample of our observations, each individual in the sample will belong to a unique class
* However, in **multi-label problems**, each individual in the sample can have **all, none, or some subset of the available classes.**
* As you can see in the image, multilabel vectors are also **one-hot encoded** (there's a 1 or a 0 representing the presence or absence of each class).

### Multi-label architecture
* Making a multi-label model is not very different from building a multi-class model
* For the sake of this example, we will assume that to differentiate between these three different classes, we need just one input and 2 hidden neurons
* The **biggest changes** (between this model and the multiclass model) happen in the **output layer** and in its **activation function**.

```
from keras.models import Sequential
from keras.layers import Dense

# Instantiate model
model = Sequential()

# Add input and hidden layers
model.add(Dense(2, input_shape=(1,)))

# Add an output layer for the 3 classes and sigmoid activation
model.add(Dense(3, activation='sigmoid'))
```

* **In the output layer, we use as many neurons as possible classes** and we also use **sigmoid activation**
* **We use sigmoid outputs because we no longer care about the sum of probabilities.** We are not deciding **between** or **among** possible outcomes, but rather **selecting any and all possible outcomes with a probabilty greater than 0.5.**

<img src='data/multilabel_outcome_probabilities.png' width="400" height="200" align="center"/>

* We want each output neuron to be able to individually take a value between 0 and 1
* This can be achieved with the sigmoid activation because it constrains our neuron output in range 0-1. (This is what we did in binary classification, though we only had one output neuron there).
* **Binary cross-entropy** is now used as the loss function when compiling the model 
* You can look at is **as if we were performing several binary classification problems; for each output we are deciding whether or not its corresponding label is present.**

```
# Compile the model with binary crossentropy
model.compile(optimizer='adam', loss = binary_crossentropy')

# Train your model, recall validation_split
model.fit(X_train, y_train, epochs=100, validation_split=0.2)
```

* By using `validation_split`, a percentage of training data is left out for testing at each epoch.
* Using neural networks for multi-label classification can be performed by minor tweaks in our model architecture

<img src='data/one_vs_rest.png' width="400" height="200" align="center"/>

* If we were to use a classical machine learning approach to solve multi-label problems, we would need more complex methods.
    * One way to do so would be to train several classifiers to distinguish each particular class from the rest
    * This is called **one-vs-rest classification** and is illustrated above

### Keras Callbacks
* Now that we've trained quite a few models, it's time to learn more about how to better control and supervise model training by using **callbacks**.
* A **callback** is a function that is executed after some other function, event, or task has finished.
* A Keras callback is a block of code that gets executed after each epoch during training or after the training is finished
    * `EarlyStopping`
    * `ModelCheckpoint`
    * `History`
* They are useful to store metrics as the model trains and to make decisions as the training goes by
* Every time you call the fit method on a keras model, there's a callback object that gets returned after the model finishes training
    * This is the **`history`** attribute, which is a python dictionary.
    * Within the `history` attribute, we can check the **saved metrics of the model at each epoch during training as an array of numbers.**
    
```
# Training a model and saving its history
history = model.fit(X_train, y_train,
                    epochs = 100,
                    metrics = ['accuracy'])
print(history.history['loss'])                    
```
* To get the most out of the history object, we should use the the `validation_data` parameter in our fit method, passing `X_test` and `y_test` as a tuple, as shown below:

```
# Training a model and saving its history
history = model.fit(X_train, y_train, 
                    epochs = 100,
                    validation_data=(X_test, y_test),
                    metrics=['accuracy'])
print(history.history['val_loss']              
```
* The `validation_split` parameter can be used instead too, specifying a percentage of the training data that will be left out for testing purposes
* That way, we not only have the training metrics, but also the validation metrics

## History plots
* You can compare training and validation metrics with a few matplotlib commands
* We just need to define a figure
* Plot the values of the history attribute for the training accuracy (`acc`) and the validation accuracy (`val_acc`)

```
# Plot train vs test accuracy per epoch
plt.figure()

# Use history metrics
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])

# Make it pretty
plt.title('Model accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'])
plt.show()
```

<img src='data/history_plots.png' width="400" height="200" align="center"/>

* **We can see our model accuracy increases for both training and test sets until it reaches epoch 25**
* Then accuracy flattens for the test set whilst the training keeps improving
* At that point, overfitting is occurring, since we see the training set keeps improving, as the test set plateaus and then even decreases in accuracy).

### Early stopping
* Early stopping a model can solve the overfitting problem, since it stops training when it no longer improves
* This is extremely useful since deep neural models can take a long time to train and we don't know beforehand how many epochs will be needed
* The early stopping callback can monitor several metrics, like validation accuracy, validation loss, etc. specified with the **`monitor`** parameter.
* It is also important to define a **`patience`** argument, or, the number of epochs to wait for the model to improve before stopping its training
    * There aren't any rules to decide which patience number works best at any time, and this depends mostly on the implementation

```
# Import early stopping from keras callbacks
from keras.callbacks import EarlyStopping

# Instantiate an early stopping callback
early_stopping = EarlyStopping(monitor='val_loss', patience=5)

# Train your model with the callback
model.fit(X_train, y_train, epochs=100, 
                            validation_data = (X_test, y_test),
                            callbacks = [early_stopping])
```
* The callback is passed as a list to the `callbacks` parameter in the model `fit` method

### Model checkpoint
* The model checkpoint callback allows us to save our model as it trains 
* We specify the model filename with a name and the `.hdf5` extension
* You can also decide what to monitor to determine which model is best with the **`monitor`** parameter; **by default validation loss is monitored**
* Setting the `save_best_only` parameter to `True` guarantees that the latest best model according to the quantity monitored wil not be overwritten

```
# Import model checkpoint from keras callbacks
from keras.callbacks import ModelCheckpoint

# Instantiate a model checkpoint callback
model_save = ModelCheckpoint('best_model.hdf5', save_best_only=True)

# Train your model with the callback
model.fit(X_train, y_train, epochs=100, validation_data=(X_test, y_test),
                                        callbacks = [model_save])
```

# $\star$ Improving Your Model
In the previous chapters, you've trained a lot of models! You will now learn how to interpret learning curves to understand your models as they train. You will also visualize the effects of activation functions, batch-sizes, and batch-normalization. Finally, you will learn how to perform automatic hyperparameter optimization to your Keras models using sklearn.

### Learning curves
* Learning curves provide a lot of insight into your model.
* Now that we know how to use the `history` callback to plot them, we will learn how to get the most value out of them
* So far, we've see two types of learning curves":
    * **Loss curves**
    * **Accuracy curves**
    
<img src='data/loss_vs_acc_curves.png' width="800" height="400" align="center"/>

* Loss tends to decrease as epochs go by
    * This is expected since our model is essentially learning to minimize the loss function
    * After a certain number of epochs, **the value converges**, meaning it no longer gets much lower; we have arrived at a minimum.
* Accuracy tends to increase as epochs go by
    * This shows our model makes fewer mistkes as it learns
   
* **If we plot training versus validation data, we can identify overfitting**: we will see the training and validation curves start to diverge.

<img src='data/identifying_overfitting.png' width="600" height="300" align="center"/>

* Again, the `EarlyStopping` callback is useful to stop our model before it starts overfitting
* **But, not all curves are smooth and pretty; many times we will find unstable curves.**

<img src='data/unstable_curves.png' width="600" height="300" align="center"/>

* **There are many reasons that can lead to unstable learning curves; the chosen optimizer, learning rate, batch-size, network architecture, weight initialization, etc.**
* All of these parameters can be tuned to improve our model learning curves, as we aim for better accuracy and generalization power
* NNs are well-known for surpassing traditional ML techniques as we increase the size of our datasets
* We can check whether collecting more data would increase a model's generalization and accuracy
* We aim to produce a graph like the one below, where we have fit the model with increasing amounts of data and plotted the values for the training and test accuracies of each run 

<img src='data/acc_vs_train.png' width="700" height="350" align="center"/>

* If, after using all of our data, we see that our test set still has a tendency to improve (that is, it's not parallel to our train set curve, and it's increasing), then it's worth gathering more data if possible to allow the model to keel learning

<img src='data/acc_vs_train2.png' width="700" height="350" align="center"/>

* How would we go about coding a graph like the two above?
* Imagine we want to evaluate an already-built-and-compiled model and have partitioned our data into `X_train`, `y_train`, `X_test`, and `y_test`.

```
# Store initial model weights
init_weights = model.get_weights()

# Lists for storing accuracies 
train_accs = []
tests_accs = []

# Loop over predefine list of train sizes, and for each we get the corresponding training data fraction
for train_size in train_sizes:
    
    # Split a fraction according to train_size
    X_train_frac, _, y_train_frac, _ = train_test_split(X_train, y_train, train_size=train_size)
    
    # Set model initial weights
    model.set_weights(initial_weights)
    
    # Fit model on the training set fraction
    model.fit(X_train_frac, y_train_frac, epochs=100, verbose=0, 
                                          callbacks=[EarlyStopping(monitor='loss', patience=1)])
          
    # Get the accuracy for this training set fraction
    train_acc = model.evaluate(X_train_frac, y_train_frac, verbose=0)[1]
    train_accs.append(train_acc)
    
    # Get the accuracy on the whole test set
    test_acc = model.evaluate(X_test, y_test, verbose=0)[1]
    test_accs.append(test_acc)
    print("Done with size: ", train_size)
```
* Loop over predefined list of train sizes, and for each we get the corresponding training data fraction
* Then, fit the model on the training fraction
    * Use an EarlyStopping callback which monitors **loss**
    * It is important to note that it is not validation loss, since we haven't provided the fit method with validation data.
* After the training is done, we can get the accuracy for the training set fraction and the accuracy from the test set and append it to our lists of accuracies
    * Observe that the same amount of test data was used for each iteration
        

## Activation Functions
* Inside the neurons of any neural network, the same process takes place; a summation of the inputs reaching the neuron multiplied by the weights of each connection and the addition of the bias weight.

<img src='data/act_func.png' width="400" height="200" align="center"/>

* This operation results in a number, $a$, which can be anything (it isn't bounded).
* We pass this number into an activation function that esentially takes it as an input and decides how the neuron fires and which output it produces

<img src='data/act_func2.png' width="400" height="200" align="center"/>

* **Activation function impact learning time, making our model converge faster or slower and achieving lower or higher accuracy.**
* They also allow us to learn more complex functions
* Four very well known activation functions are:

### Sigmoid
* Varies between 0 and 1 for all possible X input values

<img src='data/sig.png' width="400" height="200" align="center"/>

### Tanh
* Aka Hyperbolic Tangent
* Similar to the sigmoid shape, but varies between -1 and 1

<img src='data/tanh.png' width="400" height="200" align="center"/>

### ReLU
* Rectified Linear Unit
* Varies between 0 and infinity

<img src='data/relu.png' width="400" height="200" align="center"/>

### Leaky ReLU
* We can look at as a smoothed version of ReLU that doesn't sit at 0, allow negative values for negative inputs.

<img src='data/leaky_relu.png' width="400" height="200" align="center"/>

### Effect of activation function
* Changing the activation function used in the hidden layer of the model we built for binary classification results in different classification boundaries

#### Sigmoid vs. Tanh
<img src='data/sig_vs_tanh.png' width="600" height="300" align="center"/>

#### ReLU vs Leaky ReLU
<img src='data/relu_vs_leaky.png' width="600" height="300" align="center"/>

* It's important to note that **these boundaries will be different for every run of the same model because of the random initialization of weights and other random variables that aren't fixed.**
* Each activation function comes with its pros aand cons; there's no magic formula
    * Based on their properties, the problem at hand, and the layer we are looking at in our network, one activation function will perform better in terms of achieve our goal. 
    * What is the goal to achieve in a given layer?
    * ReLUs are usually a good place to start, as they train quickly and will tend to generalize well to most problems.
    * Sigmoids are not recommended for deep models
    * Tune with experimentation
    
#### Comparing activation functions
* It's easy to compare how model with different activation functions perform if they are small enough and train fast
* It is important to set a random see with numpy, that way the model weights are initialized the same for each activation function
* We then define a function that returns a fresh, new model each time, using the `act_function` parameter

```
# Set a random seed
np.random.seed(1)

# Return a new model with the given activation
def get_model(act_function):
    model = Sequential()
    model.add(Dense(4, input_shape=(2,), activation=act_function))
    model.add(Dense(1, activation='sigmoid'))
    return model
```
* We can then use this function as we loop over several activation functions, training different models and saving their `history` callbacks.

```
# Activaation functions to try out
activations = 'relu', 'sigmoid', 'tanh']

# Dictionary to store results
activation_results = {}

for funct in activations:
    model = get_model(act_function=funct)
    history = model.fit(X_train, y_train, 
                                validation_data = (X_test, y_test),
                                epochs = 100, verbose = 0)
    activation_results[funct] = history
```
* We store all these callbacks in a dictionary.
* With this dictionary of histories, we can extract the metrics we want to plot, build a pandas dataframe, and plot it

```
import pandas as pd

# Extract val_loss history of each activation function
val_loss_per_funct = {k:v.history['val_loss'] for k,v in activation_results.items()}

# Turn the dictionary into a pandas dataframe
val_loss_curves = pd.DataFrame(val_loss_per_funct)

# Plot the curves
val_loss_curves.plot(title='Loss per activation function')
```
***

```
# Activation functions to try
activations = ['relu', 'leaky_relu', 'sigmoid', 'tanh']

# Loop over the activation functions
activation_results = {}

for act in activations:
  # Get a new model with the current activation
  model = get_model(act)
  # Fit the model and store the history results
  h_callback = model.fit(X_train, y_train, validation_data = (X_test, y_test), epochs =20, verbose=0)
  activation_results[act] = h_callback
```

## Batch size and batch normalization
* A mini-batch is a subset of data samples
* If we were training a neural network with images, each image in our training set would be a **sample** and we could take **mini-batches** of different sizes (different numbers of samples) from the overall training set **batch**.
* Remember that during an epoch, we feed our network, calculate the errors, and update the network weight

<img src='data/mini_batches.png' width="400" height="200" align="center"/>

* It is not very practical to update our network weights only once per epoch after looking at the error produced by all training samples
* In practice, we take a mini-batch of training samples
* **Mini-batches will be of same size**

#### Mini-batches
* **Networks tend to train faster with mini-batches since weights are updated often.**
* Sometimes datasets are so huge that they would struggle to fit in RAM memory if we didn't use mini-batches
* Also, noise can help networks reach a lower error, escaping local minima

<img src='data/minibatch_pro_con.png' width="400" height="200" align="center"/>

* Here you can see how different batch sizes converge towards a minimum as training goes on 
* Training with all samples is shown in blue.
* Stochastic Gradient Descent, in red, uses a `batch_size` of 1
* We can see how the path towards the best value for our weights is noisier the smaller the `batch_size`

<img src='data/minibatch2.png' width="400" height="200" align="center"/>

### Batch size in Keras
* You can set your own batch size with the `batch_size` parameters on the model's fit method
* Keras uses a default batch size of 32
    * Increasing powers of 2 tend to be used
    * As a rule of thumb, **you tend to make your batch size bigger, the bigger your dataset is.**
    
## Normalization 
* Normalization is a common pre-processing step in ML algorithms, especially when features have different scales.
* One way to normalize data is with the equation below:

<img src='data/norm_eq.png' width="400" height="200" align="center"/>

* We always tend to normalize our model inputs
* This leaves everything centered around 0, with a standard deviation of 1

<img src='data/norm2.png' width="400" height="200" align="center"/>

* Normalizing neural network inputs improves our model 
* But deeper layers are trained based on previous layer outputs and since weights get updated via gradient descent, consecutive layers no longer benefit from normalization and they need to adapt to previous layers' weight changes, finding more trouble to learn their own weights.

<img src='data/norm3.png' width="400" height="200" align="center"/>

* Batch normalization makes sure that, independently of the changes, the inputs to the next layers are normalized
* It does this in a smart way, with trainable parameters that also learn how much of this normalization is kept, scaling or shifting it.

#### Batch normaliztion advantages
* Improves gradient flow
* Allows higher lerning rates
* Reducces dependence on weight initializations
* Acts as an unintended form of regularization
* Limits internal covariate shift
* Batch normalization is widely used today in many deep learning models

#### Batch normalization in Keras
* Batch normalization in Keras is applied as a layer

```
from keras.layers import BatchNormalization
# Instantiate a Sequential model
model = Sequential()

# Add an input layer
model.add(Dense(3, input_shape=(2,), activation = 'relu'))

# Add batch normalization for the outputs of the layer above
model.add(BatchNormalization())

# Add an output layer
model.add(Dense(1, activation='sigmoid'))
```

#### Exercises: Changing batch sizes
You've seen models are usually trained in batches of a fixed size. The smaller a batch size, the more weight updates per epoch, but at a cost of a more unstable gradient descent. Especially if the batch size is too small and it's not representative of the entire training set.

Let's see how different batch sizes affect the accuracy of a simple binary classification model that separates red from blue dots.

You'll use a batch size of one, updating the weights once per sample in your training set for each epoch. Then you will use the entire dataset, updating the weights only once per epoch.

```
# Get a fresh new model with get_model
model = get_model()

# Train your model for 5 epochs with a batch size of 1
model.fit(X_train, y_train, epochs=5, batch_size=1)
print("\n The accuracy when using a batch of size 1 is: ",
      model.evaluate(X_test, y_test)[1])
```
***

```
model = get_model()

# Fit your model for 5 epochs with a batch of size the training set
model.fit(X_train, y_train, epochs=5, batch_size=len(X_train))
print("\n The accuracy when using the whole training set as batch-size was: ",
      model.evaluate(X_test, y_test)[1])
```

```
# Import batch normalization from keras layers
from keras.layers import BatchNormalization

# Build your deep network
batchnorm_model = Sequential()
batchnorm_model.add(Dense(50, input_shape=(64,), activation='relu', kernel_initializer='normal'))
batchnorm_model.add(BatchNormalization())
batchnorm_model.add(Dense(50, activation='relu', kernel_initializer='normal'))
batchnorm_model.add(BatchNormalization())
batchnorm_model.add(Dense(50, activation='relu', kernel_initializer='normal'))
batchnorm_model.add(BatchNormalization())
batchnorm_model.add(Dense(10, activation='softmax', kernel_initializer='normal'))

# Compile your model with sgd
batchnorm_model.compile(optimizer='sgd', loss='categorical_crossentropy', metrics=['accuracy'])
```

### Hyperparameter tuning
* Our aim is to identify those parameters that make our model generalize better

#### Neural network hyperparameters
* A NN is full of parameters that can be tweaked:
    * Number of layers
    * Number of neurons per layer
    * Layer order
    * Layer activations
    * Batch sizes
    * Learning rates
* In sklean we can perform hyperparameter search by using methods like RandomizedSearchCv

```
# Import RandomizedSearchCV
from sklearn.model_selection import RandomizedSearchCV

# Instantiate your classifier
tree = DecisionTreeClassifier()

# Define a series of parameters to look over
params = {'max_depth':[3, None], "max_features": range(1, 4), "min_samples_leaf": range(1,4)}

# Perform random search with cross validation
tree_cv = RandomizedSearchCV(tree, params, cv=5)
tree_cv.fit(X, y)

#Print the best parameters
print(tree_cv.best_params_)
```
   
## Turn a Keras model into a sklearn estimator
* We can do the same with our Keras models
* But we first have to transform them into sklearn estimators
* We do this by first defining a function that creates our model
* Then we import the `KerasClassifier` wrapper from `keras.wrappers.scikit_learn`

```
# Function that creates our Keras model
def create_model(optimizer='adam', activation='relu'):
    model = Sequential()
    model.add(Dense(16, input_shape=(2,), activation=activation))
    model.add(Dense(1, activation='sigmoid'))
    model.compile(optimizer=optimizer, loss='binary_crossentropy')
    return model
    
# Import sklearn wrapper from keras
from keras.wrappers.scikit_learn import KerasClassifier

# Create a model as a sklearn estimator
model = KerasClassifier(build_fn=create_model, epochs=6, batch_size=16)
```
* **Note** that parameters likes `epochs` and `batch_size` are **optional** but should be passed if we want to specify them.

In [2]:
def create_model(optimizer='adam', activation='relu'):
    model = Sequential()
    model.add(Dense(16, input_shape=(2,), activation=activation))
    model.add(Dense(1, activation='sigmoid'))
    model.compile(optimizer=optimizer, loss='binary_crossentropy')
    return model

```
# Import sklearn wrapper from keras
from keras.wrappers.scikit_learn import KerasClassifier

# Create a model as a sklearn estimator
model = KerasClassifier(build_fn=create_model, epochs=6, batch_size=16)
```
* This is very cool!
* Our model is now just like any other sklearn estimator, so we can, for instance, perform cross-validation on it to see the stability of its predictions across folds

## Cross-validation

```
# Import cross_val_score
from sklearn.model_selection import cross_val_score

# Check how your keras model performs with a 5 fold crossvalidation
kfold = cross_val_score(model, X, y, cv = 5)

# Print the mean accuracy per fold
kfold.mean()

# Print the standard deviation per fold
kfold.std()
```

### Tips for neural networks hyperparameter tuning
* RandomSearch is preferred over grid search
* Don't use many epochs
* Use a smaller sample of your dataset (makes thing faster if you have a huge dataset, and it makes it easier to play with things like optimizers, batch sizes, activations, and learning rates)

## Random Search on Keras model
* To perform randomized search on a Keras model, we just need to define the parameters to try.
* We can try different optimizers, activation functions for the hidden layers and batch sizes.
* **Note:** The keys in the parameter dictionary must be named exactly as the parameters in our `create_model` function
* We then instantiate a RandomizedSearchCV object passing our model and parameters with 3 fold cross-validation

```
# Define a series of parameters
params = dict(optimizer = ['sgd', 'adam'], 
              epochs =3, 
              batch_size = [5, 10, 20], 
              activation = ['relu', 'tanh'])
              
# Create a random search cv object and fit it to the data
random_search = RandomizedSearchCV(model, params_dist=params, cv=3)
random_search_results = random_search.fit(X, y)

# Print results
print("Best %f using %s".format(random_search_results.best_score_, random_search_results.best_params_))
```

## Tuning other hyperparameter
* Parameters like the number of neurons per layer and the number of layers can also be tuned using the same method. 
* We just need to make some changes in our `create_model` function:
* `nl` = number of layers
* `nn` = number of neurons

```
def create_model(nl=1, nn=256):
    model = Sequential()
    model.add(Dense(16, input_shape=(2,), activation='relu'))
    
    # Add as many hidden layers as specified in nl
    for i in range(nl):
        # Layers have nn neurons
        model.add(Dense(nn, activation='relu'))
        
    # End defining and compiling your model...
```
* Then we just need to use the same exact names in the parameter dictionary as we have in our function and repeat the process.

```
# Define parameters, named just like in create_model()
params =dict(nl=[1, 2, 9], nn=[128, 256, 1000])

# Repeat the random search...

# Print results
```

In [5]:
def create_model(nl=1, nn=256):
    model = Sequential()
    model.add(Dense(16, input_shape=(2,), activation='relu'))
    
    # Add as many hidden layers as specified in nl
    for i in range(nl):
        # Layers have nn neurons
        model.add(Dense(nn, activation='relu'))
        
        # End defining and compiling your model...

```
# Define parameters, named just like in create_model()
params =dict(nl=[1, 2, 9], nn=[128, 256, 1000])

# Repeat the random search...

# Print results
```

#### Exercises: Preparing a model for tuning
Let's tune the hyperparameters of a **binary classification model** that does well classifying the **breast cancer dataset**.

You've seen that the first step to turn a model into a sklearn estimator is to build a function that creates it. The definition of this function is important since hyperparameter tuning is carried out by varying the arguments your function receives.

Build a simple `create_model()` function that receives both a learning rate and an activation function as arguments. The `Adam` optimizer has been imported as an object from `keras.optimizers` so that you can also change its learning rate parameter.

```
# Creates a model given an activation and learning rate
def create_model(learning_rate, activation):
  
  	# Create an Adam optimizer with the given learning rate
  	opt = Adam(lr = learning_rate)
  	
  	# Create your binary classification model  
  	model = Sequential()
  	model.add(Dense(128, input_shape = (30,), activation = activation))
  	model.add(Dense(256, activation = activation))
  	model.add(Dense(1, activation = 'sigmoid'))
  	
  	# Compile your model with your optimizer, loss, and metrics
  	model.compile(optimizer = opt, loss = 'binary_crossentropy', metrics = ['accuracy'])
  	return model
```

#### Exercises: Tuning the model parameters
It's time to try out different parameters on your model and see how well it performs!

The `create_model()` function you built in the previous exercise is ready for you to use.

Since fitting the `RandomizedSearchCV` object would take too long, the results you'd get are printed in the `show_results()` function. You could try `random_search.fit(X,y)` in the console yourself to check it does work after you have built everything else, but you will probably timeout the exercise (so copy your code first if you try this or you can lose your progress!).

You don't need to use the optional `epochs` and `batch_size` parameters when building your `KerasClassifier` object since you are passing them as `params` to the random search and this works already.

```
# Import KerasClassifier from keras scikit learn wrappers
from keras.wrappers.scikit_learn import KerasClassifier

# Create a KerasClassifier
model = KerasClassifier(build_fn = create_model)

# Define the parameters to try out
params = {'activation': ['relu', 'tanh'], 'batch_size': [32, 128, 256], 
          'epochs': [50, 100, 200], 'learning_rate': [0.1, 0.01, 0.001]}

# Create a randomize search cv object passing in the parameters to try
random_search = RandomizedSearchCV(model, param_distributions = params, cv = KFold(3))

# Running random_search.fit(X,y) would start the search,but it takes too long! 
show_results()
```

#### Exercises: Training with cross-validation
Time to train your model with the best parameters found: **0.001** for the **learning rate, 50 epochs, a 128 batch_size** and **relu activations**.

The `create_model()` function from the previous exercise is ready for you to use. `X` and `y` are loaded as features and labels.

Use the best values found for your model when creating your `KerasClassifier` object so that they are used when performing cross_validation.

End this chapter by training an awesome tuned model on the **breast cancer dataset**!

```
# Import KerasClassifier from keras wrappers
from keras.wrappers.scikit_learn import KerasClassifier

# Create a KerasClassifier
model = KerasClassifier(build_fn = create_model(learning_rate = 0.001, activation = 'relu'), epochs = 50, 
             batch_size = 128, verbose = 0)

# Calculate the accuracy score for each fold
kfolds = cross_val_score(model, X, y, cv = 3)

# Print the mean accuracy
print('The mean accuracy was:', kfolds.mean())

# Print the accuracy standard deviation
print('With a standard deviation of:', kfolds.std())
```

# $\star$ Chapter 4: Advanced Model Architectures
It's time to get introduced to more advanced architectures! You will create an autoencoder to reconstruct noisy images, visualize convolutional neural network activations, use deep pre-trained models to classify images and learn more about recurrent neural networks and working with text as you build a network that predicts the next word in a sentence.

## Tensors, layers, and autoencoders
* Now that you know how to tune your models, it's time to better understand how they work and learn about new neural network architectures

#### Accessing Keras layers
* Model layers are easily accessible, we just need to call `layers` on a built model and access the index of the layer we want
* From a chosen layer we can print its inputs, outputs and weights

```
# Accessing the first layer of a Keras model
first_layer = model.layers[0]

# Printing th elayer, and its input, output and weights
print(first_layer.input)
print(first_layer.output)
print(first_layer.weights)
```
* Inputs and outputs are tensors of a given shape built with TensorFlow tensor objects, weights are just tensors that change their value as the neural network learns the best weights

#### What are tensors?
* Tensors are the main data structures used in deep learning; inputs, outputs and transformations in neural networks are all represented using tensors 
* A **tensor** is a multi-dimensional array of numbers
* A **2-dimensional tensor** is a **matrix**
* A **3-dimensional tensor** is an **array of matrices**


In [7]:
# Defining a rank 2 tensor (2 dimensions)
T2 = [[1, 2, 3],
      [4, 5, 6],
      [7, 8, 9]]

In [8]:
# Defining a rank 3 tensor (3 dimensions)
T3 = [[1, 2, 3],
      [4, 5, 6],
      [7, 8, 9],
      
      [10, 11, 12],
      [13, 14, 15],
      [16, 17, 18],
      
      [19, 20, 21],
      [22, 23, 24],
      [25, 26, 27]]

* If we import Keras backend, we can build a function that takes in an input tensor from a given layer and returns an output tensor from another or the same layer
* To define the function with our backend K, we need to give it a list of inputs and outputs, even if we just want 1 input and 1 output
* Then we can use it on a tensor with the same shape as the input layer given during its definition
* If the weights of the layers between our inputs and outputs change, the function output for the same input will change as well. 
* We can use this to see the output of certain layers as weights change during training

```
# Import Keras backend
import keras.backend as K

# Get the input and output tensors of a model layer
inp = model.layers[0].input
out = model.layers[0].output

# Function that maps layer input to outpus
inp_to_out = K.function([inp], [out])

print(inp_to_out([X_test])
```

## Autoencoders
* A new type of architecture: autoencoders
* **Autoencoders** are models that aim at producing the same inputs as outputs

<img src='data/autoencoders.png' width="400" height="200" align="center"/>

* This task alone wouldn't be very useful
* But, since along the way we decrease the number of neurons, **we are effectively making our network learn to compress its inputs into a small set of neurons.**
* This makes autoencoders very useful

<img src='data/autoencoders2.png' width="400" height="200" align="center"/>

#### Autoencoder use cases
* **Dimensionality reduction:**
    * Smaller dimensional space representation of our inputs
* **De-noising data:**
    * If trained with clean data, irrelevant noise will be filtered out during reconstruction 
* **Anomaly detection:**
    * A poor reconstruction will result when the model is fed with unseen inputs
    * In other words, if you train an autoencoder to map inputs to outputs with data but you then pass in strange values, the network will fail at giving accurate output values 
* Many other applications can also benefit from this architecture

### Building a simple autoencoder
* To make an autoencoder that maps a hundred inputs to a hundred outputs, encoding the inputs into a layer of 4 neurons, we would do the following:

```
# Instantiate a sequential model
autoencoder = Sequential()

# Add a hidden layer of 4 neurons and an input layer of 100
autoencoder.add(Dense(4, input_shape=(100,), activation='relu'))

# Add an output layer of 100 neurons
autoencoder.add(Dense(100, activation='sigmoid'))

# Compile your model with the appropriate loss
autoencoder.compile(optimizer='adam', loss='binary_crossentropy')
```
* Once you've built and trained your autoencoder you might want to encode your inputs
* To do this, you need to build a new model and add the first layer of your previously trained autoencoder.
* This new model's predictions returns the 4 numbers given by the 4 neurons of the hidden layer for each observation in the input dataset

```
# Building a separate model to encode inputs
encoder = Sequential()
encoder.add(autoencoder.layers[0])

# Predicting returns the four hidden layer neuron outputs 
encoder.predict(X_test)
```
***

```
# Import keras backend
import keras.backend as K

# Input tensor from the 1st layer of the model
inp = model.layers[0].input

# Output tensor from the 1st layer of the model
out = model.layers[0].output

# Define a function from inputs to outputs
inp_to_out = K.function([inp], [out])

# Print the results of passing X_test through the 1st layer
print(inp_to_out([X_test]))
```
***

```
# Start with a sequential model
autoencoder = Sequential()

# Add a dense layer with input the original image pixels and neurons the encoded representation
autoencoder.add(Dense(32, input_shape=(784, ), activation="relu"))

# Add an output layer with as many neurons as the orginal image pixels
autoencoder.add(Dense(784, activation = "sigmoid"))

# Compile your model with adadelta
autoencoder.compile(optimizer = 'adadelta', loss = 'binary_crossentropy')

# Summarize your model structure
autoencoder.summary()
```

### Intro to CNNs

<img src='data/visualize_parameters.png' width="400" height="200" align="center"/>