# 1. A Single Neuron

**What is Deep Learning?**

---


Deep learning is an approach to machine learning characterized by deep stacks of computations. Neural networks are composed of neurons, where each neuron individually performs only a simple computation. The power of a neural network comes instead from the complexity of the connections these neurons can form.

**The Linear Unit**

---

Input is `x`. Its connection to the neuron has a weight of `w`. When a value flows through a connection, you multiply the value by the connection's weight w. In this case, what reaches the neuron is w*x. The `b` is a special kind of weight bias. The bias doesn't have any input data associated with it instead, we put a 1 in the diagram so that the value that reaches the neuron is just 1* b. The `y` is the value the neuron ultimately outputs. This nueron's activation is `y = w * x + b` .

**Multiple Inputs**

---

We can add more input connections to the neuron. To find the output, we multiply each input to its connection weight and add them all together. The formula would look like

```
y = w0x0 + w1x1 + w2x2 + b
```



**Linear Units in Keras**

In [None]:
from tensorflow import keras
from tensorflow.keras import layers

model = keras.Sequential([
    layers.Dense(units=1, input_shape=[3])
])

# Deep Neural Networks

**Layers**

---
Neural networks typically organize their neurons into layers. When we collect together linear units having a common set of inputs we get a dense layer.


**The Activation Function**

---
However, two dense layers with nothing in between are no better than a single dense layer by itself. What we need is something *nonlinear*. What we need are activation functions. An activation function is some function we apply to each of a layer's outputs. The most common one is the rectifier function which has a graph that's a line with the negative part rectified to zero.


**Stacking Dense Layers**

---

After adding some nonlinearity, now we can stack layers to get complex data transformations. The layers before the output layer are sometimes called hidden.

**Building Sequential Models**

---

The `Sequential` model we've been using will connect together a list of layers in order from first to last. The first layer gets the input, the last layer produces the output.



In [None]:
from tensorflow import keras
from tensorflow.keras import layers

model = keras.Sequential([
    layers.Dense(units=4, activation='relu', input shape=[2]),
    layers.Dense(units=3, activation='relu'),
    layers.Dense(units=1),
])

# Stochastic Gradient Descent

**The Loss Function**

---

The loss function measures the disparity between the target's true value and the value the model predicts. A common loss function for regression problems is the mean absolute error or MAE. Besides MAE, other loss functions for regression problems are mean-squared error or the Huber loss.

**The Optimizer - Stochastic Gradient Descent**

---

We need to inform the network how to solve the problem. This is the job of the optimizer. The optimizer adjusts the weights to minimize the loss. All of the optimization algorithms used in deep learning belong to a family called stochastic gradient descent.
1. Sample some training data and run it through the network to make predictions
2. Measure the loss between the predictions and the true values.
3. Finally, adjust the weights in a direction that makes the loss smaler.
4. Repeat this.
Each iteration's sample of training data is called a minibatch while a complete round of the training data is called an epoch.

**Learning Rate and Batch size**

---

The size of shifts is determinde by the learning rate. Smaller learning rate means the network needs to see more minibatches. The learning rate and the size of the minibatches affects SGD training process the most.

**Adding the Loss and Optimizer**

In [None]:
model.compile(
    optimizer="adam",
    loss="mae",
)

**Example - Red Wine Quality**

In [None]:
import pandas as pd
from IPython.display import display

red_wine = pd.read_csv('../input/dl-course-data/red-wine.csv')

df_train = red_wine.sample(frac=0.7, random_state=0)
df_valid = red_wine.drop(df_train.index)
display(df_train.head(4))

max_ = df_train.max(axis=0)
min_ = df_train.min(axis=0)
df_train = (df_train - min_) / (max_ - min_)
df_valid = (df_valid - min_) / (max_ - min_)

X_train = df_train.drop('quality', axis=1)
X_valid = df_valid.drop('quality', axis=1)
y_train = df_train['quality']
y_valid = df_valid['quality']

In [None]:
print(X_train.shape)

In [None]:
from tensorflow import keras
from tensorflow.keras import layers

model = keras.Sequential([
    layers.Dense(512, activation='relu', input_shape=[11]),
    layers.Dense(512, activation='relu'),
    layers.Dense(512, activation='relu'),
    layers.Dense(1),
])

In [None]:
model.compile(
    optimizer='adam',
    loss='mae',
)

In [None]:
history = model.fit(
    X_train, y_train,
    validation_data=(X_valid, y_valid),
    batch_size=256,
    epochs=10,
)

In [None]:
import pandas as pd

# convert the training history to a dataframe
history_df = pd.DataFrame(history.history)
# use Pandas native plot method
history_df['loss'].plot();

# Overfitting and Underfitting

**Interpreting the Learning Curves**

---

The signal is the part that generalizes, the part that can help our model make predictions from new data. The noise is that part that is only true of the training data. We train a model by choosing weights or parameters that minimize the loss on a training set. To accurately assess a model's performance, we need validation data. Learning curves are the plots of the loss on the training set and validation set.

**Underfitting** the training set is when the loss is not as low as it could be because the model hasn't learned enough *siganl*. **Overfitting** the training set is when the loss is not as low as it could be because the model learned too much *noise*.

**Capacity**

---

A model's capacity refers to the size and complexity of the patterns it is able to learn. You can increase the capacity of a network either by making it wider or making it deeper.

In [None]:
model = keras.Sequential([
    layers.Dense(16, activation='relu'),
    layers.Dense(1),
])

wider = keras.Sequential([
    layers.Dense(32, activation='relu'),
    layers.Dense(1),
])

deeper = keras.Sequential([
    layers.Dense(16, activation='relu'),
    layers.Dense(16, activation='relu'),
    layers.Dense(1),
])

**Early Stopping**

---

We can stop the training whenever it seems the validation loss isn't decreasing anymore. This is called early stopping. Once we detect that the validation loss is starting to rise again, we can reset the weights back to where the minimum occured. Training with early stopping also means we're in less danger of stopping the training too early.

**Adding Early Stopping**

In [None]:
from tensorflow.keras.callbacks import EarlyStopping

early_stopping = EarlyStopping(
    min_delta=0.001,
    patience=20,
    restore_best_weights=True,
)

**Example**

In [None]:
from IPython.display import display

red_wine = pd.read_csv('../input/dl-course-data/red-wine.csv')

df_train = red_wine.sample(frac=0.7, random_state=0)
df_valid = red_wine.drop(df_train.index)
display(df_train.head(4))

max_ = df_train.max(axis=0)
min_ = df_train.min(axis=0)
df_train = (df_train - min_) / (max_ - min_)
df_valid = (df_valid - min_) / (max_ - min_)

X_train = df_train.drop('quality', axis=1)
X_valid = df_valid.drop('quality', axis=1)
y_train = df_train['quality']
y_valid = df_valid['quality']

In [None]:
from tensorflow import keras
from tensorflow.keras import layers, callbacks

early_stopping = callbacks.EarlyStopping(
    min_delta=0.001,
    patience=20,
    restore_best_weights=True,
)

model = keras.Sequential([
    layers.Dense(512, activation='relu', input_shape=[11]),
    layers.Dense(512, activation='relu'),
    layers.Dense(512, activation='relu'),
    layers.Dense(1),
])
model.compile(
    optimizer='adam',
    loss='mae',
)

In [None]:
history = model.fit(
    X_train, y_train,
    validation_data=(X_valid, y_valid),
    batch_size=256,
    epochs=500,
    callbacks=[early_stopping],
    verbose=0,
)

history_df = pd.DataFrame(history.history)
history_df.loc[:, ['loss', 'val_loss']].plot();
print("Minimum validation loss: {}".format(history_df['val_loss'].min()))

# Dropout and Batch Normalization

**Dropout**

---

Dropout layer can help correct overfitting. To break up these conspiracies, we randomly drop out some fraction of a layer's input units every step of training, making it much harder for the network to learn those spurious patterns in the training data. Instead, it has to search for broad, general patterns, whose weight patterns tend to be more robust.

**Let's Add Dropout**

In [None]:
keras.Sequential([
    # ...
    layers.Dropout(rate=0.3),
    layers.Dense(16),
    # ...
])


**Batch Normalization**

---

Batch normalization can help correct training that is slow or unstable. A batch normalization layer looks at each batch as it comes in, first normalizing the batch with its own mean and standard deviation, and then also putting the data on a new scale with two trainable rescaling parameters. Batchnorm, in effect, performs a kind of coordinated rescaling of its inputs.

**Adding Batch Normalization**

In [None]:
layers.Dense(16, activation='relu'),
layers.BatchNormalization(),
#Either way is fine
layers.Dense(16),
layers.BatchNormalization(),
layers.Activation('relu'),

**Example**

In [None]:
import matplotlib.pyplot as plt

plt.style.use('seaborn-whitegrid')

plt.rc('figure', autolayout=True)
plt.rc('axes', labelweight='bold', labelsize='large',
       titleweight='bold', titlesize=18, titlepad=10)


import pandas as pd
red_wine = pd.read_csv('../input/dl-course-data/red-wine.csv')

df_train = red_wine.sample(frac=0.7, random_state=0)
df_valid = red_wine.drop(df_train.index)

X_train = df_train.drop('quality', axis=1)
X_valid = df_valid.drop('quality', axis=1)
y_train = df_train['quality']
y_valid = df_valid['quality']

In [None]:
from tensorflow.keras import layers

model = keras.Sequential([
    layers.Dense(1024, activation='relu', input_shape=[11]),
    layers.Dropout(0.3),
    layers.BatchNormalization(),
    layers.Dense(1024, activation='relu'),
    layers.Dropout(0.3),
    layers.BatchNormalization(),
    layers.Dense(1024, activation='relu'),
    layers.Dropout(0.3),
    layers.BatchNormalization(),
    layers.Dense(1),
])

In [None]:
model.compile(
    optimizer='adam',
    loss='mae',
)

history = model.fit(
    X_train, y_train,
    validation_data=(X_valid, y_valid),
    batch_size=256,
    epochs=100,
    verbose=0,
)

history_df = pd.DataFrame(history.history)
history_df.loc[:, ['loss', 'val_loss']].plot();

# Binary Classification

**Accuracy and Cross-Entropy**

---

Accuracy is one of the many metrics in use for measuring success on a classification problem. Accuracy is the ratio of correct predictions to total predictions. The problem with accuracy is that it can't be used as a loss function. Cross-entropy function is the substitute for this. Cross-entropy is a sort of measure for the distance from one probability distribution to another.

**Making Probabilities with the Sigmoid Function**

---

The cross-entropy and accuracy functions both require probabilities as inputs, meaning, numbers from 0 to 1. To convert them, we attach sigmoid activation. To get the final class prediction, we define a threshold probability. Typically this will be 0.5.

**Example**

In [None]:
ion = pd.read_csv('../input/dl-course-data/ion.csv', index_col=0)
display(ion.head())

df = ion.copy()
df['Class'] = df['Class'].map({'good': 0, 'bad': 1})

df_train = df.sample(frac=0.7, random_state=0)
df_valid = df.drop(df_train.index)

max_ = df_train.max(axis=0)
min_ = df_train.min(axis=0)

df_train = (df_train - min_) / (max_ - min_)
df_valid = (df_valid - min_) / (max_ - min_)
df_train.dropna(axis=1, inplace=True) # drop the empty feature in column 2
df_valid.dropna(axis=1, inplace=True)

X_train = df_train.drop('Class', axis=1)
X_valid = df_valid.drop('Class', axis=1)
y_train = df_train['Class']
y_valid = df_valid['Class']

In [None]:
from tensorflow import keras
from tensorflow.keras import layers

model = keras.Sequential([
    layers.Dense(4, activation='relu', input_shape=[33]),
    layers.Dense(4, activation='relu'),
    layers.Dense(1, activation='sigmoid'),
])

In [None]:
model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['binary_accuracy'],
)

In [None]:
early_stopping = keras.callbacks.EarlyStopping(
    patience=10,
    min_delta=0.001,
    restore_best_weights=True,
)

history = model.fit(
    X_train, y_train,
    validation_data=(X_valid, y_valid),
    batch_size=512,
    epochs=1000,
    callbacks=[early_stopping],
    verbose=0,
)

In [None]:
history_df = pd.DataFrame(history.history)
history_df.loc[5:, ['loss', 'val_loss']].plot()
history_df.loc[5:, ['binary_accuracy', 'val_binary_accuracy']].plot()

print(("Best Validation Loss: {:0.4f}" +\
      "\nBest Validation Accuracy: {:0.4f}")\
      .format(history_df['val_loss'].min(),
              history_df['val_binary_accuracy'].max()))