# Introduction to Neural Networks in Python

The lecture slides have provided an overview of neural networks, including key concepts such as network **layers**, the different types of **nodes** (or neurons), and **weights**, while also touching on importance concepts like **activiation functions**. We are now going to go from concepts to practice by learning how to build, train, and test your own neural network in Python using the ``tensorflow`` and ``keras`` libraries.

Let's start by importing several key libraries:

In [8]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, f1_score

## Data

We are going to continue with our Titanic data example. To refresh, our goal is to predict passenger survival (1 for survived, 0 for not survived) based on the following variables:
* `Pclass`: Passenger class (1 = 1st, 2 = 2nd, 3 = 3rd).
* `Age`: Age of the passenger.
* `SibSp`: Number of siblings/spouses aboard.
* `Parch`: Number of parents/children aboard.
* `Fare`: Ticket fare paid.
* `female`: Binary variable indicating if the passenger is female.

In [None]:
# Read the data
df = pd.read_csv('data/titanic.csv')
df.head()

### Data preprocessing

We'll do just a little bit of pre-processing to the data prior to training our NN:

In [10]:
# Fill missing values
df['Age'] = df['Age'].fillna(df['Age'].mean())

# Normalize features. This just helps the model converge faster
scaler = StandardScaler()
df[['Age', 'Fare']] = scaler.fit_transform(df[['Age', 'Fare']])

# Encode female variable
df['female'] = df['Sex'].apply(lambda x: 1 if x == 'female' else 0)

# Split data
from sklearn.model_selection import train_test_split
X = df[['Pclass', 'Age', 'SibSp', 'Parch', 'Fare', 'female']]
y = df['Survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Neural network architecture

With our data ready to go, it's now time to build our simple neural network. We'll go with the following model for the time being:
  - **Input Layer**: 6 nodes (1 for each feature: `Pclass`, `Age`, `SibSp`, `Parch`, `Fare`, `female`).
  - **Hidden Layer**: 1 fully connected layer (i.e., "dense" layer) with 4 nodes with **ReLU (Rectified Linear Unit)** activation functions.
  - **Output Layer**: 1 node with a **sigmoid activation function** for binary classification (survival or not).
  - **Loss Function**: Binary cross-entropy for classification problems.
  - **Optimizer**: Adam optimizer for training.

It's quite easy to build this model in `Python` using `keras`:

In [None]:
from keras.models import Sequential
from keras.layers import Dense, Input

# Build the model
model = Sequential([
    Input(shape=(6,)),              # Define the input shape explicitly  
    Dense(4, activation='relu'),
    Dense(1, activation='sigmoid')  # Output layer
])

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

Here's on overview of the key **concepts** using in our architecture:

- **Sequential**: This is a linear stack of layers in Keras. It allows you to build a model layer by layer, where each layer has exactly one input tensor and one output tensor.

- **Dense**: This layer is a fully connected layer, meaning each neuron in the layer is connected to every neuron in the previous layer. This is what we used in lecture.

- **ReLU (Rectified Linear Unit)**: This is an activation function that is defined as the positive part of its input. It is one of the most popular activation functions used in neural networks because it helps to mitigate the vanishing gradient problem. More on this below!

- **Sigmoid**: This is an activation function that outputs a value between 0 and 1. It is often used in the output layer of binary classification problems because it can be interpreted as a probability. This is what we use for logistic regression, so should look familiar.

- **Adam**: This is an optimization algorithm that can be used instead of the classical stochastic gradient descent procedure to update network weights iteratively based on training data. Adam combines the best properties of the AdaGrad and RMSProp algorithms to provide an optimization algorithm that can handle sparse gradients on noisy problems. **You can think of Adam as a smarter version of the gradient descent algorithm that we discussed in week 1**.

- **Binary Cross-Entropy**: This is a loss function used for binary classification problems. It measures the performance of a classification model whose output is a probability value between 0 and 1. The binary cross-entropy loss increases as the predicted probability diverges from the actual label.

### Activiation functions

As we discussed in lecture, **activiation function**s decide which neurons "fire" when moving though the network and add nonlinearity into our network. Without activation functions, the entire neural network would behave like a single linear transformation, no matter how many layers it has. This setup would not provide the necessary complexity to help with real-world problems.

Some common activation functions:
   - **ReLU (Rectified Linear Unit)**: $$\text{ReLU}(x) = \max(0, x)$$
   - **Sigmoid**: $$\sigma(x) = \frac{1}{1 + e^{-x}}$$
   - **Tanh**: $$\text{tanh}(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$$

#### Why ReLU?

ReLU is a very (very!) common activiation used for hidden layers. Here are some reasons why:
   - **Simplicity**: ReLU is computationally efficient compared to other activation functions like Sigmoid or Tanh.
   - **Prevents Vanishing Gradients**:
     - Gradients in Sigmoid or Tanh functions can become very small for large input values, slowing down learning.
     - ReLU helps maintain larger gradients, speeding up training.
   - **Effective in Deep Networks**: It works well for deep networks by introducing sparsity (many neurons output 0).

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# ReLU function
def relu(x):
    return np.maximum(0, x)

# Input values
x = np.linspace(-10, 10, 100)
y_relu = relu(x)

# Plot ReLU
plt.figure(figsize=(8, 5))
plt.plot(x, y_relu, label='ReLU', color='blue')
plt.axhline(0, color='black', linewidth=0.5, linestyle='--')
plt.axvline(0, color='black', linewidth=0.5, linestyle='--')
plt.title('ReLU Activation Function')
plt.xlabel('Input')
plt.ylabel('Output')
plt.legend()
plt.show()

### Training our model

That's it for setting up our model. Now we can train our model using the model's `fit()` method:

In [7]:
def plot_loss(history):
    plt.plot(history.history['loss'], label='Training Loss')
    plt.plot(history.history['val_loss'], label='Validation Loss')
    plt.xlabel('Epochs')
    plt.ylabel('Loss')
    plt.legend()
    plt.show()

In [None]:
# Train the model
history = model.fit(X_train, y_train, epochs=50, batch_size=16, validation_data=(X_test, y_test))
plot_loss(history)

When training a neural network, two important hyperparameters to understand are **epochs** and **batch size**.

- **Epochs**: An epoch refers to one complete pass through the entire training dataset. During each epoch, the model processes every training example once. Training for more epochs generally improves the model's performance, but too many epochs can lead to overfitting, where the model performs well on the training data but poorly on unseen data.

- **Batch Size**: The batch size is the number of training examples processed before the model's internal parameters are updated. Instead of updating the model's parameters after each training example, which can be computationally expensive, the model updates its parameters after processing a batch of examples. Smaller batch sizes can lead to more accurate updates but require more iterations to complete an epoch, while larger batch sizes can speed up training but may lead to less accurate updates.

| `batch_size`  | **Effect** |
|--------------|-----------|
| **Small (e.g., 16, 32)**  | More updates, better generalization, but **slower training**. |
| **Large (e.g., 128, 256, 512)**  | Fewer updates, **faster training**, but may generalize worse. |
| **Full Batch (`batch_size=len(X_train)`)** | One update per epoch (**slow convergence, may get stuck**). |

Choosing the right number of epochs and batch size is crucial for training an effective neural network. Too few epochs can result in underfitting, while too many can cause overfitting. Similarly, the batch size can affect the stability and speed of the training process.

### Making predictions

With our fitted model in hand, we can now calculate predicted probablities, get predictions, and examine out-of-sample performance using our test set:

In [None]:
# The predict method outputs the probability of survival
probs = model.predict(X_test)

# Convert the probabilities to binary predictions
predictions = (probs > 0.5).astype(int)  # Convert probabilities to binary predictions

print(f'Here are the first 10 predications: {predictions[0:10]}')

# Calculate evaluation metrics
precision = precision_score(y_test, predictions)
recall = recall_score(y_test, predictions)
f1 = f1_score(y_test, predictions)

print(f'Precision: {precision}')
print(f'Recall: {recall}')
print(f'F1: {f1}')

To avoid having to repeat the code in the previous cell each time that we want to assess model performance, let's create a function that we can reuse:

In [None]:
def evaluate_performance(model, X_test, y_test):
    # The predict method outputs the probability of survival
    probs = model.predict(X_test)

    # Convert the probabilities to binary predictions
    predictions = (probs > 0.5).astype(int)  # Convert probabilities to binary predictions

    # Calculate evaluation metrics
    precision = precision_score(y_test, predictions)
    recall = recall_score(y_test, predictions)
    f1 = f1_score(y_test, predictions)

    print(f'Precision: {precision}')
    print(f'Recall: {recall}')
    print(f'F1: {f1}')

## Making our network deep(er)

The example we've used so far is very, very simple and doesn't really demonstrate the power of NN's for solving complex machine learning tasks. Let's add an additional hidden layer to allow our model to learn more complex patterns in the data:

In [None]:
from keras.models import Sequential
from keras.layers import Dense, Input

# Build the model
model = Sequential([
    Input(shape=(6,)),
    Dense(16, activation='relu'),  # Add a hidden layer!    
    Dense(8, activation='relu'),  # Add a hidden layer!           
    Dense(4, activation='relu'),
    Dense(1, activation='sigmoid')
])

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

history = model.fit(X_train, y_train, epochs=25, batch_size=16, validation_data=(X_test, y_test))

plot_loss(history)

How does our more complex model perform? Let's take a look:

In [None]:
evaluate_performance(model, X_test, y_test)

### Dropout

Dropout is a **regularization technique** used to prevent overfitting in neural networks. During training, dropout randomly sets a fraction of the input units to zero at each update. This prevents the network from becoming too reliant on any particular neurons and encourages the network to learn more robust features. Dropout can be applied to both input and hidden layers.

Key points about dropout:
- **Randomly drops neurons**: During each training iteration, a random subset of neurons is ignored (dropped out).
- **Reduces overfitting**: By preventing neurons from co-adapting too much, dropout helps in reducing overfitting.
- **Improves generalization**: Dropout forces the network to learn more general features that are useful across different subsets of data.

Regularization techniques are big topic and a full account of these techniques is outside of the scope of this module. In a nutshell, these techniques are used to prevent overfitting by adding a penalty to the loss function. This penalty discourages the model from fitting the noise in the training data and encourages simpler models that generalize better to unseen data.

In [None]:
from keras.layers import Dropout

# Build the model
model = Sequential([
    Input(shape=(6,)),
    Dense(16, activation='relu'),  # Add a hidden layer!  
    Dropout(0.2),  # Add dropout!  
    Dense(8, activation='relu'),  # Add a hidden layer!
    Dropout(0.2),  # Add dropout!          
    Dense(4, activation='relu'),
    Dense(1, activation='sigmoid')
])

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

history = model.fit(X_train, y_train, epochs=25, batch_size=16, validation_data=(X_test, y_test))

plot_loss(history)

evaluate_performance(model, X_test, y_test)

## The "art" of training a neural network

As we've discussed, training a neural network is more art than science. How many epochs should we use? What's a good learning rate? What is a good batch size? Answering these -- and many other! -- questions are important aspects of training an neural network and there is rarely (if ever) a one size fits all approach.

This section outlines a handful of techniques that I've found useful over the years when fitting these networks.


### Early stopping

The `keras` library has a number of different `callbacks` that we can use to monitor training performance and make decisions on our behalf. Setting up an **early stopping** rule is one such `callback` that helps avoid using too many epochs.

In [None]:
from keras.callbacks import EarlyStopping

# Build the model
model = Sequential([
    Input(shape=(6,)),
    Dense(16, activation='relu'),  # Add a hidden layer!  
    Dropout(0.2),  # Add dropout!  
    Dense(8, activation='relu'),  # Add a hidden layer!
    Dropout(0.2),  # Add dropout!          
    Dense(4, activation='relu'),
    Dense(1, activation='sigmoid')
])


# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

early_stopping = EarlyStopping(
    monitor='val_loss',       # Monitor the validation loss
    patience=3,               # Number of epochs with no improvement to wait before stopping
    restore_best_weights=True # Restore model weights to the best epoch
)

history = model.fit(
    X_train, y_train,
    epochs=100,               # Set a high maximum number of epochs
    batch_size=16,
    validation_data=(X_test, y_test),
    callbacks=[early_stopping] # Use early stopping
)

plot_loss(history)

### Adjusting the learning rate

The default learning rate for `keras` is `0.001`, which works pretty well a lot of the time. However, sometimes it helps to adjust this learning rate (up or down) depending on the dataset at hand. Here's an example Another very helpful `callback` starts with a large(ish) learning rate and adjusts the learning rate downwards based on performance. Here's an example of how to adjust the learning rate used by the `Adam` optimizer:

In [None]:
# Import the Adam optimizer
from keras.optimizers import Adam

# Build the model
model = Sequential([
    Input(shape=(6,)),
    Dense(16, activation='relu'),  # Add a hidden layer!  
    Dropout(0.2),  # Add dropout!   
    Dense(8, activation='relu'),  # Add a hidden layer!
    Dropout(0.2),  # Add dropout!          
    Dense(4, activation='relu'),
    Dense(1, activation='sigmoid')
])

# Define the Adam optimizer with a custom learning rate
custom_adam = Adam(learning_rate=0.01)

# Compile the model
model.compile(optimizer=custom_adam, loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
history = model.fit(X_train, y_train, epochs=25, batch_size=16, validation_data=(X_test, y_test))

plot_loss(history)

Another (and probably better!) way to adjust the learning rate is by using the `ReduceLROnPlateu` callback in `keras`. By using this callback, you can start with a larger learning rate for early epochs and decrease the learning rate for later epochs. Here's how you would implement this callback:

In [None]:
from keras.callbacks import ReduceLROnPlateau

# Build the model
model = Sequential([
    Input(shape=(6,)),
    Dense(16, activation='relu'),  # Add a hidden layer!  
    Dropout(0.2),  # Add dropout!  
    Dense(8, activation='relu'),  # Add a hidden layer!
    Dropout(0.2),  # Add dropout!          
    Dense(4, activation='relu'),
    Dense(1, activation='sigmoid')
])

# Define the Adam optimizer with a custom learning rate
custom_adam = Adam(learning_rate=0.01)

# Compile the model
model.compile(optimizer=custom_adam, loss='binary_crossentropy', metrics=['accuracy'])

reduce_lr = ReduceLROnPlateau(
    monitor='val_loss',       # Monitor the validation loss
    factor=0.5,               # Reduce learning rate by half
    patience=2,               # Wait for 2 epochs of no improvement
    min_lr=.001,              # Minimum learning rate
)

history = model.fit(
    X_train, y_train,
    epochs=25,
    batch_size=16,
    validation_data=(X_test, y_test),
    callbacks=[reduce_lr]
)

plot_loss(history)

One really cool feature about using callbacks in `keras` is that they can be combined. For instance, we can systematically update our learning rate, while also setting up an early stopping rule:

In [None]:
# Build the model
model = Sequential([
    Input(shape=(6,)),
    Dense(16, activation='relu'),  # Add a hidden layer!  
    Dropout(0.2),  # Add dropout!  
    Dense(8, activation='relu'),  # Add a hidden layer!
    Dropout(0.2),  # Add dropout!          
    Dense(4, activation='relu'),
    Dense(1, activation='sigmoid')
])

# Define the Adam optimizer with a custom learning rate
custom_adam = Adam(learning_rate=0.005)

# Compile the model
model.compile(optimizer=custom_adam, loss='binary_crossentropy', metrics=['accuracy'])

# Reduce LR when val_loss plateaus
reduce_lr = ReduceLROnPlateau(
    monitor='val_loss',
    factor=0.5,            # Reduce LR by half
    patience=3,            # Wait 3 epochs of no improvement
    min_lr=1e-6,           # Minimum allowed LR
    verbose=1
)

# Stop training early if val_loss does not improve
early_stopping = EarlyStopping(
    monitor='val_loss',
    patience=5,            # Stop after 6 epochs of no improvement
    restore_best_weights=True,  # Restore best weights before stopping
    verbose=1
)

# Train the model with both callbacks
history = model.fit(
    X_train, y_train,
    epochs=100,
    batch_size=16,
    validation_data=(X_test, y_test),
    callbacks=[reduce_lr, early_stopping]
)

## Resources on training neural networks in Python

The purpose of this week was to introduce you to the key concepts of neural networks and to show you how these models can be trained in practice. In this class (and probably in your careers!), we are rarely going to build a neural network from scratch, but instead use **transfer learning** to efficiently fine neural networks created by others to work for our specific tasks.

If, however, you wanted to know more about training NNs in Python, there are a TON of resources online. Here are some of my favourites:

- Francois Chollet's book, [Deep Learning with Python](https://sourestdeeds.github.io/pdf/Deep%20Learning%20with%20Python.pdf)
- If YouTube is more your thing, then take a look at [Neural networks playlist](https://www.youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi) created by 3Blue1Brown.
- And many, many more!