# Exercise 4: Model Learning

> Welcome to the Advanced Robot Learning and Decision Making exercises!

In this exercise, we will review **model learning**, a fundamental concept in robot learning. Specifically, we will focus on learning the dynamics of a quadrotor from data using two popular model types: **Gaussian Processes (GPs)** and **Neural Networks (NNs)**.

Model learning involves approximating the underlying dynamics of a system based on observed data. This is crucial for tasks such as control, planning, and simulation in robotics. By learning a model of the system's dynamics, we can predict how the system will behave and use that information to improve decision-making and control.

### Exercise Overview

1. **Data Collection and Preprocessing**
   
   Collect state transition data from a simulated quadrotor and preprocess the data for use with GPs and NNs.

2. **Gaussian Processes**
  
   Implement and train a multi-output GP model. Evaluate the GP model's performance and analyze its uncertainty estimates.

3. **Introduction to Deep Learning**

   Pytorch, autograd, and basic optimization.

3. **Neural Networks**
   
   Design and train a NN for the same task as you have done for GPs. Evaluate the NN model's performance and analyze its learning curves.

4. **Performance Comparison**

   Compare the performance of the GP and NN models in terms of accuracy, training time, and inference time. Discuss the trade-offs and practical 
   considerations for using each model type.


_____

By the end of this exercise, you will have a solid understanding of how to use Gaussian Processes and Neural Networks for learning dynamics from data, as well as the strengths and weaknesses of each approach.

### Objectives of This Exercise
1. **Gaussian Processes (GPs):**
   - Understand the basics of GPs and their application in modeling dynamics.
   - Learn how to implement and use GPs for multi-output regression.
   - Explore the benefits of GPs, such as uncertainty quantification, and their limitations, such as scalability.

2. **Neural Networks (NNs):**
   - Understand the basics of NNs and their application in modeling dynamics.
   - Learn how to design, train, and evaluate NNs for regression tasks.
   - Explore best practices for training NNs, including normalization, regularization, and debugging learning curves.

3. **Comparison of GPs and NNs:**
   - Compare the performance of GPs and NNs in terms of accuracy, training time, and inference time.
   - Discuss the trade-offs between these two model types and their suitability for different tasks.

### Why Gaussian Processes and Neural Networks?
- **Gaussian Processes** are non-parametric models that provide uncertainty estimates, making them ideal for tasks where understanding model confidence is critical. However, they can be computationally expensive for large datasets.
- **Neural Networks** are parametric models that excel at handling large datasets and complex relationships. They are highly scalable but require careful tuning to avoid overfitting and ensure generalization.


_____



Imports

In [1]:
%load_ext autoreload
%autoreload 2

import time

import matplotlib.pyplot as plt
import numpy as np
import torch
from data_collection import collect_state_transitions
from gaussian_process import MultiOutputGaussianProcess, SVGPTrainer
from neural_network import NeuralNetwork, RegressionTrainer
from plotting_utils import (
    plot_2d_positions_with_std,
    plot_error_distribution,
    plot_prediction_vs_truth,
    plot_random_test_point,
)
from sklearn.model_selection import train_test_split
from utils import (
    CustomDataset,
    Normalizer,
    error_statistics,
    generate_synthetic_data,
    plot_learning_curves,
    set_seed,
)


ModuleNotFoundError: No module named 'exercise04'

## 1. Data Collection and Preprocessing
Data collection and processing are critical steps in any machine learning pipeline, especially in robotics, where the data quality is extremly critical and data collection often difficult. 

In this exercise, we collect state transitions from a simulated quadrotor drone flying in a noisy environment using a pretrained DRL policy (which you will see again in exercise 06). These transitions include sequences of states, actions, and the resulting next states. Of course, in practice data collection is done in real-world environments, not in simulation.

Let's review some fundamental concepts in data collection and processing.
### Data Collection
- **"Garbage in, Garbage out"**: The performance of data-based methods is strongly dependent on the available data. If the data training is poor or unrepresentative, the resulting model will also perform poorly. However, it may not show that during training, which is why critically evaluating the learning process is an important task when training data-based models.
<!-- - **Simulation vs Real-World Deployment**: Simulations are faster, easier, and safer compared to real-world data collection. They allow for controlled environments and reproducibility. However, simulations often lack the details and imperfections of real-world scenarios, such as sensor noise, environmental variability, and unexpected dynamics. Simulators themselves are often built using real-world data to approximate reality. -->
- **Importance of Data Diversity**: Diversity in the collected data is key to preventing overfitting and enabling generalization. A diverse dataset ensures that the model can handle a wide range of scenarios, including varying initial conditions, different trajectories or behaviors, and environmental factors.

### Data Preprocessing
Before training a model, data is almost always processed to improve learning efficiency and model performance. Key preprocessing steps include:

1. **Normalization**: 
   - Normalizing input features (e.g., states and actions) ensures that all features have similar scales. This prevents certain features from dominating the learning process and often significantly improves the performance of NNs and other data-based models.

2. **Data Augmentation**:
   - Augmenting the data by adding noise, rotations, scaling, or other transformations can increase the effective dataset size and improve the model's robustness to variations. This is particularly useful when the available dataset is small.

3. **Data Splitting**:
   - Splitting the dataset into training, validation, and test sets is crucial for evaluating model performance. The training set is used to train the model, the validation set is used to tune hyperparameters and prevent overfitting, and the test set is used to assess the final model performance.

In [None]:
# We are collecting data using a pretrained DRL agent. Review the collect_state_transitions function in data_collection.py for more details.
# NOTE if data exists in /outputs/*.pkl already, it will be loaded from disk when load_if_exists is True
recorded_data = collect_state_transitions(load_if_exists=True)

Let's plot the drone's x and y positions to verify how the drone flew.

In [None]:
x_positions = recorded_data["states"][:, 0]  # x
z_positions = recorded_data["states"][:, 2]  # z

# Plot the trajectory and positions
plt.figure(figsize=(8, 6))
plt.scatter(x_positions, z_positions, color="blue", alpha=0.5, s=2)
plt.plot(x_positions, z_positions, color="black", alpha=0.3)
plt.title("Drone's Positions")
plt.xlabel("X Coordinate")
plt.ylabel("Z Coordinate")
plt.grid(True)
plt.show()

Next we extract the features (state and actions) and the targets (next states) from the recorded data. Then we split the data into training- and test-sets using the sklearn library. As we are not doing any automatic hyperparameter tuning, we do not need a validation set.

In [None]:
# Extract x and y datasets from recorded_data
x_drone = np.concatenate((recorded_data["states"], recorded_data["actions"]), axis=1)
y_drone = recorded_data["next_states"]  # we will need those variables later

# Perform train test split
x_train, x_test, y_train, y_test = train_test_split(
    x_drone, y_drone, test_size=0.2, random_state=42
)

raw_data = {"x_train": x_train, "x_test": x_test, "y_train": y_train, "y_test": y_test}

# Print the shapes of the train and test datasets
for key, value in raw_data.items():
    print(f"Shape of {key}: {value.shape}")

## 2 Gaussian Process Regression

Gaussian Processes (GPs) are powerful tools for regression tasks, especially in robotics, where quantifying uncertainty is crucial. In this section, we review the fundamentals of Gaussian Processes and regression, introduce the task of implementing a custom multi-output GP model, and discuss the difference between learning complete dynamics and residual dynamics.

### Basics of Gaussian Processes
- **Gaussian Processes** are non-parametric models that define a distribution over functions. They are particularly useful for regression because they provide both predictions and associated uncertainty estimates.
- A GP is fully specified by a **mean function** and a **kernel (covariance) function**:
  - The **mean function** gives the expected value of the function at any input.
  - The **kernel function** measures similarity between inputs and determines the smoothness and complexity of the function.
- GPs are best suited for small to medium-sized datasets due to their computational complexity, which scales cubically with the number of data points.

### Task Overview: Implementing a Multi-Output Gaussian Process Model
In this section, you will:
1. Implement a custom multi-output GP model.
2. Fit the GP model to previously collected training data.
3. Evaluate the model’s performance by predicting the mean and uncertainty of the next states.

The implementation involves:
- Defining a **Radial Basis Function (RBF) kernel**, commonly used for its smoothness properties.
- Completing the **predict()** function to compute the posterior mean and variance for new inputs.

### Learning Complete Dynamics vs. Residual Dynamics
When modeling system dynamics, two common approaches are:

1. **Learning Complete Dynamics**:
   - The GP learns the entire mapping from the current state and action to the next state.
   - This approach typically requires more data and is computationally more expensive.

2. **Learning Residual Dynamics**:
   - If a base model $f_\mathrm{base}$ is available (e.g., a physics-based model or an informed guess), the GP learns the residuals $y_\mathrm{res} = y - f_\mathrm{base}(x)$.
   - This approach is often more stable and data-efficient, as the GP only models the discrepancies between the base model and observed data.

In this exercise, we will focus on learning the complete dynamics due to a simpler implementation. However, in practice, learning residual dynamics is often preferred when a reliable base model is available. The next exercise (GP-MPC) will deal with such a residual model learning setup.

<div class="alert alert-success">
    <h3>Exam Preparation</h3>
    <p>
    Review the lecture note sections that consider Gaussian Processes (GPs). Then discuss important GP parameters such as kernel selection.
    </p>
</div>

<div class="alert alert-info">
    <h3>Task 1: Review and complete the MultiOutputGaussianProcess class</h3>
    Review the class <code>MultiOutputGaussianProcess</code> in <code>gaussian_process.py</code>.
    <p>
    Implement the <code>rbf_kernel()</code> function that calculates the Radial Basis Function (RBF) kernel.
    <p>
    Afterwards complete the <code>predict()</code> function (in the same class) that predicts the posterior mean and variance.
</div>

In [None]:
# Fit a Gaussian Process to the data
gp_results = {}
gp = MultiOutputGaussianProcess(length_scale=0.2, noise=0.1)
start = time.time()
gp.fit(raw_data["x_train"], raw_data["y_train"])
gp_results["t_train"] = time.time() - start

In [None]:
# Predict test set
start = time.time()

gp_results["y_pred"], gp_results["y_std"] = gp.predict(raw_data["x_test"], return_std=True)

gp_results["t_pred"] = time.time() - start
print(f"Took {gp_results['t_pred']}s.")

In [None]:
# Calculate absolute error and squared error
gp_results["mae"], gp_results["rmse"], gp_results["uncertainty"] = error_statistics(
    raw_data["y_test"], gp_results["y_pred"], gp_results["y_std"]
)

Next, we inspect results via plotting. The plotting functions are in <code>plotting_utils.py</code>.

In [None]:
# define lists convinient for plotting
state_names = [
    "pos_x",
    "pos_y",
    "pos_z",
    "vel_x",
    "vel_y",
    "vel_z",
    "quat_w",
    "quat_x",
    "quat_y",
    "quat_z",
    "ang_vel_x",
    "ang_vel_y",
    "ang_vel_z",
]
action_names = ["thrust", "roll", "pitch", "yaw"]

# Select two states to visualize
state_1_idx = 0  # x
state_2_idx = 2  # z

# Plot predictions of two states with standard deviation for a random single point
plt = plot_random_test_point(
    raw_data["y_test"],
    gp_results["y_pred"],
    gp_results["y_std"],
    state_names,
    state_1_idx=state_1_idx,
    state_2_idx=state_2_idx,
)

# Plot predictions of two states with standard deviation
plt2 = plot_2d_positions_with_std(
    raw_data["x_train"],
    raw_data["y_train"],
    raw_data["x_test"],
    raw_data["y_test"],
    gp_results["y_pred"],
    gp_results["y_std"],
    state_names,
    state_1_idx=0,  # x
    state_2_idx=2,  # z
    show_train=False,
    x_lim=(-0.4, -0.1),  # for better visibility
    y_lim=(0.5, 0.6),  # for better visibility
)
# Plot prediction, uncertainty and ground truth for a single feature
feature_idx = 0
plot_prediction_vs_truth(
    raw_data["y_test"],
    gp_results["y_pred"],
    gp_results["y_std"],
    feature_idx=feature_idx,
    feature_name=state_names[feature_idx],
)

# Plot the error distribution across all states
plot_error_distribution(raw_data["y_test"], gp_results["y_pred"], state_names)

<div class="alert alert-success">
    <h3>Exam Preparation</h3>
    <p>
    Review the GP Prediction vs Ground Truth plot and answer following questions.
    </p>
    <ul>
        <li>Why are there certain regions where the model confidence decreases significantly? (There is a precise answer for our dataset.)</li>
        <li>What could be done to counteract that?</li>
    </ul>
</div>

## 3 Introduction to Deep Learning: Pytorch, Autograd and Basic Optimization

For more involved optimization problems that require gradient computation, we usually use Pytorch. Pytorch can be described as "numpy for the GPU" and enables efficient parallelization with GPU usage as well as automatic gradient computation (autograd). Array in pytorch are called tensors. 
For an introduction, check out parts 0, 1, 2, 5 and 6 of the [pytorch guide](https://pytorch.org/tutorials/beginner/basics/intro.html#how-to-use-this-guide). We really recommend checking out those tutorials as you will be working a lot with pytorch if you pursue deep learning applications. It will also help in our deep reinforcement learning exercises later on. In the following we will review some basic features of pytorch.

In [None]:
# We can create basic operations, just as in numpy. Arrays (numpy) are called tensors in PyTorch.

# tensor creation
a = torch.tensor([1, 2, 3, 4])  # from list
b = torch.ones(2, 4)  # 2x4 matrix of ones
c = a + b  # broadcasting, just like in numpy
print("a:", a)
print("b:", b)
print("c:", c)

A powerful feature of PyTorch are operations on the GPU.
For this, tensors need to be moved to the GPU-device.
By default, this devcontainer is set to work with CPU, which should be sufficient for this exercise
However, you can easily switch to use the GPU. For this see the README.md of this repo. You will need a NVIDIA GPU in that case. Modern deep learning libraries almost exclusively run on the GPU due to vast performance speed ups.

Note that your variables must be on the same device, if you do calculations with them. 

In [None]:
# inspect location of a tensor; default is cpu
print(a.device)

# move to GPU, if available
if torch.cuda.is_available():
    print("CUDA is available")
    a = a.to("cuda")
    print(a.device)
else:
    print("CUDA is not available, using CPU instead.")

# move back to CPU
a = a.to("cpu")
print(a.device)

Let's review some basic operations in PyTorch.

In [None]:
# reshaping
a = a.reshape(2, 2)
b = b.reshape(-1, 2)  # -1 infers the size
print("a:", a)
print("b:", b)

In [None]:
# concatenation
c = torch.cat([a, b], dim=0)  # stack a and b vertically
print("c:", c)

In [None]:
# indexing
print(c[1, 1])  # get the element at row 1, column 1

Another powerful feature of PyTorch is [Autograd (automatic differentiation)](https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html). If you haven't done any Deep Learning course, we strongly recommend to checkout the linked tutorial. Autograd allows to automatically track gradients through the computational graph.

In [None]:
x = torch.tensor(
    [1.0, 2.0, 3.0], requires_grad=True
)  # required_grad tells pytorch to track gradients (this is often enabled by default)

# calculate a function on the input tensor
y = torch.sin(x) * torch.sin(x)
z = y.sum()  # thats our 'target' function

z.backward()  # compute gradients in computational graph
print("\nGradient after sin²(x) and sum:")
print("Gradient with PyTorch: dx =", x.grad)

x.grad.zero_()  # reset gradients. This is important if you want to perform another backward pass later on.

with torch.no_grad():  # disable gradient tracking (used for faster and less memory intense inference)
    print("Manual analytical gradient: dx =", 2 * torch.sin(x) * torch.cos(x))

Autograd can be used for gradient-based optimization (such as it is done for neural networks). In this example, we fit a parameterized basis function to some simple synthetic data. Note that is only an example on how a simple optimization can be computed. Training neural networks requires further strategies that we introduce later.

In [None]:
# get random dataset
X_np, y_np = generate_synthetic_data(n_samples=100, noise_level=0.1)
# convert to PyTorch tensors
X = torch.FloatTensor(X_np)
y = torch.FloatTensor(y_np)

# Construct a basic function. In our case its easy, since we know the functions used in generate_synthetic_data
X_poly = torch.cat(
    [
        X,
        X**2,
        X**3,
        X**4,
        X**5,  # Polynomial features
        torch.sin(6 * X),  # Sinusoidal basis
        torch.exp((X - 0.5) ** 2),  # Sharper Gaussian
    ],
    dim=1,
)

# Define trainable weights
# Our trainable function will be X_poly @ weights
weights = torch.randn(X_poly.shape[1], 1, requires_grad=True)

# plot the data and the initial prediction
plt.figure(figsize=(10, 6))
plt.scatter(X_np, y_np, label="Training data")
plt.plot(X_np, X_poly @ weights.detach().numpy(), "r-", label="Prediction")
plt.title("Initial prediction")
plt.xlabel("X")
plt.ylabel("y")
plt.legend()

In a training loop, we iterate over the data (X) to optimize the weights of our target function to best fit the data.

For the optimization, we require an optimizer and a loss function (the target). A good choice of optimizer is ADAM, and a typical loss function for regression is the L2-norm loss.

Please check out how to use basic optimization loops if you havent worked with Deep Learning already [in the PyTorch tutorials](https://pytorch.org/tutorials/beginner/basics/optimization_tutorial.html). You will need this knowledge when working with Deep Neural Networks.

In [None]:
# Create optimizer
# An optimizer defines a routine of how the weights should be updated depending on the gradients. A key parameter is the learning rate: https://pytorch.org/tutorials/beginner/basics/optimization_tutorial.html
# The trainable weights need to be registered with the optimizer
optimizer = torch.optim.Adam([weights], lr=0.01)

# Training loop
n_epochs = 500
losses = []

# one epoch is one full pass through the data
# please read up on training loops if you haven't done so
for epoch in range(n_epochs):
    # Forward pass
    y_pred = X_poly @ weights
    loss = torch.mean((y_pred - y) ** 2)  # L2 loss

    # Backward pass
    optimizer.zero_grad()  # reset gradients
    loss.backward()  # calculate gradients
    optimizer.step()  # perform parameter updates based on the gradients using the defined optimizer

    losses.append(loss.detach().cpu().numpy())

    # Print progress every 50 epochs
    if (epoch + 1) % 50 == 0:
        print(f"Epoch {epoch + 1}/{n_epochs}, Loss: {loss.item():.4f}")

# Plot results
plt.figure(figsize=(12, 4))

# Plot 1: Final fit
plt.subplot(1, 2, 1)
plt.scatter(X_np, y_np, alpha=0.5, label="Data")
plt.plot(X_np, y_pred.detach().numpy(), "r-", label="Fitted curve")
plt.title("Data and Optimized Prediction")
plt.legend()

# Plot 2: Loss curve
plt.subplot(1, 2, 2)
plt.plot(losses)
plt.title("Loss over epochs")
plt.xlabel("Epoch")
plt.ylabel("MSE Loss")

plt.tight_layout()
plt.show()

# You will notice that the fitting gets better if you run this cell multiple times, as we do not reset the weights.

Note that when using such naive basis functions, the final optimization result heavily depends on the initial starting guess. You can play around with more complex basis functions and different starting conditions to find a better fit. Also, play around with the learning rate and number of training epochs to get a feel for the optimization. In practise, the loss curve is a good metric for debugging when training neural networks.

## 4 Neural Networks

Neural Networks are a class of machine learning models inspired by the structure and function of the human brain. They consist of layers of interconnected nodes (neurons) that process input data to learn patterns and make predictions. Fundamentally, neural networks are overparameterized, complex functions whose parameters are optimized using a dataset to minimize a loss function, thereby modeling the relationships in the data.

In the following, we give a short, nonexhaustive introduction to NNs.

### Why Use Neural Networks?
Neural Networks excel in tasks involving large datasets and complex, non-linear relationships due to their:
- **Scalability**: Handle high-dimensional inputs and large datasets effectively.
- **Expressiveness**: Approximate any continuous function with sufficient data and capacity (Universal Approximation Theorem).
- **Flexibility**: Adapt to tasks like regression, classification, and reinforcement learning.
- **End-to-End Learning**: Learn directly from raw data, reducing the need for feature engineering.

### Key Components of Neural Networks
- **Input Layer**: Receives the input features (e.g., states, actions, and observations in robotics).
- **Hidden Layers**: Extract patterns and relationships in the data. Stacking layers increases capacity.
- **Output Layer**: Produces final predictions (e.g., next states when learning dynamics).
- **Weights and Biases**: Parameters adjusted during training to minimize prediction error.
- **Activation Functions**: Introduce non-linearity, enabling the model to learn complex relationships (e.g., ReLU, Sigmoid, Tanh).

### Key Concepts When Training Neural Networks

#### 1. **Architecture Choice**
The architecture of a neural network defines its capacity to learn and generalize from data. Key considerations include:

- **Depth and Width**: Depth and width refer to the number of layers and the number of neurons per layer, respectively. Deeper networks can learn hierarchical features but are computationally expensive and prone to vanishing gradients while wider layers increase capacity but may lead to overfitting if not regularized.
  
- **Activation Functions**: Introduce non-linearity to the model. Common choices include:
  - **ReLU**: Efficient and widely used for hidden layers.
  - **Sigmoid/Tanh**: Useful for specific tasks but prone to vanishing gradients.
  - **Softmax**: Often used in the output layer for classification tasks.

- **Layer Types**:
  - **Fully Connected (Dense) Layers**: Standard for general-purpose tasks.
  - **Convolutional Layers**:  Ideal for image processing tasks due to their ability to leverage the spatial structure of images. They process localized regions (receptive fields) to capture spatial features, use parameter sharing to reduce complexity, provide translation invariance for detecting features regardless of position.
  - **Recurrent Layers**: Suitable for sequential data like time series or text (e.g., LSTMs, GRUs).
  - **Transformer Layers**: State-of-the-art for natural language processing and vision tasks. Use self-attention mechanisms to capture long-range dependencies and relationships in the data, enabling efficient parallel processing and improved performance on sequential and spatial data.

- **Task-Specific Architectures**:
  - **ResNet**: For image classification, uses skip connections to mitigate vanishing gradients.
  - **U-Net**: For image segmentation, combines encoder-decoder architecture with skip connections.
  - **Transformer**: For NLP and vision tasks, excels in capturing long-range dependencies.

---

#### 2. **Enabling Generalization**
In the end, the goal is for models to perform well on unseen data. While NNs (with suitable architecture and hyperparameter choice) generally perform well on the training data, generalization is much more difficult to achieve. Methods to achieve generalization are generally called regularization techniques. Moreover, splitting the dataset into training, test, and validation sets allows evaluating underfitting, overfitting, and hyperparameter selection.

- **Detecting Underfitting and Overfitting**:
    - **Learning Curves**: Plot training and validation losses over epochs. Converging and similar test and training losses indicates good generalization. 
  - **Underfitting**: Training and validation losses are high, indicating the model is too simple or insufficiently trained.
  - **Overfitting**: Training loss is low, but validation loss is high, indicating the model is memorizing the training data.

- **Regularization Techniques**:
  - **Early Stopping**: Stop training when validation loss stops improving.
  - **Dropout**: Randomly deactivate neurons during training to prevent co-adaptation.
  - **Batch Normalization**: Normalize layer inputs to stabilize and accelerate training.
  - **Weight Regularization**: Add L1 or L2 penalties to the loss function to constrain model complexity.
  - **Skip Connections**: Counteracts vanishing and exploding gradients by improving gradient flow between layers.

---

#### 3. **Hyperparameter Optimization**
Hyperparameters control the training process and model architecture. Optimizing them is essential for achieving strong performance.

- **Common Hyperparameters**:
  - **Learning Rate**: Controls the step size for weight updates. Too high may cause divergence; too low can slow convergence. Many optimizers (e.g., Adam) adapt the learning rate during training.
  - **Batch Size**: Smaller batches introduce more noise but allow more frequent updates. Larger batches are more stable but require more memory.
  - **Number of Epochs**: Specifies how many times the model sees the entire dataset.
- **Optimization Methods**:
  - **Grid Search**: Exhaustively searches over a predefined set of hyperparameter values, but is often inefficient.
  - **Random Search**: Samples random combinations of hyperparameters and is generally more efficient than grid search.
  - **Bayesian Optimization**: Uses probabilistic models to efficiently search for optimal hyperparameters.
- **Cross-Validation**: Splits the dataset into multiple folds to evaluate model performance across different subsets, providing a more robust estimate of generalization.
- **Validation Set**: Always use a separate validation set to evaluate hyperparameter choices.


---

### Best Practices for Training Neural Networks

1. **Normalize Input Data**: Ensure that input features have similar scales and counteract vanishing or exploding gradients.
2. **Monitor Learning Curves**: Use training, validation, and test losses to diagnose underfitting, overfitting, and the effects of hyperparameters.
3. **Start Simple**: Begin with a basic model and increase complexity only as needed.
4. **Use Pretrained Models**: Leverage pretrained models when available to save time and improve performance.
5. **Automate Hyperparameter Tuning**: Use methods like random search or Bayesian optimization to efficiently explore hyperparameter space.

In [None]:
if torch.cuda.is_available():
    device = torch.device("cuda")
else:
    device = torch.device("cpu")
print("Using device:", device)

# Use same data to learn a drone model as for GP
x, y = x_drone, y_drone
x, y = torch.FloatTensor(x), torch.FloatTensor(y)
x, y = x.to(device), y.to(device)


Neural Networks are prone to overfitting, i.e., they learn the training data including the noise really well but may fail to find the underlying feature-target relations. There exist many methods to counteract overfitting, which are a subset of Regularization methods.

The most basic one is to seperate the training data into training and test sets. Although this reduces the amount of data usable for training, evaluating the learned model on the test set allows to categorize whether the model overfits.  

Often the training data is further split into an additional validation set, which is often used to determine hyperparameter selection.

In [None]:
data_torch = {"x": x, "y": y}
data_torch["x_train"], data_torch["x_test"], data_torch["y_train"], data_torch["y_test"] = (
    train_test_split(x, y, test_size=0.2)
)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
# Print the shapes of the train and test datasets
for key, value in data_torch.items():
    print(f"Shape of {key}: {value.shape}")

Another common method is to normalize the input data before training and evaluation. Normalization often increases the training speed by helping to prevent vanishing or exploding gradients.

Here, we are using a normalizer class that conveniently encapsulates the normalization and reverse normalization functions fitted to a given dataset.

In [None]:
# For neural networks input data should ALWAYS be normalized (also true for DRL), to prevent vanishing or exploding gradients
normalizer = Normalizer()
normalizer.fit(data_torch["x_train"], data_torch["y_train"])  # Fit on both input and output
data_torch["x_train_norm"], data_torch["y_train_norm"] = normalizer.transform(
    data_torch["x_train"], data_torch["y_train"]
)
data_torch["x_test_norm"], data_torch["y_test_norm"] = normalizer.transform(
    data_torch["x_test"], data_torch["y_test"]
)

# Print statistics of normalized data
print("Normalized Data Statistics:")
print("Mean:", data_torch["x_train_norm"].mean().item())  # should be close to zero
print("Standard Deviation:", data_torch["x_train_norm"].std().item())  # should be close to one

Neural networks are trained in epochs and batches, which helps the optimization process. If you havent worked with Deep Learning, read up on that [here](https://machinelearningmastery.com/difference-between-a-batch-and-an-epoch/). For easy data access during training, one uses [dataloaders](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html).

<div class="alert alert-info">
    <h3>Task 2: Review and complete the CustomDataset class</h3>
    <p>
    Complete the <code>iter()</code> function in the <code>CustomDataset</code> in <code>utils.py</code>.
    </p>
</div>

In [None]:
train_loader = CustomDataset(
    data_torch["x_train_norm"], data_torch["y_train_norm"], batch_size=64, shuffle=True
)
val_loader = CustomDataset(
    data_torch["x_test_norm"], data_torch["y_test_norm"], batch_size=64, shuffle=False
)

<div class="alert alert-success">
    <h3>Exam Preparation</h3>
    <p>
    Review the lecture note sections that consider Neural Networks (NNs). 
    </p>
    <p>
    Then understand and explain for yourself fundamental concepts of NNs and their tradeoffs:
    <ul>
    <li>Learning rate, epochs, batches, and mini batches</li>
    <li>Optimizer selection (ADAM, SGD, ...)</li>
    <li>Computational graphs in auto-differentiation</li>
    </ul>
    </p>
</div>

<div class="alert alert-info">
    <h3>Task 3: Review and complete the NeuralNetwork class</h3>
    <p>Review the class <code>NeuralNetwork</code> in <code>neural_network.py</code>.</p>
    <p>Then complete the <code>__init__()</code> function that initializes the Neural Network (NN) and defines the model architecture. You have complete freedom over the architecture choice (e.g., you can select the activation functions, number of layers, hidden units per layer, and layer types) as long as:</p>
    <ul>
        <li>The model uses layer types defined by <code>torch.nn</code>.</li>
        <li>The model has at most <code>1e6</code> trainable parameters.</li>
        <li>All model parameters are included in the hyperparameter dictionary defined in the below code cell. </li>
        <li>The input and output dimensions are not changed.</li>
    </ul>
</div>

In [None]:
# setting seed for reproducibility
set_seed(42)
# Define hyperparameters for the neural network (make sure to adjust them according to your implementation)
# TODO A hidden dim of 5 is not sufficient.
hyperparameters = {
    "input_dim": data_torch["x_train_norm"].shape[1],
    "output_dim": data_torch["y_train_norm"].shape[1],
    "hidden_dims": [5, 5],
}
# Create Neural Network
model = NeuralNetwork(hyperparameters)
model.to(device)


# Plot predictions before training
x_train_plot = data_torch["x_train_norm"].clone().detach().cpu().numpy()
y_train_plot = data_torch["y_train_norm"].clone().detach().cpu().numpy()
y_pred_plot = model(data_torch["x_train_norm"]).detach().cpu().numpy()

# index zero corresponds to x position. If we plot x_pos (current state) against x_pos (next state), we expect an almost linear graph.
feature_idx = 0

sort_indices = np.argsort(x_train_plot[:, 0])
x_sorted = x_train_plot[sort_indices, 0]
y_true_sorted = y_train_plot[sort_indices, feature_idx]
y_pred_sorted = y_pred_plot[sort_indices, feature_idx]

plt.figure(figsize=(10, 6))
plt.scatter(x_sorted, y_true_sorted, alpha=0.6, label="Ground Truth", color="blue")
plt.scatter(x_sorted, y_pred_sorted, alpha=0.6, label="Predictions (Before Training)", color="red")
plt.title("Neural Network Predictions vs Ground Truth (Before Training)")
plt.xlabel("Input Feature (First Dimension, x_pos)")
plt.ylabel(f"Output Feature {state_names[feature_idx]}")
plt.legend()
plt.grid(True, alpha=0.3)

<div class="alert alert-info">
    <h3>Task 4: Review and complete the RegressionTrainer class</h3>
    Review the class <code>RegressionTrainer</code> in <code>neural_network.py</code>.
    <p>
    Then complete the <code>train_epoch()</code> function that performs a training iteration.
    </p>
    <p>
    The <code>train()</code> function saves a checkpoint at the end of the training. Make sure to commit the checkpoint before submitting your solution!
    </p>
</div>

In [None]:
# Train network
nn_results = {}
trainer = RegressionTrainer(model, device=device)

start_time = time.time()
trainer.train(train_loader, val_loader, epochs=40, patience=5)

nn_results["t_train"] = time.time() - start_time

# Plot learning curve. Understanding learning curves is a crucial part to debug your models: https://machinelearningmastery.com/learning-curves-for-diagnosing-machine-learning-model-performance/
plot_learning_curves(trainer.train_losses)

As already mentioned, debugging learning curves is important in understanding where your deep learning algorithms fails. Check out [this blog post](https://machinelearningmastery.com/learning-curves-for-diagnosing-machine-learning-model-performance/) for a good overview of how to debug learning curves.

In [None]:
# Plot predictions after training
x_train_plot = data_torch["x_train_norm"].clone().detach().cpu().numpy()
y_train_plot = data_torch["y_train_norm"].clone().detach().cpu().numpy()
y_pred_plot = model(data_torch["x_train_norm"]).detach().cpu().numpy()

feature_idx = 0

sort_indices = np.argsort(x_train_plot[:, 0])
x_sorted = x_train_plot[sort_indices, 0]
y_true_sorted = y_train_plot[sort_indices, feature_idx]
y_pred_sorted = y_pred_plot[sort_indices, feature_idx]

plt.figure(figsize=(10, 6))
plt.scatter(x_sorted, y_true_sorted, alpha=0.6, label="Ground Truth", color="blue")
plt.scatter(x_sorted, y_pred_sorted, alpha=0.6, label="Predictions (Before Training)", color="red")
plt.title("Neural Network Predictions vs Ground Truth (After Training)")
plt.xlabel("Input Feature (First Dimension, xpos)")
plt.ylabel(f"Output Feature {feature_idx}")
plt.legend()
plt.grid(True, alpha=0.3)

In [None]:
# Make predictions on test set and make sure that test loss is similar to train loss (otherwise we overfit or there are other problems during training or in the data)
model.eval()
start_time = time.time()
with torch.no_grad():  # no gradient
    nn_results["y_pred_norm"] = model(data_torch["x_test_norm"])
nn_results["t_pred"] = time.time() - start_time


test_loss = trainer.criterion(nn_results["y_pred_norm"], data_torch["y_test_norm"])
print(f"Train Loss: {trainer.train_losses[-1]:.4f}")
print(f"Test Loss: {test_loss:.4f}")

# Note: as those losses are calculated on normalized data, they are not directly interpretable as physical quantities.

In [None]:
# Same statistics as for the GP
_, nn_results["y_pred"] = normalizer.inverse_transform(
    data_torch["x_test_norm"], nn_results["y_pred_norm"]
)
nn_results["y_pred"] = nn_results["y_pred"].cpu().numpy()
nn_results["y_std"] = np.zeros_like(
    nn_results["y_pred"]
)  # (This) Neural network do not provide uncertainty estimates

# Calculate absolute error and squared error
nn_results["mae"], nn_results["rmse"], nn_results["uncertainty"] = error_statistics(
    data_torch["y_test"].cpu().numpy(), nn_results["y_pred"], nn_results["y_std"], model_name="NN"
)

In [None]:
# Plot predictions of two states with standard deviation for a random single point
plot_2d_positions_with_std(
    data_torch["x_train"].cpu().numpy(),
    data_torch["y_train"].cpu().numpy(),
    data_torch["x_test"].cpu().numpy(),
    data_torch["y_test"].cpu().numpy(),
    nn_results["y_pred"],
    y_std=None,
    state_names=state_names,
    state_1_idx=0,  # x
    state_2_idx=2,  # z
    show_train=False,
    x_lim=(-0.4, -0.1),  # for better visibility
    y_lim=(0.5, 0.6),  # for better visibility
)

# Performance Comparison
Now we compare the performance of the Neural Network and Gaussian Process implementations.
For that, we evaluate the accuracy of the predictions for the test set and additionally measure the times required for inference and learning.

To allow for a fair comparison, we will reimplement the GP model using [GPyTorch](https://gpytorch.ai/). GPyTorch is a package build upon PyTorch that implements a wide range of GP models and utility tools that  allow for GPU acceleration. Thereby the GP parameters are learned similarly as for NNs.

Note: GP implementations have generally a much higher memory footprint than a equivalent NN implementation. In practice this means that training an exact GP is on cpu is limited to datasets of size $\leq 2000$ increasing to $\approx 10000$ datapoints when using gpu. Hence, we will use [Stochastic Varational GP Regression](https://docs.gpytorch.ai/en/stable/examples/04_Variational_and_Approximate_GPs/SVGP_Regression_CUDA.html) for this example.

<div class="alert alert-info">
    <h3>Task 5: Review the SVGPTrainer class</h3>
    Review the classes <code>SVGPTrainer</code> and <code>IndependentMultitaskSVGPModel</code> in <code>gaussian_process.py</code>.
    <p>
    For more details check the information on the GPyTorch documentation (https://docs.gpytorch.ai/en/stable/examples/04_Variational_and_Approximate_GPs/index.html)
    </p>
</div>

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"
# device = "cpu" # for now we will use CPU
gp_trainer = SVGPTrainer(X_train=raw_data["x_train"], y_train=raw_data["y_train"], device=device)

In [None]:
gp_trainer.train(lr=0.01, epochs=50)
svgp_results = {}
svgp_results["t_train"] = gp_trainer.training_time

In [None]:
svgp_results["y_pred"], svgp_results["y_std"], svgp_results["lower"], svgp_results["upper"] = (
    gp_trainer.infer(X_test=raw_data["x_test"])
)
svgp_results["t_pred"] = gp_trainer.inference_time
svgp_results["mae"], svgp_results["rmse"], svgp_results["uncertainty"] = error_statistics(
    raw_data["y_test"], svgp_results["y_pred"], svgp_results["y_std"], model_name="SVGP"
)

In [None]:
# Data for the table
import pandas as pd

data = {
    "Method": ["Manual GP", "SVGP", "Neural Network"],
    "Mean Absolute Error": [gp_results["mae"], svgp_results["mae"], nn_results["mae"]],
    "Mean RMSE": [gp_results["rmse"], svgp_results["rmse"], nn_results["rmse"]],
    "Mean Uncertainty": [
        gp_results["uncertainty"],
        svgp_results["uncertainty"],
        nn_results["uncertainty"],
    ],
    "Training Time (s)": [gp_results["t_train"], svgp_results["t_train"], nn_results["t_train"]],
    "Inference Time (s)": [gp_results["t_pred"], svgp_results["t_pred"], nn_results["t_pred"]],
}

# Create the DataFrame
results_df = pd.DataFrame(data)
print(results_df)
plot_error_distribution(raw_data["y_test"], svgp_results["y_pred"], state_names, model_name="SVGP")
plot_error_distribution(
    data_torch["y_test"].cpu().numpy(), nn_results["y_pred"], state_names, model_name="NN"
)


feature = 0  # x
plot_prediction_vs_truth(
    raw_data["y_test"],
    svgp_results["y_pred"],
    svgp_results["y_std"],
    feature,
    state_names[feature],
    model_name="SVGP",
)
plot_prediction_vs_truth(
    data_torch["y_test"].cpu().numpy(),
    nn_results["y_pred"],
    nn_results["y_std"],
    feature,
    state_names[feature],
    model_name="NN",
)