# Model Predictive Control and Neural Network-based control

The goal of this exercise is to implement two controllers for the `Pendulum-v1` environment of the _Gymnasium_ library: a model-based controller and a model-free, neural network-based controller. Please **read carefully** the [documentation](https://gymnasium.farama.org/environments/classic_control/pendulum/) of environment before starting (focus on the state variables and controls). 

## Part 1: Model Predictive Control

Implement a model-based controller that uses the Model Predictive Control (MPC) theory (see the slides in the _Genetic Algorithm_ set in the repo of the course) to **stabilize the pendulum in its upright position** ($\theta = 0$, $\omega = 0$). Set the **gravity equal to 9.81** using the `g` argument of the `Pendulum-v1` environment. In general, you should follow these steps:

1. Define the _cost_ function associated to the MPC using the **reward** of the environment. For the prediction of the future
   states and rewards associated to a sequence of actions, another _Pendulum_ environment called `env_mpc` (separate from the one
   `main_env` that the controller is interacting with) should be used.
2. Define a `MPC` class compatible with `pygmo` that implements the optimization problem
   that needs to be solved using Model Predictive Control to compute the optimal action
   (the one that minimizes the cost defined at Step 1). Remember to appropriately set
   the bounds for the controls.
3. Implement the `get_best_mpc_action` to actually solve the MPC problem using `pygmo`.
4. Define the function `play_game` to play a "game" using a controller chosen by the
   user among the following: 1) MPC; 2) random; 3) Neural Network (see Part 2). The
   _initial conditions_ for the angle and angular velocity should be randomly set within
   the intervals -20/+20 degrees and -0.1/0.1 rad/s, respectively, using the
   `env.unwrapped.state` variable. For the MPC controller, at each time step an
   optimization problem must be solved using the function defined at Step 3.
   The function should store and return the lists of _observations_ and corresponding
   _controls_ and the _total score_ associated to the game. Each game lasts maximum 200 actions.
5. Play _a few_ games with _random_ initial conditions (angle between -20 and +20 degrees and angular velocity between -0.1 and 0.1 rad/s) and compute the _average total score_. **You should get a total score above -10, at least in some games.** For one game, plot the angle and the angular velocity as a function of time, and the controls in a separate figure.

In [None]:
import numpy as np
import gymnasium as gym
import matplotlib.pyplot as plt
import torch 
import pygmo as pg
from torch import nn, tensor
from skorch import NeuralNetRegressor
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV

In [None]:
def cost_mpc(action_sequence, main_env, env_mpc):
    # set initial state of the environment used for simulating MPC actions equal to the
    # current state in the "real" environment
    env_mpc.unwrapped.state = main_env.unwrapped.state

    pass

In [None]:
# pygmo problem class for MPC
class MPC():
    def __init__(self, main_env, env_mpc, control_horizon=10):
        pass

    def fitness(self, action_sequence):
        pass

    def get_bounds(self):
        pass

In [None]:
# return best action by minimizing MPC cost using genetic algorithm
def get_best_mpc_action(env, env_mpc):
    # best_action = ... 
    # convert the results into a 1x1 numpy array to be compatible with the step method
    # of the gym environment
    return np.array([best_action])

In [None]:
# Play a game with specified controller
def play_game(controller, nn_model=None):
    env = gym.make('Pendulum-v1', g=9.81)  # Main environment
    env.reset()
    env_mpc = gym.make('Pendulum-v1', g=9.81) # Environment for MPC
    env_mpc.reset()

    # initial conditions
    # ...

    # set initial x, y and omega (angular velocity)
    previous_obs = np.array([np.cos(initial_angle), np.sin(initial_angle), initial_angular_vel])

    states = []
    controls = []
    total_score = 0

    for _ in range(200):
        pass

    return states, controls, total_score

## Part 2: Neural Network controller

In this part, you will train a neural-network based controller based on the optimal control strategy found in Part 1. To this aim you should:
1. Write a function `generate_training_data` to generates a training dataset by playing a certain number of games (suggested minimum
   100) using the MPC controller implemented in Part 1 and providing as random initial conditions
   an angle between -20 and 20 degrees and zero angular velocity. Each step of the game
   corresponds to a sample of the training dataset, where $x$, $y$, $\omega$ are the
   features and the action chosen by the MPC controller is the label. Note: this step may be
   _slow_. Finally, convert the dataset to `torch` tensors ($X$ and $y$) with float32 precision.
2. Create a feedforward neural network implemented in `torch` that takes the current **observation** ($x$, $y$, $\omega$) as an input and returns the **control** to be applied to the system. Make sure that the returned value is "admissible".
3. **Train** and **select** the network (by exploring different _architectures_ and values for the _hyperparameters_).
4. Play 2000 games with the controls given by the "best" network obtained in Step 3. Compute the average total score. **You should get an average score above -5.** Compare the average total score with that of a _random controller_.

In [None]:
def generate_training_data(num_games):
    env = gym.make('Pendulum-v1', g=9.81)  # Main environment
    env_mpc = gym.make('Pendulum-v1', g=9.81) # Environment for MPC

    training_data = []

    for i in range(num_games):
        print("Playing game number ", i + 1)
        env.reset()
        env_mpc.reset()

        # ...
        # ...


        # save only the actions of "successful" games
        if total_score >= -0.8:
            # ...
        
    return training_data

In [None]:
training_data = generate_training_data(150)

In [None]:
X = tensor([i[0] for i in training_data], dtype=torch.float32).reshape(len(training_data), 3)
y = tensor([i[1] for i in training_data], dtype=torch.float32).reshape(len(training_data), 1)

In [None]:
# load training data from disk, instead of using the generate_training_data function
training_data = np.load("data/training_pendulum.npy")
X = tensor(training_data[:,:3], dtype=torch.float32)
y = tensor(training_data[:,3], dtype=torch.float32).reshape(len(training_data),1)