# Data-Enhanced Simulation for Solids - final assignment

The goals of this assignment are to study a heat conduction problem, solve it using a
neural network, and develop a model-free controller based on reinforcement learning.
As a solution to the assignment, please submit a **Jupyter notebook**.

Consider the one-dimensional diffusion equation with source term:

$$ \frac{\partial u}{\partial t}(x,t) = \frac{\partial^2 u}{\partial x^2}(x,t) +
\lambda(x) u(x,t) \qquad 0<x<1\,, 0<t\leq 1,$$

where $\lambda(x) = 50 \cos(8\arccos(x))$. The boundary conditions are $u(0,t) = 0$ and
$u(1,t) = a(t)$, with $a(t)$ a **control function**, while the initial condition is
$u(x,0) = 1$. Notice that the source term has a _destabilizing_ effect on the solution.

1. Solve the diffusion problem with finite differences (forward first-order in time,
   centered second-order in space) with constant control $a(t) = 0$ and make a
   3D plot of the solution $u$ for $(x,t) \in [0,1] \times [0,1]$. Discretize the space
   with $\Delta x = 0.005$ and time with $\Delta t = 10^{-5}$.
2. Use a neural network to solve the same problem (given only the boundary and the
   initial conditions) and evaluate the mean squared error over the space-time interval between the network
   prediction and the numerical solution obtained previously.
3. Create an `gymnasium` environment that simulates the diffusion problem (using the
   numerical solver developed in step 1) and is
   suitable for applying reinforcement learning algorithms, following the template
   below. The _state_ $s$ corresponds to the field $u(x,t)$, while the _action_ is the
   choice of the value for the boundary control $a(t)$ at each time step. 
   The _reward_ should be such that the control tries to **stabilize** the solution
   $u(x,t)$, i.e. it should compensate for the source term and drive the solution
   towards zero. Precisely, the reward should be assigned as:
   - $\|s'\| - \|s\|$, for each timestep, where $s'$ is the state after performing an action and $s$ is the
     state before it, and $\|\cdot\|$ denotes the $L_2$ norm;
   - $300 - \|u(x,1)\| - \sum_{t=0}^1 |a(t)|/1000$, at the end of the episode if $\|u(x,1)\| < 20$,
     or $0$ otherwise, with $|\cdot|$ the $L_1$ norm.
   
   Actions are sampled every $0.01$ time units. Since the timestep for the numerical
   integration of the PDE ($10^{-5}$) is smaller the the control sampling timestep, the control is kept constant until
   reaching the next time multiple of $0.01$. It is advisable to test the enviroment
   before using it for reinforcement learning.

In [None]:
from gymnasium import Env, spaces
import numpy as np
import jax.numpy as jnp
from jax import jit

dx = 0.005
dt = 1e-5
control_dt = 0.01  # Control sampling timestep
x = jnp.arange(0, 1 + dx, dx)
nx = len(x)
num_timesteps = int(1 / dt)  # Total simulation time of 1 second
num_control_steps = int(control_dt / dt)  # Number of PDE steps per control action

@jit
def source_term(x, u):
    return 50 * jnp.cos(8 * jnp.arccos(x)) * u

class DiffusionEnv(Env):
    def __init__(self):
        super(DiffusionEnv, self).__init__()
        
        # Define action and observation spaces
        self.action_space = spaces.Box(low=-20, high=20, shape=(1,), dtype=np.float32)
        self.observation_space = spaces.Box(low=-np.inf, high=np.inf, shape=(nx,), dtype=np.float32)
        
        # Initialize state
        # self.state = ...

        self.current_step = 0  # Track the number of time steps

    def reset(self, seed=None, options=None):
        if seed is not None:
            np.random.seed(seed)
        
        self.state = jnp.ones(nx, dtype=jnp.float32)
        self.state = self.state.at[0].set(0)  # Enforce boundary condition at x=0
        self.current_step = 0
        return np.array(self.state), {}  # Convert to numpy for compatibility with stable-baselines

    def step(self, action):
        # Apply boundary control at x=1
        boundary_control = action[0]

        # Update state (i.e. integrate PDE for num_control steps)
        # self.state = ...

        new_state = np.array(self.state)

        # Calculate reward: Negative squared norm of state difference
        # reward = ...

        # If this is the last timestep, adjust the reward based on the final L2 norm
        if self.current_step >= num_timesteps // num_control_steps:
            # final_norm = ...
            if final_norm <= 20:
                # reward = ...
            else:
                reward = 0

        # The episode only ends after the full duration (1 second)
        done = self.current_step >= num_timesteps // num_control_steps
        truncated = False  # No truncation

        # Increment the timestep counter
        self.current_step += 1

        return new_state, reward, done, truncated, {}

    def render(self, mode='human'):
        pass

4. Use a reinforcement learning algorithm to stabilize the solution by acting on the boundary control. The algorithm should interact with the environment built at step 3. Plot the reward as a function of the epochs or total number of timesteps during training of the algorithm. Also make a 3D plot of the stabilized solution. 

In [None]:
# The following snippet could be useful if using stable-baselines

from stable_baselines3.common.vec_env import DummyVecEnv, VecMonitor

# Wrap the environment with VecMonitor for reward logging
env = DiffusionEnv()
env = DummyVecEnv([lambda: env])
env = VecMonitor(env)