# Problem 5: Control and Reinforcement Learning

### (a) The merits and demerits of reinforcement learning for mechatronic systems

Reinforcement learning emphasizes maximizing a reward function by having an agent choose actions in an environment. In the case of robotics, the agent is the robot and the environment is the world in which the robot interacts with.

Reinforcement learning can be quite useful for mechatronic systems. Because there are a lot of uncertainties within a mechatronic system, it can be difficult to determine what actions to take to accomplish certain tasks. With reinforcement learning, the system can learn what actions to take to accomplish a given task based on a reward function. This allows the program to manage a lot of the uncertainty that exists within the system. 

Reinforcement learning involves the interaction of the agent with its environment. Therefore, the agent is choosing its next action based on its interaction within its environment. This can be of great use in a mechatronic system as the mechatronic system interacts with its surrounding environment to accomplish a given task. Applying reinforcement learning to a mechatronic system trains the system to make decisions based on its environment.

Reinforcement learning also does not require large amounts of preexisting data. Training a mechatronic system to accomplish a specific task does not come with a large amount of preexisting data that a model can be trained on. A reinforcement learning model learns based on the effect an action has on the observation space.

Reinforcement learning can be useful for mechatronic systems; however, the process of obtaining experience on a real system is costly. This requires a model that learns much quicker than a model that may be applied in other contexts. This greatly increases the difficulty to implement a usable reinforcement model for a mechatronic system. You could attempt to create a simulation to model the actions of the mechatronic system, but attempting to replicate a real-world environment within a simulation is nearly impossible for the precision that is sometimes required for a mechatronic system.

The true state of environment can never be completely observed and noise-free in a mechatronic system either. Therefore, the reinforcement learning model cannot know precisely in which state it is in. This makes it much more difficult to measure what affect the action performed had on the environment. This increases the complexity of the problem in the mechatronic system because it has to learn from a partially observed environment.

Because reinforcement learning for mechatronic systems deals with real-world environments rather than simulated ones, the reward function must account for real-world experience. This greatly increases the difficulty of developing a reward function that quickly learns.

### (b) Reinforcement Learning vs. Controllers

![Reinforcement vs. Controller Diagrams](images/controller_vs_reinforcement.png)

Reinforcement learning and controllers are similar in that they take information from their environment's state to decide their next action. Controllers, for example, process measured signals according to knowlege about the process and drive the system to approximate a desired behavior. Reinforcement learning models interact with an environment, which provides a reward, and takes actions to maximize that reward.

Controllers, however, require knowledge of the environment they are interacting with in order to provide instruction to the actuators to drive a desired behavior. We do not always have precise knowledge of our environment, so controllers are not always practical to use.

Reinforcement learning models do not rely on knowledge of the environment. They aim to take actions that maximize a reward function, so they make an action, take the state of the environment following that action, and utilize the reward function to determine what action to take.

The main difference between a controller and reinforcement learning model is the knowledge of the environment robot is interacting with. As I stated previously, a controller requires a lot more knowledge of the environment in order to choose meaningful and accurate decisions to accomplish a desired behavior. Introducing too many uncertainties greatly hinders the system's ability to make accurate decisions within the system. Reinforcement learning models rely on the results of their previous action, so they are improving their decision upon each iteration of interaction with the environment. This requires significantly less knowledge of the environment.

### (c) Reinforcement Learning Model for CartPole

The reinforcement model we used is a simple neural network consisting of only two linear layers. It simply takes in the observation space of the state as its training feature.

In [1]:
import gym
import numpy as np
import torch
import os

from agent import Agent

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
epochs = 1
log_iter = 1

env = gym.make("CartPole-v1", render_mode="human")

agent = Agent(env, load_exist=True)
learning = []

for i in range(epochs):
    steps = 0
    done = False
    state, _ = env.reset()
    
    cumulative_rewards = 0
    action_probs = []

    while not done:
        steps += 1
        
        action, action_prob = agent.act(state)
        action_probs.append(action_prob)

        state, reward, done, *_ = env.step(action)
        cumulative_rewards += reward
        
    policy_loss = torch.cat([-log_prob * cumulative_rewards for log_prob in action_probs]).sum()
    
    agent.optimizer.zero_grad()
    policy_loss.backward()
    agent.optimizer.step()
    
    learning.append(steps)
    if i % log_iter == 0:
        print(f"Iteration: {i:3d} | Moving-Average Steps: {np.mean(learning[-log_iter:]):.4f}")
        
env.close()

Iteration:   0 | Moving-Average Steps: 9769.0000


Below is a graph of the steps the cartpole accomplished before failure. As the model improves over time, there should be an increase in the number of steps the agent is able to perform before failure and the next epoch begins. As you can see, the model greatly increased the number of steps over time until approximately 500 epochs. The model then significantly decreased in its performance.

*Note:* This model was loaded from previous trainings, so the 500 epochs is not the total number of epochs the model had trained for.

![Model Training Plot](images/training_plot.png)