<a href="https://colab.research.google.com/github/cyrilgabriele/RL/blob/main/Lab06/Lab_06_A2C.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [10]:
%matplotlib inline
from typing import Tuple, List

  and should_run_async(code)


# Q-Learning on the CartPole Environment

This is an altered version of Jose Nieves Flores Maynez' notebook.

This tutorial shows how to use Q-Learning to train an RL agent on the CartPole-v0 task from the [OpenAI Gym](https://gym.openai.com/).

![cartpole](https://github.com/pytorch/tutorials/blob/main/_static/img/cartpole.gif?raw=true)

The Cartpole environment is a common simple example that is used often for simple RL examples.

In this environment, the task is to balance the pole that is attached to the cart, by moving the cart to either side.
The reward gets incremented for each step (for up to 200 steps) where the pole is not exceeding a set angle and the cart is not touching the sides of the line.
The environment provides four parameters that represent the state of the environment:
Position and velocity of the cart and angle and angular velocity of the pole (see [the documentation](https://gymnasium.farama.org/environments/classic_control/cart_pole/#observation-space)).
We will solve this by applying Q-Learning to our RL agent.


### Packages


First, let's import needed packages.

In [11]:
import gym
import math
import random
import numpy as np
from numpy.polynomial import Polynomial
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
from collections import defaultdict

In [12]:
w = np.random.uniform(0, 1, size=4)
print(w) # => w is an array
print(w.shape)
print(w)
p = np.polynomial.Polynomial(w)
print(f"This is p: {p}")
derivative_poly = p.deriv()
print(f"This is p derived: {derivative_poly}")
s = np.array([-0.08776303, -0.3829126, 0.09531414, 0.63496107], dtype=np.float64)
evaluated = p(s)
print(f"valuated on states: {evaluated}")
derivative = np.polyder(w)
print(f"this is the derivative: {derivative}")

[0.78798356 0.33518178 0.03976538 0.80374435]
(4,)
[0.78798356 0.33518178 0.03976538 0.80374435]
This is p: 0.7879835584141536 + 0.335181775547571·x¹ + 0.03976537860564011·x² +
0.8037443465605424·x³
This is p derived: 0.335181775547571 + 0.07953075721128022·x¹ + 2.411233039681627·x²
valuated on states: [0.75832996 0.62034375 0.82098835 1.22260255]
this is the derivative: [2.36395068 0.67036355 0.03976538]


## Implementation
Since this algorithm relies on updating a function for each existing pair of state and action, environments that have a high state-space become problematic. This is because we can approximate better the actual value of a state-action pair as we visit it more often. However, if we have many states or many actions to take, we distribute our visits among more pairs and it takes much longer to converge to the actual true values. The CartPole environment gives us the position of the cart, its velocity, the angle of the pole and the velocity at the tip of the pole as descriptors of the state. However, all of these are continuous variables. To be able to solve this problem, we need to discretize these states since otherwise, it would take forever to get values for each of the possible combinations of each state, despite them being bounded. The solution is to group several values of each of the variables into the same “bucket” and treat them as similar states. The agent implemented for this problem uses 3, 3, 6, and 6 buckets respectively.

In [21]:
LEFT = 0
RIGHT = 1

class MonteCarloGeneration(object):
  def __init__(self, env: object, max_steps: int = 1000, debug: bool = False):
    self.env = env
    self.max_steps = max_steps
    self.debug = debug

  def run(self, pi_policy) -> List:
    buffer = []
    n_steps = 0 # Keep track of the number of steps so I can bail out if it takes too long
    state = self.env.reset() # Reset environment back to start
    terminal = False
    while not terminal: # Run until terminal state
      action = self.choose_action(pi_policy, state) # take action based on current policy
      next_state, reward, terminal, _ = self.env.step(action) # Take action in environment
      buffer.append((state, action, reward)) # Store the result
      state = next_state # Ready for the next step
      n_steps += 1
      if n_steps >= self.max_steps:
        if self.debug:
          print("Terminated early due to large number of steps")
        terminal = True # Bail out if we've been working for too long
    return buffer

  def choose_action(self, policy, state):
      prob_left = policy(state)
      if np.random.rand() < prob_left:
        action = LEFT
      else:
        action = RIGHT
      return action


class CartPoleAgent():
    def __init__(self, env: object, generator: MonteCarloGeneration,
                 num_episodes=1000, min_lr=0.00001, discount=1, decay=50, initial_lr=0.05):
        self.num_episodes = num_episodes
        self.min_lr = min_lr # minimal learning rate
        self.initial_lr = initial_lr
        self.discount = discount
        self.decay = decay
        self.generator = generator
        self.env = env

        # Initialize Thetas arbitrarily with random values
        min_value = -1
        max_value = 1
        # Check if not this is correct:
        # self.thetas = np.random.uniform(min_value, max_value, size=4) #4 = number of states
        self.thetas = np.random.uniform(min_value, max_value, size=(1, 4)) #4 = number of states
        self.steps = np.zeros(self.num_episodes)

        # Additions for A2C:
        # Parameter w (=> coefficients of the polynom which now represents value function)
        self.weights = np.random.uniform(min_value, max_value, size=10) # numpy array


    def state_value(self, weights):
      state_value_polynom = np.polynomial.Polynomial(weights)
      return state_value_polynom

    def evaluate_polynomial(self, polynomial, state):
      state = np.array(state, dtype=np.float64)
      sum_of_evaluations = np.sum(polynomial(state + 10e3))
      return sum_of_evaluations

    def get_learning_rate(self, e):
      # Learning rate declines as we addvance in episodes
      return self.initial_lr * max(self.min_lr, min(1., 1. - math.log10((e + 1) / self.decay)))

    def train(self) -> None:
      for episode_t in range(self.num_episodes):
        trajectory = self.generator.run(self.pi_policy) # Generate a trajectory
        G = 0
        for i, step in enumerate(reversed(trajectory)): # Starting from the terminal state
          self.steps[episode_t] += 1
          state, action, reward = step
          # print(f"this is state {state}")
          gamma = self.discount**(len(trajectory) - i - 1) # gamma is always 1 (no discounting)
          # TODO implement look-ahead
          G += reward * gamma
          polynomial =  self.state_value(self.weights)
          evauluated_polynomial = self.evaluate_polynomial(polynomial, state)
          delta = G - evauluated_polynomial
          learning_rate = self.get_learning_rate(episode_t)
          gradient_value_function = self.gradient_V(state, self.weights)
          print(f"this is the gradient of V(s, w): {gradient_value_function}")
          self.weights += learning_rate * delta * gradient_value_function
          #update thetas
          self.thetas += learning_rate * gamma * G * self.gradient_pi(state, action)
      print('finished training!')

    def pi_policy(self, state):
        # Given a state, return probability of action left (using logistic function)
        action_left_prob = 1 /  (1 + np.exp(-(np.dot(self.thetas, state))))
        return action_left_prob

    def gradient_pi(self, state, action):
      if (action == LEFT):
        gradient = state - state * self.pi_policy(state)
      else:
        gradient = -state * self.pi_policy(state)
      print(f"this is the pi policy gradient: {gradient}")
      return gradient

    def gradient_V(self, state, weights):
      polynomial = self.state_value(weights)
      derivative = polynomial.deriv()
      print(f"this is the derivative: {derivative}")
      return np.sum(derivative(state + 10e3))

    def plot_learning(self):
        # Create subplots
        fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(10, 10))

        # Plot steps per episode
        ax1.scatter(range(self.num_episodes), self.steps, label='Steps per Episode', color='blue', s=1)
        #für Ardi <3
        #ax1.plot(self.steps, label='Steps per Episode', color='blue')
        ax1.set_xlabel("Episode")
        ax1.set_ylabel("Steps")

        # Plot learning rates
        ax2.plot([self.get_learning_rate(e) for e in range(self.num_episodes)], label='Learning Rate', color='green')
        ax2.set_xlabel("Episode")
        ax2.set_ylabel("Learning Rate")

        # Show legends
        ax1.legend()
        ax2.legend()

        # Adjust spacing
        plt.tight_layout()

        # Show the plots
        plt.show()

        t = 0
        for i in range(self.num_episodes):
            if self.steps[i] == 200:
                t += 1
        print(t, "episodes were successfully completed.")


        # Added to illustrate thje polynom
        #poly = np.polynomial.Polynomial(self.weights)
        x_values = np.arange(0, self.num_episodes, 1)
        print(f"x vlaues: {x_values}")
        y_values = np.polyval(self.weights, x_values)
        print(f"this are the y_values: {y_values}")
        plt.scatter(x_values, y_values)
        plt.show()


def load_REINFORCE():
    env = gym.make('CartPole-v0')
    generator = MonteCarloGeneration(env=env, debug=True)
    agent = CartPoleAgent(env=env, generator=generator)
    agent.train()
    agent.plot_learning()

    return agent

agent = load_REINFORCE()

this is the derivative: 0.4834937431190569 + 0.8980828032239239·x¹ - 0.2618291679305793·x² +
0.3457532489405235·x³ + 1.7251530848042074·x⁴ - 2.5470443524787196·x⁵ -
5.7063451708348065·x⁶ - 6.689857337504785·x⁷ + 7.798186695201721·x⁸
this is the gradient of V(s, w): 3.118400614351296e+33
this is the pi policy gradient: [-0.03901446 -0.65792469  0.09658798  1.06320175]
this is the derivative: -5.402268391660212e+68 - 1.0804536783320425e+69·x¹ -
1.6206805174980637e+69·x² - 2.160907356664085e+69·x³ -
2.7011341958301064e+69·x⁴ - 3.2413610349961275e+69·x⁵ -
3.7815878741621485e+69·x⁶ - 4.32181471332817e+69·x⁷ -
4.8620415524941914e+69·x⁸
this is the gradient of V(s, w): -1.9446722395710847e+102
this is the pi policy gradient: [-0.02583511 -0.52614281  0.07319752  0.84068995]
this is the derivative: -2.100952856082823e+206 - 4.201905712165646e+206·x¹ -
6.302858568248469e+206·x² - 8.403811424331292e+206·x³ -
1.0504764280414115e+207·x⁴ - 1.2605717136496938e+207·x⁵ -
1.4706669992579761e+207·x⁶ - 1

  self.weights += learning_rate * delta * gradient_value_function


[1;30;43mDie letzten 5000 Zeilen der Streamingausgabe wurden abgeschnitten.[0m
this is the derivative: -inf - inf·x¹ - inf·x² - inf·x³ - inf·x⁴ - inf·x⁵ - inf·x⁶ - inf·x⁷ -
inf·x⁸
this is the gradient of V(s, w): -inf
this is the pi policy gradient: [-0.04571555  0.01317296  0.02050929 -0.02448205]
this is the derivative: -inf - inf·x¹ - inf·x² - inf·x³ - inf·x⁴ - inf·x⁵ - inf·x⁶ - inf·x⁷ -
inf·x⁸
this is the gradient of V(s, w): -inf
this is the pi policy gradient: [ 0.00533187  0.00834323 -0.00224141 -0.01127368]
this is the derivative: -inf - inf·x¹ - inf·x² - inf·x³ - inf·x⁴ - inf·x⁵ - inf·x⁶ - inf·x⁷ -
inf·x⁸
this is the gradient of V(s, w): -inf
this is the pi policy gradient: [-0.03289924  0.01011641  0.0142829  -0.0268879 ]
this is the derivative: -inf - inf·x¹ - inf·x² - inf·x³ - inf·x⁴ - inf·x⁵ - inf·x⁶ - inf·x⁷ -
inf·x⁸
this is the gradient of V(s, w): -inf
this is the pi policy gradient: [ 0.00725931  0.01156024 -0.00297633 -0.01378057]
this is the derivative: -inf - inf·

KeyboardInterrupt: ignored