<h1 style="color:#333333; text-align:center; line-height: 0;">Reinforcement Learning | Assignment 2</h1>

<br/><br/>

This notebook covers a Policy Gradient **REINFORCE** approach.

Complete the code snippets given in the Section 3: there are several places to insert your code and string fields for your first and last name. The latter are needed to automatically save the results of the algorithms deployment in .json file. After you did that, please upload the notebook (.ipynb) and .json via https://forms.gle/MWZ4Po2f6hs2s7Ny8.

* Problem 2.1 - Swing Up Policy (10 points)
* Problem 2.2 - Gradient Calculation (20 points)
* Problem 2.3* (additional) - NPG (10 points)

***

<h2 style="color:#A7BD3F;">Section 1: Theory recap</h2>

Let us recall the REINFORCE algorithm from the lecture.

<img src="PG.png" alt="REINFORCE" width=75% height=75% />

The second problem will be dedicated to the implementation of the function that calculates the right-hand side of the 10th line of the pseudocode.

***

<h2 style="color:#A7BD3F;">Section 2: OpenAI Pendulum environment</h2>

In contrast to the first assumption, this time we will consider an environment with continuous state and action spaces: OpenAI Pendulum https://gym.openai.com/envs/Pendulum-v0/. The overview of the state vector, possible actions and their bounds is given in https://mspries.github.io/jimmy_pendulum.html

Let us examine the dynamic behaviour of the Pendulum by applying several simple policies. First, we will implement a wrapper function that will run the simulation for a number of episodes with a given policy and plot the reward.

In [None]:
import gym
import numpy as np
import collections
import sys
from tqdm import tqdm
from IPython.display import clear_output
import time
import matplotlib.pyplot as plt

def run_episode(policy):
    ep_len = 282

    env = gym.make('Pendulum-v0')
    env._max_episode_steps = ep_len
    
    observation = env.reset()
    reward_history = []
    
    for t in range(ep_len):  
        env.render()
        
        time.sleep(0.01)
        
        action = policy(observation)
        
        observation, reward, done, info = env.step(action)
        
        reward_history.append(reward)
    
    plt.plot(reward_history)
    
    env.close()
    
    return reward_history

The first policy that we'll apply is the policy that applies constant $0.5$ counterclockwise torque. Run it for a number of times in order to explore the reward behaviour under this policy with different initial states. You could increase the torque up to the limit and make the pendulum rotate.

In [None]:
def half_policy(obs):
    return [0.5]

_ = run_episode(half_policy)

Another policy, that will not exhibit such a cyclic behaviour, is the random one.

In [None]:
def random_policy(obs):
    return [np.random.random_sample() * 4 - 2]

_ = run_episode(random_policy)

***

<h2 style="color:#A7BD3F;">Section 3: Problems</h2>

### <font color="blue">Problem 2.1 - Swing Up Policy</font>

Implement a policy that stabilizes the pendulum in the upwards position.

Thr first policy here is the one that stabilizes the pendulum downwards (check the plot!). The second does the opposite. Please familiarize yourself with the environment (using the links given above) well enough to understand the exact way in which the negative feedback in the first policy stabilizes the pendulum.

The approach that you are asked to complete relies on the following:
* When the pendulum is in the relatively low position, the policy should destabilize (accelerate) it
* When the surrounding of the higher equilibrium is reached, the policy should stabilize the pendulum
* It is enough to set negative feedback by the angular velocity for stabilization in the lower equilibrium. However, stabilization in the higher one requires an additional negative feedback term by coordinate: without it the pendulum will slowly move away from the desired position.

Your goal is to complete the code below, in patricular:
* Set a condition for switching between stabilizing and destabilizing modes for the policy. It could be angle, measured from the desired position (note that it is not in the observation vector, it should be calculated), height of the center of mass, etc.
* Set the control coefficients for both ways of torque calculation. Try to understand the relation between them: are they positive/negative, which one has greater value. Generally, the policy should accelerate the pendulum with moderately high torque for the stabilization to be possible.

The policy should be capable of stabilizing the pendulum most of the time, at least 4 out of 5 trials, during the given number of episodes. After you implemented it, save the rewards of a single run with the help of the Auto-grading cell below.

In [None]:
def stabilizing_policy(obs):
    return [- obs[2]]

def destabilizing_policy(obs):
    return [obs[2]]

def swing_up_policy(obs):
    ### YOUR SOLUTION BELOW
    if (...):
        torque = ... * (obs[2] + obs [1])
        
        return [torque]
    
    else:
        return [... * obs[2]]
    ### YOUR SOLUTION ABOVE

reward_history = run_episode(stabilizing_policy)
#reward_history = run_episode(destabilizing_policy)
#reward_history = run_episode(swing_up_policy)

### <font color="orange">Auto-grading</font>
Run this cell to track your answers and to save your answer for problem 2.1. Make sure you defined the necessary variable above to avoid a `NameError` below.

In [None]:
### GRADING DO NOT MODIFY
from grading_utilities import AnswerTracker
asgn2_answers = AnswerTracker()
asgn2_answers.record('problem_2-1', {'reward_history': reward_history})

### <font color="blue">Problem 2.2 - Gradient Calculation</font>

Examine the code below. Note the way it generalizes and wraps the `swing_up_policy`. Correlate the lines of code of `REINFORCE` with the pseudocode above.

Let us briefly outline the main novelties in comparison to the code above.

* The control coefficients are given to the policy as a parameter.
* A function for the PDF gradient calculation is sketched.
* Random noise (by the name of `nrv`) is included in the process. Familiarize yourself with the way it is transferred during the execution.

Let us unwrap the latter a little bit. The resultant torque is given by

$\hat{\tau}(\vartheta) =
\begin{equation*}
    \begin{cases}
      \vartheta[0] (\dot{\theta} + \sin(\theta)), \; condition \\
      \vartheta[1] \dot{\theta}, \quad \quad \quad \; \; \; otw.
    \end{cases}
\end{equation*}$

where $\vartheta$ is a vector of policy parameters.

Adding Gaussian noise leads to the following PDF:

$f(\tau) = \dfrac{1}{\sigma \sqrt{2 \pi}} e^{ -\frac{1}{2}\left(\dfrac{\tau - \hat{\tau}}{\sigma}\right)^2 }$

The task is the following:
* Insert your policy switching criteria into `parametrized_swing_up_policy` and `param_policy_grad`.
* Insert your control coefficients into the initialization of `vartheta` in `REINFORCE`.
* Take partial derivatives of $\ln f(\tau)$ by the componemts of $\vartheta$.
* Write code for their calculation (using the given variables) in the `param_policy_grad` function
* Run the cell (note the flag `visualize`) with and without updating paramenters during the run ( `update_params`). Do it multiple times and compare the performance. Feel free to output any information you need, such as cumulative reward, to plot anything you need. Because of the complex structure (read as nonconvexity) of the reward function by parameters, the performance could change in any direction. The thing that is checked in the task is that the method is indeed working, not that it converges to the optimum.
* When you are done, run the code with updating parameters and save the reward history by running the Grading cell below.

You could vary the Learning Rate $\alpha$ it fou need or scale components of $\vartheta[1]$ relatively to each other if necessary.

In [None]:
import gym
import numpy as np
import collections
import sys
from tqdm import tqdm
from IPython.display import clear_output
import time
import matplotlib.pyplot as plt
import math

def parametrized_swing_up_policy(obs, vartheta, s):
    #normal random variable
    nrv = np.random.normal(0, s, 1)[0]
    
    ### YOUR SOLUTION ON THE LINE BELOW
    if (...):
        torque = vartheta[0] * (obs[2] + obs [1]) + nrv
        
        return [torque], nrv
    
    else:
        return [vartheta[1] * obs[2] + nrv], nrv

#x - state
#u - action
#s - sigma of the normal distribution
#nrv - the specific value of the random variable
def param_policy_grad(x, u, s, nrv):
    ### YOUR SOLUTION BELOW
    if (...):
        

    else:
        
    
    ### YOUR SOLUTION ABOVE

ep_len = 340
env = gym.make('Pendulum-v0')
env._max_episode_steps = ep_len

def REINFORCE(env, update_params, visualize = False):
    observation = env.reset()
    
    ### YOUR SOLUTION BELOW
    vartheta = np.array([..., ...])
    ### YOUR SOLUTION ABOVE

    steps_num    = 20
    episodes_num = 30

    policy = parametrized_swing_up_policy
    alpha = 0.00001

    sigma = 0.3
    
    reward_history = []
    
    for step in range(steps_num):
        Grad = np.array([0.0, 0.0])

        acc_reward = 0
        policy_PDF_grad = np.array([0.0, 0.0])

        for ep in range(episodes_num):        
            if (visualize == True):
                env.render()

            action, nrv = policy(observation, vartheta, sigma)

            observation, reward, done, info = env.step(action)

            acc_reward += reward
            
            reward_history.append(reward)

            ppg = param_policy_grad(observation, action, sigma, nrv)

            policy_PDF_grad += ppg

        Grad += acc_reward * policy_PDF_grad

        if (update_params):
            vartheta += alpha * Grad
    
    return reward_history

parametric_policy_reward_history = REINFORCE(env, update_params = True, visualize = True)

env.close()

### <font color="orange">Auto-grading</font>
Run this cell to track your answers and to save your answer for problem 2.2. Make sure you defined the necessary variable above to avoid a `NameError` below.

In [None]:
### GRADING DO NOT MODIFY
asgn2_answers.record('problem_2-2', {'reward_history': parametric_policy_reward_history})

### <font color="blue">Problem 2.3* (additional) - NPG</font>

Copy the code from the Problem 2.2 and modify the gradient step in accordance with the NPG algorithm. Feel free to rewrite code in any way you need. This task is an extra one, so there will bo no guidance. The only requirement (apart from the convergence to the higher equilibrium) is the name of the list with the rewards history for the AnswerTracker to save it.

In [None]:
### YOUR SOLUTION BELOW

### YOUR SOLUTION ABOVE

### <font color="orange">Auto-grading</font>
Run this cell to track your answers and to save your answer for problem 2.3. Make sure you defined the necessary variable above to avoid a `NameError` below.

In [None]:
### GRADING DO NOT MODIFY
asgn2_answers.record('problem_2-3', {'reward_history': NPG_reward_history})

### <font color="orange">Auto-grading: Submit your answers</font>
Enter your first and last name in the cell below and then run it to save your answers for this assumption to a JSON file. The file is saved next to this notebook. After the file is created, upload the JSON file and the notebook via the form provided in the beginning of the assumption.

In [None]:
assignment_name = "asgn_2"
first_name = ""
last_name = ""

asgn2_answers.save_to_json(assignment_name, first_name, last_name)

## Questions?

Reach out to Ilya Osokin (@elijahmipt) on Telegram.

## Sources

***

<sup>[1]</sup> Ng, A. Stanford University, CS229 Notes: Reinforcement Learning and Control.

<sup>[2]</sup> Barnabás Póczos, Carnegie Mellon, Introduction To Machine Learning: Reinforcement Learning (Course).

<sup>[3]</sup> **Sutton, R. S., Barto, A. G. (2018 ). Reinforcement Learning: An Introduction. The MIT Press.** 

<sup>[4]</sup> OpenAI: Spinning Up. Retrieved from https://spinningup.openai.com/en/latest/spinningup/rl_intro.html