# Deep Cross-Entropy Method, 8 pts + bonuses

In this section we'll extend your CEM implementation with neural networks. You will train a multi-layer neural network to solve simple continuous state space games. __Please make sure you're done with tabular crossentropy method from another notebook.__


In [None]:
# Install necessary libraries. If you encounter difficulties in local installation, use google colab.

!pip install swig
!pip install gymnasium[toy_text,classic_control,box2d]

Here we start with [CartPole-v1](https://gymnasium.farama.org/environments/classic_control/cart_pole/) environment. As usual, first of all read the description of the environment: what are the goals of the game, what are observations and actions, how reward is calculated, etc.

In [None]:
import gymnasium as gym
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

env = gym.make("CartPole-v1", render_mode="rgb_array")

env.reset()
n_actions = env.action_space.n
state_dim = env.observation_space.shape[0]

plt.imshow(env.render())
print("state vector dim =", state_dim)
print("n_actions =", n_actions)

env.close()

Let's play with the environment with random strategy and generate a video with results

In [None]:
from gymnasium.wrappers import RecordVideo

with RecordVideo(
    env=gym.make("CartPole-v1", render_mode="rgb_array"),
    video_folder="./videos",
    episode_trigger=lambda episode_number: True,
) as env_monitor:

    s, info = env_monitor.reset()
    for t in range(100):
        a = env_monitor.action_space.sample()
        s, r, terminated, truncated, info = env_monitor.step(a)
        if terminated or truncated:
            break

In [None]:
import sys
from pathlib import Path
from base64 import b64encode
from IPython.display import HTML

video_paths = sorted([s for s in Path("videos").iterdir() if s.suffix == ".mp4"])
video_path = video_paths[0]  # You can also try other indices

if "google.colab" in sys.modules:
    # https://stackoverflow.com/a/57378660/1214547
    with video_path.open("rb") as fp:
        mp4 = fp.read()
    data_url = "data:video/mp4;base64," + b64encode(mp4).decode()
else:
    data_url = str(video_path)

HTML(
    """
<video width="640" height="480" controls>
  <source src="{}" type="video/mp4">
</video>
""".format(
        data_url
    )
)

# Neural Network Policy

For this assignment we'll utilize the simplified neural network implementation from __[Scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html)__. Here's what you'll need:

* `agent.partial_fit(states, actions)` - make a single training pass over the data. Maximize the probability of :actions: from :states:
* `agent.predict_proba(states)` - predict probabilities of all actions, a matrix of shape __[len(states), n_actions]__


In [None]:
from sklearn.neural_network import MLPClassifier

agent = MLPClassifier(
    hidden_layer_sizes=(20, 20),
    activation="tanh",
)

# initialize agent to the dimension of state space and number of actions
agent.partial_fit([env.reset()[0]] * n_actions, range(n_actions), range(n_actions))


In [None]:
def generate_session(env, agent, t_max=1000, test=False):
    """
    Play a single game using agent neural network.
    Terminate when game finishes or after :t_max: steps
    """
    states, actions = [], []
    total_reward = 0

    s, info = env.reset()

    for t in range(t_max):

        # use agent to predict a vector of action probabilities for state :s:
        probs = <YOUR CODE>

        assert probs.shape == (env.action_space.n,), "make sure probabilities are a vector (hint: np.reshape)"

        # use the probabilities you predicted to pick an action
        if test:
            # on the test use the best (the most likely) actions at test
            # experiment, will it work on the train and vice versa?
            a = <YOUR CODE>
            # ^-- hint: try np.argmax
        else:
            # sample proportionally to the probabilities,
            # don't just take the most likely action at train
            a = <YOUR CODE>
            # ^-- hint: try np.random.choice        
        
        new_s, r, terminated, truncated, info = env.step(a)

        # record sessions like you did before
        states.append(s)
        actions.append(a)
        total_reward += r

        s = new_s
        if terminated or truncated:
            break
            
    return states, actions, total_reward


In [None]:
dummy_states, dummy_actions, dummy_reward = generate_session(env, agent, t_max=5)
print("states:", np.stack(dummy_states))
print("actions:", dummy_actions)
print("reward:", dummy_reward)


In [None]:
# let's see the initial reward distribution
sample_rewards = [generate_session(env, agent, t_max=1000, test=False)[-1] for _ in range(200)]

plt.hist(sample_rewards, bins=20)
plt.vlines([np.percentile(sample_rewards, 50)], [0], [100], label="50'th percentile", color='green')
plt.vlines([np.percentile(sample_rewards, 90)], [0], [100], label="90'th percentile", color='red')
plt.legend()

### CEM steps
Deep CEM uses exactly the same strategy as the regular CEM, so you can copy your function code from previous notebook.

In [None]:
def select_elites(states_batch, actions_batch, rewards_batch, percentile=50):
    """
    Select states and actions from games that have rewards >= percentile
    :param states_batch: list of lists of states, states_batch[session_i][t]
    :param actions_batch: list of lists of actions, actions_batch[session_i][t]
    :param rewards_batch: list of rewards, rewards_batch[session_i]

    :returns: elite_states,elite_actions, both 1D lists of states and respective actions from elite sessions

    Please return elite states and actions in their original order
    [i.e. sorted by session number and timestep within session]

    If you are confused, see examples below. Please don't assume that states are integers
    (they will become different later).
    """

    <YOUR CODE>

    return elite_states, elite_actions

# Training loop
Generate sessions, select N best and fit to those. Here we don't need to solve the environment with the best possible quality. Just reaching a mean reward 190 is enough.

In [None]:
n_sessions = 20
percentile = 50
log = []

for i in range(n_sessions):
    # generate new sessions
    sessions = [ < generate a list of n_sessions new sessions > ]

    states_batch, actions_batch, rewards_batch = map(np.array, zip(*sessions))

    < estimate mean reward and print >
    
    elite_states, elite_actions = <select elite actions just like before>

    <partial_fit agent to predict elite_actions (y) from elite_states (X)>
    
    if mean_reward > 190:
        print("You Win! You may stop training now via KeyboardInterrupt.")


# Results

Let's generate a video for our trained policy

In [None]:
# Record sessions

from gymnasium.wrappers import RecordVideo

with RecordVideo(
    env=gym.make("CartPole-v1", render_mode="rgb_array"),
    video_folder="./videos",
    episode_trigger=lambda episode_number: True,
) as env_monitor:
    sessions = [generate_session(env_monitor, agent) for _ in range(5)]


In [None]:
# Show video

from pathlib import Path
from base64 import b64encode
from IPython.display import HTML

video_paths = sorted([s for s in Path("videos").iterdir() if s.suffix == ".mp4"])
video_path = video_paths[0]  # You can also try other indices

if "google.colab" in sys.modules:
    # https://stackoverflow.com/a/57378660/1214547
    with video_path.open("rb") as fp:
        mp4 = fp.read()
    data_url = "data:video/mp4;base64," + b64encode(mp4).decode()
else:
    data_url = str(video_path)

HTML(
    """
<video width="640" height="480" controls>
  <source src="{}" type="video/mp4">
</video>
""".format(
        data_url
    )
)


# The Assignment

### Deep Cross-Entropy Method

By this moment, you should have got enough score on CartPole-v1 to consider it solved. It's time to try something harder.

### Tasks

* __2.1__ (**3pts**) Pick one of environments: [MountainCar-v0](https://gymnasium.farama.org/environments/classic_control/mountain_car/) or [LunarLander-v2](https://gymnasium.farama.org/environments/box2d/lunar_lander/)
  * For MountainCar, get average reward of __at least -150__
  * For LunarLander, get average reward of __at least +50__

See the tips section below, it's kinda important.
__Note:__ If your agent is below the target score, you'll still get some of the points depending on the result.
  
  
* __2.2__ Devise a way to speed up training against the default version
  * (**2pts**) Try re-using samples from 3-5 last iterations when computing threshold and training.
  * (**3pts**) Obtain **-100** at MountainCar-v0 or **+200** at LunarLander-v2.Feel free to experiment with hyperparameters, architectures, schedules etc.
    
  
### Tips
* Sessions for MountainCar may last for 10k+ ticks. Make sure ```t_max``` param is at least 10k.
 * Also it may be a good idea to cut rewards via ">" and not ">=". If 90% of your sessions get reward of -10k and 10% are better, than if you use percentile 20% as threshold, R >= threshold __fails to cut off bad sessions__ while R > threshold works alright.
* If it doesn't train, it's a good idea to plot reward distribution and record sessions: they may give you some clue.
* 20-neuron network is probably not enough, feel free to experiment.


### Bonus tasks (Up to 6 points)

* __2.3 bonus__ (2 pts) Try to find a network architecture and training params that solve __both__ environments above

* __2.4 bonus__ (4 pts) Solve continuous action space task with `MLPRegressor` or similar.
  * Since your agent only predicts the "expected" action, you will have to add noise to ensure exploration.
  * Choose one of [MountainCarContinuous-v0](https://gymnasium.farama.org/environments/classic_control/mountain_car_continuous/), [LunarLanderContinuous-v2](https://gymnasium.farama.org/environments/box2d/lunar_lander/)
  * Slightly less points for getting some results below solution threshold. Note that discrete and continuous environments may have slightly different rules aside from action spaces.
  