# **Reinforcement Learning** [SOW-BKI258]
## Practical 1: Introduction (demos)

In today's practical, we will have some fun with modern RL tools and software. We will explore some demos and
examples that illustrate the usefulness, interest (and fun) of RL algorithms and how easy it is to use them with modern software tools.

**NOTE:** This week's practical is merely a demonstration/tutorial. There is no assignment to hand-in. In part 2, we
will superficially explore some advanced topics in modern RL and see a concrete demo, but most of the content of this
 demo (models and environments) are beyond the scope of this course. Thus, you are not asked to understand the models used and you will not be evaluated about this content.

# 1. `OpenAI-gym`

We will start by going through the basics of [OpenAI Gym](https://www.gymlibrary.dev/index.html), the most popular choice to define environments and agent-environment interactions. It comes with a set of handy tools as well as pre-defined (classical) RL environments / tasks. The use of `gym` will come up several times during the practical exercises, so try to get an early grasp of the basic concepts involved (these will, of course, be explained in the coming weeks).

### 1.1. Let's look at a couple of example of a classic RL task: the `LunarLander` and the `CartPole` environments.

In [10]:
## make sure the version of installed Python (in your environment) is one of: Python 3.8, 3.9, 3.10, 3.11

##You may need to install some packages initially, 
## for Mac users, un-comment the following lines to complete the installation

#pip install gymnasium 
#pip install swig
#pip install 'gymnasium[box2d]'


#For Windows users, copy and run the following commands in the Anaconda terminal window
# pip install gymnasium 
# pip install swig
# pip install 'gymnasium[box2d]'

In [11]:
import gymnasium as gym
import matplotlib.pyplot as plt

In [12]:
# load the gym environment and set the render mode to "human" for us to visualize the outputs

env = gym.make("LunarLander-v2", render_mode="human")
# seed the action space and environment (so the results can be reproduced)
env.action_space.seed(42)
observation, info = env.reset(seed=42)
# randomly sample the action space (take random actions) to see how the environment looks like
for _ in range(1000):
	observation, reward, terminated, truncated, info = env.step(env.action_space.sample())
	if terminated or truncated:
		observation, info = env.reset()

env.close()

In [13]:
env = gym.make('CartPole-v1', render_mode="human")
env.action_space.seed(42)
observation, info = env.reset(seed=42)
for _ in range(1000):
	env.render()
	observation, reward, terminated, truncated, info = env.step(env.action_space.sample())
	if terminated:
		observation, info = env.reset()
env.close()

In [21]:
env = gym.make('BipedalWalker-v3', render_mode="human")
env.action_space.seed(42)
observation, info = env.reset(seed=42)
for _ in range(1000):
	env.render()
	observation, reward, terminated, truncated, info = env.step(env.action_space.sample())
	if terminated:
		observation, info = env.reset()
env.close()

KeyboardInterrupt: 

We can check the attributes and methods of the environments to understand what it is made of. I advise you to check
the official documentation and, for example, [this tutorial](https://www.gymlibrary.dev/content/basic_usage/) to gain a better insight into `gym` environments.

Particularly important are the state and action spaces:

In [14]:
obs_space = env.observation_space
action_space = env.action_space
print("The observation space: {}".format(obs_space))
print("The action space: {}".format(action_space))

The observation space: Box([-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38], [4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38], (4,), float32)
The action space: Discrete(2)


In the cartpole environment, the states are the (discretized / binned) position, velocity, pole angle and pole
angular velocity. There are only 2 possible actions: `move left` and `move right`. For more info on the task, see
[documentation](https://www.gymlibrary.dev/environments/classic_control/cart_pole/).

## 1.2. Interacting with the environment

Preconfigured `gym` environments are typically accompanied by sophisticated visualizations and rendering tools that
allow us to see how the agent is learning the task at hand. More importantly, `gym` provides a wrapper for any new RL environment, exposing the following main API methods:
 *  `step`: Receives an action (`int`) and returns 4 objects: the next state (observation), the reward, a boolean
 indicating whether the environment has ended, and any extra info depending on the environment.
*  `reset`: Reset the environment, returns the initial state / observation.
*  `render`: Returns an object for rendering, depending on the `mode` parameter passed in.

See [here](https://github.com/openai/gym/blob/3bd5ef71c2ca3766a26c3dacf87a33a9390ce1e6/gym/core.py) for more details on this API.

As we have done in the demo above, we use these methods to interact with pre-existing environments:

In [15]:
# import env
env = gym.make('MountainCar-v0', render_mode="human")

# reset the environment and see the initial observation
observation = env.reset()
print("The initial observation is {}".format(observation))

# Sample a random action from the entire action space
random_action = env.action_space.sample()

# # Take the action and get the new observation space
new_observation, reward, terminated, truncated, info = env.step(random_action)
print("The new observation is {}".format(new_observation))

The initial observation is (array([-0.4689014,  0.       ], dtype=float32), {})
The new observation is [-0.4703098  -0.00140839]


To run the RL environment, we loop through a number of steps, as we did in the demos above. Note that the goal of RL
algorithms is to select actions that maximize rewards. In these examples, we are using a random *policy*, i.e.
actions are randomly sampled from the action space.

In [16]:
# Number of steps you run the agent for
num_steps = 150
obs = env.reset()
for step in range(num_steps):
	# take random action, but you can also do something more intelligent
	# action = my_intelligent_agent_fn(obs)
	action = env.action_space.sample()

	# apply the action
	obs, reward, terminated, truncated, info = env.step(action)

	# Render the env
	env_img = env.render() # in this case, return a 2D rgb-array

	# If the epsiode is up, then start another one
	if terminated or truncated:
		env.reset()

# Close the env
env.close()

# import matplotlib.pyplot as plt
# plt.imshow(env_img)

---

### 1.2. Now let's see how we can create a new `env` object from scratch

Suppose you want to program a controller that can regulate the heat of your shower and get it in an optimal range.
Some constraints on the problem:
- we want our optimal temperature to be between 37 and 39 degrees Celsius.
- The shower length will be 60 seconds (episode length = 60 seconds).
- Three actions can be performed: `turn up`, `leave`, and `turn down`.

(for a more completed example, see [this tutorial](https://medium.com/geekculture/developing-reinforcement-learning-environment-using-openai-gym-f510b0393eb7))

In [17]:
import numpy as np
from gymnasium import Env
from gymnasium.spaces import Box, Discrete
import random

In [18]:
class CustomEnv(Env):
	def __init__(self):
		self.action_space = Discrete(3)
		self.observation_space = Box(low=np.array([0], dtype=np.float32), high=np.array([100], dtype=np.float32))
		self.state = 38 + random.randint(-3,3)
		self.shower_duration = 60

	def step(self, action):
		self.state += action -1
		self.shower_duration -= 1

		# Calculating the reward
		if 37 <= self.state <= 39:
			reward = 1
		else:
			reward = -1

			# Checking if shower is done
		if self.shower_duration <= 0:
			done = True
		else:
			done = False

		# Setting the placeholder for info
		info = {}

		# Returning the step information
		return self.state, reward, done, info

	def render(self):
		# This is where you would write the visualization code (we will leave it for now)
		pass

	def reset(self):
		self.state = 38 + random.randint(-3,3)
		self.shower_duration = 60
		return self.state



In [19]:
env = CustomEnv()

In [20]:
episodes = 20 #20 shower episodes
for episode in range(1, episodes+1):
	state = env.reset()
	done = False
	score = 0

	while not done:
		action = env.action_space.sample()
		n_state, reward, done, info = env.step(action)
		score+=reward
	print('Episode:{} Score:{}'.format(episode, score))

Episode:1 Score:-48
Episode:2 Score:6
Episode:3 Score:-60
Episode:4 Score:-16
Episode:5 Score:-60
Episode:6 Score:-46
Episode:7 Score:12
Episode:8 Score:-38
Episode:9 Score:-60
Episode:10 Score:-60
Episode:11 Score:-16
Episode:12 Score:-38
Episode:13 Score:-50
Episode:14 Score:-60
Episode:15 Score:28
Episode:16 Score:-60
Episode:17 Score:-12
Episode:18 Score:10
Episode:19 Score:-54
Episode:20 Score:-18


With these examples and the linked tutorials, you should get a good outlook of how to use `gym`. Naturally, what
actually matters is what comes next: how to choose the actions to take at any given time point.

# 2. Mario DQN

To complement the lecture content and to demonstrate how modern RL algorithms can do fun things, explore [this
tutorial](https://ml-showcase.paperspace.com/projects/super-mario-bros-double-deep-q-network) to see a practical and
complete example of an agent learning the Super Mario Bros.

**Note:** This tutorial is just for fun. The actual RL algorithm used (Deep Q-Network) is beyond the scope of the
current course. We will go through the basis, including Q-learning, but this is an example of a Deep-RL algorithm.
So, explore and have fun with the implementation, but don't get frightened or confused by the actual agent /
algorithm used. In time, you will learn more about these approaches.