<a href="https://colab.research.google.com/github/abhipraay/CVI_Projects_PS/blob/main/RL_PS_22_23.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# `Hit the Gym: MiniProject by CVI`

The exercises in this notebook are meant for CVI aspirants who wish to work on the RL Games CPP. 

**About the project:** The goal of the RL Games project is to explore and analyse current Reinforcement Learning Methods. These techniques will then be used to help F1 teams optimize their race strategies.

In this notebook, you'll use a popular Reinforcement Learning Technique called Q-learning to train an agent to play simple games using the python library OpenAI Gym. 

You may refer to the following while coding:

Python reference: https://bit.ly/3ajUalZ and https://bit.ly/2UgiZKa

OpenAI Gym Documentation: https://gym.openai.com/docs/


### `RL Basics:`
Reinforcement learning is an approach used in Machine Learning where the agent is allowed to interact with the system to learn its behaviour and come up with an optimal startegy to achieve an objective. The agent models the problem as a probabilistic state machine (a graph where the transition from one node to another has a probability distribution). Nodes in the model graph are called states and a transition from one state to another is called an action. Each state transition (which is a (state, action) pair) has a corresponding reward or penalty. The goal of the RL agent is to maximise the reward. 

Training an RL agent from scratch requires us to model the state space and the action space. In addition, we must also come up with suitable rewards for each state transition. The RL agent estimates this reward structure and executes actions so as to maximise them. The final performance of the RL agent is heavily dependent on how the system is modelled. Luckily for us, we **do not need to get into the mathematics** of Reinforcement Learning right now, thanks to the Python library Gym.


Gym offers many in-built RL environments which you can use to play around with. These environments are Python classes with their state spaces, action spaces and rewards pre-defined. You will use two such environments (Taxi-v3 and Maze) to train an agent accomplish a goal. You can find the documentation for these environments here:

Taxi-v3 Documentation: https://gym.openai.com/envs/Taxi-v3/  

Maze: Custom environment similar to GYM environments   

To create a gym environment of 'Taxi-v3' you do this:

In [None]:
import gym
import numpy as np  
# Create an environment of Taxi-v3:
env = gym.make('Taxi-v3').env 
env.render()
print(env.s)
print(env.observation_space)

+---------+
|[34;1mR[0m: | : :G|
| : | : :[43m [0m|
| : : : : |
| | : | : |
|Y| : |[35mB[0m: |
+---------+

183
Discrete(500)


env.s is the current state of the environment.

env.observation_space() returns the type and size of the observation or state space. Discrete implies that the states are discrete and not continuous. For games like Pacman, Gym uses another type of state space- Box- due to the large number of states. 

**Solving the Problem:** Now the goal of this game is to train the taxi to efficiently pick the passenger from the blue spot and drop them at the purple spot. The taxi can do the following actions: Move Up, Move Down, Move to the Left, Move to the Right, Pickup, Dropoff. The reward structure is such:
1. -10 points for illegal dropoff/pickup actions
2. +20 points if the passenger is dropped off at the correct location
3. -1 point for every other action

A paleolithic approach to this problem would be to pick an action at random and execute it. Eventually the passenger would get picked up and then dropped off at the correct location.

In [None]:
state = env.reset()
epochs = 0
penalty, reward = 0, 0  # Penalty records the number of times the agent hits a wall
frames = []
done = # Find out the role of 'done' and complete the statement for its initial condition 
while #insert condition:
    '''
      Enter your code here
      The code must pick an action from the action space at random, execute it and update 'penalty' accordingly
    '''
    frames.append({'frame': env.render(mode='ansi' ), 'state': state, 'action': action, 'reward': reward})
    epochs += 1
print("Timesteps taken: {}".format(epochs))
print("Penalties incurred: {}".format(penalty))

Run the following cell to visualise the performance of the agent. It won't come as a surprise to see that the approach is quite bad. It is so because, the agent has no memory of the past and hence learns nothing. 

In [None]:
from IPython.display import clear_output
from time import sleep
def print_frames(frames):
    for i, frame in enumerate(frames):
        clear_output(wait = True)
        print(frame['frame'])
        print(f"Timestep: {i+1}")
        print(f"State: {frame['state']}")
        print(f"Action: {frame['action']}")
        print(f"Reward: {frame['reward']}")
        sleep(0.1)
print_frames(frames)

# `Enter Q-learning`

A popular technique used in Reinforcement Learning is to have the agent maintain an estimate of the rewards that it would gain from executing a particular state transition (called the Q-table) which is updated after each step based on the reward obtained. The agent picks the action that yields the maximum reward at that particular state. That is, if a particular state transition ((state, action) pair) resulted in a good reward, the agent is better off repeating that action whenever that state is attained. 



##**TASK 1:**

Find out the update rule for Q-learning and implement Q-learning for the Taxi-v3 problem. Also, find out what 'exploration versus exploitation' is and use a suitable way to optimise on exploration and exploitation.

You would need to set a few hyperparameters- learning rate ($\alpha$), reward decay rate ($\gamma$), number of episodes and exploration probability($\epsilon$). Obtain the performance characteristics of the agent (that is, number of epochs per episode and average number of penalties per episode) for ($\alpha$, $\gamma$, episodes, $\epsilon$) = (0.6, 0.9, 1000, 0.4)

In [None]:
'''
  Q-LEARNING ON Taxi-v3:
  Enter Your Code Here
'''

##TASK 2:
Now that you have trained the agent on Taxi-v3, try Q-learning on the Maze environment. After training, obtain the following performance characteristics- number of epochs per episode and average number of wins. What can you do to improve the performance of the agent?

###Maze
###This game is described using a grid like:
>   _ _ _ H _ _ _ H _ _<br>   _ _ _ _ _ H _ H _ _<br>   _ _ H _ H _ _ _ _ H<br>   H _ _ _ _ _ H _ _ _<br>   _ H _ H H _ _ _ H _<br>   _ _ _ H _ _ _ H _ _<br>   _ _ _ _ _ H H _ _ _<br>   _ _ H _ _ H _ _ _ H <br>   H _ _ _ _ _ H _ _ _ <br>   _ _ _ H _ _ _ H _ G
<br><br> _ : Safe path
<br> H : Hole, avoid falling
<br> G: Goal, target to reach
###Your goal is to reach G and receive reward 1.
###The episode ends when you reach G or fall in H.

###You receive a reward of 1 if you reach G, 0 otherwise.




###Do Not Edit This Code

In [None]:
import random
import sys
import copy
class maze:
  class action:
    def __init__(self):
      self.total_actions = 4
      self.__out=sys.stdout
    
    def random_action(self):
      act = random.randint(1,4)
      return act

    def show_actions(self):
      actions= "1->Up, 2->Right, 3->Down, 4->Left"
      self.__out.write(actions)
    
  class observation:
    def __init__(self):
      self.total_observations = 100
      self.dtype = type(self.total_observations)

    def random(self):
      obs = random.randint(1,99)
      return obs
  def __init__(self):
    self.observation_space = self.observation()
    self.__map=['___H___H__',
       '_____H_H__',
       '__H_H____H',
       'H_____H___',
       '_H_HH___H_',
       '___H___H__',
       '_____HH___',
       '__H__H___H',
       'H_____H___',
       '___H___H_G']
    self.action_space = self.action()
    self.__x = None
    self.__y = None
    self.__state = None
    self.__out = sys.stdout
    self.__action = None
    self.__action_dict = {1:'Up',2:'Right',3:'Down',4:'Left'}
    self.__done = False

  def reset(self):
    self.__y = random.randint(0,9)
    while True:
      self.__x = random.randint(0,9)
      if self.__map[self.__y][self.__x]=='_':
        break
    self.current_state()
    self.__action = None
    self.__done = False
    return self.__state
  
  def current_state(self):
    if self.__y is not None:
     self.__state = self.__y*10+self.__x+1
    return self.__state

  def take_step(self,action):
    if self.__done == False :
      if action == 1:
        if self.__y-1>=0:
          self.__y-=1
        self.__action = action
      elif action == 3:
        if self.__y+1<=9:
          self.__y+=1
        self.__action = action
      elif action == 2:
        if self.__x+1<=9:
          self.__x+=1
        self.__action = action
      elif action == 4:
        if self.__x-1>=0:
          self.__x-=1
        self.__action = action
      else :
        self.__out.write("Enter a valid action.")
        return
      self.current_state()
      reward = 0.0
      if self.__map[self.__y][self.__x]=='G':
        reward=1.0
        self.__done= True
      if self.__map[self.__y][self.__x]=='H':
        self.__done = True
      return self.__state,reward,self.__done
    else :
      self.__out.write("\n\033[38;5;11mWARN: You are calling 'step()' even though this environment has already returned done = True. You should always call 'reset()' once you receive 'done = True\033[0;0m")

  def show(self):
    map = copy.deepcopy(self.__map)
    val = map[self.__y][self.__x]
    map[self.__y] = map[self.__y][:self.__x] + 'P' +map[self.__y][self.__x+1:]
    map = self.__add_colour_h(map)
    map[-1]=map[-1].replace('G',"\033[38;5;12mG\033[0;0m")
    map[self.__y] = map[self.__y].replace('P',f'\033[48;5;9m{val}\033[0;0m')
    if self.__action is not None:
      self.__out.write('\n'+self.__action_dict[self.__action])
    for i in map:
      self.__out.write('\n'+i)
    if val =='H':
      self.__out.write("\nTRY AGAIN.......You fell in hole!!!")
    if val =='G':
      self.__out.write("\nGG!!")
    self.__out.write("\n")

  def __add_colour_h(self,map):
    for i in range(len(map)):
      map[i]=map[i].replace('H','\033[48;5;16mH\033[0;0m')
    return map

  def set_state(self,state):
    if state>100 or state<1:
      self.__out.write("Enter a valid state.")
      return
    self.__state = state
    self.__y = (state-1)//10
    self.__x = (state-1)%10
    if self.__map[self.__y][self.__x]=='_':
      self.__done = False
    else: 
      self.__done = True
    self.__action = None

###Environment methods and attributes

In [None]:
env = maze() #Creating object of maze class

In [None]:
print(env.observation_space.total_observations) #Total observations in observation space
print(env.observation_space.random()) # Random observation from observation space

100
40


In [None]:
print(env.action_space.total_actions) #Total actions in action space
print(env.action_space.random_action()) #Returns random action from action space
env.action_space.show_actions() #Prints details about actions in action space

4
1
1->Up, 2->Right, 3->Down, 4->Left

In [None]:
print(env.current_state()) #No state is initailized

None


In [None]:
env.reset() #initilizes game to a random state
env.show() #prints observation
print(env.current_state())


___[48;5;16mH[0;0m___[48;5;16mH[0;0m__
_____[48;5;16mH[0;0m_[48;5;16mH[0;0m__
__[48;5;16mH[0;0m_[48;5;16mH[0;0m____[48;5;16mH[0;0m
[48;5;16mH[0;0m_____[48;5;16mH[0;0m___
_[48;5;16mH[0;0m_[48;5;16mH[0;0m[48;5;16mH[0;0m___[48;5;16mH[0;0m_
___[48;5;16mH[0;0m___[48;5;16mH[0;0m__
_____[48;5;16mH[0;0m[48;5;16mH[0;0m___
__[48;5;16mH[0;0m__[48;5;16mH[0;0m__[48;5;9m_[0;0m[48;5;16mH[0;0m
[48;5;16mH[0;0m_____[48;5;16mH[0;0m___
___[48;5;16mH[0;0m___[48;5;16mH[0;0m_[38;5;12mG[0;0m
79


In [None]:
env.set_state(90) #state of environment is changed to state specified
env.show() 
print(env.current_state())


___[48;5;16mH[0;0m___[48;5;16mH[0;0m__
_____[48;5;16mH[0;0m_[48;5;16mH[0;0m__
__[48;5;16mH[0;0m_[48;5;16mH[0;0m____[48;5;16mH[0;0m
[48;5;16mH[0;0m_____[48;5;16mH[0;0m___
_[48;5;16mH[0;0m_[48;5;16mH[0;0m[48;5;16mH[0;0m___[48;5;16mH[0;0m_
___[48;5;16mH[0;0m___[48;5;16mH[0;0m__
_____[48;5;16mH[0;0m[48;5;16mH[0;0m___
__[48;5;16mH[0;0m__[48;5;16mH[0;0m___[48;5;16mH[0;0m
[48;5;16mH[0;0m_____[48;5;16mH[0;0m__[48;5;9m_[0;0m
___[48;5;16mH[0;0m___[48;5;16mH[0;0m_[38;5;12mG[0;0m
90


###env.take_step( ) returns THREE values only, state, reward and done (episode completed or not) 

In [None]:
from IPython.display import clear_output
from time import sleep
env.reset()
done = False
while True:
  env.show()
  clear_output(wait=True)
  sleep(1)
  if done: 
    break
  action = env.action_space.random_action()
  state,reward,done = env.take_step(action)


Left
___[48;5;16mH[0;0m___[48;5;16mH[0;0m__
_____[48;5;16mH[0;0m_[48;5;9mH[0;0m__
__[48;5;16mH[0;0m_[48;5;16mH[0;0m____[48;5;16mH[0;0m
[48;5;16mH[0;0m_____[48;5;16mH[0;0m___
_[48;5;16mH[0;0m_[48;5;16mH[0;0m[48;5;16mH[0;0m___[48;5;16mH[0;0m_
___[48;5;16mH[0;0m___[48;5;16mH[0;0m__
_____[48;5;16mH[0;0m[48;5;16mH[0;0m___
__[48;5;16mH[0;0m__[48;5;16mH[0;0m___[48;5;16mH[0;0m
[48;5;16mH[0;0m_____[48;5;16mH[0;0m___
___[48;5;16mH[0;0m___[48;5;16mH[0;0m_[38;5;12mG[0;0m
TRY AGAIN.......You fell in hole!!!


###Now you are familiar with Maze environment. You have to implement q-learing on this custom environment. Remember you are not allowed to do any chnages in Maze class.

In [None]:
'''
  Q-LEARNING ON Maze:
  Enter Your Code Here
'''

## `Further Motivation`

In case you are curious about the mathematics of Reinforcement Learning, you check the following resources out:

RL Lectures by Dr. David Silver: http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html

Medium Blog Post on RL techniques: https://medium.com/@jonathan_hui/rl-introduction-to-deep-reinforcement-learning-35c25e04c199

Deep Neural Networks (useful for Deep RL): http://cs231n.github.io/