## MountainCar-v0 with Q-learning



First, we import relevant libraries.

In [90]:
import gym
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import clear_output
import seaborn as sns
import time

sns.set()

Then, we instantiate the environment using the code below.

In [39]:
env = gym.make('MountainCar-v0')

The problem is continuous state and solving such problems using Q-learning algorithm is impossible due to discrete nature of the TD methods. In order to solve the issue, the problem needs to be broken into discrete states.

To do so, first, we obtain the high and low bound of the observation space. Please note that the environment has 2 states at each time step, the position and velocity state:


In [98]:
print(f'Low bound of position state: {env.observation_space.low[0]}')
print(f'High bound of position state: {env.observation_space.high[0]}')
print(f'Low bound of velocity state: {env.observation_space.low[1]}')
print(f'High bound of velocity state: {env.observation_space.high[1]}')

Low bound of position state: -1.2000000476837158
High bound of position state: 0.6000000238418579
Low bound of velocity state: -0.07000000029802322
High bound of velocity state: 0.07000000029802322


The length of the position and velocity state:


In [99]:
print(f'Length of position state: {env.observation_space.high[1] - env.observation_space.low[0]}')
print(f'Length of velocity state: {env.observation_space.high[1] - env.observation_space.low[1]}')


Length of position state: 1.2700001001358032
Length of velocity state: 0.14000000059604645


We break the position and velocity states into 20 and 200 discrete states, respectively. Thus, at each time step, the observations for the position and velocity should be multiplied by 20 and 200 and converted into integer values. (Note that the chosen values, i.e. 20 and 200, are arbitrary. By increasing the values, the steps would be finer and consequently, the required computational budget would increase. Lower values will decrease the required memory and the training time and will diminish the precision of the model).

For instance:

In [121]:
state = env.reset()
print(f'Discritized position state {int(20*state[0])}')
print(f'Discritized velocity state {int(200*state[1])}')

Discritized position state -9
Discritized velocity state 0


As you know, to solve a problem using the Q-learning algorithm we need to construct a state-action table that maps every pair of state and action to its corresponding value.

Looking at the above values, the issue is that the obtained values can be negative which raises an error while solving the problem. So, we need to shift both of the states by a value to make sure that they cannot get a negative value. Those values are the absolute value of the low bound of each state multiplied by their corresponding discretizing factors. For instance:

In [122]:
print(f'Discritized position state {int(20*state[0]) + abs(int(20*env.observation_space.low[0]))}')
print(f'Discritized velocity state {int(200*state[1]) + abs(int(200*env.observation_space.low[1]))}')

Discritized position state 15
Discritized velocity state 14


ALL SET!

Now, it is time to specify hyperparameters as follow:

$
learning\,rate = 0.4 \\
discount\,rate(gamma)= 0.99\\
initial\,\epsilon = 0.5 \\
\epsilon\,decay = 0.90 \\
minimum\,\epsilon = 0.01 \\
$

Then, we are ready to write the class as follows:

In [116]:
class MountainCar():
    def __init__(self, env, n_epochs, lr=0.4, df=0.99, init_epsilon=0.5, 
                 min_epsilon = 0.01, decay_epsilon=0.90, dis_factor=20):
        
        """
        A class to train the MountainCar-v0 created by OpenAI.
        Args:
            env: Instantiated MountainCar-v0 environment
            n_epochs: Number of epochs to train the model
            lr: Learning rate, default 0.4
            df: discount factor or gamma, default 0.99
            init_epsilon: initial probability of exploration, default 0.5
            decay_epsilon: the factor by which the epsilon value dereases exponentialy, defalut 0.90
            min_epsilon: the minimum likelihood of exploration, default 0.01
            dis_factor: The factor by which the Position state is discretized. This is ten times 
                        bigger for the Velocity state. default 20
        """
        self.env = env
        self.n_epochs = n_epochs
        self.lr = lr
        self.df = df
        self.init_epsilon = init_epsilon
        self.min_epsilon = min_epsilon
        self.decay_epsilon = decay_epsilon
        self.dis_factor = dis_factor
        
        
        
        self.upper_position = self.env.observation_space.high[0]     #obtatinig high bound of position state
        self.lower_position = self.env.observation_space.low[0]      #obtatinig low bound of position state
        self.upper_velocity = self.env.observation_space.high[1]     #obtatinig high bound of velocity state
        self.lower_velocity = self.env.observation_space.low[1]      #obtatinig low bound of velocity state
        
        self.shift_position = np.abs(int(self.lower_position * self.dis_factor))       #shifting the positions 
                                                                                       #to get positive values
        self.shift_velocity = np.abs(int(self.lower_velocity * self.dis_factor * 10))  #shifting the velocities 
                                                                                       #to get positive values
        self.n_state = self.env.observation_space.shape[0]          #number of state types, position and velocity
        self.n_action = self.env.action_space.n         #number of possible actions, accelerate, neutral, decelerate

        self.ave_reward_list = []              #creating a list to store average reward every 100 episodes
        self.reward_list = []                  #creating a list to store 100 total rewards at each episode
        
        self.n_state_position = int((self.upper_position - self.lower_position) * self.dis_factor)
        self.n_state_velocity = int((self.upper_velocity - self.lower_velocity) * self.dis_factor * 10)
        #initializing the Q table with random values between -0.5 and +0.5
        self.Q = np.random.uniform(-0.5, 0.5, size=(self.n_state_position, self.n_state_velocity, self.n_action))
        
        
        
    def train(self):
        for e in range(self.n_epochs):
            
            tot_reward = 0           #the total reward returned by the env during each episode          
            
            s = self.env.reset()     #the environment should be reset at the beginning of each episode
            self.epsilon = self.init_epsilon    #the probability of exploration should be reset at 
                                                #the beginning of each episode
            
            for t in range(200):
                
                if e % 100 == 0:
                    env.render()
                    
                #obtatining shifted position state, sp, and velocity space, sv.
                sp = int(s[0]*self.dis_factor) + self.shift_position
                sv = int(s[1]*self.dis_factor*10) + self.shift_velocity
                #choosing the next action based on e-greedy policy
                a = np.argmax(self.Q[sp, sv, :])
                if np.random.random() < self.epsilon:
                    a = self.env.action_space.sample()
                #decaying the epsilon if it is bigger than minimum epsilon
                if self.epsilon > self.min_epsilon:
                    self.epsilon *= self.decay_epsilon
                #the env step forward and returns next state, s_, reward, r and if the goal is hit
                s_, r, done, _ = self.env.step(a)
                sp_ = int(s_[0]*self.dis_factor) + self.shift_position
                sv_ = int(s_[1]*self.dis_factor*10) + self.shift_velocity
                #updating the Q table using the greedy policy
                self.Q[sp, sv, a] += self.lr * (r + self.df * np.max(self.Q[sp_, sv_, :] - self.Q[sp, sv, a]))
                
                
                tot_reward += r
        
                if done:
                    break
                #setting the next state as the current state for the next time step
                s = s_
            self.reward_list.append(tot_reward)
            if (e+1) % 100 == 0:
                ave_reward = np.mean(self.reward_list)
                self.ave_reward_list.append(ave_reward)
                self.reward_list = []
                print(f'episode {e} finished in {t} time steps and reward is {ave_reward}')
        env.close()
                
    def test(self):
        s = self.env.reset()
        for t in range(200):
            env.render()
            sp = int(s[0]*self.dis_factor) + self.shift_position
            sv = int(s[1]*self.dis_factor*10) + self.shift_velocity
            a = np.argmax(self.Q[sp, sv, :])
            s_, r, done, _ = self.env.step(a)
            s = s_
            if done:
                print(f'Finished in {t} time steps.')
                break
            
        env.close()
        
    def reset_Q(self):
        self.Q = np.random.uniform(-0.1, 0.1, size=(self.n_state_position, self.n_state_velocity, self.n_action))
                

In [117]:
m = MountainCar(env, 10000)

In [118]:
m.train()

episode 99 finished in 199 time steps and reward is -200.0
episode 199 finished in 199 time steps and reward is -200.0
episode 299 finished in 199 time steps and reward is -200.0
episode 399 finished in 199 time steps and reward is -200.0
episode 499 finished in 199 time steps and reward is -199.67
episode 599 finished in 199 time steps and reward is -200.0
episode 699 finished in 199 time steps and reward is -200.0
episode 799 finished in 199 time steps and reward is -198.18
episode 899 finished in 199 time steps and reward is -196.83
episode 999 finished in 199 time steps and reward is -198.26
episode 1099 finished in 199 time steps and reward is -198.11
episode 1199 finished in 199 time steps and reward is -197.1
episode 1299 finished in 199 time steps and reward is -196.53
episode 1399 finished in 193 time steps and reward is -187.29
episode 1499 finished in 199 time steps and reward is -192.33
episode 1599 finished in 199 time steps and reward is -196.54
episode 1699 finished in 1

In [123]:
m.n_state

2