<a href="https://colab.research.google.com/github/amogh-code2021/Tutorials_1/blob/master/RL_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<font size = 5>**Implementation based RL Assignment**</font>

I will be using the openAI gym environment, specifically CartPole-v1 for this porject. I will be using Q learning algorithm (greedy with epsilon) to arrive at a Final reward. Additionally I will also be using Deep Q Networks for this project with different Q learning algorithms to compare results. I couldn't include the render unfortunately as render does not work on Google Colab.

We begin with installing and importing all necessary packages for the project

In [None]:
!pip install keras
!pip install keras-rl2

In [1]:
import gym
import random
import numpy as np
import time # to get the time
import math # needed for calculations
from IPython.display import clear_output

In [2]:
import tensorflow
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten
from tensorflow.keras.optimizers import Adam

from rl.agents import DQNAgent
from rl.policy import BoltzmannQPolicy, EpsGreedyQPolicy, GreedyQPolicy
from rl.memory import SequentialMemory

<font size = 4>**Setting up cartPole environment**</font> 

It is one of the classic Open AI environments. Here a cart is trying to balance the pole vertically.
The only movement allowed is left or right (-1 or +1). If the cart position moves by more than +2.4 or -2.4, pole angle is more that +/- 12 degrees and episode length is more than 500 the episode ends [2]

In [3]:
env_name = "CartPole-v1"
env = gym.make(env_name)

The “Observation” variable is slightly unique, however. The reason that the array was manually set was that the first two variables (Cart position, Cart Velocity) is not as important as the other two, (Pole Angle, Pole Velocity)[1].

The np_array_win_size is the “steps” based upon cart position, cart velocity, pole angle, and then pole velocity.[1]

In [4]:
Observation = [30, 30, 50, 50]
np_array_win_size = np.array([0.25, 0.25, 0.01, 0.1])


#Get state values in discretised format
def get_discrete_state(state):
    discrete_state = state/np_array_win_size+ np.array([15,10,1,10])
    return tuple(discrete_state.astype(np.int))

<font size = 4>**Q - Learning**</font>

Function to train the q model. Here initial epsilon value of 1, learning rate of 0.1, discount rate of 0.95 and 70000 episodes were chosen as default values.


In [5]:
q_table = np.random.uniform(low=0, high=1, size=(Observation + [env.action_space.n]))

def q_train_1(env, q_table, eps = 1.0, epsilon_decay_value = 0.99995, lr = 0.1, disc_rate = 0.95, episodes = 70000):
    
    #Setting up initial values
    total_reward = 0
    prior_reward = 0
    #The episode rewards starts from 0 before every episode loop
    episode_reward = 0

    #Running loop over all episodes
    for episode in range(episodes + 1):
    
        state = env.reset()
        discrete_state = get_discrete_state(state)
        done = False
        episode_reward = 0

        while not done:
            #Taking epsiln into consideration, moves the model towards coordinated action
            if np.random.random() > eps:
                action = np.argmax(q_table[discrete_state])
            else:
              #Random action (These are mostly taken intially until the algorithm starts learning)
                action = np.random.randint(0, env.action_space.n)

            #Get new state
            new_state, reward, done, _ = env.step(action)
            
            episode_reward += reward

            new_discrete_state = get_discrete_state(new_state)

            if not done:

                #Updating Q table
                future_q = np.max(q_table[new_discrete_state])

                current_q = q_table[discrete_state + (action,)]

                new_q = (1 - lr) * current_q + lr * (reward + disc_rate * future_q)

                q_table[discrete_state + (action,)] = new_q

            discrete_state = new_discrete_state
        
        if eps > 0.05: #Updating epsilon value. Updated every 10000 episodes
            if episode_reward > prior_reward and episode > 10000:
                eps = math.pow(epsilon_decay_value, episode - 10000)

        total_reward += episode_reward #Total reward of the episode
        prior_reward = episode_reward

        if episode % 1000 == 0: #every 1000 episodes print the average time and the average reward

            mean_reward = total_reward / 1000
            print("Episode: " + str(episode) + " Mean Reward: " + str(mean_reward))

            total_reward = 0

Adter training the model, we can see that initally the rewards were not increasing by much (till episode 10000). As soon as the algorithm learns we can see that there is a dramatic increase in reward there on. After 50000 episodes we can see that a mean reward of above 200 was obtained.

In [10]:
q_train_1(env,q_table)

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  


Episode: 0 Mean Reward: 0.021
Episode: 1000 Mean Reward: 22.351
Episode: 2000 Mean Reward: 21.631
Episode: 3000 Mean Reward: 22.502
Episode: 4000 Mean Reward: 21.949
Episode: 5000 Mean Reward: 21.279
Episode: 6000 Mean Reward: 22.259
Episode: 7000 Mean Reward: 21.853
Episode: 8000 Mean Reward: 21.866
Episode: 9000 Mean Reward: 21.811
Episode: 10000 Mean Reward: 21.909
Episode: 11000 Mean Reward: 22.804
Episode: 12000 Mean Reward: 23.642
Episode: 13000 Mean Reward: 25.102
Episode: 14000 Mean Reward: 26.625
Episode: 15000 Mean Reward: 28.616
Episode: 16000 Mean Reward: 30.389
Episode: 17000 Mean Reward: 34.879
Episode: 18000 Mean Reward: 35.107
Episode: 19000 Mean Reward: 39.091
Episode: 20000 Mean Reward: 43.785
Episode: 21000 Mean Reward: 43.877
Episode: 22000 Mean Reward: 48.849
Episode: 23000 Mean Reward: 54.363
Episode: 24000 Mean Reward: 58.403
Episode: 25000 Mean Reward: 63.629
Episode: 26000 Mean Reward: 68.394
Episode: 27000 Mean Reward: 73.323
Episode: 28000 Mean Reward: 76.632

<font size = 4>**Deep Q learning Networks**</font>

Here I implement combination of a Neural networks with multiple Q learning algorithms. The Neural Network remains the same for all Q learning algorithms [3]. For the project I have picked Boltzmann Q-policy,
Epsilon greedy Q-policy and greedy Q policy.

Function to build NN and Agent

In [4]:
def build_model(states, actions):
    model = Sequential()
    model.add(Flatten(input_shape=(1,states)))
    model.add(Dense(24, activation='relu'))
    model.add(Dense(24, activation='relu'))
    model.add(Dense(actions, activation='linear'))
    return model

def build_agent(model, actions, policy = BoltzmannQPolicy()):
    memory = SequentialMemory(limit=50000, window_length=1)
    dqn = DQNAgent(model=model, memory=memory, policy=policy, 
                  nb_actions=actions, nb_steps_warmup=10, target_model_update=1e-2)
    return dqn


In [5]:
states = env.observation_space.shape[0]
actions = env.action_space.n

In [8]:
model = build_model(states, actions)
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 flatten (Flatten)           (None, 4)                 0         
                                                                 
 dense (Dense)               (None, 24)                120       
                                                                 
 dense_1 (Dense)             (None, 24)                600       
                                                                 
 dense_2 (Dense)             (None, 2)                 50        
                                                                 
Total params: 770
Trainable params: 770
Non-trainable params: 0
_________________________________________________________________


Build and train DQN. Number of steps here are 50000

In [10]:
dqn = build_agent(model, actions, policy= EpsGreedyQPolicy())
dqn.compile(Adam(lr=1e-3), metrics=['mae'])
dqn.fit(env, nb_steps=50000, visualize=False, verbose=1)

  super(Adam, self).__init__(name, **kwargs)


Training for 50000 steps ...
Interval 1 (0 steps performed)


  updates=self.state_updates,


    1/10000 [..............................] - ETA: 50:15 - reward: 1.0000



179 episodes - episode_reward: 55.508 [8.000, 155.000] - loss: 1.029 - mae: 14.423 - mean_q: 29.281

Interval 2 (10000 steps performed)
73 episodes - episode_reward: 136.384 [108.000, 172.000] - loss: 1.471 - mae: 26.463 - mean_q: 53.606

Interval 3 (20000 steps performed)
71 episodes - episode_reward: 142.296 [82.000, 241.000] - loss: 0.999 - mae: 27.570 - mean_q: 55.606

Interval 4 (30000 steps performed)
58 episodes - episode_reward: 170.397 [116.000, 281.000] - loss: 0.833 - mae: 29.640 - mean_q: 59.828

Interval 5 (40000 steps performed)
done, took 385.304 seconds


<keras.callbacks.History at 0x7f38d01e0a90>

As we can see below the DQN performed significantly better in comparison yielding high rewards at very low episode numbers. This is because in Q learning only 1 value was getting updated as opposed to the entire matrix in case of DQN. This results in better performance at lesser number of episodes

In [11]:
scores = dqn.test(env, nb_episodes=10, visualize=False)
print(np.mean(scores.history['episode_reward']))

Testing for 10 episodes ...
Episode 1: reward: 180.000, steps: 180
Episode 2: reward: 158.000, steps: 158
Episode 3: reward: 164.000, steps: 164
Episode 4: reward: 165.000, steps: 165
Episode 5: reward: 162.000, steps: 162
Episode 6: reward: 173.000, steps: 173
Episode 7: reward: 176.000, steps: 176
Episode 8: reward: 177.000, steps: 177
Episode 9: reward: 173.000, steps: 173
Episode 10: reward: 180.000, steps: 180
170.8


In [8]:
model_1 = build_model(states,actions)

In [9]:
dqn_1 = build_agent(model_1,actions, policy = BoltzmannQPolicy())
dqn_1.compile(Adam(learning_rate = 1e-3), metrics = ['mae'])
dqn_1.fit(env, nb_steps=50000, visualize=False, verbose=1)

Training for 50000 steps ...
Interval 1 (0 steps performed)


  updates=self.state_updates,


    1/10000 [..............................] - ETA: 50:01 - reward: 1.0000



91 episodes - episode_reward: 108.044 [10.000, 400.000] - loss: 2.071 - mae: 19.197 - mean_q: 38.907

Interval 2 (10000 steps performed)
44 episodes - episode_reward: 227.614 [158.000, 312.000] - loss: 3.325 - mae: 40.677 - mean_q: 82.265

Interval 3 (20000 steps performed)
40 episodes - episode_reward: 248.550 [178.000, 500.000] - loss: 2.461 - mae: 43.881 - mean_q: 88.482

Interval 4 (30000 steps performed)
43 episodes - episode_reward: 235.674 [166.000, 325.000] - loss: 1.625 - mae: 41.662 - mean_q: 83.842

Interval 5 (40000 steps performed)
done, took 385.418 seconds


<keras.callbacks.History at 0x7f5320576590>

The Boltzmann Q-policy DQN performed quite well giving much higher rewards as opposed epsilon greedy Q method with rewards upto 500 in some episodes (which is the highest that can be achieved.

In [10]:
scores_1 = dqn_1.test(env, nb_episodes=10, visualize=False)
print(np.mean(scores_1.history['episode_reward']))

Testing for 10 episodes ...
Episode 1: reward: 275.000, steps: 275
Episode 2: reward: 240.000, steps: 240
Episode 3: reward: 253.000, steps: 253
Episode 4: reward: 256.000, steps: 256
Episode 5: reward: 239.000, steps: 239
Episode 6: reward: 307.000, steps: 307
Episode 7: reward: 500.000, steps: 500
Episode 8: reward: 500.000, steps: 500
Episode 9: reward: 211.000, steps: 211
Episode 10: reward: 500.000, steps: 500
328.1


In [6]:
model_2 = build_model(states,actions)
dqn_2 = build_agent(model_2,actions, policy = GreedyQPolicy())
dqn_2.compile(Adam(learning_rate = 1e-3), metrics = ['mae'])
dqn_2.fit(env, nb_steps=50000, visualize=False, verbose=1)

Training for 50000 steps ...
Interval 1 (0 steps performed)


  updates=self.state_updates,


    1/10000 [..............................] - ETA: 50:43 - reward: 1.0000



326 episodes - episode_reward: 30.607 [8.000, 129.000] - loss: 1.022 - mae: 10.024 - mean_q: 20.765

Interval 2 (10000 steps performed)
84 episodes - episode_reward: 118.048 [107.000, 135.000] - loss: 2.294 - mae: 24.835 - mean_q: 51.208

Interval 3 (20000 steps performed)
82 episodes - episode_reward: 123.134 [109.000, 149.000] - loss: 1.284 - mae: 25.039 - mean_q: 51.244

Interval 4 (30000 steps performed)
79 episodes - episode_reward: 126.658 [104.000, 157.000] - loss: 0.716 - mae: 23.954 - mean_q: 48.681

Interval 5 (40000 steps performed)
done, took 387.630 seconds


<keras.callbacks.History at 0x7fa3d0471290>

In case of Greedy Q policy we achieved decent results. But the results were not as good the other 2 Q-learning model.

In [7]:
scores_2 = dqn_2.test(env, nb_episodes=10, visualize=False)
print(np.mean(scores_2.history['episode_reward']))

Testing for 10 episodes ...
Episode 1: reward: 117.000, steps: 117
Episode 2: reward: 160.000, steps: 160
Episode 3: reward: 122.000, steps: 122
Episode 4: reward: 169.000, steps: 169
Episode 5: reward: 138.000, steps: 138
Episode 6: reward: 128.000, steps: 128
Episode 7: reward: 122.000, steps: 122
Episode 8: reward: 180.000, steps: 180
Episode 9: reward: 141.000, steps: 141
Episode 10: reward: 125.000, steps: 125
140.2


<font size = 4>**Reference**</font>



1.   https://medium.com/swlh/using-q-learning-for-openais-cartpole-v1-4a216ef237df
2.   https://github.com/openai/gym/blob/master/gym/envs/classic_control/cartpole.py

3. https://www.youtube.com/watch?v=cO5g5qLrLSo&t=536s

<font size = 4>**Additional Resources**</font>

1. https://towardsdatascience.com/deep-q-learning-for-the-cartpole-44d761085c2f

2. https://www.youtube.com/watch?v=mlTDpdh24qs