<a href="https://colab.research.google.com/github/coryroyce/reinforcement_learning_open_ai_gym/blob/main/notebook/220306_CMPE_252_HW_4_Reinforcement_Learning_Cory_Randolph.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Reinforcement Learning



Cory Randolph

3/7/2022

# Prompt

Implementing Q- Learning in OpenAI gym (40p)
* A car is on a one-dimensional track, positioned between two “mountains”. The goal is to drive up the mountain on the right; however, the car’s engine is not strong enough to scale the mountain in a single pass. Therefore, the only way to succeed is to drive back and forth to build up momentum.


# Reinforcement Learning Overview

The below image is a diagram showing the main components of Reinforcement Learning from [sadiakhaf](https://github.com/sadiakhaf/IEEE-Hands-On-RL-using-Python)

*   What is an **environment** in RL? 
*   And What is an **agent**?
*   What is *state*?
*   What is an *action*?
*   What is *reward*?

![picture](https://drive.google.com/uc?export=view&id=13oYKs5qWbpPekxMQN5ExG2kLo4ih4pKS)


https://drive.google.com/file/d/13oYKs5qWbpPekxMQN5ExG2kLo4ih4pKS/view?usp=sharing


Bais process overview of RL

1. REST() env and get state
2.   Give this state to *agent*, wait for him to *act* --> ACT()
3.   Give his *action* to env and get *reward* --> STEP()
4.   Pass this reward to *agent* for his behavior, make him learn --> UPDATE(). Plot something if you need to.
5.   Go to step 2

# Imports

Install the needed packages

In [1]:
# Install required system dependencies
!apt-get install -y xvfb x11-utils

# Install required python dependencies (might need to install additional gym extras depending)
!pip install pyvirtualdisplay PyOpenGL PyOpenGL-accelerate

# Install ML libraries 
!pip install tensorflow #==2.3.0
!pip install gym
!pip install keras
!pip install keras-rl2


# Clear output for this cell
from IPython.display import clear_output
clear_output()

Set up the correct virtual display settings to run fully in Colab

In [2]:
import pyvirtualdisplay

_display = pyvirtualdisplay.Display(visible=False,  # use False with Xvfb
                                    size=(1400, 900))
_ = _display.start()

# Gym Environment

## Setup Gym

Import Gym

In [3]:
import gym 
import random

Load the environment and need components

In [4]:
# Load gym environment 
env_name = 'MountainCar-v0'
env = gym.make(env_name)

# Update the display to render the video as a file (then download from Colab)
video_every_n_episodes = 5
env = gym.wrappers.Monitor(env, "./video", video_callable=lambda episode_id: (episode_id%video_every_n_episodes)==0, force=True)

# Extract the states and actions
states = env.observation_space.shape[0]
actions = env.action_space.n

Generate some random actions just to verify the environment is setup correctly

In [5]:
episodes = 10
for episode in range(1, episodes+1):
    state = env.reset()
    done = False
    score = 0 
    
    while not done:
        env.render()
        action = random.choice([0,1])
        n_state, reward, done, info = env.step(action)
        score+=reward
    print('Episode:{} Score:{}'.format(episode, score))

Episode:1 Score:-200.0
Episode:2 Score:-200.0
Episode:3 Score:-200.0
Episode:4 Score:-200.0
Episode:5 Score:-200.0
Episode:6 Score:-200.0
Episode:7 Score:-200.0
Episode:8 Score:-200.0
Episode:9 Score:-200.0
Episode:10 Score:-200.0


# Build Reinforcement Learning Model

## Build RL Model

Import libraries

In [6]:
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten, Activation
from tensorflow.keras.optimizers import Adam

Define how the model should be built

In [7]:
# Set Random state
np.random.seed(3)
env.seed(3)

[3]

In [8]:
def build_model(states, actions):
    model = Sequential()
    model.add(Flatten(input_shape=(1,) + env.observation_space.shape))
    model.add(Dense(100))
    model.add(Activation('relu'))
    model.add(Dense(100))
    model.add(Activation('relu'))
    model.add(Dense(actions, activation='linear'))
    return model

In [9]:
model = build_model(states, actions)

Display model architecture

In [10]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 flatten (Flatten)           (None, 2)                 0         
                                                                 
 dense (Dense)               (None, 100)               300       
                                                                 
 activation (Activation)     (None, 100)               0         
                                                                 
 dense_1 (Dense)             (None, 100)               10100     
                                                                 
 activation_1 (Activation)   (None, 100)               0         
                                                                 
 dense_2 (Dense)             (None, 3)                 303       
                                                                 
Total params: 10,703
Trainable params: 10,703
Non-traina

## Build RL Agent

Use Keras Reinforcement learning library to create and train a RL model

Import Keras RL libraries

In [11]:
from rl.agents.dqn import DQNAgent
from rl.policy import EpsGreedyQPolicy
from rl.memory import SequentialMemory

Build the Agent

In [12]:
def build_agent(model, actions):
    policy = EpsGreedyQPolicy(eps=.1)
    memory = SequentialMemory(limit=50_000, window_length=1)
    dqn = DQNAgent(model=model, 
                   memory=memory, 
                   policy=policy, 
                   nb_actions=actions, 
                   nb_steps_warmup=10,
                   enable_dueling_network=True, # Dueling network increase performance by splitting the value and advantage into 2 outputs
                   dueling_type='avg',
                   target_model_update=1e-2,)
    return dqn

Remove any old model before compiling

In [13]:
del model

In [14]:
model = build_model(states, actions)

In [15]:
dqn = build_agent(model, actions)
dqn.compile(Adam(lr=1e-3), metrics=['mae'])

  super(Adam, self).__init__(name, **kwargs)


## Fit Model

Fit the model by training for a number of steps.

Since fitting the model takes a while to run through the 200,000 steps, you can skip the training steps by loading the saved model weights in the code below.

In [16]:
nb_steps=200_000
dqn.fit(env, nb_steps=nb_steps, visualize=False, verbose=1) #nb_steps=500_000

Training for 200000 steps ...
Interval 1 (0 steps performed)
    5/10000 [..............................] - ETA: 2:08 - reward: -1.0000 

  updates=self.state_updates,


    9/10000 [..............................] - ETA: 2:08 - reward: -1.0000



50 episodes - episode_reward: -198.000 [-200.000, -144.000] - loss: 2.387 - mae: 20.056 - mean_q: -29.677

Interval 2 (10000 steps performed)
55 episodes - episode_reward: -180.636 [-200.000, -105.000] - loss: 5.305 - mae: 32.976 - mean_q: -48.674

Interval 3 (20000 steps performed)
64 episodes - episode_reward: -158.234 [-200.000, -88.000] - loss: 3.417 - mae: 30.763 - mean_q: -45.245

Interval 4 (30000 steps performed)
68 episodes - episode_reward: -146.088 [-200.000, -95.000] - loss: 2.402 - mae: 29.736 - mean_q: -43.736

Interval 5 (40000 steps performed)
66 episodes - episode_reward: -151.667 [-200.000, -112.000] - loss: 2.175 - mae: 30.200 - mean_q: -44.387

Interval 6 (50000 steps performed)
68 episodes - episode_reward: -146.882 [-186.000, -91.000] - loss: 1.297 - mae: 30.263 - mean_q: -44.460

Interval 7 (60000 steps performed)
70 episodes - episode_reward: -143.229 [-200.000, -88.000] - loss: 0.598 - mae: 29.899 - mean_q: -43.867

Interval 8 (70000 steps performed)
79 episode

<keras.callbacks.History at 0x7f85653d2cd0>

## Save and Load Weights

Save the weights of the trained model

In [17]:
# After training is done, we save the final weights.
dqn.save_weights(f'dqn_{nb_steps}_steps_{env_name}_weights.h5f', overwrite=True)

Delete the current model to verify that we are fully loading from the trained weights

In [18]:
del model
del dqn
del env

Setup the model Architecture the same as before so that the weights can be loaded back in.

In [19]:
env = gym.make(env_name)
env = gym.wrappers.Monitor(env, "./video", video_callable=lambda episode_id: episode_id, force=True)
actions = env.action_space.n
states = env.observation_space.shape[0]
model = build_model(states, actions)
dqn = build_agent(model, actions)
dqn.compile(Adam(lr=1e-3), metrics=['mae'])

  super(Adam, self).__init__(name, **kwargs)


In [20]:
# Load previously saved weights just to test
dqn.load_weights(f'dqn_{nb_steps}_steps_{env_name}_weights.h5f')

## Test Model

In [21]:
# Update the display to render the video as a file (then download from Colab)
del env
env = gym.make(env_name)
env = gym.wrappers.Monitor(env, "./video", video_callable=lambda episode_id: episode_id, force=True)

Evaluate our algorithm for a few episodes. 

Note, the video will be available to download in the video folder in Colab for these test runs.

In [22]:
dqn.test(env, nb_episodes= 5, visualize=True)

Testing for 5 episodes ...


  updates=self.state_updates,


Episode 1: reward: -104.000, steps: 104
Episode 2: reward: -103.000, steps: 103
Episode 3: reward: -86.000, steps: 86
Episode 4: reward: -84.000, steps: 84
Episode 5: reward: -85.000, steps: 85


<keras.callbacks.History at 0x7f8564314d50>

# Basic RL Model

This section implements a basic ML model on the Mountain Gym environment without using any deep learning. This is added here more as a reference, since the scores were not very good, I explored deep learning models that perform better.

Code for this section is from [Genevieve Hayes](https://towardsdatascience.com/getting-started-with-reinforcement-learning-and-open-ai-gym-c289aca874f)

Delete the previous environment

In [23]:
del env

In [24]:
import gym
import random

Import needed packages

In [25]:
import numpy as np
import gym
import matplotlib.pyplot as plt

In [26]:
# Import and initialize Mountain Car Environment
# Load gym environment 
env_name = 'MountainCar-v0'
env = gym.make(env_name)

In [27]:
# Define Q-learning function
def QLearning(env, learning, discount, epsilon, min_eps, episodes):
    # Determine size of discretized state space
    num_states = (env.observation_space.high - env.observation_space.low)*\
                    np.array([10, 100])
    num_states = np.round(num_states, 0).astype(int) + 1
    
    # Initialize Q table
    Q = np.random.uniform(low = -1, high = 1, 
                          size = (num_states[0], num_states[1], 
                                  env.action_space.n))
    
    # Initialize variables to track rewards
    reward_list = []
    ave_reward_list = []
    
    # Calculate episodic reduction in epsilon
    reduction = (epsilon - min_eps)/episodes
    
    # Run Q learning algorithm
    for i in range(episodes):
        # Initialize parameters
        done = False
        tot_reward, reward = 0,0
        state = env.reset()
        
        # Discretize state
        state_adj = (state - env.observation_space.low)*np.array([10, 100])
        state_adj = np.round(state_adj, 0).astype(int)
    
        while done != True:   
            # Render environment for last five episodes
            # if i >= (episodes - 20):
            #   print(f'episode #: {episodes}')
            #   env.render()
                
            # Determine next action - epsilon greedy strategy
            if np.random.random() < 1 - epsilon:
                action = np.argmax(Q[state_adj[0], state_adj[1]]) 
            else:
                action = np.random.randint(0, env.action_space.n)
                
            # Get next state and reward
            state2, reward, done, info = env.step(action) 
            
            # Discretize state2
            state2_adj = (state2 - env.observation_space.low)*np.array([10, 100])
            state2_adj = np.round(state2_adj, 0).astype(int)
            
            #Allow for terminal states
            if done and state2[0] >= 0.5:
                Q[state_adj[0], state_adj[1], action] = reward
                
            # Adjust Q value for current state
            else:
                delta = learning*(reward + 
                                 discount*np.max(Q[state2_adj[0], 
                                                   state2_adj[1]]) - 
                                 Q[state_adj[0], state_adj[1],action])
                Q[state_adj[0], state_adj[1],action] += delta
                                     
            # Update variables
            tot_reward += reward
            state_adj = state2_adj
        
        # Decay epsilon
        if epsilon > min_eps:
            epsilon -= reduction
        
        # Track rewards
        reward_list.append(tot_reward)
        
        if (i+1) % 100 == 0:
            ave_reward = np.mean(reward_list)
            ave_reward_list.append(ave_reward)
            reward_list = []
            
        if (i+1) % 100 == 0:    
            print('Episode {} Average Reward: {}'.format(i+1, ave_reward))
            
    env.close()
    
    return ave_reward_list

# Run Q-learning algorithm
rewards = QLearning(env, 0.2, 0.9, 0.8, 0, 5_000)

# Plot Rewards
plt.plot(100*(np.arange(len(rewards)) + 1), rewards)
plt.xlabel('Episodes')
plt.ylabel('Average Reward')
plt.title('Average Reward vs Episodes')
plt.savefig('rewards.jpg')     
plt.close()

Episode 100 Average Reward: -200.0
Episode 200 Average Reward: -200.0
Episode 300 Average Reward: -200.0
Episode 400 Average Reward: -200.0
Episode 500 Average Reward: -200.0
Episode 600 Average Reward: -200.0
Episode 700 Average Reward: -200.0
Episode 800 Average Reward: -200.0
Episode 900 Average Reward: -200.0
Episode 1000 Average Reward: -200.0
Episode 1100 Average Reward: -200.0
Episode 1200 Average Reward: -200.0
Episode 1300 Average Reward: -200.0
Episode 1400 Average Reward: -200.0
Episode 1500 Average Reward: -200.0
Episode 1600 Average Reward: -200.0
Episode 1700 Average Reward: -200.0
Episode 1800 Average Reward: -200.0
Episode 1900 Average Reward: -200.0
Episode 2000 Average Reward: -200.0
Episode 2100 Average Reward: -200.0
Episode 2200 Average Reward: -200.0
Episode 2300 Average Reward: -200.0
Episode 2400 Average Reward: -200.0
Episode 2500 Average Reward: -200.0
Episode 2600 Average Reward: -200.0
Episode 2700 Average Reward: -200.0
Episode 2800 Average Reward: -200.0
E

# Reference

Reviewed Q-policy RL from [Genevieve Hayes](https://towardsdatascience.com/getting-started-with-reinforcement-learning-and-open-ai-gym-c289aca874f)

Got the Colab install dependencies and video saving from  [cwkx's video](https://www.youtube.com/watch?v=BNSwFURmaCA&ab_channel=cwkx)

RL Overview picture and comments from [sadiakhaf](https://github.com/sadiakhaf/IEEE-Hands-On-RL-using-Python)

Sample code for using Keras RL with Mountain Car [aslamplr](https://github.com/aslamplr/mountaincar_gym)