# Solving OpenAI Gym MountainCar-v0 problem

In this reinforcement learning notebook, the deep Q network that will be created will be trained on the Mountain Car environment/game. This can be accessed through the open source reinforcement learning library called Open AI Gym.

The object of this game is to get the car to go up the right-side hill to get to the flag. There’s one problem however, the car doesn’t have enough power to motor all the way up the hill. Instead, the car / agent needs to learn that it must motor up one hill for a bit, then accelerate down the hill and back up the other side, and repeat until it builds up enough momentum to make it to the top of the hill.


## Q-Learning: OpenAI gym Mountain Car

Example using OpenAI gym Mountain Car enviornment.

#### Description

Get an under powered car to the top of a hill (top = 0.5 position)

#### Observation

Num | Observation  | Min  | Max  
----|--------------|------|----   
0   | position     | -1.2 | 0.6
1   | velocity     | -0.07| 0.07

#### Actions

Num | Action|
----|-------------|
0   | push left   |
1   | no push     |
2   | push right  |



#### Reward
-1 for each time step, until the goal position of 0.5 is reached. There is no penalty for climbing the left hill, which upon reached acts as a wall.

#### Episode Termination
The episode ends when you reach 0.5 position, or if 200 iterations are reached.

**Source**:
    - https://github.com/llSourcell/Q_Learning_Explained/blob/master/q_learning.py

In [1]:
import numpy as np

import gym
from gym import wrappers

import tensorflow as tf
import random 
import numpy as np
import math
import matplotlib.pyplot as plt 

In [2]:
n_states = 40
iter_max = 10000

initial_lr = 1.0 # Learning rate
min_lr = 0.003
gamma = 1.0
t_max = 10000
eps = 0.02

In [3]:
def run_episode(env, policy=None, render=False):
    obs = env.reset()
    total_reward = 0
    step_idx = 0
    for _ in range(t_max):
        if render:
            env.render()
        if policy is None:
            action = env.action_space.sample()
        else:
            a,b = obs_to_state(env, obs)
            action = policy[a][b]
        obs, reward, done, _ = env.step(action)
        total_reward += gamma ** step_idx * reward
        step_idx += 1
        if done:
            break
    return total_reward

In [4]:
def obs_to_state(env, obs):
    """ Maps an observation to state """
    env_low = env.observation_space.low
    env_high = env.observation_space.high
    env_dx = (env_high - env_low) / n_states
    a = int((obs[0] - env_low[0])/env_dx[0])
    b = int((obs[1] - env_low[1])/env_dx[1])
    return a, b

In [4]:
env_name = 'MountainCar-v0'
env = gym.make(env_name)
env.seed(0)
np.random.seed(0)

In [5]:
print ('----- using Q Learning -----')
q_table = np.zeros((n_states, n_states, 3))
for i in range(iter_max):
    obs = env.reset()
    total_reward = 0
    ## eta: learning rate is decreased at each step
    eta = max(min_lr, initial_lr * (0.85 ** (i//100)))
    for j in range(t_max):
        a, b = obs_to_state(env, obs)
        if np.random.uniform(0, 1) < eps:
            action = np.random.choice(env.action_space.n)
        else:
            logits = q_table[a][b]
            logits_exp = np.exp(logits)
            probs = logits_exp / np.sum(logits_exp)
            action = np.random.choice(env.action_space.n, p=probs)
        obs, reward, done, _ = env.step(action)
        total_reward += reward
        # update q table
        a_, b_ = obs_to_state(env, obs)
        q_table[a][b][action] = q_table[a][b][action] + eta * (reward + gamma *  np.max(q_table[a_][b_]) - q_table[a][b][action])
        if done:
            break
    if i % 100 == 0:
        print('Iteration #%d -- Total reward = %d.' %(i+1, total_reward))
solution_policy = np.argmax(q_table, axis=2)
solution_policy_scores = [run_episode(env, solution_policy, False) for _ in range(100)]
print("Average score of solution = ", np.mean(solution_policy_scores))

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
----- using Q Learning -----
Iteration #1 -- Total reward = -200.
Iteration #101 -- Total reward = -200.
Iteration #201 -- Total reward = -200.
Iteration #301 -- Total reward = -200.
Iteration #401 -- Total reward = -200.
Iteration #501 -- Total reward = -200.
Iteration #601 -- Total reward = -200.
Iteration #701 -- Total reward = -200.
Iteration #801 -- Total reward = -200.
Iteration #901 -- Total reward = -200.
Iteration #1001 -- Total reward = -200.
Iteration #1101 -- Total reward = -200.
Iteration #1201 -- Total reward = -200.
Iteration #1301 -- Total reward = -200.
Iteration #1401 -- Total reward = -200.
Iteration #1501 -- Total reward = -200.
Iteration #1601 -- Total reward = -200.
Iteration #1701 -- Total reward = -200.
Iteration #1801 -- Total reward = -200.
Iteration #1901 -- Total reward = -200.
Iteration #2001 -- Total reward = -200.
Iteration #2101 -- Total reward = -

In [6]:
# Animate it
run_episode(env, solution_policy, True)

-125.0

In [5]:
print(env.observation_space.high) # use to discritize space later
print(env.observation_space.low)
print(env.action_space.n) # number of possible action in Gym environment

[0.6  0.07]
[-1.2  -0.07]
3


We will discretize the observation space (i.e. position, velocity) in order to be able to create a Q-table with reasonable size.

We need to convert continue states into discrete states. 

In [6]:
# Discrete Observation Space (OS = observation space)
bins = 20
DISCRETE_OS_SIZE = [bins]*len(env.observation_space.high) # to be functional in every environment
discrete_os_win_size = (env.observation_space.high - env.observation_space.low) / DISCRETE_OS_SIZE
# window size (win)
print(discrete_os_win_size)

[0.09  0.007]


In [7]:
# 20 x 20 x 3 (3dim table) 
# every possible combination of possible evironment and observation times 3 actions we can take
q_table = np.random.uniform(low=-2, high=0, 
                            size=(DISCRETE_OS_SIZE + [env.action_space.n]))
print(q_table.shape)

(20, 20, 3)


In [2]:
# Q-Learning settings
LEARNING_RATE = 0.1
DISCOUNT = 0.95
EPISODES = 25000

In [11]:
def get_discrete_state(state):
    discrete_state= (state - env.observation_space.low) / discrete_os_win_size
    return tuple(discrete_state.astype(np.int))

In [12]:
discrete_state = get_discrete_state(env.reset()) # reset returns us just the initial state
print(discrete_state)

(8, 10)


In [13]:
# print out the arbitrary random starting values
q_table[discrete_state]

array([-1.79074422, -1.30304802, -0.51980495])

In [14]:
# if we want to go off the maximum one we can do
np.argmax(q_table[discrete_state]) # action 2

2

## Greedy approach

We make use of only max fucntion to extract max-Q value, resulting in only exploitation. Our model is greedy and exploiting for max Q values always

In [1]:
SHOW_EVERY = 1000

for episode in range(EPISODES):
    
    if episode % SHOW_EVERY == 0:
        render = True
        print(episode)
    else:
        render = False
        
     # reset returns us just the initial state
    discrete_state = get_discrete_state(env.reset())
    
    done = False
    
    while not done: # discrete_state --> tuple
        action = np.argmax(q_table[discrete_state])
    
        new_state, reward, done, _ = env.step(action)
    
        # we will use this in our formulation of q-values
        new_discrete_state = get_discrete_state(new_state)
    
        if episode % SHOW_EVERY == 0:
            env.render()
        
        # If simulation did not end yet after last step - update Q table
        if not done:
            # Maximum possible Q value in next step (for new state)
            max_future_q = np.max(q_table[new_discrete_state])
            # Current Q value (for current state and performed action)
            curret_q = q_table[discrete_state+(action, )]
    
            # And here's our equation for a new Q value for current state and action
            new_q = (1 - LEARNING_RATE) * current_q + LEARNING_RATE * (reward + DISCOUNT * max_future_q)
            # Update Q table with new Q value (Q-formula) 
            q_table[discrete_state + (action, )] = new_q
        
        # Simulation ended (for any reson) - if goal position is achived - update Q value with reward directly
        elif new_state[0] > env.goal_position:
            #q_table[discrete_state + (action,)] = reward
            q_table[discrete_state + (action, )] = 0
            print(f"We made it on episode {episode}")
        discrete_state = new_discrete_state
    
env.close()

NameError: name 'EPISODES' is not defined

## Epsilon-greedy approach

By introducing and epsilon-greedy approach we can take advantage of doing exploration during the episode. 

In [None]:
# Exploration settings
epsilon = 0.5  # not a constant, qoing to be decayed
START_EPSILON_DECAYING = 1
END_EPSILON_DECAYING = EPSILON // 2 # divide out to an integer
epsilon_decay_value = epsilon/(END_EPSILON_DECAYING - START_EPSILON_DECAYING)

Now, we want to decay the epsilon value every episode until we're done decaying it. We'll do this at the end of each episode. Now we just need to use epsilon. We'll use np.random.random() to randomly pick a number 0 to 1. If np.random.random() is greater than the epsilon value, then we'll go based off the max q value as usual. Otherwise, we will just move randomly:

In [None]:
# objective is to get the cart to the flag.

for episode in range(EPISODES):
    discrete_state = get_discrete_state(env.reset())
    done = False

    if episode % SHOW_EVERY == 0:
        render = True
        print(episode)
    else:
        render = False

    while not done:
        # Check whether we want to exploit or explore state/action space
        if np.random.random() > epsilon:
            # Get action from Q table
            action = np.argmax(q_table[discrete_state])
        else:
            # Get random action (set/take a random action)
            action = np.random.randint(0, env.action_space.n)


        new_state, reward, done, _ = env.step(action)

        new_discrete_state = get_discrete_state(new_state)

        if episode % SHOW_EVERY == 0:
            env.render()
        #new_q = (1 - LEARNING_RATE) * current_q + LEARNING_RATE * (reward + DISCOUNT * max_future_q)

        # If simulation did not end yet after last step - update Q table
        if not done:

            # Maximum possible Q value in next step (for new state)
            max_future_q = np.max(q_table[new_discrete_state])

            # Current Q value (for current state and performed action)
            current_q = q_table[discrete_state + (action,)]

            # And here's our equation for a new Q value for current state and action
            new_q = (1 - LEARNING_RATE) * current_q + LEARNING_RATE * (reward + DISCOUNT * max_future_q)

            # Update Q table with new Q value
            q_table[discrete_state + (action,)] = new_q


        # Simulation ended (for any reson) - if goal position is achived - update Q value with reward directly
        elif new_state[0] >= env.goal_position:
            #q_table[discrete_state + (action,)] = reward
            q_table[discrete_state + (action,)] = 0

        discrete_state = new_discrete_state

    # Decaying is being done every episode if episode number is within decaying range
    if END_EPSILON_DECAYING >= episode >= START_EPSILON_DECAYING:
        epsilon -= epsilon_decay_value


env.close()

https://pythonprogramming.net/q-learning-algorithm-reinforcement-learning-python-tutorial/

### Boltzmann Exploration Policy

https://github.com/lukedottec/QLearnGrid/blob/master/main.py

In [None]:
# objective is to get the cart to the flag.
temperature = 10 #[2000, 1000, 100, 10, 1]

for episode in range(EPISODES):
    discrete_state = get_discrete_state(env.reset())
    done = False

    if episode % SHOW_EVERY == 0:
        render = True
        print(episode)
    else:
        render = False

    while not done:
        moves = q_table[discrete_state]
        
        # Check whether we want to exploit or explore state/action space
         # Circumvent math issues with temperature actually being 0
        if temperature > 0:
            # Compute action probabilities using temperature; when
            # temperature is high, we're treating values of very different
            # Q-values as more equally choosable
            
            action_probs_numes = []
            denom = 0
            for m in moves:
                val = math.exp(m / temperature)
                action_probs_numes.append(val)
                denom += val
            action_probs = [x / denom for x in action_probs_numes]

            # Pick random move, in which moves with higher probability are
            # more likely to be chosen, but it is obviously not guaranteed
            rand_val = random.uniform(0, 1)
            prob_sum = 0
            
            for i, prob in enumerate(action_probs):
                prob_sum += prob
                if rand_val <= prob_sum:
                    picked_move = i
                    break
        else:
            # Here, we're totally cold; meaning, we're just exploitin
            #picked_move, picked_move_q = get_best_action(orig_state)
            new_discrete_state = get_discrete_state(new_state)

        
        new_state, reward, done, _ = env.step(action)

        #new_discrete_state = get_discrete_state(new_state)

        if episode % SHOW_EVERY == 0:
            env.render()
        #new_q = (1 - LEARNING_RATE) * current_q + LEARNING_RATE * (reward + DISCOUNT * max_future_q)

        # If simulation did not end yet after last step - update Q table
        if not done:

            # Maximum possible Q value in next step (for new state)
            max_future_q = np.max(q_table[new_discrete_state])

            # Current Q value (for current state and performed action)
            current_q = q_table[discrete_state + (action,)]

            # And here's our equation for a new Q value for current state and action
            new_q = (1 - LEARNING_RATE) * current_q + LEARNING_RATE * (reward + DISCOUNT * max_future_q)

            # Update Q table with new Q value
            q_table[discrete_state + (action,)] = new_q


        # Simulation ended (for any reson) - if goal position is achived - update Q value with reward directly
        elif new_state[0] >= env.goal_position:
            #q_table[discrete_state + (action,)] = reward
            q_table[discrete_state + (action,)] = 0

        discrete_state = new_discrete_state


env.close()

## Deep Q-Learning Network (DQN) approach

In [3]:
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import Adam

import random
import gym
import numpy as np
from collections import deque

import os # for creating directories

env.seed(0)
np.random.seed(0)

In [4]:
# initialise environment to version 0 of Cart Pole problem
env = gym.make('CartPole-v0') 

In [5]:
# Obtain state and action size
state_size = env.observation_space.shape[0]
action_size = env.action_space.n
print(state_size)
print(action_size)

4
2


In [11]:
# Set parameters
batch_size = 32

n_episodes = 200 # n games we want agent to play (default 1001)
output_dir = 'model_output/cartpole/'

In [12]:
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

In [16]:
# Define DQN agent
class DQNAgent:
    def __init__(self, state_size, action_size):
        self.state_size = state_size
        self.action_size = action_size
        self.memory = deque(maxlen=2000) # double-ended queue; acts like list, but elements can be added/removed from either end
        self.gamma = 0.95 # decay or discount rate: enables agent to take into account future actions in addition to the immediate ones, but discounted at this rate
        self.epsilon = 1.0 # exploration rate: how much to act randomly; more initially than later due to epsilon decay
        self.epsilon_decay = 0.995 # decrease number of random explorations as the agent's performance (hopefully) improves over time
        self.epsilon_min = 0.01 # minimum amount of random exploration permitted
        self.learning_rate = 0.001 # rate at which NN adjusts models parameters via SGD to reduce cost 
        self.model = self.build_model() # private method 
    
    def build_model(self):
        # neural net to approximate Q-value function:
        model = Sequential()
        model.add(Dense(units=24, input_dim=self.state_size, activation='relu')) # 1st hidden layer; states as input
        model.add(Dense(units=24, activation='relu')) # 2nd hidden layer
        model.add(Dense(units=self.action_size, activation='linear')) # 2 actions, so 2 output neurons: 0 and 1 (L/R)
        model.compile(loss='mse',
                      optimizer=Adam(lr=self.learning_rate))
        return model
    
    def remember(self, state, action, reward, next_state, done):
        # list of previous experiences, enabling re-training later
        self.memory.append((state, action, reward, next_state, done)) 

    def act(self, state):
        if np.random.rand() <= self.epsilon: # if acting randomly, take random action
            return random.randrange(self.action_size)
        
        # if not acting randomly, predict reward value based on current state
        act_values = self.model.predict(state) 
         # pick the action that will give the highest reward (i.e., go left or right?)
        return np.argmax(act_values[0])

    def replay(self, batch_size): # method that trains NN with experiences sampled from memory
        minibatch = random.sample(self.memory, batch_size) # sample a minibatch from memory
        
        for state, action, reward, next_state, done in minibatch: # extract data for each minibatch sample
            target = reward # if done (boolean whether game ended or not, i.e., whether final state or not), then target = reward
            
            if not done: # if not done, then predict future discounted reward
                target = (reward + self.gamma * # (target) = reward + (discount rate gamma) * 
                          np.amax(self.model.predict(next_state)[0])) # (maximum target Q based on future action a')
                
            target_f = self.model.predict(state) # approximately map current state to future discounted reward
            target_f[0][action] = target
            self.model.fit(state, target_f, epochs=1, verbose=0) # single epoch of training with x=state, y=target_f; fit decreases loss between target_f and y_hat
            
        if self.epsilon > self.epsilon_min:
            self.epsilon *= self.epsilon_decay

    def load(self, name):
        self.model.load_weights(name)

    def save(self, name):
        self.model.save_weights(name)

### Interact with environment

In [17]:
# initialise agent
agent = DQNAgent(state_size, action_size) 

In [19]:
done = False
for e in range(n_episodes): # iterate over new episodes of the game
    state = env.reset() # reset state at start of each new episode of the game
    state = np.reshape(state, [1, state_size])
    
    for time in range(5000):  # time represents a frame of the game; goal is to keep pole upright as long as possible up to range, e.g., 500 or 5000 timesteps
        env.render() # comment out for faster training
        
        action = agent.act(state) # action is either 0 or 1 (move cart left or right); decide on one or other here
        next_state, reward, done, _ = env.step(action) # agent interacts with env, gets feedback; 4 state data points, e.g., pole angle, cart position        
        reward = reward if not done else -10 # reward +1 for each additional frame with pole upright        
        next_state = np.reshape(next_state, [1, state_size])
        
        agent.remember(state, action, reward, next_state, done) # remember the previous timestep's state, actions, reward, etc.        
        
        state = next_state # set "current state" for upcoming iteration to the current next state        
        
        if done: # episode ends if agent drops pole or we reach timestep 5000
            print("episode: {}/{}, score: {}, e: {:.2}" # print the episode's score and agent's epsilon
                  .format(e, n_episodes, time, agent.epsilon))
            break # exit loop
            
    if len(agent.memory) > batch_size:
        agent.replay(batch_size) # train the agent by replaying the experiences of the episode
    if e % 50 == 0:
        agent.save(output_dir + "weights_" + '{:04d}'.format(e) + ".hdf5")

episode: 0/200, score: 97, e: 0.37
episode: 1/200, score: 136, e: 0.37
episode: 2/200, score: 199, e: 0.37
episode: 3/200, score: 184, e: 0.36
episode: 4/200, score: 33, e: 0.36
episode: 5/200, score: 86, e: 0.36
episode: 6/200, score: 56, e: 0.36
episode: 7/200, score: 113, e: 0.36
episode: 8/200, score: 62, e: 0.35
episode: 9/200, score: 174, e: 0.35
episode: 10/200, score: 137, e: 0.35
episode: 11/200, score: 113, e: 0.35
episode: 12/200, score: 70, e: 0.35
episode: 13/200, score: 59, e: 0.35
episode: 14/200, score: 42, e: 0.34
episode: 15/200, score: 94, e: 0.34
episode: 16/200, score: 199, e: 0.34
episode: 17/200, score: 29, e: 0.34
episode: 18/200, score: 111, e: 0.34
episode: 19/200, score: 155, e: 0.34
episode: 20/200, score: 143, e: 0.33
episode: 21/200, score: 199, e: 0.33
episode: 22/200, score: 56, e: 0.33
episode: 23/200, score: 93, e: 0.33
episode: 24/200, score: 90, e: 0.33
episode: 25/200, score: 160, e: 0.33
episode: 26/200, score: 49, e: 0.32
episode: 27/200, score: 1

Saved agents can be loaded with agent.load("./path/filename.hdf5")

In [None]:
# metric to track over time
ep_rewards = []
aggr_ep_rewards = {'ep': [], 'avg': [], 'min': [], 'max': []}
# Note: 'avg' is the average for a fixed window during every for say 500 steps

In [None]:
# objective is to get the cart to the flag.

for episode in range(EPISODES):
    discrete_state = get_discrete_state(env.reset())
    done = False
    episode_reward = 0

    if episode % SHOW_EVERY == 0:
        render = True
        print(episode)
    else:
        render = False

    while not done:
        # Check whether we want to exploit or explore state/action space
        if np.random.random() > epsilon:
            # Get action from Q table
            action = np.argmax(q_table[discrete_state])
        else:
            # Get random action (set/take a random action)
            action = np.random.randint(0, env.action_space.n)


        new_state, reward, done, _ = env.step(action)

        episode_reward += reward
        
        new_discrete_state = get_discrete_state(new_state)

        if episode % SHOW_EVERY == 0:
            env.render()
        #new_q = (1 - LEARNING_RATE) * current_q + LEARNING_RATE * (reward + DISCOUNT * max_future_q)

        # If simulation did not end yet after last step - update Q table
        if not done:

            # Maximum possible Q value in next step (for new state)
            max_future_q = np.max(q_table[new_discrete_state])

            # Current Q value (for current state and performed action)
            current_q = q_table[discrete_state + (action,)]

            # And here's our equation for a new Q value for current state and action
            new_q = (1 - LEARNING_RATE) * current_q + LEARNING_RATE * (reward + DISCOUNT * max_future_q)

            # Update Q table with new Q value
            q_table[discrete_state + (action,)] = new_q


        # Simulation ended (for any reson) - if goal position is achived - update Q value with reward directly
        elif new_state[0] >= env.goal_position:
            #q_table[discrete_state + (action,)] = reward
            q_table[discrete_state + (action,)] = 0

        discrete_state = new_discrete_state

    # Decaying is being done every episode if episode number is within decaying range
    if END_EPSILON_DECAYING >= episode >= START_EPSILON_DECAYING:
        epsilon -= epsilon_decay_value

    ep_rewards.append(episode_reward)
    
    if not episode % SHOW_EVERY:
        average_reward = sum(ep_rewards[-SHOW_EVERY:]/len(ep_rewards[-SHOW_EVERY:]))
        aggr_ep_rewards['ep'].append(episode)
        aggr_ep_rewards['avg'].append(average_reward)
        aggr_ep_rewards['min'].append(min(ep_rewards[-SHOW_EVERY:]))
        aggr_ep_rewards['max'].append(max(ep_rewards[-SHOW_EVERY:]))
        
        print(f'Episode: {episode} avg: {average_reward} min: {min(ep_rewards[-SHOW_EVERY:])}\
        max: {max(ep_rewards[-SHOW_EVERY:])}')

env.close()

In [None]:
plt.plot(aggr_ep_rewards['ep'], aggr_ep_rewards['avg'], label='avg')
plt.plot(aggr_ep_rewards['ep'], aggr_ep_rewards['min'], label='min')
plt.plot(aggr_ep_rewards['ep'], aggr_ep_rewards['max'], label='max')
plt.legend(loc=4)
plt.show()

## Q-Learning method

In [None]:
class GameRunner:
  def __init__(self, sess, model, env, memory, max_eps, min_eps,
                 decay, render=True):
        self._sess  = sess
        self._env   = env
        self._model = model
        self._memory  = memory
        self._render  = render
        self._max_eps = max_eps
        self._min_eps = min_eps
        self._decay   = decay
        self._eps   = self._max_eps
        self._steps = 0
        self._reward_store = []
        self._max_x_store  = []
        
  def run(self):
      state = self._env.reset()
      tot_reward = 0
      max_x = -100

      while True:
        #if self._render:      
          #self._env.render()
          
        action = self._choose_action(state)        
        next_state, reward, done, info = self._env.step(action)
        
        if next_state[0] >= 0.1:
          reward += 10
        elif next_state[0] >= 0.25:
          reward += 20
        elif next_state[0] >= 0.5:
          reward += 100

        if next_state[0] > max_x:
          max_x = next_state[0]
        # is the game complete? If so, set the next state to
        # None for storage sake
        if done:
          next_state = None

        self._memory.add_sample((state, action, reward, next_state))
        self._replay()

        # exponentially decay the eps value
        self._steps += 1
        self._eps = MIN_EPSILON + (MAX_EPSILON - MIN_EPSILON) \
                                * math.exp(-LAMBDA * self._steps)

        # move the agent to the next state and accumulate the reward
        state = next_state
        tot_reward += reward

        # if the game is done, break the loop
        if done:
          self._reward_store.append(tot_reward)
          self._max_x_store.append(max_x)
          break
        
      print("Step {}, Total reward: {}, Eps: {}".format(self._steps, tot_reward, self._eps))

  def _choose_action(self, state):
    if random.random() < self._eps:
        return random.randint(0, self._model._num_actions - 1)
    else:
        return np.argmax(self._model.predict_one(state, self._sess))
      
  def _replay(self):
    GAMMA=0.99 #added
    batch = self._memory.sample(self._model._batch_size)
    states = np.array([val[0] for val in batch])
    next_states = np.array([(np.zeros(self._model._num_states)
                             if val[3] is None else val[3]) for val in batch])
    # predict Q(s,a) given the batch of states
    q_s_a = self._model.predict_batch(states, self._sess)
    # predict Q(s',a') - so that we can do gamma * max(Q(s'a')) below
    q_s_a_d = self._model.predict_batch(next_states, self._sess)
    # setup training arrays
    x = np.zeros((len(batch), self._model._num_states))
    y = np.zeros((len(batch), self._model._num_actions))
    for i, b in enumerate(batch):
        state, action, reward, next_state = b[0], b[1], b[2], b[3]
        # get the current q values for all actions in state
        current_q = q_s_a[i]
        # update the q value for action
        if next_state is None:
            # in this case, the game completed after action, so there is no max Q(s',a')
            # prediction possible
            current_q[action] = reward
        else:
            current_q[action] = reward + GAMMA * np.amax(q_s_a_d[i])
        x[i] = state
        y[i] = current_q
    self._model.train_batch(self._sess, x, y)

In [None]:
class Model:
    def __init__(self, num_states, num_actions, batch_size):
        self._num_states = num_states
        self._num_actions = num_actions
        self._batch_size = batch_size
        # define the placeholders
        self._states = None
        self._actions = None
        # the output operations
        self._logits = None
        self._optimizer = None
        self._var_init = None
        # now setup the model
        self._define_model()

    def _define_model(self):
        self._states = tf.placeholder(shape=[None, self._num_states], dtype=tf.float32)
        self._q_s_a = tf.placeholder(shape=[None, self._num_actions], dtype=tf.float32)
        # create a couple of fully connected hidden layers
        fc1 = tf.layers.dense(self._states, 50, activation=tf.nn.relu)
        fc2 = tf.layers.dense(fc1, 50, activation=tf.nn.relu)
        self._logits = tf.layers.dense(fc2, self._num_actions)
        loss = tf.losses.mean_squared_error(self._q_s_a, self._logits)
        self._optimizer = tf.train.AdamOptimizer().minimize(loss)
        self._var_init = tf.global_variables_initializer()
        
    def predict_one(self, state, sess):
      return sess.run(self._logits, feed_dict={self._states:
                                               state.reshape(1, self._num_states)})

    def predict_batch(self, states, sess):
      return sess.run(self._logits, feed_dict={self._states: states})

    def train_batch(self, sess, x_batch, y_batch):
      sess.run(self._optimizer, feed_dict={self._states: x_batch, self._q_s_a: y_batch})

In [None]:
class Memory:
  def __init__(self, max_memory):
    self._max_memory = max_memory
    self._samples = []

  def add_sample(self, sample):
    self._samples.append(sample)
    if len(self._samples) > self._max_memory:
      self._samples.pop(0)

  def sample(self, no_samples):
    if no_samples > len(self._samples):
      return random.sample(self._samples, len(self._samples))
    else:
      return random.sample(self._samples, no_samples)    

In [None]:
# sets up the environment and run multiple games to perform the learning
env_name = 'MountainCar-v0'
env = gym.make(env_name)

In [None]:
# number of state and action
num_states  = env.env.observation_space.shape[0]
num_actions = env.env.action_space.n

In [None]:
# build RL model
BATCH_SIZE=50
MAX_EPSILON=0.9
MIN_EPSILON=0.1
LAMBDA=0.2

model = Model(num_states, num_actions, BATCH_SIZE)
mem = Memory(50000)

with tf.Session() as sess:
  sess.run(model._var_init)
  gr = GameRunner(sess, model, env, mem, 
                  MAX_EPSILON, MIN_EPSILON, LAMBDA)
  num_episodes = 300
  cnt = 0
  while cnt < num_episodes:
    if cnt % 10 == 0:
      print('Episode {} of {}'.format(cnt+1, num_episodes))
    gr.run()
    cnt += 1
  plt.plot(gr._reward_store)
  plt.show()
  plt.close("all")
  plt.plot(gr._max_x_store)
  plt.show()

## Basic implementation

In [None]:
# This code creates a virtual display to draw game images on. 
# If you are running locally, just ignore it
import os
if type(os.environ.get("DISPLAY")) is not str or len(os.environ.get("DISPLAY"))==0:
    !bash ../xvfb start
    %env DISPLAY=:1

In [None]:
import gym
env = gym.make("MountainCar-v0")

In [None]:
plt.imshow(env.render('rgb_array'))
print("Observation space:", env.observation_space)
print("Action space:", env.action_space)

In [None]:
obs0 = env.reset()
print("initial observation code:", obs0)

# Note: in MountainCar, observation is just two numbers: car position and velocity

In [None]:
print("taking action 2 (right)")
new_obs, reward, is_done, _ = env.step(2)

print("new observation code:", new_obs)
print("reward:", reward)
print("is game over?:", is_done)

# Note: as you can see, the car has moved to the riht slightly (around 0.0005)

In [None]:
TIME_LIMIT = 250
env = gym.wrappers.TimeLimit(gym.envs.classic_control.MountainCarEnv(),
                             max_episode_steps=TIME_LIMIT + 1)

In [None]:
s = env.reset()
actions = {'left': 0, 'stop': 1, 'right': 2}

In [None]:
# prepare "display"
#%matplotlib notebook
fig = plt.figure()
ax = fig.add_subplot(111)
fig.show()

In [None]:
def policy(s, t):
    # YOUR CODE HERE
    if t>50 and t<100:
        return actions['left']
    else:
        return actions['right']
    
    #return actions['right']

In [None]:
def policy(t):
    if t < 50:
        return actions['left']
    elif t < 100:
        return actions['right']
    elif t < 150:
        return actions['left']
    else:
        return actions['right']

In [None]:
for t in range(TIME_LIMIT):
    
    s, r, done, _ = env.step(policy(s, t))
    
    #draw game image on display
    ax.clear()
    ax.imshow(env.render('rgb_array'))
    fig.canvas.draw()
    
    if done:
        print("Well done!")
        break
else:    
    print("Time limit exceeded. Try again.")

In [None]:
env.close()