# Module Five Assignment: Cartpole Problem
Review the code in this notebook and in the score_logger.py file in the *scores* folder (directory). Once you have reviewed the code, return to this notebook and select **Cell** and then **Run All** from the menu bar to run this code. The code takes several minutes to run.

In [1]:
import random  
import gym  
import numpy as np  
from collections import deque  
from keras.models import Sequential  
from keras.layers import Dense  
from keras.optimizers import Adam  
from scores.score_logger import ScoreLogger

  
ENV_NAME = "CartPole-v1"  
  
GAMMA = 0.95  
LEARNING_RATE = 0.001  
  
MEMORY_SIZE = 1000000  
BATCH_SIZE = 20  
  
EXPLORATION_MAX = 1.0  
EXPLORATION_MIN = 0.01  
EXPLORATION_DECAY = 0.995  
  
  
class DQNSolver:  
  
    def __init__(self, observation_space, action_space):  
        self.exploration_rate = EXPLORATION_MAX  
  
        self.action_space = action_space  
        self.memory = deque(maxlen=MEMORY_SIZE)  
  
        self.model = Sequential()  
        self.model.add(Dense(24, input_shape=(observation_space,), activation="relu"))  
        self.model.add(Dense(24, activation="relu"))  
        self.model.add(Dense(self.action_space, activation="linear"))  
        self.model.compile(loss="mse", optimizer=Adam(lr=LEARNING_RATE))  
  
    def remember(self, state, action, reward, next_state, done):  
        self.memory.append((state, action, reward, next_state, done))  
  
    def act(self, state):  
        if np.random.rand() < self.exploration_rate:  
            return random.randrange(self.action_space)  
        q_values = self.model.predict(state)  
        return np.argmax(q_values[0])  
  
    def experience_replay(self):  
        if len(self.memory) < BATCH_SIZE:  
            return  
        batch = random.sample(self.memory, BATCH_SIZE)  
        for state, action, reward, state_next, terminal in batch:  
            q_update = reward  
            if not terminal:  
                q_update = (reward + GAMMA * np.amax(self.model.predict(state_next)[0]))  
            q_values = self.model.predict(state)  
            q_values[0][action] = q_update  
            self.model.fit(state, q_values, verbose=0)  
        self.exploration_rate *= EXPLORATION_DECAY  
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)  
  
  
def cartpole():  
    env = gym.make(ENV_NAME)  
    score_logger = ScoreLogger(ENV_NAME)  
    observation_space = env.observation_space.shape[0]  
    action_space = env.action_space.n  
    dqn_solver = DQNSolver(observation_space, action_space)  
    run = 0  
    while True:  
        run += 1  
        state = env.reset()  
        state = np.reshape(state, [1, observation_space])  
        step = 0  
        while True:  
            step += 1  
            #env.render()  
            action = dqn_solver.act(state)  
            state_next, reward, terminal, info = env.step(action)  
            reward = reward if not terminal else -reward  
            state_next = np.reshape(state_next, [1, observation_space])  
            dqn_solver.remember(state, action, reward, state_next, terminal)  
            state = state_next  
            if terminal:  
                print ("Run: " + str(run) + ", exploration: " + str(dqn_solver.exploration_rate) + ", score: " + str(step))  
                score_logger.add_score(step, run)  
                break  
            dqn_solver.experience_replay()  


Using TensorFlow backend.


In [2]:
cartpole()

Run: 1, exploration: 0.985074875, score: 23
Scores: (min: 23, avg: 23, max: 23)

Run: 2, exploration: 0.9137248860125932, score: 16
Scores: (min: 16, avg: 19.5, max: 23)

Run: 3, exploration: 0.8433051360508336, score: 17
Scores: (min: 16, avg: 18.666666666666668, max: 23)

Run: 4, exploration: 0.8020760579717637, score: 11
Scores: (min: 11, avg: 16.75, max: 23)

Run: 5, exploration: 0.6832098777212641, score: 33
Scores: (min: 11, avg: 20, max: 33)

Run: 6, exploration: 0.653073201944699, score: 10
Scores: (min: 10, avg: 18.333333333333332, max: 33)

Run: 7, exploration: 0.6242658676435396, score: 10
Scores: (min: 10, avg: 17.142857142857142, max: 33)

Run: 8, exploration: 0.6027415843082742, score: 8
Scores: (min: 8, avg: 16, max: 33)

Run: 9, exploration: 0.5647174463480732, score: 14
Scores: (min: 8, avg: 15.777777777777779, max: 33)

Run: 10, exploration: 0.5398075216808175, score: 10
Scores: (min: 8, avg: 15.2, max: 33)

Run: 11, exploration: 0.5082950737585841, score: 13
Scores: 

Run: 88, exploration: 0.01, score: 18
Scores: (min: 8, avg: 20.818181818181817, max: 50)

Run: 89, exploration: 0.01, score: 25
Scores: (min: 8, avg: 20.865168539325843, max: 50)

Run: 90, exploration: 0.01, score: 27
Scores: (min: 8, avg: 20.933333333333334, max: 50)

Run: 91, exploration: 0.01, score: 25
Scores: (min: 8, avg: 20.978021978021978, max: 50)

Run: 92, exploration: 0.01, score: 20
Scores: (min: 8, avg: 20.967391304347824, max: 50)

Run: 93, exploration: 0.01, score: 21
Scores: (min: 8, avg: 20.967741935483872, max: 50)

Run: 94, exploration: 0.01, score: 19
Scores: (min: 8, avg: 20.9468085106383, max: 50)

Run: 95, exploration: 0.01, score: 22
Scores: (min: 8, avg: 20.957894736842107, max: 50)

Run: 96, exploration: 0.01, score: 43
Scores: (min: 8, avg: 21.1875, max: 50)

Run: 97, exploration: 0.01, score: 24
Scores: (min: 8, avg: 21.216494845360824, max: 50)

Run: 98, exploration: 0.01, score: 18
Scores: (min: 8, avg: 21.183673469387756, max: 50)

Run: 99, exploration: 0

Run: 190, exploration: 0.01, score: 225
Scores: (min: 8, avg: 120.28, max: 483)

Run: 191, exploration: 0.01, score: 171
Scores: (min: 8, avg: 121.74, max: 483)

Run: 192, exploration: 0.01, score: 150
Scores: (min: 8, avg: 123.04, max: 483)

Run: 193, exploration: 0.01, score: 141
Scores: (min: 8, avg: 124.24, max: 483)

Run: 194, exploration: 0.01, score: 165
Scores: (min: 8, avg: 125.7, max: 483)

Run: 195, exploration: 0.01, score: 12
Scores: (min: 8, avg: 125.6, max: 483)

Run: 196, exploration: 0.01, score: 10
Scores: (min: 8, avg: 125.27, max: 483)

Run: 197, exploration: 0.01, score: 169
Scores: (min: 8, avg: 126.72, max: 483)

Run: 198, exploration: 0.01, score: 230
Scores: (min: 8, avg: 128.84, max: 483)

Run: 199, exploration: 0.01, score: 148
Scores: (min: 8, avg: 130.11, max: 483)

Run: 200, exploration: 0.01, score: 170
Scores: (min: 8, avg: 131.29, max: 483)

Run: 201, exploration: 0.01, score: 141
Scores: (min: 8, avg: 132.26, max: 483)

Run: 202, exploration: 0.01, sco

Run: 291, exploration: 0.01, score: 181
Scores: (min: 10, avg: 180.54, max: 319)

Run: 292, exploration: 0.01, score: 196
Scores: (min: 10, avg: 181, max: 319)

Run: 293, exploration: 0.01, score: 147
Scores: (min: 10, avg: 181.06, max: 319)

Run: 294, exploration: 0.01, score: 218
Scores: (min: 10, avg: 181.59, max: 319)

Run: 295, exploration: 0.01, score: 241
Scores: (min: 10, avg: 183.88, max: 319)

Run: 296, exploration: 0.01, score: 189
Scores: (min: 101, avg: 185.67, max: 319)

Run: 297, exploration: 0.01, score: 130
Scores: (min: 101, avg: 185.28, max: 319)

Run: 298, exploration: 0.01, score: 119
Scores: (min: 101, avg: 184.17, max: 319)

Run: 299, exploration: 0.01, score: 167
Scores: (min: 101, avg: 184.36, max: 319)

Run: 300, exploration: 0.01, score: 196
Scores: (min: 101, avg: 184.62, max: 319)

Run: 301, exploration: 0.01, score: 323
Scores: (min: 101, avg: 186.44, max: 323)

Run: 302, exploration: 0.01, score: 153
Scores: (min: 101, avg: 186.13, max: 323)

Run: 303, ex

Run: 392, exploration: 0.01, score: 228
Scores: (min: 8, avg: 185.37, max: 413)

Run: 393, exploration: 0.01, score: 223
Scores: (min: 8, avg: 186.13, max: 413)

Run: 394, exploration: 0.01, score: 305
Scores: (min: 8, avg: 187, max: 413)

Run: 395, exploration: 0.01, score: 243
Scores: (min: 8, avg: 187.02, max: 413)

Run: 396, exploration: 0.01, score: 146
Scores: (min: 8, avg: 186.59, max: 413)

Run: 397, exploration: 0.01, score: 147
Scores: (min: 8, avg: 186.76, max: 413)

Run: 398, exploration: 0.01, score: 189
Scores: (min: 8, avg: 187.46, max: 413)

Run: 399, exploration: 0.01, score: 145
Scores: (min: 8, avg: 187.24, max: 413)

Run: 400, exploration: 0.01, score: 128
Scores: (min: 8, avg: 186.56, max: 413)

Run: 401, exploration: 0.01, score: 183
Scores: (min: 8, avg: 185.16, max: 413)

Run: 402, exploration: 0.01, score: 210
Scores: (min: 8, avg: 185.73, max: 413)

Run: 403, exploration: 0.01, score: 231
Scores: (min: 8, avg: 185.6, max: 413)

Run: 404, exploration: 0.01, sco

NameError: name 'exit' is not defined

Note: If the code is running properly, you should begin to see output appearing above this code block. It will take several minutes, so it is recommended that you let this code run in the background while completing other work. When the code has finished, it will print output saying, "Solved in _ runs, _ total runs."

You may see an error about not having an exit command. This error does not affect the program's functionality and results from the steps taken to convert the code from Python 2.x to Python 3. Please disregard this error.

In [5]:
# Modified cartpole: increased gamma to 0.995

ENV_NAME = "CartPole-v1"  
  
GAMMA = 0.995  
LEARNING_RATE = 0.001  
  
MEMORY_SIZE = 1000000  
BATCH_SIZE = 20  
  
EXPLORATION_MAX = 1.0  
EXPLORATION_MIN = 0.01  
EXPLORATION_DECAY = 0.995  
  
  
class DQNSolver:  
  
    def __init__(self, observation_space, action_space):  
        self.exploration_rate = EXPLORATION_MAX  
  
        self.action_space = action_space  
        self.memory = deque(maxlen=MEMORY_SIZE)  
  
        self.model = Sequential()  
        self.model.add(Dense(24, input_shape=(observation_space,), activation="relu"))  
        self.model.add(Dense(24, activation="relu"))  
        self.model.add(Dense(self.action_space, activation="linear"))  
        self.model.compile(loss="mse", optimizer=Adam(lr=LEARNING_RATE))  
  
    def remember(self, state, action, reward, next_state, done):  
        self.memory.append((state, action, reward, next_state, done))  
  
    def act(self, state):  
        if np.random.rand() < self.exploration_rate:  
            return random.randrange(self.action_space)  
        q_values = self.model.predict(state)  
        return np.argmax(q_values[0])  
  
    def experience_replay(self):  
        if len(self.memory) < BATCH_SIZE:  
            return  
        batch = random.sample(self.memory, BATCH_SIZE)  
        for state, action, reward, state_next, terminal in batch:  
            q_update = reward  
            if not terminal:  
                q_update = (reward + GAMMA * np.amax(self.model.predict(state_next)[0]))  
            q_values = self.model.predict(state)  
            q_values[0][action] = q_update  
            self.model.fit(state, q_values, verbose=0)  
        self.exploration_rate *= EXPLORATION_DECAY  
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)  
  
  
def cartpole():  
    env = gym.make(ENV_NAME)  
    score_logger = ScoreLogger(ENV_NAME)  
    observation_space = env.observation_space.shape[0]  
    action_space = env.action_space.n  
    dqn_solver = DQNSolver(observation_space, action_space)  
    run = 0  
    while True:  
        run += 1  
        state = env.reset()  
        state = np.reshape(state, [1, observation_space])  
        step = 0  
        while True:  
            step += 1  
            #env.render()  
            action = dqn_solver.act(state)  
            state_next, reward, terminal, info = env.step(action)  
            reward = reward if not terminal else -reward  
            state_next = np.reshape(state_next, [1, observation_space])  
            dqn_solver.remember(state, action, reward, state_next, terminal)  
            state = state_next  
            if terminal:  
                print ("Run: " + str(run) + ", exploration: " + str(dqn_solver.exploration_rate) + ", score: " + str(step))  
                score_logger.add_score(step, run)  
                break  
            dqn_solver.experience_replay()  


In [6]:
cartpole()

Run: 1, exploration: 1.0, score: 12
Scores: (min: 12, avg: 12, max: 12)

Run: 2, exploration: 0.9703725093562657, score: 14
Scores: (min: 12, avg: 13, max: 14)

Run: 3, exploration: 0.8433051360508336, score: 29
Scores: (min: 12, avg: 18.333333333333332, max: 29)

Run: 4, exploration: 0.697046600835495, score: 39
Scores: (min: 12, avg: 23.5, max: 39)

Run: 5, exploration: 0.6057704364907278, score: 29
Scores: (min: 12, avg: 24.6, max: 39)

Run: 6, exploration: 0.500708706245853, score: 39
Scores: (min: 12, avg: 27, max: 39)

Run: 7, exploration: 0.39561243860243744, score: 48
Scores: (min: 12, avg: 30, max: 48)

Run: 8, exploration: 0.3386767948568688, score: 32
Scores: (min: 12, avg: 30.25, max: 48)

Run: 9, exploration: 0.27714603575484437, score: 41
Scores: (min: 12, avg: 31.444444444444443, max: 48)

Run: 10, exploration: 0.17829525136613786, score: 89
Scores: (min: 12, avg: 37.2, max: 89)

Run: 11, exploration: 0.14811319969530845, score: 38
Scores: (min: 12, avg: 37.2727272727272

Run: 92, exploration: 0.01, score: 10
Scores: (min: 8, avg: 110.6195652173913, max: 318)

Run: 93, exploration: 0.01, score: 10
Scores: (min: 8, avg: 109.53763440860214, max: 318)

Run: 94, exploration: 0.01, score: 9
Scores: (min: 8, avg: 108.46808510638297, max: 318)

Run: 95, exploration: 0.01, score: 9
Scores: (min: 8, avg: 107.42105263157895, max: 318)

Run: 96, exploration: 0.01, score: 10
Scores: (min: 8, avg: 106.40625, max: 318)

Run: 97, exploration: 0.01, score: 10
Scores: (min: 8, avg: 105.41237113402062, max: 318)

Run: 98, exploration: 0.01, score: 9
Scores: (min: 8, avg: 104.42857142857143, max: 318)

Run: 99, exploration: 0.01, score: 10
Scores: (min: 8, avg: 103.47474747474747, max: 318)

Run: 100, exploration: 0.01, score: 11
Scores: (min: 8, avg: 102.55, max: 318)

Run: 101, exploration: 0.01, score: 9
Scores: (min: 8, avg: 102.52, max: 318)

Run: 102, exploration: 0.01, score: 10
Scores: (min: 8, avg: 102.48, max: 318)

Run: 103, exploration: 0.01, score: 10
Scores:

Run: 197, exploration: 0.01, score: 14
Scores: (min: 8, avg: 10.26, max: 51)

Run: 198, exploration: 0.01, score: 13
Scores: (min: 8, avg: 10.3, max: 51)

Run: 199, exploration: 0.01, score: 27
Scores: (min: 8, avg: 10.47, max: 51)

Run: 200, exploration: 0.01, score: 26
Scores: (min: 8, avg: 10.62, max: 51)

Run: 201, exploration: 0.01, score: 52
Scores: (min: 8, avg: 11.05, max: 52)

Run: 202, exploration: 0.01, score: 46
Scores: (min: 8, avg: 11.41, max: 52)

Run: 203, exploration: 0.01, score: 56
Scores: (min: 8, avg: 11.87, max: 56)

Run: 204, exploration: 0.01, score: 396
Scores: (min: 8, avg: 15.74, max: 396)

Run: 205, exploration: 0.01, score: 61
Scores: (min: 8, avg: 16.26, max: 396)

Run: 206, exploration: 0.01, score: 13
Scores: (min: 8, avg: 16.3, max: 396)

Run: 207, exploration: 0.01, score: 11
Scores: (min: 8, avg: 16.32, max: 396)

Run: 208, exploration: 0.01, score: 11
Scores: (min: 8, avg: 16.34, max: 396)

Run: 209, exploration: 0.01, score: 11
Scores: (min: 8, avg:

Run: 301, exploration: 0.01, score: 500
Scores: (min: 9, avg: 140.78, max: 500)

Run: 302, exploration: 0.01, score: 500
Scores: (min: 9, avg: 145.32, max: 500)

Run: 303, exploration: 0.01, score: 500
Scores: (min: 9, avg: 149.76, max: 500)

Run: 304, exploration: 0.01, score: 500
Scores: (min: 9, avg: 150.8, max: 500)

Run: 305, exploration: 0.01, score: 500
Scores: (min: 9, avg: 155.19, max: 500)

Run: 306, exploration: 0.01, score: 500
Scores: (min: 9, avg: 160.06, max: 500)

Run: 307, exploration: 0.01, score: 500
Scores: (min: 9, avg: 164.95, max: 500)

Run: 308, exploration: 0.01, score: 500
Scores: (min: 9, avg: 169.84, max: 500)

Run: 309, exploration: 0.01, score: 500
Scores: (min: 9, avg: 174.73, max: 500)

Run: 310, exploration: 0.01, score: 500
Scores: (min: 9, avg: 179.63, max: 500)

Run: 311, exploration: 0.01, score: 500
Scores: (min: 9, avg: 184.54, max: 500)

Run: 312, exploration: 0.01, score: 500
Scores: (min: 9, avg: 189.44, max: 500)

Run: 313, exploration: 0.01, 

NameError: name 'exit' is not defined

In [9]:
# Modified cartpole: increased gamma to 0.995, increased exploration min to 0.5
  
ENV_NAME = "CartPole-v1"  
  
GAMMA = 0.995  
LEARNING_RATE = 0.001 
  
MEMORY_SIZE = 1000000  
BATCH_SIZE = 20  
  
EXPLORATION_MAX = 1.0  
EXPLORATION_MIN = 0.5 
EXPLORATION_DECAY = 0.995  
  
  
class DQNSolver:  
  
    def __init__(self, observation_space, action_space):  
        self.exploration_rate = EXPLORATION_MAX  
  
        self.action_space = action_space  
        self.memory = deque(maxlen=MEMORY_SIZE)  
  
        self.model = Sequential()  
        self.model.add(Dense(24, input_shape=(observation_space,), activation="relu"))  
        self.model.add(Dense(24, activation="relu"))  
        self.model.add(Dense(self.action_space, activation="linear"))  
        self.model.compile(loss="mse", optimizer=Adam(lr=LEARNING_RATE))  
  
    def remember(self, state, action, reward, next_state, done):  
        self.memory.append((state, action, reward, next_state, done))  
  
    def act(self, state):  
        if np.random.rand() < self.exploration_rate:  
            return random.randrange(self.action_space)  
        q_values = self.model.predict(state)  
        return np.argmax(q_values[0])  
  
    def experience_replay(self):  
        if len(self.memory) < BATCH_SIZE:  
            return  
        batch = random.sample(self.memory, BATCH_SIZE)  
        for state, action, reward, state_next, terminal in batch:  
            q_update = reward  
            if not terminal:  
                q_update = (reward + GAMMA * np.amax(self.model.predict(state_next)[0]))  
            q_values = self.model.predict(state)  
            q_values[0][action] = q_update  
            self.model.fit(state, q_values, verbose=0)  
        self.exploration_rate *= EXPLORATION_DECAY  
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)  
  
  
def cartpole():  
    env = gym.make(ENV_NAME)  
    score_logger = ScoreLogger(ENV_NAME)  
    observation_space = env.observation_space.shape[0]  
    action_space = env.action_space.n  
    dqn_solver = DQNSolver(observation_space, action_space)  
    run = 0  
    while True:  
        run += 1  
        state = env.reset()  
        state = np.reshape(state, [1, observation_space])  
        step = 0  
        while True:  
            step += 1  
            #env.render()  
            action = dqn_solver.act(state)  
            state_next, reward, terminal, info = env.step(action)  
            reward = reward if not terminal else -reward  
            state_next = np.reshape(state_next, [1, observation_space])  
            dqn_solver.remember(state, action, reward, state_next, terminal)  
            state = state_next  
            if terminal:  
                print ("Run: " + str(run) + ", exploration: " + str(dqn_solver.exploration_rate) + ", score: " + str(step))  
                score_logger.add_score(step, run)  
                break  
            dqn_solver.experience_replay()  


In [10]:
cartpole()

Run: 1, exploration: 0.9801495006250001, score: 24
Scores: (min: 24, avg: 24, max: 24)

Run: 2, exploration: 0.8911090557802088, score: 20
Scores: (min: 20, avg: 22, max: 24)

Run: 3, exploration: 0.7940753492934954, score: 24
Scores: (min: 20, avg: 22.666666666666668, max: 24)

Run: 4, exploration: 0.7328768546436799, score: 17
Scores: (min: 17, avg: 21.25, max: 24)

Run: 5, exploration: 0.697046600835495, score: 11
Scores: (min: 11, avg: 19.2, max: 24)

Run: 6, exploration: 0.6465587967553006, score: 16
Scores: (min: 11, avg: 18.666666666666668, max: 24)

Run: 7, exploration: 0.6027415843082742, score: 15
Scores: (min: 11, avg: 18.142857142857142, max: 24)

Run: 8, exploration: 0.5452463540625918, score: 21
Scores: (min: 11, avg: 18.5, max: 24)

Run: 9, exploration: 0.5211953074858876, score: 10
Scores: (min: 10, avg: 17.555555555555557, max: 24)

Run: 10, exploration: 0.500708706245853, score: 9
Scores: (min: 9, avg: 16.7, max: 24)

Run: 11, exploration: 0.5, score: 11
Scores: (min:

Run: 94, exploration: 0.5, score: 78
Scores: (min: 9, avg: 92.30851063829788, max: 296)

Run: 95, exploration: 0.5, score: 55
Scores: (min: 9, avg: 91.91578947368421, max: 296)

Run: 96, exploration: 0.5, score: 142
Scores: (min: 9, avg: 92.4375, max: 296)

Run: 97, exploration: 0.5, score: 145
Scores: (min: 9, avg: 92.97938144329896, max: 296)

Run: 98, exploration: 0.5, score: 198
Scores: (min: 9, avg: 94.05102040816327, max: 296)

Run: 99, exploration: 0.5, score: 231
Scores: (min: 9, avg: 95.43434343434343, max: 296)

Run: 100, exploration: 0.5, score: 26
Scores: (min: 9, avg: 94.74, max: 296)

Run: 101, exploration: 0.5, score: 232
Scores: (min: 9, avg: 96.82, max: 296)

Run: 102, exploration: 0.5, score: 90
Scores: (min: 9, avg: 97.52, max: 296)

Run: 103, exploration: 0.5, score: 31
Scores: (min: 9, avg: 97.59, max: 296)

Run: 104, exploration: 0.5, score: 196
Scores: (min: 9, avg: 99.38, max: 296)

Run: 105, exploration: 0.5, score: 212
Scores: (min: 9, avg: 101.39, max: 296)



NameError: name 'exit' is not defined

In [11]:
# Modified cartpole: increased gamma to 0.995, increased exploration min to 0.5, decreased exploration decay to 0.8
  
ENV_NAME = "CartPole-v1"  
  
GAMMA = 0.995  
LEARNING_RATE = 0.001 
  
MEMORY_SIZE = 1000000  
BATCH_SIZE = 20  
  
EXPLORATION_MAX = 1.0  
EXPLORATION_MIN = 0.5 
EXPLORATION_DECAY = 0.8  
  
  
class DQNSolver:  
  
    def __init__(self, observation_space, action_space):  
        self.exploration_rate = EXPLORATION_MAX  
  
        self.action_space = action_space  
        self.memory = deque(maxlen=MEMORY_SIZE)  
  
        self.model = Sequential()  
        self.model.add(Dense(24, input_shape=(observation_space,), activation="relu"))  
        self.model.add(Dense(24, activation="relu"))  
        self.model.add(Dense(self.action_space, activation="linear"))  
        self.model.compile(loss="mse", optimizer=Adam(lr=LEARNING_RATE))  
  
    def remember(self, state, action, reward, next_state, done):  
        self.memory.append((state, action, reward, next_state, done))  
  
    def act(self, state):  
        if np.random.rand() < self.exploration_rate:  
            return random.randrange(self.action_space)  
        q_values = self.model.predict(state)  
        return np.argmax(q_values[0])  
  
    def experience_replay(self):  
        if len(self.memory) < BATCH_SIZE:  
            return  
        batch = random.sample(self.memory, BATCH_SIZE)  
        for state, action, reward, state_next, terminal in batch:  
            q_update = reward  
            if not terminal:  
                q_update = (reward + GAMMA * np.amax(self.model.predict(state_next)[0]))  
            q_values = self.model.predict(state)  
            q_values[0][action] = q_update  
            self.model.fit(state, q_values, verbose=0)  
        self.exploration_rate *= EXPLORATION_DECAY  
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)  
  
  
def cartpole():  
    env = gym.make(ENV_NAME)  
    score_logger = ScoreLogger(ENV_NAME)  
    observation_space = env.observation_space.shape[0]  
    action_space = env.action_space.n  
    dqn_solver = DQNSolver(observation_space, action_space)  
    run = 0  
    while True:  
        run += 1  
        state = env.reset()  
        state = np.reshape(state, [1, observation_space])  
        step = 0  
        while True:  
            step += 1  
            #env.render()  
            action = dqn_solver.act(state)  
            state_next, reward, terminal, info = env.step(action)  
            reward = reward if not terminal else -reward  
            state_next = np.reshape(state_next, [1, observation_space])  
            dqn_solver.remember(state, action, reward, state_next, terminal)  
            state = state_next  
            if terminal:  
                print ("Run: " + str(run) + ", exploration: " + str(dqn_solver.exploration_rate) + ", score: " + str(step))  
                score_logger.add_score(step, run)  
                break  
            dqn_solver.experience_replay()  


In [12]:
cartpole()

Run: 1, exploration: 1.0, score: 14
Scores: (min: 14, avg: 14, max: 14)

Run: 2, exploration: 0.5, score: 28
Scores: (min: 14, avg: 21, max: 28)

Run: 3, exploration: 0.5, score: 11
Scores: (min: 11, avg: 17.666666666666668, max: 28)

Run: 4, exploration: 0.5, score: 13
Scores: (min: 11, avg: 16.5, max: 28)

Run: 5, exploration: 0.5, score: 11
Scores: (min: 11, avg: 15.4, max: 28)

Run: 6, exploration: 0.5, score: 20
Scores: (min: 11, avg: 16.166666666666668, max: 28)

Run: 7, exploration: 0.5, score: 11
Scores: (min: 11, avg: 15.428571428571429, max: 28)

Run: 8, exploration: 0.5, score: 15
Scores: (min: 11, avg: 15.375, max: 28)

Run: 9, exploration: 0.5, score: 11
Scores: (min: 11, avg: 14.88888888888889, max: 28)

Run: 10, exploration: 0.5, score: 11
Scores: (min: 11, avg: 14.5, max: 28)

Run: 11, exploration: 0.5, score: 18
Scores: (min: 11, avg: 14.818181818181818, max: 28)

Run: 12, exploration: 0.5, score: 10
Scores: (min: 10, avg: 14.416666666666666, max: 28)

Run: 13, explora

Run: 95, exploration: 0.5, score: 53
Scores: (min: 10, avg: 94.71578947368421, max: 339)

Run: 96, exploration: 0.5, score: 78
Scores: (min: 10, avg: 94.54166666666667, max: 339)

Run: 97, exploration: 0.5, score: 66
Scores: (min: 10, avg: 94.24742268041237, max: 339)

Run: 98, exploration: 0.5, score: 248
Scores: (min: 10, avg: 95.81632653061224, max: 339)

Run: 99, exploration: 0.5, score: 32
Scores: (min: 10, avg: 95.17171717171718, max: 339)

Run: 100, exploration: 0.5, score: 22
Scores: (min: 10, avg: 94.44, max: 339)

Run: 101, exploration: 0.5, score: 88
Scores: (min: 10, avg: 95.18, max: 339)

Run: 102, exploration: 0.5, score: 83
Scores: (min: 10, avg: 95.73, max: 339)

Run: 103, exploration: 0.5, score: 133
Scores: (min: 10, avg: 96.95, max: 339)

Run: 104, exploration: 0.5, score: 178
Scores: (min: 10, avg: 98.6, max: 339)

Run: 105, exploration: 0.5, score: 71
Scores: (min: 10, avg: 99.2, max: 339)

Run: 106, exploration: 0.5, score: 207
Scores: (min: 10, avg: 101.07, max: 

Run: 197, exploration: 0.5, score: 120
Scores: (min: 11, avg: 150.17, max: 433)

Run: 198, exploration: 0.5, score: 17
Scores: (min: 11, avg: 147.86, max: 433)

Run: 199, exploration: 0.5, score: 61
Scores: (min: 11, avg: 148.15, max: 433)

Run: 200, exploration: 0.5, score: 240
Scores: (min: 11, avg: 150.33, max: 433)

Run: 201, exploration: 0.5, score: 69
Scores: (min: 11, avg: 150.14, max: 433)

Run: 202, exploration: 0.5, score: 225
Scores: (min: 11, avg: 151.56, max: 433)

Run: 203, exploration: 0.5, score: 149
Scores: (min: 11, avg: 151.72, max: 433)

Run: 204, exploration: 0.5, score: 226
Scores: (min: 11, avg: 152.2, max: 433)

Run: 205, exploration: 0.5, score: 115
Scores: (min: 11, avg: 152.64, max: 433)

Run: 206, exploration: 0.5, score: 40
Scores: (min: 11, avg: 150.97, max: 433)

Run: 207, exploration: 0.5, score: 99
Scores: (min: 11, avg: 150.81, max: 433)

Run: 208, exploration: 0.5, score: 72
Scores: (min: 11, avg: 147.83, max: 433)

Run: 209, exploration: 0.5, score: 

Run: 299, exploration: 0.5, score: 249
Scores: (min: 11, avg: 145.66, max: 413)

Run: 300, exploration: 0.5, score: 89
Scores: (min: 11, avg: 144.15, max: 413)

Run: 301, exploration: 0.5, score: 19
Scores: (min: 11, avg: 143.65, max: 413)

Run: 302, exploration: 0.5, score: 156
Scores: (min: 11, avg: 142.96, max: 413)

Run: 303, exploration: 0.5, score: 151
Scores: (min: 11, avg: 142.98, max: 413)

Run: 304, exploration: 0.5, score: 78
Scores: (min: 11, avg: 141.5, max: 413)

Run: 305, exploration: 0.5, score: 58
Scores: (min: 11, avg: 140.93, max: 413)

Run: 306, exploration: 0.5, score: 111
Scores: (min: 11, avg: 141.64, max: 413)

Run: 307, exploration: 0.5, score: 171
Scores: (min: 11, avg: 142.36, max: 413)

Run: 308, exploration: 0.5, score: 51
Scores: (min: 11, avg: 142.15, max: 413)

Run: 309, exploration: 0.5, score: 47
Scores: (min: 11, avg: 140.63, max: 413)

Run: 310, exploration: 0.5, score: 185
Scores: (min: 11, avg: 141.51, max: 413)

Run: 311, exploration: 0.5, score: 

Run: 401, exploration: 0.5, score: 336
Scores: (min: 12, avg: 133.35, max: 336)

Run: 402, exploration: 0.5, score: 208
Scores: (min: 12, avg: 133.87, max: 336)

Run: 403, exploration: 0.5, score: 158
Scores: (min: 12, avg: 133.94, max: 336)

Run: 404, exploration: 0.5, score: 113
Scores: (min: 12, avg: 134.29, max: 336)

Run: 405, exploration: 0.5, score: 500
Scores: (min: 12, avg: 138.71, max: 500)

Run: 406, exploration: 0.5, score: 400
Scores: (min: 12, avg: 141.6, max: 500)

Run: 407, exploration: 0.5, score: 39
Scores: (min: 12, avg: 140.28, max: 500)

Run: 408, exploration: 0.5, score: 145
Scores: (min: 12, avg: 141.22, max: 500)

Run: 409, exploration: 0.5, score: 158
Scores: (min: 12, avg: 142.33, max: 500)

Run: 410, exploration: 0.5, score: 229
Scores: (min: 12, avg: 142.77, max: 500)

Run: 411, exploration: 0.5, score: 281
Scores: (min: 12, avg: 144.95, max: 500)

Run: 412, exploration: 0.5, score: 308
Scores: (min: 12, avg: 146.37, max: 500)

Run: 413, exploration: 0.5, sc

Run: 503, exploration: 0.5, score: 156
Scores: (min: 12, avg: 183.06, max: 500)

Run: 504, exploration: 0.5, score: 18
Scores: (min: 12, avg: 182.11, max: 500)

Run: 505, exploration: 0.5, score: 87
Scores: (min: 12, avg: 177.98, max: 500)

Run: 506, exploration: 0.5, score: 57
Scores: (min: 12, avg: 174.55, max: 500)

Run: 507, exploration: 0.5, score: 228
Scores: (min: 12, avg: 176.44, max: 500)

Run: 508, exploration: 0.5, score: 139
Scores: (min: 12, avg: 176.38, max: 500)

Run: 509, exploration: 0.5, score: 19
Scores: (min: 12, avg: 174.99, max: 500)

Run: 510, exploration: 0.5, score: 93
Scores: (min: 12, avg: 173.63, max: 500)

Run: 511, exploration: 0.5, score: 22
Scores: (min: 12, avg: 171.04, max: 500)

Run: 512, exploration: 0.5, score: 33
Scores: (min: 12, avg: 168.29, max: 500)

Run: 513, exploration: 0.5, score: 463
Scores: (min: 12, avg: 170.55, max: 500)

Run: 514, exploration: 0.5, score: 399
Scores: (min: 12, avg: 172.8, max: 500)

Run: 515, exploration: 0.5, score: 7

Run: 605, exploration: 0.5, score: 173
Scores: (min: 11, avg: 169.81, max: 500)

Run: 606, exploration: 0.5, score: 152
Scores: (min: 11, avg: 170.76, max: 500)

Run: 607, exploration: 0.5, score: 84
Scores: (min: 11, avg: 169.32, max: 500)

Run: 608, exploration: 0.5, score: 26
Scores: (min: 11, avg: 168.19, max: 500)

Run: 609, exploration: 0.5, score: 297
Scores: (min: 11, avg: 170.97, max: 500)

Run: 610, exploration: 0.5, score: 190
Scores: (min: 11, avg: 171.94, max: 500)

Run: 611, exploration: 0.5, score: 228
Scores: (min: 11, avg: 174, max: 500)

Run: 612, exploration: 0.5, score: 91
Scores: (min: 11, avg: 174.58, max: 500)

Run: 613, exploration: 0.5, score: 147
Scores: (min: 11, avg: 171.42, max: 500)

Run: 614, exploration: 0.5, score: 135
Scores: (min: 11, avg: 168.78, max: 500)

Run: 615, exploration: 0.5, score: 411
Scores: (min: 11, avg: 172.18, max: 500)

Run: 616, exploration: 0.5, score: 119
Scores: (min: 11, avg: 172.93, max: 500)

Run: 617, exploration: 0.5, score:

Run: 707, exploration: 0.5, score: 245
Scores: (min: 13, avg: 181.14, max: 500)

Run: 708, exploration: 0.5, score: 119
Scores: (min: 13, avg: 182.07, max: 500)

Run: 709, exploration: 0.5, score: 61
Scores: (min: 13, avg: 179.71, max: 500)

Run: 710, exploration: 0.5, score: 41
Scores: (min: 13, avg: 178.22, max: 500)

Run: 711, exploration: 0.5, score: 83
Scores: (min: 13, avg: 176.77, max: 500)

Run: 712, exploration: 0.5, score: 203
Scores: (min: 13, avg: 177.89, max: 500)

Run: 713, exploration: 0.5, score: 34
Scores: (min: 13, avg: 176.76, max: 500)

Run: 714, exploration: 0.5, score: 99
Scores: (min: 13, avg: 176.4, max: 500)

Run: 715, exploration: 0.5, score: 121
Scores: (min: 13, avg: 173.5, max: 500)

Run: 716, exploration: 0.5, score: 37
Scores: (min: 13, avg: 172.68, max: 500)

Run: 717, exploration: 0.5, score: 45
Scores: (min: 13, avg: 172.33, max: 500)

Run: 718, exploration: 0.5, score: 54
Scores: (min: 13, avg: 171.85, max: 500)

Run: 719, exploration: 0.5, score: 28


Run: 809, exploration: 0.5, score: 127
Scores: (min: 11, avg: 152.03, max: 500)

Run: 810, exploration: 0.5, score: 500
Scores: (min: 11, avg: 156.62, max: 500)

Run: 811, exploration: 0.5, score: 117
Scores: (min: 11, avg: 156.96, max: 500)

Run: 812, exploration: 0.5, score: 500
Scores: (min: 11, avg: 159.93, max: 500)

Run: 813, exploration: 0.5, score: 77
Scores: (min: 11, avg: 160.36, max: 500)

Run: 814, exploration: 0.5, score: 57
Scores: (min: 11, avg: 159.94, max: 500)

Run: 815, exploration: 0.5, score: 48
Scores: (min: 11, avg: 159.21, max: 500)

Run: 816, exploration: 0.5, score: 12
Scores: (min: 11, avg: 158.96, max: 500)

Run: 817, exploration: 0.5, score: 75
Scores: (min: 11, avg: 159.26, max: 500)

Run: 818, exploration: 0.5, score: 117
Scores: (min: 11, avg: 159.89, max: 500)

Run: 819, exploration: 0.5, score: 183
Scores: (min: 11, avg: 161.44, max: 500)

Run: 820, exploration: 0.5, score: 126
Scores: (min: 11, avg: 161.94, max: 500)

Run: 821, exploration: 0.5, score

Run: 911, exploration: 0.5, score: 115
Scores: (min: 12, avg: 184.11, max: 500)

Run: 912, exploration: 0.5, score: 49
Scores: (min: 12, avg: 179.6, max: 500)

Run: 913, exploration: 0.5, score: 239
Scores: (min: 12, avg: 181.22, max: 500)

Run: 914, exploration: 0.5, score: 52
Scores: (min: 12, avg: 181.17, max: 500)

Run: 915, exploration: 0.5, score: 190
Scores: (min: 12, avg: 182.59, max: 500)

Run: 916, exploration: 0.5, score: 220
Scores: (min: 13, avg: 184.67, max: 500)

Run: 917, exploration: 0.5, score: 29
Scores: (min: 13, avg: 184.21, max: 500)

Run: 918, exploration: 0.5, score: 238
Scores: (min: 13, avg: 185.42, max: 500)

Run: 919, exploration: 0.5, score: 102
Scores: (min: 13, avg: 184.61, max: 500)

Run: 920, exploration: 0.5, score: 105
Scores: (min: 13, avg: 184.4, max: 500)

Run: 921, exploration: 0.5, score: 319
Scores: (min: 13, avg: 184.89, max: 500)

Run: 922, exploration: 0.5, score: 317
Scores: (min: 13, avg: 187.67, max: 500)

Run: 923, exploration: 0.5, score

NameError: name 'exit' is not defined

In [3]:
# Modified cartpole: increased gamma to 0.995, increased exploration min to 0.5, increased learning rate to 0.01

import random  
import gym  
import numpy as np  
from collections import deque  
from keras.models import Sequential  
from keras.layers import Dense  
from keras.optimizers import Adam  
from scores.score_logger import ScoreLogger


ENV_NAME = "CartPole-v1"

GAMMA = 0.995
LEARNING_RATE = 0.01

MEMORY_SIZE = 1000000
BATCH_SIZE = 20

EXPLORATION_MAX = 1.0
EXPLORATION_MIN = 0.5
EXPLORATION_DECAY = 0.995


class DQNSolver:

    def __init__(self, observation_space, action_space):
        self.exploration_rate = EXPLORATION_MAX

        self.action_space = action_space
        self.memory = deque(maxlen=MEMORY_SIZE)

        self.model = Sequential()
        self.model.add(Dense(24, input_shape=(observation_space,), activation="relu"))
        self.model.add(Dense(24, activation="relu"))
        self.model.add(Dense(self.action_space, activation="linear"))
        self.model.compile(loss="mse", optimizer=Adam(lr=LEARNING_RATE))

    def remember(self, state, action, reward, next_state, done):
        self.memory.append((state, action, reward, next_state, done))

    def act(self, state):
        if np.random.rand() < self.exploration_rate:
            return random.randrange(self.action_space)
        q_values = self.model.predict(state)
        return np.argmax(q_values[0])

    def experience_replay(self):
        if len(self.memory) < BATCH_SIZE:
            return
        batch = random.sample(self.memory, BATCH_SIZE)
        for state, action, reward, state_next, terminal in batch:
            q_update = reward
            if not terminal:
                q_update = (reward + GAMMA * np.amax(self.model.predict(state_next)[0]))
            q_values = self.model.predict(state)
            q_values[0][action] = q_update
            self.model.fit(state, q_values, verbose=0)
        self.exploration_rate *= EXPLORATION_DECAY
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)


def cartpole():
    env = gym.make(ENV_NAME)
    score_logger = ScoreLogger(ENV_NAME)
    observation_space = env.observation_space.shape[0]
    action_space = env.action_space.n
    dqn_solver = DQNSolver(observation_space, action_space)
    run = 0
    while True:
        run += 1
        state = env.reset()
        state = np.reshape(state, [1, observation_space])
        step = 0
        while True:
            step += 1
            #env.render()
            action = dqn_solver.act(state)
            state_next, reward, terminal, info = env.step(action)
            reward = reward if not terminal else -reward
            state_next = np.reshape(state_next, [1, observation_space])
            dqn_solver.remember(state, action, reward, state_next, terminal)
            state = state_next
            if terminal:
                print ("Run: " + str(run) + ", exploration: " + str(dqn_solver.exploration_rate) + ", score: " + str(step))
                score_logger.add_score(step, run)
                break
            dqn_solver.experience_replay()


Using TensorFlow backend.


In [4]:
cartpole()

Run: 1, exploration: 1.0, score: 18
Scores: (min: 18, avg: 18, max: 18)

Run: 2, exploration: 0.8734200960253871, score: 29
Scores: (min: 18, avg: 23.5, max: 29)

Run: 3, exploration: 0.8224322824348486, score: 13
Scores: (min: 13, avg: 20, max: 29)

Run: 4, exploration: 0.7292124703704616, score: 25
Scores: (min: 13, avg: 21.25, max: 29)

Run: 5, exploration: 0.697046600835495, score: 10
Scores: (min: 10, avg: 19, max: 29)

Run: 6, exploration: 0.6696478204705644, score: 9
Scores: (min: 9, avg: 17.333333333333332, max: 29)

Run: 7, exploration: 0.6305556603555866, score: 13
Scores: (min: 9, avg: 16.714285714285715, max: 29)

Run: 8, exploration: 0.5997278763867329, score: 11
Scores: (min: 9, avg: 16, max: 29)

Run: 9, exploration: 0.5704072587541458, score: 11
Scores: (min: 9, avg: 15.444444444444445, max: 29)

Run: 10, exploration: 0.5371084840724134, score: 13
Scores: (min: 9, avg: 15.2, max: 29)

Run: 11, exploration: 0.5, score: 22
Scores: (min: 9, avg: 15.818181818181818, max: 29

Run: 96, exploration: 0.5, score: 25
Scores: (min: 8, avg: 14.583333333333334, max: 47)

Run: 97, exploration: 0.5, score: 17
Scores: (min: 8, avg: 14.608247422680412, max: 47)

Run: 98, exploration: 0.5, score: 49
Scores: (min: 8, avg: 14.959183673469388, max: 49)

Run: 99, exploration: 0.5, score: 33
Scores: (min: 8, avg: 15.141414141414142, max: 49)

Run: 100, exploration: 0.5, score: 13
Scores: (min: 8, avg: 15.12, max: 49)

Run: 101, exploration: 0.5, score: 38
Scores: (min: 8, avg: 15.32, max: 49)

Run: 102, exploration: 0.5, score: 14
Scores: (min: 8, avg: 15.17, max: 49)

Run: 103, exploration: 0.5, score: 11
Scores: (min: 8, avg: 15.15, max: 49)

Run: 104, exploration: 0.5, score: 21
Scores: (min: 8, avg: 15.11, max: 49)

Run: 105, exploration: 0.5, score: 15
Scores: (min: 8, avg: 15.16, max: 49)

Run: 106, exploration: 0.5, score: 46
Scores: (min: 8, avg: 15.53, max: 49)

Run: 107, exploration: 0.5, score: 19
Scores: (min: 8, avg: 15.59, max: 49)

Run: 108, exploration: 0.5, 

Run: 202, exploration: 0.5, score: 87
Scores: (min: 9, avg: 32.11, max: 87)

Run: 203, exploration: 0.5, score: 117
Scores: (min: 9, avg: 33.17, max: 117)

Run: 204, exploration: 0.5, score: 34
Scores: (min: 9, avg: 33.3, max: 117)

Run: 205, exploration: 0.5, score: 65
Scores: (min: 9, avg: 33.8, max: 117)

Run: 206, exploration: 0.5, score: 47
Scores: (min: 9, avg: 33.81, max: 117)

Run: 207, exploration: 0.5, score: 111
Scores: (min: 9, avg: 34.73, max: 117)

Run: 208, exploration: 0.5, score: 105
Scores: (min: 9, avg: 35.65, max: 117)

Run: 209, exploration: 0.5, score: 123
Scores: (min: 9, avg: 36.61, max: 123)

Run: 210, exploration: 0.5, score: 131
Scores: (min: 9, avg: 37.81, max: 131)

Run: 211, exploration: 0.5, score: 144
Scores: (min: 9, avg: 39.01, max: 144)

Run: 212, exploration: 0.5, score: 178
Scores: (min: 9, avg: 40.7, max: 178)

Run: 213, exploration: 0.5, score: 101
Scores: (min: 9, avg: 41.6, max: 178)

Run: 214, exploration: 0.5, score: 122
Scores: (min: 9, avg: 

Run: 306, exploration: 0.5, score: 35
Scores: (min: 12, avg: 153.53, max: 500)

Run: 307, exploration: 0.5, score: 75
Scores: (min: 12, avg: 153.17, max: 500)

Run: 308, exploration: 0.5, score: 81
Scores: (min: 12, avg: 152.93, max: 500)

Run: 309, exploration: 0.5, score: 258
Scores: (min: 12, avg: 154.28, max: 500)

Run: 310, exploration: 0.5, score: 245
Scores: (min: 12, avg: 155.42, max: 500)

Run: 311, exploration: 0.5, score: 54
Scores: (min: 12, avg: 154.52, max: 500)

Run: 312, exploration: 0.5, score: 103
Scores: (min: 12, avg: 153.77, max: 500)

Run: 313, exploration: 0.5, score: 224
Scores: (min: 12, avg: 155, max: 500)

Run: 314, exploration: 0.5, score: 441
Scores: (min: 12, avg: 158.19, max: 500)

Run: 315, exploration: 0.5, score: 273
Scores: (min: 12, avg: 160.12, max: 500)

Run: 316, exploration: 0.5, score: 164
Scores: (min: 12, avg: 156.76, max: 441)

Run: 317, exploration: 0.5, score: 256
Scores: (min: 12, avg: 157.78, max: 441)

Run: 318, exploration: 0.5, score: 

Run: 408, exploration: 0.5, score: 305
Scores: (min: 19, avg: 161.76, max: 441)

Run: 409, exploration: 0.5, score: 244
Scores: (min: 19, avg: 161.62, max: 441)

Run: 410, exploration: 0.5, score: 28
Scores: (min: 19, avg: 159.45, max: 441)

Run: 411, exploration: 0.5, score: 384
Scores: (min: 19, avg: 162.75, max: 441)

Run: 412, exploration: 0.5, score: 318
Scores: (min: 19, avg: 164.9, max: 441)

Run: 413, exploration: 0.5, score: 51
Scores: (min: 19, avg: 163.17, max: 441)

Run: 414, exploration: 0.5, score: 15
Scores: (min: 15, avg: 158.91, max: 384)

Run: 415, exploration: 0.5, score: 137
Scores: (min: 15, avg: 157.55, max: 384)

Run: 416, exploration: 0.5, score: 19
Scores: (min: 15, avg: 156.1, max: 384)

Run: 417, exploration: 0.5, score: 350
Scores: (min: 15, avg: 157.04, max: 384)

Run: 418, exploration: 0.5, score: 238
Scores: (min: 15, avg: 159.12, max: 384)

Run: 419, exploration: 0.5, score: 251
Scores: (min: 15, avg: 161.41, max: 384)

Run: 420, exploration: 0.5, score:

Run: 510, exploration: 0.5, score: 181
Scores: (min: 15, avg: 168.41, max: 454)

Run: 511, exploration: 0.5, score: 45
Scores: (min: 15, avg: 165.02, max: 454)

Run: 512, exploration: 0.5, score: 144
Scores: (min: 15, avg: 163.28, max: 454)

Run: 513, exploration: 0.5, score: 173
Scores: (min: 15, avg: 164.5, max: 454)

Run: 514, exploration: 0.5, score: 122
Scores: (min: 16, avg: 165.57, max: 454)

Run: 515, exploration: 0.5, score: 287
Scores: (min: 16, avg: 167.07, max: 454)

Run: 516, exploration: 0.5, score: 244
Scores: (min: 16, avg: 169.32, max: 454)

Run: 517, exploration: 0.5, score: 275
Scores: (min: 16, avg: 168.57, max: 454)

Run: 518, exploration: 0.5, score: 145
Scores: (min: 16, avg: 167.64, max: 454)

Run: 519, exploration: 0.5, score: 25
Scores: (min: 16, avg: 165.38, max: 454)

Run: 520, exploration: 0.5, score: 51
Scores: (min: 16, avg: 161.95, max: 454)

Run: 521, exploration: 0.5, score: 13
Scores: (min: 13, avg: 159.18, max: 454)

Run: 522, exploration: 0.5, score

Run: 612, exploration: 0.5, score: 265
Scores: (min: 13, avg: 174.29, max: 500)

Run: 613, exploration: 0.5, score: 458
Scores: (min: 13, avg: 177.14, max: 500)

Run: 614, exploration: 0.5, score: 149
Scores: (min: 13, avg: 177.41, max: 500)

Run: 615, exploration: 0.5, score: 455
Scores: (min: 13, avg: 179.09, max: 500)

Run: 616, exploration: 0.5, score: 255
Scores: (min: 13, avg: 179.2, max: 500)

Run: 617, exploration: 0.5, score: 205
Scores: (min: 13, avg: 178.5, max: 500)

Run: 618, exploration: 0.5, score: 89
Scores: (min: 13, avg: 177.94, max: 500)

Run: 619, exploration: 0.5, score: 128
Scores: (min: 13, avg: 178.97, max: 500)

Run: 620, exploration: 0.5, score: 73
Scores: (min: 13, avg: 179.19, max: 500)

Run: 621, exploration: 0.5, score: 134
Scores: (min: 16, avg: 180.4, max: 500)

Run: 622, exploration: 0.5, score: 500
Scores: (min: 16, avg: 184.23, max: 500)

Run: 623, exploration: 0.5, score: 259
Scores: (min: 16, avg: 184.69, max: 500)

Run: 624, exploration: 0.5, score

NameError: name 'exit' is not defined

# Analysis

## Explain how reinforcement learning concepts apply to the cartpole problem

In reinforcement learning the goal of the agent is to navigate an unknown environment by maximizing the reward it receives for its actions.  The goal of this program is to "create an agent that is capable of learning through trial and error and ultimately solving the cartpole problem." (Surma, 2021)  The various state values get represented by "cart position, cart velocity, pole angle, and the velocity of the tip of the pole" while the action values are "moving left or moving right." (Kurban, 2022)  This particular problem uses the Deep Q-Learning (DQN) reinforcement learning algorithm. (Surma, 2021)

## Analyze how experience replay is applied to the cartpole problem

Experience replay utilizes previous experiences for training.  According to an article from Rita Kurban,

>"Experience replay stores the agent’s experiences in memory. Batches of experiences are randomly sampled from memory and are used to train the neural network. Such learning consists of two phases — gaining experience and updating the model. The size of the replay controls the number of experiences that are used for the network update. Memory is an array that stores the agent’s state, reward, and action, as well as whether the action finished the game and the next state." (Surma, 2021)

Along with this, a discount factor comes into play.  The discount factor works by being "multiplied by future rewards to dampen these rewards’ effect on the agent. It is designed to make future rewards worth less than immediate rewards." (Surma, 2021)

## Analyze how neural networks are used in deep Q-learning

With the cartpole problem a simple neural network is used to predict and update the Q values for the agent.  "The network takes the agent’s state as an input and returns the 𝑄 values for each of the actions. The maximum 𝑄 value is selected by the agent to perform the next action." (Surma, 2021)  The neural network makes the Q-Learning algorithm more efficient by estimating the Q values, allowing it to scale and run faster.  By adjusting the learning rate for the algorithm and increasing it from `0.001` to `0.01`, it almost doubled the number of runs needed to reach the end state.

## References

Kurban, R. (2022, December 12). Deep Q Learning for the CartPole - Towards Data Science. Medium. https://towardsdatascience.com/deep-q-learning-for-the-cartpole-44d761085c2f

Surma, G. (2021, October 13). Cartpole - Introduction to Reinforcement Learning (DQN - Deep Q-Learning). Medium. https://gsurma.medium.com/cartpole-introduction-to-reinforcement-learning-ed0eb5b58288
