# Module Five Assignment: Cartpole Problem
Review the code in this notebook and in the score_logger.py file in the *scores* folder (directory). Once you have reviewed the code, return to this notebook and select **Cell** and then **Run All** from the menu bar to run this code. The code takes several minutes to run.

# Original

Solved in 37 runs, 137 total runs.

In [5]:
import random  
import gym  
import numpy as np  
from collections import deque  
from keras.models import Sequential  
from keras.layers import Dense  
from keras.optimizers import Adam  
  
  
from scores.score_logger import ScoreLogger  
  
ENV_NAME = "CartPole-v1"  
  
GAMMA = 0.95  
LEARNING_RATE = 0.001  
  
MEMORY_SIZE = 1000000  
BATCH_SIZE = 20  
  
EXPLORATION_MAX = 1.0  
EXPLORATION_MIN = 0.01  
EXPLORATION_DECAY = 0.995  
  
  
class DQNSolver:  
  
    def __init__(self, observation_space, action_space):  
        self.exploration_rate = EXPLORATION_MAX  
  
        self.action_space = action_space  
        self.memory = deque(maxlen=MEMORY_SIZE)  
  
        self.model = Sequential()  
        self.model.add(Dense(24, input_shape=(observation_space,), activation="relu"))  
        self.model.add(Dense(24, activation="relu"))  
        self.model.add(Dense(self.action_space, activation="linear"))  
        self.model.compile(loss="mse", optimizer=Adam(lr=LEARNING_RATE))  
  
    def remember(self, state, action, reward, next_state, done):  
        self.memory.append((state, action, reward, next_state, done))  
  
    def act(self, state):  
        if np.random.rand() < self.exploration_rate:  
            return random.randrange(self.action_space)  
        q_values = self.model.predict(state)  
        return np.argmax(q_values[0])  
  
    def experience_replay(self):  
        if len(self.memory) < BATCH_SIZE:  
            return  
        batch = random.sample(self.memory, BATCH_SIZE)  
        for state, action, reward, state_next, terminal in batch:  
            q_update = reward  
            if not terminal:  
                q_update = (reward + GAMMA * np.amax(self.model.predict(state_next)[0]))  
            q_values = self.model.predict(state)  
            q_values[0][action] = q_update  
            self.model.fit(state, q_values, verbose=0)  
        self.exploration_rate *= EXPLORATION_DECAY  
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)  
  
  
def cartpole():  
    env = gym.make(ENV_NAME)  
    score_logger = ScoreLogger(ENV_NAME)  
    observation_space = env.observation_space.shape[0]  
    action_space = env.action_space.n  
    dqn_solver = DQNSolver(observation_space, action_space)  
    run = 0  
    while True:  
        run += 1  
        state = env.reset()  
        state = np.reshape(state, [1, observation_space])  
        step = 0  
        while True:  
            step += 1  
            #env.render()  
            action = dqn_solver.act(state)  
            state_next, reward, terminal, info = env.step(action)  
            reward = reward if not terminal else -reward  
            state_next = np.reshape(state_next, [1, observation_space])  
            dqn_solver.remember(state, action, reward, state_next, terminal)  
            state = state_next  
            if terminal:  
                print ("Run: " + str(run) + ", exploration: " + str(dqn_solver.exploration_rate) + ", score: " + str(step))  
                score_logger.add_score(step, run)  
                break  
            dqn_solver.experience_replay()  



In [2]:
cartpole()

Run: 1, exploration: 1.0, score: 20
Scores: (min: 20, avg: 20, max: 20)

Run: 2, exploration: 0.9369146928798039, score: 14
Scores: (min: 14, avg: 17, max: 20)

Run: 3, exploration: 0.798065677681905, score: 33
Scores: (min: 14, avg: 22.333333333333332, max: 33)

Run: 4, exploration: 0.7147372386831305, score: 23
Scores: (min: 14, avg: 22.5, max: 33)

Run: 5, exploration: 0.6696478204705644, score: 14
Scores: (min: 14, avg: 20.8, max: 33)

Run: 6, exploration: 0.6369088258938781, score: 11
Scores: (min: 11, avg: 19.166666666666668, max: 33)

Run: 7, exploration: 0.5997278763867329, score: 13
Scores: (min: 11, avg: 18.285714285714285, max: 33)

Run: 8, exploration: 0.5704072587541458, score: 11
Scores: (min: 11, avg: 17.375, max: 33)

Run: 9, exploration: 0.5264466124450268, score: 17
Scores: (min: 11, avg: 17.333333333333332, max: 33)

Run: 10, exploration: 0.46444185833082485, score: 26
Scores: (min: 11, avg: 18.2, max: 33)

Run: 11, exploration: 0.42437208406280985, score: 19
Scores:

Run: 89, exploration: 0.01, score: 290
Scores: (min: 10, avg: 118.98876404494382, max: 474)

Run: 90, exploration: 0.01, score: 222
Scores: (min: 10, avg: 120.13333333333334, max: 474)

Run: 91, exploration: 0.01, score: 500
Scores: (min: 10, avg: 124.3076923076923, max: 500)

Run: 92, exploration: 0.01, score: 441
Scores: (min: 10, avg: 127.75, max: 500)

Run: 93, exploration: 0.01, score: 251
Scores: (min: 10, avg: 129.0752688172043, max: 500)

Run: 94, exploration: 0.01, score: 176
Scores: (min: 10, avg: 129.5744680851064, max: 500)

Run: 95, exploration: 0.01, score: 186
Scores: (min: 10, avg: 130.16842105263157, max: 500)

Run: 96, exploration: 0.01, score: 243
Scores: (min: 10, avg: 131.34375, max: 500)

Run: 97, exploration: 0.01, score: 164
Scores: (min: 10, avg: 131.68041237113403, max: 500)

Run: 98, exploration: 0.01, score: 321
Scores: (min: 10, avg: 133.6122448979592, max: 500)

Run: 99, exploration: 0.01, score: 156
Scores: (min: 10, avg: 133.83838383838383, max: 500)

Ru

NameError: name 'exit' is not defined

Note: If the code is running properly, you should begin to see output appearing above this code block. It will take several minutes, so it is recommended that you let this code run in the background while completing other work. When the code has finished, it will print output saying, "Solved in _ runs, _ total runs."

You may see an error about not having an exit command. This error does not affect the program's functionality and results from the steps taken to convert the code from Python 2.x to Python 3. Please disregard this error.

# Exploration Factor Changed

    EXPLORATION_MAX: 1.0 --> 0.5

    EXPLORATION_MIN: 0.01 --> 0.1
    
Solved in 201 runs, 301 total runs.

In [6]:
import random  
import gym  
import numpy as np  
from collections import deque  
from keras.models import Sequential  
from keras.layers import Dense  
from keras.optimizers import Adam  
  
  
from scores.score_logger import ScoreLogger  
  
ENV_NAME = "CartPole-v1"  
  
GAMMA = 0.95  
LEARNING_RATE = 0.001  
  
MEMORY_SIZE = 1000000  
BATCH_SIZE = 20  
  
EXPLORATION_MAX = 0.5  
EXPLORATION_MIN = 0.1  
EXPLORATION_DECAY = 0.995  
  
  
class DQNSolver:  
  
    def __init__(self, observation_space, action_space):  
        self.exploration_rate = EXPLORATION_MAX  
  
        self.action_space = action_space  
        self.memory = deque(maxlen=MEMORY_SIZE)  
  
        self.model = Sequential()  
        self.model.add(Dense(24, input_shape=(observation_space,), activation="relu"))  
        self.model.add(Dense(24, activation="relu"))  
        self.model.add(Dense(self.action_space, activation="linear"))  
        self.model.compile(loss="mse", optimizer=Adam(lr=LEARNING_RATE))  
  
    def remember(self, state, action, reward, next_state, done):  
        self.memory.append((state, action, reward, next_state, done))  
  
    def act(self, state):  
        if np.random.rand() < self.exploration_rate:  
            return random.randrange(self.action_space)  
        q_values = self.model.predict(state)  
        return np.argmax(q_values[0])  
  
    def experience_replay(self):  
        if len(self.memory) < BATCH_SIZE:  
            return  
        batch = random.sample(self.memory, BATCH_SIZE)  
        for state, action, reward, state_next, terminal in batch:  
            q_update = reward  
            if not terminal:  
                q_update = (reward + GAMMA * np.amax(self.model.predict(state_next)[0]))  
            q_values = self.model.predict(state)  
            q_values[0][action] = q_update  
            self.model.fit(state, q_values, verbose=0)  
        self.exploration_rate *= EXPLORATION_DECAY  
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)  
  
  
def cartpole():  
    env = gym.make(ENV_NAME)  
    score_logger = ScoreLogger(ENV_NAME)  
    observation_space = env.observation_space.shape[0]  
    action_space = env.action_space.n  
    dqn_solver = DQNSolver(observation_space, action_space)  
    run = 0  
    while True:  
        run += 1  
        state = env.reset()  
        state = np.reshape(state, [1, observation_space])  
        step = 0  
        while True:  
            step += 1  
            #env.render()  
            action = dqn_solver.act(state)  
            state_next, reward, terminal, info = env.step(action)  
            reward = reward if not terminal else -reward  
            state_next = np.reshape(state_next, [1, observation_space])  
            dqn_solver.remember(state, action, reward, state_next, terminal)  
            state = state_next  
            if terminal:  
                print ("Run: " + str(run) + ", exploration: " + str(dqn_solver.exploration_rate) + ", score: " + str(step))  
                score_logger.add_score(step, run)  
                break  
            dqn_solver.experience_replay()  

cartpole()

Run: 1, exploration: 0.47794478917877986, score: 29
Scores: (min: 29, avg: 29, max: 29)

Run: 2, exploration: 0.43671004801269353, score: 19
Scores: (min: 19, avg: 24, max: 29)

Run: 3, exploration: 0.39505248627351397, score: 21
Scores: (min: 19, avg: 23, max: 29)

Run: 4, exploration: 0.3609692879892581, score: 19
Scores: (min: 19, avg: 22, max: 29)

Run: 5, exploration: 0.33650644244751976, score: 15
Scores: (min: 15, avg: 20.6, max: 29)

Run: 6, exploration: 0.3121329338217698, score: 16
Scores: (min: 15, avg: 19.833333333333332, max: 29)

Run: 7, exploration: 0.29244193182929556, score: 14
Scores: (min: 14, avg: 19, max: 29)

Run: 8, exploration: 0.2809469295581664, score: 9
Scores: (min: 9, avg: 17.75, max: 29)

Run: 9, exploration: 0.2516124151989211, score: 23
Scores: (min: 9, avg: 18.333333333333332, max: 29)

Run: 10, exploration: 0.23222092916541243, score: 17
Scores: (min: 9, avg: 18.2, max: 29)

Run: 11, exploration: 0.21540092780399553, score: 16
Scores: (min: 9, avg: 18,

Run: 94, exploration: 0.1, score: 200
Scores: (min: 9, avg: 136.56382978723406, max: 296)

Run: 95, exploration: 0.1, score: 160
Scores: (min: 9, avg: 136.81052631578947, max: 296)

Run: 96, exploration: 0.1, score: 170
Scores: (min: 9, avg: 137.15625, max: 296)

Run: 97, exploration: 0.1, score: 51
Scores: (min: 9, avg: 136.2680412371134, max: 296)

Run: 98, exploration: 0.1, score: 14
Scores: (min: 9, avg: 135.0204081632653, max: 296)

Run: 99, exploration: 0.1, score: 299
Scores: (min: 9, avg: 136.67676767676767, max: 299)

Run: 100, exploration: 0.1, score: 170
Scores: (min: 9, avg: 137.01, max: 299)

Run: 101, exploration: 0.1, score: 173
Scores: (min: 9, avg: 138.45, max: 299)

Run: 102, exploration: 0.1, score: 79
Scores: (min: 9, avg: 139.05, max: 299)

Run: 103, exploration: 0.1, score: 9
Scores: (min: 9, avg: 138.93, max: 299)

Run: 104, exploration: 0.1, score: 10
Scores: (min: 9, avg: 138.84, max: 299)

Run: 105, exploration: 0.1, score: 11
Scores: (min: 9, avg: 138.8, max:

Run: 198, exploration: 0.1, score: 9
Scores: (min: 8, avg: 73.46, max: 361)

Run: 199, exploration: 0.1, score: 11
Scores: (min: 8, avg: 70.58, max: 361)

Run: 200, exploration: 0.1, score: 9
Scores: (min: 8, avg: 68.97, max: 361)

Run: 201, exploration: 0.1, score: 9
Scores: (min: 8, avg: 67.33, max: 361)

Run: 202, exploration: 0.1, score: 9
Scores: (min: 8, avg: 66.63, max: 361)

Run: 203, exploration: 0.1, score: 9
Scores: (min: 8, avg: 66.63, max: 361)

Run: 204, exploration: 0.1, score: 11
Scores: (min: 8, avg: 66.64, max: 361)

Run: 205, exploration: 0.1, score: 12
Scores: (min: 8, avg: 66.65, max: 361)

Run: 206, exploration: 0.1, score: 54
Scores: (min: 8, avg: 67.08, max: 361)

Run: 207, exploration: 0.1, score: 8
Scores: (min: 8, avg: 67.06, max: 361)

Run: 208, exploration: 0.1, score: 10
Scores: (min: 8, avg: 67.07, max: 361)

Run: 209, exploration: 0.1, score: 10
Scores: (min: 8, avg: 67.08, max: 361)

Run: 210, exploration: 0.1, score: 9
Scores: (min: 8, avg: 67.07, max:

NameError: name 'exit' is not defined

# Learning Rate changed

    LEARNING_RATE: 0.001 --> 0.0001
    
Solved in 366 runs, 466 total runs.

In [9]:
import random  
import gym  
import numpy as np  
from collections import deque  
from keras.models import Sequential  
from keras.layers import Dense  
from keras.optimizers import Adam  
  
  
from scores.score_logger import ScoreLogger  
  
ENV_NAME = "CartPole-v1"  
  
GAMMA = 0.95  
LEARNING_RATE = 0.0001  
  
MEMORY_SIZE = 1000000  
BATCH_SIZE = 20  
  
EXPLORATION_MAX = 1.0  
EXPLORATION_MIN = 0.01  
EXPLORATION_DECAY = 0.995  
  
  
class DQNSolver:  
  
    def __init__(self, observation_space, action_space):  
        self.exploration_rate = EXPLORATION_MAX  
  
        self.action_space = action_space  
        self.memory = deque(maxlen=MEMORY_SIZE)  
  
        self.model = Sequential()  
        self.model.add(Dense(24, input_shape=(observation_space,), activation="relu"))  
        self.model.add(Dense(24, activation="relu"))  
        self.model.add(Dense(self.action_space, activation="linear"))  
        self.model.compile(loss="mse", optimizer=Adam(lr=LEARNING_RATE))  
  
    def remember(self, state, action, reward, next_state, done):  
        self.memory.append((state, action, reward, next_state, done))  
  
    def act(self, state):  
        if np.random.rand() < self.exploration_rate:  
            return random.randrange(self.action_space)  
        q_values = self.model.predict(state)  
        return np.argmax(q_values[0])  
  
    def experience_replay(self):  
        if len(self.memory) < BATCH_SIZE:  
            return  
        batch = random.sample(self.memory, BATCH_SIZE)  
        for state, action, reward, state_next, terminal in batch:  
            q_update = reward  
            if not terminal:  
                q_update = (reward + GAMMA * np.amax(self.model.predict(state_next)[0]))  
            q_values = self.model.predict(state)  
            q_values[0][action] = q_update  
            self.model.fit(state, q_values, verbose=0)  
        self.exploration_rate *= EXPLORATION_DECAY  
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)  
  
  
def cartpole():  
    env = gym.make(ENV_NAME)  
    score_logger = ScoreLogger(ENV_NAME)  
    observation_space = env.observation_space.shape[0]  
    action_space = env.action_space.n  
    dqn_solver = DQNSolver(observation_space, action_space)  
    run = 0  
    while True:  
        run += 1  
        state = env.reset()  
        state = np.reshape(state, [1, observation_space])  
        step = 0  
        while True:  
            step += 1  
            #env.render()  
            action = dqn_solver.act(state)  
            state_next, reward, terminal, info = env.step(action)  
            reward = reward if not terminal else -reward  
            state_next = np.reshape(state_next, [1, observation_space])  
            dqn_solver.remember(state, action, reward, state_next, terminal)  
            state = state_next  
            if terminal:  
                print ("Run: " + str(run) + ", exploration: " + str(dqn_solver.exploration_rate) + ", score: " + str(step))  
                score_logger.add_score(step, run)  
                break  
            dqn_solver.experience_replay()  

cartpole()

Run: 1, exploration: 1.0, score: 16
Scores: (min: 16, avg: 16, max: 16)

Run: 2, exploration: 0.946354579813443, score: 15
Scores: (min: 15, avg: 15.5, max: 16)

Run: 3, exploration: 0.8348931673187264, score: 26
Scores: (min: 15, avg: 19, max: 26)

Run: 4, exploration: 0.7402609576967045, score: 25
Scores: (min: 15, avg: 20.5, max: 26)

Run: 5, exploration: 0.697046600835495, score: 13
Scores: (min: 13, avg: 19, max: 26)

Run: 6, exploration: 0.6401093727576664, score: 18
Scores: (min: 13, avg: 18.833333333333332, max: 26)

Run: 7, exploration: 0.5790496471185967, score: 21
Scores: (min: 13, avg: 19.142857142857142, max: 26)

Run: 8, exploration: 0.5290920728090721, score: 19
Scores: (min: 13, avg: 19.125, max: 26)

Run: 9, exploration: 0.4982051627146237, score: 13
Scores: (min: 13, avg: 18.444444444444443, max: 26)

Run: 10, exploration: 0.46211964903917074, score: 16
Scores: (min: 13, avg: 18.2, max: 26)

Run: 11, exploration: 0.42650460709830135, score: 17
Scores: (min: 13, avg: 1

Run: 85, exploration: 0.01, score: 64
Scores: (min: 8, avg: 19.305882352941175, max: 90)

Run: 86, exploration: 0.01, score: 18
Scores: (min: 8, avg: 19.290697674418606, max: 90)

Run: 87, exploration: 0.01, score: 22
Scores: (min: 8, avg: 19.32183908045977, max: 90)

Run: 88, exploration: 0.01, score: 23
Scores: (min: 8, avg: 19.363636363636363, max: 90)

Run: 89, exploration: 0.01, score: 58
Scores: (min: 8, avg: 19.797752808988765, max: 90)

Run: 90, exploration: 0.01, score: 26
Scores: (min: 8, avg: 19.866666666666667, max: 90)

Run: 91, exploration: 0.01, score: 47
Scores: (min: 8, avg: 20.164835164835164, max: 90)

Run: 92, exploration: 0.01, score: 34
Scores: (min: 8, avg: 20.315217391304348, max: 90)

Run: 93, exploration: 0.01, score: 28
Scores: (min: 8, avg: 20.397849462365592, max: 90)

Run: 94, exploration: 0.01, score: 24
Scores: (min: 8, avg: 20.43617021276596, max: 90)

Run: 95, exploration: 0.01, score: 43
Scores: (min: 8, avg: 20.673684210526314, max: 90)

Run: 96, exp

Run: 186, exploration: 0.01, score: 173
Scores: (min: 21, avg: 117.14, max: 434)

Run: 187, exploration: 0.01, score: 175
Scores: (min: 21, avg: 118.67, max: 434)

Run: 188, exploration: 0.01, score: 194
Scores: (min: 21, avg: 120.38, max: 434)

Run: 189, exploration: 0.01, score: 204
Scores: (min: 21, avg: 121.84, max: 434)

Run: 190, exploration: 0.01, score: 170
Scores: (min: 21, avg: 123.28, max: 434)

Run: 191, exploration: 0.01, score: 191
Scores: (min: 21, avg: 124.72, max: 434)

Run: 192, exploration: 0.01, score: 192
Scores: (min: 21, avg: 126.3, max: 434)

Run: 193, exploration: 0.01, score: 183
Scores: (min: 21, avg: 127.85, max: 434)

Run: 194, exploration: 0.01, score: 159
Scores: (min: 21, avg: 129.2, max: 434)

Run: 195, exploration: 0.01, score: 201
Scores: (min: 21, avg: 130.78, max: 434)

Run: 196, exploration: 0.01, score: 184
Scores: (min: 21, avg: 131.54, max: 434)

Run: 197, exploration: 0.01, score: 205
Scores: (min: 21, avg: 133.26, max: 434)

Run: 198, explorat

Run: 286, exploration: 0.01, score: 182
Scores: (min: 143, avg: 168.11, max: 216)

Run: 287, exploration: 0.01, score: 177
Scores: (min: 143, avg: 168.13, max: 216)

Run: 288, exploration: 0.01, score: 157
Scores: (min: 143, avg: 167.76, max: 216)

Run: 289, exploration: 0.01, score: 161
Scores: (min: 143, avg: 167.33, max: 216)

Run: 290, exploration: 0.01, score: 159
Scores: (min: 143, avg: 167.22, max: 216)

Run: 291, exploration: 0.01, score: 165
Scores: (min: 143, avg: 166.96, max: 216)

Run: 292, exploration: 0.01, score: 165
Scores: (min: 143, avg: 166.69, max: 216)

Run: 293, exploration: 0.01, score: 149
Scores: (min: 143, avg: 166.35, max: 216)

Run: 294, exploration: 0.01, score: 155
Scores: (min: 143, avg: 166.31, max: 216)

Run: 295, exploration: 0.01, score: 172
Scores: (min: 143, avg: 166.02, max: 216)

Run: 296, exploration: 0.01, score: 169
Scores: (min: 143, avg: 165.87, max: 216)

Run: 297, exploration: 0.01, score: 170
Scores: (min: 143, avg: 165.52, max: 216)

Run:

Run: 385, exploration: 0.01, score: 184
Scores: (min: 140, avg: 187.44, max: 459)

Run: 386, exploration: 0.01, score: 198
Scores: (min: 140, avg: 187.6, max: 459)

Run: 387, exploration: 0.01, score: 200
Scores: (min: 140, avg: 187.83, max: 459)

Run: 388, exploration: 0.01, score: 190
Scores: (min: 140, avg: 188.16, max: 459)

Run: 389, exploration: 0.01, score: 213
Scores: (min: 140, avg: 188.68, max: 459)

Run: 390, exploration: 0.01, score: 165
Scores: (min: 140, avg: 188.74, max: 459)

Run: 391, exploration: 0.01, score: 168
Scores: (min: 140, avg: 188.77, max: 459)

Run: 392, exploration: 0.01, score: 169
Scores: (min: 140, avg: 188.81, max: 459)

Run: 393, exploration: 0.01, score: 177
Scores: (min: 140, avg: 189.09, max: 459)

Run: 394, exploration: 0.01, score: 208
Scores: (min: 140, avg: 189.62, max: 459)

Run: 395, exploration: 0.01, score: 167
Scores: (min: 140, avg: 189.57, max: 459)

Run: 396, exploration: 0.01, score: 174
Scores: (min: 140, avg: 189.62, max: 459)

Run: 

NameError: name 'exit' is not defined

# Discount Factor (Gamma) changed

    GAMMA: 0.95 --> 0.99
    
Solved in 84 runs, 184 total runs.

In [1]:
import random  
import gym  
import numpy as np  
from collections import deque  
from keras.models import Sequential  
from keras.layers import Dense  
from keras.optimizers import Adam  
  
  
from scores.score_logger import ScoreLogger  
  
ENV_NAME = "CartPole-v1"  
  
GAMMA = 0.99  
LEARNING_RATE = 0.001  
  
MEMORY_SIZE = 1000000  
BATCH_SIZE = 20  
  
EXPLORATION_MAX = 1.0  
EXPLORATION_MIN = 0.01  
EXPLORATION_DECAY = 0.995  
  
  
class DQNSolver:  
  
    def __init__(self, observation_space, action_space):  
        self.exploration_rate = EXPLORATION_MAX  
  
        self.action_space = action_space  
        self.memory = deque(maxlen=MEMORY_SIZE)  
  
        self.model = Sequential()  
        self.model.add(Dense(24, input_shape=(observation_space,), activation="relu"))  
        self.model.add(Dense(24, activation="relu"))  
        self.model.add(Dense(self.action_space, activation="linear"))  
        self.model.compile(loss="mse", optimizer=Adam(lr=LEARNING_RATE))  
  
    def remember(self, state, action, reward, next_state, done):  
        self.memory.append((state, action, reward, next_state, done))  
  
    def act(self, state):  
        if np.random.rand() < self.exploration_rate:  
            return random.randrange(self.action_space)  
        q_values = self.model.predict(state)  
        return np.argmax(q_values[0])  
  
    def experience_replay(self):  
        if len(self.memory) < BATCH_SIZE:  
            return  
        batch = random.sample(self.memory, BATCH_SIZE)  
        for state, action, reward, state_next, terminal in batch:  
            q_update = reward  
            if not terminal:  
                q_update = (reward + GAMMA * np.amax(self.model.predict(state_next)[0]))  
            q_values = self.model.predict(state)  
            q_values[0][action] = q_update  
            self.model.fit(state, q_values, verbose=0)  
        self.exploration_rate *= EXPLORATION_DECAY  
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)  
  
  
def cartpole():  
    env = gym.make(ENV_NAME)  
    score_logger = ScoreLogger(ENV_NAME)  
    observation_space = env.observation_space.shape[0]  
    action_space = env.action_space.n  
    dqn_solver = DQNSolver(observation_space, action_space)  
    run = 0  
    while True:  
        run += 1  
        state = env.reset()  
        state = np.reshape(state, [1, observation_space])  
        step = 0  
        while True:  
            step += 1  
            #env.render()  
            action = dqn_solver.act(state)  
            state_next, reward, terminal, info = env.step(action)  
            reward = reward if not terminal else -reward  
            state_next = np.reshape(state_next, [1, observation_space])  
            dqn_solver.remember(state, action, reward, state_next, terminal)  
            state = state_next  
            if terminal:  
                print ("Run: " + str(run) + ", exploration: " + str(dqn_solver.exploration_rate) + ", score: " + str(step))  
                score_logger.add_score(step, run)  
                break  
            dqn_solver.experience_replay()  

cartpole()

Using TensorFlow backend.


Run: 1, exploration: 0.946354579813443, score: 31
Scores: (min: 31, avg: 31, max: 31)

Run: 2, exploration: 0.9046104802746175, score: 10
Scores: (min: 10, avg: 20.5, max: 31)

Run: 3, exploration: 0.8183201210226743, score: 21
Scores: (min: 10, avg: 20.666666666666668, max: 31)

Run: 4, exploration: 0.778312557068642, score: 11
Scores: (min: 10, avg: 18.25, max: 31)

Run: 5, exploration: 0.7076077347272662, score: 20
Scores: (min: 10, avg: 18.6, max: 31)

Run: 6, exploration: 0.6662995813682115, score: 13
Scores: (min: 10, avg: 17.666666666666668, max: 31)

Run: 7, exploration: 0.5937455908197752, score: 24
Scores: (min: 10, avg: 18.571428571428573, max: 31)

Run: 8, exploration: 0.5618938591163328, score: 12
Scores: (min: 10, avg: 17.75, max: 31)

Run: 9, exploration: 0.531750826943791, score: 12
Scores: (min: 10, avg: 17.11111111111111, max: 31)

Run: 10, exploration: 0.49571413690105054, score: 15
Scores: (min: 10, avg: 16.9, max: 31)

Run: 11, exploration: 0.46444185833082485, sco

Run: 82, exploration: 0.018042124582040707, score: 9
Scores: (min: 8, avg: 11, max: 31)

Run: 83, exploration: 0.017332943577287888, score: 9
Scores: (min: 8, avg: 10.975903614457831, max: 31)

Run: 84, exploration: 0.016485538227150154, score: 11
Scores: (min: 8, avg: 10.976190476190476, max: 31)

Run: 85, exploration: 0.01583754189442009, score: 9
Scores: (min: 8, avg: 10.952941176470588, max: 31)

Run: 86, exploration: 0.014987930304771725, score: 12
Scores: (min: 8, avg: 10.965116279069768, max: 31)

Run: 87, exploration: 0.013422995398979608, score: 23
Scores: (min: 8, avg: 11.10344827586207, max: 31)

Run: 88, exploration: 0.01283090141222608, score: 10
Scores: (min: 8, avg: 11.090909090909092, max: 31)

Run: 89, exploration: 0.012388500230681249, score: 8
Scores: (min: 8, avg: 11.0561797752809, max: 31)

Run: 90, exploration: 0.01184203826198843, score: 10
Scores: (min: 8, avg: 11.044444444444444, max: 31)

Run: 91, exploration: 0.01137656378004644, score: 9
Scores: (min: 8, avg

Run: 181, exploration: 0.01, score: 156
Scores: (min: 8, avg: 188.68, max: 489)

Run: 182, exploration: 0.01, score: 296
Scores: (min: 8, avg: 191.55, max: 489)

Run: 183, exploration: 0.01, score: 225
Scores: (min: 8, avg: 193.71, max: 489)

Run: 184, exploration: 0.01, score: 210
Scores: (min: 8, avg: 195.7, max: 489)

Solved in 84 runs, 184 total runs.


NameError: name 'exit' is not defined


# Explain how reinforcement learning concepts apply to the cartpole problem.
## What is the goal of the agent in this case?
In the Cartpole problem, the goal of the agent is to balance a pole that is attached to a cart for as long as possible by applying forces to the cart (Surma, 2018). 

## What are the various state values?
There are 4 values in the cartpole problem:
- The cart's position
- The cart's velocity
- The pole's angle
- The pole's angular velocity
(Surma, 2018)

## What are the possible actions that can be performed?
There are only two actions that can be done: move the cart to the right(value of 1), and move the cart to the left(value of 0) (Gym Documentation, n.d.).

## What reinforcement algorithm is used for this problem?
We used Q-learning to solve this problem, although it can use others. Q-learning is a type of reinforcement learning where the agent learns to make better decisions by interacting with an environment (OpenAI's Gym). 

# Analyze how experience replay is applied to the cartpole problem.
## How does experience replay work in this algorithm?
Experience replay is the storage of the different values associated with each step in the process (DeepLizard, 2018). This means that the model can reference these in order to learn. Just like a human uses their experiences to determine their decisions.
A (hopefully) simple breakdown of how the experience replay works with the cartpole problem: While learning, the agent applies Q-learning updates on samples of the experience. "A key reason for using replay memory is to break the correlation between consecutive samples (DeepLizard, 2018).

## What is the effect of introducing a discount factor for calculating the future rewards?
This is what determines the importance of the rewards in the reinforcement learning. Rewards are important because, just like when training your dog, rewards help reinforce the desired behavior. A larger discount factor puts more emphasis on long-term rewards, where a lower one makes it more greedy and will attempt to get the rewards ASAP (NitinAgarwal494, 2020). Introducing a discount factor for calculating the future rewards can help in stabilizing the learning process (NitinAgarwal494, 2020; Singh, n.d.).

# Analyze how neural networks are used in deep Q-learning.
## Explain the neural network architecture that is used in the cartpole problem.
In our case there are, I believe, 4 neurons for the cartople problem (one for each state value, depicted by `observation_space`), and two output neurons (one for each of the actions that can be taken. There are a few layers, two that are ReLU - one of these is the input layer - and one linear. 

## How does the neural network make the Q-learning algorithm more efficient?
Accoording to Penick in his article Neural Network Q Learning, "Neural Network Q Learning has been shown to outperform traditional Q Learning algorithms in scenarios with large state spaces and complex decision-making processes. (Penick, 2023)" This means that Q-learning with a neural net is more efficient than one without. 

## What difference do you see in the algorithm performance when you increase or decrease the learning rate?
I divided the learning rate by 10 and saw a dramatic decrease in how quickly it solved the cartpole problem as compared to the original run. I believe I also attempted multiplying it by 10, but that went on for so long that I gave up on it and tried again. I believe I was sitting at over 1200 runs and over an hour and a half of it running.

# References
- *Cart Pole - gym documentation*. (n.d.). https://www.gymlibrary.dev/environments/classic_control/cart_pole/ 

- NitinAgarwal494. (2020, April 6). *Reinforcement Learning (Q-learning) – An Introduction (Part 1)*. Data Science Central. https://www.datasciencecentral.com/reinforcement-learning-q-learning-an-introduction-part-1/ 

- Penick, H. (2023, December 17). *Neural Network Q Learning.* Get Neural Net. https://getneuralnet.com/neural-network-q-learning/ 

- *Replay memory explained - Experience for deep Q-Network training*. (n.d.). Deeplizard. https://deeplizard.com/learn/video/Bcuj2fTH4_4 

- Singh, S. (2022, October 7). *How are neural networks used in deep Q-Learning?* https://www.turing.com/kb/how-are-neural-networks-used-in-deep-q-learning 

- Surma, G. (2021, October 13). *Cartpole - Introduction to Reinforcement Learning (DQN - Deep Q-Learning)*. Medium. https://gsurma.medium.com/cartpole-introduction-to-reinforcement-learning-ed0eb5b58288