# Deep Reinforcement Learning - Cartpole Game (Code Along)

Code Along with - https://www.youtube.com/watch?v=OYhFoMySoVs

## Background

*The Cartpole Game*  
A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. Use the arrow keys to apply a force on the cart. The pendulum starts upright, and the goal is to prevent it from falling over. A score of +1 is provided for every move that the pole remains upright. The game ends when the pole is more than 15 degrees from vertical, or the cart touches the edges.  
[Source](https://fluxml.ai/experiments/cartPole/)

#### Important Terms

1. **Environment** - The World in which our agent is acting. Returns state and reward to **agent**
2. **Agent** - The Actor that makes decisions in the environment. This will be what handles our model. Takes in **state** and **reward**, then returns **action**.
3. **State** - The important variables in our environment. In this case, it will be cart position, cart velocity, pole angle, and pole angular velocity.
4. **Reward** - Feedback for the model to determine how successful an action was. In this case, it will be trying to maximize the length of the game.
5. **Action** - What the Agent does in response to the given environment. This will be what our model is attempting to predict. In this case, it will be deciding whether to move the cart left or right.

#### Markov Decision Process

+ **S** - All Possible States
+ **A** - List of Possible Actions
+ **R** - Reward Distribution given (s,a)
+ **P** - Transition Probability Distribution of S[t+1] given (s,a)
+ **D** - Discount Factor

*Objective is to get the Maximum Sum of D^t*r^t where t is the number of actions from the current state*

#### Q Learning Value Functions

+ **Value Function (State)** - Expected reward given current state. In this situation, if the pole is completly vertical, this will be high, as the likelihood of success is high.
+ **Q-Value Function (State, Action)** - Expected reward given current state and action. 
+ **Q*-Value Function (State, Action, Deep Q-Learning Network)** - Expected reward given current state and action through the lense of the Deep Q-Learning Network

## Dependencies, Environment, and Parameters

In [1]:
import os
os.environ["KERAS_BACKEND"] = "plaidml.keras.backend"

In [2]:
import numpy as np
import random
import gym
from collections import deque
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import Adam

if not os.path.exists(output_dir):
    os.makedirs(output_dir)

Using plaidml.keras.backend backend.


In [3]:
env = gym.make('CartPole-v0')

In [4]:
state_size = env.observation_space.shape[0]
state_size

4

In [5]:
action_size = env.action_space.n
action_size

2

In [6]:
batch_size = 32

In [7]:
n_episodes = 1001

In [8]:
done = False

In [9]:
output_dir = 'model_output/cartpole'

In [10]:
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

## Define Agent

In [11]:
class DQNAgent:
    
    def __init__(self, state_size, action_size):
        self.state_size = state_size
        self.action_size = action_size
        
        #Used for sampling from past experiences. This is important for making sure there is enough variety in the actions
        self.memory = deque(maxlen=2000)
        
        self.gamma = 0.95
        
        #Helps balance exploitation vs exploration
        self.epsilon = 1.0
        self.epsilon_decay = 0.995
        self.epsilon_min = 0.01
        
        #Step size for our optimizer
        self.learning_rate = 0.001
        
        self.model = self._build_model()
        
    def _build_model(self):
        
        # Set up Model
        model = Sequential()
        
        # Hidden Layers
        model.add(Dense(24, input_dim = self.state_size, activation='relu'))
        model.add(Dense(24, activation = 'relu'))
        
        # Output Layer
        model.add(Dense(self.action_size, activation='linear'))
        
        #Compile Model
        model.compile(loss='mse', optimizer=Adam(lr=self.learning_rate))
        
        return model
    
    #Create Datapoint for learning
    def remember(self, state, action, reward, next_state, done):
        self.memory.append((state, action, reward, next_state, done))
    
    #Determine explore or exploit
    def act(self, state):
        if np.random.rand() <= self.epsilon:
            return random.randrange(self.action_size)
        act_values = self.model.predict(state)
        return np.argmax(act_values[0])
    
    #Uses batch of memories to train the model
    def replay(self, batch_size):
        minibatch = random.sample(self.memory, batch_size)
        for state, action, reward, next_state, done in minibatch:
            target = reward
            if not done:
                target = (reward + self.gamma * np.amax(self.model.predict(next_state)[0]))
            target_f = self.model.predict(state)
            target_f[0][action] = target
            
            self.model.fit(state, target_f, epochs=1, verbose=0)
            
        if self.epsilon > self.epsilon_min:
            self.epsilon *= self.epsilon_decay
    
    #Load model
    def load(self,name):
        self.model.load_weights(name)
    
    #Save Model
    def save(self, name):
        self.model.save_weights(name)

In [12]:
agent = DQNAgent(state_size, action_size)

INFO:plaidml:Opening device "metal_amd_radeon_rx_5700_xt.0"


## Interact With Environment

In [13]:
done = False

for e in range(n_episodes): 
    #Reset Game
    state = env.reset()
    state = np.reshape(state, [1, state_size])
    
    for time in range(5000):  
        #Render Environment
        #env.render()
        
        #Get Action
        action = agent.act(state)
        
        #Process Action in Environment
        next_state, reward, done, _ = env.step(action) 
        
        #Punish Failures
        reward = reward if not done else -10     
        
        #Store Datapoint
        next_state = np.reshape(next_state, [1, state_size])
        agent.remember(state, action, reward, next_state, done) 
        
        state = next_state    
        if done: 
            print("episode: {}/{}, score: {}, e: {:.2}" 
                  .format(e, n_episodes, time, agent.epsilon))
            break
        
    #When Memory filled to batch size, train model on batch
    if len(agent.memory) > batch_size:
        agent.replay(batch_size) 
        
    if e % 50 == 0:
        agent.save(output_dir + "weights_" + '{:04d}'.format(e) + ".hdf5")

episode: 0/300, score: 42, e: 1.0
episode: 1/300, score: 28, e: 0.99
episode: 2/300, score: 14, e: 0.99
episode: 3/300, score: 43, e: 0.99
episode: 4/300, score: 25, e: 0.98
episode: 5/300, score: 31, e: 0.98
episode: 6/300, score: 19, e: 0.97
episode: 7/300, score: 22, e: 0.97
episode: 8/300, score: 12, e: 0.96
episode: 9/300, score: 19, e: 0.96
episode: 10/300, score: 13, e: 0.95
episode: 11/300, score: 38, e: 0.95
episode: 12/300, score: 37, e: 0.94
episode: 13/300, score: 23, e: 0.94
episode: 14/300, score: 20, e: 0.93
episode: 15/300, score: 24, e: 0.93
episode: 16/300, score: 12, e: 0.92
episode: 17/300, score: 37, e: 0.92
episode: 18/300, score: 20, e: 0.91
episode: 19/300, score: 42, e: 0.91
episode: 20/300, score: 36, e: 0.9
episode: 21/300, score: 26, e: 0.9
episode: 22/300, score: 33, e: 0.9
episode: 23/300, score: 19, e: 0.89
episode: 24/300, score: 10, e: 0.89
episode: 25/300, score: 27, e: 0.88
episode: 26/300, score: 69, e: 0.88
episode: 27/300, score: 18, e: 0.87
episod

episode: 225/300, score: 65, e: 0.32
episode: 226/300, score: 72, e: 0.32
episode: 227/300, score: 82, e: 0.32
episode: 228/300, score: 88, e: 0.32
episode: 229/300, score: 95, e: 0.32
episode: 230/300, score: 181, e: 0.32
episode: 231/300, score: 97, e: 0.31
episode: 232/300, score: 106, e: 0.31
episode: 233/300, score: 36, e: 0.31
episode: 234/300, score: 69, e: 0.31
episode: 235/300, score: 99, e: 0.31
episode: 236/300, score: 46, e: 0.31
episode: 237/300, score: 199, e: 0.3
episode: 238/300, score: 55, e: 0.3
episode: 239/300, score: 52, e: 0.3
episode: 240/300, score: 82, e: 0.3
episode: 241/300, score: 72, e: 0.3
episode: 242/300, score: 124, e: 0.3
episode: 243/300, score: 64, e: 0.3
episode: 244/300, score: 64, e: 0.29
episode: 245/300, score: 63, e: 0.29
episode: 246/300, score: 97, e: 0.29
episode: 247/300, score: 88, e: 0.29
episode: 248/300, score: 88, e: 0.29
episode: 249/300, score: 65, e: 0.29
episode: 250/300, score: 70, e: 0.29
episode: 251/300, score: 94, e: 0.28
epis