## CartPole Gym
*If code differs between the notebook and the curriculum, go with the notebook!!!*

### Imports

In [1]:
import keras
from keras.models import Sequential
from keras.layers.core import Dense, Dropout

import numpy as np
import gym

Using TensorFlow backend.


### Creating the environment

In [2]:
env = gym.make('CartPole-v0')

In [3]:
action_space = env.action_space.n
print(action_space)

2


In [4]:
def getStateSize():
    state=env.reset()
    action = env.action_space.sample()
    obs, _, _, _ = env.step(action)
    return len(obs)

In [5]:
state_space = getStateSize()
print(state_space)

4


### Random games test
We did this in the previous lesson, so skip it if you wish. We'll see it played later

In [6]:
def some_random_games_first():
    for episode in range(5):
        env.reset()
        for t in range(500):
            env.render()
            action = env.action_space.sample()
            observation, reward, done, info = env.step(action)
            if done:
                env.close()
                break

# uncomment line below to watch the random agent play
# some_random_games_first()                

### Creating the model

In [6]:
model = Sequential()

model.add(Dense(128, activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(32, activation='relu'))
model.add(Dense(action_space, activation='softmax'))

model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['acc'])

Instructions for updating:
Colocations handled automatically by placer.


### Generating data and training

In [7]:
# This function is crucially important! Mind the details!
#
def initial_data(number_of_games, game_turns, acceptable_score):
    # lists for features X and labels y
    # i.e., states of the environment 
    # (observations) are in X and the 
    # proposed actions to take are in y
    X = []
    y = []
    # one hot encoded vector for actions.
    # initialized empty
    one_hot = [0 for i in range(action_space)]
    # How many games should we play? A 
    # good place to start is to balance
    # number of games and acceptable score
    # so that you're producing at least a 
    # few thousand examples
    for i in range(number_of_games):
        env.reset()
        # game_memory is new. It's described 
        # in the curriculum!
        game_memory = []
        prev_obs = []
        score = 0
        # We want a max number of game turns
        # so that the game doesn't run forever
        # on an bad set of inputs. only relevent
        # to certain games however.
        for turn in range(game_turns):
            # we're just collecting data off of
            # random agent played games. If 
            # gets to play 1000's of games, it
            # will occassionally do well!
            action = env.action_space.sample()
            new_obs, reward, done, info = env.step(action)
            # summing the final score
            score += int(reward)
            # the first turn (or 0th turn) has no 
            # prev_obs, so skip it. otherwise,
            # we tack 1) the previous observation
            # and 2) the action taken during that 
            # state onto the game_memory
            if turn > 0:
                game_memory.append([prev_obs, int(action)])
            # we cycle the obs so that on each
            # step we have the previous obs 
            # stored when we recieve the new one
            prev_obs = new_obs
            # if the round finished, we want to
            # break out of this for loop
            if done:
                break
                
        # this occurs after each game completes. 
        # if the score from that game is above 
        # the threshold, we append that entire
        # game onto X for training!
        if score >= acceptable_score:
            for data in game_memory:
                X.append(np.array(data[0]).reshape(1, len(data[0])))
                # the next two lines create our one hot
                # labels array. We just set the index of
                # our desired move to be a 1
                predicted_action = list(one_hot)
                predicted_action[data[1]] = 1
                y.append(np.array(predicted_action).reshape(1, action_space))
    print('{} examples were made.'.format(len(X)))
    return np.array(X).reshape(-1, 1, len(data[0])), np.array(y).reshape(-1, 1, action_space)

In [9]:
X, y = initial_data(3000, 200, 60)

2986 examples were made.


A word on validation_split. What this does is reserves a given percentage of the input data for validation testing. Basically, it's a way of keeping an eye on how well the training is doing while it's training. If the models ability to work with data outside of the training set is stagnant and not improving, the model is overfitting. This will be more useful in future projects. 

For now, this model only needs a single epoch to get good results. Obviously the accuracy and loss measures here don't exactly reflect how well the model is doing. 60% accuracy but aceing it every time? It's a bit more complicated than that. Definitely need to see the model in action.

In [10]:
model.fit(x=X, y=y, epochs=1, verbose=2, validation_split=0.2)

Instructions for updating:
Use tf.cast instead.
Instructions for updating:
Deprecated in favor of operator or tf.math.divide.
Train on 2388 samples, validate on 598 samples
Epoch 1/1
 - 0s - loss: 0.6736 - acc: 0.5942 - val_loss: 0.6679 - val_acc: 0.6154


<keras.callbacks.History at 0x13a6c44a8>

### Playing

In [8]:
def play_game(n_games, model=None):
    for i in range(n_games):
        env.reset()
        prev_obs = []
        score = 0
        done = False
        while not done:
            env.render()
            # If no data is loaded up, take a random action
            # If we're using a model, this will 
            # only happen on the 0th step
            if (model == None) or (len(prev_obs) < 1):
                action = env.action_space.sample()
            else:
                # otherwise we use our model to choose an
                # action based on the current observation (state)
                action = np.argmax(model.predict(prev_obs.reshape(-1, 1, state_space)))
            new_obs, reward, done, _ = env.step(action)
            prev_obs = new_obs
            score += reward
                
        env.close()
        print('Final score: {}'.format(score))

In [12]:
play_game(10, model)

Final score: 200.0
Final score: 200.0
Final score: 200.0
Final score: 200.0
Final score: 200.0
Final score: 200.0
Final score: 200.0
Final score: 200.0
Final score: 200.0
Final score: 200.0
