# Flappy Bird Solver!


## Intro
Flappy bird is a rediculously difficult game with lots and lots of failure and some questionable collision detection.  Perhaps a regrettable choice as a learning experiment.
![what the?](./assets/collision.png "questionable collision")

I followed a few blogs and notes to implement the reinforcement learning agents, particularly:
*https://keon.io/deep-q-learning/
*https://www.intelnervana.com/demystifying-deep-reinforcement-learning/

My goal was to implement deep reinforcement learning for Flappy Bird

## Signal to Noise
As noted above, flappy bird is extremely difficult.  Flying through even one pipe is extremely unlikely.  Also, the game rewards halfway through a pipe, while there is an extremely high chance of crashing before exiting the pipe.  Thus, my primary issue was getting as much signal as possible out of the system.

To that end, I avoided the standard random sampling of states and rewards, and instead opted to randomly sample from full game instances.

Furthermore, the input into the agent is both the *state* of the game and the **chosen action**.  The result of the input is the presumed **value** of the state + action.

## Method
The "values" are computed via the *bellman equation*:

  Q(s,a) =  r + g * max(Q(s',a'))

Where r is the instantaneous reward, g is an internal paramter describing the correlation between this state and the next, while s' represents the next state.

Full game histories are fed to the agent and stored in memory.  During training time, a set of histories are extracted at random from the agent's memory.  The histories are **reversed** to make computing Q(s,a) more intuitive.

Finally, to help bootstrap the agent, a random *epsilon* factor is introduced to use a random number generator to decide what action to take.  *Epsilon* starts at a high value and is slowly decremented as the model is trained.

## Results
At the time of writing this, my bird flies through *median* 10 pipes, with a high score of **94** pipes.  Pretty good!

In [None]:
from collections import deque
from matplotlib import pyplot as plt
import numpy as np
from ple.games.flappybird import FlappyBird
from ple import PLE
import pygame
import random
import time

In [None]:
from keras.models import Sequential
from keras.layers import Dense, Conv2D, Conv3D, MaxPooling2D, Dropout, Flatten
from keras.optimizers import Adam
from keras import backend as K
from keras import regularizers
import keras

def make_batch(frame):
    return frame.reshape(1,frame.shape[0])

def prepQ(frame, a):
    return np.array([*frame,a])

class QAgent:
    def __init__(self, dim):
        #paramters
        self.gamma = 0.97
        self.eps = 0.9
        self.eps_dec = 0.995
        self.eps_min = 0.0
        
        #build model(s)
        self.model = self.build_model()
                
        self.batch_size = 64
        self.memory_size = self.batch_size * 10
        self.memory = deque(maxlen=self.memory_size)
        
    def build_model(self):
        model = Sequential()
        model.add(Dense(256,
                       input_shape=(9,),  #state is 8, action is 1
#                        kernel_regularizer=regularizers.l2(0.01),
                       activation='relu'))
        model.add(Dense(256,
#                        kernel_regularizer=regularizers.l2(0.01),
                       activation='relu'))
        model.add(Dense(1,
                       activation='linear')) #output is value of state+action pair
    
        model.compile(loss='mse',optimizer=Adam(lr=0.0001))
        return model
        
    def act(self, state, verbose=False):
        rs = np.random.rand();
        
        if rs < self.eps:
            return int(np.random.rand() < 0.5)
        
        #cheat!
        if rs < self.eps:
            if state[4] < state[7]:
                pad = 65
            else:
                pad = 65

            if (state[4]-pad+state[1]) > state[0]:
                act = 1
            else:
                act = 0
                
            if verbose:
                print("cheating",state, pad, act)                

            return act
        
        flap = self.model.predict(make_batch(prepQ(state,0)))
        fall = self.model.predict(make_batch(prepQ(state,1)))
        act = int(flap < fall)
        if verbose:
            print("act:", act, "Flap:",flap,"Fall:",fall)

        return act
        
    def store_reverse_game(self, history):
        history.reverse()
        self.memory.append(history)
            
    def train(self, verbose=False):
        gamebatch = random.sample(self.memory, self.batch_size)
        
        states = []
        qvals = []
        
        for game in gamebatch:
            rr = 0.0
            for s,a,r in game:
                t = r + self.gamma * rr
                states.append(prepQ(s,a))
                qvals.append(t)
                rr = t
        
        self.model.fit(np.array(states), np.array(qvals), epochs=1, verbose=verbose)
        self.eps = max(self.eps_min, self.eps * self.eps_dec)

In [None]:
#prepare a game
Agent = QAgent
game = FlappyBird()
def get_state():
    return np.array([float(v) for v in game.getGameState().values()])
env = PLE(game, fps=30, force_fps=30, display_screen=True, reward_values={"positive": 1.0, "loss": -1.0, "tick": 0.0,})
agent = Agent(6)
actions = env.getActionSet()
env.init()

In [None]:
#generate some memory
for i in range(agent.batch_size):
    history = []
    while not env.game_over():
        state = get_state()
        act = agent.act(state)
        reward = env.act(actions[act])
        history.append([state, act, reward])
    agent.store_reverse_game(history)
    env.reset_game()
    env.act(actions[0])
agent.train(True)

In [None]:
#train in an infinite loop
high_score = 0
print("Starting game with gamma:",agent.gamma, "eps:", agent.eps, "decay", agent.eps_dec)
try:
    while True:
        frameset = []
        for i in range(agent.batch_size):
            history = []
            frames = 0
            score = 0.0

            while not env.game_over():
                state = get_state()
                act = agent.act(state)
                reward = env.act(actions[act])
                score += reward
                frames += 1
                history.append([state,act,reward])

            agent.store_reverse_game(history)
            env.reset_game()
            env.act(actions[0])
            frameset.append(frames)

            if (frames > high_score):
                print("New max flight!", frames, score)
                high_score = frames
                
        print("Max frames:", np.max(frameset), "Avg frames:", np.average(frameset), "Median frames:", np.median(frameset), "eps:", agent.eps)
        #hmmm... multi train?
        for i in range(5):
            agent.train(False)

except KeyboardInterrupt:
    pass

In [None]:
#qc check single game with no random elements
e = agent.eps
print("Note: gamma", agent.gamma, "eps", agent.eps, "eps decay", agent.eps_dec)
agent.eps = 0
score = 0.0
for i in range(1):
    env.reset_game()
    ctr = 0
    env.act(actions[0]) #first action
    while not env.game_over():
        state = get_state()
        choice = agent.act(state, verbose=True)
        reward = env.act(actions[choice])
        score += reward
        time.sleep(1./30)
        ctr+=1
    print("flew",ctr,"frames, score:", score)
env.reset_game()
agent.eps = e

In [None]:
#save
agent.model.save("modelv1")
agent.model.save_weights("mweightsv1")