# Deep Q-Learning applied to the 'Snake game'

This is the implementation o Deep Q-Learning applied to a game environment developed in Python and based in numpy vectors for faster code processing.

![Snake](Snake.gif "Snake")

![RL](https://www.kdnuggets.com/images/reinforcement-learning-fig1-700.jpg)

## The environment

The environment is simulated using the class [SnakeEnvironment](SnakeEnvironment.py).
The state $S_t$ is represented by the following measurements:
  - distance to food (vertical)
  - distance to food (horizontal)
  - distance up (wall or own body)
  - distance right (wall or own body)
  - distance down (wall or own body)
  - distance left (wall or own body)
  - long snake (wether the snake is longer than 3 or not)
  
Each call to the method 'move_snake(direction)' will move the snake towards the specified direction and will return the new state of the environment $S_{t+1}$, the reward $R_{t+1}$ for the action $A_t$ represented by the input 'direction'.

## The agent

The agent is an implementation of a Deep Q-Learning network using Keras. The class [SnakeDQNAgent](SnakeDQNAgent.py) builds two models of the following form:

In [None]:
policy_nn = models.Sequential()
policy_nn.add(layers.Dense(128, activation='relu', input_shape=(self.state_size,)))
policy_nn.add(layers.Dense(64, activation='relu'))
policy_nn.add(layers.Dense(32, activation='relu'))
policy_nn.add(layers.Dense(self.action_size))
policy_nn.compile(loss='mse', optimizer=optimizers.RMSprop(lr=self.learning_rate))
target_nn = clone_model(self.policy_nn)

The policy neural network is in charge of estimating the action $A_t$ to be sent to the environment given a certain state $S_t$. 
The target neural network is a cloned version of the policy_nn with a small lag, it is used to estimate the future reward $Q(s', a')$. The motivation for using this second network is to help reducing the instability of using only a single network to calculate both the Q-values and the target Q-values, [source](https://youtu.be/xVkPh9E9GfE?t=114).

The Bellman equation is used to update the Q values for the current state $S_t$ given the action $A_t$, the reward $R_t$ and the estimation of the future Q-value:
$$Q_{(s,a)}=(1-\alpha)Q_{(s,a)}+\alpha\left(R_{t}+\gamma\underset{a'}{\max}Q_{(s',a')}\right)$$

## Training the agent

The agent is trained using the class [SnakeTrain](SnakeTrain.py). The method `replay` of class `SnakeDQNAgent` is called at the end of each episode in order to update the Q-value estimations for the given state:

In [None]:
            targets = self.policy_nn.predict(states)
            action_rewards = rewards * actions
            learned_value = self.discount_rate * np.amax(self.target_nn.predict(new_states), axis=1) - np.amax(targets, axis=1)
            target_modifier = action_rewards + (1-deads) * learned_value.reshape(self.batch_size, 1)
            targets += target_modifier
            self.policy_nn.fit(states, targets, epochs=self.epochs, verbose=0)

Use the following code from a terminal (within the corresponding code folder) to train the snake with 50.000 episodes and visualize 100 episodes more (you can also use SnakeTrain.py --help):

In [None]:
 python SnakeTrain.py --train_episodes 50000 --print_from 50100