# Reinforcement Learning Chess 
Reinforcement Learning Chess is a series of notebooks where I implement Reinforcement Learning algorithms to develop a chess AI. I start of with simpler versions (environments) that can be tackled with simple methods and gradually expand on those concepts untill I have a full-flegded chess AI. 

[**Notebook 1: Policy Iteration**](https://www.kaggle.com/arjanso/reinforcement-learning-chess-1-policy-iteration)  
[**Notebook 3: Q-networks**](https://www.kaggle.com/arjanso/reinforcement-learning-chess-3-q-networks)  
[**Notebook 4: Policy Gradients**](https://www.kaggle.com/arjanso/reinforcement-learning-chess-4-policy-gradients)  
[**Notebook 5: Monte Carlo Tree Search**](https://www.kaggle.com/arjanso/reinforcement-learning-chess-5-tree-search)  

# Notebook II: Model-free control
In this notebook I use the same move-chess environment as in notebook 1. In this notebook I mentioned that policy evaluation calculates the state value by backing up the successor state values and the transition probabilities to those states. The problem is that these probabilities are usually unknown in real-world problems. Luckily there are control techniques that can work in these unknown environments. These techniques don't leverage any prior knowledge about the environment's dynamics, they are model-free.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import inspect

In [3]:
!pip install --upgrade git+https://github.com/arjangroen/RLC.git  # RLC is the Reinforcement Learning package

Collecting git+https://github.com/arjangroen/RLC.git
  Cloning https://github.com/arjangroen/RLC.git to /tmp/pip-req-build-gn6xg08b
  Running command git clone -q https://github.com/arjangroen/RLC.git /tmp/pip-req-build-gn6xg08b
Building wheels for collected packages: RLC
  Building wheel for RLC (setup.py) ... [?25l- \ done
[?25h  Stored in directory: /tmp/pip-ephem-wheel-cache-x011tmyw/wheels/04/68/a5/cb835cd3d76a49de696a942739c71a56bfe66d0d8ea7b4b446
Successfully built RLC
Installing collected packages: RLC
Successfully installed RLC-0.3


In [4]:
from RLC.move_chess.environment import Board
from RLC.move_chess.agent import Piece
from RLC.move_chess.learn import Reinforce

### The environment
- The state space is a 8 by 8 grid
- The starting state S is the top-left square (0,0)
- The terminal state F is square (5,7). 
- Every move from state to state gives a reward of minus 1
- Naturally the best policy for this evironment is to move from S to F in the lowest amount of moves possible.

In [5]:
env = Board()
env.render()
env.visual_board

[['[S]', '[ ]', '[ ]', '[ ]', '[ ]', '[ ]', '[ ]', '[ ]'],
 ['[ ]', '[ ]', '[ ]', '[ ]', '[ ]', '[ ]', '[ ]', '[ ]'],
 ['[ ]', '[ ]', '[ ]', '[ ]', '[ ]', '[ ]', '[ ]', '[ ]'],
 ['[ ]', '[ ]', '[ ]', '[ ]', '[ ]', '[ ]', '[ ]', '[ ]'],
 ['[ ]', '[ ]', '[ ]', '[ ]', '[ ]', '[ ]', '[ ]', '[ ]'],
 ['[ ]', '[ ]', '[ ]', '[ ]', '[ ]', '[ ]', '[ ]', '[ ]'],
 ['[ ]', '[ ]', '[ ]', '[ ]', '[ ]', '[ ]', '[ ]', '[ ]'],
 ['[ ]', '[ ]', '[ ]', '[ ]', '[ ]', '[F]', '[ ]', '[ ]']]

### The agent
- The agent is a chess Piece (king, queen, rook, knight or bishop)
- The agent has a behavior policy determining what the agent does in what state

In [6]:
p = Piece(piece='king')

### Reinforce
- The reinforce object contains the algorithms for solving move chess
- The agent and the environment are attributes of the Reinforce object

In [7]:
r = Reinforce(p,env)

# 2.1 Monte Carlo Control

**Theory**  
The basic intuition is:
* We do not know the environment, so we sample an episode from beginning to end by running our current policy
* We try to estimate the action-values rather than the state values. This is because we are working model-free so just knowning state values won't help us select the best actions. 
* The value of a state-action value is defined as the future returns from the first visit of that state-action
* Based on this we can improve our policy and repeat the process untill the algorithm converges

![](http://incompleteideas.net/book/first/ebook/pseudotmp5.png)

**Implementation**

In [8]:
print(inspect.getsource(r.monte_carlo_learning))

    def monte_carlo_learning(self, epsilon=0.1):
        """
        Learn move chess through monte carlo control
        :param epsilon: exploration rate
        :return:
        """
        state = (0, 0)
        self.env.state = state

        # Play out an episode
        states, actions, rewards = self.play_episode(state, epsilon=epsilon)

        first_visits = []
        for idx, state in enumerate(states):
            action_index = actions[idx]
            if (state, action_index) in first_visits:
                continue
            r = np.sum(rewards[idx:])
            if (state, action_index) in self.agent.Returns.keys():
                self.agent.Returns[(state, action_index)].append(r)
            else:
                self.agent.Returns[(state, action_index)] = [r]
            self.agent.action_function[state[0], state[1], action_index] = \
                np.mean(self.agent.Returns[(state, action_index)])
            first_visits.append((state, action_index))
        #

**Demo**  
We do 100 iterations of monte carlo learning while maintaining a high exploration rate of 0.5:

In [9]:
for k in range(100):
    eps = 0.5
    r.monte_carlo_learning(epsilon=eps)

In [10]:
r.visualize_policy()

[['↙', '↖', '↓', '←', '↗', '→', '→', '↗'],
 ['←', '↓', '↙', '↓', '↖', '←', '↘', '↗'],
 ['↑', '↙', '↖', '↙', '→', '↖', '→', '→'],
 ['→', '↙', '↙', '↗', '←', '→', '↙', '↘'],
 ['↓', '↙', '↓', '←', '↙', '↖', '↗', '↑'],
 ['↓', '↘', '→', '→', '↙', '↙', '↘', '↗'],
 ['↗', '↓', '↘', '↖', '↘', '↓', '↙', '←'],
 ['↗', '→', '↖', '→', '→', 'F', '↗', '←']]


Best action value for each state:

In [11]:
r.agent.action_function.max(axis=2).astype(int)

array([[ -52,  -54,  -51,  -55,  -98,  -76,  -70,  -63],
       [ -47,  -47,  -47,  -56,  -77, -140,  -81,  -58],
       [ -57,  -39,  -44,  -43,  -85,  -65, -170, -174],
       [ -32,  -31,  -32,  -59,  -66, -105,  -63,  -84],
       [ -30,  -34,  -37,  -36,  -39,  -62, -146,  -44],
       [ -26,  -16,  -25,  -24,  -31,  -57,  -65,  -22],
       [ -23,  -32,  -12,  -24,   -1,   -1,   -1,   -3],
       [ -35,  -30,  -32,   -4,   -1,    0,    0,   -2]])

# 2.2 Temporal Difference Learning 

**Theory**
* Like Policy Iteration, we can back up state-action values from the successor state action without waiting for the episode to end. 
* We update our state-action value in the direction of the successor state action value.
* The algorithm is called SARSA: State-Action-Reward-State-Action.
* Epsilon is gradually lowered (the GLIE property)

**Implementation**

In [12]:
print(inspect.getsource(r.sarsa_td))

    def sarsa_td(self, n_episodes=1000, alpha=0.01, gamma=0.9):
        """
        Run the sarsa control algorithm (TD0), finding the optimal policy and action function
        :param n_episodes: int, amount of episodes to train
        :param alpha: learning rate
        :param gamma: discount factor of future rewards
        :return: finds the optimal policy for move chess
        """
        for k in range(n_episodes):
            state = (0, 0)
            self.env.state = state
            episode_end = False
            epsilon = max(1 / (1 + k), 0.05)
            while not episode_end:
                state = self.env.state
                action_index = self.agent.apply_policy(state, epsilon)
                action = self.agent.action_space[action_index]
                reward, episode_end = self.env.step(action)
                successor_state = self.env.state
                successor_action_index = self.agent.apply_policy(successor_state, epsilon)

                action_va

**Demonstration**

In [13]:
p = Piece(piece='king')
env = Board()
r = Reinforce(p,env)
r.sarsa_td(n_episodes=10000,alpha=0.2,gamma=0.9)

In [14]:
r.visualize_policy()

[['↘', '↘', '↘', '→', '↘', '→', '↑', '↗'],
 ['↘', '↙', '↘', '↙', '→', '↘', '→', '↓'],
 ['↘', '↘', '↓', '↓', '↙', '↖', '↙', '↖'],
 ['↘', '↘', '↓', '↙', '↘', '↘', '↑', '↙'],
 ['→', '↘', '↘', '↘', '↘', '↓', '↘', '↙'],
 ['↘', '→', '↘', '↘', '↘', '↘', '↓', '↙'],
 ['←', '↗', '↘', '↘', '↘', '↓', '↙', '←'],
 ['↗', '→', '→', '→', '→', 'F', '←', '←']]


# 2.3 TD-lambda
**Theory**  
In Monte Carlo we do a full-depth backup while in Temporal Difference Learning we de a 1-step backup. You could also choose a depth in-between: backup by n steps. But what value to choose for n?
* TD lambda uses all n-steps and discounts them with factor lambda
* This is called lambda-returns
* TD-lambda uses an eligibility-trace to keep track of the previously encountered states
* This way action-values can be updated in retrospect

**Implementation**

In [15]:
print(inspect.getsource(r.sarsa_lambda))

    def sarsa_lambda(self, n_episodes=1000, alpha=0.05, gamma=0.9, lamb=0.8):
        """
        Run the sarsa control algorithm (TD lambda), finding the optimal policy and action function
        :param n_episodes: int, amount of episodes to train
        :param alpha: learning rate
        :param gamma: discount factor of future rewards
        :param lamb: lambda parameter describing the decay over n-step returns
        :return: finds the optimal move chess policy
        """
        for k in range(n_episodes):
            self.agent.E = np.zeros(shape=self.agent.action_function.shape)
            state = (0, 0)
            self.env.state = state
            episode_end = False
            epsilon = max(1 / (1 + k), 0.2)
            action_index = self.agent.apply_policy(state, epsilon)
            action = self.agent.action_space[action_index]
            while not episode_end:
                reward, episode_end = self.env.step(action)
                successor_state = self.env.

**Demonstration**

In [16]:
p = Piece(piece='king')
env = Board()
r = Reinforce(p,env)
r.sarsa_lambda(n_episodes=10000,alpha=0.2,gamma=0.9)

In [17]:
r.visualize_policy()

[['↓', '↘', '↘', '↓', '↓', '↙', '↓', '→'],
 ['↘', '↘', '↘', '↓', '↙', '↑', '→', '↙'],
 ['↘', '↓', '↘', '↘', '↘', '↓', '↙', '↓'],
 ['→', '↘', '↘', '↓', '↓', '↓', '←', '↙'],
 ['↘', '↘', '↘', '↘', '↓', '↘', '↓', '↙'],
 ['↘', '↘', '↘', '↘', '↓', '↓', '↙', '↙'],
 ['↘', '→', '↘', '↘', '↘', '↓', '↙', '↙'],
 ['→', '→', '↗', '→', '→', 'F', '←', '←']]


# 2.4 Q-learning

**Theory**
* In SARSA/TD0, we back-up our action values with the succesor action value
* In SARSA-max/Q learning, we back-up using the maximum action value. 

**Implementation**

In [18]:
print(inspect.getsource(r.sarsa_lambda))

    def sarsa_lambda(self, n_episodes=1000, alpha=0.05, gamma=0.9, lamb=0.8):
        """
        Run the sarsa control algorithm (TD lambda), finding the optimal policy and action function
        :param n_episodes: int, amount of episodes to train
        :param alpha: learning rate
        :param gamma: discount factor of future rewards
        :param lamb: lambda parameter describing the decay over n-step returns
        :return: finds the optimal move chess policy
        """
        for k in range(n_episodes):
            self.agent.E = np.zeros(shape=self.agent.action_function.shape)
            state = (0, 0)
            self.env.state = state
            episode_end = False
            epsilon = max(1 / (1 + k), 0.2)
            action_index = self.agent.apply_policy(state, epsilon)
            action = self.agent.action_space[action_index]
            while not episode_end:
                reward, episode_end = self.env.step(action)
                successor_state = self.env.

**Demonstration**

In [19]:
p = Piece(piece='king')
env = Board()
r = Reinforce(p,env)
r.q_learning(n_episodes=1000,alpha=0.2,gamma=0.9)

In [20]:
r.visualize_policy()

[['↘', '↑', '↖', '→', '→', '↗', '↑', '↖'],
 ['↘', '↘', '↘', '↑', '↗', '↘', '↘', '↗'],
 ['↘', '↓', '↘', '↘', '↘', '↖', '↖', '↘'],
 ['→', '↘', '↓', '↘', '↙', '↙', '↗', '↖'],
 ['↖', '↘', '↘', '↘', '↙', '↓', '↑', '↓'],
 ['↘', '↘', '→', '↘', '↓', '↙', '↙', '↙'],
 ['↖', '↙', '←', '↘', '↘', '↓', '↙', '←'],
 ['↓', '→', '→', '↗', '→', 'F', '←', '↖']]


In [21]:
r.agent.action_function.max(axis=2).round().astype(int)

array([[-5, -5, -5, -4, -4, -4, -4, -3],
       [-5, -5, -4, -4, -4, -4, -3, -3],
       [-4, -4, -4, -4, -4, -3, -3, -3],
       [-4, -3, -3, -3, -3, -3, -3, -3],
       [-3, -3, -3, -3, -3, -3, -3, -3],
       [-3, -3, -3, -2, -2, -2, -2, -2],
       [-3, -3, -2, -2, -1, -1, -1, -2],
       [-3, -3, -2, -2, -1,  0, -1, -1]])

# References
1. Reinforcement Learning: An Introduction  
   Richard S. Sutton and Andrew G. Barto  
   1st Edition  
   MIT Press, march 1998
2. RL Course by David Silver: Lecture playlist  
   https://www.youtube.com/watch?v=2pWv7GOvuf0&list=PLqYmG7hTraZDM-OYHWgPebj2MfCFzFObQ