### TIC-TAC-TOE with Reinforcement Learning

### What is tic-tac-toe?

The objective of Tic-Tac-Toe is to be the first to place their marks (either cross or naughts) in a horizontal, vertical, or diagonal arrangement. The important factors are :

![Tic-tac-toe-game-1%201.svg](attachment:Tic-tac-toe-game-1%201.svg)

- Agents involve 2 Tic-Tac-Toe players who attempt to outwit each other by taking a turn to place their mark
- Reward refers to an arbitrary value earned by the winning agent
- Actions dictate that each agent is allowed to place their corresponding mark only in an empty box

The state is the configuration of the tic-tac-toe board after each turn until the game ends in either a win or a draw.


## Assignment 

In this assignment, you will

- Become familiar with `tictactoe` class
- Train two RL agents to play against each other

#### Note: Please do not modify any pre-defined variables. Doing so can affect the autograder results.

In [None]:
# required imports
import argparse
import os
import pickle
import sys
import numpy as np
 
from tictactoe.agent import NoLearner, BaseLearner
from tictactoe.teacher import Teacher
from tictactoe.game import Game, plot_agent_reward

We initialize the board parmeters for a 3x3 board. An example configuration is as follows:

        o | x | o 
        x | o |  
        x | o | x
        
In the above example, nobody wins, since neither player has three marks in a row.

A TicTacToe class has been initialized for you with several methods. You can try to play the game manually using `testing_manual_play`.

In [None]:
class GameLearning(object):
    """
    A class that holds the state of the learning process. Learning
    agents are created/loaded here, and a count is kept of the
    games that have been played.
    """
    def __init__(self, agent_type="random", alpha=0.5, gamma=0.9, epsilon=0.1):

        if agent_type == "random":
            self.agent = NoLearner()
        elif agent_type == "q":
            self.agent = Qlearner(alpha,gamma,epsilon)
        else:
            self.agent = SARSAlearner(alpha,gamma,epsilon)
            
        self.games_played = 0

    def beginPlaying(self):
        """ Loop through game iterations with a human player. """
        print("Welcome to Tic-Tac-Toe. You are 'X' and the computer is 'O'.")

        def play_again():
            print("Games played: %i" % self.games_played)
            while True:
                play = input("Do you want to play again? [y/n]: ")
                if play == 'y' or play == 'yes':
                    return True
                elif play == 'n' or play == 'no':
                    return False
                else:
                    print("Invalid input. Please choose 'y' or 'n'.")

        while True:
            game = Game(self.agent)
            game.start()
            self.games_played += 1
            if not play_again():
                print("OK. Quitting.")
                break

    def beginTeaching(self, episodes):
        """ Loop through game iterations with a teaching agent. """
        teacher = Teacher()
        # Train for alotted number of episodes
        while self.games_played < episodes:
            game = Game(self.agent, teacher=teacher)
            game.start()
            self.games_played += 1
            # Monitor progress
            if self.games_played % 1000 == 0:
                print("Games played: %i" % self.games_played)


In [None]:
# try playing the game manually
gl = GameLearning()
gl.beginPlaying()

Now that you are familiar with the interface, we will use some reinforcement learning algorithms to train an agent to win the game. 

First, we train Q-learning based agent that maintains and updates `Q` values for every cell in the board.

In [None]:
class Qlearner(BaseLearner):
    """
    A class to implement the Q-learning agent.
    """
    def __init__(self, alpha, gamma, eps, eps_decay=0.):
        super().__init__(alpha, gamma, eps, eps_decay)

    def update(self, s, s_, a, a_, r):
        """
        Perform the Q-Learning update of Q values.

        Parameters
        ----------
        s : string
            previous state
        s_ : string
            new state
        a : (i,j) tuple
            previous action
        a_ : (i,j) tuple
            new action. NOT used by Q-learner!
        r : int
            reward received after executing action "a" in state "s"
        """
        # Update Q(s,a)
        if s_ is not None:
            # hold list of Q values for all a_,s_ pairs. We will access the max later
            possible_actions = [action for action in self.actions if s_[action[0]*3 + action[1]] == '-']
            Q_options = [self.Q[action][s_] for action in possible_actions]
            
            # update self.Q[a][s] using Q-learning update
            # your code here
            
            
        else:
            # terminal state update
            self.Q[a][s] += self.alpha*(r - self.Q[a][s])

        # add r to rewards list
        self.rewards.append(r)


Second, we train a SARSA based agent.

In [None]:
class SARSAlearner(BaseLearner):
    """
    A class to implement the SARSA agent.
    """
    def __init__(self, alpha, gamma, eps, eps_decay=0.):
        super().__init__(alpha, gamma, eps, eps_decay)

    def update(self, s, s_, a, a_, r):
        """
        Perform the SARSA update of Q values.

        Parameters
        ----------
        s : string
            previous state
        s_ : string
            new state
        a : (i,j) tuple
            previous action
        a_ : (i,j) tuple
            new action
        r : int
            reward received after executing action "a" in state "s"
        """
        if s_ is not None:
            
            # update self.Q[a][s] using SARSA update
            # your code here
            
            
        else:
            # terminal state update
            self.Q[a][s] += self.alpha*(r - self.Q[a][s])

        # add r to rewards list
        self.rewards.append(r)

# Visualization

Now that we've trained RL agents, we can compare their performance. We assume that a game loss results in a penalty of `-1`. 

In [None]:
gl = GameLearning()
gl.beginTeaching(1000)
plot_agent_reward(gl.agent.rewards)

In [None]:
gl = GameLearning(agent_type="q")
gl.beginTeaching(1000)
plot_agent_reward(gl.agent.rewards)

Before proceeding, we check if Q-learning based agent performs any better than an agent that does not learns at all.

In [None]:
gl_random = GameLearning(agent_type="random")
gl_random.beginTeaching(1000)
gl_q = GameLearning(agent_type="q")
gl_q.beginTeaching(1000)

In [None]:
last_50_random = np.cumsum(gl_random.agent.rewards)[-50:]
last_50_q      = np.cumsum(gl_q.agent.rewards)[-50:]

assert (last_50_random < last_50_q).all(), "Check the rewards in Q-learning agent"

Similarly, we compare how SARSA peforms.

In [None]:
gl_sarsa = GameLearning(agent_type="sarsa")
gl_sarsa.beginTeaching(1000)
plot_agent_reward(gl_sarsa.agent.rewards)

In [None]:
last_50_sarsa = np.cumsum(gl_sarsa.agent.rewards)[-50:]
assert (last_50_random < last_50_sarsa).all(), "Check the rewards in Sarsa agent"