## Solving Tic-Tac-Toe using TD(λ)
In this assignment you will build an RL agent capable of playing Tic-Tac-Toe using TD(λ) algorithm and the environment simulated by you in first week.

First of all copy the environment simulated by you from the first week below.
- Note that you should also return the state of the board each time you call act function, ideally the state should be stored in a numpy array for faster implementation
- The only input the function can take is via its own arguments and not input() function.

The ideal TicTacToe environment:
- Will take N, the size of board as an argument in its constructor.
- Will have act function taking a single argument representing the action taken (preferably int type) and return the state of board (preferably numpy array), reward signal (float) and bool value "done" which is true if the game is over else false.
- Will have reset function which resets the board and starts a new game.
- Will give reward signal as 1 if 'X' won and -1 if 'O' won (or vice-versa) and 0 if its a draw.
- Will take alternate calls of act function as the moves of one player.

For example:
```html
env.reset()
Returns ==> (array([[0., 0., 0.],[0., 0., 0.],[0., 0., 0.]]), 0, False)
                | | | |
        Board:  | | | |
                | | | |

env.act(4)
Returns ==> (array([[0., 0., 0.],[0., 1.0, 0.],[0., 0., 0.]]), 0, False)
                | | | |
        Board:  | |X| |
                | | | |

env.act(0)
Returns ==> (array([[-1.0, 0., 0.],[0., 1.0, 0.],[0., 0., 0.]]), 0, False)
                |O| | |
        Board:  | |X| |
                | | | |

env.act(7)
Returns ==> (array([[-1.0, 0., 0.],[0., 1.0, 0.],[0., 1.0, 0.]]), 0, False)
                |O| | |
        Board:  | |X| |
                | |X| |

env.act(6)
Returns ==> (array([[-1.0, 0., 0.],[0., 1.0, 0.],[-1.0, 1.0, 0.]]), 0, False)
                |O| | |
        Board:  | |X| |
                |O|X| |

env.act(2)
Returns ==> (array([[-1.0, 1.0, 0.],[0., 1.0, 0.],[-1.0, 1.0, 0.]]), 1, True)
                |O|X| |
        Board:  | |X| |
                |O|X| |


```
<hr>

Note : You can change your TicTacToe environment code before using it here


In [1]:
# Import any necessary libraries here
import numpy as np

Your TicTacToe environment class comes here

In [1]:
class TicTacToe:
    def __init__(self, n : int = 3):
        self.n = n
        self.grid = np.full((n, n), 0)
        self.currMove = 1
        self.moveNum = 0

    # Given below is the preferable structure of act function
    def act(self, action : int) -> tuple:     # Returns tuple of types (np.ndarray, int, bool)
        self.action = action
        self.xMove = action // self.n
        self.yMove = action % self.n
        maxMove = self.n ** 2
        if action < 0 or action >= maxMove or self.grid[self.xMove][self.yMove] != 0:
            print("Wrong Move, try again!")
            return self.grid, 0, False
        self.moveNum += 1
        self.grid[self.xMove][self.yMove] = self.currMove
        if self._checkWin() == True:
            if self.currMove == 1:
                return self.grid, 1, True
            return self.grid, -1, True
        if self.moveNum == maxMove:
            return self.grid, 0, True
        self._changeMove()
        return self.grid, 0, False

    def reset(self):
        self.grid = np.full((self.n, self.n), 0)
        self.currMove = 1
        return self.grid, 0, False

    def _checkWin(self):
        found = 1
        for i in range(self.n):
            if self.grid[self.xMove][i] != self.currMove:
                found = 0
                break
        if found == 1:
            return True
        found = 1
        for i in range(self.n):
            if self.grid[i][self.yMove] != self.currMove:
                found = 0
                break
        if found == 1:
            return True

        if self.xMove == self.yMove:
            for i in range(self.n):
                if self.grid[i][i] != self.currMove:
                    return False
            return True
        if self.xMove + self.yMove == self.n - 1:
            for i in range(self.n):
                if self.grid[i][self.n-i-1] != self.currMove:
                    return False
            return True
        return False


    def _changeMove(self):
        if self.currMove == 1:
            self.currMove = -1
        else:
            self.currMove = 1

Then comes the agent class which
- Uses TD(λ) algorithm to find the optimal policies for each state
- Stores the calculated optimal policies as a .npy file for later use
- Calculates the average return of the itself against a random player (makes random moves on its chance) periodically during training and plot it (for example if total training iterations is 10000, then calculate average return after each 500 steps, also for average return you should calculate return atleast 5 times and then take average)
- You can make additional functions

You can store all the encountered states in a numpy array (which will have 3 dims) and then store corresponding values for that particulare state in another array (will have 1 dims) and then you can store all these arrays in a .npy file for future use, so that you don't have to train the model each time you want to play TicTacToe

In [None]:
# Both the constructors and train() function can have any arguments you need
class TicTacToeAgent:
    def __init__(self) -> None:
        pass
    def train():
        pass

Now for evaluation purposes and for your self checking the code block below after running should:
- Initialize the agent and call the train function which trains the agent
- Load the stored state value data
- Start a single player game of TicTacToe which takes input from the user for moves according to the convention given below, where the trained Q values play as computer

In [None]:
'''
You will be asked to enter number corresponding to the boards where you want to make your move, for example in 1 3x3 TicTacToe:
0 | 1 | 2
3 | 4 | 5
6 | 7 | 8
The model should train a 3x3 TicTacToe by default, you can definitely modify the values(of N, number of iterations etc) for your convenience but training model for bigger N might take lot of time
'''

# Code Here