<a href="https://colab.research.google.com/github/cmoore102589/ai-data-science-portfolio/blob/main/Fundamentals%20of%20AI%20Portfolio%20Assignments/Project_Part_III/Project_Part_III_MMoore25.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# COSC-640: Project (Part III)

**Matthew Corley Moore**

[PLACEHOLDER_FOR_NOTEBOOK_LINK]


## Getting Started

Follow the instructions below to copy this notebook and to perform some initial setup.

1. Copy this notebook by selecting `File > Save a copy in Drive`.
2. A new window should open for the copied notebook. Move the new notebook to your course folder in Google Drive by selecting `File > Move` and then selecting the desired folder.
3. Update the name of the notebook by removing "Copy of" and replacing "Username" with your actual username.
4. Update the first cell in the notebook by replacing "**Student Name**" with your actual name.
5. Do not edit the line that says `PLACEHOLDER_FOR_NOTEBOOK_LINK`. This will be used by the [Notebook Renderer](https://colab.research.google.com/drive/1CJTipys46ldZxJFwnt7XbdjQUfkmoXeU?usp=sharing) tool to insert a link to your Colab notebook.
6. Enable link sharing for your notebook.

## Preparing the Colab Environment

Run the cell below to download the `aitools` course package.

In [None]:
%%capture
!rm aitools -r
!git clone https://github.com/drbeane/aitools.git

Run the cell below to import the necessary tools for this assignment. No other import statements are required for this lab, and no other import statements should be included in this assignment.

In [None]:
from aitools.envs import BotPlayerEnv
from aitools.algs import PolicyPlayer, RandomPlayer, MinimaxPlayer
from aitools.algs import TDAgent
from aitools.algs import play_game, tournament

# Description of Part III of Project

In Part III of the project, you will train Q-learning agent to play Nim. The agent will be trained by playing thousands of games against a `RandomPlayer` agent, but will eventually be able to consistently defeat the better playing `MininmaxPlayer` agents.

# Define `Nim` Class

Copy the definition for the `Nim` class from Part I  of the project into the cell below.

In [None]:
import numpy as np

class Nim:

    def __init__(self, piles, stones, limit):
        self.piles = piles
        self.stones = stones
        self.limit = limit
        self.board = [stones] * piles
        self.turns = 0
        self.cur_player = 1
        self.winner = None

    def display(self):
      print(f"Current Turn: {self.turns}, Current Player: {self.cur_player}, Piles: {self.board}")

    def copy(self):
        new_instance = Nim(self.piles, self.stones, self.limit)
        new_instance.board = self.board.copy()
        new_instance.turns = self.turns
        new_instance.cur_player = self.cur_player
        new_instance.winner = self.winner
        return new_instance

    def check_for_win(self):
        if all(pile == 0 for pile in self.board):
          self.winner = self.cur_player

    def get_actions(self):
        actions = []
        for pile_index, stones in enumerate(self.board):
            max_stones = min(self.limit, stones)
            for stones_to_take in range(1, max_stones + 1):
                actions.append((pile_index, stones_to_take))
        return actions

    def take_action(self, a):
        new_node = self.copy()
        pile_index, stones_removed = a
        new_node.board[pile_index] -= stones_removed
        new_node.turns += 1
        new_node.cur_player = 2 if new_node.cur_player == 1 else 1
        new_node.check_for_win()
        return new_node

    def get_state(self):
        s = 0
        for i in range(self.piles):
            s += self.board[i] * (self.stones + 1) ** i
        return s

    def heuristic(self, agent, mode=None):
        return 0

# New Classes

You will work with two new classes in this notebook. These are named `BotPlayerEnv` and `PolicyPlayer`. These classes are described below.

The `BotPlayerEnv` class provides an interface that can be used with reinforcement learning algorithms to train agents to play games by having them complete against a "bot player" controlled by an adversarial search agent (such as `RandomPlayer`). Instances of `BotPlayerEnv` combine an instance of a game environment with an instance of an adversarial agent to create an environment that can be use with our RL algorithms. When an action is taken in this environment, the `BotPlayerEnv` class will apply that action, and then generate and apply an action for the bot player. The code block below demonstrates how to create an instance of `BotPlayerEnv` and how to use it with an instance of `TDAgent`.

    nim = Nim(piles=3, stones=9, limit=5)
    bot = RandomPlayer('Bot')
    bot_env = BotPlayerEnv(game_env=nim, agent=bot)
    td = TDAgent(bot_env, gamma=1, random_state=1)



An instance of the `PolicyPlayer` class represents an adversarial search agent that follows a policy that maps game states to actions. We will use this class to create agents that follow the policies learned by applying the Q-learning algorithm. It is interesting to note that since `PolicyPlayer` agents don't have to perform a search when selecting actions, they will always select their actions very quickly. It might take a significant amount of time to run the Q-learning algorithm that learns the policy to use in conjunction with `PolicyPlayer`, but once the policy is learned, the agent will play very quickly.

The code block below demonstrates how to create an instance of `PolicyPlayer`.

    p1 = PolicyPlayer('Policy Player', policy=some_policy)



## Part 1: Basic Q-Learning Agent

In Part 1, we will use Q-learning to learn a policy for playing Nim. The policy will be learned by having the Q-learning algorithm play many games against a `RandomPlayer` agent, and will be tested by having it play against `RandomPlayer` and `MinimaxPlayer` agents. Our eventual goal is to find a policy that can be used to consistently defeat a Minimax agent with a depth of 4.


## 1.A - Training the Agent

Create the following objects:

- An instance of `Nim` with 3 piles, 9 stones per pile, and with a limit of 5 stones per action.
- An instance of `RandomPlayer`.
- An instance of `BotPlayerEnv` using the `Nim` and `RandomPlayer` instances you created above.
- An instance of `TDAgent` that uses your instance of `BotPlayerEnv`. Set `gamma=1` and `random_state=1`.

After creating the objects above, use your `TDAgent` instance to apply Q-learning to learn a policy for `Nim`. Run 10,000 episodes of Q-learning with an exploration rate of 0.1 and a learning rate of 0.1. Also set `track_history=False` when calling calling the `q_learning()` method. This will significantly reduce the memory requirements of the algorithm.

In [None]:
nim = Nim(piles=3, stones=9, limit=5)
bot = RandomPlayer('Bot')
bot_env = BotPlayerEnv(game_env=nim, agent=bot)
td = TDAgent(bot_env, gamma=1, random_state=1)

td.q_learning(episodes=10000, epsilon=0.1, alpha=0.1, track_history=False)

## 1.B - Create Agents

Create the following agents:
- A `PolicyPlayer` instance using the policy found by Q-learning.
- A `RandomPlayer` instance.
- A `MinimaxPlayer` instance with `depth=2`.
- A `MinimaxPlayer` instance with `depth=3`.
- A `MinimaxPlayer` instance with `depth=4`.

In [None]:
policyplayer = PolicyPlayer('Policy Player', policy=td.policy)

randomplayer = RandomPlayer('Random Player')

minimax_player1 = MinimaxPlayer('Minimax Player (Depth 2)', depth=2)
minimax_player2 = MinimaxPlayer('Minimax Player (Depth 3)', depth=3)
minimax_player3 = MinimaxPlayer('Minimax Player (Depth 4)', depth=4)

## 1.C - Versus RandomPlayer

Run a 1000 round tournament between the `PolicyPlayer` agent and the `RandomPlayer` agent. Set `random_state=1`. When creating the agent list, please list the `PolicyPlayer` agent first.

In [None]:
tournament(nim, agents = [policyplayer, randomplayer], rounds = 1000, random_state = 1, switch_players = True)

100%|██████████| 1000/1000 [00:00<00:00, 5350.32it/s]

Policy Player vs. Random Player
-------------------------------
Ties:          0
Player 1 Wins: 920
Player 2 Wins: 80
Player 1 took: 0.04 seconds
Player 2 took: 0.10 seconds
Average number of turns: 11.4





## 1.D - Versus Minimax(2)

Run a 1000 round tournament between the `PolicyPlayer` agent and the `MinimaxPlayer` agent with `depth=2`. Set `random_state=1`. When creating the agent list, please list the `PolicyPlayer` agent first.

In [None]:
tournament(nim, agents = [policyplayer, minimax_player1], rounds = 1000, random_state = 1, switch_players = True)

100%|██████████| 1000/1000 [00:01<00:00, 554.69it/s]

Policy Player vs. Minimax Player (Depth 2)
------------------------------------------
Ties:          0
Player 1 Wins: 726
Player 2 Wins: 274
Player 1 took: 0.05 seconds
Player 2 took: 1.66 seconds
Average number of turns: 11.2





## 1.E - Versus Minimax(3)

Run a 1000 round tournament between the `PolicyPlayer` agent and the `MinimaxPlayer` agent with `depth=3`. Set `random_state=1`. When creating the agent list, please list the `PolicyPlayer` agent first.



In [None]:
tournament(nim, agents = [policyplayer, minimax_player2], rounds = 1000, random_state = 1, switch_players = True)

100%|██████████| 1000/1000 [00:17<00:00, 56.44it/s]

Policy Player vs. Minimax Player (Depth 3)
------------------------------------------
Ties:          0
Player 1 Wins: 402
Player 2 Wins: 598
Player 1 took: 0.07 seconds
Player 2 took: 17.29 seconds
Average number of turns: 12.2





## 1.F - Versus Minimax(4)

Run a 1000 round tournament between the `PolicyPlayer` agent and the `MinimaxPlayer` agent with `depth=4`. Set `random_state=1`. When creating the agent list, please list the `PolicyPlayer` agent first.

In [None]:
tournament(nim, agents = [policyplayer, minimax_player3], rounds = 1000, random_state = 1, switch_players = True)

100%|██████████| 1000/1000 [02:55<00:00,  5.68it/s]

Policy Player vs. Minimax Player (Depth 4)
------------------------------------------
Ties:          0
Player 1 Wins: 188
Player 2 Wins: 812
Player 1 took: 0.13 seconds
Player 2 took: 173.90 seconds
Average number of turns: 11.6





## 1.G - Summarizing Results

Indicate the win rates for the `PolicyPlayer` agent by filling in each of the blanks below. Proivde your answer as percentages rounded to 1 decimal place.

The policy player won:
- ____% of games played against the `RandomPlayer` agent.
- ____% of games played against the `MinimaxPlayer` agent with depth 2.
- ____% of games played against the `MinimaxPlayer` agent with depth 3.
- ____% of games played against the `MinimaxPlayer` agent with depth 4.

# Part 2: Improved Q-Learning Agent

We will now attempt to improve the `PolicyPlayer` agent by using Q-learning to find a better policy.

Please repeat steps A-G from Part 1, with one change. You should experiment with the parameters used in Part A when applying Q-learning. You can experiment with any or all of the following:

* You can increase the number of episodes.
* You can experiment with the exploration rate.
* You can experiment with the learning rate.
* You can experiment with exploring starts.
* You can consider applying multiple rounds of Q-learning, changing the parameters between the two rounds.

Your goal is to find a policy that results in the `PolicyPlayer` agent winning at least 85% of the games in the tournament against the `MinimaxPlayer` with `depth=4`. If you find a policy that gets close to this, you will get partial credit, but you will not receive full credit unless the `PolicyPlayer` agent reaches the 85% win rate.

**Hint:** You should be able to attain this goal by simply increasing the number of episodes. However, you might be able to achieve the desired results in fewer episodes by experimenting with the other parameters. You also might be able to get an agent that gets a win rate higher than 85% by experimenting with the other parameters. It is possible (but not required) to find a policy with a win rate of at least 95%.

## 2.A - Training the Agent

Create the following objects:

- An instance of `Nim` with 3 piles, 9 stones per pile, and with a limit of 5 stones per action.
- An instance of `RandomPlayer`.
- An instance of `BotPlayerEnv` using the `Nim` and `RandomPlayer` instances you created above.
- An instance of `TDAgent` that uses your instance of `BotPlayerEnv`. Set `gamma=1` and `random_state=1`.

After creating the objects above, use your `TDAgent` instance to apply Q-learning to learn a policy for `Nim`. As before, you should set `track_history=False` when calling the `q_learning()` method.

In [None]:
nim_game = Nim(piles=3, stones=9, limit=5)

random_player = RandomPlayer('Random Player 2')

bot_env = BotPlayerEnv(game_env=nim_game, agent=random_player)

td2 = TDAgent(bot_env, gamma=1, random_state=1)

td2.q_learning(episodes=50000, epsilon=0.1, alpha=0.1, track_history=False)

## 2.B - Create Agents

Create the following agents:
- A `PolicyPlayer` instance using the policy found by Q-learning.
- A `RandomPlayer` instance.
- A `MinimaxPlayer` instance with `depth=2`.
- A `MinimaxPlayer` instance with `depth=3`.
- A `MinimaxPlayer` instance with `depth=4`.

In [None]:
policyplayer2 = PolicyPlayer('Policy Player', policy=td2.policy)

random_player = RandomPlayer('Random Player')

minimaxplayer_depth_2 = MinimaxPlayer('Minimax Player Depth 2', depth=2)
minimaxplayer_depth_3 = MinimaxPlayer('Minimax Player Depth 3', depth=3)
minimaxplayer_depth_4 = MinimaxPlayer('Minimax Player Depth 4', depth=4)


## 2.C - Versus RandomPlayer

Run a 1000 round tournament between the `PolicyPlayer` agent and the `RandomPlayer` agent. Set `random_state=1`. When creating the agent list, please list the `PolicyPlayer` agent first.

In [None]:
tournament(nim_game, agents = [policyplayer2, random_player],rounds=1000,  random_state=1)

100%|██████████| 1000/1000 [00:00<00:00, 6272.85it/s]

Policy Player vs. Random Player
-------------------------------
Ties:          0
Player 1 Wins: 968
Player 2 Wins: 32
Player 1 took: 0.03 seconds
Player 2 took: 0.09 seconds
Average number of turns: 11.5





## 2.D - Versus Minimax(2)

Run a 1000 round tournament between the `PolicyPlayer` agent and the `MinimaxPlayer` agent with `depth=2`. Set `random_state=1`. When creating the agent list, please list the `PolicyPlayer` agent first.

In [None]:
tournament(nim_game, agents = [policyplayer2, minimaxplayer_depth_2],rounds=1000,  random_state=1)

100%|██████████| 1000/1000 [00:01<00:00, 526.68it/s]

Policy Player vs. Minimax Player Depth 2
----------------------------------------
Ties:          0
Player 1 Wins: 865
Player 2 Wins: 135
Player 1 took: 0.06 seconds
Player 2 took: 1.73 seconds
Average number of turns: 11.4





## 2.E - Versus Minimax(3)

Run a 1000 round tournament between the `PolicyPlayer` agent and the `MinimaxPlayer` agent with `depth=3`. Set `random_state=1`. When creating the agent list, please list the `PolicyPlayer` agent first.



In [None]:
tournament(nim_game, agents = [policyplayer2, minimaxplayer_depth_3],rounds=1000,  random_state=1)

100%|██████████| 1000/1000 [00:18<00:00, 55.31it/s]

Policy Player vs. Minimax Player Depth 3
----------------------------------------
Ties:          0
Player 1 Wins: 607
Player 2 Wins: 393
Player 1 took: 0.07 seconds
Player 2 took: 17.62 seconds
Average number of turns: 12.5





## 2.F - Versus Minimax(4)

Run a 1000 round tournament between the `PolicyPlayer` agent and the `MinimaxPlayer` agent with `depth=4`. Set `random_state=1`. When creating the agent list, please list the `PolicyPlayer` agent first.

In [None]:
tournament(nim_game, agents = [policyplayer2, minimaxplayer_depth_4],rounds=1000,  random_state=1)

100%|██████████| 1000/1000 [02:53<00:00,  5.75it/s]

Policy Player vs. Minimax Player Depth 4
----------------------------------------
Ties:          0
Player 1 Wins: 362
Player 2 Wins: 638
Player 1 took: 0.13 seconds
Player 2 took: 171.95 seconds
Average number of turns: 11.8





## 2.G - Summarizing Results

Indicate the win rates for the `PolicyPlayer` agent by filling in each of the blanks below. Proivde your answer as percentages rounded to 1 decimal place.

The policy player won:
- ____% of games played against the `RandomPlayer` agent.
- ____% of games played against the `MinimaxPlayer` agent with depth 2.
- ____% of games played against the `MinimaxPlayer` agent with depth 3.
- ____% of games played against the `MinimaxPlayer` agent with depth 4.

# Submission Instructions

1. Perform a Restart and Run All by clicking **Tools > Restart session and run all**.
2. Copy the link to your notebook by clicking **Share > Copy Link**.
3. Paste the copied link into the `notebook_url` field in the [Notebook Renderer](https://colab.research.google.com/drive/1CJTipys46ldZxJFwnt7XbdjQUfkmoXeU?usp=sharing) tool and then execute the cell to render the notebook.
4. The Notebook Renderer will open up a save file dialog. Save the resulting HTML file yo your local machine.
5. Submit the HTML file to Canvas.
