In [1]:
import random
import numpy as np

In [8]:
# Constants for the game
EMPTY = 0
X = 1
O = 2

In [9]:
class TicTacToe:
    def __init__(self):
        self.board = [EMPTY] * 9  # 3x3 grid
        self.done = False
        self.winner = None

    def reset(self):
        self.board = [EMPTY] * 9
        self.done = False
        self.winner = None
        return self.board

    def render(self):
        for i in range(3):
            print(self.board[i*3:i*3+3])
        print()

    def is_winner(self, player):
        win_conditions = [
            [0, 1, 2], [3, 4, 5], [6, 7, 8],  # rows
            [0, 3, 6], [1, 4, 7], [2, 5, 8],  # columns
            [0, 4, 8], [2, 4, 6]  # diagonals
        ]
        for condition in win_conditions:
            if all(self.board[i] == player for i in condition):
                return True
        return False

    def is_full(self):
        return EMPTY not in self.board

    def step(self, action, player):
        if self.board[action] != EMPTY:
            return self.board, -10, True  # Invalid move penalty

        self.board[action] = player

        if self.is_winner(player):
            return self.board, 10, True  # Win reward

        if self.is_full():
            return self.board, 0, True  # Draw reward

        return self.board, 0, False  # No winner yet

In [10]:
class QLearningAgent:
    def __init__(self, alpha=0.1, gamma=0.9, epsilon=0.1):
        self.alpha = alpha  # Learning rate
        self.gamma = gamma  # Discount factor
        self.epsilon = epsilon  # Exploration rate
        self.q_table = {}

    def get_q(self, state, action):
        state = tuple(state)
        if state not in self.q_table:
            self.q_table[state] = [0] * 9  # Initialize Q-values for all actions
        return self.q_table[state][action]

    def update_q(self, state, action, reward, next_state, done):
        state = tuple(state)
        next_state = tuple(next_state)
        if next_state not in self.q_table:
            self.q_table[next_state] = [0] * 9

        best_next_action = max(self.q_table[next_state]) if not done else 0
        q_value = self.get_q(state, action)
        new_q_value = q_value + self.alpha * (reward + self.gamma * best_next_action - q_value)
        self.q_table[state][action] = new_q_value

    def choose_action(self, state):
        if random.uniform(0, 1) < self.epsilon:
            return random.choice([i for i in range(9) if state[i] == EMPTY])  # Explore
        q_values = [self.get_q(state, i) if state[i] == EMPTY else -float('inf') for i in range(9)]
        return np.argmax(q_values)  # Exploit

In [11]:
def train(agent, num_episodes=10000):
    game = TicTacToe()

    for episode in range(num_episodes):
        state = game.reset()
        done = False
        while not done:
            action = agent.choose_action(state)
            next_state, reward, done = game.step(action, X)  # X is the agent
            agent.update_q(state, action, reward, next_state, done)
            state = next_state


In [12]:
def test(agent, num_games=100):
    game = TicTacToe()
    agent_wins, opponent_wins, draws = 0, 0, 0

    for _ in range(num_games):
        state = game.reset()
        done = False
        while not done:
            action = agent.choose_action(state)
            next_state, reward, done = game.step(action, X)
            if reward == 10:
                agent_wins += 1
            elif reward == -10:
                opponent_wins += 1
            state = next_state

    print(f"Agent wins: {agent_wins}, Opponent wins: {opponent_wins}, Draws: {draws}")


In [13]:
# Training the agent
agent = QLearningAgent()
train(agent, num_episodes=10000)

In [14]:
# Testing the agent
test(agent, num_games=100)

Agent wins: 100, Opponent wins: 0, Draws: 0


Explanation of the Code:

a. Setting up the environment
The TicTacToe class simulates the environment. It initializes the game state (board), checks for wins, handles game steps (actions), and displays the current state with render().

b. Defining the Tic-Tac-Toe game
The game is played on a 3x3 grid. The agent plays as X, and we check for winning conditions after each move. The game ends when there's a win or a draw.

c. Building the reinforcement learning model
The QLearningAgent class represents the Q-learning agent. The agent uses a Q-table (q_table) where each key is a game state, and the value is a list of Q-values for each possible action.
The agent uses the epsilon-greedy strategy: it either explores random moves or exploits the best-known move (based on the Q-values).

d. Training the model
In the train function, the agent plays num_episodes games, learning from each game's rewards. After every action, the Q-table is updated using the Q-learning update rule:
Q(st,at)=Q(st,at)+α(rt+γ⋅maxQ(st+1,a)−Q(st,at))
where:
α is the learning rate.
γ is the discount factor.
r_t is the reward from the action.

e. Testing the model
The test function simulates multiple games against a random opponent to evaluate how well the trained agent performs. It counts agent wins, opponent wins, and draws.

*Main Algorithms Used:*
Q-Learning: A reinforcement learning algorithm where an agent learns by interacting with the environment and adjusting its action choices based on past experiences.
Epsilon-Greedy: A strategy that balances exploration and exploitation.
Conclusion:
This code sets up a simple Q-learning agent to play Tic-Tac-Toe. The model learns from repeated games, adjusts its strategies, and can be tested by playing against a random opponent to evaluate its performance.


Explanation of Changes:
render() method: This method is called after each move in the step() function to print the current state of the game board. It displays the board in a user-friendly format with vertical bars (|) separating the columns and dashes (-) separating the rows.

Example output for a game board:

X | O |  

-----
O | X | O

-----
X |   |  

Visual Output during Gameplay: After every move, self.render() is called to print the updated board, which allows you to track how the game evolves visually.

Example of Game Play:
Let's consider a random game scenario where the agent plays against an opponent. The output will look like this:

X |   |  

-----
O | X |  

-----
   |   | O

X | O |  

-----
O | X |  

-----
X | O |  

This gives you a step-by-step visualization of how the game progresses, showing the state of the board after each move.