In [1]:
import random
import numpy as np

In [9]:
# Display the Tic-Tac-Toe board
def display_board(board):
    print("\n".join([" | ".join(board[i*3:(i+1)*3]) for i in range(3)]))
    print("-" * 9)

In [10]:
# Check if there's a winner
def check_winner(board, player):
    wins = [(0,1,2), (3,4,5), (6,7,8), (0,3,6), (1,4,7), (2,5,8), (0,4,8), (2,4,6)]
    return any(all(board[i] == player for i in win) for win in wins)

In [12]:
# Q-learning agent functions
q_table = {}  # Stores Q-values for state-action pairs
def get_state(board):
    return "".join(board)

In [13]:
def choose_action(board, epsilon=0.1):
    state = get_state(board)
    if random.random() < epsilon or state not in q_table:
        return random.choice([i for i, x in enumerate(board) if x == ' '])
    return max((i for i, x in enumerate(board) if x == ' '), key=lambda x: q_table[state].get(x, 0))

In [14]:
def update_q(state, action, reward, next_state, alpha=0.5, gamma=0.9):
    if state not in q_table:
        q_table[state] = {}
    old_q = q_table[state].get(action, 0)
    next_max = max(q_table.get(next_state, {}).values(), default=0)
    q_table[state][action] = old_q + alpha * (reward + gamma * next_max - old_q)

In [15]:
# Training the AI through reinforcement learning
def train_ai(episodes=5000):
    for _ in range(episodes):
        board = [' '] * 9
        while True:
            state = get_state(board)
            action = choose_action(board)
            board[action] = 'X'
            if check_winner(board, 'X'):
                update_q(state, action, 1, get_state(board))
                break
            if ' ' not in board:
                update_q(state, action, 0.5, get_state(board))
                break
            opp_action = random.choice([i for i, x in enumerate(board) if x == ' '])
            board[opp_action] = 'O'
            if check_winner(board, 'O'):
                update_q(state, action, -1, get_state(board))
                break
            update_q(state, action, 0, get_state(board))

In [16]:
# Play a game against the trained AI
def play_game():
    board = [' '] * 9
    while True:
        display_board(board)
        ai_action = choose_action(board, epsilon=0)  # No exploration in test
        board[ai_action] = 'X'
        print("\nAI moved:")
        display_board(board)
        if check_winner(board, 'X'):
            print("AI wins!")
            break
        if ' ' not in board:
            print("It's a draw!")
            break
        while True:
            try:
                player_action = int(input("\nEnter your move (0-8): "))
                if board[player_action] == ' ':
                    board[player_action] = 'O'
                    break
                print("Invalid move, try again.")
            except ValueError:
                print("Invalid input, try again.")
        if check_winner(board, 'O'):
            display_board(board)
            print("You win!")
            break
        if ' ' not in board:
            display_board(board)
            print("It's a draw!")
            break

In [17]:
# Main
print("Training AI...")
train_ai()
print("Training complete!\n")
print("Game starts! You are 'O', AI is 'X'")
print("Positions (0-8):")
print(np.arange(9).reshape(3,3))

Training AI...
Training complete!

Game starts! You are 'O', AI is 'X'
Positions (0-8):
[[0 1 2]
 [3 4 5]
 [6 7 8]]


In [19]:
play_game()

  |   |  
  |   |  
  |   |  
---------

AI moved:
  | X |  
  |   |  
  |   |  
---------
O | X |  
  |   |  
  |   |  
---------

AI moved:
O | X |  
  | X |  
  |   |  
---------
O | X |  
  | X | O
  |   |  
---------

AI moved:
O | X | X
  | X | O
  |   |  
---------
Invalid move, try again.
O | X | X
  | X | O
O |   |  
---------

AI moved:
O | X | X
  | X | O
O | X |  
---------
AI wins!


Explanation:
-------------

We import numpy (for array manipulation, though it’s minimal here) and random for randomness in choosing moves, especially for exploration in the Q-learning phase.

Display the Tic-Tac-Toe Board:
This function displays the Tic-Tac-Toe board in a readable format. The board is a 1D list of 9 elements, and we use slicing to create rows.

Check for a Winner:
his function checks if a player has won. The variable wins defines all possible winning combinations, and the function checks if any of these contain the same player symbol.
Suggestion: Add variations by allowing different board sizes or modifying winning conditions.

Q-learning Setup:
q_table is a dictionary that holds the Q-values for each board state-action pair, allowing the AI to learn optimal moves over time.

Get State:
Explanation: The board’s state is converted into a string so it can serve as a unique key for q_table.

Choose Action:
This function selects the AI’s action. With probability epsilon, it explores random moves; otherwise, it exploits the best-known move based on q_table.
Suggestion: Change the epsilon parameter or decrease it gradually to experiment with the balance between exploration and exploitation.

Update Q-values:
This function updates Q-values based on the reward for the current action and the maximum expected reward from the next state, applying the Q-learning formula.
Parameters:
alpha (learning rate): Controls the weight given to new information.
gamma (discount factor): Determines how much future rewards affect the current state.
Suggestion: Adjust alpha and gamma values to see how quickly the AI learns or how it values future rewards over immediate ones.

Q-learning Algorithm Explanation
Q-learning is a reinforcement learning algorithm that aims to learn a policy by maximizing cumulative rewards. Here’s how it fits into this Tic-Tac-Toe game:

Q-table: Stores Q-values that represent the desirability of actions given states.
Exploration and Exploitation: The AI explores random moves at first (epsilon), then exploits learned moves as it builds up Q-values.
Update Rule: Updates Q-values based on actions taken, rewards received, and maximum Q-values of resulting states to improve decision-making over time.

Training the AI:
This function trains the AI through episodes of self-play. The AI (X) and a random opponent (O) alternate turns. Rewards are given based on the outcome of each episode.
Rewards:
Win: +1
Draw: 0.5
Loss: -1
Suggestion: Increase or decrease the number of episodes to control training depth. More episodes can lead to better AI performance.

Play Against the AI:
This function allows a player to play against the AI. It uses the trained q_table to guide AI moves, and disables exploration (epsilon=0) to ensure the AI follows its learned strategy.
Suggestion: Play around with epsilon to see how random moves affect gameplay or modify the player symbols.

Main Execution:
This segment trains the AI, then initiates a game with the player.
Suggestion: Add a prompt for the player to choose X or O and dynamically adjust the AI’s training reward structure for that symbol.

Summary
Q-learning is essential here for reinforcement learning in a static environment where actions lead to rewards or penalties, making it ideal for turn-based games like Tic-Tac-Toe.
Parameter Tuning (alpha, gamma, epsilon, episodes): Adjusting these values allows control over the AI’s learning rate, exploration-exploitation balance, and emphasis on immediate vs. future rewards.
By altering these aspects, you can observe different behaviors and outcomes, allowing deeper insight into how Q-learning influences the game dynamics.