# 1. Introduction

## 1.1 About

In game theory, the problem of finding an equilibrium strategy for each player in a non-cooperative game is of great importance. Equilibrium strategies are those strategies that, when played by all players, result in a stable outcome where no player can improve their payoff by unilaterally changing their strategy. Two well-known algorithms for finding equilibrium strategies are fictitious play and reinforcement learning.


## 1.2 Description

Fictitious play is a learning algorithm that has been widely used in game theory to find equilibrium strategies in repeated games. The algorithm works by iteratively updating each player's strategy based on the observed strategies of the other players. The name "fictitious play" comes from the fact that the algorithm assumes that the other players' strategies are fixed, although in reality they may also be changing. The algorithm has been shown to converge to a Nash equilibrium, which is a type of equilibrium strategy where no player can improve their payoff by unilaterally changing their strategy, assuming that the other players strategies are stationary.

Reinforcement learning, on the other hand, is a general learning framework that has been applied to various fields such as robotics, control, and artificial intelligence. Reinforcement learning is based on the idea that an agent interacts with an environment and learns to optimize its behavior by receiving rewards or penalties. In the context of games, reinforcement learning can be used to find equilibrium strategies by training agents to play against each other and learn from the rewards or penalties of the game.

## 1.3 Goal

The goal of this assignment is to explore the use of fictitious play and reinforcement learning for computing equilibrium strategies in repeated games. The assignment will involve implementing and experimenting with both algorithms and comparing their performance through experimental results. The assignment will also include a theoretical analysis of the properties of the algorithms, such as their convergence and termination.

# 2. Theory

## 2.1 Fictitious play

Fictitious play is a learning algorithm that has been widely used in game theory to find equilibrium strategies in repeated games. The algorithm works by iteratively updating each player's strategy based on the observed strategies of the other players. The name "fictitious play" comes from the fact that the algorithm assumes that the other players' strategies are fixed, although in reality they may also be changing.

The algorithm starts with an initial strategy for each player and iteratively updates each player's strategy based on the observed strategies of the other players. The update rule for player i is as follows:

$p_{i,t+1}(s) = \frac{\sum_{t'=1}^t[s_{-i,t'}=s]}{t}$

where $p_{i,t}(s)$ is the probability of player i playing strategy s at time t, and $s_{-i,t}$ is the strategy profile of the other players at time t.

The algorithm has been shown to converge to a Nash equilibrium, which is a type of equilibrium strategy where no player can improve their payoff by unilaterally changing their strategy, assuming that the other players strategies are stationary.

In this assignment, we will be using Fictitious play algorithm to train agents to play a repeated game and find equilibrium strategies. The implementation will involve creating a simulation of the game and training the agents using the Fictitious play algorithm. The agents will be trained to play against each other and learn from the strategies of the other players.

It's worth to mention that there are other variations of the fictitious play algorithm such as fictitious play with adaptive exploration, which uses an adaptive exploration to avoid the problem of getting stuck in a local equilibrium.

## 2.2 Reinforcement learning

Reinforcement learning (RL) is a type of machine learning that focuses on training agents to make decisions by interacting with an environment and receiving rewards or penalties. In the context of games, RL can be used to find equilibrium strategies by training agents to play against each other and learn from the rewards or penalties of the game.

One popular algorithm for RL in games is Q-learning. Q-learning is a model-free algorithm that estimates the value of each action in a given state. The agent starts with an initial estimate of the value of each action and updates it as it experiences new states and rewards. The agent uses the following formula to update its Q-values:

$Q(s,a) = Q(s,a) + \alpha(r + \gamma \max_{a'} Q(s',a') - Q(s,a))$

where $Q(s,a)$ is the estimate of the value of taking action a in state s, $\alpha$ is the learning rate, $r$ is the reward received after taking action a in state s, $\gamma$ is the discount factor, and $s'$ is the resulting state after taking action a in state s.

Another popular algorithm for RL in games is SARSA. SARSA is similar to Q-learning but it estimates the value of taking the action chosen by the agent in the next state rather than the action with the highest value. The agent uses the following formula to update its Q-values:

$Q(s,a) = Q(s,a) + \alpha(r + \gamma Q(s',a') - Q(s,a))$

where $Q(s,a)$ is the estimate of the value of taking action a in state s, $\alpha$ is the learning rate, $r$ is the reward received after taking action a in state s, $\gamma$ is the discount factor, $s'$ is the resulting state after taking action a in state s, and $a'$ is the action chosen by the agent in state $s'$.

In this assignment, we will be using Q-learning and SARSA algorithms to train agents to play a repeated game and find equilibrium strategies. The implementation will involve creating a simulation of the game and training the agents using the Q-learning and SARSA algorithms. The agents will be trained to play against each other and learn from the rewards or penalties of the game.

It's worth to mention that these are just a few examples of RL algorithms, there are many others algorithms that could be used such as SARSA($\lambda$), actor-critic, etc.

# 3. Implementation 

## 3.1 The Prisoner's Dilemma

The Prisoner's Dilemma is a classic example of a repeated non-zero sum game that has been widely studied in game theory. The game models a scenario where two players are arrested and are held in separate cells, unable to communicate with each other. The prosecutor can either charge them with a light crime (cooperate) or a heavy crime (defect). If both players cooperate, they will each receive a light sentence. However, if one player defects while the other cooperates, the defector will receive a reduced sentence while the cooperator will receive a heavy sentence. If both players defect, they will both receive a heavy sentence.

The payoff matrix for the Prisoner's Dilemma is as follows:

       C      D
C  [ R, R ] [ S, T ]
D  [ T, S ] [ P, P ]


Where R, S, T and P are the payoffs for each player. R is the reward for both players cooperating, T is the temptation for one player to defect, S is the sucker's payoff for one player cooperating and the other defecting, and P is the punishment for both players defecting.

The goal of this problem is to find the equilibrium strategies for the two players in the Prisoner's Dilemma game by using the algorithms of fictitious play and reinforcement learning. Specifically, we will be implementing and experimenting with both algorithms and comparing their performance through experimental results. The assignment will also include a theoretical analysis of the properties of the algorithms, such as their convergence and termination.

This problem is of great importance in understanding the behavior of agents in non-cooperative scenarios, as well as showing the performance of the algorithms in finding the equilibrium strategies.

### 3.1.1 Solving The Prisoner's Dilemma using Fictitious play algorithm 

In [1]:
import numpy as np

# Payoff matrix for the Prisoner's Dilemma game
payoffs = np.array([[R, R], [T, S]])

# Initialize strategies for both players
p1_strategy = np.array([0.5, 0.5])
p2_strategy = np.array([0.5, 0.5])

# Number of iterations for the algorithm
num_iterations = 100

# Fictitious play algorithm
for t in range(num_iterations):
    # Player 1's strategy
    p1_strategy = np.dot(p2_strategy, payoffs) / np.sum(np.dot(p2_strategy, payoffs))
    # Player 2's strategy
    p2_strategy = np.dot(p1_strategy, payoffs.T) / np.sum(np.dot(p1_strategy, payoffs.T))

# Equilibrium strategies
print("Player 1's equilibrium strategy:", p1_strategy)
print("Player 2's equilibrium strategy:", p2_strategy)


NameError: name 'R' is not defined

### Description

In this code, the payoffs matrix is defined with the payoffs for the game, and the initial strategies for both players are set to [0.5, 0.5], which corresponds to a random choice between C and D. The algorithm performs the update rule for each player's strategy in each iteration and the number of iterations is set to 100. The final equilibrium strategies for each player are printed at the end.

### 3.1.2 Solving The Prisoner's Dilemma using Reinforcement learning 

In [None]:
import numpy as np

# Payoff matrix for the Prisoner's Dilemma game
payoffs = np.array([[R, R], [T, S]])

# Q-values for both players
q_values = np.zeros((2, 2))

# Initialize strategies for both players
p1_strategy = np.array([0.5, 0.5])
p2_strategy = np.array([0.5, 0.5])

# Parameters for Q-learning
alpha = 0.1
gamma = 0.9
num_iterations = 100

# Q-learning algorithm
for t in range(num_iterations):
    # Select actions for both players
    p1_action = np.random.choice(2, p=p1_strategy)
    p2_action = np.random.choice(2, p=p2_strategy)
    
    # Update Q-values for both players
    q_values[p1_action, p2_action] = (1 - alpha) * q_values[p1_action, p2_action] + alpha * (payoffs[p1_action, p2_action] + gamma * np.max(q_values))
    q_values[p2_action, p1_action] = (1 - alpha) * q_values[p2_action, p1_action] + alpha * (payoffs[p1_action, p2_action] + gamma * np.max(q_values))
    
    # Update strategies for both players
    p1_strategy = np.exp(q_values[:, p2_action]) / np.sum(np.exp(q_values[:, p2_action]))
    p2_strategy = np.exp(q_values[p1_action, :]) / np.sum(np.exp(q_values[p1_action, :]))

# Equilibrium strategies
print("Player 1's equilibrium strategy:", p1_strategy)
print("Player 2's equilibrium strategy:", p2_strategy)


### Description

In this code, the payoffs matrix is defined with the payoffs for the game, and the initial strategies for both players are set to [0.5, 0.5], which corresponds to a random choice between C and D. The algorithm performs the update rule for each player's strategy in each iteration and the number of iterations is set to 100. The final equilibrium strategies for each player are printed at the end.