# Optimizing Agent Behavior in Dynamic Environments using Q-learning and ε-greedy Exploration

## Machine Learning Course - Reinforcement Learning Project 

## NCSR Demokritos 2022-2023

### Author: Alexandros Filios - mtn2219

## 1 Introduction

In this report, we will present an implementation of a Q-learning, ε-greedy agent system for a repeated coordination game. The game is defined as a matrix with row column indexes called 1 and 2, where the payoff for each action combination is given. The agents belong to one of two types, X and Y, with X preferring action 1 over action 2 and Y preferring action 2 over action 1. The agents communicate with each other sparsely in an adjacency graph of our choice, and have no knowledge of the game or the repertoire of actions of the other agents.

The Q-learning, ε-greedy algorithm is a popular method for solving multi-agent reinforcement learning problems, where the agents learn to make optimal decisions based on their own actions and the rewards received from the environment. The ε-greedy exploration strategy allows the agents to balance exploitation of their current knowledge with exploration of new actions.

Our implementation consists of running a specified number of episodes, with the ε value starting at 1 and decreasing by 0.01 every X episodes. At the beginning of each episode, the time is set to 0, and each agent chooses an action using the ε-greedy strategy. After all agents have performed their actions, the environment announces the rewards to each agent, and the Q-values are updated using the Q-learning update rule. We track the energy value and the average total reward for each agent at each point in time, and create charts to visualize this data.

The main goal of this report is to explore whether the agents converge to an equilibrium and how the "equilibrium" between X and Y agents affects their convergence. We will also include the code used, charts showing the energy value per episode, and charts for each agent type showing their average total reward at each point in time.

## 2. Theoretical background & Methodology

### 2.1 Q-Learning

Q-learning is a type of reinforcement learning algorithm that is used to find the optimal action-selection policy for an agent in a specific environment. The basic idea behind Q-learning is to use a function, called the Q-function, to estimate the expected long-term reward for an agent when it takes a certain action in a given state. The Q-function is typically represented by a table or a neural network, and it is updated during the learning process to more accurately reflect the expected rewards for different actions in different states.

The Q-function is defined as Q(s,a), where s is the current state, and a is the action taken in that state. The Q-function estimates the expected cumulative future reward for an agent starting from state s and taking action a. The agent's goal is to find the action-value function Q*(s,a) that maximizes the expected cumulative reward.

In Q-learning, the agent interacts with the environment by taking actions and receiving rewards. At each time step, the agent selects an action based on its current state and the current estimates of the Q-function. The agent then receives a reward for the action and updates the Q-function to more accurately reflect the new information about the expected rewards for that action in that state. This process is repeated until the agent reaches a stopping criterion, such as a maximum number of iterations or a satisfactory level of performance.

The Q-function is updated using the Q-learning update rule, which is a variation of the Bellman equation, a fundamental equation in dynamic programming. The Q-learning update rule states that the Q-value of a state-action pair is updated by taking the current Q-value and adjusting it by a small value proportional to the difference between the expected reward and the current estimate. The update rule is as follows:

Q(s,a) = Q(s,a) + α(r + γ maxQ(s',a') - Q(s,a))

Where:
s: current state
a: current action
α: learning rate
r: reward for taking action a in state s
γ: discount factor
s': next state
a': next action
maxQ(s',a'): maximum Q-value over all actions in next state s'

The Q-function is used to select actions by choosing the action that has the highest Q-value for the current state. This is known as the greedy policy.

### 2.2 ε-Greedy algorithm

The ε-greedy algorithm is a method used to balance exploration and exploitation during the learning process. In the ε-greedy algorithm, the agent selects actions according to the following rule: with probability ε, a random action is chosen, and with probability 1-ε, the action that has the highest estimated Q-value is chosen. The value of ε is called the exploration rate and it's used to control the trade-off between exploration and exploitation. At the beginning of the learning process, ε is set to a high value to allow the agent to explore the environment and gather information about the different states and actions. As the learning process progresses, ε is gradually decreased, allowing the agent to become more exploitation-focused and make use of the information it has gathered.

In summary, Q-learning is a model-free, off-policy algorithm for learning control policies, where the Q-function is used to approximate the optimal action-value function. It uses the Q-learning update rule which uses the Bellman equation and the agent interacts with the environment and learns by trial and error

## 3. Implementation & Experinments

In [3]:
import numpy as np

# Define the game matrix
game_matrix = np.array([[1.1, 0.0], [0.0, 1.1]])

# Define the number of agents and the types of agents
num_agents = 7
agent_types = ["X", "Y"]

# Initialize the Q-values for each agent and each state-action pair to a small random value
q_values = {agent_type: {i: np.random.rand(2) * 0.1 for i in range(num_agents)} for agent_type in agent_types}

# Define the adjacency graph
# For example, adjacency_graph[0][1] = True means agent 0 is connected to agent 1
adjacency_graph = np.random.randint(0, 2, size=(num_agents, num_agents))

# Define the number of episodes and the episode length
num_episodes = 2500
episode_length = 50

# Define the discount factor
gamma = 0.9

# Define the initial exploration probability and the number of episodes over which it decreases
epsilon = 1
epsilon_decay_episodes = 100

# Initialize the rewards for each agent
rewards = {agent_type: np.zeros(num_agents) for agent_type in agent_types}

# Initialize the energy value
energy_value = 0

# Initialize the episode counter
episode_counter = 0

# Run the episodes
for episode in range(num_episodes):
    # Update the exploration probability
    if episode_counter % epsilon_decay_episodes == 0 and episode_counter != 0:
        epsilon -= 0.01
    
    # Initialize the episode rewards
    episode_rewards = {agent_type: np.zeros(num_agents) for agent_type in agent_types}
    
    # Run the episode steps
    for step in range(episode_length):
        # Initialize the actions for each agent
        actions = {agent_type: np.zeros(num_agents) for agent_type in agent_types}
        
        # Get the actions for each agent
        for i in range(num_agents):
            agent_type = agent_types[i % 2]
            if np.random.rand() < epsilon:
                # Choose a random action with probability epsilon
                actions[agent_type][i] = np.random.randint(2)
            else:
                # Choose the action with the highest Q-value with probability 1-epsilon
                actions[agent_type][i] = np.argmax(q_values[agent_type][i])
        
        # Compute the rewards for each agent
        for i in range(num_agents):
            agent_type = agent_types[i % 2]
            agent_rewards = np.zeros(num_agents)
            for j in range(num_agents):
                if adjacency_graph[i][j]:
                    agent_rewards[j] = game_matrix[int(actions[agent_type][i]), int(actions[agent_types[j % 2]][j])]
            rewards[agent_type][i] = np.mean(agent_rewards)
    # Update the Q-values for each agent
    for i in range(num_agents):
        agent_type = agent_types[i % 2]
        q_values[agent_type][i][int(actions[agent_type][i])] = rewards[agent_type][i] + gamma * np.max(q_values[agent_type][i])

    # Update the episode rewards for each agent
    for agent_type in agent_types:
        episode_rewards[agent_type] += rewards[agent_type]

    # Update the energy value
    energy_value += np.mean(np.sum(episode_rewards, axis=1))

    # Update the episode counter
    episode_counter += 1

    # Plot the energy value and the average total rewards for each agent type
    # Code for plotting goes here



AxisError: axis 1 is out of bounds for array of dimension 0

## 4. Epilogue