# Tutorial 1: Tabular Q-learning with Gym

Reinforcement learning is an exciting field that has seen tremendous progress in recent years. With the help of powerful tools like Gym, it has become easier to experiment with different reinforcement learning algorithms and environments. In this tutorial, we will be exploring one such algorithm, Tabular Q-learning, and training an agent on the FrozenLake environment.

## Step 1: Understanding the FrozenLake enviroment 

The FrozenLake environment is a simple grid-world game where the goal is to navigate an agent from the starting point to the goal while avoiding holes in the ground. The agent can move up, down, left, or right, and receives a reward of 1 for reaching the goal and a reward of 0 for falling into a hole or moving into a wall. The environment is implemented in Gym and can be imported using the following code:

In [1]:
import gym
env = gym.make("FrozenLake-v1")

## Step 3: Implementing the Tabular Q-learning algorithm

Next, we need to implement the Tabular Q-learning algorithm. This algorithm works by estimating the optimal action-value function, Q(s,a), which is the expected cumulative reward for taking action a in state s and then following the optimal policy thereafter. The algorithm updates the Q-values using the following formula:

Q(s,a) = Q(s,a) + alpha * (reward + gamma * max(Q(s',a')) - Q(s,a))

Where alpha is the learning rate, gamma is the discount factor, and s' is the next state after taking action a in state s.

Here's an implementation of the Tabular Q-learning algorithm:

In [2]:
import gym
import numpy as np

# Initialize environment
env = gym.make("CartPole-v1")

# Initialize Q-table with zeros
n_states = 10  # Number of discrete states
n_actions = env.action_space.n  # Number of actions
Q = np.zeros([n_states, n_states, n_states, n_states, n_actions])

# Set hyperparameters
alpha = 0.8
gamma = 0.95
epsilon = 0.1
num_episodes = 5000

# Discretize observation space
def discretize(obs, n_states):
    obs_min = -1.0  # Minimum observation value
    obs_max = 1.0   # Maximum observation value
    obs_range = obs_max - obs_min  # Range of observation values
    bin_width = obs_range / n_states  # Width of each bin
    
    # Select relevant features of the observation
    cart_pos = obs['cart_position']
    cart_vel = obs['cart_velocity']
    pole_ang = obs['pole_angle']
    pole_vel = obs['pole_velocity']
    
    # Map features to nearest state indices
    state = (np.digitize(cart_pos, np.linspace(obs_min, obs_max, n_states)) - 1,
             np.digitize(cart_vel, np.linspace(obs_min, obs_max, n_states)) - 1,
             np.digitize(pole_ang, np.linspace(obs_min, obs_max, n_states)) - 1,
             np.digitize(pole_vel, np.linspace(obs_min, obs_max, n_states)) - 1)
    
    return state

# For each episode
for i in range(num_episodes):
    # Reset environment
    obs = env.reset()
    state = discretize(obs, n_states)
    done = False
    
    # While episode is not finished
    while not done:
        # Choose action using epsilon-greedy policy
        if np.random.uniform() < epsilon:
            action = env.action_space.sample()
        else:
            action = np.argmax(Q[state])
            
        # Take action and observe next state and reward
        obs, reward, done, _ = env.step(action)
        next_state = discretize(obs, n_states)
        
        # Update Q-table
        Q[state][action] = Q[state][action] + alpha * (reward + gamma * np.max(Q[next_state]) - Q[state][action])
        
        # Update state
        state = next_state

# Print learned Q-table
print(Q)



TypeError: tuple indices must be integers or slices, not str

## Step 4: Training the agent on FrozenLake

Now that we have implemented the Tabular Q-learning algorithm, we can train an agent on the FrozenLake environment. Here's an implementation of the training loop:

In [None]:
# Initialize Q-table with zeros
Q = np.zeros([env.observation_space.n, env.action_space.n])

# Set hyperparameters
alpha = 0.8
gamma = 0.95
epsilon = 0.1
num_episodes = 5000

# For each episode
for i in range(num_episodes):
    # Reset environment
    state = env.reset()
    done = False
    
    # While episode is not finished
    while not done:
        # Choose action using epsilon-greedy policy
        if np.random.uniform() < epsilon:
            action = env.action_space
            action = env.action_space.sample()
        else:
            action = np.argmax(Q[state,:])
            
        # Take action and observe next state and reward
        next_state, reward, done, _ = env.step(action)
        
        # Update Q-table
        Q[state,action] = Q[state,action] + alpha * (reward + gamma * np.max(Q[next_state,:]) - Q[state,action])
        
        # Update state
        state = next_state
        
    # Print episode number every 1000 episodes
    if i % 1000 == 0:
        print("Episode ", i)


## Step 5: Evaluating the agent

After training the agent, we can evaluate its performance by running several episodes and computing the average reward. Here's an implementation of the evaluation loop:


In [None]:
# Run 100 episodes
num_episodes = 100
total_reward = 0

for i in range(num_episodes):
    # Reset environment
    state = env.reset()
    done = False
    
    # While episode is not finished
    while not done:
        # Choose action using greedy policy
        action = np.argmax(Q[state,:])
            
        # Take action and observe next state and reward
        next_state, reward, done, _ = env.step(action)
        
        # Update total reward
        total_reward += reward
        
        # Update state
        state = next_state
        
    # Print episode number every 10 episodes
    if i % 10 == 0:
        print("Episode ", i)
        
# Print average reward
print("Average reward: ", total_reward/num_episodes)
