# Frozen Lake Reinforcement Learning Project

William Hinkley 

CSPB 3202 

Fall 2023

Github Repository Link: https://github.com/WiHi1131/Frozen-Lake-Reinforcement-Learning

YouTube Link: https://youtu.be/kjfbU22-EUE

## Project Overview

This report presents a comprehensive study conducted in the context of Artificial Intelligence and intelligent agents, focusing on the application of reinforcement learning (RL) techniques. The core of the project revolves around the implementation and analysis of various RL models tasked with navigating the challenging FrozenLake environment from OpenAI's Gymnasium.

## Reinforcement Learning Models

Three distinct RL models were explored:

1. **Value Iteration**: A classic RL algorithm that involves updating the value function iteratively to converge to the optimal policy.
2. **Q-Learning**: A model-free, off-policy algorithm, which estimates the value of action-state pairs by using a Q-table and updates these values based on the Bellman equation.
3. **Deep Q-Learning Network (DQN)**: An advanced approach that integrates deep learning with Q-learning, employing neural networks to approximate the Q-value function.

## The FrozenLake Environment

### Environment Description

- **Theme**: Players navigate across a frozen lake, aiming to reach a goal without falling into holes.
- **Slippery Terrain**: The lake's slippery nature adds randomness to movement, causing occasional deviations from the intended direction.
- **Map Image**: 

![Frozen Lake 4x4 Map](./images/frozen_lake_4x4.PNG)

### Specifications

- **Action Space**: Discrete(4) - {0: Left, 1: Down, 2: Right, 3: Up}
- **Observation Space**: Discrete(16) for the 4x4 map, representing the player's current position.
- **Starting State**: Player begins at [0,0].
- **Rewards**: +1 for reaching the goal, 0 otherwise.
- **Episode Termination**: Occurs upon falling into a hole or reaching the goal.

### Environment Setup

- **Default Map (4x4)** (see image above)
- **Large Map (8x8)**: 

![Frozen Lake 8x8 Map](./images/frozen_lake_8x8.PNG)
- **Slippery Condition**: For each map, the is_slippery condition is set to true. This means that the player only moves in the intended direction 1/3 of the time, otherwise the player will move "in either perpendicular direction with equal probability of 1/3 in both directions.For example, if action is left and is_slippery is True, then: P(move left)=1/3, P(move up)=1/3, P(move down)=1/3" (https://gymnasium.farama.org/environments/toy_text/frozen_lake/)

This project aims to evaluate and compare the effectiveness of these RL models in mastering the FrozenLake game, offering insights into their learning capabilities and adaptability in a stochastic environment.


# Approach

## Model Selection Rationale

For this project, I chose three models: Value Iteration, Q-Learning, and Deep Q-Learning Network (DQN), each representing a unique approach in the spectrum of reinforcement learning (RL).

1. **Value Iteration**: Selected for its foundational significance in RL, providing a clear understanding of policy and value function interactions in deterministic environments.
2. **Q-Learning**: As a model-free algorithm, it offers insights into learning dynamics without requiring a model of the environment, making it suitable for stochastic settings like FrozenLake.
3. **Deep Q-Learning Network (DQN)**: Chosen for its ability to handle high-dimensional observation spaces and to demonstrate the integration of deep learning with traditional RL methods.

## Implementation Strategy

### Development Environment

- **IDE**: Visual Studio Code was used for its robust Python support, ease of managing multiple scripts, and integrated debugging tools.
- **Dependency Management**: Necessary libraries, including Gymnasium and Box2D/Toy Text, were installed to facilitate environment setup and agent interactions.
- **Local Execution**: Scripts were run locally to manage computational resources effectively while ensuring real-time visualization and interaction with the environment.

### Visualization

- A key aspect of the project was the visualization of the agent's performance in the environment. The Box2D package provided a simple yet effective way to render the game, allowing for immediate feedback on the agent's behavior and strategy effectiveness.

## Evaluation Metrics

- **Average Reward/Score**: The primary metric for evaluating model performance was the average reward or score obtained over time across all training episodes. This metric was chosen for its direct correlation with the agent's ability to navigate the environment successfully, reflecting both the efficiency and effectiveness of the learning process.


## Troubleshooting and Optimization

- **Rendering and Computational Complexity**: One of the challenges was finding an environment setup that allowed easy rendering of the game for visualization purposes while balancing the computational load. This balance was crucial for running the models on a local machine without compromising performance.
- **Environment Compatibility**: Ensuring compatibility with the chosen RL environments and the visualization tools was a significant part of the setup process. This required testing different versions of libraries and adjusting configurations to achieve optimal functionality, as well as downloading swigwin and Visual Studio to get the visualization of the environment to work. 

## Testing on Different Environment Scales

### Simple 4x4 Version

- Initially, the models were tested on the simpler 4x4 version of the FrozenLake environment. This provided a controlled setting to fine-tune the models and understand their learning dynamics in a relatively straightforward context.

### Complex 8x8 Version

- To assess the scalability and adaptability of the models, they were subsequently tested on the more challenging 8x8 version of FrozenLake. This larger and more complex environment offered a better understanding of how the models perform under increased environmental complexity and uncertainty.
- The performance in the 8x8 environment was particularly crucial in evaluating the models' ability to generalize their learning strategies to larger state spaces and more intricate navigation challenges.

### Code: 

- The following code snippets show all the python scripts used in my project, with comments added for readability. Each snippet is titled with the title of the file (all files can also be found in the github repository for my project). These snippets are not designed to be run within this Jupyter notebook, and were run in VS Code on my local machine. 
- Note: Some of the code to implement the q-learning agent file was adapted from the following online source: https://github.com/simoninithomas/Deep_reinforcement_learning_Course/blob/master/Q%20learning/FrozenLake/Q%20Learning%20with%20FrozenLake_unslippery%20(Deterministic%20version).ipynb
- Note: test.py is not described by any snippets below - this was merely a testing script to ensure the environment could be visualized properly for my reference. 

#### value_iteration_policy.py: 
- this script defines my value iteration policy

In [None]:
import gymnasium as gym  # Importing the Gym library for creating and managing environments
import numpy as np  # Importing NumPy for numerical operations

# Define the function for running value iteration
def run_value_iteration(env, discount_factor=0.9, theta=1e-4, print_iterations=[100, 1000]):
    value_table = np.zeros(env.observation_space.n)  # Initialize the value table with zeros
    policy_table = np.zeros(env.observation_space.n, dtype=int)  # Initialize the policy table with zeros
    iteration = 0  # Counter for iterations

    while True:  # Start of the value iteration loop
        delta = 0  # Initialize the delta, which tracks the change in value
        iteration += 1  # Increment the iteration count

        # Loop over all states in the environment
        for state in range(env.observation_space.n):
            v = value_table[state]  # Store the current value of the state
            # Update the value of the state based on the Bellman equation
            value_table[state] = max(sum(prob * (reward + discount_factor * value_table[next_state])
                                        for prob, next_state, reward, _ in env.P[state][action])
                                    for action in range(env.action_space.n))
            # Update delta with the maximum change observed in the value table
            delta = max(delta, abs(v - value_table[state]))

        # Print the value table at specified iterations
        if iteration in print_iterations:
            print(f"Value Table after {iteration} iterations:")
            print(value_table)

        # Check for convergence, break if the change is below the threshold
        if delta < theta:
            break

    # Policy extraction loop
    for state in range(env.observation_space.n):
        # For each state, find the best action by looking at the future rewards
        policy_table[state] = np.argmax([sum(prob * (reward + discount_factor * value_table[next_state])
                                             for prob, next_state, reward, _ in env.P[state][action])
                                         for action in range(env.action_space.n)])

    return policy_table  # Return the final policy table

# Main execution block
if __name__ == "__main__":
    env = gym.make('FrozenLake-v1', render_mode="human")  # Create the FrozenLake environment
    policy = run_value_iteration(env)  # Run value iteration on the environment
    env.close()  # Close the environment after running value iteration


#### test_value_iteration_policy.py 
- this tests and renders the environment where an agent uses the value iteration policy defined in value_iteration_policy.py

In [None]:
import gymnasium as gym  # Importing the Gym library for creating and managing environments
from value_iteration_policy import run_value_iteration  # Importing the value iteration function

# Function to test the policy in the environment
def test_policy(env, policy, total_episodes=100):
    total_rewards = 0  # Initialize total rewards

    # Run the policy for a specified number of episodes
    for episode in range(total_episodes):
        observation, info = env.reset()  # Reset the environment at the start of each episode
        episode_reward = 0  # Initialize reward for this episode

        # Run the episode for a maximum of 99 steps
        for _ in range(99):
            action = policy[observation]  # Select an action based on the policy
            # Perform the action in the environment and get the next state and reward
            observation, reward, terminated, truncated, info = env.step(action)
            episode_reward += reward  # Accumulate reward

            # Break the loop if the episode is terminated or truncated
            if terminated or truncated:
                break
        total_rewards += episode_reward  # Add episode reward to total rewards

    average_reward = total_rewards / total_episodes  # Calculate the average reward
    return average_reward  # Return the average reward

# Function to render and demonstrate the policy in the environment
def render_policy(env, policy, total_episodes=5):
    for episode in range(total_episodes):  # Loop for a specified number of episodes
        observation, info = env.reset()  # Reset the environment at the start of each episode

        # Run the episode for a maximum of 99 steps
        for step in range(99):
            env.render()  # Render the current state of the environment
            action = policy[observation]  # Select an action based on the policy
            # Perform the action in the environment and get the next state and reward
            observation, reward, terminated, truncated, info = env.step(action)

            # Break the loop if the episode is terminated or truncated
            if terminated or truncated:
                print("****************************************************")
                print(f"EPISODE {episode + 1}")
                print("Number of steps:", step)
                break

# Main execution block
if __name__ == "__main__":
    env = gym.make('FrozenLake-v1', render_mode=None)  # Create environment without rendering for training
    policy = run_value_iteration(env)  # Run value iteration to get the policy

    # Test the policy and print the average reward
    average_reward = test_policy(env, policy, 100)
    print("Average Score over time: " + str(average_reward))

    # Create environment with rendering for demonstration
    env = gym.make('FrozenLake-v1', render_mode="human")
    render_policy(env, policy, 5)  # Render and demonstrate the policy
    env.close()  # Close the environment

#### test_vi_frozen_8x8.py 
- this script tests a value iteration policy on an 8x8 version of Frozen Lake: 

In [None]:
import gymnasium as gym  # Importing the Gym library for creating and managing environments
from value_iteration_policy import run_value_iteration  # Importing the value iteration function

# Function to test the policy in the environment
def test_policy(env, policy, total_episodes=100):
    total_rewards = 0  # Initialize total rewards

    # Run the policy for a specified number of episodes
    for episode in range(total_episodes):
        observation, info = env.reset()  # Reset the environment at the start of each episode
        episode_reward = 0  # Initialize reward for this episode

        # Run the episode for a maximum of 199 steps
        for _ in range(199):
            action = policy[observation]  # Select an action based on the policy
            # Perform the action in the environment and get the next state and reward
            observation, reward, terminated, truncated, info = env.step(action)
            episode_reward += reward  # Accumulate reward

            # Break the loop if the episode is terminated or truncated
            if terminated or truncated:
                break
        total_rewards += episode_reward  # Add episode reward to total rewards

    average_reward = total_rewards / total_episodes  # Calculate the average reward
    return average_reward  # Return the average reward

# Function to render and demonstrate the policy in the environment
def render_policy(env, policy, total_episodes=5):
    for episode in range(total_episodes):  # Loop for a specified number of episodes
        observation, info = env.reset()  # Reset the environment at the start of each episode

        # Run the episode for a maximum of 199 steps
        for step in range(199):
            env.render()  # Render the current state of the environment
            action = policy[observation]  # Select an action based on the policy
            # Perform the action in the environment and get the next state and reward
            observation, reward, terminated, truncated, info = env.step(action)

            # Break the loop if the episode is terminated or truncated
            if terminated or truncated:
                print("****************************************************")
                print(f"EPISODE {episode + 1}")
                print("Number of steps:", step)
                break

# Main execution block
if __name__ == "__main__":
    # Create the 8x8 version of the FrozenLake environment without rendering for training
    env = gym.make('FrozenLake8x8-v1', render_mode=None)
    policy = run_value_iteration(env)  # Run value iteration to get the policy

    # Test the policy and print the average reward
    average_reward = test_policy(env, policy, 100)
    print("Average Score over time: " + str(average_reward))

    # Create the 8x8 environment with rendering for demonstration
    env = gym.make('FrozenLake8x8-v1', render_mode="human")
    render_policy(env, policy, 5)  # Render and demonstrate the policy
    env.close()  # Close the environment

#### q_learning_agent.py
- this script defines a q-learning agent, tests and renders it on the 4x4 Frozen Lake environment: 

In [None]:
import gymnasium as gym  # Importing the Gym library for creating and managing environments
import numpy as np  # Importing NumPy for numerical operations
import random  # Importing random for stochastic elements in Q-Learning

class QLearningAgent:
    def __init__(self, env, learning_rate=0.8, discount_factor=0.95, exploration_rate=1.0, max_exploration_rate=1.0, min_exploration_rate=0.01, exploration_decay_rate=0.005):
        self.env = env  # The environment in which the agent operates
        self.learning_rate = learning_rate  # Learning rate for Q-learning updates
        self.discount_factor = discount_factor  # Discount factor for future rewards
        self.exploration_rate = exploration_rate  # Initial exploration rate
        self.max_exploration_rate = max_exploration_rate  # Maximum exploration rate
        self.min_exploration_rate = min_exploration_rate  # Minimum exploration rate
        self.exploration_decay_rate = exploration_decay_rate  # Rate at which exploration rate decays
        self.q_table = np.zeros((env.observation_space.n, env.action_space.n))  # Initialize Q-table with zeros

    def train(self, total_episodes, max_steps):
        rewards = []  # To store rewards obtained in each episode

        # Training loop over episodes
        for episode in range(total_episodes):
            state, _ = self.env.reset()  # Reset the environment
            total_rewards = 0  # Initialize total rewards for the episode

            # Loop for each step in an episode
            for step in range(max_steps):
                exp_exp_tradeoff = random.uniform(0, 1)  # Exploration-exploitation decision
                # Choose action based on exploration rate or Q-table
                if exp_exp_tradeoff > self.exploration_rate:
                    action = np.argmax(self.q_table[state, :])  # Exploitation (choosing best action)
                else:
                    action = self.env.action_space.sample()  # Exploration (choosing random action)

                # Perform action and get new state and reward
                new_state, reward, terminated, truncated, _ = self.env.step(action)

                # Update Q-table using the Q-learning algorithm
                self.q_table[state, action] = self.q_table[state, action] + \
                    self.learning_rate * (reward + self.discount_factor * np.max(self.q_table[new_state, :]) - self.q_table[state, action])

                total_rewards += reward  # Update total rewards
                state = new_state  # Update state

                # Break if the episode has ended
                if terminated or truncated:
                    break

            # Adjust the exploration rate
            self.exploration_rate = self.min_exploration_rate + \
                (self.max_exploration_rate - self.min_exploration_rate) * np.exp(-self.exploration_decay_rate * episode)
            rewards.append(total_rewards)  # Store rewards for this episode

        print("Score over time: " + str(sum(rewards) / total_episodes))  # Print average reward

    def test(self, total_episodes):
        # Testing loop over episodes
        for episode in range(total_episodes):
            state, _ = self.env.reset()  # Reset the environment
            done = False
            total_reward = 0

            # Loop until the episode ends
            while not done:
                action = self.get_action(state)  # Choose action based on Q-table
                state, reward, terminated, truncated, _ = self.env.step(action)  # Perform action
                total_reward += reward  # Update total reward
                done = terminated or truncated  # Check if episode ended
                self.env.render()  # Render the environment

            print(f"Episode {episode+1}: Total Reward: {total_reward}")  # Print total reward for the episode

    def get_action(self, state):
        return np.argmax(self.q_table[state, :])  # Choose the best action based on Q-table

    def play(self, total_episodes, max_steps):
        # Play loop over episodes
        for episode in range(total_episodes):
            state, _ = self.env.reset()  # Reset the environment
            print("****************************************************")
            print("EPISODE ", episode)

            # Loop for each step in an episode
            for step in range(max_steps):
                action = np.argmax(self.q_table[state, :])  # Choose the best action based on Q-table
                new_state, reward, terminated, truncated, _ = self.env.step(action)  # Perform action

                done = terminated or truncated  # Check if episode ended
                if done:
                    self.env.render()  # Render the environment
                    print("Number of steps", step)  # Print number of steps taken
                    break
                state = new_state  # Update state

        self.env.close()  # Close the environment

# Main execution block
if __name__ == "__main__":
    env = gym.make('FrozenLake-v1', render_mode=None)  # Create environment without rendering for training
    agent = QLearningAgent(env)  # Initialize QLearningAgent
    agent.train(15000, 99)  # Train the agent
    env.close()  # Close the environment

    env = gym.make('FrozenLake-v1', render_mode="human")  # Create environment with rendering for playing
    agent.env = env  # Update the agent's environment
    agent.play(5, 99)  # Play the game using the trained Q-table
    env.close()  # Close the environment


#### q_learning_agent_flv2.py
- this script tests a q-learning agent on the 8x8 version of Frozen Lake. It's only changes from the above q_learning_agent.py file are in the execution block, code specified below: 

In [None]:
if __name__ == "__main__":
    env = gym.make('FrozenLake8x8-v1', render_mode=None)
    agent = QLearningAgent(env)
    agent.train(50000, 399)  # Updated number of training episodes and max_steps
    env.close()

    env = gym.make('FrozenLake8x8-v1', render_mode="human")
    agent.env = env
    agent.play(5, 399)  # Using the Q-table as a 'cheatsheet' to play
    env.close()


#### dql_agent.py
- this script creates a deep q-learning neural network, trains and renders it for the standard 4x4 Frozen Lake: 

In [None]:
import gymnasium as gym  # Importing the Gym library for creating and managing environments
import numpy as np  # Importing NumPy for numerical operations
import random  # Importing random for stochastic elements in Deep Q-Learning
import tensorflow as tf  # Importing TensorFlow for building neural network
from tensorflow.keras.models import Sequential  # Sequential model for creating neural network
from tensorflow.keras.layers import Dense  # Dense layer for neural network
from tensorflow.keras.optimizers import Adam  # Adam optimizer for training neural network

class DQNAgent:
    def __init__(self, env, learning_rate=0.001, discount_factor=0.95, exploration_rate=1.0, max_exploration_rate=1.0, min_exploration_rate=0.01, exploration_decay_rate=0.005):
        self.env = env  # The environment in which the agent operates
        self.learning_rate = learning_rate  # Learning rate for neural network
        self.discount_factor = discount_factor  # Discount factor for future rewards
        self.exploration_rate = exploration_rate  # Initial exploration rate
        self.max_exploration_rate = max_exploration_rate  # Maximum exploration rate
        self.min_exploration_rate = min_exploration_rate  # Minimum exploration rate
        self.exploration_decay_rate = exploration_decay_rate  # Rate at which exploration rate decays

        # Neural Network model for Deep Q-Learning
        self.model = Sequential([
            Dense(24, input_shape=(env.observation_space.n,), activation='relu'),  # First hidden layer
            Dense(24, activation='relu'),  # Second hidden layer
            Dense(env.action_space.n, activation='linear')  # Output layer
        ])
        self.model.compile(loss='mse', optimizer=Adam(learning_rate=self.learning_rate))  # Compile the model

    def train(self, total_episodes, max_steps):
        # Training loop over episodes
        for episode in range(total_episodes):
            state, _ = self.env.reset()  # Reset the environment
            state_one_hot = np.identity(self.env.observation_space.n)[state:state+1]  # One-hot encode state
            total_rewards = 0  # Initialize total rewards for the episode

            # Loop for each step in an episode
            for step in range(max_steps):
                # Exploration-exploitation decision
                if random.uniform(0, 1) > self.exploration_rate:
                    action = np.argmax(self.model.predict(state_one_hot)[0])  # Exploitation
                else:
                    action = self.env.action_space.sample()  # Exploration

                # Perform action and get new state and reward
                new_state, reward, terminated, truncated, _ = self.env.step(action)
                new_state_one_hot = np.identity(self.env.observation_space.n)[new_state:new_state+1]  # One-hot encode new state

                # Update target for Q-value
                target = (reward + self.discount_factor * 
                        np.max(self.model.predict(new_state_one_hot)[0]))
                target_f = self.model.predict(state_one_hot)
                target_f[0][action] = target

                # Fit the model
                self.model.fit(state_one_hot, target_f, epochs=1, verbose=0)
                total_rewards += reward  # Update total rewards
                state_one_hot = new_state_one_hot  # Update state

                # Break if the episode has ended
                if terminated or truncated:
                    break

            # Adjust the exploration rate
            self.exploration_rate = self.min_exploration_rate + \
                (self.max_exploration_rate - self.min_exploration_rate) * np.exp(-self.exploration_decay_rate * episode)

            print(f"Episode {episode+1}: Total Reward: {total_rewards}")  # Print reward for the episode

    def get_action(self, state):
        state_one_hot = np.identity(self.env.observation_space.n)[state:state+1]  # One-hot encode state
        return np.argmax(self.model.predict(state_one_hot)[0])  # Choose action based on Q-values

    def play(self, total_episodes):
        # Play loop over episodes
        for episode in range(total_episodes):
            state, _ = self.env.reset()  # Reset the environment
            state_one_hot = np.identity(self.env.observation_space.n)[state:state+1]  # One-hot encode state
            done = False
            step = 0

            print("****************************************************")
            print(f"EPISODE {episode + 1}")

            while not done:
                self.env.render()  # Render the environment
                action = np.argmax(self.model.predict(state_one_hot)[0])  # Choose action based on Q-values
                new_state, reward, terminated, truncated, _ = self.env.step(action)  # Perform action
                new_state_one_hot = np.identity(self.env.observation_space.n)[new_state:new_state+1]  # One-hot encode new state

                state_one_hot = new_state_one_hot  # Update state
                done = terminated or truncated  # Check if episode ended
                step += 1

            print(f"Episode {episode + 1} finished after {step} steps")  # Print steps taken

        self.env.close()  # Close the environment

# Main execution block
if __name__ == "__main__":
    env = gym.make('FrozenLake-v1', render_mode=None)  # Create environment without rendering for training
    agent = DQNAgent(env)  # Initialize DQNAgent
    agent.train(5000, 99)  # Train the agent
    env.close()  # Close the environment

    env = gym.make('FrozenLake-v1', render_mode="human")  # Create environment with rendering for playing
    agent.env = env  # Update the agent's environment
    agent.play(5)  # Play 5 episodes in human mode
    env.close()  # Close the environment

## Results

These results are obtained from screenshotted images of the output terminal after running the scripts above. The hyperparameters of all models (particularly the numbers of training episodes and steps/episode) were tweaked and changed in various ways - the most optimal results of all hyperparameter optimizations are shown below: 

### Value Iteration Agent Results

#### 4x4 

![Value Iteration 4x4 Results](./images/value_it_4x4_results.PNG)

From the average score over time (100 episodes of training), we see that the agent finds the reward and exits 73% of the time (because of the stochastic nature of the environment, this may change with repeated runs of the script). We test the script on 5 different episodes after training and visualize it within the human render mode of the environment - the terminal then outputs the number of steps run in each episode. We can see that there is a large range of steps taken per episode, even when the agent is returning successful most of the time. (Note: The demo video shows different episodes generated from a different run of the script, and will not match the image above)

#### 8x8 

![Value Iteration 8x8 Results](./images/value_it_8x8_results.PNG)

Again, this shows the average score over time over 100 episodes of training, so the agent finds the reward for 77 of the 100 training episodes. This is surprisingly, even better than the 4x4 map environment. Again, we see a wide range of steps/episode in the 5 visualized for testing purposes - the episodes with small step numbers (17 and 21) were instances that the agent fell into a hole before finding the reward. 

### Q-Learning Agent Results

#### 4x4 

![Q-learning 4x4](./images/q_learning_4x4_results.PNG)

The Q-learning agent shows success in finding the reward and exiting a little less than half the time - however, this is after training the agent for 15000 episodes - a remarkably worse success rate requiring tens of thousands of more episodes of training than the value iteration agent. The range in the number of steps per episode is much less, however, and the video demonstration of the q-learning agent runs much faster than the value iteration agent. 

#### 8x8 

![Q-learning 8x8](./images/q_learning_8x8_results.PNG)

This model was trained on larger and larger numbers of training episodes, taking longer and longer times to train, until 50000 episodes with 399 steps/episode, with a 0% chance of success for the model. We can see the number of steps is the same every time, and the visualization of this model shows the agent traveling only vertically in one column of the map. Reasons for the failure of this model to find success are discussed in the Conclusion section below. 

### DQL Results

![Frozen Lake DQL](./images/dql_results.PNG)

This model was trained with 5000 episodes - due to limits of computational resources and training time (vastly increased, even for a very simple neural network as was defined for this model), no further numbers of training episodes were tested. The image above shows a snapshot of training after 100 episodes, every epoch of the network coming up with a 0% chance of success with no improvement. This continued for 5000 episodes, and this approach, unfortunately, also did not yield success for the DQL agent. Reasons for the failure of this model to find success are discussed in the Conclusion section below. 

# Conclusion, Discussion, and Reflection

## Summary of Results

The project's exploration of reinforcement learning models in the FrozenLake environment yielded insightful results. The performance of Value Iteration, Q-Learning, and Deep Q-Learning Network (DQN) models varied significantly across the 4x4 and 8x8 maps.

### Value Iteration Results

- **4x4 Map**: Achieved an average score of 0.73 over 100 episodes. Steps per episode varied, with a sample range of 37 to 70 steps.
- **8x8 Map**: Surprisingly outperformed its 4x4 counterpart with an average score of 0.77, although with a larger spread in steps per episode (range from 17 to 136).

### Q-Learning Results

- **4x4 Map**: Had a moderate average score of 0.49053333, but exhibited faster execution and a narrower spread of steps per episode (range from 14 to 40).
- **8x8 Map**: Showed no success, with an average score of 0.0 after 50,000 training episodes, typically reaching 199 steps per episode.

### Deep Q-Learning Results

- The 4x4 map resulted in an average reward of 0 after extensive training (5,000 episodes).

## Analysis of Results

### Value Iteration Performance Analysis

#### 4x4 Map Performance

- **Average Score**: The value iteration agent achieved an average score of 0.73 over 100 episodes. This high score indicates a good level of proficiency in navigating the simpler 4x4 environment. 
- **Steps per Episode**: The observed range of steps per episode (37 to 70) suggests a variability in the efficiency of the generated policy. This could be attributed to the stochastic nature of the environment where the slippery condition can lead to less predictable movements.
- **Implementation Aspects**: The setting of the discount factor (0.9) and the convergence threshold (theta=1e-4) in the value iteration algorithm influenced the policy development. A higher discount factor emphasizes the importance of future rewards, leading to a policy that may take more steps but aims for higher overall rewards.

#### 8x8 Map Performance

- **Average Score Improvement**: Interestingly, the average score improved to 0.77 on the 8x8 map. This could be due to the larger state space allowing for more nuanced policy development, where the value iteration algorithm can better differentiate between good and bad states over a larger scale.
- **Greater Steps Variability**: The wider range of steps per episode (17 to 136) on the 8x8 map can be attributed to the increased complexity and the larger number of possible paths to the goal. This reflects the algorithm's ability to explore various strategies in a more complex environment.
- **Policy Extraction in Larger State Space**: The policy extraction phase of the value iteration might have benefited from the increased diversity in state transitions available in the 8x8 environment, leading to a more robust policy despite the larger state space.

#### Implications and Future Improvements

- **Hyperparameter Tuning**: Adjusting the discount factor and convergence threshold specifically for each environment could potentially optimize the performance. A lower discount factor might reduce the variability in steps per episode in the 4x4 environment.
- **Exploration of Stochasticity**: Further analysis of the impact of the environment's stochastic nature on policy performance could provide insights for improvement. Experimenting with environments having different levels of stochasticity (in a different environment besides Frozen Lake but with a similar state space and goals) might reveal more about the algorithm's adaptability.
- **Incremental Complexity Approach**: Gradually increasing the complexity of the training environment, starting from a simple deterministic version and moving towards more complex and stochastic versions, could help in understanding the scalability of the value iteration algorithm.

### Q-Learning Performance Analysis

#### 4x4 Map Performance

- **Average Score**: The Q-learning agent achieved an average score of 0.49053333 over 15,000 training episodes. This moderate success rate reflects the agent's ability to learn an effective policy in a relatively small state space.
- **Exploration-Exploitation Trade-off**: The initial high exploration rate, which decayed over time, allowed the agent to explore various actions in the early stages and gradually focus more on exploitation. This strategy is crucial in environments like FrozenLake, where certain actions can lead to falling in holes, and safe paths need to be learned over time.
- **Steps per Episode**: The smaller spread of steps per episode (ranging from 14 to 40) suggests that the Q-learning agent was relatively efficient in navigating the environment, possibly due to the balance between exploration and exploitation achieved through the decay of the exploration rate.
- **Learning Rate and Discount Factor**: The relatively high learning rate (0.8) and discount factor (0.95) might have contributed to the agent’s ability to adapt its policy based on the feedback from the environment. 

#### 8x8 Map Performance

- **Lack of Success**: On the larger 8x8 map, the Q-learning agent did not achieve any success (average score over all episodes was 0.0) after 50,000 episodes of training. This indicates a significant challenge in scaling the Q-learning approach to larger, more complex environments.
- **Potential Reasons for Poor Performance**:
    - **Complexity of State Space**: The increased complexity and size of the 8x8 map may have exceeded the capacity of the agent to learn an effective policy within the given number of episodes. 
    - **Insufficient Exploration**: Given the larger environment, the agent might have needed more exploration to effectively learn about the diverse states and actions possible, especially in the early stages of training.
    - **Hyperparameter Tuning**: The learning rate, discount factor, and exploration decay rate may not have been optimal for the larger environment, requiring further tuning to adapt to the increased complexity.

### Implications and Future Improvements

- **Enhanced Exploration Techniques**: Implementing advanced exploration techniques, such as epsilon-greedy with a more adaptive decay rate or even sophisticated methods like Upper Confidence Bound (UCB), could improve learning in complex environments.
- **Hyperparameter Optimization**: Adjusting the learning rate, discount factor, and exploration decay rate specifically for each environment size could yield better results, especially in larger environments.
- **Incremental Complexity**: Training the agent initially on smaller environments and progressively increasing the environment size might help in building a more robust learning process.
- **More Computational Resources**: An analysis environment with more computational resources could train more episodes in a shorter time, which may be required for q-learning of the 8x8 map. 

### Deep Q-Learning (DQN) Agent Analysis

- The DQN agent, despite extensive training (5000 episodes), did not achieve success in the 4x4 map of the FrozenLake environment, reflected in an average reward of 0 over all episodes. This result prompts a deeper analysis of the factors influencing the DQN agent's performance.

#### Factors Influencing DQN Performance

- **Complexity of State Representation**: The DQN agent's neural network utilized a one-hot encoding for the states, which may not be the most effective representation for capturing the nuances of the FrozenLake environment. This method of state representation could limit the network's ability to learn complex policies.
- **Neural Network Architecture**: The DQN's model, comprising two hidden layers with 24 neurons each, might not have been sufficiently complex to capture the required policy for navigation. Computational resources of my local machine prevented implementation of an overly complex neural network.
- **Learning Rate and Discount Factor**: The chosen learning rate (0.001) and discount factor (0.95) play a crucial role in the learning process. The learning rate may have been too low, slowing down the learning process significantly. In contrast, the discount factor might have been too high for an environment where immediate rewards are sparse.
- **Exploration and Exploitation Balance**: The exploration strategy, starting from a high exploration rate and decaying over time, might not have been optimal. This could have led to insufficient exploration or premature convergence to a suboptimal policy.
- **Reward Structure Sensitivity**: DQN is known to be sensitive to the reward structure of the environment. In FrozenLake, where the rewards are sparse and only obtained upon reaching the goal, the agent might struggle to associate actions with positive outcomes effectively.

### Suggestions for Improvement

- **State Representation Enhancement**: Experimenting with different state representation techniques, such as embedding layers or simpler state aggregation methods, could improve the learning efficiency of the network.
- **Neural Network Tuning**: Adjusting the neural network's architecture, like varying the number of layers and neurons, could help in finding a more suitable model complexity for the task. A computational network with more resources likely could have helped in this endeavor.
- **Hyperparameter Optimization**: Fine-tuning the learning rate and discount factor could accelerate learning or improve the agent's ability to value future rewards appropriately.
- **Advanced Exploration Strategies**: Implementing more sophisticated exploration strategies, such as adding noise to actions or using methods like epsilon-greedy with a variable decay rate, could enhance the agent's exploration efficiency.

## Reflection

This project illuminated the nuances of applying different reinforcement learning models to environments of varying complexity. The surprising performance of value iteration in a more complex environment and the challenges faced by Q-learning and DQN highlight the importance of context-specific model selection and hyperparameter tuning in reinforcement learning. These insights pave the way for future exploration into more complex environments assuming sufficient computational resources can be acquired. 

## References: 

- “Gymnasium Documentation.” Gymnasium.farama.org, gymnasium.farama.org/environments/toy_text/frozen_lake/.
- simoninithomas. “Deep_reinforcement_learning_Course/Q Learning/FrozenLake/Q Learning with FrozenLake_unslippery (Deterministic Version).Ipynb at Master · Simoninithomas/Deep_reinforcement_learning_Course.” GitHub, 2018, github.com/simoninithomas/Deep_reinforcement_learning_Course/blob/master/Q%20learning/FrozenLake/Q%20Learning%20with%20FrozenLake_unslippery%20(Deterministic%20version).ipynb. Accessed 19 Dec. 2023.