# **Table of Contents**

### 1. **Import Libraries**  
   - 1A. [Import Required Libraries](#1a-import-required-libraries)  
   - 1B. [Create Environment and Test](#1b-create-environment-and-test)  

### 2. **Train Model for Normal Version with PPO**  
   - 2A. [Train the Model](#2a-train-the-model)  
   - 2B. [Save the Model](#2b-save-the-model)  
   - 2C. [Evaluate the Model](#2c-evaluate-the-model)  

### 3. **Train Model for Hardcore Version with PPO**  
   - 3A. [Test the Environment](#3a-test-the-environment)  
   - 3B. [Train the Hardcore Model](#3b-train-the-hardcore-model)  
   - 3C. [Save the Hardcore Model](#3c-save-the-hardcore-model)  
   - 3D. [Evaluate the Hardcore Model](#3d-evaluate-the-hardcore-model)  


# 1. Import Libaries

## 1A) Import Libaries

In [None]:
# Import the necessary libraries

# gymnasium is a modern version of the gym library used to create and interact with reinforcement learning environments
import gymnasium as gym

# Import PPO (Proximal Policy Optimization) from stable-baselines3, which is a popular RL algorithm
from stable_baselines3 import PPO

# Import the evaluation function to assess the performance of the trained policy
from stable_baselines3.common.evaluation import evaluate_policy

## 1B) Create Env and Test

In [None]:
# Create the BipedalWalker environment with human-rendering mode enabled
env = gym.make("BipedalWalker-v3", render_mode="human")

In [None]:
# Reset the environment (start a new episode) - without using seed or options
obs = env.reset()

# Let the agent take random actions for 1000 steps
for _ in range(1000):
    # Take a random action sampled from the environment's action space
    action = env.action_space.sample()
    
    # Step the environment forward using the chosen action
    # The environment returns the new observation (obs), the reward, 
    # whether the episode is done (done), if it was truncated (truncated), and additional info (info)
    obs, reward, done, truncated, info = env.step(action)
    
    # If the episode is finished (either done or truncated), reset the environment for a new episode
    if done or truncated:
        obs = env.reset()

# Close the environment when finished to clean up resources
env.close()

# 2) Train Model for Normal Version with PPO

## 2A) Train Model

In [None]:
env = gym.make("BipedalWalker-v3")

In [None]:
# Create the PPO model with a Multi-Layer Perceptron (MLP) policy
model = PPO("MlpPolicy", env, verbose=1)

In [None]:
model.learn(total_timesteps=1000000)

## 2B) Save Model

In [None]:
model.save("ppo_bipedalwalker_1M")

In [None]:
del model

## 2C) Evaluate Model

In [None]:
model = PPO.load("ppo_bipedalwalker")

In [None]:
env = gym.make("BipedalWalker-v3", render_mode="human")

In [None]:
# Modeli değerlendirin (örneğin, 10 bölüm boyunca)
mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=10)

print(f"Ortalama ödül: {mean_reward} ± {std_reward}")

# 3) Train Model for Hardcore Version with PPO

## 3A) Test Enviroment

In [None]:
env = gym.make("BipedalWalker-v3", hardcore=True, render_mode="human")

In [None]:
# Reset the environment (start a new episode) - without using seed or options
obs = env.reset()

# Let the agent take random actions for 1000 steps
for _ in range(1000):
    # Take a random action sampled from the environment's action space
    action = env.action_space.sample()
    
    # Step the environment forward using the chosen action
    # The environment returns the new observation (obs), the reward, 
    # whether the episode is done (done), if it was truncated (truncated), and additional info (info)
    obs, reward, done, truncated, info = env.step(action)
    
    # If the episode is finished (either done or truncated), reset the environment for a new episode
    if done or truncated:
        obs = env.reset()

# Close the environment when finished to clean up resources
env.close()

## 3B) Train Model

In [None]:
# Create the PPO model with a Multi-Layer Perceptron (MLP) policy
model = PPO("MlpPolicy", env, verbose=1)

In [None]:
env = gym.make("BipedalWalker-v3", hardcore=True)

In [None]:
model.learn(total_timesteps=2000000)

## 3C) Save Model

In [None]:
model.save("ppo_bipedalwalker_hardcore_3M")

## 3D) Evaluate Model

In [None]:
env = gym.make("BipedalWalker-v3", hardcore=True, render_mode="human")

In [None]:
# Evaluate the model (e.g., over 10 episodes)
mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=10)

print(f"Average reward: {mean_reward} ± {std_reward}")