# Solving RL Problems: Defining States, Actions, and Rewards

## ðŸ“š Learning Objectives

By completing this notebook, you will:
- Define state spaces for RL problems
- Define action spaces for RL problems
- Design reward functions
- Run RL simulations
- Apply these concepts to real environments

## ðŸ”— Prerequisites

- âœ… OpenAI Gym setup
- âœ… Understanding of RL basics (agent, environment, episode)
- âœ… Python knowledge (functions, classes, loops)
- âœ… NumPy knowledge

---

## Official Structure Reference

This notebook covers practical activities from **Course 09, Unit 1**:
- Solving RL problems: defining states, actions, and rewards, running RL simulations
- **Source:** `DETAILED_UNIT_DESCRIPTIONS.md` - Unit 1 Practical Content

---

## Introduction

**Solving RL problems** requires carefully defining:
1. **States**: What the agent observes from the environment
2. **Actions**: What the agent can do
3. **Rewards**: Feedback signals that guide learning

Proper definition of these components is crucial for successful RL.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import gym

print("âœ… Libraries imported!")
print("\nSolving RL Problems: States, Actions, and Rewards")
print("=" * 60)

## Part 1: Defining State Spaces


In [None]:
print("=" * 60)
print("Part 1: Defining State Spaces")
print("=" * 60)

# Create CartPole environment
env = gym.make('CartPole-v1')

print("\nCartPole-v1 Environment:")
observation, info = env.reset()
print(f" State/Observation: {observation}")
print(f" State shape: {observation.shape}")
print(f" State type: {type(observation)}")
print(f" State space: {env.observation_space}")

print("\nState Components (CartPole):")
print(" [0] Cart position")
print(" [1] Cart velocity")
print(" [2] Pole angle")
print(" [3] Pole angular velocity")

# Create FrozenLake environment
env_fl = gym.make('FrozenLake-v1', render_mode='ansi')
obs_fl, info_fl = env_fl.reset()
print(f"\nFrozenLake-v1 Environment:")
print(f" State/Observation: {obs_fl}")
print(f" State space: {env_fl.observation_space}")
print(f" State type: Discrete (single integer representing grid position)")

env.close()
env_fl.close()

print("\nâœ… State spaces defined!")

## Part 2: Defining Action Spaces


In [None]:
print("\n" + "=" * 60)
print("Part 2: Defining Action Spaces")
print("=" * 60)

# CartPole actions
env = gym.make('CartPole-v1')
print("\nCartPole-v1 Actions:")
print(f" Action space: {env.action_space}")
print(f" Number of actions: {env.action_space.n}")
print(" Actions:")
print(" 0: Push cart to the left")
print(" 1: Push cart to the right")

# FrozenLake actions
env_fl = gym.make('FrozenLake-v1')
obs_fl, info_fl = env_fl.reset()
print(f"\nFrozenLake-v1 Actions:")
print(f" Action space: {env_fl.action_space}")
print(f" Number of actions: {env_fl.action_space.n}")
print(" Actions:")
print(" 0: Move left")
print(" 1: Move down")
print(" 2: Move right")
print(" 3: Move up")

env.close()
env_fl.close()

print("\nâœ… Action spaces defined!")

## Part 3: Designing Reward Functions


In [None]:
print("\n" + "=" * 60)
print("Part 3: Designing Reward Functions")
print("=" * 60)

print("\nReward Function Design Principles:")
print(" 1. Provide clear feedback (positive for good, negative for bad)")
print(" 2. Shape rewards to guide learning (sparse vs dense)")
print(" 3. Balance immediate vs long-term rewards")
print(" 4. Avoid reward hacking (unintended behaviors)")

# Test CartPole rewards
env = gym.make('CartPole-v1')
obs, info = env.reset()

print("\nCartPole-v1 Reward Function:")
print(" +1 for each step the pole remains balanced")
print(" Episode ends when pole falls or cart goes out of bounds")
print(" Maximum reward: 500 (episode length limit)")

total_reward = 0
for step in range(10):
 action = env.action_space.sample()
 obs, reward, terminated, truncated, info = env.step(action)
 done = terminated or truncated
 total_reward += reward
 print(f" Step {step+1}: Reward = {reward}, Total = {total_reward}")
 if done:
 break

env.close()

print("\nâœ… Reward functions understood!")

## Part 4: Running RL Simulations


In [None]:
print("\n" + "=" * 60)
print("Part 4: Running RL Simulations")
print("=" * 60)

def run_random_episode(env, max_steps=100):
 
    """Run a single episode with random actions."""
 obs, info = env.reset()
 total_reward = 0
 steps = 0
 
 for step in range(max_steps):
 action = env.action_space.sample()
 obs, reward, terminated, truncated, info = env.step(action)
 done = terminated or truncated
 total_reward += reward
 steps += 1
 
 if done:
 break
 
 return total_reward, steps

# Run multiple episodes
env = gym.make('CartPole-v1')
n_episodes = 10

print(f"\nRunning {n_episodes} random episodes in CartPole-v1:")
rewards = []
steps_list = []

for episode in range(n_episodes):
 reward, steps = run_random_episode(env)
 rewards.append(reward)
 steps_list.append(steps)
 print(f" Episode {episode+1}: Reward = {reward:.1f}, Steps = {steps}")

print(f"\nStatistics:")
print(f" Average reward: {np.mean(rewards):.2f}")
print(f" Average steps: {np.mean(steps_list):.2f}")
print(f" Best reward: {np.max(rewards):.2f}")
print(f" Worst reward: {np.min(rewards):.2f}")

env.close()

print("\nâœ… RL simulations complete!")

## Summary

### Key Components:
1. **States**: What the agent observes (observation space)
   - Continuous states (CartPole: position, velocity, angle)
   - Discrete states (FrozenLake: grid position)
2. **Actions**: What the agent can do (action space)
   - Discrete actions (CartPole: left/right, FrozenLake: up/down/left/right)
   - Continuous actions (MountainCar: acceleration)
3. **Rewards**: Feedback signals
   - Dense rewards (CartPole: +1 per step)
   - Sparse rewards (FrozenLake: +1 only at goal)

### Design Principles:
- **States**: Include all relevant information for decision-making
- **Actions**: Cover all possible behaviors the agent can take
- **Rewards**: Provide clear feedback, shape learning, avoid hacking

### Simulation Process:
1. Reset environment (get initial state)
2. Loop: Select action â†’ Step environment â†’ Receive reward â†’ Update state
3. Repeat until episode ends
4. Reset and repeat for multiple episodes

### Next Steps:
- Implement learning algorithms (Q-learning, policy gradients)
- Train agents to maximize rewards
- Evaluate performance across episodes

**Reference:** Course 09, Unit 1: "Introduction to Reinforcement Learning" - Solving RL problems practical content