# Reinforcement Learning: Policy Search and OpenAI Gym

This notebook introduces the foundational concepts of policy search in reinforcement learning (RL) and demonstrates practical usage of OpenAI Gym with the classic CartPole environment.

In [1]:
# Deterministic policy: always returns the same action for a given observation

def deterministic_policy(obs):
    # Example: always move right
    return 1

# Stochastic policy: returns an action sampled from a probability distribution
import numpy as np

def stochastic_policy(obs):
    # Example: 70% chance to move right, 30% left
    return np.random.choice([0, 1], p=[0.3, 0.7])


## 1. Policy Search in Reinforcement Learning

A **policy** is the strategy an agent uses to decide what action to take given an observation from the environment. It is the agent's "brain" or decision-making function.

**Types of Policies:**
- **Deterministic:** Always outputs the same action for a given observation.
- **Stochastic:** Outputs a probability distribution over actions and samples from it.

**Policy Representations:**
- Rule-based systems
- Lookup tables
- Neural networks (common in Deep RL)

**Policy Search:**
The process of finding the best policy parameters (e.g., probabilities, neural network weights) to maximize cumulative reward.

**Methods for Policy Search:**
1. **Brute Force:** Try many parameter combinations and pick the best.
2. **Genetic Algorithms:** Evolve a population of policies over generations.
3. **Policy Gradients:** Use optimization (gradient ascent) to improve policy parameters based on reward gradients.


## 2. Setting Up and Using OpenAI Gym

[OpenAI Gym](https://www.gymlibrary.dev/) is a toolkit that provides a wide variety of simulated environments for RL. It is a standard platform for training and evaluating RL agents.

**Installation:**
```bash
pip install -U gym
```

Some environments may require extra dependencies for rendering (e.g., `pyglet`, `pygame`).


In [2]:
# Import gym and check version
import gymnasium as gym
print('Gym version:', gym.__version__)


Gym version: 1.1.1


## 3. Exploring the CartPole Environment

The CartPole environment is a classic RL "hello world." The goal is to keep a pole balanced upright on a moving cart for as long as possible.

Let's walk through the basic steps to interact with this environment.

In [3]:
# Create the CartPole environment
env = gym.make('CartPole-v1',render_mode="human")

# Reset the environment to start a new episode
obs, info = env.reset()
print('Initial observation:', obs)

# Inspect the observation space
print('Observation space:', env.observation_space)

Initial observation: [-0.02014128  0.01499837  0.00417846  0.02987751]
Observation space: Box([-4.8               -inf -0.41887903        -inf], [4.8               inf 0.41887903        inf], (4,), float32)


In [4]:
# (Optional) Render the environment
# env.render()  # Uncomment to visualize (may not work on headless servers)

# Check the action space
print('Action space:', env.action_space)

Action space: Discrete(2)


In [5]:
# Take a step in the environment
action = 1  # Accelerate right
obs, reward, terminated, truncated, info = env.step(action)
done = terminated or truncated
print('New observation:', obs)
print('Reward:', reward)
print('Done:', done)
print('Info:', info)

New observation: [-0.01984131  0.21006015  0.00477601 -0.26148415]
Reward: 1.0
Done: False
Info: {}


In [6]:
env.render()  # Render the environment (if supported)

In [7]:
# Close the environment when done
env.close()

## 4. Implementing a Simple Hardcoded Policy

Let's implement a basic policy for CartPole: if the pole is leaning left (angle < 0), move left; otherwise, move right. We'll run this policy for multiple episodes and analyze the results.

In [None]:
import numpy as np


def basic_policy(obs):
    angle = obs[2]
    return 0 if angle < 0 else 1

env = gym.make('CartPole-v1')
totals = []
for episode in range(500):
    episode_rewards = 0
    obs, info = env.reset()
    for step in range(200):
        action = basic_policy(obs)
        obs, reward, terminated, truncated, info = env.step(action)
        done = terminated or truncated
        episode_rewards += reward
        if done:
            break
    totals.append(episode_rewards)

env.close()

print('Mean episode reward:', np.mean(totals))
print('Max episode reward:', np.max(totals))

Mean episode reward: 42.51
Max episode reward: 68.0


: 