# üéÆ Reinforcement Learning

**Author**: Data Science Master System  
**Difficulty**: ‚≠ê‚≠ê‚≠ê‚≠ê Advanced  
**Time**: 90 minutes  
**Prerequisites**: PyTorch, Probability

## Learning Objectives
- RL fundamentals (MDP, rewards)
- Q-Learning and DQN
- Policy Gradient methods
- Stable-Baselines3

In [None]:
import numpy as np
import torch
import torch.nn as nn

np.random.seed(42)

## 1. Q-Learning (Tabular)

In [None]:
# Simple grid world
n_states = 16  # 4x4 grid
n_actions = 4  # up, down, left, right

Q = np.zeros((n_states, n_actions))

# Hyperparameters
alpha = 0.1  # Learning rate
gamma = 0.99  # Discount factor
epsilon = 0.1  # Exploration rate

def q_learning_update(state, action, reward, next_state):
    """Q-learning update rule."""
    best_next = np.max(Q[next_state])
    Q[state, action] += alpha * (reward + gamma * best_next - Q[state, action])

print("‚úÖ Q-Learning setup complete")
print(f"Q-table shape: {Q.shape}")

## 2. Deep Q-Network (DQN)

In [None]:
class DQN(nn.Module):
    def __init__(self, state_dim, action_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(state_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 128),
            nn.ReLU(),
            nn.Linear(128, action_dim)
        )
    
    def forward(self, x):
        return self.net(x)
    
    def select_action(self, state, epsilon):
        if np.random.random() < epsilon:
            return np.random.randint(self.net[-1].out_features)
        with torch.no_grad():
            q_values = self(torch.FloatTensor(state))
            return q_values.argmax().item()

dqn = DQN(4, 2)  # 4 state dims, 2 actions
print(f"DQN parameters: {sum(p.numel() for p in dqn.parameters()):,}")

## 3. Stable-Baselines3

In [None]:
sb3_example = '''
from stable_baselines3 import PPO, DQN, A2C
import gym

# Create environment
env = gym.make('CartPole-v1')

# Train agent
model = PPO('MlpPolicy', env, verbose=1)
model.learn(total_timesteps=10000)

# Evaluate
obs = env.reset()
for _ in range(1000):
    action, _ = model.predict(obs)
    obs, reward, done, info = env.step(action)
    if done:
        obs = env.reset()

# Save
model.save('ppo_cartpole')
'''
print("üìã Stable-Baselines3:")
print(sb3_example)

## 4. Algorithm Comparison

In [None]:
import pandas as pd

algorithms = pd.DataFrame({
    'Algorithm': ['Q-Learning', 'DQN', 'A2C', 'PPO', 'SAC'],
    'Type': ['Value', 'Value', 'Actor-Critic', 'Policy', 'Policy'],
    'Actions': ['Discrete', 'Discrete', 'Both', 'Both', 'Continuous'],
    'Best For': ['Small state', 'Atari', 'Simple tasks', 'General', 'Robotics']
})

display(algorithms)

## üéØ Key Takeaways
1. Start with PPO (robust, general)
2. DQN for discrete, SAC for continuous
3. Reward shaping is crucial
4. Sim-to-real for robotics

**Next**: 26_model_deployment_api.ipynb