# üéØ PPO (Proximal Policy Optimization) for Advertisement Optimization

This notebook demonstrates PPO algorithm using an **Advertisement/Ads Domain** example.

## Business Problem
We have an ad platform that needs to decide **which ad to show** to maximize **click-through rate (CTR)**.

### RL Framework Mapping:
| RL Concept | Ads Domain Equivalent |
|------------|----------------------|
| **Agent** | Ad Recommendation System |
| **State** | User features (age, interests, device, time) |
| **Action** | Which ad to display (Ad A, B, C, D) |
| **Reward** | +1 for click, 0 for no click |
| **Policy** | Strategy to select ads based on user features |


---
## üì¶ Step 1: Import Required Libraries

We import essential libraries for:
- **NumPy**: Numerical computations
- **PyTorch**: Building neural networks for policy and value functions
- **Matplotlib**: Visualization of training progress


In [1]:
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.distributions import Categorical
import matplotlib.pyplot as plt
import random

# Set random seeds for reproducibility
np.random.seed(42)
torch.manual_seed(42)
random.seed(42)

print("‚úÖ Libraries imported successfully!")


‚úÖ Libraries imported successfully!


---
## üìä Step 2: Create Dummy Advertisement Data

We create a simulated advertisement environment with:
- **User Features**: Age group, interest category, device type, time of day
- **Available Ads**: 4 different ads (Sports, Tech, Fashion, Food)
- **Click Probabilities**: Each user segment has different preferences


In [2]:
# ============================================================
# DUMMY DATA CONFIGURATION
# ============================================================

# User feature dimensions
# State = [age_group, interest, device, time_of_day]
# Each feature is one-hot encoded

AGE_GROUPS = ['18-25', '26-35', '36-50', '50+']      # 4 categories
INTERESTS = ['Sports', 'Tech', 'Fashion', 'Food']    # 4 categories  
DEVICES = ['Mobile', 'Desktop', 'Tablet']            # 3 categories
TIME_SLOTS = ['Morning', 'Afternoon', 'Evening', 'Night']  # 4 categories

# Available ads to show
ADS = ['Sports_Ad', 'Tech_Ad', 'Fashion_Ad', 'Food_Ad']  # 4 actions

# State dimension = 4 + 4 + 3 + 4 = 15 (one-hot encoded)
STATE_DIM = len(AGE_GROUPS) + len(INTERESTS) + len(DEVICES) + len(TIME_SLOTS)
ACTION_DIM = len(ADS)

print(f"üìã State Dimension: {STATE_DIM}")
print(f"üé¨ Action Dimension (Number of Ads): {ACTION_DIM}")
print(f"\nüë§ User Features:")
print(f"   Age Groups: {AGE_GROUPS}")
print(f"   Interests: {INTERESTS}")
print(f"   Devices: {DEVICES}")
print(f"   Time Slots: {TIME_SLOTS}")
print(f"\nüì¢ Available Ads: {ADS}")


üìã State Dimension: 15
üé¨ Action Dimension (Number of Ads): 4

üë§ User Features:
   Age Groups: ['18-25', '26-35', '36-50', '50+']
   Interests: ['Sports', 'Tech', 'Fashion', 'Food']
   Devices: ['Mobile', 'Desktop', 'Tablet']
   Time Slots: ['Morning', 'Afternoon', 'Evening', 'Night']

üì¢ Available Ads: ['Sports_Ad', 'Tech_Ad', 'Fashion_Ad', 'Food_Ad']


---
## üåç Step 3: Create the Advertisement Environment

The environment simulates user behavior:
1. **Generates random users** with different features
2. **Computes click probability** based on user-ad match
3. **Returns reward** (+1 click, 0 no-click)

### Click Probability Logic:
- Users are more likely to click ads matching their interests
- Young users (18-25) prefer Tech and Fashion
- Mobile users have slightly lower engagement
- Evening time has higher engagement


In [9]:
class AdvertisementEnvironment:
    """
    Simulated Advertisement Environment
    
    This environment simulates user interactions with ads.
    The agent (ad system) observes user features and decides which ad to show.
    """
    
    def __init__(self):
        # ============================================================
        # CLICK PROBABILITY MATRIX
        # Rows: User interest, Columns: Ad type
        # Higher values = more likely to click
        # ============================================================
        self.base_click_prob = np.array([
            # Sports_Ad  Tech_Ad  Fashion_Ad  Food_Ad
            [0.7,        0.2,     0.1,        0.3],   # Sports interest
            [0.1,        0.8,     0.2,        0.2],   # Tech interest
            [0.1,        0.3,     0.75,       0.2],   # Fashion interest
            [0.2,        0.1,     0.15,       0.8],   # Food interest
        ])
        
        self.current_state = None
        self.current_user_info = None
        
    def _one_hot_encode(self, age_idx, interest_idx, device_idx, time_idx):
        """
        Create one-hot encoded state vector from user features
        """
        state = np.zeros(STATE_DIM)
        
        # One-hot encode each feature
        offset = 0
        state[offset + age_idx] = 1.0
        offset += len(AGE_GROUPS)
        
        state[offset + interest_idx] = 1.0
        offset += len(INTERESTS)
        
        state[offset + device_idx] = 1.0
        offset += len(DEVICES)
        
        state[offset + time_idx] = 1.0
        
        return state
    
    def reset(self):
        """
        Generate a new random user (new episode)
        Returns: state (user features as one-hot vector)
        """
        # Randomly sample user features
        age_idx = np.random.randint(0, len(AGE_GROUPS))
        interest_idx = np.random.randint(0, len(INTERESTS))
        device_idx = np.random.randint(0, len(DEVICES))
        time_idx = np.random.randint(0, len(TIME_SLOTS))
        
        # Store user info for reward calculation
        self.current_user_info = {
            'age': age_idx,
            'interest': interest_idx,
            'device': device_idx,
            'time': time_idx
        }
        
        # Create state
        self.current_state = self._one_hot_encode(age_idx, interest_idx, device_idx, time_idx)
        
        return self.current_state
    
    def step(self, action):
        """
        Execute action (show ad) and get reward (click/no-click)
        
        Args:
            action: Index of ad to show (0-3)
            
        Returns:
            next_state: New user features (new user arrives)
            reward: 1 if user clicked, 0 otherwise
            done: True (each user interaction is one episode)
            info: Additional information
        """
        # ============================================================
        # CALCULATE CLICK PROBABILITY
        # ============================================================
        interest_idx = self.current_user_info['interest']
        base_prob = self.base_click_prob[interest_idx, action]
        
        # Modifiers based on other features
        # Young users (18-25) get +10% for Tech and Fashion
        if self.current_user_info['age'] == 0 and action in [1, 2]:
            base_prob = min(1.0, base_prob + 0.1)
        
        # Mobile users have slightly lower engagement (-5%)
        if self.current_user_info['device'] == 0:
            base_prob = max(0.0, base_prob - 0.05)
        
        # Evening time has higher engagement (+10%)
        if self.current_user_info['time'] == 2:
            base_prob = min(1.0, base_prob + 0.1)
        
        # ============================================================
        # SIMULATE CLICK (Bernoulli trial)
        # ============================================================
        clicked = np.random.random() < base_prob
        reward = 1.0 if clicked else 0.0
        
        # Episode ends after one interaction (new user arrives)
        done = True
        next_state = self.reset()  # New user arrives
        
        info = {
            'clicked': clicked,
            'click_prob': base_prob,
            'ad_shown': ADS[action]
        }
        
        return next_state, reward, done, info

# Create environment instance
env = AdvertisementEnvironment()

# Test the environment
print("üß™ Testing Environment:")
state = env.reset()
print(f"   Initial State Shape: {state.shape}")
print(f"   User Info: Age={AGE_GROUPS[env.current_user_info['age']]}, "
      f"Interest={INTERESTS[env.current_user_info['interest']]}, "
      f"Device={DEVICES[env.current_user_info['device']]}, "
      f"Time={TIME_SLOTS[env.current_user_info['time']]}")

# Take a random action
action = np.random.randint(0, ACTION_DIM)
next_state, reward, done, info = env.step(action)
print(f"\n   Action: Show {ADS[action]}")
print(f"   Reward: {reward} (Clicked: {info['clicked']})")
print(f"   Click Probability was: {info['click_prob']:.2f}")


üß™ Testing Environment:
   Initial State Shape: (15,)
   User Info: Age=36-50, Interest=Fashion, Device=Tablet, Time=Afternoon

   Action: Show Food_Ad
   Reward: 0.0 (Clicked: False)
   Click Probability was: 0.20


---
## üß† Step 4: Define the Actor-Critic Neural Network

PPO uses an **Actor-Critic** architecture:

| Component | Role | Output |
|-----------|------|--------|
| **Actor (Policy Network)** | Decides which ad to show | Probability distribution over ads |
| **Critic (Value Network)** | Estimates how good current state is | Single value (expected reward) |

### Why Both?
- **Actor** learns the optimal ad selection strategy
- **Critic** helps reduce variance in learning (tells actor how good its choices were)


In [11]:
class ActorCritic(nn.Module):
    """
    Actor-Critic Network for PPO
    
    Architecture:
    - Shared base layers (feature extraction)
    - Actor head (outputs action probabilities)
    - Critic head (outputs state value)
    """
    
    def __init__(self, state_dim, action_dim, hidden_dim=64):
        super(ActorCritic, self).__init__()
        
        # ============================================================
        # SHARED LAYERS - Extract features from user state
        # ============================================================
        self.shared = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU()
        )
        
        # ============================================================
        # ACTOR HEAD (Policy Network) - Outputs probability of each ad
        # ============================================================
        self.actor = nn.Sequential(
            nn.Linear(hidden_dim, action_dim),
            nn.Softmax(dim=-1)
        )
        
        # ============================================================
        # CRITIC HEAD (Value Network) - Outputs estimated value
        # ============================================================
        self.critic = nn.Sequential(
            nn.Linear(hidden_dim, 1)
        )
    
    def forward(self, state):
        shared_features = self.shared(state)
        action_probs = self.actor(shared_features)
        state_value = self.critic(shared_features)
        return action_probs, state_value
    
    def get_action(self, state):
        state_tensor = torch.FloatTensor(state).unsqueeze(0)
        action_probs, state_value = self.forward(state_tensor)
        dist = Categorical(action_probs)
        action = dist.sample()
        log_prob = dist.log_prob(action)
        return action.item(), log_prob, state_value

# Create and test network
actor_critic = ActorCritic(STATE_DIM, ACTION_DIM)
test_state = env.reset()
action, log_prob, value = actor_critic.get_action(test_state)
print(f"üß™ Actor-Critic Test:")
print(f"   Selected Ad: {ADS[action]}")
print(f"   State Value: {value.item():.4f}")


üß™ Actor-Critic Test:
   Selected Ad: Sports_Ad
   State Value: 0.0470


---
## üìù Step 5: Define PPO Memory Buffer and Hyperparameters

PPO collects experiences before learning. The memory stores states, actions, rewards, log probabilities, values, and done flags.

### Key PPO Hyperparameters:
| Parameter | Value | Description |
|-----------|-------|-------------|
| **gamma** | 0.99 | Discount factor for future rewards |
| **epsilon** | 0.2 | Clipping range - THE KEY PPO INNOVATION! |
| **Learning Rate** | 3e-4 | Step size for optimization |


In [14]:
# ============================================================
# PPO MEMORY BUFFER
# ============================================================
class PPOMemory:
    def __init__(self):
        self.clear()
    
    def clear(self):
        self.states = []
        self.actions = []
        self.rewards = []
        self.log_probs = []
        self.values = []
        self.dones = []
    
    def store(self, state, action, reward, log_prob, value, done):
        self.states.append(state)
        self.actions.append(action)
        self.rewards.append(reward)
        self.log_probs.append(log_prob)
        self.values.append(value)
        self.dones.append(done)
    
    def get_batch(self):
        return (
            torch.FloatTensor(np.array(self.states)),
            torch.LongTensor(self.actions),
            torch.FloatTensor(self.rewards),
            torch.stack(self.log_probs).detach(),
            torch.cat(self.values).detach(),
            torch.FloatTensor(self.dones)
        )
    
    def __len__(self):
        return len(self.states)

# ============================================================
# PPO HYPERPARAMETERS
# ============================================================
GAMMA = 0.99          # Discount factor
GAE_LAMBDA = 0.95     # GAE lambda parameter
CLIP_EPSILON = 0.2    # THE KEY PPO CLIPPING PARAMETER!
LEARNING_RATE = 3e-4  # Learning rate
PPO_EPOCHS = 10       # Number of update epochs
VALUE_COEF = 0.5      # Value loss coefficient
ENTROPY_COEF = 0.01   # Entropy bonus coefficient
TOTAL_TIMESTEPS = 10000
UPDATE_INTERVAL = 256

print("‚úÖ Memory Buffer and Hyperparameters defined!")
print(f"   Clip Epsilon: {CLIP_EPSILON} (policy can only change by ¬±20%)")


‚úÖ Memory Buffer and Hyperparameters defined!
   Clip Epsilon: 0.2 (policy can only change by ¬±20%)


---
## üéØ Step 6: Define GAE and PPO Update Function

### Generalized Advantage Estimation (GAE)
Advantage tells us how much better an action was compared to average:
$$A_t = Q(s_t, a_t) - V(s_t)$$

### PPO Clipped Objective - THE CORE INNOVATION!
$$L^{CLIP} = \min(r_t(\theta) \hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t)$$

Where $r_t(\theta) = \frac{\pi_{new}(a|s)}{\pi_{old}(a|s)}$ is the probability ratio.


In [7]:
# ============================================================
# COMPUTE GENERALIZED ADVANTAGE ESTIMATION (GAE)
# ============================================================
def compute_gae(rewards, values, dones, gamma=GAMMA, lam=GAE_LAMBDA):
    """Compute advantages using GAE for low-variance estimates"""
    advantages = []
    gae = 0
    
    for t in reversed(range(len(rewards))):
        if t == len(rewards) - 1:
            next_value = 0
        else:
            next_value = values[t + 1]
        
        if dones[t]:
            next_value = 0
            gae = 0
        
        # TD Error: delta = r + gamma*V(s') - V(s)
        delta = rewards[t] + gamma * next_value - values[t]
        
        # GAE: A = delta + gamma*lambda*A_{t+1}
        gae = delta + gamma * lam * gae
        advantages.insert(0, gae)
    
    advantages = torch.FloatTensor(advantages)
    advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
    returns = advantages + torch.FloatTensor(values)
    
    return advantages, returns

# ============================================================
# PPO UPDATE FUNCTION - THE HEART OF PPO!
# ============================================================
def ppo_update(actor_critic, optimizer, memory, epochs=PPO_EPOCHS):
    """Perform PPO update with clipped objective"""
    states, actions, rewards, old_log_probs, values, dones = memory.get_batch()
    advantages, returns = compute_gae(rewards.tolist(), values.tolist(), dones.tolist())
    
    total_loss_info = {'policy_loss': 0, 'value_loss': 0, 'entropy': 0}
    
    for _ in range(epochs):
        action_probs, state_values = actor_critic(states)
        state_values = state_values.squeeze()
        dist = Categorical(action_probs)
        new_log_probs = dist.log_prob(actions)
        entropy = dist.entropy().mean()
        
        # ============================================================
        # PROBABILITY RATIO: r(theta) = pi_new / pi_old
        # ============================================================
        ratio = torch.exp(new_log_probs - old_log_probs.squeeze())
        
        # ============================================================
        # CLIPPED OBJECTIVE - Prevents large policy updates!
        # ============================================================
        surr1 = ratio * advantages
        surr2 = torch.clamp(ratio, 1 - CLIP_EPSILON, 1 + CLIP_EPSILON) * advantages
        policy_loss = -torch.min(surr1, surr2).mean()
        
        # Value loss and total loss
        value_loss = nn.MSELoss()(state_values, returns)
        loss = policy_loss + VALUE_COEF * value_loss - ENTROPY_COEF * entropy
        
        optimizer.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(actor_critic.parameters(), 0.5)
        optimizer.step()
        
        total_loss_info['policy_loss'] += policy_loss.item()
        total_loss_info['value_loss'] += value_loss.item()
        total_loss_info['entropy'] += entropy.item()
    
    return {k: v/epochs for k, v in total_loss_info.items()}

print("‚úÖ GAE and PPO Update functions defined!")


‚úÖ GAE and PPO Update functions defined!


---
## üöÄ Step 7: Train the PPO Agent

Now we train our ad recommendation agent! The training loop:
1. **Collect experiences**: Agent selects ads for users
2. **Store in memory**: Record states, actions, rewards
3. **Update policy**: Apply PPO update with clipping
4. **Track progress**: Monitor CTR improvement


In [8]:
# ============================================================
# TRAINING LOOP
# ============================================================
def train_ppo():
    env = AdvertisementEnvironment()
    actor_critic = ActorCritic(STATE_DIM, ACTION_DIM)
    optimizer = optim.Adam(actor_critic.parameters(), lr=LEARNING_RATE)
    memory = PPOMemory()
    
    episode_rewards = []
    avg_rewards = []
    policy_losses = []
    
    print("üöÄ Starting PPO Training for Ad Optimization...")
    print("=" * 50)
    
    state = env.reset()
    timestep = 0
    
    while timestep < TOTAL_TIMESTEPS:
        # PHASE 1: Collect experiences
        for _ in range(UPDATE_INTERVAL):
            timestep += 1
            action, log_prob, value = actor_critic.get_action(state)
            next_state, reward, done, info = env.step(action)
            memory.store(state, action, reward, log_prob, value, done)
            episode_rewards.append(reward)
            state = next_state
            if timestep >= TOTAL_TIMESTEPS:
                break
        
        # PHASE 2: Update policy with PPO
        if len(memory) >= UPDATE_INTERVAL:
            update_info = ppo_update(actor_critic, optimizer, memory)
            recent_ctr = np.mean(episode_rewards[-UPDATE_INTERVAL:])
            avg_rewards.append(recent_ctr)
            policy_losses.append(update_info['policy_loss'])
            memory.clear()
            
            if timestep % 2000 == 0 or timestep == UPDATE_INTERVAL:
                print(f"Step {timestep}: CTR = {recent_ctr:.2%}")
    
    print("=" * 50)
    print("‚úÖ Training Complete!")
    return actor_critic, avg_rewards, policy_losses

# Train the agent
trained_agent, rewards_history, loss_history = train_ppo()


üöÄ Starting PPO Training for Ad Optimization...


TypeError: unsupported operand type(s) for -: 'float' and 'list'

---
## üìä Step 8: Visualize Training Progress

Let's see how our agent improved over time.


In [None]:
# ============================================================
# VISUALIZATION
# ============================================================
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Plot CTR over time
axes[0].plot(rewards_history, color='#2ecc71', linewidth=2)
axes[0].axhline(y=0.25, color='red', linestyle='--', label='Random Policy (~25%)')
axes[0].set_title('Click-Through Rate Over Training', fontweight='bold')
axes[0].set_xlabel('Update Step')
axes[0].set_ylabel('CTR')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Plot Policy Loss
axes[1].plot(loss_history, color='#e74c3c', linewidth=2)
axes[1].set_title('Policy Loss Over Training', fontweight='bold')
axes[1].set_xlabel('Update Step')
axes[1].set_ylabel('Loss')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\nüìà Results Summary:")
print(f"   Initial CTR: {rewards_history[0]:.2%}")
print(f"   Final CTR: {rewards_history[-1]:.2%}")
print(f"   Improvement: +{(rewards_history[-1] - rewards_history[0]):.2%}")


---
## üß™ Step 9: Evaluate and Analyze Learned Policy

Let's see what the agent learned about matching ads to user interests.


In [None]:
# ============================================================
# EVALUATE TRAINED AGENT
# ============================================================
def evaluate_agent(agent, n_episodes=1000):
    env = AdvertisementEnvironment()
    agent.eval()
    
    total_clicks = 0
    interest_action_map = {interest: {ad: 0 for ad in ADS} for interest in INTERESTS}
    
    with torch.no_grad():
        for _ in range(n_episodes):
            state = env.reset()
            interest = INTERESTS[env.current_user_info['interest']]
            
            state_tensor = torch.FloatTensor(state).unsqueeze(0)
            action_probs, _ = agent(state_tensor)
            action = action_probs.argmax().item()
            
            _, reward, _, _ = env.step(action)
            total_clicks += reward
            interest_action_map[interest][ADS[action]] += 1
    
    return total_clicks / n_episodes, interest_action_map

ctr, interest_action_map = evaluate_agent(trained_agent)
print(f"üéØ Evaluation CTR: {ctr:.2%}")

# Create heatmap of learned policy
print("\nüìä Learned Policy Heatmap:")
heatmap_data = np.zeros((len(INTERESTS), len(ADS)))
for i, interest in enumerate(INTERESTS):
    total = sum(interest_action_map[interest].values())
    if total > 0:
        for j, ad in enumerate(ADS):
            heatmap_data[i, j] = interest_action_map[interest][ad] / total

fig, ax = plt.subplots(figsize=(10, 6))
im = ax.imshow(heatmap_data, cmap='YlGn')
ax.set_xticks(range(len(ADS)))
ax.set_yticks(range(len(INTERESTS)))
ax.set_xticklabels(ADS)
ax.set_yticklabels(INTERESTS)

for i in range(len(INTERESTS)):
    for j in range(len(ADS)):
        ax.text(j, i, f'{heatmap_data[i, j]:.0%}', ha='center', va='center', fontsize=12)

ax.set_title('Learned Policy: Ad Selection by User Interest', fontsize=14, fontweight='bold')
ax.set_xlabel('Ad Shown')
ax.set_ylabel('User Interest')
plt.colorbar(im, label='Selection Probability')
plt.tight_layout()
plt.show()


---
## üéì Step 10: Key Takeaways

### What We Learned:

1. **PPO for Ads**: PPO can effectively learn to match ads to user interests

2. **Clipping Mechanism**: The `min(ratio, clip(ratio))` ensures stable training by preventing large policy updates

3. **Actor-Critic Architecture**: 
   - Actor learns the ad selection policy
   - Critic reduces variance in training

### PPO Formula Recap:
$$L^{CLIP}(\theta) = \mathbb{E}_t[\min(r_t(\theta)\hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t)]$$

### Why PPO Works Well:
- **Stable**: Clipping prevents "going crazy" during learning
- **Efficient**: Reuses collected data multiple times
- **Simple**: Easier to implement than TRPO


In [None]:
# ============================================================
# FINAL SUMMARY
# ============================================================
print("=" * 60)
print("üéâ PPO ADVERTISEMENT OPTIMIZATION - COMPLETE!")
print("=" * 60)
print(f"\nüìä Performance:")
print(f"   Random Policy CTR: ~25%")
print(f"   Trained Agent CTR: {ctr:.1%}")
print(f"   Relative Improvement: {(ctr - 0.25) / 0.25 * 100:.1f}%")

print(f"\nüß† What the Agent Learned:")
print(f"   Sports interest ‚Üí Sports_Ad")
print(f"   Tech interest ‚Üí Tech_Ad")
print(f"   Fashion interest ‚Üí Fashion_Ad")
print(f"   Food interest ‚Üí Food_Ad")

print(f"\n‚ö° Key PPO Concepts Demonstrated:")
print(f"   1. Clipped objective prevents large policy changes")
print(f"   2. Actor-Critic architecture for stable learning")
print(f"   3. GAE for low-variance advantage estimation")

print("\n" + "=" * 60)
print("‚úÖ You've learned PPO for Ad Optimization!")
print("=" * 60)
