# üéØ TF-Agents RewardPredictionBasePolicy for Advertisement Optimization

This notebook demonstrates how to use **TF-Agents' RewardPredictionBasePolicy** for a **Multi-Armed Bandit** problem in the **Advertisement Domain**.

## Business Problem
An ad platform needs to decide **which ad to show** to each user to maximize **click-through rate (CTR)**.

### Why Bandits for Ads?

- **Exploration vs Exploitation**: We need to balance showing ads we know work well vs. trying new ads
- **Contextual Decisions**: User features (age, interests, device) affect which ad works best
- **Immediate Feedback**: We get reward (click/no-click) right after showing the ad

### RL Framework Mapping:
| Bandit Concept | Ads Domain Equivalent |
|----------------|----------------------|
| **Context** | User features (age, interests, device, time) |
| **Arms/Actions** | Different ads to display (Sports, Tech, Fashion, Food) |
| **Reward** | +1 for click, 0 for no click |
| **Policy** | Strategy to select ads based on user context |


---
## üîë What is RewardPredictionBasePolicy?

**RewardPredictionBasePolicy** is a base class in TF-Agents that:

1. **Predicts rewards** for each action given the current context
2. **Selects the action** with the highest predicted reward (greedy)
3. **Can be extended** to add exploration strategies

### Key Components:
- **Reward Network**: Neural network that predicts expected reward for each action
- **Action Selection**: Typically greedy (pick best) or epsilon-greedy (explore sometimes)
- **Training**: Update the reward network based on observed rewards


---
## üì¶ Step 1: Install and Import Required Libraries

We need:
- **TensorFlow**: Deep learning framework
- **TF-Agents**: RL library with bandit policies
- **NumPy**: Numerical computations
- **Matplotlib**: Visualization


In [1]:
# Install dependencies (uncomment if needed)
# !pip install tf-agents tensorflow tensorflow-probability tf-keras numpy matplotlib

# IMPORTANT: Set legacy Keras mode for TF-Agents compatibility
import os
os.environ['TF_USE_LEGACY_KERAS'] = '1'


In [2]:
import numpy as np
import tensorflow as tf
from typing import Optional, Tuple



# TF-Agents imports
from tf_agents.specs import tensor_spec, array_spec
from tf_agents.trajectories import time_step as ts
from tf_agents.trajectories import policy_step
from tf_agents.bandits.policies import reward_prediction_base_policy
from tf_agents.networks import network
from tf_agents.utils import common
import tensorflow_probability as tfp

# Visualization
import matplotlib.pyplot as plt

# Set random seeds for reproducibility
np.random.seed(42)
tf.random.set_seed(42)

print(f"‚úÖ TensorFlow version: {tf.__version__}")
print("‚úÖ All libraries imported successfully!")


2025-12-31 15:32:41.212070: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-12-31 15:32:41.213072: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-12-31 15:32:41.218955: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-12-31 15:32:41.235080: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-12-31 15:32:41.261854: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been 

‚úÖ TensorFlow version: 2.17.1
‚úÖ All libraries imported successfully!


---
## üìä Step 2: Define the Advertisement Domain Configuration

We create a simulated advertisement environment with:
- **User Features (Context)**: Age group, interest category, device type, time of day
- **Available Ads (Arms)**: 4 different ads (Sports, Tech, Fashion, Food)
- **Click Probabilities**: Each user segment has different preferences for ads


In [3]:
# ============================================================
# ADVERTISEMENT DOMAIN CONFIGURATION
# ============================================================

# User feature categories
AGE_GROUPS = ['18-25', '26-35', '36-50', '50+']          # 4 categories
INTERESTS = ['Sports', 'Tech', 'Fashion', 'Food']        # 4 categories  
DEVICES = ['Mobile', 'Desktop', 'Tablet']                # 3 categories
TIME_SLOTS = ['Morning', 'Afternoon', 'Evening', 'Night'] # 4 categories

# Available ads (arms/actions)
ADS = ['Sports_Ad', 'Tech_Ad', 'Fashion_Ad', 'Food_Ad']  # 4 actions

# Context dimension = 4 + 4 + 3 + 4 = 15 (one-hot encoded)
CONTEXT_DIM = len(AGE_GROUPS) + len(INTERESTS) + len(DEVICES) + len(TIME_SLOTS)
NUM_ACTIONS = len(ADS)

print(f"üìã Context Dimension: {CONTEXT_DIM}")
print(f"üé¨ Number of Actions (Ads): {NUM_ACTIONS}")
print(f"\nüë§ User Features:")
print(f"   Age Groups: {AGE_GROUPS}")
print(f"   Interests: {INTERESTS}")
print(f"   Devices: {DEVICES}")
print(f"   Time Slots: {TIME_SLOTS}")
print(f"\nüì¢ Available Ads: {ADS}")


üìã Context Dimension: 15
üé¨ Number of Actions (Ads): 4

üë§ User Features:
   Age Groups: ['18-25', '26-35', '36-50', '50+']
   Interests: ['Sports', 'Tech', 'Fashion', 'Food']
   Devices: ['Mobile', 'Desktop', 'Tablet']
   Time Slots: ['Morning', 'Afternoon', 'Evening', 'Night']

üì¢ Available Ads: ['Sports_Ad', 'Tech_Ad', 'Fashion_Ad', 'Food_Ad']


---
## üé≤ Step 3: Create the Click Probability Matrix

This function defines the **true click probabilities** for each user segment and ad combination.
In real scenarios, this is unknown - we learn it through interaction!

**Key Insight**: Users with matching interests have higher click probability:
- Sports enthusiasts ‚Üí Sports_Ad
- Tech enthusiasts ‚Üí Tech_Ad
- etc.


In [4]:
# ============================================================
# CLICK PROBABILITY FUNCTION (Ground Truth - Unknown to Agent)
# ============================================================

def get_click_probability(context: np.ndarray, action: int) -> float:
    """
    Calculate click probability based on user context and ad shown.
    
    Context format (one-hot encoded):
    - [0:4]  = Age group
    - [4:8]  = Interest
    - [8:11] = Device
    - [11:15] = Time slot
    
    Higher probability when ad matches user interest.
    """
    # Extract interest from context (indices 4-8)
    interest_idx = np.argmax(context[4:8])
    
    # Base click probability
    base_prob = 0.1
    
    # Bonus if ad matches interest (interest_idx == action since both are aligned)
    if interest_idx == action:
        match_bonus = 0.4  # High bonus for matching
    else:
        match_bonus = 0.05  # Small bonus for non-matching
    
    # Age group effect (younger users click more on mobile)
    age_idx = np.argmax(context[0:4])
    device_idx = np.argmax(context[8:11])
    
    if age_idx < 2 and device_idx == 0:  # Young + Mobile
        age_device_bonus = 0.1
    else:
        age_device_bonus = 0.0
    
    # Time effect (evening has higher engagement)
    time_idx = np.argmax(context[11:15])
    if time_idx == 2:  # Evening
        time_bonus = 0.05
    else:
        time_bonus = 0.0
    
    # Combine all factors
    final_prob = min(base_prob + match_bonus + age_device_bonus + time_bonus, 0.9)
    
    return final_prob

print("‚úÖ Click probability function defined!")
print("\nüìà Example probabilities:")

# Example: Young user interested in Sports on Mobile in Evening
example_context = np.zeros(CONTEXT_DIM, dtype=np.float32)
example_context[0] = 1.0   # Age: 18-25
example_context[4] = 1.0   # Interest: Sports
example_context[8] = 1.0   # Device: Mobile
example_context[13] = 1.0  # Time: Evening


for i, ad in enumerate(ADS):
    prob = get_click_probability(example_context, i)
    print(f"   Sports fan + {ad}: {prob:.2%}")


‚úÖ Click probability function defined!

üìà Example probabilities:
   Sports fan + Sports_Ad: 65.00%
   Sports fan + Tech_Ad: 30.00%
   Sports fan + Fashion_Ad: 30.00%
   Sports fan + Food_Ad: 30.00%


---
## üè≠ Step 4: Create the Advertisement Bandit Environment

We create a contextual bandit environment that:
1. **Generates user contexts** (random user features)
2. **Returns rewards** based on ad shown (click = 1, no click = 0)

### Key Difference from RL:
- **No state transitions**: Each user visit is independent
- **Immediate reward**: We know right away if user clicked


In [5]:
class AdvertisementBanditEnvironment:
    """
    Contextual Bandit Environment for Advertisement Selection.
    
    Each step:
    1. A new user arrives with random features (context)
    2. Agent selects an ad to show (action)
    3. User clicks or not (reward: 1 or 0)
    """
    
    def __init__(self):
        # Define specs
        self._observation_spec = tensor_spec.TensorSpec(
            shape=(CONTEXT_DIM,), 
            dtype=tf.float32, 
            name='observation'
        )
        
        self._action_spec = tensor_spec.BoundedTensorSpec(
            shape=(), 
            dtype=tf.int32, 
            minimum=0, 
            maximum=NUM_ACTIONS - 1, 
            name='action'
        )
        
        self._current_context = None
        
    @property
    def observation_spec(self):
        return self._observation_spec
    
    @property
    def action_spec(self):
        return self._action_spec
    
    @property
    def time_step_spec(self):
        """Returns the time_step_spec for this environment."""
        return ts.time_step_spec(observation_spec=self._observation_spec)
    
    def _generate_random_context(self) -> np.ndarray:
        """
        Generate a random user context (one-hot encoded features).
        """
        context = np.zeros(CONTEXT_DIM, dtype=np.float32)
        
        # Randomly select one category for each feature
        age_idx = np.random.randint(0, len(AGE_GROUPS))
        interest_idx = np.random.randint(0, len(INTERESTS))
        device_idx = np.random.randint(0, len(DEVICES))
        time_idx = np.random.randint(0, len(TIME_SLOTS))
        
        # One-hot encode
        context[age_idx] = 1.0
        context[len(AGE_GROUPS) + interest_idx] = 1.0
        context[len(AGE_GROUPS) + len(INTERESTS) + device_idx] = 1.0
        context[len(AGE_GROUPS) + len(INTERESTS) + len(DEVICES) + time_idx] = 1.0
        
        return context
    
    def reset(self) -> ts.TimeStep:
        """
        Reset environment and return initial time step with new user context.
        """
        self._current_context = self._generate_random_context()
        return ts.restart(tf.constant([self._current_context], dtype=tf.float32))
    
    def step(self, action: int) -> Tuple[ts.TimeStep, float]:
        """
        Execute action (show ad) and return reward (click/no-click).
        
        Args:
            action: Index of ad to show
            
        Returns:
            time_step: New time step with next user context
            reward: 1.0 if clicked, 0.0 if not
        """
        # Get click probability for this context-action pair
        click_prob = get_click_probability(self._current_context, action)
        
        # Sample reward (Bernoulli with click_prob)
        reward = 1.0 if np.random.random() < click_prob else 0.0
        
        # Generate new user context for next step
        self._current_context = self._generate_random_context()
        
        # Return transition to new context
        next_time_step = ts.transition(
            observation=tf.constant([self._current_context], dtype=tf.float32),
            reward=tf.constant([reward], dtype=tf.float32)
        )
        
        return next_time_step, reward

# Create environment instance
env = AdvertisementBanditEnvironment()

print("‚úÖ Advertisement Bandit Environment created!")
print(f"\nüìã Observation Spec: {env.observation_spec}")
print(f"üé¨ Action Spec: {env.action_spec}")
print(f"‚è∞ Time Step Spec: {env.time_step_spec}")


‚úÖ Advertisement Bandit Environment created!

üìã Observation Spec: TensorSpec(shape=(15,), dtype=tf.float32, name='observation')
üé¨ Action Spec: BoundedTensorSpec(shape=(), dtype=tf.int32, name='action', minimum=array(0, dtype=int32), maximum=array(3, dtype=int32))
‚è∞ Time Step Spec: TimeStep(
{'step_type': TensorSpec(shape=(), dtype=tf.int32, name='step_type'),
 'reward': TensorSpec(shape=(), dtype=tf.float32, name='reward'),
 'discount': BoundedTensorSpec(shape=(), dtype=tf.float32, name='discount', minimum=array(0., dtype=float32), maximum=array(1., dtype=float32)),
 'observation': TensorSpec(shape=(15,), dtype=tf.float32, name='observation')})


---
## üß† Step 5: Create the Reward Prediction Network

The **Reward Prediction Network** is a neural network that:
1. Takes **context (user features)** as input
2. Outputs **predicted reward for each action (ad)**

### Architecture:
```
Context (15) ‚Üí Dense(64) ‚Üí ReLU ‚Üí Dense(32) ‚Üí ReLU ‚Üí Dense(4) ‚Üí Predicted Rewards
```

This network learns the mapping: `f(context) ‚Üí [reward_ad1, reward_ad2, reward_ad3, reward_ad4]`


In [None]:
class RewardPredictionNetwork(network.Network):
    """
    Neural network that predicts expected reward for each action given context.
    
    Input: User context (one-hot encoded features)
    Output: Predicted reward for each ad (NUM_ACTIONS outputs)
    """
    
    def __init__(self, 
                 observation_spec, 
                 action_spec,
                 fc_layer_params=(64, 32),
                 name='RewardPredictionNetwork'):
        """
        Initialize the reward prediction network.
        
        Args:
            observation_spec: Spec for observations (context)
            action_spec: Spec for actions
            fc_layer_params: Tuple of hidden layer sizes
            name: Network name
        """
        super(RewardPredictionNetwork, self).__init__(
            input_tensor_spec=observation_spec,
            state_spec=(),
            name=name
        )
        
        self._action_spec = action_spec
        self._num_actions = action_spec.maximum - action_spec.minimum + 1
        
        # Build the network layers
        self._layers = []
        
        # Hidden layers
        for units in fc_layer_params:
            self._layers.append(
                tf.keras.layers.Dense(
                    units,
                    activation='relu',
                    kernel_initializer=tf.keras.initializers.GlorotUniform()
                )
            )
        
        # Output layer: one output per action (predicted reward)
        self._output_layer = tf.keras.layers.Dense(
            self._num_actions,
            activation='sigmoid',  # Rewards are between 0 and 1 (click probability)
            kernel_initializer=tf.keras.initializers.GlorotUniform(),
            name='reward_predictions'
        )
        
    def call(self, observation, step_type=None, network_state=(), training=False):
        """
        Forward pass through the network.
        
        Args:
            observation: User context tensor [batch_size, CONTEXT_DIM]
            step_type: Not used in bandits
            network_state: Not used (stateless network)
            training: Whether in training mode
            
        Returns:
            predicted_rewards: Tensor of shape [batch_size, NUM_ACTIONS]
            network_state: Empty tuple (stateless)
        """
        x = observation
        
        # Pass through hidden layers
        for layer in self._layers:
            x = layer(x, training=training)
        
        # Output layer
        predicted_rewards = self._output_layer(x, training=training)
        
        return predicted_rewards, network_state

# Create the reward prediction network
reward_network = RewardPredictionNetwork(
    observation_spec=env.observation_spec,
    action_spec=env.action_spec,
    fc_layer_params=(64, 32)
)

print("‚úÖ Reward Prediction Network created!")
print(f"\nüß† Network Architecture:")
print(f"   Input: Context ({CONTEXT_DIM} features)")
print(f"   Hidden: Dense(64) ‚Üí ReLU ‚Üí Dense(32) ‚Üí ReLU")
print(f"   Output: {NUM_ACTIONS} predicted rewards (one per ad)")

# Test the network
test_time_step = env.reset()
test_predictions, _ = reward_network(test_time_step.observation)
print(f"\nüß™ Test prediction shape: {test_predictions.shape}")
print(f"   Sample predictions: {test_predictions.numpy()[0]}")


---
## üéØ Step 6: Create the RewardPredictionBasePolicy

Now we create a policy that uses the reward prediction network to select actions.

### How It Works:
1. **Receive context** (user features)
2. **Predict rewards** for all actions using the network
3. **Select action** with highest predicted reward (greedy)
4. **Optional**: Add epsilon-greedy exploration

### Key Methods to Implement:
- `_predict_rewards()`: Returns predicted rewards for all actions
- `_distribution()`: Returns action distribution based on predicted rewards


In [None]:
class AdRewardPredictionPolicy(reward_prediction_base_policy.RewardPredictionBasePolicy):
    """
    Policy that selects ads based on predicted click rewards.
    
    Uses epsilon-greedy exploration:
    - With probability (1-epsilon): Select ad with highest predicted reward
    - With probability epsilon: Select random ad
    """
    
    def __init__(self,
                 time_step_spec,
                 action_spec,
                 reward_network,
                 epsilon=0.1,
                 name='AdRewardPredictionPolicy'):
        """
        Initialize the policy.
        
        Args:
            time_step_spec: Spec for time steps
            action_spec: Spec for actions
            reward_network: Network that predicts rewards
            epsilon: Exploration rate (probability of random action)
            name: Policy name
        """
        super(AdRewardPredictionPolicy, self).__init__(
            time_step_spec=time_step_spec,
            action_spec=action_spec,
            name=name
        )
        
        self._reward_network = reward_network
        self._epsilon = epsilon
        self._num_actions = action_spec.maximum - action_spec.minimum + 1
        
    @property
    def reward_network(self):
        return self._reward_network
        
    def _predict_rewards(self, time_step, policy_state):
        """
        Predict rewards for all actions given current context.
        
        This is a KEY method that RewardPredictionBasePolicy expects!
        
        Args:
            time_step: Current time step with observation
            policy_state: Not used (stateless policy)
            
        Returns:
            predicted_rewards: Tensor [batch_size, num_actions]
        """
        predicted_rewards, _ = self._reward_network(
            time_step.observation,
            training=False
        )
        return predicted_rewards
    
    def _distribution(self, time_step, policy_state):
        """
        Get action distribution based on predicted rewards.
        
        Uses epsilon-greedy: mostly greedy, sometimes random.
        
        Args:
            time_step: Current time step
            policy_state: Not used
            
        Returns:
            PolicyStep with action distribution
        """
        # Get predicted rewards for all actions
        predicted_rewards = self._predict_rewards(time_step, policy_state)
        
        # Get batch size
        batch_size = tf.shape(predicted_rewards)[0]
        
        # Greedy action (highest predicted reward)
        greedy_action = tf.argmax(predicted_rewards, axis=1, output_type=tf.int32)
        
        # Random action for exploration
        random_action = tf.random.uniform(
            shape=(batch_size,),
            minval=0,
            maxval=self._num_actions,
            dtype=tf.int32
        )
        
        # Epsilon-greedy: choose random with probability epsilon
        explore = tf.random.uniform(shape=(batch_size,)) < self._epsilon
        action = tf.where(explore, random_action, greedy_action)
        
        # Return deterministic distribution at selected action
        return policy_step.PolicyStep(
            action=tfp.distributions.Deterministic(loc=action),
            state=policy_state,
            info={'predicted_rewards': predicted_rewards}
        )

# Create the policy
policy = AdRewardPredictionPolicy(
    time_step_spec=env.time_step_spec,
    action_spec=env.action_spec,
    reward_network=reward_network,
    epsilon=0.1  # 10% exploration
)

print("‚úÖ AdRewardPredictionPolicy created!")
print(f"\nüé≤ Policy Configuration:")
print(f"   Exploration rate (epsilon): {policy._epsilon}")
print(f"   Number of actions: {policy._num_actions}")

# Test the policy
test_time_step = env.reset()
test_action_step = policy.action(test_time_step)
print(f"\nüß™ Test action: {test_action_step.action.numpy()[0]} ({ADS[test_action_step.action.numpy()[0]]})")


---
## üèãÔ∏è Step 7: Create the Training Loop

Now we train the reward prediction network using observed data.

### Training Process:
1. **Collect experience**: Show ads, observe clicks
2. **Update network**: Train to predict observed rewards
3. **Repeat**: Continuously improve predictions

### Loss Function:
We use **Mean Squared Error (MSE)** between:
- Predicted reward for the chosen action
- Actual observed reward (0 or 1)


In [None]:
class BanditTrainer:
    """
    Trainer for the reward prediction bandit policy.
    Collects experience and updates the reward network.
    """
    
    def __init__(self, policy, learning_rate=0.001):
        self.policy = policy
        self.optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate)
        self.experience_buffer = []
        self.buffer_size = 1000
        self.rewards_history = []
        self.loss_history = []
        
    def collect_experience(self, env, num_steps=100):
        """Collect experience by interacting with environment."""
        total_reward = 0
        
        for _ in range(num_steps):
            time_step = env.reset()
            action_step = self.policy.action(time_step)
            action = action_step.action.numpy()[0]
            _, reward = env.step(action)
            total_reward += reward
            
            experience = {
                'context': time_step.observation.numpy()[0],
                'action': action,
                'reward': reward
            }
            self.experience_buffer.append(experience)
            
            if len(self.experience_buffer) > self.buffer_size:
                self.experience_buffer.pop(0)
        
        avg_reward = total_reward / num_steps
        self.rewards_history.append(avg_reward)
        return avg_reward
    
    def train_step(self, batch_size=32):
        """Perform one training step on the reward network."""
        if len(self.experience_buffer) < batch_size:
            return 0.0
        
        indices = np.random.choice(len(self.experience_buffer), batch_size, replace=False)
        batch = [self.experience_buffer[i] for i in indices]
        
        contexts = tf.constant([exp['context'] for exp in batch], dtype=tf.float32)
        actions = tf.constant([exp['action'] for exp in batch], dtype=tf.int32)
        rewards = tf.constant([exp['reward'] for exp in batch], dtype=tf.float32)
        
        with tf.GradientTape() as tape:
            predicted_rewards, _ = self.policy.reward_network(contexts, training=True)
            batch_indices = tf.range(batch_size)
            indices_2d = tf.stack([batch_indices, actions], axis=1)
            predicted_for_action = tf.gather_nd(predicted_rewards, indices_2d)
            loss = tf.reduce_mean(tf.square(predicted_for_action - rewards))
        
        gradients = tape.gradient(loss, self.policy.reward_network.trainable_variables)
        self.optimizer.apply_gradients(
            zip(gradients, self.policy.reward_network.trainable_variables)
        )
        
        self.loss_history.append(loss.numpy())
        return loss.numpy()

trainer = BanditTrainer(policy, learning_rate=0.001)
print("‚úÖ BanditTrainer created!")
print(f"\nüìä Trainer Configuration:")
print(f"   Learning rate: 0.001")
print(f"   Buffer size: {trainer.buffer_size}")


---
## üöÄ Step 8: Run the Training

Now we train the bandit agent to learn optimal ad selection!


In [None]:
# Training Configuration
NUM_ITERATIONS = 100
COLLECT_STEPS = 50
TRAIN_STEPS = 10
BATCH_SIZE = 32

print("üöÄ Starting Training...")
print(f"   Iterations: {NUM_ITERATIONS}")
print(f"   Collect steps per iteration: {COLLECT_STEPS}")
print(f"   Train steps per iteration: {TRAIN_STEPS}")
print("\n" + "="*60)

for iteration in range(NUM_ITERATIONS):
    avg_reward = trainer.collect_experience(env, num_steps=COLLECT_STEPS)
    
    total_loss = 0
    for _ in range(TRAIN_STEPS):
        loss = trainer.train_step(batch_size=BATCH_SIZE)
        total_loss += loss
    avg_loss = total_loss / TRAIN_STEPS
    
    if (iteration + 1) % 10 == 0:
        print(f"üìà Iteration {iteration + 1:3d}: Avg Reward = {avg_reward:.4f}, Avg Loss = {avg_loss:.4f}")

print("\n" + "="*60)
print("‚úÖ Training Complete!")
print(f"\nüìä Final Statistics:")
print(f"   Final Average Reward: {trainer.rewards_history[-1]:.4f}")
print(f"   Best Average Reward: {max(trainer.rewards_history):.4f}")


---
## üìà Step 9: Visualize Training Progress


In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Reward History
ax1 = axes[0]
ax1.plot(trainer.rewards_history, color='#2ecc71', linewidth=2, alpha=0.8)
ax1.axhline(y=0.25, color='#e74c3c', linestyle='--', label='Random Policy', alpha=0.7)

window = 10
if len(trainer.rewards_history) >= window:
    smoothed = np.convolve(trainer.rewards_history, np.ones(window)/window, mode='valid')
    ax1.plot(range(window-1, len(trainer.rewards_history)), smoothed, 
             color='#27ae60', linewidth=3, label='Smoothed')

ax1.set_xlabel('Iteration', fontsize=12)
ax1.set_ylabel('Average Reward (CTR)', fontsize=12)
ax1.set_title('üéØ Click-Through Rate Over Training', fontsize=14, fontweight='bold')
ax1.legend(loc='lower right')
ax1.grid(True, alpha=0.3)

# Plot 2: Loss History
ax2 = axes[1]
ax2.plot(trainer.loss_history, color='#3498db', linewidth=1, alpha=0.5)

if len(trainer.loss_history) >= window:
    smoothed_loss = np.convolve(trainer.loss_history, np.ones(window)/window, mode='valid')
    ax2.plot(range(window-1, len(trainer.loss_history)), smoothed_loss,
             color='#2980b9', linewidth=2, label='Smoothed')

ax2.set_xlabel('Training Step', fontsize=12)
ax2.set_ylabel('MSE Loss', fontsize=12)
ax2.set_title('üìâ Training Loss', fontsize=14, fontweight='bold')
ax2.legend(loc='upper right')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()


---
## üîç Step 10: Evaluate the Learned Policy

Let's see what the policy learned by checking predicted rewards for different user segments.


In [None]:
def create_test_context(age_idx, interest_idx, device_idx, time_idx):
    """Create a specific user context for testing."""
    context = np.zeros(CONTEXT_DIM, dtype=np.float32)
    context[age_idx] = 1.0
    context[len(AGE_GROUPS) + interest_idx] = 1.0
    context[len(AGE_GROUPS) + len(INTERESTS) + device_idx] = 1.0
    context[len(AGE_GROUPS) + len(INTERESTS) + len(DEVICES) + time_idx] = 1.0
    return context

print("üîç Policy Evaluation: Predicted Rewards by User Interest")
print("=" * 70)

for interest_idx, interest in enumerate(INTERESTS):
    context = create_test_context(age_idx=1, interest_idx=interest_idx, device_idx=0, time_idx=2)
    context_tensor = tf.constant([context], dtype=tf.float32)
    predictions, _ = reward_network(context_tensor)
    predictions = predictions.numpy()[0]
    best_action = np.argmax(predictions)
    
    print(f"\nüë§ User interested in: {interest}")
    print(f"   Predicted rewards: ", end="")
    for ad_idx, ad in enumerate(ADS):
        marker = "‚≠ê" if ad_idx == best_action else "  "
        print(f"{ad}: {predictions[ad_idx]:.3f}{marker}  ", end="")
    print(f"\n   ‚Üí Selected: {ADS[best_action]} | Optimal: {ADS[interest_idx]} | {'‚úÖ' if best_action == interest_idx else '‚ùå'}")


In [None]:
# Compare with baseline policies
def evaluate_policy(policy_fn, env, num_episodes=500):
    """Evaluate a policy over multiple episodes."""
    total_reward = 0
    for _ in range(num_episodes):
        time_step = env.reset()
        action = policy_fn(time_step.observation.numpy()[0])
        _, reward = env.step(action)
        total_reward += reward
    return total_reward / num_episodes

def random_policy(context):
    return np.random.randint(0, NUM_ACTIONS)

def optimal_policy(context):
    probs = [get_click_probability(context, a) for a in range(NUM_ACTIONS)]
    return np.argmax(probs)

def learned_policy(context):
    context_tensor = tf.constant([context], dtype=tf.float32)
    predictions, _ = reward_network(context_tensor)
    return tf.argmax(predictions[0]).numpy()

print("üìä Policy Comparison (500 episodes each)")
print("=" * 50)

random_reward = evaluate_policy(random_policy, env)
optimal_reward = evaluate_policy(optimal_policy, env)
learned_reward = evaluate_policy(learned_policy, env)

print(f"\nüé≤ Random Policy:   {random_reward:.4f} CTR")
print(f"üß† Learned Policy:  {learned_reward:.4f} CTR")
print(f"üèÜ Optimal Policy:  {optimal_reward:.4f} CTR")

improvement = (learned_reward - random_reward) / random_reward * 100
print(f"\nüìà Improvement over random: +{improvement:.1f}%")


---
## üìù Summary: Key Steps Explained

### Step-by-Step Breakdown:

| Step | What We Did | Why It Matters |
|------|-------------|----------------|
| **1. Setup** | Imported TF-Agents and defined domain | Foundation for bandit implementation |
| **2. Environment** | Created `AdvertisementBanditEnvironment` | Simulates user arrivals and clicks |
| **3. Network** | Built `RewardPredictionNetwork` | Learns context ‚Üí reward mapping |
| **4. Policy** | Implemented `AdRewardPredictionPolicy` | Extends RewardPredictionBasePolicy for ads |
| **5. Training** | Created training loop with experience buffer | Updates network from observed data |
| **6. Evaluation** | Compared against baselines | Measured improvement over random |

### Key Takeaways:

1. **RewardPredictionBasePolicy** is a flexible base class for bandit policies
2. You implement `_predict_rewards()` to get reward estimates
3. Override `_distribution()` to customize action selection
4. Training uses MSE loss between predicted and observed rewards
5. Epsilon-greedy exploration helps discover better actions

### Next Steps:

- Try **Thompson Sampling** for better exploration
- Add **Upper Confidence Bound (UCB)** exploration
- Use **TF-Agents' built-in bandit agents** (LinUCB, Neural Epsilon Greedy)
- Scale to **real ad data** with proper feature engineering


In [None]:
print("\nüéâ Notebook Complete!")
print("\nYou've learned how to:")
print("  ‚úÖ Create a contextual bandit environment for ads")
print("  ‚úÖ Build a reward prediction network")
print("  ‚úÖ Implement RewardPredictionBasePolicy")
print("  ‚úÖ Train and evaluate bandit policies")
print("\nüöÄ Now try modifying the code for your own use case!")
