In [1]:
import torch

if torch.backends.mps.is_available():
    print("✅ MPS (Metal) backend is available and enabled.")
else:
    print("❌ MPS not available.")

✅ MPS (Metal) backend is available and enabled.


### **Chapter 2: The Adaptive Recommender: Contextual Bandits in Action**

#### **2.1 Introduction: Escaping the Static World with the Explore-Exploit Dilemma**

In the previous chapter, we built a capable, yet fundamentally flawed, batched recommender. Its knowledge is frozen in time, learned from a static dataset. It is like a student who has memorized a textbook but cannot apply that knowledge to new problems or learn from their mistakes. To build a truly intelligent system, we need to move from this passive, offline learning to an active, online paradigm.

This brings us face-to-face with one of the most fundamental trade-offs in decision-making and machine learning: the **explore-exploit dilemma**.

Imagine you are at a new food court with five stalls.
*   **Exploitation** is the safe bet. You try the pizza stall, and it's pretty good. The "exploit" strategy would be to eat pizza every single day. You are guaranteed a decent meal, maximizing your immediate reward based on your current knowledge.
*   **Exploration** is the risky, but potentially more rewarding, path. You could try the mysterious taco stall. It might be terrible (a loss of immediate reward), but it could also be the best food you've ever had, leading to much higher rewards in the long run.

Our batched recommender is a pure exploiter. Once trained, it will always recommend the items it *believes* have the highest CTR, based on its fixed knowledge. It never dares to try the "taco stall"—a new product or a niche item—because its predicted CTR is low or unknown.

To build a better system, we need an algorithm that can intelligently manage this trade-off. This is the domain of **Multi-Armed Bandits**, a class of reinforcement learning algorithms designed specifically for this problem. The name comes from the analogy of a gambler at a row of slot machines (or "one-armed bandits"), trying to figure out which machine to play to maximize their total winnings.

We will take this concept one step further by using a **Contextual Bandit**. A simple multi-armed bandit learns the best "arm" (or product) to pull on average, across all situations. A contextual bandit is far more powerful: it learns the best arm to pull *given the current context*. In our Zooplus scenario, the **context** is the user. The algorithm doesn't just learn "which product is best overall?"; it learns "which product is best for *this specific user* right now?".

This chapter will introduce a classic, elegant, and highly effective contextual bandit algorithm: the **Linear Upper Confidence Bound (LinUCB)** algorithm. We will implement it, pit it against our static batched model in our simulation, and witness firsthand the power of continuous, adaptive learning.

#### **2.2 The Frontier Technique: The Linear Upper Confidence Bound (LinUCB) Algorithm**

The LinUCB algorithm, first introduced by Li et al. (2010) for news article recommendation, strikes a beautiful balance between performance, efficiency, and interpretability. It is a perfect entry point into the world of online, reinforcement learning-based recommenders.

**The Core Assumption: A Linear World**

LinUCB makes a simplifying (for nolinear cases we will investigate NeuralUCB and the likes in subsequent chapters) assumption: the expected reward (the true, underlying CTR) of showing a product to a user is a **linear function** of a combined feature vector.

Let's say for a given user-product pair, we can construct a feature vector, `x`. This vector could include:
*   User features (e.g., one-hot encoding of their persona)
*   Product features (e.g., one-hot encoding of the product's category)
*   Interaction features (e.g., the product of user and item embeddings)

The algorithm assumes there exists an unknown coefficient vector, `θ`, such that the expected reward, `E[r]`, is simply their dot product:

`E[r] = x^T θ`

The entire goal of the LinUCB algorithm is to **learn the `θ` vector** for each product as efficiently as possible.

**How LinUCB Balances Exploration and Exploitation**

For each "arm" (i.e., each product in our catalog), LinUCB maintains two key pieces of information:

1.  **A `d x d` matrix `A`**: This matrix stores information about the feature vectors `x` it has seen so far for that arm. It's essentially `X^T X`, where `X` is the matrix of all feature vectors observed for that arm. The inverse of `A` helps us measure our uncertainty about the arm's true reward.
2.  **A `d x 1` vector `b`**: This vector stores the sum of the feature vectors `x` weighted by the rewards `r` they produced. It's `X^T r`.

At each step, to make a recommendation, LinUCB calculates a score for every possible product using these two pieces of information. The score is composed of two parts:

`Score = (Predicted CTR) + (Uncertainty Bonus)`

1.  **Predicted CTR (Exploitation):** The algorithm first calculates its current best estimate of the coefficient vector, `θ_hat = A⁻¹ b`. The predicted CTR is then simply `x^T θ_hat`. This is the exploitation term—it favors products that have performed well in the past.

2.  **Uncertainty Bonus (Exploration):** The second term is `α * sqrt(x^T A⁻¹ x)`.
    *   `A⁻¹` represents the covariance of our estimate for `θ_hat`. A large value means we are very uncertain about our estimate.
    *   `x^T A⁻¹ x` gives us the variance of the prediction specifically for the feature vector `x`. If we have seen feature vectors similar to `x` many times before, this term will be small. If `x` represents a new, unseen combination of user and product features, this term will be large.
    *   `α` (alpha) is a hyperparameter that you control. It scales how much the algorithm values exploration. A higher `α` makes the algorithm more adventurous.

The algorithm then simply chooses the product with the **highest combined score**.

**The Intuition:**

*   If a product has a high predicted CTR and we are very certain about it (low uncertainty bonus), it gets a high score. **(Pure Exploitation)**
*   If a product has a mediocre predicted CTR but we are very uncertain about it (high uncertainty bonus), it can also get a high score. Choosing this product is an act of **exploration**. By trying it, we get a new data point, which reduces our uncertainty (updating `A` and `b`) and helps us learn its true value for the future.

This elegant combination allows LinUCB to learn efficiently. It focuses its exploration on the parts of the feature space where its knowledge is weakest, leading to rapid convergence.

#### **2.3 Implementing the LinUCB Agent for Zooplus**

Now, let's translate this theory into practice. We will create a Python class for a single LinUCB "arm" (representing one product) and a main agent class that manages all the arms.

For our feature vector `x`, we will do something simple and effective: we will **concatenate the user's embedding with the product's category features**. But where do we get a user embedding for a contextual bandit? We can't use the one from the batched model directly, as it was trained for a different task.

Instead, we will create a new set of user embeddings, one for each *persona*. This is a reasonable simplification for our simulation. In a real system, these could be embeddings learned from user demographics or other side information. The product features will be the one-hot encoded categories we already have in our simulator.

**Code Block 2.1: The LinUCB Implementation**

```python
import numpy as np

class LinUCBArm:
    """Represents a single arm in the LinUCB algorithm."""
    def __init__(self, arm_index, d, alpha):
        """
        Args:
            arm_index (int): The index of the arm (e.g., product_id).
            d (int): The dimensionality of the feature vector.
            alpha (float): The exploration parameter.
        """
        self.arm_index = arm_index
        self.alpha = alpha
        
        # Initialize A as a d x d identity matrix.
        # This corresponds to a standard Bayesian linear regression prior.
        self.A = np.identity(d)
        
        # Initialize b as a d x 1 zero vector.
        self.b = np.zeros([d, 1])

    def calc_p(self, x):
        """
        Calculates the score for this arm given a feature vector x.
        
        Args:
            x (np.array): A d-dimensional feature vector.
        
        Returns:
            The UCB score for this arm.
        """
        # Ensure x is a column vector
        x = x.reshape(-1, 1)
        
        # Calculate A_inv and theta_hat
        A_inv = np.linalg.inv(self.A)
        theta_hat = A_inv.dot(self.b)
        
        # Calculate the UCB score
        p = theta_hat.T.dot(x) + self.alpha * np.sqrt(x.T.dot(A_inv).dot(x))
        
        return p

    def update(self, x, reward):
        """
        Updates the A and b matrices for this arm.
        
        Args:
            x (np.array): The feature vector for the interaction.
            reward (int): The observed reward (0 or 1).
        """
        x = x.reshape(-1, 1)
        self.A += x.dot(x.T)
        self.b += reward * x

class LinUCBAgent:
    """The main agent that manages all the LinUCB arms."""
    def __init__(self, n_products, user_features, product_features, alpha=1.0):
        """
        Args:
            n_products (int): The number of arms (products).
            user_features (dict): A dict mapping persona name to a feature vector.
            product_features (np.array): A matrix of one-hot encoded product categories.
            alpha (float): The exploration parameter.
        """
        self.user_features = user_features
        self.product_features = product_features
        self.n_products = n_products
        
        # The dimensionality of our combined feature vector
        d = list(user_features.values())[0].shape[0] + product_features.shape[1]
        
        # Create a list of arms
        self.arms = [LinUCBArm(i, d, alpha) for i in range(n_products)]

    def _create_feature_vector(self, persona, product_id):
        """Creates the concatenated feature vector x."""
        user_feat = self.user_features[persona]
        product_feat = self.product_features[product_id]
        return np.concatenate([user_feat, product_feat])

    def choose_action(self, user_persona):
        """
        Chooses the best product to recommend for the given user persona.
        
        Returns:
            The product_id of the chosen action.
        """
        scores = []
        for product_id in range(self.n_products):
            # Create the feature vector for this user-product pair
            x = self._create_feature_vector(user_persona, product_id)
            
            # Calculate the score for this arm
            score = self.arms[product_id].calc_p(x)
            scores.append(score)
            
        # Choose the arm with the highest score (break ties randomly)
        max_score = np.max(scores)
        best_arms = np.where(scores == max_score)[0]
        chosen_arm = np.random.choice(best_arms)
        
        return chosen_arm

    def update(self, chosen_arm, user_persona, reward):
        """Updates the agent after an action is taken."""
        x = self._create_feature_vector(user_persona, chosen_arm)
        self.arms[chosen_arm].update(x, reward)

# --- Setup for the LinUCB Agent ---

# Create simple, random embeddings for our user personas
persona_embedding_dim = 8
user_features_for_bandit = {
    name: np.random.rand(persona_embedding_dim) 
    for name in sim.personas.keys()
}

# The product features are the one-hot encoded categories from the simulator
product_features_for_bandit = sim.product_features

# Instantiate the agent
linucb_agent = LinUCBAgent(
    n_products=sim.n_products,
    user_features=user_features_for_bandit,
    product_features=product_features_for_bandit,
    alpha=1.5 # Let's be a bit adventurous
)

print("LinUCB Agent created successfully.")
d = list(user_features_for_bandit.values())[0].shape[0] + product_features_for_bandit.shape[1]
print(f"Feature vector dimensionality (d): {d}")
```

With the agent class defined and instantiated, we are now ready for the main event: a head-to-head competition. We will create a simulation loop that pits our new, adaptive `LinUCBAgent` against the static, pre-trained `MLPRecommender` from Chapter 1. This will allow us to see, step-by-step, how an online learning agent behaves compared to its offline counterpart.

---
This concludes the first part of Chapter 2. We have introduced the core concepts and provided a full implementation of the LinUCB agent, setting up all the necessary components for the simulation. The next section will detail the experimental design and present the code for the head-to-head comparison. Please review, and I will proceed.