In [1]:
import torch

if torch.backends.mps.is_available():
    print("✅ MPS (Metal) backend is available and enabled.")
else:
    print("❌ MPS not available.")

✅ MPS (Metal) backend is available and enabled.


### **Revised Outline: Chapter 3. Neural Contextual Bandits: Generalization Through Shared Representation**

*   **3.1 The Limits of Disjoint Models and the Need for Generalization**
    *   (This section is excellent as written. We will reuse it verbatim to motivate the chapter.)

*   **3.2 Generalizing the UCB Principle Beyond Linearity**
    *   (This section is also excellent. We will reuse it to set up the theoretical challenge.)

*   **3.3 The NeuralUCB Ideal and its Computational Hurdle**
    *   **Intuition:** We will first present the "ideal" NeuralUCB algorithm, where the exploration bonus is based on the gradient with respect to *all* network parameters, $\theta$. This is the formulation you began with.
    *   **Rigor:** We will present the formal mathematics, including the definition of the $p \times p$ matrix $A$.
    *   **The "No Steps Skipped" Analysis:** We will then immediately confront the computational reality. We will calculate the number of parameters `p` for our network and explain why forming and inverting a `p x p` matrix at each step is computationally infeasible. This is not a flaw in our teaching; it is a crucial lesson in the gap between pure theory and practical application.

*   **3.4 The Practical Solution: Neural-LinUCB and Last-Layer Linearization**
    *   **Intuition:** We will introduce the elegant solution: treat the deep layers of the network as a powerful, learnable **feature extractor**. The network's job is to map the raw context $x$ into a rich, lower-dimensional embedding $z_\theta(x)$. We then run the fast, proven LinUCB algorithm on these learned embeddings.
    *   **Rigor:** We will define this formally. The reward is modeled as $r \approx z_\theta(x)^T \omega$, where $\omega$ is a small weight vector for the final linear layer. The exploration bonus is now calculated in the low-dimensional embedding space, using a small `embedding_dim` x `embedding_dim` matrix. This is the **Neural-LinUCB** algorithm, a practical and widely used hybrid.

*   **3.5 Implementation: The Practical `NeuralUCBAgent`**
    *   **Code as a Didactic Tool:** We will build the new agent step-by-step.
        *   First, we'll define the `NeuralBanditModel`, architected to separate the `feature_extractor` from the `output_layer`.
        *   Then, we'll build the `NeuralUCBAgent` class. Its `predict` method will perform a forward pass to get embeddings and then apply the standard LinUCB math in the embedding space. Its `update` method will perform a dual update: (1) update the small UCB matrices $A$ and $b$, and (2) perform a standard `backward()` pass to train the entire neural network.

*   **3.6 Experimental Showdown: Generalization vs. Disjoint Models**
    *   **Application:** This is the climax. We will implement the head-to-head simulation you envisioned.
    *   We will run both a stationary and a non-stationary (market shock) experiment comparing three agents:
        1.  `MLPRecommender` (Static Baseline from Ch. 1)
        2.  `LinUCBAgent` (Disjoint Online Learner from Ch. 2)
        3.  `NeuralUCBAgent` (Our new, practical, generalizing agent)
    *   We will generate plots and provide a detailed analysis of the results, highlighting the superior adaptability of the generalizing model.

*   **3.7 Remark on Scalability and Production**
    *   **Computational Perspective:** We will analyze the complexity of our practical `NeuralUCBAgent`, showing why it is scalable.
    *   **Production Perspective:** We will discuss how such a model fits into a real-world, large-scale system, introducing the concept of a two-stage "candidate generation -> re-ranking" pipeline.

*   **3.8 Chapter Conclusion**
    *   We will summarize our findings, cementing the student's understanding of why generalization is key to building robust, adaptive learning systems.

---

## **Part I: From Static Predictions to Adaptive Learning**

### **Chapter 3: Neural Contextual Bandits: Generalization Through Shared Representation**

#### **3.1 The Limits of Disjoint Models and the Need for Generalization**

In our exploration thus far, we have journeyed through two fundamental paradigms of machine learning for recommendation. In Chapter 1, we constructed a `MLPRecommender`, a model endowed with the expressive power of deep learning. It learned rich, non-linear relationships from a large, static dataset, creating effective, generalizable embeddings for users and products. Its crucial flaw, however, was its static nature; it was a creature of the past, unable to adapt to new information without complete retraining.

In Chapter 2, we addressed this inertia by introducing the `LinUCBAgent`, an online learning system grounded in the contextual bandit framework. This agent, learning at every interaction, demonstrated a remarkable ability to adapt and optimize its strategy over time, ultimately outperforming its static counterpart. Yet, it too possessed a critical architectural weakness. Our `LinUCBAgent` was a **disjoint model**. It maintained an independent set of parameters—a matrix $A_a$ and vector $b_a$—for each of the 50 products in our catalog.

Consider the implications of this design. When the agent recommends "Premium Dog Toy A" and observes a click, it updates only the parameters for that specific toy. If it is subsequently asked to evaluate "Deluxe Dog Toy B," it approaches the problem with no transference of knowledge. The insight that this user has an affinity for dog toys remains siloed within the parameters of "Toy A." The model cannot **generalize**.

This is not merely inefficient; it is fundamentally unscalable. In a real-world e-commerce setting with millions of products, maintaining a separate model for each item is computationally infeasible and statistically disastrous. The vast majority of items would be recommended so infrequently that their parameters would never be reliably estimated—a severe form of the cold-start problem.

The path forward must therefore be a synthesis. We need a model that combines the adaptive, principled exploration of a bandit algorithm with the powerful, generalizable representation learning of a neural network. We seek a single, unified model that learns from every interaction to refine a shared understanding of the world, where learning about one dog toy informs its beliefs about *all* dog toys. This chapter is dedicated to the theory and implementation of such a model: the **Neural Bandit**.

#### **3.2 Generalizing the UCB Principle Beyond Linearity**

To build our new agent, we must first return to the theoretical foundation of the UCB algorithm and abstract it away from the restrictive assumption of linearity.

Recall from Chapter 2 that the LinUCB algorithm operates on a crucial assumption: the expected reward $r_{t,a}$ for a given arm $a$ at time $t$ is a linear function of its context feature vector $x_{t,a}$. Our goal is to replace the linear model $(x_{t,a})^T \theta_a$ with a more powerful, non-linear function approximator, which we will represent as $f(x_{t,a}; \theta)$, where $\theta$ represents the shared parameters of this function. In our case, this function $f$ will be a neural network.

The core UCB principle is not intrinsically tied to linearity. It is a general strategy for decision-making under uncertainty. We can state it more broadly:

**Principle: The Generalized UCB**
At each time step $t$, for each arm $a$, select the arm that maximizes the following payoff:
$$
\text{select } a_t = \arg\max_{a \in \mathcal{A}} \left( \hat{\mu}_{t-1}(x_{t,a}) + \kappa_{t-1}(x_{t,a}) \right)
$$
where:
*   $\hat{\mu}_{t-1}(x_{t,a})$ is our model's estimate of the mean reward for arm $a$ given features $x_{t,a}$, based on information up to step $t-1$. This will be the output of our neural network, $f(x_{t,a}; \hat{\theta}_{t-1})$.
*   $\kappa_{t-1}(x_{t,a})$ is the upper confidence bound, or "exploration bonus," which quantifies the uncertainty of the estimate $\hat{\mu}$. It should be large when the model is uncertain and small when it is confident.

This generalization presents us with a formidable new challenge. For linear models, the confidence bound $\kappa$ can be derived analytically. But how do we compute a meaningful confidence bound for the output of a complex, non-linear model like a deep neural network? This is the central question that neural bandit algorithms aim to answer.

#### **3.3 The NeuralUCB Ideal and its Computational Hurdle**

The most direct theoretical extension of LinUCB to the neural network setting is an algorithm known as NeuralUCB. The core idea is to approximate the complex neural network as a linear model in a very high-dimensional space: the space of its own gradients.

**Intuition First: A Linear Model in Gradient Space**

Imagine that instead of using our hand-crafted feature vector $x_{t,a}$, we could find a more expressive feature representation. The NeuralUCB algorithm proposes using the **gradient of the network's output with respect to its parameters** as this feature vector. This gradient vector, $\nabla_{\theta} f(x_{t,a}; \theta)$, tells us how a small change in each of the network's weights would affect the final prediction. It is a rich, learned, high-dimensional representation of the model's sensitivity to the input.

The algorithm then proceeds by running LinUCB in this gradient space.

**Definition 3.1: The Idealized NeuralUCB Algorithm**
The NeuralUCB agent is defined by:
1.  A neural network $f(x; \theta)$ with a total of $p$ trainable parameters, $\theta \in \mathbb{R}^p$.
2.  A shared $p \times p$ matrix $A$ and a shared $p \times 1$ vector $b$.

The payoff score for an arm $a$ at time $t$ is given by:
$$
p_{t,a} = \underbrace{f(x_{t,a}; \hat{\theta}_{t-1})}_{\text{Exploitation}} + \underbrace{\alpha \sqrt{ (\nabla g_{t,a})^T A_{t-1}^{-1} (\nabla g_{t,a}) }}_{\text{Exploration}}
$$
where $\nabla g_{t,a} = \nabla_{\theta} f(x_{t,a}; \hat{\theta}_{t-1})$ is the gradient of the network's output with respect to its parameters, evaluated at the current parameter estimate. The updates to $A$ and $b$ follow the standard LinUCB rule, using this gradient vector as the feature vector.

**The Computational Hurdle: The Curse of Dimensionality**

This formulation is theoretically elegant, but it presents a monumental practical challenge. The dimensionality of our "feature space" is now $p$, the total number of parameters in the neural network.

Let's calculate $p$ for a modest network architecture, similar to the one we might use for our Zooplus problem.
*   Input layer (size 9) to Hidden Layer 1 (size 128): $(9 \times 128) + 128_{\text{bias}} = 1,280$ parameters.
*   Hidden Layer 1 (128) to Hidden Layer 2 (64): $(128 \times 64) + 64_{\text{bias}} = 8,256$ parameters.
*   Hidden Layer 2 (64) to Output (1): $(64 \times 1) + 1_{\text{bias}} = 65$ parameters.
*   **Total parameters $p = 1,280 + 8,256 + 65 = 9,601$.**

The algorithm requires us to maintain and, critically, to **invert** the matrix $A$, which is of size $p \times p$, or approximately $9601 \times 9601$.
*   **Memory:** Storing this single matrix would require $(9601)^2 \times 4 \text{ bytes/float} \approx 368$ megabytes of RAM.
*   **Computation:** The complexity of matrix inversion is roughly $O(p^3)$. For our network, a single `predict` step, which requires one inversion, would involve on the order of $(9601)^3 \approx 8.8 \times 10^{11}$ floating-point operations.

This is computationally non-viable for real-time decision-making on standard hardware. We have hit a wall. The theoretically pure approach is practically unusable. This is not a failure, but a critical insight: we need a more clever, practical approximation.

#### **3.4 The Practical Solution: Neural-LinUCB and Last-Layer Linearization**

The solution lies in a beautiful and highly effective hybrid approach that captures the spirit of NeuralUCB without inheriting its crippling computational cost. This algorithm is often called **Neural-LinUCB**.

**Intuition First: The Best of Both Worlds**

Instead of linearizing the *entire network*, we will linearize only its **final layer**. We can conceptually divide our neural network into two parts:
1.  **A Deep Feature Extractor, $\phi(x; \theta_{enc})$:** This part consists of all layers *except* the final output layer. Its job is to take the raw, sparse input features $x$ and transform them into a rich, dense, lower-dimensional **embedding vector**. Let's say this embedding has a dimension $d_{emb} = 32$.
2.  **A Linear Predictor, $h(z; \theta_{out})$:** This is just the final linear layer of the network. It takes the learned embedding $z = \phi(x)$ as input and produces the final reward prediction.

The key insight of Neural-LinUCB is this: we can run the fast and efficient LinUCB algorithm *on the learned embeddings*. The deep network learns the features, and the bandit algorithm manages the explore-exploit tradeoff on those features.

**Rigor: The Neural-LinUCB Formulation**

Let our neural network model be $f(x; \theta) = h(\phi(x; \theta_{enc}); \theta_{out})$, where $\theta = (\theta_{enc}, \theta_{out})$. The output of the feature extractor is the embedding $z_{t,a} = \phi(x_{t,a}; \hat{\theta}_{enc, t-1})$.

The payoff score is now calculated with respect to this embedding:

**Definition 3.2: The Neural-LinUCB Payoff Score**
$$
p_{t,a} = \underbrace{ (z_{t,a})^T \hat{\omega}_{t-1} }_{\text{Exploitation}} + \underbrace{ \alpha \sqrt{ (z_{t,a})^T A_{t-1}^{-1} z_{t,a} } }_{\text{Exploration}}
$$
where:
*   $z_{t,a} \in \mathbb{R}^{d_{emb}}$ is the embedding of the feature vector $x_{t,a}$.
*   $A$ is now a small, manageable $d_{emb} \times d_{emb}$ matrix.
*   $\hat{\omega}_{t-1} = A_{t-1}^{-1} b_{t-1}$ is the estimate of the final layer's weights.

**The Learning Process:** The update step is now a two-part process:
1.  **Update the Bandit:** For the chosen arm $a_t$, we update the small matrices $A$ and $b$ using the embedding $z_{t,a_t}$ as the feature vector. This is fast.
2.  **Train the Feature Extractor:** We also perform a standard gradient descent step on the entire network $f(x; \theta)$ to minimize the prediction error $(f(x_{t,a_t}; \theta) - r_t)^2$. This trains the feature extractor $\phi$ to produce embeddings that are useful for the final linear prediction task.

This hybrid approach gives us the best of both worlds: the powerful representation learning of a deep network and the principled, computationally efficient exploration of LinUCB. We have tamed the complexity while preserving the power of generalization.

Let us now proceed to implement this practical and powerful agent.