## **Part II: Understanding the User's Journey and Network**

### **Chapter 3: The Session-Aware Recommender: Sequential Models for In-the-Moment Personalization**

#### **3.1 Introduction: Beyond a Single Click - Modeling User Intent**

In the previous chapters, we built and contrasted two powerful but fundamentally different recommendation paradigms. Our batched MLP model learned a static, long-term profile of a user's preferences. Our LinUCB agent learned to adapt to these preferences in real-time, intelligently exploring to maximize rewards over time. Yet, both models share a critical blind spot: they are largely "memoryless" on a moment-to-moment basis. They treat each user interaction as an isolated event, failing to consider the rich context provided by the user's *immediate* actions within the current browsing session.

Consider Anna, our `new_puppy_parent`. Her long-term profile indicates a preference for dog-related products. But what is she trying to accomplish *right now*?
*   **Session A:** She clicks on "Puppy Food," then "Food Bowl," then "Water Dispenser." Her intent is clear: she is setting up the feeding station for her new pet. The next logical recommendation is not a random dog toy, but perhaps "Puppy Training Treats" or a "Placemat for Food Bowls."
*   **Session B:** A few weeks later, she clicks on "Flea & Tick Prevention," then "Dog Shampoo," then "Grooming Brush." Her current mission is pet hygiene. The best recommendation would be "Nail Clippers" or "Medicated Ear Wipes."

Both the static MLP and the contextual bandit would struggle to distinguish between these two sessions. They see a `new_puppy_parent` and recommend a high-average-CTR item like a "Dog Toy," missing the specific, in-the-moment user intent.

This is the limitation we will address in this chapter. We will move beyond single-interaction predictions and build a **session-aware recommender**. Our goal is to model the *sequence* of a user's actions to predict their next move. By understanding the "grammar" of a user's journey, we can achieve a much deeper and more responsive form of personalization.

#### **3.2 The Frontier Technique: Transformer Architectures for Recommendations**

To model sequences, the natural inclination for many years was to turn to Recurrent Neural Networks (RNNs) and their more powerful variants like LSTMs and GRUs. These models process a sequence element by element, maintaining a "hidden state" that acts as a memory of what has been seen so far. While effective, they suffer from two key weaknesses: difficulty in capturing very long-range dependencies and an inherently sequential nature that makes them difficult to parallelize during training.

The modern solution, which has revolutionized natural language processing and is now a frontier technique in recommendations, is the **Transformer architecture**. The power of the Transformer lies in its core mechanism: **self-attention**.

Instead of processing a sequence one item at a time, the self-attention mechanism allows the model to look at the entire sequence at once and, for each item, calculate an "attention score" that determines how important all other items in the sequence are to it. In our Zooplus example, when predicting the item that should follow "Puppy Food" -> "Food Bowl", the self-attention mechanism can learn that "Puppy Food" is a much more important clue than an unrelated item clicked at the beginning of the session.

We will implement a specific, well-regarded Transformer-based model for recommendations: the **Behavioral Sequence Transformer (BST)**. The architecture is an elegant application of the Transformer's encoder block for the task of next-item prediction.

Here's a conceptual overview of the BST model:
1.  **Inputs:** The model takes two primary inputs: a user's recent behavior sequence (e.g., the last 10 products they clicked on) and a "target item" (a candidate product we are considering recommending).
2.  **Embedding:** All items in the behavior sequence and the target item are converted from IDs into dense, learned embedding vectors. This is the same `nn.Embedding` concept from Chapter 1.
3.  **Positional Encoding:** Because the self-attention mechanism itself has no inherent sense of order, we add a "positional embedding" to each item in the sequence. This vector encodes the item's position (e.g., 1st, 2nd, 3rd...), giving the model a crucial sense of temporality.
4.  **Transformer Encoder (Self-Attention):** The sequence of (item + positional) embeddings is fed into a Transformer Encoder layer. This layer performs self-attention, allowing every item to "communicate" with every other item. The output is a new sequence of context-aware embeddings, where each item's representation has been enriched with information from its neighbors.
5.  **Aggregation & Prediction:** The context-aware embeddings from the sequence are aggregated (e.g., averaged or concatenated). This aggregated vector, representing the user's overall session intent, is then combined with the target item's embedding.
6.  **MLP Tower:** This final combined vector is passed through a few dense layers (an MLP) to produce a final prediction: the probability that the user will click on the target item, given their behavior sequence.

Let's begin by preparing our data for this new, sequence-aware model.

#### **3.3 Building a Behavioral Sequence Transformer (BST) for Zooplus**

**Step 1: Preparing Sequential Data**

Our `training_data` from Chapter 1 is a simple log of `(user_id, product_id, clicked)` events. To train a sequential model, we need to transform this into samples of `(history_sequence, target_item, label)`. We will do this by creating sliding windows over each user's interaction history.

For a user who interacted with items `[p1, p2, p3, p4, p5]`, we can generate the following training examples:
*   `history=[p1]`, `target=p2`
*   `history=[p1, p2]`, `target=p3`
*   `history=[p1, p2, p3]`, `target=p4`
*   ...and so on.

Since our model will require fixed-size inputs, we will define a maximum sequence length and pad any shorter sequences with a special value.

**Code Block 3.1: Creating Sequential Training Data**
```python
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import numpy as np
import pandas as pd
from tqdm import tqdm
import matplotlib.pyplot as plt
import seaborn as sns

# We will reuse the simulator and data generation from Chapter 1
# Let's re-run them here for a self-contained chapter.
# (Code from Chapter 1)
class ZooplusSimulator:
    def __init__(self, n_products=50, n_users=1000, seed=42):
        self.rng = np.random.default_rng(seed)
        self.n_products = n_products
        self.n_users = n_users
        self.products = self._create_product_catalog()
        self.personas = self._create_user_personas()
        self.user_to_persona_map = self._assign_users_to_personas()
    def _create_product_catalog(self):
        product_ids = range(self.n_products)
        categories = ['Fish Supplies', 'Cat Food', 'Dog Food', 'Dog Toy', 'Cat Toy']
        num_per_category = self.n_products // len(categories)
        cat_list = []
        for cat in categories: cat_list.extend([cat] * num_per_category)
        cat_list.extend(self.rng.choice(categories, self.n_products - len(cat_list)))
        return pd.DataFrame({'product_id': product_ids, 'category': self.rng.permutation(cat_list)}).set_index('product_id')
    def _create_user_personas(self):
        return {
            'new_puppy_parent': {'Dog Food': 0.40, 'Dog Toy': 0.50, 'Cat Food': 0.10, 'Cat Toy': 0.05, 'Fish Supplies': 0.02},
            'cat_connoisseur':  {'Dog Food': 0.05, 'Dog Toy': 0.02, 'Cat Food': 0.55, 'Cat Toy': 0.45, 'Fish Supplies': 0.05},
            'budget_shopper':   {'Dog Food': 0.25, 'Dog Toy': 0.15, 'Cat Food': 0.40, 'Cat Toy': 0.20, 'Fish Supplies': 0.20},
            'fish_hobbyist':    {'Dog Food': 0.02, 'Dog Toy': 0.02, 'Cat Food': 0.10, 'Cat Toy': 0.08, 'Fish Supplies': 0.60}
        }
    def _assign_users_to_personas(self):
        persona_names = list(self.personas.keys())
        return {user_id: self.rng.choice(persona_names) for user_id in range(self.n_users)}
    def get_reward(self, user_id, product_id):
        persona_name = self.user_to_persona_map[user_id]
        persona_prefs = self.personas[persona_name]
        product_category = self.products.loc[product_id, 'category']
        click_prob = persona_prefs.get(product_category, 0.01)
        return self.rng.binomial(1, click_prob)
    def get_random_user(self):
        return self.rng.integers(0, self.n_users)

def generate_training_data(simulator, num_interactions):
    user_ids, product_ids, clicks = [], [], []
    for _ in range(num_interactions):
        user_id = simulator.get_random_user()
        product_id = simulator.rng.integers(0, simulator.n_products)
        click = simulator.get_reward(user_id, product_id)
        user_ids.append(user_id); product_ids.append(product_id); clicks.append(click)
    return pd.DataFrame({'user_id': user_ids, 'product_id': product_ids, 'clicked': clicks})

sim = ZooplusSimulator(seed=42)
# We only need the interactions of users who actually clicked on something
interaction_log = generate_training_data(sim, 200_000)
interaction_log = interaction_log[interaction_log['clicked'] == 1].drop('clicked', axis=1)

def create_sequences(df, max_len=10):
    """Transforms interaction log into sequences for BST."""
    sequences = []
    # Group by user and create a list of their clicked product_ids
    user_groups = df.groupby('user_id')['product_id'].apply(list)

    for user_id, user_interactions in tqdm(user_groups.items(), desc="Creating Sequences"):
        if len(user_interactions) < 2:
            continue # Need at least one history item and one target item
        
        # Create sliding windows
        for i in range(1, len(user_interactions)):
            # History is items up to i-1, target is item i
            history = user_interactions[max(0, i - max_len):i]
            target = user_interactions[i]
            
            # Pad history to max_len
            padded_history = np.pad(history, (max_len - len(history), 0), 'constant', constant_values=0)
            
            sequences.append((padded_history, target))
            
    # Convert to DataFrame
    seq_df = pd.DataFrame(sequences, columns=['history', 'target_item_id'])
    # In this simplified setup, all targets are positive examples (clicked=1)
    # We will need to generate negative samples for effective training.
    return seq_df

# Generate sequences
MAX_SEQ_LEN = 10
seq_data = create_sequences(interaction_log, max_len=MAX_SEQ_LEN)
print("Original positive sequences:", len(seq_data))

# Step 1.b: Negative Sampling
# For each positive sequence (history -> target), we create a negative one
# by pairing the same history with a randomly sampled item.
def generate_negative_samples(df, n_products):
    neg_df = df.copy()
    # Sample a random product_id for each row. This is a simple but common strategy.
    random_negatives = np.random.randint(1, n_products + 1, size=len(df))
    neg_df['target_item_id'] = random_negatives
    
    # Add labels
    df['label'] = 1
    neg_df['label'] = 0
    
    # Combine and shuffle
    return pd.concat([df, neg_df]).sample(frac=1, random_state=42).reset_index(drop=True)

# Note: We add +1 to n_products because product_id 0 is our padding token.
# Let's adjust our simulator slightly to use 1-based indexing for products.
sim = ZooplusSimulator(n_products=50, seed=42)
sim.products.index = sim.products.index + 1 # Shift index to be 1 to 50
interaction_log = generate_training_data(sim, 200_000)
interaction_log = interaction_log[interaction_log['clicked'] == 1].drop('clicked', axis=1)
seq_data = create_sequences(interaction_log, max_len=MAX_SEQ_LEN)

# Now generate the final training set
final_training_data = generate_negative_samples(seq_data, sim.n_products)

print(f"Total training samples (positive + negative): {len(final_training_data):,}")
print("\nExample of final training data:")
print(final_training_data.head())
```
```

In []:
```python
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import numpy as np
import pandas as pd
from tqdm import tqdm
import matplotlib.pyplot as plt
import seaborn as sns

# We will reuse the simulator and data generation from Chapter 1
# Let's re-run them here for a self-contained chapter.
# (Code from Chapter 1)
class ZooplusSimulator:
    def __init__(self, n_products=50, n_users=1000, seed=42):
        self.rng = np.random.default_rng(seed)
        self.n_products = n_products
        self.n_users = n_users
        self.products = self._create_product_catalog()
        self.personas = self._create_user_personas()
        self.user_to_persona_map = self._assign_users_to_personas()
    def _create_product_catalog(self):
        product_ids = range(self.n_products)
        categories = ['Fish Supplies', 'Cat Food', 'Dog Food', 'Dog Toy', 'Cat Toy']
        num_per_category = self.n_products // len(categories)
        cat_list = []
        for cat in categories: cat_list.extend([cat] * num_per_category)
        cat_list.extend(self.rng.choice(categories, self.n_products - len(cat_list)))
        return pd.DataFrame({'product_id': product_ids, 'category': self.rng.permutation(cat_list)}).set_index('product_id')
    def _create_user_personas(self):
        return {
            'new_puppy_parent': {'Dog Food': 0.40, 'Dog Toy': 0.50, 'Cat Food': 0.10, 'Cat Toy': 0.05, 'Fish Supplies': 0.02},
            'cat_connoisseur':  {'Dog Food': 0.05, 'Dog Toy': 0.02, 'Cat Food': 0.55, 'Cat Toy': 0.45, 'Fish Supplies': 0.05},
            'budget_shopper':   {'Dog Food': 0.25, 'Dog Toy': 0.15, 'Cat Food': 0.40, 'Cat Toy': 0.20, 'Fish Supplies': 0.20},
            'fish_hobbyist':    {'Dog Food': 0.02, 'Dog Toy': 0.02, 'Cat Food': 0.10, 'Cat Toy': 0.08, 'Fish Supplies': 0.60}
        }
    def _assign_users_to_personas(self):
        persona_names = list(self.personas.keys())
        return {user_id: self.rng.choice(persona_names) for user_id in range(self.n_users)}
    def get_reward(self, user_id, product_id):
        # Adjust for 1-based indexing if product_id is not in index
        if product_id not in self.products.index: return 0.0
        persona_name = self.user_to_persona_map[user_id]
        persona_prefs = self.personas[persona_name]
        product_category = self.products.loc[product_id, 'category']
        click_prob = persona_prefs.get(product_category, 0.01)
        return self.rng.binomial(1, click_prob)
    def get_random_user(self):
        return self.rng.integers(0, self.n_users)

def generate_training_data(simulator, num_interactions):
    user_ids, product_ids, clicks = [], [], []
    for _ in range(num_interactions):
        user_id = simulator.get_random_user()
        # Adjust to generate product_ids from 1 to n_products
        product_id = simulator.rng.integers(1, simulator.n_products + 1)
        click = simulator.get_reward(user_id, product_id)
        user_ids.append(user_id); product_ids.append(product_id); clicks.append(click)
    return pd.DataFrame({'user_id': user_ids, 'product_id': product_ids, 'clicked': clicks})

# Note: We add +1 to n_products because product_id 0 is our padding token.
# Let's adjust our simulator slightly to use 1-based indexing for products.
sim = ZooplusSimulator(n_products=50, seed=42)
sim.products.index = sim.products.index + 1 # Shift index to be 1 to 50
# We only need the interactions of users who actually clicked on something
interaction_log = generate_training_data(sim, 200_000)
interaction_log = interaction_log[interaction_log['clicked'] == 1].drop('clicked', axis=1)


def create_sequences(df, max_len=10):
    """Transforms interaction log into sequences for BST."""
    sequences = []
    # Group by user and create a list of their clicked product_ids
    df_sorted = df.sort_index() # Ensure interactions are in order if not already
    user_groups = df_sorted.groupby('user_id')['product_id'].apply(list)

    for user_id, user_interactions in tqdm(user_groups.items(), desc="Creating Sequences"):
        if len(user_interactions) < 2:
            continue # Need at least one history item and one target item
        
        # Create sliding windows
        for i in range(1, len(user_interactions)):
            # History is items up to i-1, target is item i
            history = user_interactions[max(0, i - max_len):i]
            target = user_interactions[i]
            
            # Pad history to max_len
            # We use a padding value of 0, which is why we shifted product_ids to start at 1
            padded_history = np.pad(history, (max_len - len(history), 0), 'constant', constant_values=0)
            
            sequences.append((padded_history, target))
            
    # Convert to DataFrame
    seq_df = pd.DataFrame(sequences, columns=['history', 'target_item_id'])
    return seq_df

# Generate sequences
MAX_SEQ_LEN = 10
seq_data = create_sequences(interaction_log, max_len=MAX_SEQ_LEN)
print("Original positive sequences:", len(seq_data))

# Step 1.b: Negative Sampling
# For each positive sequence (history -> target), we create a negative one
# by pairing the same history with a randomly sampled item.
def generate_negative_samples(df, n_products):
    neg_df = df.copy()
    # Sample a random product_id for each row.
    random_negatives = np.random.randint(1, n_products + 1, size=len(df))
    neg_df['target_item_id'] = random_negatives
    
    # Add labels
    df['label'] = 1
    neg_df['label'] = 0
    
    # Combine and shuffle
    return pd.concat([df, neg_df]).sample(frac=1, random_state=42).reset_index(drop=True)

# Now generate the final training set
final_training_data = generate_negative_samples(seq_data, sim.n_products)

print(f"Total training samples (positive + negative): {len(final_training_data):,}")
print("\nExample of final training data:")
print(final_training_data.head())
```

<pre>
Creating Sequences: 100%|██████████| 999/999 [00:00&lt;00:00, 1952.47it/s]
</pre>

<pre>
Original positive sequences: 41040
Total training samples (positive + negative): 82,080

Example of final training data:
                                             history  target_item_id  label
0  [0, 0, 0, 0, 0, 0, 0, 0, 48, 12]              20              0
1           [0, 0, 0, 0, 0, 0, 0, 4, 38, 2]              31              1
2     [0, 0, 0, 0, 0, 0, 0, 0, 0, 2]              25              1
3  [0, 0, 0, 0, 0, 0, 0, 0, 31, 28]               2              0
4   [0, 0, 0, 0, 0, 0, 0, 0, 2, 45]              49              1
</pre>

**Step 2: Implementing the BST Model in PyTorch**

With our data prepared, we can now define the model architecture. We will use PyTorch's built-in `nn.TransformerEncoderLayer` as it provides a robust and optimized implementation of the self-attention mechanism.

A key detail is the handling of padding. We need to create an "attention mask" to tell the Transformer layer which elements in the sequence are real items and which are just padding. The model should not pay attention to padding tokens.

**Code Block 3.2: The Behavioral Sequence Transformer (BST) Model**
```python
# --- Device Configuration ---
if torch.backends.mps.is_available() and torch.backends.mps.is_built():
    device = torch.device("mps")
    print("Using MPS (Apple Silicon) device.")
elif torch.cuda.is_available():
    device = torch.device("cuda")
    print("Using CUDA device.")
else:
    device = torch.device("cpu")
    print("Using CPU device.")

class BSTDataset(Dataset):
    def __init__(self, df):
        self.histories = [torch.tensor(h, dtype=torch.long) for h in df.history.values]
        self.targets = torch.tensor(df.target_item_id.values, dtype=torch.long)
        self.labels = torch.tensor(df.label.values, dtype=torch.float32)

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        return self.histories[idx], self.targets[idx], self.labels[idx]

class BehavioralSequenceTransformer(nn.Module):
    def __init__(self, n_products, max_len, embed_dim=64, n_heads=4, n_layers=2, dropout=0.2):
        super().__init__()
        # Note: n_products + 1 to account for the padding token (0)
        self.item_embedding = nn.Embedding(n_products + 1, embed_dim, padding_idx=0)
        # Learnable positional embedding
        self.positional_embedding = nn.Embedding(max_len + 1, embed_dim)
        
        # Standard Transformer Encoder
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=embed_dim, 
            nhead=n_heads, 
            dropout=dropout,
            batch_first=True # Important: our data is (batch, seq, feature)
        )
        self.transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers=n_layers)
        
        # Prediction MLP
        self.mlp = nn.Sequential(
            nn.Linear(embed_dim * 2, 128),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(128, 1)
        )
        self.sigmoid = nn.Sigmoid()
        self.max_len = max_len

    def forward(self, history_seq, target_item):
        # Create attention mask for padding
        # Shape: (batch_size, seq_len) -> True for padding tokens
        attention_mask = (history_seq == 0)

        # 1. Embeddings and Positional Encoding
        item_embeds = self.item_embedding(history_seq)
        
        # Create position IDs (1 to max_len)
        positions = torch.arange(1, self.max_len + 1, device=history_seq.device).unsqueeze(0)
        pos_embeds = self.positional_embedding(positions)
        
        # Add embeddings together
        seq_embeds = item_embeds + pos_embeds

        # 2. Transformer Encoder
        # The mask will prevent attention to padding tokens
        transformer_out = self.transformer_encoder(seq_embeds, src_key_padding_mask=attention_mask)
        
        # 3. Aggregation
        # We'll use a simple average of the transformer outputs, ignoring padding
        # We can create a non-padded version of the mask for averaging
        mask_for_avg = attention_mask.unsqueeze(-1).expand(transformer_out.size())
        transformer_out[mask_for_avg] = 0 # Zero out padding embeddings
        # Sum non-padded embeddings and divide by the number of non-padded items
        seq_representation = torch.sum(transformer_out, dim=1)
        non_pad_counts = (history_seq != 0).sum(dim=1, dtype=torch.float32).unsqueeze(1)
        seq_representation = seq_representation / torch.clamp(non_pad_counts, min=1.0) # Avoid division by zero
        
        # 4. Prediction
        target_embed = self.item_embedding(target_item)
        combined_rep = torch.cat([seq_representation, target_embed], dim=1)
        
        output = self.mlp(combined_rep)
        return self.sigmoid(output)

# --- Instantiate Dataset and Dataloaders ---
from sklearn.model_selection import train_test_split
train_df, val_df = train_test_split(final_training_data, test_size=0.2, random_state=42)
train_dataset = BSTDataset(train_df)
val_dataset = BSTDataset(val_df)
train_loader = DataLoader(train_dataset, batch_size=256, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=256, shuffle=False)

print("Datasets and DataLoaders are ready.")
```
```
Using MPS (Apple Silicon) device.
Datasets and DataLoaders are ready.
```