![Factored Banner](images/Factored_Logo_Profile_Asset_Cover-.png)

---
# üá®üá¥ **ColombiaTechFest ‚Äì Workshop ‚Äì Factored**  
## üöÄ **Beyond Matrix Factorization: Deep RecSys Architectures in Action**

Recommender systems are one of the üíé **most impactful applications of machine learning** in business today. They help people navigate and interact with the endless variety of products and services companies offer ‚Äî from üéµ **Spotify** playlists tailored to your mood, to üçø **Netflix** suggesting your next binge-worthy series.  

They‚Äôre everywhere in our daily lives ‚Äî and when used effectively, they can **boost engagement, satisfaction, and business growth** üìà.

---

### üõ†Ô∏è **What‚Äôs in this workshop?**

In this **hands-on session**, we‚Äôll start with **traditional approaches** like collaborative filtering and **Matrix Factorization**, and then step into the world of **modern deep learning architectures** that power today‚Äôs most sophisticated platforms.  

You‚Äôll leave with:
- üß† **Conceptual understanding** of core and advanced models.
- üíª **Code examples** you can run and adapt.
- üó∫Ô∏è **Guidance** on when to use each approach.

--- 
## Workshop Repository: https://github.com/factoredai/eb-recsys-overview-workshop

---

## **Quick Overview: Traditional Recommenders**

### üß© **From Matrix Factorization to Factorization Machines**

**üîç Matrix Factorization Recap**  
Matrix Factorization (MF) is a foundation of collaborative filtering. It learns:
- üë§ **User vectors** ‚Üí latent preferences.
- üéØ **Item vectors** ‚Üí latent attributes.

Prediction is the dot product of the two:
$$
\hat{r}_{ui} = \mathbf{p}_u^\top \mathbf{q}_i
$$

Where:
- $\mathbf{p}_u$ = latent vector for user $u$
- $\mathbf{q}_i$ = latent vector for item $i$

‚úÖ **Strengths**: Great for uncovering hidden patterns in sparse user‚Äìitem data.  
‚ö†Ô∏è **Limitations**: Only models interactions between *user ID* and *item ID*. No easy way to add context or side features.

![Matrix Factorization](images/Matrix_Architecture.png)

---

### üåü **Factorization Machines (FMs)**

**üí° Concept & Motivation**  
Factorization Machines extend MF to model interactions between **any pair of features** ‚Äî not just users and items. That means you can mix:
- üÜî IDs (users, items)
- üè∑Ô∏è Metadata (genres, categories)
- ‚è±Ô∏è Context (time, location)

Perfect for **sparse, high‚Äëdimensional data** like CTR prediction or modern recommender systems.

They bridge the gap between:
- üìè **Linear models** ‚Üí handle individual features well.
- üîÑ **Nonlinear models** ‚Üí capture complex feature interactions.

**üìê Mathematical Formulation**  
$$
\hat{y}(\mathbf{x}) = w_0 + \sum_{i=1}^{n} w_i x_i
+ \sum_{i=1}^{n}\sum_{j=i+1}^{n} \langle \mathbf{v}_i, \mathbf{v}_j \rangle x_i x_j
$$

Where:
- $\mathbf{x}$ = feature vector (user ID, item ID, side features)
- $w_0$ = global bias
- $w_i$ = weight for feature $i$
- $\mathbf{v}_i \in \mathbb{R}^k$ = latent vector for feature $i$

In [None]:
import numpy as np
import pandas as pd
from scipy.sparse import hstack, csr_matrix
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

import tensorflow as tf
from tensorflow.keras.layers import Input, Embedding, Dense, Flatten, Concatenate
from tensorflow.keras.models import Model 
from tensorflow.keras import layers, Model

import functions_workshop as fn

---

### üìù Task 1 to 3 ‚Äì Preparing the Dataset

Before we dive into modeling, let‚Äôs make sure our data is in the right shape.  
In this task, you‚Äôll load and prepare the **MovieLens dataset** so it‚Äôs ready for training recommender models.

#### ‚úÖ Steps

1. **Load the Data**  
   - Read the three CSV files into separate **Pandas DataFrames**.

2. **Merge the Tables**  
   - Combine them so that each row contains the following fields:  
     - `userId`  
     - `movieId`  
     - `rating`  

3. **Encode IDs**  
   - Reindex `userId` and `movieId` so they become **consecutive integers starting from 0**.  
   - You can use `LabelEncoder` or create a manual mapping with Pandas.

---

üí° *Hint*: This preprocessing step ensures that both our matrix factorization and deep learning models can handle users and items efficiently.  

In [None]:
# === TO DO 1 ===
# Load the MovieLens dataset from the CSV files into Pandas DataFrames
# Complete the function load_movielens_data in task1.py
# The function should return three DataFrames: users, movies, and ratings
from src.task1 import load_movielens_data

users, movies, ratings = load_movielens_data()

In [None]:
# === TO DO 2 ===
# Merge the three DataFrames into one called `data`
# The final DataFrame should contain: userId, movieId, and rating
# Complete the function merge_data in task2.py
# The function should return a DataFrame with the merged data
from src.task2 import merge_data
data = merge_data(users, movies, ratings)
data.head()

In [None]:
# === TO DO 3 ===
# Encode userId and movieId as integer indices for embeddings/one-hot encoding
# Add two new columns to `data`: 
#   - u_idx (encoded userId)
#   - m_idx (encoded movieId)
#
# Complete the function encode_user_movie_ids in task3.py
# The function should return the DataFrame with the new columns

from src.task3 import encode_user_movie_ids
data = encode_user_movie_ids(data)
print(data[['u_idx', 'm_idx']].head())

In [None]:
### Get target variable and number of users/items
y = data["Rating"].astype(float).values
n_users = data["u_idx"].nunique()
n_items = data["m_idx"].nunique()

# genres split (for models that use side features)
data["Genres_list"] = data["Genres"].fillna("(no genres listed)").str.split("|")
id_to_movie = data[['m_idx', 'Title']].drop_duplicates().reset_index(drop=True)

print(data[['u_idx', 'm_idx', 'Rating','Genres_list','Gender','Age']].head())

In [None]:
### Implementation of Factorizacion Machine Model
class FactorizationMachine:
    """
    A simple Factorization Machine for regression (e.g., rating prediction)
    trained with stochastic gradient descent (SGD).

    Parameters
    ----------
    n_features : int
        Total number of input features (columns in X).
    k : int, default=10
        Number of latent factors for modeling pairwise interactions.
    learning_rate : float, default=0.01
        SGD step size.
    n_iter : int, default=100
        Number of passes (epochs) over the training data.

    Notes
    -----
    - Assumes X is a scipy.sparse CSR matrix for efficiency.
    - Uses squared error loss: (y - y_hat)^2
    """

    def __init__(self, n_features, k=10, learning_rate=0.01, n_iter=100):
        self.k = k
        self.lr = learning_rate
        self.n_iter = n_iter

        # Model parameters:
        self.w0 = 0.0                          # global bias
        self.W = np.zeros(n_features)          # linear weights
        # latent factors initialized small (Gaussian)
        self.V = np.random.normal(scale=0.01, size=(n_features, k))

    def _predict_instance(self, x):
        """
        Predict a single instance.
        x: 1xN sparse row (CSR format expected)
        """
        # Linear term: w0 + x ¬∑ W
        linear = self.w0 + x.dot(self.W)  # returns a (1,) ndarray

        # Interaction term using the FM identity:
        # 0.5 * [ (xV)^2 - (x^2)(V^2) ] summed over features and factors
        # x.dot(self.V) -> shape (1, k)
        xv = x.dot(self.V)               # (1, k)
        xv_sq = np.sum(xv**2)            # scalar
        # (x.multiply(x)) keeps sparsity; (V**2) is dense (N,k); result is (1,k)
        x_sq_v_sq = (x.multiply(x)).dot(self.V**2)
        x_sq_v_sq_sum = np.sum(x_sq_v_sq)

        interactions = 0.5 * (xv_sq - x_sq_v_sq_sum)

        # Both linear and interactions are scalars now
        return float(linear + interactions)

    def predict(self, X):
        """
        Vectorized predict over all rows for convenience.
        Loops over rows to reuse _predict_instance (works fine with sparse).
        """
        return np.array([self._predict_instance(X[i]) for i in range(X.shape[0])])

    def fit(self, X, y):
        """
        Train with simple SGD over epochs.

        X: CSR matrix of shape (n_samples, n_features)
        y: array of shape (n_samples,)
        """
        assert isinstance(X, csr_matrix), "Use a CSR sparse matrix for X."

        for _ in range(self.n_iter):
            for i in range(X.shape[0]):
                x_i = X[i]                    # 1xN CSR sparse row
                y_hat = self._predict_instance(x_i)
                error = y[i] - y_hat          # residual

                # === Update w0 (scalar) ===
                self.w0 += self.lr * error

                # === Update W (linear weights) ===
                # Only update positions where x_i is non-zero for sparsity efficiency
                # x_i.indices -> non-zero column indices
                # x_i.data    -> non-zero values at those indices
                for idx, val in zip(x_i.indices, x_i.data):
                    self.W[idx] += self.lr * error * val

                # === Update V (latent factors) ===
                # For each factor f, use the FM gradient:
                # dL/dV[j,f] = -error * ( x_j * ( sum_l x_l * V[l,f] - V[j,f] * x_j ) )
                # We compute xV[:, f] once, then update only non-zero j.
                xV = x_i.dot(self.V)  # shape (1, k)
                for f in range(self.k):
                    xV_f = xV[0, f]   # scalar
                    for j, xj in zip(x_i.indices, x_i.data):
                        v_jf = self.V[j, f]
                        grad = error * (xj * (xV_f - v_jf * xj))
                        self.V[j, f] += self.lr * grad

In [None]:
### Use one hot encoding to create user and item features, also create sparse matrix for Factorization Machine
ohe_u = OneHotEncoder(handle_unknown="ignore")
ohe_m = OneHotEncoder(handle_unknown="ignore")

X_u = ohe_u.fit_transform(data[["u_idx"]])     # (n, n_users)
X_m = ohe_m.fit_transform(data[["m_idx"]])     # (n, n_items)
X_fm = hstack([X_u, X_m]).tocsr()            # (n, n_users+n_items)

### Train the Factorization Machine model
fm = FactorizationMachine(n_features=X_fm.shape[1], k=3, learning_rate=0.1, n_iter=1)
fm.fit(X_fm, y)

In [None]:
### Get recommendations for a specific user
user_idx = 10  # encoded user
n_items = len(ohe_m.categories_[0])  # number of encoded items

# Build features for (user_idx, each item)
u_vecs = ohe_u.transform([[user_idx]] * n_items)  # repeat user row
m_vecs = ohe_m.transform([[i] for i in range(n_items)])

X_all_items = hstack([u_vecs, m_vecs]).tocsr()

# Predict ratings
preds = fm.predict(X_all_items)

# Rank items by predicted score
top_n = 5
top_items = np.argsort(preds)[::-1][:top_n]

### Get original movie titles for the top recommended items
top_item_ids = ohe_m.inverse_transform(m_vecs[top_items])
for idx in top_item_ids:
    movie_rec = id_to_movie[id_to_movie.m_idx == idx[0]]['Title'].values[0] if idx in id_to_movie.m_idx.values else "Unknown"
    print(f"Recommended Movie ID: {idx}, Title: {movie_rec}")

# **Deep RecSys Architectures**

## **Part 2 ‚Äì Wide & Deep Learning (WDL)**  

**üí° Concept & Motivation**  
- Introduced by Google in 2016 for **large-scale recommendations** and **click-through rate (CTR) prediction**.  
- **Wide** üèéÔ∏è = memorization of **explicit feature interactions** (fast, rule-based learning).  
- **Deep** üß† = generalization via **embeddings** and **neural networks** (learn hidden patterns).  

**üèóÔ∏è Architecture**  
- **Wide branch**: Generalized Linear Model (GLM) with original features + cross features.  
- **Deep branch**: Embedding layers for categorical features + dense layers for non-linear transformations.  
- Outputs from both branches are **concatenated** and fed into a final prediction layer.  

![Wide & Deep Architecture](images/Wide_Deep_Models_Architecture.png)  

---

### üîç **Why Both Components Are Necessary**

#### 1Ô∏è‚É£ Complementary Strengths
| Component | Best At... | How It Works |
| --------- | ---------- | ------------ |
| **Wide**  | Frequent, memorized patterns | Uses explicit cross-features to memorize known rules |
| **Deep**  | Unseen or complex patterns   | Uses learned embeddings + nonlinearities to generalize |

‚úÖ **Wide** ‚Üí memorizes co-occurrence rules.  
‚úÖ **Deep** ‚Üí generalizes to rare or never-before-seen combinations.

---

#### 2Ô∏è‚É£ Cold-Start & Long-Tail Handling
- **Wide branch** handles "hot" user‚Äìitem combos efficiently.  
- **Deep branch** helps with cold-start problems by using **shared embeddings** and generalizing from similar known items/users.  
- The deep path can make a prediction for an unseen pair if their embeddings are close to known patterns.

---

#### 3Ô∏è‚É£ Balanced Recommendations
- **Wide** keeps strong priors: ‚ÄúUser 123 always buys Item 456.‚Äù  
- **Deep** promotes diversity and serendipity by exploring subtle or novel associations.  
- This balance helps avoid overfitting to popular items while still recommending relevant content.

---

#### 4Ô∏è‚É£ Scalability & Performance
- **Wide branch** ‚Üí sparse and fast (especially at inference time).  
- **Deep branch** ‚Üí slower but more expressive.  
- Together, they balance **speed vs. capacity** ‚Äî a critical trade-off in real-world production RecSys.

---

In [None]:
# ==== Build features ====
u_idx = data["u_idx"].astype("int32").values
m_idx = data["m_idx"].astype("int32").values

ohe_demo = OneHotEncoder(handle_unknown="ignore", sparse_output=True)
X_demo   = ohe_demo.fit_transform(data[['Age','Gender', 'Occupation', 'Zip-code']])

### üìù Task 4 ‚Äì Include additional characteristics to the dataset


---

In [None]:
# === TO DO 4 ===
# Create a multi-label binarizer for movie genres using MultiLabelBinarizer.
# Transform the "Genres_list" column into a multi-hot encoding and wrap it 
# with `csr_matrix` for efficiency.

from src.task4 import get_wide_input
X_wide, X_gen, mlb = get_wide_input(data, X_demo)

wide_dim = X_wide.shape[1]

print(f"Users: {n_users}, Items: {n_items}, Wide dims: {wide_dim}, Samples: {len(data)}")

In [None]:
# ==== 2) Model: Wide & Deep for ratings ====
def create_wide_deep_regression(num_users, num_items, wide_dim, embedding_dim=16):
    user_in = Input(shape=(), dtype="int32", name="user_id")
    item_in = Input(shape=(), dtype="int32", name="item_id")
    wide_in = Input(shape=(wide_dim,), dtype="float32", name="wide")

    # Deep branch (embeddings)
    u_emb = Embedding(num_users, embedding_dim, name="user_emb")(user_in)
    i_emb = Embedding(num_items, embedding_dim, name="item_emb")(item_in)
    deep  = Concatenate()([Flatten()(u_emb), Flatten()(i_emb)])
    deep  = Dense(64, activation="relu")(deep)
    deep  = Dense(32, activation="relu")(deep)

    # Merge wide + deep
    x = Concatenate()([deep, wide_in])
    out = Dense(1, activation=None, name="rating")(x)

    model = Model([user_in, item_in, wide_in], out)
    model.compile(optimizer="adam",
                  loss="mse",
                  metrics=[tf.keras.metrics.RootMeanSquaredError(name="rmse")])
    return model

model = create_wide_deep_regression(n_users, n_items, wide_dim=wide_dim, embedding_dim=16)
model.summary()

In [None]:
# ==== 3) Train ====
history = model.fit(
    x={"user_id": u_idx, "item_id": m_idx, "wide": X_wide.toarray()},
    y=y,
    batch_size=1024,
    epochs=3,
    verbose=1
)

In [None]:
top_recs = fn.recommend_for_uidx_wide(u_idx=10, data=data, model=model,
                              ohe_demo=ohe_demo, mlb=mlb, top_n=5)
for mid, title, score in top_recs:
    print(f"{title}  (pred: {score:.2f})")

---
## **Part 3 ‚Äì Two-Tower Models** üèõÔ∏èüèõÔ∏è

Two-Tower models split representation learning into **two parallel ‚Äútowers‚Äù** ‚Äî one for users and one for items.  
This design lets us **precompute item embeddings** and perform **lightning-fast retrieval** at scale. üöÄ

---

### üèóÔ∏è **Model Structure**

**üë§ User Tower**
- **Inputs**: User ID (embedded), demographics (e.g., age, region), recent interactions (past item IDs, session stats).
- **Output**: Dense user vector $\mathbf{u}_u \in \mathbb{R}^d$

**üéØ Item Tower**
- **Inputs**: Item ID (embedded), metadata (category, price), content features (text embeddings, image CNN features).
- **Output**: Dense item vector $\mathbf{v}_i \in \mathbb{R}^d$

![Two Towers](images/Tower_Architecture.png)

---

### üí° **Why Two-Towers?**

| Benefit | Impact |
|---------|--------|
| ‚ö° **Scalability** | Precompute item embeddings ‚Üí one vector lookup + ANN (Approximate Nearest Neighbor) search ‚Üí millisecond latency. |
| üß© **Modularity** | Retrain/update towers independently; plug in new features without retraining the whole model. |
| üé® **Multimodal** | Easily add text, images, or audio by connecting specialized sub-networks. |
| üÜï **Fresh Content** | Generate embeddings for new items instantly, without retraining the user tower or reindexing everything. |
| üñ•Ô∏è **Resource-Light** | Inference = a single dot product per candidate; highly parallelizable on GPUs/CPUs. |

---

‚úÖ **Key takeaway:**  
Two-Tower models **shine in large-scale retrieval** ‚Äî they get you from *millions* of candidates down to a *small shortlist* in milliseconds, ready for a second-stage ranking model.

In [None]:
# ================================
# 0) Encoders & Side-Feature Setup
# ================================

# Existing encoders (user-side)
le_gender = LabelEncoder().fit(data["Gender"])
le_occ    = LabelEncoder().fit(data["Occupation"])
le_zip    = LabelEncoder().fit(data["Zip-code"])

data["g_idx"] = le_gender.transform(data["Gender"]).astype("int32")
data["o_idx"] = le_occ.transform(data["Occupation"]).astype("int32")
data["z_idx"] = le_zip.transform(data["Zip-code"]).astype("int32")

# Item genres (already computed earlier)
G = X_gen  # csr_matrix with shape (n_samples, n_genres)

In [None]:
# === TO DO 5 ===
# Extract the release year from the "Title" column and encode it for use as a side feature.
# Steps:
# 1. Extract the year (4 digits inside parentheses) and store it in a new column "Year".
# 2. Use LabelEncoder to convert the "Year" column into integer indices.
# 3. Save the result in a new column "y_idx", which will be used for year embeddings.
from src.task5 import get_year_embedding

data = get_year_embedding(data)

In [None]:
# ==================
# 1) Basic dimension
# ==================
n_users  = data["u_idx"].nunique()
n_items  = data["m_idx"].nunique()
n_g      = data["g_idx"].nunique()
n_o      = data["o_idx"].nunique()
n_z      = data["z_idx"].nunique()
n_genres = G.shape[1]
n_years  = data["y_idx"].nunique()                               

u_idx = data["u_idx"].astype("int32").values
m_idx = data["m_idx"].astype("int32").values

In [None]:
# ============
# 2) Two-Tower
# ============
class TwoTower(Model):
    def __init__(self, n_users, n_items, n_g, n_o, n_z, n_genres, n_years,
                 emb_dim=32, side_dim=8, tower_dim=64):
        super().__init__()
        # ID embeddings
        self.user_emb = layers.Embedding(n_users, emb_dim, name="user_id_emb")
        self.item_emb = layers.Embedding(n_items, emb_dim, name="item_id_emb")
        # user side
        self.gender_emb = layers.Embedding(n_g, side_dim, name="gender_emb")
        self.occ_emb    = layers.Embedding(n_o, side_dim, name="occ_emb")
        self.zip_emb    = layers.Embedding(n_z, side_dim, name="zip_emb")
        # item side
        self.genre_proj = layers.Dense(side_dim, use_bias=False, name="genre_proj")  # projects multi-hot genre vector
        self.year_emb   = layers.Embedding(n_years, side_dim, name="year_emb")       

        # projections to common tower space
        self.user_proj = layers.Dense(tower_dim, activation=None, name="user_proj")
        self.item_proj = layers.Dense(tower_dim, activation=None, name="item_proj")

    def user_tower(self, user_id, g_idx, o_idx, z_idx):
        u = self.user_emb(user_id)
        g = self.gender_emb(g_idx)
        o = self.occ_emb(o_idx)
        z = self.zip_emb(z_idx)
        u_cat = tf.concat([u, g, o, z], axis=-1)          # (..., 32 + 3*8 = 56)
        return self.user_proj(u_cat)                      # -> (..., tower_dim)

    def item_tower(self, item_id, genres_vec, y_idx):
        i = self.item_emb(item_id)
        g_emb = self.genre_proj(genres_vec)               # (..., 8)
        y_emb = self.year_emb(y_idx)                      # (..., 8)   
        v_cat = tf.concat([i, g_emb, y_emb], axis=-1)     # (..., 32 + 8 + 8 = 48)
        return self.item_proj(v_cat)                      # -> (..., tower_dim)

    def call(self, inputs):
        u_vec = self.user_tower(inputs["user_id"], inputs["g_idx"], inputs["o_idx"], inputs["z_idx"])
        v_vec = self.item_tower(inputs["item_id"], inputs["genres"], inputs["y_idx"])  # NEW
        return tf.reduce_sum(u_vec * v_vec, axis=-1, keepdims=True)  # dot product

In [None]:
# =========================================
# 3) tf.data generator & signature
# =========================================
def _as_float1d(x):
    # Accepts dense row or scipy sparse row
    if hasattr(x, "toarray"):
        return x.toarray().astype("float32").ravel()
    return np.asarray(x, dtype="float32").ravel()

def gen():
    for i in range(len(data)):
        yield (
            {
                "user_id": np.int32(u_idx[i]),
                "item_id": np.int32(m_idx[i]),
                "g_idx":   np.int32(data["g_idx"].iloc[i]),
                "o_idx":   np.int32(data["o_idx"].iloc[i]),
                "z_idx":   np.int32(data["z_idx"].iloc[i]),
                "genres":  _as_float1d(G[i]),               # (n_genres,) float32
                "y_idx":   np.int32(data["y_idx"].iloc[i]),
            },
            np.float32(y[i]),
        )

sig = (
    {
        "user_id": tf.TensorSpec(shape=(), dtype=tf.int32),
        "item_id": tf.TensorSpec(shape=(), dtype=tf.int32),
        "g_idx":   tf.TensorSpec(shape=(), dtype=tf.int32),
        "o_idx":   tf.TensorSpec(shape=(), dtype=tf.int32),
        "z_idx":   tf.TensorSpec(shape=(), dtype=tf.int32),
        "genres":  tf.TensorSpec(shape=(n_genres,), dtype=tf.float32),
        "y_idx":   tf.TensorSpec(shape=(), dtype=tf.int32),        
    },
    tf.TensorSpec(shape=(), dtype=tf.float32),
)

ds = tf.data.Dataset.from_generator(gen, output_signature=sig).batch(1024).prefetch(tf.data.AUTOTUNE)

In [None]:
# ===========================
# 4) Compile, Train, Evaluate
# ===========================
model = TwoTower(n_users, n_items, n_g, n_o, n_z, n_genres, n_years, emb_dim=32, side_dim=8, tower_dim=64)
model.compile(optimizer="adam", loss="mse", metrics=[tf.keras.metrics.RootMeanSquaredError(name="rmse")])
model.fit(ds, epochs=1)

In [None]:
# ===========================
# 5) Inference / Recommendation
# ===========================
# Example usage:
top_recs = fn.recommend_for_uidx_tt(10, data, model, mlb, top_n=5)
print(top_recs[["Title", "pred"]])

# [üèÜ **Factored Tech Week 2025 ‚Äì Recommender Systems Challenge**](https://www.codabench.org/competitions/10195/?secret_key=a03669ef-a8e7-483a-aa8b-3686beb4be9b)

![Factored Banner](images/Factored_Logo_Profile_Asset_Cover-.png)

