# Assignment 2 – Behance Like Prediction

## 1. Predictive Task & Evaluation

### Task formulation (Context)

Our goal is to **predict which Behance projects a user will “appreciate” (like)** in the future, based on their past implicit feedback.

- **Input data:** triples of the form \((u, i, t)\), where:
  - \(u\) is a user ID
  - \(i\) is a project (item) ID
  - \(t\) is a timestamp of when the user appreciated the project
- **Prediction task:** for a given user \(u\), and a set of candidate projects, we want to produce a **ranking** so that items the user actually appreciates are ranked as highly as possible.
- **Supervision type:** this is an **implicit-feedback recommendation** problem (we only see positive interactions, not explicit ratings or true negatives).

In our experiments, we evaluate a scoring function \(s(u, i)\) on **one positive project** and **100 sampled negatives** for each user in the validation/test sets. The model’s job is to assign higher scores to the positive project than to the negatives.

### Inputs, outputs, and what is optimized

- **Model inputs:**
  - A user index \(u\)
  - A project index \(i\)
  - Optionally, an image feature vector for project \(i\) (in the visual and hybrid models)
- **Model output:**
  - A **real-valued score** \(s(u, i)\). Higher scores mean “more likely that user \(u\) will appreciate project \(i\).”
- **Optimization objective:**
  - For the **matrix factorization (MF)** model, we optimize a **logistic loss** over positive and sampled negative user–item pairs, with \(L_2\) regularization on user and item embeddings.
  - For the **visual and hybrid** models, parameters are not learned directly in this notebook; instead, we combine precomputed image features with learned MF scores and tune a combination weight (alpha) using validation performance.

### Evaluation metrics

We evaluate models using two ranking metrics computed over each user’s 1 positive + 100 sampled negatives:

- **AUC (Area Under the ROC Curve):**
  - Interpreted here as \(P(s(u, i_{pos}) > s(u, i_{neg}))\) averaged over negatives.
  - Measures how often the model ranks the true positive above a random negative.
- **Precision@10 (P@10):**
  - For each user, we rank the 1 positive + 100 negatives, take the **top 10**, and check whether the positive is inside that top-10 set.
  - This approximates a practical scenario where we show the user a **small recommendation list** and ask whether it contains something they actually appreciated.

We report both metrics because:

- **AUC** captures overall ranking quality across all 101 items.
- **P@10** focuses on the **very top of the ranking**, which is often the most important region for a recommender system’s UI.

These evaluation choices are consistent with our goal: **produce a ranked list where true future appreciations appear near the top.**

## 2. Exploratory Data Analysis (EDA)

In [None]:
# --- Core ---
import os, math, random, struct
from collections import defaultdict, Counter

# --- Data ---
import numpy as np
import pandas as pd

# --- Visualization ---
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="whitegrid")

# --- ML helpers ---
from sklearn.decomposition import PCA

# --- Utils ---
from tqdm import tqdm
from numpy.linalg import norm

random.seed(0)
np.random.seed(0)

## Data Loading & Preprocessing

### Load appreciates file

In [None]:
data_dir = "data"
path = os.path.join(data_dir, "Behance_appreciate_1M.gz")

triples_raw = []
with open(path, "rt") as f:
    for line in f:
        u, i, t = line.strip().split()
        triples_raw.append((u, i, int(t)))

len(triples_raw), triples_raw[:5]

### Map users/items to integer indices

In [None]:
user2idx, item2idx = {}, {}
idx2user, idx2item = [], []

triples = []  # (u_idx, i_idx, timestamp)

for u, i, t in triples_raw:
    if u not in user2idx:
        user2idx[u] = len(user2idx)
        idx2user.append(u)
    if i not in item2idx:
        item2idx[i] = len(item2idx)
        idx2item.append(i)
    triples.append((user2idx[u], item2idx[i], t))

num_users = len(user2idx)
num_items = len(item2idx)
num_users, num_items, len(triples)

### Train / val / test split (per user)

In [None]:
by_user = defaultdict(list)
for u, i, t in triples:
    by_user[u].append((t, i))

train_pos = []
val_pos   = []
test_pos  = []

for u, lst in by_user.items():
    lst.sort()  # sort by time
    if len(lst) >= 3:
        *train_items, val_item, test_item = lst
        train_pos.extend((u, i) for (_, i) in train_items)
        val_pos.append((u, val_item[1]))
        test_pos.append((u, test_item[1]))
    elif len(lst) == 2:
        (t1, i1), (t2, i2) = lst
        train_pos.append((u, i1))
        test_pos.append((u, i2))
    else:
        train_pos.append((u, lst[0][1]))

len(train_pos), len(val_pos), len(test_pos)

### Users' liked items

In [None]:
user_pos_items = defaultdict(set)
for u, i in train_pos + val_pos + test_pos:
    user_pos_items[u].add(i)

### Negative sampling + eval data builder

In [None]:
def sample_negative(u):
    """Sample an item that user u has NOT liked."""
    while True:
        j = random.randrange(num_items)
        if j not in user_pos_items[u]:
            return j

def build_eval_data(pos_pairs, num_neg=100):
    """
    pos_pairs: list of (u, i_pos).
    Returns: list of dicts: {"u": u, "pos": i_pos, "negs": [neg_items]}
    """
    eval_data = []
    for u, i_pos in pos_pairs:
        negs = [sample_negative(u) for _ in range(num_neg)]
        eval_data.append({"u": u, "pos": i_pos, "negs": negs})
    return eval_data

val_data  = build_eval_data(val_pos,  num_neg=100)
test_data = build_eval_data(test_pos, num_neg=100)

len(val_data), len(test_data)

### Basic stats

In [None]:
num_interactions = len(triples)
print("Users:", num_users)
print("Items:", num_items)
print("Interactions:", num_interactions)
print("Train/Val/Test:", len(train_pos), len(val_pos), len(test_pos))

### Likes per user

In [None]:
user_counts = Counter(u for u, _, _ in triples)
plt.figure(figsize=(6,4))
plt.hist(list(user_counts.values()), bins=50)
plt.yscale("log")
plt.xlabel("Likes per user")
plt.ylabel("Count (log scale)")
plt.title("Distribution of likes per user")
plt.show()

### Likes per item

In [None]:
item_counts_full = Counter(i for _, i, _ in triples)
plt.figure(figsize=(6,4))
plt.hist(list(item_counts_full.values()), bins=50)
plt.yscale("log")
plt.xlabel("Likes per item")
plt.ylabel("Count (log scale)")
plt.title("Distribution of likes per item")
plt.show()

## 3. Modeling

### 3.1 Task as an ML problem (Context)

We formulate Behance like prediction as a **personalized ranking problem**:

- We have a set of users \(\mathcal{U}\) and a set of projects \(\mathcal{I}\).
- Whenever a user appreciates a project, we observe a positive interaction \((u, i, t)\).
- We treat each observed \((u, i)\) as a **positive example** and sample additional **unobserved items** for the same user as **implicit negatives**.
- The model learns a scoring function \(s(u, i)\) such that \(s(u, i_{pos}) > s(u, i_{neg})\) for as many user–item pairs as possible.

In our implementation:

- The **matrix factorization (MF)** model learns:
  - \(P \in \mathbb{R}^{U \times K}\): user latent factors
  - \(Q \in \mathbb{R}^{I \times K}\): item latent factors
  - Score: \(s_{MF}(u, i) = P_u^\top Q_i\)
- The **visual model** builds a visual profile per user by averaging image features of liked projects, and scores a candidate project via **cosine similarity**.
- The **hybrid model** combines MF and visual scores linearly:  
  \[ s_{hyb}(u, i) = \alpha\, s_{MF}(u, i) + (1 - \alpha)\, s_{vis}(u, i). \]

We compare these against a **popularity baseline**, which ignores users and simply ranks items by how many times they were appreciated in the training data.

### 3.2 Modeling approaches: advantages and disadvantages

**Popularity baseline**

- **Idea:** rank all projects by how many likes they have, and recommend the same top projects to everyone.
- **Advantages:**
  - Very simple and efficient (just a count lookup).
  - Strong baseline when interactions are heavily concentrated on a few very popular items.
- **Disadvantages:**
  - Completely ignores individual user preferences.
  - Cannot recommend long-tail or niche projects that are not globally popular.

**Matrix factorization (MF)**

- **Idea:** embed users and items into a shared **K-dimensional latent space**, and compute scores as dot products.
- **Advantages:**
  - Captures user–item interaction patterns (“people who liked X also liked Y”).
  - Scales relatively well once embeddings are learned; scoring is just a dot product.
  - Our implementation uses a **logistic loss with negative sampling**, which is natural for implicit feedback.
- **Disadvantages / challenges:**
  - Training is more complex and computationally expensive than the popularity baseline.
  - Requires hyperparameter choices (latent dimension K, learning rate, regularization, number of epochs).
  - Cold-start items with few interactions are still hard, since they have poorly learned embeddings.

**Visual-only model**

- **Idea:** use **image features** to represent projects. For each user, average the features of all projects they liked to form a **visual profile**, and then score new projects by cosine similarity to that profile.
- **Advantages:**
  - Uses **content information** (how projects look), which is very relevant in a visual platform like Behance.
  - Can help in cold-start settings: even if a project is new, as long as we have an image feature vector, we can still compute a score.
- **Disadvantages / challenges:**
  - Ignores global interaction patterns between users and items.
  - Averaging features is a simple heuristic; it may not capture multiple different styles a user likes.
  - In our results, this model performs significantly worse than the interaction-based models on P@10.

**Hybrid model (MF + Visual)**

- **Idea:** combine MF and visual scores with a weight \(\alpha\), chosen based on validation performance.
- **Advantages:**
  - Balances **interaction signals** and **visual content signals**.
  - Can still rank visually similar items even if they have limited interactions, while leveraging MF where data is abundant.
- **Disadvantages / challenges:**
  - Requires tuning the combination weight \(\alpha\).
  - Our combination is a simple linear blend; more advanced methods (e.g., learning \(\alpha\) per user or per item) could potentially perform better but would increase complexity.

Overall, MF and the hybrid model are more expressive and personalized, while the popularity baseline sets a strong, simple reference point. The visual-only model explores the value of pure content information.

### 3.3 Code walkthrough (Modeling code)

Below we implement and train the different models:

1. **Popularity baseline** (Section: *Baseline model – Item popularity*):
   - We count how many times each item appears in the training set and define `pop_score(u, i)` to return this count.
   - This is our **trivial but strong baseline**.

2. **MF training data and model** (Sections: *Build MF training data*, *Initialize parameters*, *MF scoring function*, *Training loop*):
   - `build_mf_training_data` constructs labeled pairs `(u, i, y)`
     where `y = 1` for positive (observed) interactions and `y = 0` for sampled negatives.
   - We initialize user and item embeddings `P` and `Q` with small random values.
   - `mf_score(u, i)` computes the dot product between user and item embeddings.
   - `train_mf` performs **stochastic gradient descent** on the logistic loss, with \(L_2\) regularization on `P` and `Q`.

3. **Visual model** (Sections: *Load image features*, *Build user visual profiles*, *Visual-only scoring function*):
   - We load precomputed 4096-dimensional image features for each project.
   - For each user, we average the features of all liked projects to build `user_visual[u]`.
   - `visual_score(u, i)` computes cosine similarity between the user’s visual profile and the project’s feature vector.

4. **Hybrid model** (Section: *Hybrid model (MF + Visual)*):
   - `make_hybrid_score(alpha)` returns a scoring function that linearly combines `mf_score` and `visual_score`.
   - We later evaluate several values of `alpha` on the validation set to pick the best weight.

In the next section, we will define an evaluation helper and compare these models quantitatively using AUC and P@10.

### Baseline model – Item popularity

In [None]:
item_counts = Counter(i for (u, i) in train_pos)

def pop_score(u, i):
    # same for all users; uses only item popularity
    return item_counts[i]

### Build MF training data (with negatives)

In [None]:
def build_mf_training_data(num_neg_per_pos=2):
    data = []
    for u, i in train_pos:
        data.append((u, i, 1))
        for _ in range(num_neg_per_pos):
            j = sample_negative(u)
            data.append((u, j, 0))
    random.shuffle(data)
    return data

mf_train_data = build_mf_training_data(num_neg_per_pos=2)
len(mf_train_data)

### MF model: parameters and scoring function

In [None]:
K = 40        # latent dimension
lr = 0.05
reg = 0.001
epochs = 5    # start small, increase if training fast

P = 0.01 * np.random.randn(num_users, K)
Q = 0.01 * np.random.randn(num_items, K)

In [None]:
def mf_score(u, i):
    return float(P[u] @ Q[i])

### Training loop

In [None]:
def train_mf(train_data, epochs=5, lr=0.05, reg=0.001):
    global P, Q
    for epoch in range(epochs):
        random.shuffle(train_data)
        total_loss = 0.0

        for u, i, y in train_data:
            pred = P[u] @ Q[i]
            p_hat = 1.0 / (1.0 + math.exp(-pred))  # logistic
            grad = p_hat - y                       # d/dpred logloss

            Pu = P[u]
            Qi = Q[i]

            P[u] -= lr * (grad * Qi + reg * Pu)
            Q[i] -= lr * (grad * Pu + reg * Qi)

            total_loss += -(y * math.log(p_hat + 1e-8) +
                            (1 - y) * math.log(1 - p_hat + 1e-8))

        avg_loss = total_loss / len(train_data)
        print(f"Epoch {epoch+1}/{epochs}, avg log-loss: {avg_loss:.4f}")

train_mf(mf_train_data, epochs=epochs, lr=lr, reg=reg)

### Load image features

In [None]:
IMG_PATH = os.path.join(data_dir, "Behance_Image_Features.b")
print("Image file exists:", os.path.exists(IMG_PATH))

item_features = {}  # item_idx -> np.array(4096,)

with open(IMG_PATH, "rb") as f:
    while True:
        item_id_bytes = f.read(8)
        if not item_id_bytes:
            break
        raw_id = item_id_bytes.decode("ascii").strip()
        vec = f.read(4 * 4096)
        if len(vec) < 4 * 4096:
            break
        feat = np.frombuffer(vec, dtype=np.float32)

        # map original item ID -> our index
        if raw_id in item2idx:
            idx = item2idx[raw_id]
            item_features[idx] = feat

len(item_features)

### Build user visual profiles

In [None]:
user_visual = {}

for u in range(num_users):
    liked = [i for (uu, i) in train_pos if uu == u and i in item_features]
    if not liked:
        continue
    mat = np.stack([item_features[i] for i in liked])
    user_visual[u] = mat.mean(axis=0)

len(user_visual)

### Visual-only scoring function

In [None]:
def visual_score(u, i):
    vu = user_visual.get(u)
    fi = item_features.get(i)
    if vu is None or fi is None:
        return 0.0
    denom = (norm(vu) * norm(fi)) + 1e-8
    return float(vu @ fi / denom)

### Hybrid model (MF + Visual)

In [None]:
def make_hybrid_score(alpha):
    def score(u, i):
        return alpha * mf_score(u, i) + (1 - alpha) * visual_score(u, i)
    return score

## 4. Evaluation & Results

### 4.1 Evaluation setup (Context)

For each user in the validation and test sets, we construct one evaluation instance:

- A **single positive project** (the held-out appreciated item for that user), and
- **100 negative projects**, sampled uniformly from items the user has never appreciated.

Given a scoring function `score_fn(u, i)`, we compute:

- **AUC:** fraction of negatives whose score is below the positive’s score.
- **P@10:** whether the positive item appears in the **top 10** ranked among 1 positive + 100 negatives.

We then average these metrics over all users in the validation/test sets.

This protocol lets us compare different models **under the same candidate set**, and focuses on whether the true appreciated project is ranked near the top of a plausible recommendation list.

### 4.2 Evaluation helper

In [None]:
def eval_model(score_fn, eval_data, k=10):
    """
    score_fn(u, i) -> float
    eval_data: list of {"u": u, "pos": i_pos, "negs": [j1,...]}
    Returns: (mean AUC, mean Precision@k)
    """
    aucs = []
    precisions = []

    for row in eval_data:
        u   = row["u"]
        pos = row["pos"]
        negs = row["negs"]

        items = [pos] + negs
        scores = np.array([score_fn(u, it) for it in items])

        pos_score = scores[0]
        neg_scores = scores[1:]

        # AUC = P(score_pos > score_neg)
        auc = np.mean(pos_score > neg_scores)
        aucs.append(auc)

        # Precision@k
        order = np.argsort(-scores)  # descending
        topk = order[:k]
        prec = 1.0 if 0 in topk else 0.0
        precisions.append(prec)

    return float(np.mean(aucs)), float(np.mean(precisions))

### 4.3 Validation performance and model selection

We first evaluate all models on the **validation set** to understand their behavior and to choose the best \(\alpha\) for the hybrid model.

In [None]:
# Baseline
pop_auc_val, pop_prec_val = eval_model(pop_score, val_data, k=10)
print("Popularity baseline (val): AUC =", pop_auc_val, "P@10 =", pop_prec_val)

# MF
mf_auc_val, mf_prec_val = eval_model(mf_score, val_data, k=10)
print("MF (val): AUC =", mf_auc_val, "P@10 =", mf_prec_val)

# Visual-only
vis_auc_val, vis_prec_val = eval_model(visual_score, val_data, k=10)
print("Visual-only (val): AUC =", vis_auc_val, "P@10 =", vis_prec_val)

# Hybrid with different alphas
alphas = [0.2, 0.5, 0.8]
hyb_val_results = []
for alpha in alphas:
    hyb_score = make_hybrid_score(alpha)
    auc, prec = eval_model(hyb_score, val_data, k=10)
    hyb_val_results.append((alpha, auc, prec))
    print(f"Hybrid alpha={alpha}: AUC={auc:.4f}, P@10={prec:.4f}")

hyb_val_results

From the validation results:

- The **popularity baseline** achieves relatively high P@10, confirming that many likes are concentrated on a few very popular projects.
- The **MF model** improves AUC over popularity, meaning it ranks positives above negatives more consistently, but its P@10 is slightly lower than pure popularity.
- The **visual-only model** has substantially lower performance, indicating that using image features alone is not sufficient in this setting.
- The **hybrid models** with different \(\alpha\) values improve AUC further, and for some \(\alpha\) they also get closer to popularity in terms of P@10.

Based on these trends, we choose **\(\alpha = 0.5\)** as a reasonable compromise between MF and visual information for the final test evaluation.

In [None]:
alpha_best = 0.5
hybrid_best = make_hybrid_score(alpha_best)

### 4.4 Test set performance

In [None]:
pop_auc_test, pop_prec_test = eval_model(pop_score, test_data, k=10)
mf_auc_test,  mf_prec_test  = eval_model(mf_score,  test_data, k=10)
vis_auc_test, vis_prec_test = eval_model(visual_score, test_data, k=10)
hyb_auc_test, hyb_prec_test = eval_model(hybrid_best, test_data, k=10)

results = pd.DataFrame({
    "Model": [
        "Popularity (baseline)",
        "MF",
        "Visual-only",
        f"Hybrid (alpha={alpha_best})"
    ],
    "AUC":  [pop_auc_test, mf_auc_test, vis_auc_test, hyb_auc_test],
    "P@10": [pop_prec_test, mf_prec_test, vis_prec_test, hyb_prec_test],
})
results

The test results summarize how each model performs on held-out users and projects:

- **Popularity (baseline):**
  - Strong P@10, reflecting that recommending globally popular items is often enough to hit at least one item the user will appreciate in the top 10.
- **MF:**
  - Higher AUC than popularity, indicating better **overall ranking quality**, but a slightly lower P@10.
- **Visual-only:**
  - Much lower AUC and P@10, confirming that content alone is not competitive here.
- **Hybrid (alpha = 0.5):
  - Achieves the **best AUC** among the learned models, and P@10 that is between MF and popularity.

Overall, our experiments show that:

- The **popularity baseline remains difficult to beat at P@10**, which is consistent with heavy popularity skew in real-world platforms.
- Our **MF and hybrid models** provide better global ranking quality (AUC), and the hybrid makes use of visual information without completely sacrificing top-10 performance.

This suggests that future work might focus on **better integrating popularity into the learned models**, for example via popularity-aware regularization or explicit popularity features, to close the P@10 gap while maintaining strong AUC.

## 5. Related Work

Our project fits into a long line of research on systems that suggest items to users based on
their past behavior, such as likes or clicks, and on using image information to make better
recommendations.

**Using past likes and clicks to make recommendations:**  
Many existing systems only see whether a user clicked or liked something, not a detailed
rating. Early work such as *Collaborative Filtering for Implicit Feedback Datasets* (Hu,
Koren, & Volinsky, 2008) and *BPR: Bayesian Personalized Ranking from Implicit
Feedback* (Rendle, Freudenthaler, Gantner, & Schmidt-Thieme, 2009) showed how to use
this kind of “yes/no” data to learn which users and items go well together. Their models
learn a small list of numbers for each user and for each item, and then give a higher score
to user–item pairs whose numbers line up well. At prediction time, the system picks the
items with the highest scores for each user. Our main interaction based model is inspired by this
same idea, where for each user and each project, we learn a short numeric description, and to
recommend projects to a user we score all projects and keep the ones with the highest
scores.

**Popularity based recommendations:**  
Other work on sites like YouTube, Pinterest, or Behance-style platforms has found that
simply recommending the most popular items, meaning the ones that receive the most likes overall,
can already work surprisingly well. Because many users interact with only a few items, and
because attention is often focused on a small set of very popular items, “recommend what
is globally popular” is commonly used as a simple starting point or a backup strategy. Our
popularity based model follows this idea, since we rank projects by how many likes they
have, and recommend the top ones to everyone. In our results, this approach is quite effective: 
on average, for each user, a large fraction of the top 10 projects we recommend
this way are projects they actually liked in the held out test data. This supports the claim
from prior work that likes tend to be heavily concentrated on a small set of very popular
projects.

**Using image information to make recommendations:**  
More recent research combines interaction data with information extracted from the images
themselves. In domains like fashion, art, and design, how something looks is very important,
so using image information can help. Models such as VBPR (*VBPR: Visual Bayesian
Personalized Ranking for Personalized Recommendation of Visual Content*; He &
McAuley, 2016) use an image recognition network to turn each image into a long vector of
numbers that captures aspects like style, color, and composition. They then combine this
image based description with the interaction based model so that the system can
recommend items that both match a user’s taste and look similar to things they have liked
before, and can still recommend reasonable items even when there are few past likes for a
given item. Other work on image popularity prediction, such as *What Makes an Image
Popular?* (Khosla, Das Sarma, & Hamid, 2014), also shows that these kinds of image
features are strongly related to whether an image will attract attention and engagement.

Our visual only model is closest to this line of work on image popularity. In our case, we
ignore user IDs and only look at image-based features, since we build a visual profile for each
user by averaging the image features of the projects they liked, and then recommend
projects whose image features are similar to that profile. We find that this image only
approach does contain useful information, but it is not enough by itself: compared to our
interaction-based model and our popularity based model, it suggests fewer projects that
the user actually likes near the top of the ranked list.

**Combining interaction and image information:**  
A common idea in related works is to combine these different signals rather than choose
only one of them. Many papers use a weighted combination of a score based on past
interactions and a score based on content, such as image features. This lets the system
balance between “people who liked X also liked Y” and “this looks similar to what you
liked before,” and can help especially for users or items that do not have much interaction
history. Our combined model follows this pattern, as for each user–project pair, we compute
one score from the interaction based model and one from the image based model, and
then take a weighted average of the two, controlled by a parameter that decides how much
weight to give to each source of information. In line with prior work, we see that using a
middle value for this weight, rather than relying purely on interactions or purely on images,
gives the best overall ranking of projects for most users on our test data. However, our
results also show that the pure popularity based model still gives the largest number of
correct hits in the top 10 recommendations. This suggests that our simple way of combining
signals does not yet fully exploit popularity information, a limitation that is also discussed in
recent work on how to correctly handle very popular items when building recommendation
systems.