# 🎯 实验背景与应用场景：去偏排序模型在推荐系统中的应用

在大多数实际推荐系统中，**用户点击行为常用于训练排序模型**，然而这些点击数据往往存在两类严重偏差：

1. **标签偏差（Label Bias）**：用户点击未必表示真实兴趣，未点击也不代表不感兴趣。例如，一些商品由于图文吸引力被误点，而另一些有价值商品未被注意到。

2. **曝光偏差（Exposure Bias）**：用户只能看到推荐系统主动曝光的内容，导致训练样本分布与真实兴趣分布不一致。系统更倾向于曝光热门商品，冷门内容很难获得点击。

这些偏差直接影响排序模型的训练质量，容易导致推荐结果过度集中在热门内容、难以捕捉用户的真实个性化需求。


# LambdaMART vs Debiased LambdaMART + IPW

通过模拟排序推荐数据，比较：
- Click Label + LambdaMART（Baseline）
- Denoised Label + LambdaMART（Debiased）
- Denoised Label + IPW权重 + LambdaMART（Debiased）

## 构造模拟数据集

In [1]:
import numpy as np
import pandas as pd
import lightgbm as lgb
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import LabelEncoder

np.random.seed(42)

n_users = 100
n_items = 50

data = []
for user in range(n_users):
    items = np.random.choice(n_items, size=10, replace=False)
    for item in items:
        # 模拟点击：70%由兴趣决定，30%为从众点击
        is_interest = np.random.rand() < 0.6
        is_conformity = not is_interest and (np.random.rand() < 0.5)
        click = int(is_interest or is_conformity)
        dwell_time = np.random.rand() * 100 if click else np.random.rand() * 10
        data.append([user, item, click, dwell_time, is_interest, is_conformity])

df = pd.DataFrame(data, columns=["user_id", "item_id", "click", "dwell_time", "is_interest", "is_conformity"])
df["click_label"] = df["click"]
df["denoised_label"] = df["is_interest"].astype(int)  # 去掉 conformity 的影响
df.head()

Unnamed: 0,user_id,item_id,click,dwell_time,is_interest,is_conformity,click_label,denoised_label
0,0,13,1,1.596625,True,False,1,1
1,0,39,1,24.102547,True,False,1,1
2,0,30,0,8.331949,False,False,0,0
3,0,45,1,39.106061,True,False,1,1
4,0,17,1,75.536141,True,False,1,1


## 特征编码 + 构造 group

In [2]:
user_enc = LabelEncoder()
item_enc = LabelEncoder()
df["user_id"] = user_enc.fit_transform(df["user_id"])
df["item_id"] = item_enc.fit_transform(df["item_id"])

# 构造 Group（每个用户是一组）
df = df.sort_values("user_id")
group_sizes = df.groupby("user_id").size().tolist()

# 训练测试划分
train_df = df.sample(frac=0.8, random_state=1)
test_df = df.drop(train_df.index)

group_train = train_df.groupby("user_id").size().tolist()
group_test = test_df.groupby("user_id").size().tolist()


## 定义排序评估指标（AUC, NDCG, MAP）

In [5]:
from sklearn.metrics import roc_auc_score
from sklearn.metrics import average_precision_score

def dcg_at_k(r, k):
    r = np.asarray(r, dtype=np.float32)[:k]
    if r.size:
        return np.sum(r / np.log2(np.arange(2, r.size + 2)))
    return 0.

def ndcg_at_k(r, k):
    dcg_max = dcg_at_k(sorted(r, reverse=True), k)
    return dcg_at_k(r, k) / dcg_max if dcg_max else 0.

def evaluate(model, test_df, label_col):
    X_test = test_df[["user_id", "item_id"]]
    y_true = test_df[label_col].values
    y_pred = model.predict(X_test)

    auc = roc_auc_score(y_true, y_pred)
    ap = average_precision_score(y_true, y_pred)

    test_df = test_df.copy()
    test_df["score"] = y_pred
    test_df["label"] = y_true
    ndcgs = []
    for _, group in test_df.groupby("user_id"):
        ranked = group.sort_values("score", ascending=False)
        ndcg = ndcg_at_k(ranked["label"].values, 5)
        ndcgs.append(ndcg)

    return auc, np.mean(ndcgs), ap


## Baseline: Click Label + LambdaMART

In [6]:
train_set = lgb.Dataset(train_df[["user_id", "item_id"]], label=train_df["click_label"], group=group_train)
test_set = lgb.Dataset(test_df[["user_id", "item_id"]], label=test_df["click_label"], group=group_test, reference=train_set, free_raw_data=False)

params = {
    "objective": "lambdarank",
    "metric": "ndcg",
    "ndcg_eval_at": [5],
    "learning_rate": 0.1,
    "boosting_type": "gbdt"
}

model_baseline = lgb.train(params, train_set, num_boost_round=100, valid_sets=[test_set],
                           callbacks=[lgb.early_stopping(stopping_rounds=10)])

auc1, ndcg1, map1 = evaluate(model_baseline, test_df, "click_label")


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000613 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 150
[LightGBM] [Info] Number of data points in the train set: 800, number of used features: 2
Training until validation scores don't improve for 10 rounds
Early stopping, best iteration is:
[4]	valid_0's ndcg@5: 0.951479


## Debiased: Denoised Label + LambdaMART

In [7]:
train_set_denoised = lgb.Dataset(train_df[["user_id", "item_id"]], label=train_df["denoised_label"], group=group_train)
test_set_denoised = lgb.Dataset(test_df[["user_id", "item_id"]],
                                label=test_df["denoised_label"],
                                group=group_test,
                                reference=train_set_denoised,
                                free_raw_data=False)

model_denoised = lgb.train(params, train_set_denoised, num_boost_round=100, valid_sets=[test_set_denoised],
                           callbacks=[lgb.early_stopping(stopping_rounds=10)])

auc2, ndcg2, map2 = evaluate(model_denoised, test_df, "denoised_label")


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000576 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 150
[LightGBM] [Info] Number of data points in the train set: 800, number of used features: 2
Training until validation scores don't improve for 10 rounds
Early stopping, best iteration is:
[1]	valid_0's ndcg@5: 0.926893


## Debiased + IPW 权重

In [8]:
# 模拟曝光位置偏差，构造IPW权重（越靠后越难被看见）
def simulate_exposure_bias(row):
    position = np.random.randint(1, 11)
    prob = 1.0 / position
    return prob

train_df["exposure_prob"] = train_df.apply(simulate_exposure_bias, axis=1)
train_df["ipw_weight"] = 1.0 / (train_df["exposure_prob"] + 1e-6)

train_set_ipw = lgb.Dataset(train_df[["user_id", "item_id"]],
                            label=train_df["denoised_label"],
                            group=group_train,
                            weight=train_df["ipw_weight"])

test_set_ipw = lgb.Dataset(test_df[["user_id", "item_id"]],
                           label=test_df["denoised_label"],
                           group=group_test,
                           reference=train_set_ipw,
                           free_raw_data=False)

model_ipw = lgb.train(params, train_set_ipw, num_boost_round=100, valid_sets=[test_set_ipw],
                      callbacks=[lgb.early_stopping(stopping_rounds=10)])

auc3, ndcg3, map3 = evaluate(model_ipw, test_df, "denoised_label")


[LightGBM] [Info] Calculating query weights...
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001015 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 150
[LightGBM] [Info] Number of data points in the train set: 800, number of used features: 2
Training until validation scores don't improve for 10 rounds
Early stopping, best iteration is:
[4]	valid_0's ndcg@5: 0.923977


## 模型效果对比

In [9]:
print("Baseline (Click Label):       AUC = %.4f, NDCG@5 = %.4f, MAP = %.4f" % (auc1, ndcg1, map1))
print("Debiased (Denoised Label):    AUC = %.4f, NDCG@5 = %.4f, MAP = %.4f" % (auc2, ndcg2, map2))
print("Debiased + IPW (Our Method):  AUC = %.4f, NDCG@5 = %.4f, MAP = %.4f" % (auc3, ndcg3, map3))

Baseline (Click Label):       AUC = 0.5113, NDCG@5 = 0.9044, MAP = 0.8154
Debiased (Denoised Label):    AUC = 0.5483, NDCG@5 = 0.7249, MAP = 0.6705
Debiased + IPW (Our Method):  AUC = 0.5413, NDCG@5 = 0.7240, MAP = 0.6280
