# LightGBM

LightGBMを使ったランキング学習

<a href="https://colab.research.google.com/github/fuyu-quant/data-science-wiki/blob/develop/tabledata/ranking/lightgbm.ipynb" target="_blank" rel="noopener noreferrer"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
%%capture
!pip install --upgrade lightgbm

In [5]:
import lightgbm as lgb
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.metrics import ndcg_score
from sklearn.model_selection import train_test_split

### データセットの作成
- 各データについて
    - qid:各データのグループ化する識別子(ユーザー情報など)，説明変数ではなくランキングをする際に参照する
    - Relevance Score:目的変数

In [6]:
# データを生成
X, y = make_classification(random_state=3655, n_samples=300)
rng = np.random.default_rng(seed=3655)
n_query_groups = 3
qid = rng.integers(0, n_query_groups, size=X.shape[0])

# qidに基づいてデータをソート
sorted_idx = np.argsort(qid)
X = X[sorted_idx, :]
y = y[sorted_idx]
qid = qid[sorted_idx]

# DataFrameを作成
df = pd.DataFrame(X)
df['Relevance Score'] = y
df['qid'] = qid

In [7]:
# 訓練データとテストデータに分割
train_df, test_df = train_test_split(df, test_size=0.2, random_state=3655, stratify=df['qid'])

### LightGBMのランキング学習

In [8]:

# クエリごとのアイテム数を計算
train_query_sizes = train_df.groupby('qid').size().to_numpy()
test_query_sizes = test_df.groupby('qid').size().to_numpy()

# LGBM用のデータセットを生成
train_data = lgb.Dataset(data=train_df.drop(columns=['Relevance Score', 'qid']), 
                         label=train_df['Relevance Score'], 
                         group=train_query_sizes)
test_data = lgb.Dataset(data=test_df.drop(columns=['Relevance Score', 'qid']), 
                        label=test_df['Relevance Score'], 
                        group=test_query_sizes)

# パラメータを設定
params = {
    'objective': 'lambdarank',
    'metric': 'ndcg',
    'ndcg_eval_at': [1, 3, 5, 10],
    'learning_rate': 0.05,
    'num_leaves': 32,
    'min_data_in_leaf': 1,
    'boosting': 'gbdt',
}

# モデルを訓練
num_round = 100
bst = lgb.train(params, train_data, num_round, valid_sets=[test_data])

# 予測を行う
y_pred = bst.predict(test_df.drop(columns=['Relevance Score', 'qid']))

# スコアを評価（注意: 正確な評価のためには、qidが等しくてもRelevance Scoreが異なる必要があります）
ndcg_val = ndcg_score(np.asarray([test_df['Relevance Score']]), np.asarray([y_pred]))
print(f'NDCG Score: {ndcg_val}')


You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1634
[LightGBM] [Info] Number of data points in the train set: 240, number of used features: 20
NDCG Score: 0.9951746057602089


### 予測

In [9]:
y_pred = bst.predict(test_df.drop(columns=['Relevance Score', 'qid']))

ndcg_val = ndcg_score(np.asarray([test_df['Relevance Score']]), np.asarray([y_pred]))
print(f'NDCG Score: {ndcg_val}')

NDCG Score: 0.9951746057602089
