Setup & install library – kalau kamu pakai environment baru, install dulu CatBoost / LightGBM dan imbalanced-learn.

In [1]:
%pip install pandas numpy scikit-learn catboost lightgbm imbalanced-learn

^C


Defaulting to user installation because normal site-packages is not writeable
Collecting catboost
  Downloading catboost-1.2.8-cp311-cp311-win_amd64.whl.metadata (1.5 kB)
Collecting lightgbm
  Downloading lightgbm-4.6.0-py3-none-win_amd64.whl.metadata (17 kB)
Collecting imbalanced-learn
  Downloading imbalanced_learn-0.13.0-py3-none-any.whl.metadata (8.8 kB)
Collecting graphviz (from catboost)
  Downloading graphviz-0.21-py3-none-any.whl.metadata (12 kB)
Collecting plotly (from catboost)
  Downloading plotly-6.2.0-py3-none-any.whl.metadata (8.5 kB)
Collecting sklearn-compat<1,>=0.1 (from imbalanced-learn)
  Downloading sklearn_compat-0.1.3-py3-none-any.whl.metadata (18 kB)
Collecting narwhals>=1.15.1 (from plotly->catboost)
  Downloading narwhals-1.45.0-py3-none-any.whl.metadata (11 kB)
Downloading catboost-1.2.8-cp311-cp311-win_amd64.whl (102.5 MB)
   ---------------------------------------- 0.0/102.5 MB ? eta -:--:--
   ---------------------------------------- 0.8/102.5 MB 5.6 MB/s e


[notice] A new release of pip is available: 25.0.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


Import library inti & set display options

In [None]:
import pandas as pd
import numpy as np
from pathlib import Path

# modelling
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.metrics import balanced_accuracy_score, confusion_matrix, classification_report
from catboost import CatBoostClassifier
from imblearn.over_sampling import SMOTE

pd.set_option("display.max_columns", None)
DATA_DIR = Path("D:\KULIAH\Lomba\Intelectra\intelectra\dataset")        # ganti jika file ada di folder lain


Load seluruh CSV

In [None]:
df_trans   = pd.read_csv(DATA_DIR / "train_transaction_data.csv")
df_trans_t = pd.read_csv(DATA_DIR / "test_transaction_data.csv")
df_member  = pd.read_csv(DATA_DIR / "member_data.csv")
df_prod    = pd.read_csv(DATA_DIR / "product_data.csv")
df_prog    = pd.read_csv(DATA_DIR / "prodgram_data.csv")
df_label   = pd.read_csv(DATA_DIR / "train_label_data.csv")
print(df_trans.shape, df_member.shape, df_prod.shape, df_prog.shape, df_label.shape)


Quick EDA – cek proporsi label & preview data

In [None]:
print("Distribusi next_buy:")
print(df_label['next_buy'].value_counts(normalize=True))
df_trans.head()

Join relasional ke tabel training

train_transaction ↔ member (MemberID)

product (FK_PRODUCT_ID)

program (FK_PROD_GRAM_ID)

merge dengan train_label

In [None]:
train = (df_trans
         .merge(df_member,  on='MemberID',      how='left')
         .merge(df_prod,    left_on='FK_PRODUCT_ID',   right_on='productID',  how='left')
         .merge(df_prog,    left_on='FK_PROD_GRAM_ID', right_on='programID',  how='left')
         .merge(df_label,   on='MemberID',      how='left')
        )

test  = (df_trans_t
         .merge(df_member,  on='MemberID',      how='left')
         .merge(df_prod,    left_on='FK_PRODUCT_ID',   right_on='productID',  how='left')
         .merge(df_prog,    left_on='FK_PROD_GRAM_ID', right_on='programID',  how='left')
        )

print(train.shape, test.shape)


Feature Engineering ringkas

TotalSpending = Qty * PricePerUnit

datetime → Month, DayOfWeek, Recency

MemberAge, MemberTenure

drop kolom ID yang tak dipakai

In [None]:
def create_features(df, is_train=True):
    df = df.copy()
    # datetime
    df['TransactionDatetime'] = pd.to_datetime(df['TransactionDatetime'])
    df['Month']       = df['TransactionDatetime'].dt.month
    df['DayOfWeek']   = df['TransactionDatetime'].dt.dayofweek
    
    # recency per Member
    df.sort_values(['MemberID', 'TransactionDatetime'], inplace=True)
    df['Recency'] = df.groupby('MemberID')['TransactionDatetime'].diff().dt.days
    df['Recency'].fillna(df['Recency'].median(), inplace=True)
    
    # numeric conversions
    df['Qty'] = pd.to_numeric(df['Qty'], errors='coerce')
    df['TotalSpending'] = df['Qty'] * df['PricePerUnit']
    
    # member age & tenure
    dob = pd.to_datetime(df['DateOfBirth'], errors='coerce')
    df['MemberAge'] = (pd.Timestamp('today') - dob).dt.days // 365
    joind = pd.to_datetime(df['JoinDate'], errors='coerce')
    df['MemberTenure'] = (pd.Timestamp('today') - joind).dt.days
    
    # pilih fitur
    keep_cols = ['MemberID','ProductCategory','ProductLevel','Source',
                 'Qty','TotalSpending','Month','DayOfWeek','Recency',
                 'MemberAge','MemberTenure']
    if is_train:
        keep_cols.append('next_buy')
    return df[keep_cols]

train_feat = create_features(train)
test_feat  = create_features(test, is_train=False)


Pre-processing – pisah X/y, handle imbalance (SMOTE), list kolom kategorikal

In [None]:
y = train_feat['next_buy']
X = train_feat.drop(columns=['next_buy'])

cat_cols = ['ProductCategory','ProductLevel','Source']
for col in cat_cols:
    X[col] = X[col].astype('category')
    test_feat[col] = test_feat[col].astype('category')

X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42)

# Optional SMOTE
sm = SMOTE(random_state=42)
X_train_res, y_train_res = sm.fit_resample(X_train, y_train)


Model : CatBoost (karena handle kategori otomatis) + evaluasi Balanced Accuracy

In [None]:
model = CatBoostClassifier(
    iterations=300,
    depth=6,
    learning_rate=0.1,
    loss_function='Logloss',
    eval_metric='BalancedAccuracy',
    random_state=42,
    verbose=False
)

model.fit(
    X_train_res, y_train_res,
    eval_set=(X_val, y_val),
    cat_features=[X.columns.get_loc(c) for c in cat_cols],
    verbose=100
)

pred_val = model.predict(X_val)
print("Balanced Accuracy:", balanced_accuracy_score(y_val, pred_val))
print(classification_report(y_val, pred_val))


Cross-validation 5-fold (optional, tapi disarankan)

In [None]:
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(
    model, X, y, cv=cv, scoring='balanced_accuracy', n_jobs=-1)
print("Balanced Acc 5-fold:", scores.mean(), "±", scores.std())


Fit ulang di seluruh data training + prediksi data test

In [None]:
model.fit(
    X, y,
    cat_features=[X.columns.get_loc(c) for c in cat_cols],
    verbose=False
)
test_pred = model.predict(test_feat)


Buat submission.csv – pastikan urutan MemberID sama dengan file sample

In [None]:
submission = pd.DataFrame({
    'MemberID': test_feat['MemberID'],
    'next_buy': test_pred.astype(int)
})
submission.to_csv('submission.csv', index=False)
submission.head()
