In [None]:
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 04_modeling_and_inference.ipynb  \n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 1: Loading & Preparing Data  \n",
    "*We’re loading the saved CSVs for users, content, and training labels, then converting date strings into numeric timestamps for modeling.*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Loading user features to merge aggregated user signals\n",
    "import pandas as pd\n",
    "user_df = pd.read_csv('user_features.csv')  \n",
    "# Loading content features to merge aggregated content signals\n",
    "content_df = pd.read_csv('content_features.csv')  \n",
    "# Loading combined training data (positive interactions + negative candidates)\n",
    "train_df = pd.read_csv('training_data.csv')  \n",
    "# Parsing datetime columns into Unix timestamps\n",
    "for col in ['first_seen', 'last_seen']:\n",
    "    train_df[col] = (pd.to_datetime(train_df[col], errors='coerce')  \n",
    "                        .astype('int64') // 10**9)\n",
    "# Quick sanity check shapes\n",
    "print(f\"User: {user_df.shape}, Content: {content_df.shape}, Train: {train_df.shape}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 2: Feature Matrix Construction  \n",
    "*We’re merging user & content features into a single DataFrame, then splitting into X (features) and y (labels).*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Merge user-level signals\n",
    "train = train_df.merge(user_df, on='deviceId', how='left')  \n",
    "# Merge content-level signals\n",
    "train = train.merge(content_df, on='hashId', how='left')  \n",
    "# Save for reproducibility\n",
    "train.to_csv('train.csv', index=False)  \n",
    "# Prepare feature matrix and target\n",
    "X = train.drop(columns=['deviceId','hashId','label']).select_dtypes(include=['int64','float64'])  \n",
    "y = train['label']  \n",
    "print(f\"X shape: {X.shape}, positive rate: {y.mean():.4f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 3: Model Training & Validation  \n",
    "*We’re splitting the data, training a LightGBM classifier with early stopping, and evaluating via AUC.*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.metrics import roc_auc_score\n",
    "import lightgbm as lgb\n",
    "from lightgbm import LGBMClassifier\n",
    "# Split\n",
    "X_train,X_val,y_train,y_val = train_test_split(X,y,stratify=y,test_size=0.2,random_state=42)\n",
    "# Train\n",
    "model = LGBMClassifier(n_estimators=200,random_state=42)\n",
    "model.fit(\n",
    "    X_train, y_train,\n",
    "    eval_set=[(X_val,y_val)],\n",
    "    callbacks=[lgb.early_stopping(20)]\n",
    ")\n",
    "# Evaluate\n",
    "preds = model.predict_proba(X_val)[:,1]\n",
    "print('Validation AUC:', roc_auc_score(y_val,preds))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 4: Prediction & Submission  \n",
    "*We’re loading test candidates, merging features, scoring, and exporting top-50 per user.*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_candidates = pd.read_csv('test_candidates.csv')  \n",
    "X_test = test_candidates.merge(user_df,on='deviceId',how='left')\n",
    "X_test = X_test.merge(content_df,on='hashId',how='left')\n",
    "for col in ['first_seen','last_seen']:\n",
    "    X_test[col] = pd.to_datetime(X_test[col],errors='coerce').astype('int64')//1e9\n",
    "X_test_model = X_test[X.columns]  \n",
    "test_candidates['label'] = model.predict_proba(X_test_model)[:,1]  \n",
    "test_candidates.to_csv('submission.csv',index=False)  \n",
    "test_candidates.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*End of notebook. Everything cleaned up and ready for review.*"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
