1_lightgbm.ipynb

This notebook implements a **user-level LightGBM pipeline** for the *Influencer vs Observer* task.

Pipeline Overview:
1. Load and normalize JSONL data
2. Structural features from tweets (temporal, textual, user profile)
3. Process object columns with frequency encoding and presence flags
4. Aggregate tweet-level features to user-level statistics
5. Train LightGBM classifier with stratified k-fold cross-validation
6. Generate out-of-fold predictions and submission file


In [1]:
import os
import gc
import re
import ast
import numpy as np
import pandas as pd
from pandas import json_normalize
from urllib.parse import urlparse
from lightgbm import LGBMClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings("ignore")


# CONFIGURATION

SEED = 42
N_FOLDS = 5
ID_COL = "ID"
LABEL_COL = "label"

# Directory paths
ROOT_DIR = "."
# ROOT_DIR = "/content/drive/MyDrive/Colab Notebooks/code" # for google colab
OUT_DIR = os.path.join(ROOT_DIR, "intermediate")
os.makedirs(OUT_DIR, exist_ok=True)
TRAIN_JSONL = os.path.join(ROOT_DIR, "train.jsonl")
TEST_JSONL = os.path.join(ROOT_DIR, "kaggle_test.jsonl")

# Twitter datetime format
TWITTER_DT_FORMAT = "%a %b %d %H:%M:%S %z %Y"

# Columns to exclude from feature engineering
TEXT_COLS_EXCLUDE = ["text", "extended_tweet.full_text", "full_text"]

# LightGBM hyperparameters
LGBM_PARAMS = {
    'n_estimators': 1500,        # number of trees
    'learning_rate': 0.03,      # learning rate
    'num_leaves': 31,           # tree complexity
    'max_depth': 7,            # max tree depth
    'min_child_samples': 20,   # minimum samples per leaf
    'subsample': 0.8,          # row sampling
    'colsample_bytree': 0.8,   # feature sampling
    'reg_alpha': 0.1,          # l1 regularization
    'reg_lambda': 1.0,         # l2 regularization
    'random_state': SEED,      # reproducibility
    'n_jobs': -1,              # use all cores
    'verbose': -1              # no logs
}


# UTILITY FUNCTIONS

def safe_upper_ratio(text):
    """Calculate the ratio of uppercase characters in text."""
    if not isinstance(text, str) or not text:
        return 0.0
    return sum(1 for c in text if c.isupper()) / max(1, len(text))


def is_hex_color(x):
    """Check if string is a valid 6-digit hex color code."""
    return isinstance(x, str) and re.fullmatch(r"[0-9A-Fa-f]{6}", x) is not None


def hex_to_rgb(h):
    """Convert hex color code to RGB tuple."""
    if is_hex_color(h):
        return tuple(int(h[i:i+2], 16) for i in (0, 2, 4))
    return (0, 0, 0)


def extract_domain(x):
    """Extract domain name from URL string."""
    if not isinstance(x, str) or len(x) < 5:
        return "none"
    try:
        return urlparse(x).netloc or "none"
    except:
        return "none"


def parse_list_like(x):
    """Parse list-like strings and handle various input types."""
    if isinstance(x, list):
        return x
    if pd.isna(x):
        return []
    if isinstance(x, str):
        s = x.strip()
        if s in ("", "[]"):
            return []
        try:
            val = ast.literal_eval(s)
            return val if isinstance(val, list) else []
        except:
            return []
    return []


def detect_bool_like_series(s, max_unique=4):
    """
    Detect if a pandas Series contains boolean-like values.
    Returns True if values can be interpreted as boolean (True/False, 0/1).
    """
    sample = s.dropna()
    if sample.apply(lambda v: isinstance(v, (list, dict, set))).any():
        return False

    try:
        vals = pd.unique(sample)
    except TypeError:
        return False

    if len(vals) == 0 or len(vals) > max_unique:
        return False

    for v in vals:
        if isinstance(v, bool):
            continue
        elif isinstance(v, (int, float)) and v in (0, 1):
            continue
        elif isinstance(v, str) and v.strip().lower() in ("true", "false", "0", "1"):
            continue
        else:
            return False
    return True


def map_bool_like_to_float(s):
    """Convert boolean-like values to float (1.0 or 0.0)."""
    def _map(v):
        if isinstance(v, bool):
            return 1.0 if v else 0.0
        if isinstance(v, (int, float)) and v in (0, 1):
            return float(v)
        if isinstance(v, str):
            vs = v.strip().lower()
            if vs in ("true", "1"):
                return 1.0
            if vs in ("false", "0"):
                return 0.0
        return np.nan
    return s.apply(_map).astype("float32")


def is_mostly_listlike(s, threshold=0.3):
    """Check if a Series contains mostly list-like values. (minimum 30%)"""
    sample = s.dropna().head(200)
    if len(sample) == 0:
        return False
    return sample.apply(lambda v: isinstance(v, (list, dict, set))).mean() > threshold


def balanced_sample_weight(y):
    """
    Compute balanced class weights for training.
    Helps address class imbalance by upweighting minority class.
    """
    classes, counts = np.unique(y, return_counts=True)
    weights = {c: len(y) / (len(classes) * n) for c, n in zip(classes, counts)}
    return np.array([weights[v] for v in y], dtype=np.float32)


# STEP 1: DATA LOADING

print("Step 1: Loading JSONL data")

# Load and normalize nested JSON structures
train_df = json_normalize(pd.read_json(TRAIN_JSONL, lines=True).to_dict(orient="records"))
test_df = json_normalize(pd.read_json(TEST_JSONL, lines=True).to_dict(orient="records"))

# Set ID column
train_df[ID_COL] = train_df["challenge_id"]
test_df[ID_COL] = test_df["challenge_id"]

print(f"Train shape: {train_df.shape}, Test shape: {test_df.shape}")


# STEP 2: FEATURE ENGINEERING
print("\nStep 2: Engineering structural features")

# Temporal Features
print("  - Temporal features (hour, weekday, month, year)")
for df_ in (train_df, test_df):
    if "created_at" in df_.columns:
        dt = pd.to_datetime(df_["created_at"], format=TWITTER_DT_FORMAT, errors="coerce")
        df_["tweet_hour"] = dt.dt.hour
        df_["tweet_weekday"] = dt.dt.weekday
        df_["tweet_dayofmonth"] = dt.dt.day
        df_["tweet_month"] = dt.dt.month
        df_["tweet_year"] = dt.dt.year
        df_["is_weekend"] = df_["tweet_weekday"].isin([5, 6]).astype("float32")

# Account Age
print("  - Account age features")
if "user.created_at" in train_df.columns:
    for df_ in (train_df, test_df):
        dt_user = pd.to_datetime(df_["user.created_at"], format=TWITTER_DT_FORMAT, errors="coerce")
        dt_tweet = pd.to_datetime(df_["created_at"], format=TWITTER_DT_FORMAT, errors="coerce")
        age_days = (dt_tweet - dt_user).dt.days.astype("float32").clip(lower=0)
        df_["user_account_age_days"] = age_days
        df_["log_user_account_age_days"] = np.log1p(age_days)


# Text Features
print("  - Text-based features (length, hashtags, mentions)")
for df_ in (train_df, test_df):
    if "text" in df_.columns:
        txt = df_["text"].fillna("").astype(str)
        df_["text_length"] = txt.str.len().astype("float32")
        df_["text_n_hashtags"] = txt.str.count("#").astype("float32")
        df_["text_n_mentions"] = txt.str.count("@").astype("float32")
        df_["text_upper_ratio"] = txt.apply(safe_upper_ratio).astype("float32")


# User Description Features
print("  - User description features")
for df_ in (train_df, test_df):
    if "user.description" in df_.columns:
        desc = df_["user.description"].fillna("").astype(str)
        df_["user_desc_len"] = desc.str.len().astype("float32")
        df_["user_desc_has_url"] = desc.str.contains("http", regex=False).astype("float32")


# Presence flags
print("  - Presence flags (location, banner, profile URL)")
for df_ in (train_df, test_df):
    presence_cols = [
        ("user.location", "has_location"),
        ("user.profile_banner_url", "has_banner"),
            ("user.url", "has_profile_url")
    ]
    for col, feat in presence_cols:
        if col in df_.columns:
            df_[feat] = df_[col].notna().astype("float32")


# Entity counts
print("  - Entity counts (hashtags, mentions, URLs, symbols)")
ENTITIES_LIST_COLS = [
    "entities.hashtags", "entities.user_mentions", "entities.urls", "entities.symbols",
    "extended_tweet.entities.hashtags", "extended_tweet.entities.user_mentions",
    "extended_tweet.entities.urls", "extended_tweet.entities.symbols",
    "quoted_status.entities.hashtags", "quoted_status.entities.user_mentions",
    "quoted_status.entities.urls", "quoted_status.entities.symbols",
]
for df_ in (train_df, test_df):
    for col in ENTITIES_LIST_COLS:
        if col in df_.columns:
            df_[f"{col}_n_items"] = df_[col].apply(
                lambda x: len(x) if isinstance(x, list) else 0
            ).astype("float32")

    if "entities.media" in df_.columns:
        df_["has_media"] = df_["entities.media"].notna().astype("float32")


# Profile color features (RGB)
print("  - Profile color RGB components")
COLOR_COLS = [
    "user.profile_link_color",
    "user.profile_background_color",
    "user.profile_sidebar_fill_color",
    "user.profile_sidebar_border_color",
    "user.profile_text_color"
]
for df_ in (train_df, test_df):
    for col in COLOR_COLS:
        if col in df_.columns:
            rgb = df_[col].apply(lambda x: pd.Series(hex_to_rgb(x)))
            df_[f"{col}_r"] = rgb[0].astype("float32")
            df_[f"{col}_g"] = rgb[1].astype("float32")
            df_[f"{col}_b"] = rgb[2].astype("float32")


# Source frequency encoding (Iphone, Android, Web, etc.)
print("  - Source frequency encoding")
if "source" in train_df.columns:
    source_freq = train_df["source"].value_counts(normalize=True)
    train_df["source_freq"] = train_df["source"].map(source_freq).fillna(0).astype("float32")
    test_df["source_freq"] = test_df["source"].map(source_freq).fillna(0).astype("float32")


# User URL domain features (encode domain frequency)
print("  - User URL domain frequency")
if "user.url" in train_df.columns:
    train_df["user_url_domain"] = train_df["user.url"].apply(extract_domain)
    test_df["user_url_domain"] = test_df["user.url"].apply(extract_domain)
    domain_freq = train_df["user_url_domain"].value_counts(normalize=True)
    train_df["user_url_domain_freq"] = train_df["user_url_domain"].map(domain_freq).fillna(0).astype("float32")
    test_df["user_url_domain_freq"] = test_df["user_url_domain"].map(domain_freq).fillna(0).astype("float32")


# Withheld countries (is present? in how many countries?)
print("  - Withheld countries features")
withheld_cols = ["withheld_in_countries", "quoted_status.withheld_in_countries"]
for col in withheld_cols:
    for df_ in (train_df, test_df):
        if col in df_.columns:
            parsed = df_[col].apply(parse_list_like)
            df_[f"{col}_has_withheld"] = parsed.apply(
                lambda x: 1.0 if len(x) > 0 else 0.0
            ).astype("float32")
            df_[f"{col}_n_withheld_countries"] = parsed.apply(
                lambda x: float(len(x))
            ).astype("float32")


# Coordinate features (is coordinates present?)
print("  - Coordinate presence flags")
coord_cols = [
    "coordinates.coordinates",
    "geo.coordinates",
    "quoted_status.coordinates.coordinates",
    "quoted_status.geo.coordinates"
]
for col in coord_cols:
    for df_ in (train_df, test_df):
        if col in df_.columns:
            parsed = df_[col].apply(parse_list_like)
            df_[f"{col}_has_coords"] = parsed.apply(
                lambda x: 1.0 if len(x) >= 2 else 0.0
            ).astype("float32")


# Quoted status features (how old is the quoted tweet/user?)
print("  - Quoted status temporal features")
for df_ in (train_df, test_df):
    dt_main = None
    if "created_at" in df_.columns:
        dt_main = pd.to_datetime(df_["created_at"], format=TWITTER_DT_FORMAT, errors="coerce")

    # Time lag between main tweet and quoted tweet
    if "quoted_status.created_at" in df_.columns and dt_main is not None:
        dt_quoted = pd.to_datetime(df_["quoted_status.created_at"], format=TWITTER_DT_FORMAT, errors="coerce")
        lag_days = (dt_main - dt_quoted).dt.days.astype("float32").clip(lower=0)
        df_["quoted_tweet_lag_days"] = lag_days
        df_["quoted_tweet_is_recent_7d"] = (lag_days <= 7).astype("float32")

    # Quoted user account age
    if "quoted_status.user.created_at" in df_.columns and dt_main is not None:
        dt_q_user = pd.to_datetime(df_["quoted_status.user.created_at"], format=TWITTER_DT_FORMAT, errors="coerce")
        age_q = (dt_main - dt_q_user).dt.days.astype("float32").clip(lower=0)
        df_["quoted_user_account_age_days"] = age_q
        df_["log_quoted_user_account_age_days"] = np.log1p(age_q)


# Quoted status numeric features
print("  - Quoted status numeric features and log transforms")
quoted_numeric_cols = [
    "quoted_status.user.favourites_count",
    "quoted_status.user.listed_count",
    "quoted_status.user.friends_count",
    "quoted_status.user.followers_count",
    "quoted_status.user.statuses_count",
    "quoted_status.retweet_count",
    "quoted_status.favorite_count",
    "quoted_status.quote_count",
    "quoted_status.reply_count"
]
for col in quoted_numeric_cols:
    for df_ in (train_df, test_df):
        if col in df_.columns:
            df_[col] = pd.to_numeric(df_[col], errors="coerce").astype("float32")
            df_[f"{col}_log1p"] = np.log1p(df_[col]).astype("float32")


# Global mapping from quoted users to main users (train + test)
print("  - Global mapping quoted user friends/followers to main users via created_at")

# required columns to build the global quoted-user mapping
required_cols = [
    "user.created_at",
    "quoted_status.user.created_at",
    "quoted_status.user.friends_count",
    "quoted_status.user.followers_count",
]

# only if all required columns are available
# in both train and test datasets (robust to schema variations):
if all(c in train_df.columns for c in required_cols) and all(c in test_df.columns for c in required_cols):

    # extract quoted-user statistics from the training set
    tmp_train = train_df[[
        "quoted_status.user.created_at",
        "quoted_status.user.friends_count",
        "quoted_status.user.followers_count"
    ]]

    # extract quoted-user statistics from the test set
    tmp_test = test_df[[
        "quoted_status.user.created_at",
        "quoted_status.user.friends_count",
        "quoted_status.user.followers_count"
    ]]

    # combine train and test to build a global mapping
    combined_tmp = pd.concat([tmp_train, tmp_test], axis=0, ignore_index=True)

    # keep only rows where the quoted user's creation time is known
    # (proxy identifier for the user)
    combined_tmp = combined_tmp.dropna(subset=["quoted_status.user.created_at"])

    if not combined_tmp.empty:
        # numeric conversion of friends and followers counters
        combined_tmp["quoted_status.user.friends_count"] = pd.to_numeric(
            combined_tmp["quoted_status.user.friends_count"], errors="coerce"
        )
        combined_tmp["quoted_status.user.followers_count"] = pd.to_numeric(
            combined_tmp["quoted_status.user.followers_count"], errors="coerce"
        )

        # aggregate quoted-user statistics by their account creation timestamp
        # the mean is used to stabilize potentially noisy observations
        agg_global = combined_tmp.groupby("quoted_status.user.created_at").agg({
            "quoted_status.user.friends_count": "mean",
            "quoted_status.user.followers_count": "mean",
        }).rename(columns={
            "quoted_status.user.friends_count": "mapped_user_friends_count",
            "quoted_status.user.followers_count": "mapped_user_followers_count",
        })

        # map aggregated quoted-user statistics back to the main users
        # using user.created_at as a proxy key shared across interactions
        for df_ in (train_df, test_df):
            df_["user_friends_from_quoted"] = df_["user.created_at"].map(
                agg_global["mapped_user_friends_count"]
            ).astype("float32")
            df_["user_followers_from_quoted"] = df_["user.created_at"].map(
                agg_global["mapped_user_followers_count"]
            ).astype("float32")

    else:
        # degenerate case: no valid quoted-user timestamps available
        for df_ in (train_df, test_df):
            df_["user_friends_from_quoted"] = np.nan
            df_["user_followers_from_quoted"] = np.nan

else:
    # if the required quoted-user columns are absent
    for df_ in (train_df, test_df):
        df_["user_friends_from_quoted"] = np.nan
        df_["user_followers_from_quoted"] = np.nan


# STEP 3: GENERIC OBJECT COLUMN PROCESSING
print("\nStep 3: Processing generic object columns")

# Align columns between train and test
all_cols = sorted(set(train_df.columns) | set(test_df.columns))
for col in all_cols:
    if col not in train_df.columns:
        train_df[col] = np.nan
    if col not in test_df.columns:
        test_df[col] = np.nan

# Process all object-type columns
obj_cols = [c for c in train_df.columns if train_df[c].dtype == "O"]
print(f"  Processing {len(obj_cols)} object columns")

for col in obj_cols:
    # Skip label, ID, and text columns
    if col in [LABEL_COL, ID_COL] + TEXT_COLS_EXCLUDE:
        continue

    s_train = train_df[col]
    s_test = test_df[col]
    s_all = pd.concat([s_train, s_test], axis=0)

    # Adds a presence flag (is the value missing or not?)
    train_df[f"{col}_is_present"] = s_train.notna().astype("float32")
    test_df[f"{col}_is_present"] = s_test.notna().astype("float32")

    # Adds the string length as a numeric feature
    train_df[f"{col}_str_len"] = s_train.fillna("").astype(str).str.len().astype("float32")
    test_df[f"{col}_str_len"] = s_test.fillna("").astype(str).str.len().astype("float32")

    # Skip list-like columns for frequency encoding
    if is_mostly_listlike(s_all):
        continue

    # detects and encodes boolean-like columns
    if detect_bool_like_series(s_train):
        train_df[f"{col}_bool"] = map_bool_like_to_float(s_train)
        test_df[f"{col}_bool"] = map_bool_like_to_float(s_test)

    # applies frequency encoding to low-cardinality categorical features
    try:
        n_unique = s_all.nunique(dropna=True)
    except TypeError:
        continue

    if 1 < n_unique <= 20:
        freq = s_all.value_counts(normalize=True)
        train_df[f"{col}_freq"] = s_train.map(freq).fillna(0).astype("float32")
        test_df[f"{col}_freq"] = s_test.map(freq).fillna(0).astype("float32")


# List-like Item Counts
# For columns containing list-like values, we extract the number of items as a numeric feature
print("  - Counting items in list-like columns")
for col in obj_cols:
    if col in [LABEL_COL, ID_COL] + TEXT_COLS_EXCLUDE + ENTITIES_LIST_COLS:
        continue

    s_all = pd.concat([train_df[col], test_df[col]], axis=0)
    if is_mostly_listlike(s_all):
        train_df[f"{col}_n_items"] = train_df[col].apply(parse_list_like).apply(
            lambda x: float(len(x))
        ).astype("float32")
        test_df[f"{col}_n_items"] = test_df[col].apply(parse_list_like).apply(
            lambda x: float(len(x))
        ).astype("float32")


# For selected categorical columns, we apply frequency encoding based on train + test values
print("  - Additional frequency encodings for categorical columns")
categorical_cols = [
    "lang", "user.lang", "filter_level", "user.translator_type",
    "place.country", "place.country_code", "place.full_name",
    "place.name", "place.type", "user.time_zone", "in_reply_to_screen_name"
]
for col in categorical_cols:
    if col not in train_df.columns:
        continue
    s_train = train_df[col].astype(str)
    s_test = test_df[col].astype(str)
    freq = pd.concat([s_train, s_test]).value_counts(normalize=True)
    train_df[f"{col}_freq_all"] = s_train.map(freq).fillna(0).astype("float32")
    test_df[f"{col}_freq_all"] = s_test.map(freq).fillna(0).astype("float32")

# Create clean copies
train_df = train_df.copy()
test_df = test_df.copy()


# STEP 4: DATA CLEANING
print("\nStep 4: Removing empty and constant columns")

# Combine datasets to check for empty/constant columns
all_df = pd.concat([train_df, test_df], axis=0, ignore_index=True)
cols_to_drop = []

for c in train_df.columns:
    if c in [LABEL_COL, ID_COL]:
        continue

    s = all_df[c]

    # drop if all missing
    if s.isna().all():
        cols_to_drop.append(c)
        continue

    # drop if constant (including unhashable types)
    try:
        if s.nunique(dropna=False) <= 1:
            cols_to_drop.append(c)
    except TypeError:
        continue

print(f"  Dropping {len(cols_to_drop)} empty or constant columns")
train_df.drop(columns=cols_to_drop, inplace=True, errors="ignore")
test_df.drop(columns=cols_to_drop, inplace=True, errors="ignore")

del all_df
gc.collect()


# STEP 5: SELECT NUMERIC FEATURES

print("\nStep 5: Selecting numeric features for modeling")

# Columns to exclude from feature selection
exclude_cols = [LABEL_COL, ID_COL, "user_key", "challenge_id"] + TEXT_COLS_EXCLUDE

# Select only numeric columns (excluding IDs)
feature_cols = [
    c for c in train_df.columns
    if c not in exclude_cols
    and str(train_df[c].dtype) in ["int64", "int32", "float64", "float32", "bool"]
    and "id" not in c.lower()
]
print(f"  Selected {len(feature_cols)} tweet-level features")


# STEP 6: USER-LEVEL AGGREGATION

print("\nStep 6: Aggregating features to user level")

# create user key based on account creation time (proxy for unique user)
train_df["user_key"] = train_df["user.created_at"].fillna("UNK").astype(str)
test_df["user_key"] = test_df["user.created_at"].fillna("UNK").astype(str)

# define aggregation functions
agg_funcs = ["mean", "max", "min", "std"]
agg_dict = {c: agg_funcs for c in feature_cols}
agg_dict[ID_COL] = "count"  # count number of tweets per user

# aggregate training data
train_user = train_df.groupby("user_key").agg(agg_dict)
train_user.columns = [f"{col}_{stat}" for col, stat in train_user.columns.to_flat_index()]
train_user = train_user.reset_index().rename(columns={f"{ID_COL}_count": "n_tweets"})

# user-level labels (majority vote: 1 if mean >= 0.5, else 0)
user_label = train_df.groupby("user_key")[LABEL_COL].agg(
    lambda x: 1 if x.mean() >= 0.5 else 0
).reset_index()
train_user = train_user.merge(user_label, on="user_key")

# aggregate test data
test_user = test_df.groupby("user_key").agg(agg_dict)
test_user.columns = [f"{col}_{stat}" for col, stat in test_user.columns.to_flat_index()]
test_user = test_user.reset_index().rename(columns={f"{ID_COL}_count": "n_tweets"})

print(f"  Train users: {len(train_user)}, Test users: {len(test_user)}")

# select user-level feature columns
user_feature_cols = [c for c in train_user.columns if c not in ["user_key", LABEL_COL]]
print(f"  User-level features: {len(user_feature_cols)}")


# STEP 7: PREPARE FEATURE MATRICES

print("\nStep 7: Preparing feature matrices for training")

# extract labels
y_user = train_user[LABEL_COL].astype(int).values

# impute missing values and standardize features
imputer = SimpleImputer(strategy="median")
scaler = StandardScaler()

X_train = scaler.fit_transform(
    imputer.fit_transform(train_user[user_feature_cols].values)
).astype(np.float32)

X_test = scaler.transform(
    imputer.transform(test_user[user_feature_cols].values)
).astype(np.float32)

print(f"  X_train shape: {X_train.shape}, X_test shape: {X_test.shape}")


# STEP 8: TRAIN LIGHTGBM WITH CROSS-VALIDATION

print("\nStep 8: Training LightGBM with stratified k-fold cross-validation")

# Initialize stratified k-fold
skf = StratifiedKFold(n_splits=N_FOLDS, shuffle=True, random_state=SEED)

# Initialize out-of-fold predictions
oof_user = np.zeros(len(train_user), dtype=np.float32)
test_user_folds = []

# Train model for each fold
for fold, (tr_idx, val_idx) in enumerate(skf.split(X_train, y_user), 1):
    print(f"\n  Fold {fold}/{N_FOLDS}")

    # Split data
    X_tr, X_val = X_train[tr_idx], X_train[val_idx]
    y_tr, y_val = y_user[tr_idx], y_user[val_idx]

    # Train model with balanced class weights
    model = LGBMClassifier(**LGBM_PARAMS)
    model.fit(
        X_tr, y_tr,
        sample_weight=balanced_sample_weight(y_tr),
        eval_set=[(X_val, y_val)],
        eval_metric='binary_logloss'
    )

    # predict on validation set
    val_proba = model.predict_proba(X_val)[:, 1]
    oof_user[val_idx] = val_proba

    # calculate and display fold accuracy
    fold_acc = accuracy_score(y_val, (val_proba >= 0.5).astype(int))
    print(f"    Fold {fold} validation accuracy: {fold_acc:.4f}")

    # predict on test set
    test_user_folds.append(model.predict_proba(X_test)[:, 1])

    # clean up
    del model
    gc.collect()

# calculate overall out-of-fold accuracy
oof_acc = accuracy_score(y_user, (oof_user >= 0.5).astype(int))
print(f"OVERALL USER-LEVEL OUT-OF-FOLD ACCURACY: {oof_acc:.4f}")

# average test predictions across folds
test_user_mean = np.mean(test_user_folds, axis=0).astype(np.float32)


# STEP 9: MAP USER-LEVEL PREDICTIONS BACK TO TWEETS

print("\nStep 9: Mapping user-level predictions back to tweet level")

# map OOF predictions to training tweets
oof_tweet = train_df[[ID_COL, "user_key", LABEL_COL]].merge(
    pd.DataFrame({
        "user_key": train_user["user_key"],
        "lightgbm_user_proba": oof_user
    }),
    on="user_key",
    how="left"
)

# map test predictions to test tweets
test_tweet = test_df[[ID_COL, "user_key"]].merge(
    pd.DataFrame({
        "user_key": test_user["user_key"],
        "lightgbm_user_proba": test_user_mean
    }),
    on="user_key",
    how="left"
)

# prepare user-level features for export
lightgbm_features_train = oof_tweet[[ID_COL, "user_key"]].merge(
    train_user.drop(columns=[LABEL_COL]),
    on="user_key",
    how="left"
).drop(columns=["user_key"])

lightgbm_features_test = test_tweet[[ID_COL, "user_key"]].merge(
    test_user,
    on="user_key",
    how="left"
).drop(columns=["user_key"])


# STEP 10: SAVE OUTPUTS

print("\nStep 10: Saving outputs")

# save out-of-fold predictions
oof_tweet[[ID_COL, LABEL_COL, "lightgbm_user_proba"]].to_csv(
    os.path.join(OUT_DIR, "oof_lgbm.csv"),
    index=False
)

# save test predictions
test_tweet[[ID_COL, "lightgbm_user_proba"]].to_csv(
    os.path.join(OUT_DIR, "test_lgbm.csv"),
    index=False
)

# save aggregated features for potential ensemble models
lightgbm_features_train.to_csv(
    os.path.join(OUT_DIR, "lgbm_features_train.csv"),
    index=False
)
lightgbm_features_test.to_csv(
    os.path.join(OUT_DIR, "lgbm_features_test.csv"),
    index=False
)

print(f"\n  Out-of-fold predictions: {OUT_DIR}/oof_lgbm.csv")
print(f"  Test predictions: {OUT_DIR}/test_lgbm.csv")
print(f"  Aggregated features (train): {OUT_DIR}/lgbm_features_train.csv")
print(f"  Aggregated features (test):  {OUT_DIR}/lgbm_features_test.csv")
print("\nPipeline completed!")


Step 1: Loading JSONL data
Train shape: (154914, 194), Test shape: (103380, 192)

Step 2: Engineering structural features
  - Temporal features (hour, weekday, month, year)
  - Account age features
  - Text-based features (length, hashtags, mentions)
  - User description features
  - Presence flags (location, banner, profile URL)
  - Entity counts (hashtags, mentions, URLs, symbols)
  - Profile color RGB components
  - Source frequency encoding
  - User URL domain frequency
  - Withheld countries features
  - Coordinate presence flags
  - Quoted status temporal features
  - Quoted status numeric features and log transforms
  - Global mapping quoted user friends/followers to main users via created_at

Step 3: Processing generic object columns
  Processing 131 object columns
  - Counting items in list-like columns
  - Additional frequency encodings for categorical columns

Step 4: Removing empty and constant columns
  Dropping 78 empty or constant columns

Step 5: Selecting numeric featu

```text
Step 1: Loading JSONL data
Train shape: (154914, 194), Test shape: (103380, 192)

Step 2: Engineering structural features
  - Temporal features (hour, weekday, month, year)
  - Account age features
  - Text-based features (length, hashtags, mentions)
  - User description features
  - Presence flags (location, banner, profile URL)
  - Entity counts (hashtags, mentions, URLs, symbols)
  - Profile color RGB components
  - Source frequency encoding
  - User URL domain frequency
  - Withheld countries features
  - Coordinate presence flags
  - Quoted status temporal features
  - Quoted status numeric features and log transforms
  - Global mapping quoted user friends/followers to main users via created_at

Step 3: Processing generic object columns
  Processing 131 object columns
  - Counting items in list-like columns
  - Additional frequency encodings for categorical columns

Step 4: Removing empty and constant columns
  Dropping 78 empty or constant columns

Step 5: Selecting numeric features for modeling
  Selected 343 tweet-level features

Step 6: Aggregating features to user level
  Train users: 30696, Test users: 20465
  User-level features: 1373

Step 7: Preparing feature matrices for training
  X_train shape: (30696, 1372), X_test shape: (20465, 1372)

Step 8: Training LightGBM with stratified k-fold cross-validation

  Fold 1/5
    Fold 1 validation accuracy: 0.8686

  Fold 2/5
    Fold 2 validation accuracy: 0.8708

  Fold 3/5
    Fold 3 validation accuracy: 0.8681

  Fold 4/5
    Fold 4 validation accuracy: 0.8760

  Fold 5/5
    Fold 5 validation accuracy: 0.8698

OVERALL USER-LEVEL OUT-OF-FOLD ACCURACY: 0.8707

Step 9: Mapping user-level predictions back to tweet level

Step 10: Saving outputs

  Out-of-fold predictions: ./intermediate/oof_lgbm.csv
  Test predictions: ./intermediate/test_lgbm.csv
  Aggregated features (train): ./intermediate/lgbm_features_train.csv
  Aggregated features (test):  ./intermediate/lgbm_features_test.csv

Pipeline completed!
```
