### Unified Tabular Learning with LightGBM

This notebook implements a single unified machine learning model trained jointly on three tabular datasets: HELOC, HIGGS, and CoverType.  
The goal is to study whether a single model can generalize across heterogeneous tabular domains while remaining interpretable and robust.


In [69]:
# Imports 
import pandas as pd
import numpy as np

import lightgbm as lgb

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [70]:
# Load Training Data
heloc = pd.read_csv("heloc_train.csv")
covtype = pd.read_csv("covtype_train.csv")
higgs = pd.read_csv("higgs_train.csv")

In [71]:
# Dataset Identifiers
heloc["dataset_id"] = 0
covtype["dataset_id"] = 1
higgs["dataset_id"] = 2

In [72]:
# Global Target Encoding
# HELOC
heloc["target"] = heloc["RiskPerformance"].map({
    "Bad": 0,
    "Good": 1
})

# HIGGS
higgs["target"] = higgs["Label"].map({
    "b": 0,
    "s": 1
}) + 2

# COVTYPE
covtype["target"] = covtype["Cover_Type"] + 3

In [73]:
# Remove Original Label Columns
heloc = heloc.drop(columns=["RiskPerformance"])
higgs = higgs.drop(columns=["Label"])
covtype = covtype.drop(columns=["Cover_Type"])

### Dataset Integration and Unified Label Space

Each dataset has different features, label definitions, and class distributions. To enable a single model, we align all feature spaces and construct a global label encoding so that all tasks can be learned jointly.

In [74]:
# Align Feature Spaces
heloc_features = [c for c in heloc.columns if c not in ["target", "dataset_id"]]
covtype_features = [c for c in covtype.columns if c not in ["target", "dataset_id"]]
higgs_features = [c for c in higgs.columns if c not in ["target", "dataset_id"]]

ALL_FEATURES = sorted(
    set(heloc_features) |
    set(covtype_features) |
    set(higgs_features)
)

In [75]:
# Align feature spaces for training data
def align_features(df):
    df = df.copy()
    for col in ALL_FEATURES:
        if col not in df.columns:
            df[col] = 0
    return df[ALL_FEATURES + ["dataset_id", "target"]]


# Apply feature alignment to training datasets
heloc = align_features(heloc)
covtype = align_features(covtype)
higgs = align_features(higgs)

In [76]:
# Combine All Training Data
full_data = pd.concat([heloc, covtype, higgs], ignore_index=True)

X = full_data.drop(columns=["target"])
y = full_data["target"]
dataset_id = full_data["dataset_id"]

### Handling Dataset and Class Imbalance

The three datasets differ substantially in size and class balance. We experiment with multiple weighting strategies to prevent the largest dataset from dominating training and to improve performance on minority classes.

In [77]:
# Compute Dataset + Class-Balanced Weights
dataset_counts = full_data["dataset_id"].value_counts().to_dict()
class_counts = full_data["target"].value_counts().to_dict()

weights = []

for _, row in full_data.iterrows():
    d = int(row["dataset_id"])
    y_i = int(row["target"])

    w = 1.0 / dataset_counts[d]
    w = w * (1.0 / class_counts[y_i])

    weights.append(w)

weights = np.array(weights, dtype=float)
weights = weights / weights.mean()

In [78]:
# Determine Best Number of Boosting Rounds
X_train, X_val, y_train, y_val, d_train, d_val = train_test_split(
    X,
    y,
    dataset_id,
    test_size=0.2,
    random_state=42,
    stratify=dataset_id
)

w_train = weights[X_train.index]
w_val = weights[X_val.index]


In [79]:
train_data = lgb.Dataset(X_train, label=y_train, weight=w_train)
val_data = lgb.Dataset(X_val, label=y_val, weight=w_val)

params = {
    "objective": "multiclass",
    "num_class": 11,
    "metric": "multi_error", # # multi_error = 1 - accuracy
    "learning_rate": 0.05,
    "num_leaves": 63,
    "min_data_in_leaf": 50,
    "feature_fraction": 0.8,
    "bagging_fraction": 0.8,
    "bagging_freq": 1,
    "lambda_l2": 1.0,
    "verbosity": -1,
    "seed": 42
}

model = lgb.train(
    params,
    train_data,
    valid_sets=[val_data],
    num_boost_round=2000,
    callbacks=[lgb.early_stopping(50)]
)

best_iteration = model.best_iteration
print("Best iteration:", best_iteration)

Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[171]	valid_0's multi_error: 0.2162
Best iteration: 171


### Final Model Selection

After evaluating multiple weighting strategies and random seeds, we retrain the best-performing configuration on the full dataset to produce the final submission.

In [80]:
# Retrain Final Model on all data
final_train_data = lgb.Dataset(X, label=y, weight=weights)

final_model = lgb.train(
    params,
    final_train_data,
    num_boost_round=best_iteration
)

In [81]:
# Load Test Data
heloc_test = pd.read_csv("heloc_test.csv")
covtype_test = pd.read_csv("covtype_test.csv")
higgs_test = pd.read_csv("higgs_test.csv")

In [82]:
# Prepare Test Sets
def prepare_test(df, dataset_id):
    df = df.copy()
    df["dataset_id"] = dataset_id
    for col in ALL_FEATURES:
        if col not in df.columns:
            df[col] = 0
    return df[ALL_FEATURES + ["dataset_id"]]

heloc_test = prepare_test(heloc_test, 0)
covtype_test = prepare_test(covtype_test, 1)
higgs_test = prepare_test(higgs_test, 2)

In [83]:
# Predict Test Sets
heloc_preds = np.argmax(final_model.predict(heloc_test), axis=1)
covtype_preds = np.argmax(final_model.predict(covtype_test), axis=1)
higgs_preds = np.argmax(final_model.predict(higgs_test), axis=1)

In [84]:
# Decode Predictions
heloc_final = heloc_preds
higgs_final = higgs_preds - 2
covtype_final = covtype_preds - 3

In [85]:
# Kaggle submission
final_submission = pd.concat([
    pd.DataFrame({
        "ID": np.arange(1, 1 + len(covtype_final)),
        "Prediction": covtype_final
    }),
    pd.DataFrame({
        "ID": np.arange(3501, 3501 + len(heloc_final)),
        "Prediction": heloc_final
    }),
    pd.DataFrame({
        "ID": np.arange(4547, 4547 + len(higgs_final)),
        "Prediction": higgs_final
    })
], ignore_index=True)

final_submission.to_csv(
    "final_submission_group9_unified_class_balanced_full.csv",
    index=False
)

### Conclusion

In this notebook, we developed a single unified LightGBM model trained jointly on three heterogeneous tabular datasets: HELOC, HIGGS, and CoverType. To enable this, we aligned all feature spaces, constructed a global label encoding, and included a dataset identifier as an additional feature.

We systematically evaluated several design choices, including different weighting strategies to address dataset and class imbalance, as well as multiple random seeds to assess robustness. Dataset-only and smoothed weighting strategies resulted in lower performance, while full dataset and class inverse-frequency weighting consistently achieved the best results.

Our final model was selected based on both performance and stability and was retrained on the full training data before generating the submission. Overall, the results show that a single unified model can effectively learn across heterogeneous tabular domains when imbalance is handled carefully and experimental choices are evaluated in a controlled manner.
