#### Unified Tabular Benchmark (HELOC + HIGGS + COVTYPE)

We train a single LightGBM multiclass model across three datasets by:
1. Assigning a dataset identifier (`dataset_id`) so the model can account for dataset-specific label priors.
2. Mapping all targets into one unified label space (11 classes total).
3. Aligning feature spaces across datasets (missing features filled with 0).
4. Removing suspicious identifier/proxy columns from HIGGS (`Weight/weight`, `EventId`) consistently in train and test.
5. Using smoothed dataset + class reweighting to reduce dominance of large datasets/classes.

Finally, we generate predictions for each dataset's test set and format them into Kaggleâ€™s required submission CSV.


In [29]:
import pandas as pd
import numpy as np
import lightgbm as lgb

from sklearn.model_selection import train_test_split

In [30]:
# Load Training Data
heloc = pd.read_csv("heloc_train.csv")
covtype = pd.read_csv("covtype_train.csv")
higgs = pd.read_csv("higgs_train.csv")

# Drop identifier-like columns from HIGGS 
DROP_HIGGS_COLS = ["Weight", "weight", "EventId", "EventID"]
for col in DROP_HIGGS_COLS:
    if col in higgs.columns:
        higgs = higgs.drop(columns=[col])

In [31]:
# Dataset Identifiers
heloc["dataset_id"] = 0
covtype["dataset_id"] = 1
higgs["dataset_id"] = 2

# Global Target Encoding (11 classes total)
# HELOC: Bad/Good - 0/1
heloc["target"] = heloc["RiskPerformance"].map({"Bad": 0, "Good": 1})

# HIGGS: b/s - 0/1 then shift by +2 - 2/3
higgs["target"] = higgs["Label"].map({"b": 0, "s": 1}) + 2

# COVTYPE: 1..7 shift by +3 - 4..10
covtype["target"] = covtype["Cover_Type"] + 3

# Remove original label columns
heloc = heloc.drop(columns=["RiskPerformance"])
higgs = higgs.drop(columns=["Label"])
covtype = covtype.drop(columns=["Cover_Type"])

In [32]:
# Determine feature sets (exclude metadata)
heloc_features = [c for c in heloc.columns if c not in ["target", "dataset_id"]]
covtype_features = [c for c in covtype.columns if c not in ["target", "dataset_id"]]
higgs_features = [c for c in higgs.columns if c not in ["target", "dataset_id"]]

ALL_FEATURES = sorted(set(heloc_features) | set(covtype_features) | set(higgs_features))

def align_features_train(df):
    df = df.copy()
    for col in ALL_FEATURES:
        if col not in df.columns:
            df[col] = 0
    return df[ALL_FEATURES + ["dataset_id", "target"]]

In [33]:
# Align + combine training data
heloc = align_features_train(heloc)
covtype = align_features_train(covtype)
higgs = align_features_train(higgs)

full_data = pd.concat([heloc, covtype, higgs], ignore_index=True)

X = full_data.drop(columns=["target"])
y = full_data["target"]
dataset_id = full_data["dataset_id"]

### Smoothed reweighting

We reduce imbalance by combining:
- **Dataset weighting**: each dataset contributes similarly (prevents HIGGS from dominating).
- **Smoothed class weighting**: uses `1/sqrt(class_count)` to avoid extreme weights.

In [34]:
# Compute smoothed weights
dataset_counts = full_data["dataset_id"].value_counts().to_dict()
class_counts = full_data["target"].value_counts().to_dict()

weights = []
for _, row in full_data.iterrows():
    d = int(row["dataset_id"])
    y_i = int(row["target"])

    w = 1.0 / dataset_counts[d]
    w = w * (1.0 / np.sqrt(class_counts[y_i]))  
    weights.append(w)

weights = np.array(weights, dtype=float)
weights = weights / weights.mean()

In [35]:
# Train/Validation split and early stopping
X_train, X_val, y_train, y_val, d_train, d_val = train_test_split(
    X, y, dataset_id,
    test_size=0.2,
    random_state=42,
    stratify=dataset_id
)

w_train = weights[X_train.index]
w_val = weights[X_val.index]

train_data = lgb.Dataset(X_train, label=y_train, weight=w_train)
val_data = lgb.Dataset(X_val, label=y_val, weight=w_val)

params = {
    "objective": "multiclass",
    "num_class": 11,
    "metric": "multi_error",
    "learning_rate": 0.05,
    "num_leaves": 63,
    "min_data_in_leaf": 50,
    "feature_fraction": 0.8,
    "bagging_fraction": 0.8,
    "bagging_freq": 1,
    "lambda_l2": 1.0,
    "verbosity": -1,
    "seed": 42
}

model = lgb.train(
    params,
    train_data,
    valid_sets=[val_data],
    num_boost_round=2000,
    callbacks=[lgb.early_stopping(50)]
)

best_iteration = model.best_iteration
print("Best iteration:", best_iteration)


Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[207]	valid_0's multi_error: 0.220864
Best iteration: 207


In [36]:
# Retrain final model on all data
final_train_data = lgb.Dataset(X, label=y, weight=weights)

final_model = lgb.train(
    params,
    final_train_data,
    num_boost_round=best_iteration
)

## Prepare test sets

We apply the same feature alignment as training, including:
- add `dataset_id`
- fill missing features with 0
- remove HIGGS proxy/identifier columns consistently (`Weight/weight`, `EventId`)

In [37]:
# Load test data + drop proxy/ID
heloc_test = pd.read_csv("heloc_test.csv")
covtype_test = pd.read_csv("covtype_test.csv")
higgs_test = pd.read_csv("higgs_test.csv")

# Drop the identifier-like columns from HIGGS TEST
for col in DROP_HIGGS_COLS:
    if col in higgs_test.columns:
        higgs_test = higgs_test.drop(columns=[col])

In [38]:
# Prepare test function
def prepare_test(df, dataset_id_value):
    df = df.copy()
    df["dataset_id"] = dataset_id_value
    for col in ALL_FEATURES:
        if col not in df.columns:
            df[col] = 0
    return df[ALL_FEATURES + ["dataset_id"]]

heloc_test = prepare_test(heloc_test, 0)
covtype_test = prepare_test(covtype_test, 1)
higgs_test = prepare_test(higgs_test, 2)

In [39]:
# Predict and decode back to dataset label spaces
heloc_preds = np.argmax(final_model.predict(heloc_test), axis=1)
covtype_preds = np.argmax(final_model.predict(covtype_test), axis=1)
higgs_preds = np.argmax(final_model.predict(higgs_test), axis=1)

# Decode Predictions back to original label spaces
heloc_final = heloc_preds                  
higgs_final = higgs_preds - 2              
covtype_final = covtype_preds - 3          

### Kaggle submission formatting
 
We follow the provided ordering and offsets:
- COVTYPE IDs start at 1
- HELOC IDs start at 3501
- HIGGS IDs start at 4547


In [40]:
# Kaggle submission
final_submission = pd.concat([
    pd.DataFrame({
        "ID": np.arange(1, 1 + len(covtype_final)),
        "Prediction": covtype_final
    }),
    pd.DataFrame({
        "ID": np.arange(3501, 3501 + len(heloc_final)),
        "Prediction": heloc_final
    }),
    pd.DataFrame({
        "ID": np.arange(4547, 4547 + len(higgs_final)),
        "Prediction": higgs_final
    })
], ignore_index=True)

final_submission.to_csv(
    "final_submission_unified_tabular_benchmark_group_09.csv",
    index=False
)

final_submission.head()

Unnamed: 0,ID,Prediction
0,1,1
1,2,1
2,3,1
3,4,1
4,5,1


# Sanity checks 

We include quick checks to ensure:
- `dataset_id` alone cannot fully solve the problem (it only encodes dataset priors).
- No single proxy feature (e.g., HIGGS Weight) provides suspiciously strong predictive power (we removed these).


In [41]:
# dataset_id-only sanity check
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

X_check = full_data[["dataset_id"]]
y_check = full_data["target"]

X_tr, X_te, y_tr, y_te = train_test_split(
    X_check, y_check, test_size=0.2, random_state=42
)

clf = LogisticRegression(max_iter=500)
clf.fit(X_tr, y_tr)

print("Accuracy using dataset_id only:", accuracy_score(y_te, clf.predict(X_te)))


Accuracy using dataset_id only: 0.6181060965301115


Using `dataset_id` alone yields ~62% accuracy, reflecting differing class priors
across datasets but not label leakage. In contrast, the HIGGS `Weight` feature
exhibited unusually strong predictive power and was therefore excluded.


# Results summary

To investigate predictive accuracy per class, local predicitons will be further validated in the code below. It's worth to note to avoid data leakage, we can only use the model that has not been trained on complete dataset -> `model` not `final_model`. To optimise first version of the model `num_boost_round` was adjusted to identified best iteration.

In [42]:
from sklearn.metrics import classification_report

model = lgb.train(
    params,
    train_data,
    valid_sets=[val_data],
    num_boost_round= best_iteration,
)

name_map = {
    # HELOC
    "0":  "bad",
    "1":  "good",

    # Higgs
    "2":  "background",
    "3":  "signal",

    # COVTYPE 
    "4":  "Spruce/Fir",          
    "5":  "Lodgepole Pine",      
    "6":  "Ponderosa Pine",      
    "7":  "Cottonwood/Willow",
    "8":  "Aspen", 
    "9":  "Douglas-fir",         
    "10": "Krummholz",           
}

predictions_val = np.argmax(model.predict(X_val), axis=1) 

resultsTab = pd.DataFrame(classification_report(y_val, predictions_val, output_dict=True, zero_division=0)).T
resultsTab.index = resultsTab.index.map(lambda x: name_map.get(str(x), x))
resultsTab = resultsTab.loc[:, ["precision", "recall", "f1-score"]].round(2)
resultsTab = resultsTab.drop(index=["macro avg", "weighted avg"], errors="ignore")

print(resultsTab)

                   precision  recall  f1-score
bad                     0.73    0.76      0.74
good                    0.71    0.68      0.70
background              0.88    0.87      0.88
signal                  0.76    0.78      0.77
Spruce/Fir              0.86    0.85      0.85
Lodgepole Pine          0.87    0.88      0.88
Ponderosa Pine          0.89    0.91      0.90
Cottonwood/Willow       0.87    0.85      0.86
Aspen                   0.76    0.64      0.70
Douglas-fir             0.78    0.78      0.78
Krummholz               0.89    0.90      0.89
accuracy                0.84    0.84      0.84
