1. Download and import the model

Set-up a baseline form HuggingFace
https://huggingface.co/alana89/TabSTAR


In [9]:

!pip install tabstar


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




2. Read the datasets

In [1]:
import pandas as pd
covtype_test = pd.read_csv("covtype_test.csv")
covtype_train = pd.read_csv("covtype_train.csv")

heloc_test = pd.read_csv("heloc_test.csv")
heloc_train = pd.read_csv("heloc_train.csv")

higgs_test = pd.read_csv("higgs_test.csv")
higgs_train = pd.read_csv("higgs_train.csv")

3. Merge 3 datasets into one dataset for train

The respecitve classes for this tabular problem are respective for the following datasets
1) covtype - 6 different classes of tree
2) heloc - binary class (good/ bad)
3) higgs - binary class (signal / background)

To aggregate these all classes we can notice that the whole problem consists of:
- 7 classes for trees
- 1 binary class where:
    - "1" = good / signal
    - "0" = bad / background

This means that in order to keep track of the class we can combine all of the columns and mark the classes by additional outcome_class as follows:
-  0 = bad / background for heloc / higggs 
- 1 = good / signal for heloc / higgs
- 2 = Spruce/Fir
- 3 = Lodgepole Pine
- 4 = Ponderosa Pine
- 5 = Cottonwood/Willow
- 6 = Aspen
- 7 = Douglas-fir
- 8 = Krummholz

This way the outcome_class with 9 integers can comprehehend 7 different classes for trees and binary class for 2 other datasets. The next step would be to design a table combining columns for all datasets and for each respective dataset it would mark the class accordingly to the mapping described above. 

Each row will consist of:
- outcome_class = respective class for the case as indicated in the mapping above
- features that are relevant to the class
- other features not relevant to the class

In [2]:
import pandas as pd

# covytype
# create mapping for covtype classes to new outcome_class
covtype_map = {
    1: 2,  
    2: 3,  
    3: 4,  
    4: 5,  
    5: 6,  
    6: 7,  
    7: 8,  
}

# make copy of covtype_train to avoid modifying original data
covtype_train_copy = covtype_train.copy()

covtype_train_copy["outcome_class"] = covtype_train_copy["Cover_Type"].map(covtype_map)
covtype_train_copy = covtype_train_copy.drop(columns=["Cover_Type"])

# heloc 
heloc_train_copy = heloc_train.copy()
heloc_train_copy["outcome_class"] = heloc_train_copy["RiskPerformance"].map({
    "Bad": 0,
    "Good": 1,
})
heloc_train_copy = heloc_train_copy.drop(columns=["RiskPerformance"])

# higgs data
higgs_train_copy = higgs_train.copy()            
higgs_train_copy["outcome_class"] = higgs_train_copy["Label"].map({
    "b": 0,   
    "s": 1,  
})
higgs_train_copy = higgs_train_copy.drop(columns=["Label"])


#  merge all 3 datasets into a single training set
merged_train = pd.concat([covtype_train_copy, heloc_train_copy, higgs_train_copy], axis=0, ignore_index=True)
merged_train.head()


Unnamed: 0,Elevation,Aspect,Slope,Horizontal_Distance_To_Hydrology,Vertical_Distance_To_Hydrology,Horizontal_Distance_To_Roadways,Hillshade_9am,Hillshade_Noon,Hillshade_3pm,Horizontal_Distance_To_Fire_Points,...,PRI_met_sumet,PRI_jet_num,PRI_jet_leading_pt,PRI_jet_leading_eta,PRI_jet_leading_phi,PRI_jet_subleading_pt,PRI_jet_subleading_eta,PRI_jet_subleading_phi,PRI_jet_all_pt,Weight
0,3351.0,206.0,27.0,726.0,124.0,3813.0,192.0,252.0,180.0,2271.0,...,,,,,,,,,,
1,2732.0,129.0,7.0,212.0,1.0,1082.0,231.0,236.0,137.0,912.0,...,,,,,,,,,,
2,2572.0,24.0,9.0,201.0,25.0,957.0,216.0,222.0,142.0,2191.0,...,,,,,,,,,,
3,2824.0,69.0,13.0,417.0,39.0,3223.0,233.0,214.0,110.0,6478.0,...,,,,,,,,,,
4,2529.0,84.0,5.0,120.0,9.0,1092.0,227.0,231.0,139.0,4983.0,...,,,,,,,,,,


Let's check if numbers of classes in merged_train data match the numbers from original datasets

In [3]:
# number of rows with class 1
count_class1 = int((merged_train["outcome_class"] == 1).sum())
count_class1 == sum(heloc_train["RiskPerformance"] == "Good") +  sum(higgs_train["Label"] == 1)



False

The next step would be to merge the test datasets similarly to how the training datasets were merged.

In [4]:
covtype_test_copy = covtype_test.copy()
heloc_test_copy = heloc_test.copy()
higgs_test_copy = higgs_test.copy() 

import numpy as np


# merge datasets
merged_test = pd.concat([covtype_test_copy, heloc_test_copy, higgs_test_copy], axis=0, ignore_index=True)

# ensure that the columns in merged_test match those in t
merged_test = merged_test.reindex(columns= merged_train.columns.drop("outcome_class"))

Once we have merged data for both train and test sets we can proceed to fit the model and get the predictions

In [5]:
# split into X and y
y = merged_train["outcome_class"].astype(int)
X = merged_train.drop(columns=["outcome_class"])


# fill NaNs

# train data
X = X.fillna(-999)

# test data
merged_test = merged_test.fillna(-999)



In [23]:
from tabstar.tabstar_model import TabSTARClassifier

# get the model and fit it on the training data
tabstar = TabSTARClassifier(max_epochs=2)
tabstar.fit(X, y)

# save the model
tabstar.save("baseline3.pkl")
tabstar = TabSTARClassifier.load("baseline3.pkl")


# predict on the merged test set
X_test = merged_test
predictions = tabstar.predict(X_test) 

# save the predictions to a CSV file 
submission = pd.DataFrame({
    "ID": range(1, len(predictions) + 1),
    "Prediction": predictions.astype(int),
})

submission = submission[["ID", "Prediction"]]
submission.to_csv("combined_test_submission3.csv", index=False)



üñ•Ô∏è Using device: mps
ü§© Loading pretrained model version: alana89/TabSTAR


Epochs:   0%|          | 0/2 [00:00<?, ?it/s]

Epoch 1 || Train 0.2783 || Val 0.2431 || Metric 0.9905  ü•á


Epochs:  50%|‚ñà‚ñà‚ñà‚ñà‚ñà     | 1/2 [1:46:15<1:46:15, 6375.70s/it]

Epoch 2 || Train 0.2460 || Val 0.2429 || Metric 0.9905  ü•á


                                                           

üèÜ Best checkpoint: Epoch 2 with loss 0.2429
Threshold: 0.2764 was chosen for best val loss of 0.2429
üìä Averaging 2 checkpoints:
- checkpoint_epoch_1.pt (val_loss=0.2431)
- checkpoint_epoch_2.pt (val_loss=0.2429)
üíæ Saved averaged checkpoint to .tabstar_checkpoint/20251212_132808/checkpoint_averaged.pt
‚úÖ Saved averaged model to .tabstar_checkpoint/20251212_132808/averaged_model
üìà Averaged checkpoint || Val Loss: 0.2430 || Val Metric: 0.9905


To elaborate the comparison between the TabSTARClassifier and the LightGBM model we need to split the mergred train dataset into training and validation sets. This will allow us to evaluate the performance of tabstar on validation set and thus allow insights into quality of components.

In [8]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

tabstar = TabSTARClassifier.load("baseline_model.pkl")
predictions_val = tabstar.predict(X_val)


After getting the local predictions we may summarise the results for each class using the table below.

In [9]:
from sklearn.metrics import classification_report
import pandas as pd

name_map = {
    "0":"bad/background","1":"good/signal",
    "2":"Spruce/Fir","3":"Lodgepole Pine","4":"Ponderosa Pine",
    "5":"Cottonwood/Willow","6":"Aspen","7":"Douglas-fir","8":"Krummholz"
}

resultsTable = pd.DataFrame(classification_report(y_val, predictions_val, output_dict=True, zero_division=0)).T
resultsTable.index = resultsTable.index.map(lambda x: name_map.get(str(x), x))
resultsTable = resultsTable.loc[:, ["precision","recall","f1-score"]].round(2)

resultsTable = resultsTable.drop(index=["macro avg", "weighted avg"])
print(resultsTable)


                   precision  recall  f1-score
bad/background          0.96    0.97      0.97
good/signal             0.94    0.93      0.94
Spruce/Fir              0.65    0.78      0.71
Lodgepole Pine          0.78    0.68      0.73
Ponderosa Pine          0.61    0.90      0.72
Cottonwood/Willow       0.00    0.00      0.00
Aspen                   0.00    0.00      0.00
Douglas-fir             0.47    0.17      0.25
Krummholz               0.66    0.64      0.65
accuracy                0.89    0.89      0.89
