### Performance Comparison of Synthetic Data for Prediction Tasks

Here we are working with data for cardiovascular disease for 70000 different patient. Based on the features present in the dataset we make binary classification of whether the patient has cardiovascular disease or not. The intial scoring is done on pure real data followed by scoring on pure synthetic data and a combination of real and synthetic data in the end.

In [2]:
import pandas as pd
import warnings
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import f1_score, classification_report, roc_auc_score
from xgboost import XGBClassifier
from bulian.Tabular.synthesizers import TwinSynthesizer

warnings.filterwarnings('ignore')

#### Loading Data

We define a simple helper function `get_f1_score` which helps us to get the f1_score for different datasets

In [3]:
def get_f1_score(X, Y, model):
    preds = model.predict(X)
    return f1_score(Y, preds)

In [4]:
df = pd.read_csv("/home/anurag/Tabular_Synthesizers/examples/csv/cardio_train.csv", sep=";")
df = df.drop(['id'], axis=1)

Y_full = df['cardio'].dropna()
X_full = df.drop(['cardio'], axis=1)

The above dataset was retrieved from - https://www.kaggle.com/datasets/sulianova/cardiovascular-disease-dataset

#### Model Defination

In [5]:
model = XGBClassifier(learning_rate=0.1, colsample_bytree=0.8, n_estimators=500, n_jobs=-1)

#### Training on Real Data

In [6]:
train_X, test_X, train_Y, test_Y = train_test_split(X_full, Y_full, train_size=0.7, test_size=0.3, random_state=0)
model.fit(train_X, train_Y)

XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None,
              colsample_bylevel=1, colsample_bynode=1, colsample_bytree=0.8,
              early_stopping_rounds=None, enable_categorical=False,
              eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise',
              importance_type=None, interaction_constraints='',
              learning_rate=0.1, max_bin=256, max_cat_to_onehot=4,
              max_delta_step=0, max_depth=6, max_leaves=0, min_child_weight=1,
              missing=nan, monotone_constraints='()', n_estimators=500,
              n_jobs=-1, num_parallel_tree=1, predictor='auto', random_state=0,
              reg_alpha=0, reg_lambda=1, ...)

In [7]:
print("F1 Score on Real Test Data - ", get_f1_score(test_X, test_Y, model))

F1 Score on Real Test Data -  0.7192912602030659


#### Generate Synthetic Data

We make use of TwinSynthesizer model to generate synthetic data

In [8]:
synth = TwinSynthesizer(batch_size=200,device='cpu')
synth.fit(data=df,epochs=50,discrete_columns=[])
sample = synth.sample(30000)

Epoch: [0]  [  0/349]  eta: 0:00:19  loss_g: 0.0377 (0.0377)  loss_d: -0.0125 (-0.0125)  loss: 0.0252 (0.0252)  time: 0.0558  data: 0.0000
Epoch: [0]  [ 50/349]  eta: 0:00:12  loss_g: -0.5799 (-0.2457)  loss_d: 0.0715 (-0.0088)  loss: -0.4804 (-0.2545)  time: 0.0414  data: 0.0000
Epoch: [0]  [100/349]  eta: 0:00:09  loss_g: -0.2984 (-0.3274)  loss_d: -0.0541 (-0.0138)  loss: -0.3210 (-0.3412)  time: 0.0337  data: 0.0000
Epoch: [0]  [150/349]  eta: 0:00:07  loss_g: -0.4459 (-0.3480)  loss_d: 0.0835 (0.0227)  loss: -0.3644 (-0.3253)  time: 0.0339  data: 0.0000
Epoch: [0]  [200/349]  eta: 0:00:05  loss_g: -0.4767 (-0.3791)  loss_d: -0.2333 (0.0085)  loss: -0.6772 (-0.3705)  time: 0.0485  data: 0.0000
Epoch: [0]  [250/349]  eta: 0:00:03  loss_g: -0.9680 (-0.4695)  loss_d: 0.0854 (0.0203)  loss: -0.8657 (-0.4492)  time: 0.0298  data: 0.0000
Epoch: [0]  [300/349]  eta: 0:00:01  loss_g: -0.8542 (-0.5576)  loss_d: -0.1127 (0.0155)  loss: -0.9707 (-0.5421)  time: 0.0371  data: 0.0000
Epoch: [0]

#### Train on Synthetic Data

In [11]:
synth_Y = sample['cardio']
synth_X = sample.drop('cardio', axis=1)
model.fit(synth_X, synth_Y)
print("F1 Score on Synthetic Data - ", get_f1_score(test_X, test_Y, model))

F1 Score on Synthetic Data -  0.6496411779262559


#### Train on Synthetic and Real Data

In [12]:
X = pd.concat([train_X, synth_X], ignore_index=True)
Y = pd.concat([train_Y, synth_Y], ignore_index=True)
model.fit(X, Y)

print("F1 Score on Real + Synthetic Data - ", get_f1_score(test_X, test_Y, model))

F1 Score on Real + Synthetic Data -  0.7205735909181438


#### Conclusion

We see a slight improvement after fitting the model on Real + Synthetic data. With some more hyperparameter tuning or some other models we might be able to improve the above score but this test shows that a combination of real and synthetic data can improve your overall model performance.