# Decision Tree Model Comparison: Tabular Synthetic Data with Gretel

Below, the adults dataset from the UCI machine learning database will be split into training and testing sets and exported as CSV files to avoid data leakage. From there, the training set will also be passed to the Gretel api to generate a synthetic dataset. Later on, a pipeline will be set up to prepare both the real training data and the synthetic training data to be passed into a couple of simple decision tree models. These two models will then be tested on the testing data to see how they compare.

In [1]:
import pandas as pd

In [2]:
#load in adults dataset from UCI machine learning database
df = pd.read_csv("adult (3).data")
df.head()

Unnamed: 0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
0,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
1,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
2,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
3,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
4,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32560 entries, 0 to 32559
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   39              32560 non-null  int64 
 1    State-gov      32560 non-null  object
 2    77516          32560 non-null  int64 
 3    Bachelors      32560 non-null  object
 4    13             32560 non-null  int64 
 5    Never-married  32560 non-null  object
 6    Adm-clerical   32560 non-null  object
 7    Not-in-family  32560 non-null  object
 8    White          32560 non-null  object
 9    Male           32560 non-null  object
 10   2174           32560 non-null  int64 
 11   0              32560 non-null  int64 
 12   40             32560 non-null  int64 
 13   United-States  32560 non-null  object
 14   <=50K          32560 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


In [4]:
#performing train-test split
from sklearn.model_selection import train_test_split

In [5]:
train, test = train_test_split(df, random_state=42)

In [6]:
train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 24420 entries, 8610 to 23654
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   39              24420 non-null  int64 
 1    State-gov      24420 non-null  object
 2    77516          24420 non-null  int64 
 3    Bachelors      24420 non-null  object
 4    13             24420 non-null  int64 
 5    Never-married  24420 non-null  object
 6    Adm-clerical   24420 non-null  object
 7    Not-in-family  24420 non-null  object
 8    White          24420 non-null  object
 9    Male           24420 non-null  object
 10   2174           24420 non-null  int64 
 11   0              24420 non-null  int64 
 12   40             24420 non-null  int64 
 13   United-States  24420 non-null  object
 14   <=50K          24420 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.0+ MB


In [7]:
test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8140 entries, 14160 to 30338
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   39              8140 non-null   int64 
 1    State-gov      8140 non-null   object
 2    77516          8140 non-null   int64 
 3    Bachelors      8140 non-null   object
 4    13             8140 non-null   int64 
 5    Never-married  8140 non-null   object
 6    Adm-clerical   8140 non-null   object
 7    Not-in-family  8140 non-null   object
 8    White          8140 non-null   object
 9    Male           8140 non-null   object
 10   2174           8140 non-null   int64 
 11   0              8140 non-null   int64 
 12   40             8140 non-null   int64 
 13   United-States  8140 non-null   object
 14   <=50K          8140 non-null   object
dtypes: int64(6), object(9)
memory usage: 1017.5+ KB


In [8]:
train.to_csv("./data/og_train.csv")
test.to_csv("./data/test.csv")