# CTGAN, introduction

## Step 1: Prepare your data

In [1]:
from ctgan.data import read_csv

In [2]:
data, discrete_columns = read_csv(csv_filename="./examples/csv/adult.csv", 
                                  meta_filename="./examples/csv/adult.json")

The Metadata file will be in JSON format, containing an entry called columns, with a list of sub-documents containing both the name of the column and its type.

**Column types can be continuous for continuous columns and categorical, ordinal or discrete for non-continuous columns.**

In [4]:
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [5]:
discrete_columns

['workclass',
 'education',
 'marital-status',
 'occupation',
 'relationship',
 'race',
 'sex',
 'native-country',
 'income']

## Step 2: Fit CTGAN to your data

In [6]:
from ctgan.synthesizer import CTGANSynthesizer

In [7]:
ctgan = CTGANSynthesizer()

In [8]:
ctgan

<ctgan.synthesizer.CTGANSynthesizer at 0x1a2ba57470>

In [9]:
ctgan.fit(data, discrete_columns, epochs=5)

Epoch 1, Loss G: 1.8868, Loss D: -0.2757
Epoch 2, Loss G: 1.5564, Loss D: 0.2455
Epoch 3, Loss G: 1.1967, Loss D: 0.2388
Epoch 4, Loss G: 0.7673, Loss D: -0.0487
Epoch 5, Loss G: 0.4341, Loss D: 0.0820


Once the process has finished, all you need to do is call the sample method of your CTGANSynthesizer instance indicating the number of rows that you want to generate.

In [10]:
samples = ctgan.sample(1000)

In [11]:
samples

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,43.5579,Private,407321,HS-grad,4.18167,Never-married,Prof-specialty,Unmarried,Black,Male,-45.0411,4.16282,44.625,United-States,<=50K
1,34.444,Private,266373,Bachelors,3.48671,Never-married,Adm-clerical,Husband,White,Male,-113.307,2.90152,54.0152,United-States,>50K
2,51.5516,Private,231401,HS-grad,13.2405,Never-married,Machine-op-inspct,Unmarried,Black,Male,78.5132,-2.36432,72.0984,United-States,>50K
3,25.6598,Private,49548.4,Some-college,16.2661,Married-civ-spouse,Other-service,Own-child,White,Male,44.4972,-1.25731,34.1151,United-States,<=50K
4,34.6581,Private,198839,HS-grad,2.29787,Never-married,Prof-specialty,Own-child,White,Female,65.8026,4.56506,39.9558,United-States,<=50K
5,48.2409,Self-emp-inc,25909.3,HS-grad,3.1007,Married-civ-spouse,Prof-specialty,Husband,White,Female,137.413,2.44584,57.0688,United-States,<=50K
6,21.9093,Private,230304,1st-4th,10.0475,Widowed,Handlers-cleaners,Other-relative,White,Male,94.1392,5.64152,21.9799,United-States,<=50K
7,22.6247,Self-emp-not-inc,319036,HS-grad,9.03482,Married-civ-spouse,?,Husband,Black,Male,92.9954,-2.3871,40.332,United-States,<=50K
8,24.5064,Private,135014,Some-college,8.9904,Widowed,Protective-serv,Not-in-family,White,Female,-3.60939,4.55226,49.479,United-States,>50K
9,39.7407,Private,114037,Some-college,13.7575,Married-civ-spouse,Other-service,Husband,White,Male,53.7699,2.44408,45.8428,United-States,<=50K


NOTE: CTGAN does not distinguish between float and integer columns, which means that it will sample float values in all cases. If integer values are required, the outputted float values must be rounded to integers in a later step, outside of CTGAN.