<a href="https://colab.research.google.com/github/jeffheaton/t81_558_deep_learning/blob/master/t81_558_class_07_5_tabular_synthetic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# T81-558: Applications of Deep Neural Networks
**Module 7: Generative Adversarial Networks**
* Instructor: [Jeff Heaton](https://sites.wustl.edu/jeffheaton/), McKelvey School of Engineering, [Washington University in St. Louis](https://engineering.wustl.edu/Programs/Pages/default.aspx)
* For more information visit the [class website](https://sites.wustl.edu/jeffheaton/t81-558/).

# Module 7 Material

* Part 7.1: Introduction to GANs for Image and Data Generation [[Video]](https://www.youtube.com/watch?v=hZw-AjbdN5k&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](https://github.com/jeffheaton/t81_558_deep_learning/blob/master/t81_558_class_07_1_gan_intro.ipynb)
* Part 7.2: Train StyleGAN3 with your Own Images [[Video]](https://www.youtube.com/watch?v=R546LYsQk5M&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](https://github.com/jeffheaton/t81_558_deep_learning/blob/master/t81_558_class_07_2_train_gan.ipynb)
* Part 7.3: Exploring the StyleGAN Latent Vector [[Video]](https://www.youtube.com/watch?v=goQzp8QSb2s&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](https://github.com/jeffheaton/t81_558_deep_learning/blob/master/t81_558_class_07_3_latent_vector.ipynb)
* Part 7.4: GANs to Enhance Old Photographs Deoldify [[Video]](https://www.youtube.com/watch?v=0OTd5GlHRx4&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](https://github.com/jeffheaton/t81_558_deep_learning/blob/master/t81_558_class_07_4_deoldify.ipynb)
* **Part 7.5: GANs for Tabular Synthetic Data Generation** [[Video]](https://www.youtube.com/watch?v=yujdA46HKwA&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](https://github.com/jeffheaton/t81_558_deep_learning/blob/master/t81_558_class_07_5_tabular_synthetic.ipynb)


# Google CoLab Instructions

The following code ensures that Google CoLab is running the correct version of TensorFlow.
  Running the following code will map your GDrive to ```/content/drive```.

In [8]:
try:
    from google.colab import drive
    COLAB = True
    print("Note: using Google CoLab")
    %tensorflow_version 2.x
except:
    print("Note: not using Google CoLab")
    COLAB = False

Note: using Google CoLab
Colab only includes TensorFlow 2.x; %tensorflow_version has no effect.


# Part 7.5: GANs for Tabular Synthetic Data Generation

Typically GANs are used to generate images. However, we can also generate tabular data from a GAN. In this part, we will use the Python tabgan utility to create fake data from tabular data. Specifically, we will use the Auto MPG dataset to train a GAN to generate fake cars.  [Cite:ashrapov2020tabular](https://arxiv.org/pdf/2010.00638.pdf)

## Installing Tabgan

Pytorch is the foundation of the tabgan neural network utility. The following code installs the needed software to run tabgan in Google Colab.

In [22]:
# HIDE OUTPUT
CMD = "wget https://raw.githubusercontent.com/Diyago/Tabular-data-generation/refs/heads/master/requirements.txt"

!{CMD}
!pip install -r requirements.txt
!pip install tabgan

--2025-09-02 16:19:12--  https://raw.githubusercontent.com/Diyago/Tabular-data-generation/refs/heads/master/requirements.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 211 [text/plain]
Saving to: ‘requirements.txt’


2025-09-02 16:19:12 (8.31 MB/s) - ‘requirements.txt’ saved [211/211]

Collecting python-dateutil==2.8.1 (from -r requirements.txt (line 10))
  Downloading python_dateutil-2.8.1-py2.py3-none-any.whl.metadata (8.0 kB)
Collecting be-great==0.0.8 (from -r requirements.txt (line 13))
  Downloading be_great-0.0.8-py3-none-any.whl.metadata (5.7 kB)
INFO: pip is looking at multiple versions of pandas to determine which version is compatible with other requirements. This could take a while.
Collecting pandas>=1.2.2 (from -r requirements.txt (line 5)

Note, after installing; you may see this message:

* You must restart the runtime in order to use newly installed versions.

If so, click the "restart runtime" button just under the message. Then rerun this notebook, and you should not receive further issues.

## Loading the Auto MPG Data and Training a Neural Network

We will begin by generating fake data for the Auto MPG dataset we have previously seen. The tabgan library can generate categorical (textual) and continuous (numeric) data. However, it cannot generate unstructured data, such as the name of the automobile. Car names, such as "AMC Rebel SST" cannot be replicated by the GAN, because every row has a different car name; it is a textual but non-categorical value.

The following code is similar to what we have seen before. We load the AutoMPG dataset. The tabgan library requires Pandas dataframe to train. Because of this, we keep both the Pandas and Numpy values.

In [46]:
# HIDE OUTPUT
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation
from tensorflow.keras.callbacks import EarlyStopping
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
import pandas as pd
import io
import os
import requests
import numpy as np
from sklearn import metrics
from sklearn.preprocessing import LabelEncoder

df = pd.read_csv(
    "https://raw.githubusercontent.com/dayvid91110/ddos_detection/refs/heads/main/filtered_DoS_Slowhttptest_normalized.csv",
    na_values=['NA', '?'])

COLS_USED = ['Destination Port', 'Flow Duration', 'Packet Length Mean', 'Flow Bytes/s',
          'Total Fwd Packets', 'Label']
COLS_TRAIN = ['Destination Port', 'Flow Duration', 'Packet Length Mean', 'Flow Bytes/s',
          'Total Fwd Packets', 'Label']

df = df[COLS_USED]
label_encoder = LabelEncoder()
df['Label'] = label_encoder.fit_transform(df['Label'])


# Split into training and test sets
df_x_train, df_x_test, df_y_train, df_y_test = train_test_split(
    df.drop("Label", axis=1),
    df["Label"],
    test_size=0.20,
    #shuffle=False,
    random_state=42,
)

# Create dataframe versions for tabular GAN
df_x_test, df_y_test = df_x_test.reset_index(drop=True), \
  df_y_test.reset_index(drop=True)
df_y_train = pd.DataFrame(df_y_train)
df_y_test = pd.DataFrame(df_y_test)

# Pandas to Numpy
x_train = df_x_train.values
x_test = df_x_test.values
y_train = df_y_train.values
y_test = df_y_test.values

# Build the neural network
model = Sequential()
# Hidden 1
model.add(Dense(50, input_dim=x_train.shape[1], activation='relu'))
model.add(Dense(25, activation='relu')) # Hidden 2
model.add(Dense(12, activation='relu')) # Hidden 2
model.add(Dense(1)) # Output
model.compile(loss='mean_squared_error', optimizer='adam')

monitor = EarlyStopping(monitor='val_loss', min_delta=1e-3,
        patience=5, verbose=1, mode='auto',
        restore_best_weights=True)
model.fit(x_train,y_train,validation_data=(x_test,y_test),
        callbacks=[monitor], verbose=2,epochs=1000)

Epoch 1/1000
138/138 - 2s - 11ms/step - loss: 2.7728 - val_loss: 0.0063
Epoch 2/1000
138/138 - 0s - 3ms/step - loss: 0.0054 - val_loss: 0.0050
Epoch 3/1000
138/138 - 0s - 2ms/step - loss: 0.0042 - val_loss: 0.0038
Epoch 4/1000
138/138 - 1s - 5ms/step - loss: 0.0033 - val_loss: 0.0029
Epoch 5/1000
138/138 - 0s - 2ms/step - loss: 0.0025 - val_loss: 0.0023
Epoch 6/1000
138/138 - 0s - 2ms/step - loss: 0.0020 - val_loss: 0.0017
Epoch 7/1000
138/138 - 0s - 2ms/step - loss: 0.0015 - val_loss: 0.0013
Epoch 8/1000
138/138 - 1s - 4ms/step - loss: 0.0012 - val_loss: 0.0010
Epoch 9/1000
138/138 - 0s - 3ms/step - loss: 8.9295e-04 - val_loss: 7.5911e-04
Epoch 10/1000
138/138 - 1s - 6ms/step - loss: 6.4237e-04 - val_loss: 5.7147e-04
Epoch 11/1000
138/138 - 1s - 4ms/step - loss: 5.0083e-04 - val_loss: 5.5005e-04
Epoch 12/1000
138/138 - 1s - 5ms/step - loss: 3.6607e-04 - val_loss: 3.7560e-04
Epoch 13/1000
138/138 - 0s - 3ms/step - loss: 2.9554e-04 - val_loss: 2.3630e-04
Epoch 13: early stopping
Restori

<keras.src.callbacks.history.History at 0x7ac11d369850>

We now evaluate the trained neural network to see the RMSE. We will use this trained neural network to compare the accuracy between the original data and the GAN-generated data. We will later see that you can use such comparisons for anomaly detection. We can use this technique can be used for security systems. If a neural network trained on original data does not perform well on new data, then the new data may be suspect or fake.

In [47]:
pred = model.predict(x_test)
score = np.sqrt(metrics.mean_squared_error(pred,y_test))
print("Final score (RMSE): {}".format(score))

[1m35/35[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step
Final score (RMSE): 0.032126180046159074


## Training a GAN for Auto MPG

Next, we will train the GAN to generate fake data from the original MPG data. There are quite a few options that you can fine-tune for the GAN. The example presented here uses most of the default values. These are the usual hyperparameters that must be tuned for any model and require some experimentation for optimal results. To learn more about tabgab refer to its paper or this [Medium article](https://towardsdatascience.com/review-of-gans-for-tabular-data-a30a2199342), written by the creator of tabgan.

In [48]:
from tabgan.sampler import GANGenerator
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

gen_x, gen_y = GANGenerator(gen_x_times=1.1, cat_cols=None,
           bot_filter_quantile=0.001, top_filter_quantile=0.999, \
              is_post_process=True,
           adversarial_model_params={
               "metrics": "rmse", "max_depth": 2, "max_bin": 100,
               "learning_rate": 0.02, "random_state": \
                42, "n_estimators": 500,
           }, pregeneration_frac=2, only_generated_data=False,\
           gen_params = {"batch_size": 500, "patience": 25, \
          "epochs" : 500}).generate_data_pipe(df_x_train, df_y_train,\
          df_x_test, deep_copy=True, only_adversarial=False, \
          use_adversarial=True)



Fitting CTGAN transformers for each column:   0%|          | 0/6 [00:00<?, ?it/s]

Training CTGAN, epochs::   0%|          | 0/500 [00:00<?, ?it/s]

Note: if you receive an error running the above code, you likely need to restart the runtime. You should have a "restart runtime" button in the output from the second cell. Once you restart the runtime, rerun all of the cells. This step is necessary as tabgan requires specific versions of some packages.

## Evaluating the GAN Results

If we display the results, we can see that the GAN-generated data looks similar to the original. Some values, typically whole numbers in the original data, have fractional values in the synthetic data.

In [49]:
gen_x

Unnamed: 0,Destination Port,Flow Duration,Packet Length Mean,Flow Bytes/s,Total Fwd Packets
0,80,0.961382,0.259281,6.731538e-02,0.051958
1,80,0.950892,0.400929,3.521738e-02,0.145347
2,80,0.959911,0.359905,5.048190e-02,0.073688
3,80,0.993381,0.060368,1.173333e-01,0.067660
4,80,0.947085,0.060705,3.131553e-02,0.059011
...,...,...,...,...,...
5815,80,0.898676,0.216239,2.686108e-08,0.681818
5816,80,0.898506,0.216239,2.686615e-08,0.681818
5817,80,0.898765,0.216239,2.685843e-08,0.681818
5818,80,0.898814,0.216239,2.685696e-08,0.681818


Finally, we present the synthetic data to the previously trained neural network to see how accurately we can predict the synthetic targets.  As we can see, you lose some RMSE accuracy by going to synthetic data.

In [50]:
# Predict
pred = model.predict(gen_x.values)
score = np.sqrt(metrics.mean_squared_error(pred,gen_y.values))
print("Final score (RMSE): {}".format(score))

[1m182/182[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step
Final score (RMSE): 0.035865706602142326
