# GAN for Numerical Data

GAN and its variations have been used extensively on images. This notebook attempts to use and apply GAN for numerical data instead. The goal is to be able to create new data based on a given dataset. The new data will be combined with the original dataset and then used in training a classifier. Theoretically, this could improve the accuracy of the classifier due to increased training data. This could potentially be helpful for when the given dataset is small.

## Adult Income Dataset

The experiment will be using Adult Income Dataset from UCI Machine Learning repository, converted to csv format for easier processing, available on [Kaggle](https://www.kaggle.com/flyingwombat/logistic-regression-with-uci-adult-income/data).

Try importing all the required libraries first. This should not be a problem when using the provided conda environment.

In [1]:
import numpy as np
import pandas as pd

from keras.layers import Activation, BatchNormalization
from keras.layers import Input, Dense, Reshape, Flatten, Dropout
from keras.layers.advanced_activations import LeakyReLU
from keras.models import Sequential, Model
from keras.models import load_model
from keras.optimizers import Adam
from keras.utils import to_categorical

Using TensorFlow backend.


In [15]:
dataFrame = pd.read_csv('data/adult-training.csv')
dataFrame

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
5,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K
6,49,Private,160187,9th,5,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0,0,16,Jamaica,<=50K
7,52,Self-emp-not-inc,209642,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,45,United-States,>50K
8,31,Private,45781,Masters,14,Never-married,Prof-specialty,Not-in-family,White,Female,14084,0,50,United-States,>50K
9,42,Private,159449,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,5178,0,40,United-States,>50K


No preprocessing of any kind (including missing value handling) will be performed on the dataset.

In [3]:
dataset = dataFrame.values
print('dataset, shape: {}\n{}'.format(dataset.shape, dataset))

dataset, shape: (48842, 15)
[[25 'Private' 226802 ... 40 'United-States' '<=50K']
 [38 'Private' 89814 ... 50 'United-States' '<=50K']
 [28 'Local-gov' 336951 ... 40 'United-States' '>50K']
 ...
 [58 'Private' 151910 ... 40 'United-States' '<=50K']
 [22 'Private' 201490 ... 20 'United-States' '<=50K']
 [52 'Self-emp-inc' 287927 ... 40 'United-States' '>50K']]


In [4]:
dataset.shape

(48842, 15)

Split the data to X and Y

In [8]:
X = dataset[:,0:14]
Y = dataset[:,14]
print('X, shape: {}\n{}'.format(X.shape, X))
print('Y, shape: {}\n{}'.format(Y.shape, Y))

X, shape: (48842, 14)
[[25 'Private' 226802 ... 0 40 'United-States']
 [38 'Private' 89814 ... 0 50 'United-States']
 [28 'Local-gov' 336951 ... 0 40 'United-States']
 ...
 [58 'Private' 151910 ... 0 40 'United-States']
 [22 'Private' 201490 ... 0 20 'United-States']
 [52 'Self-emp-inc' 287927 ... 0 40 'United-States']]
Y, shape: (48842,)
['<=50K' '<=50K' '>50K' ... '<=50K' '<=50K' '>50K']
