# GAN for Numerical Data

GAN and its variations have been used extensively on images. This notebook attempts to use and apply GAN for numerical data instead. The goal is to be able to create new data based on a given dataset. The new data will be combined with the original dataset and then used in training a classifier. Theoretically, this could improve the accuracy of the classifier due to increased training data. This could potentially be helpful for when the given dataset is small.

## Adult Income Dataset

The experiment will be using Adult Income Dataset from UCI Machine Learning repository, converted to csv format for easier processing, available on [Kaggle](https://www.kaggle.com/flyingwombat/logistic-regression-with-uci-adult-income/data).

Try importing all the required libraries first. This should not be a problem when using the provided conda environment.

In [8]:
import numpy as np
import pandas as pd

In [10]:
dataFrame = pd.read_csv('data/adult.csv')
dataFrame

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K
5,34,Private,198693,10th,6,Never-married,Other-service,Not-in-family,White,Male,0,0,30,United-States,<=50K
6,29,?,227026,HS-grad,9,Never-married,?,Unmarried,Black,Male,0,0,40,United-States,<=50K
7,63,Self-emp-not-inc,104626,Prof-school,15,Married-civ-spouse,Prof-specialty,Husband,White,Male,3103,0,32,United-States,>50K
8,24,Private,369667,Some-college,10,Never-married,Other-service,Unmarried,White,Female,0,0,40,United-States,<=50K
9,55,Private,104996,7th-8th,4,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,10,United-States,<=50K


No preprocessing of any kind (including missing value handling) will be performed on the dataset.

In [31]:
dataset = dataFrame.values
print('dataset, shape: {}\n{}'.format(dataset.shape, dataset))

dataset, shape: (48842, 15)
[[25 'Private' 226802 ... 40 'United-States' '<=50K']
 [38 'Private' 89814 ... 50 'United-States' '<=50K']
 [28 'Local-gov' 336951 ... 40 'United-States' '>50K']
 ...
 [58 'Private' 151910 ... 40 'United-States' '<=50K']
 [22 'Private' 201490 ... 20 'United-States' '<=50K']
 [52 'Self-emp-inc' 287927 ... 40 'United-States' '>50K']]


In [29]:
dataset.shape

(48842, 15)

Split the data to X and Y

In [28]:
X = dataset[:,0:14]
Y = dataset[:,14]
print('X, shape: {}\n{}'.format(X.shape, X))
print('Y, shape: {}\n{}'.format(Y.shape, Y))

X, shape: (48842, 14)
[[25 'Private' 226802 ... 0 40 'United-States']
 [38 'Private' 89814 ... 0 50 'United-States']
 [28 'Local-gov' 336951 ... 0 40 'United-States']
 ...
 [58 'Private' 151910 ... 0 40 'United-States']
 [22 'Private' 201490 ... 0 20 'United-States']
 [52 'Self-emp-inc' 287927 ... 0 40 'United-States']]
Y, shape: (48842,)
['<=50K' '<=50K' '>50K' ... '<=50K' '<=50K' '>50K']
