# Tutorial GAIN

    
 ## Missing Data Imputation using Generative Adversarial Nets

This tutorial shows how to use GAIN to do data imputation. Follow link [GAIN](http://proceedings.mlr.press/v80/yoon18a.html) for more details. We are using the UCI dataset forest cover type as an example.

Load dataset and show the first five samples:

In [16]:
from sklearn.datasets import fetch_covtype
import pandas as pd
data = fetch_covtype()
df = pd.DataFrame(data.data)
target = 'target'
df[target] = data.target

df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,45,46,47,48,49,50,51,52,53,target
0,2596.0,51.0,3.0,258.0,0.0,510.0,221.0,232.0,148.0,6279.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5
1,2590.0,56.0,2.0,212.0,-6.0,390.0,220.0,235.0,151.0,6225.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5
2,2804.0,139.0,9.0,268.0,65.0,3180.0,234.0,238.0,135.0,6121.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2
3,2785.0,155.0,18.0,242.0,118.0,3090.0,238.0,238.0,122.0,6211.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2
4,2595.0,45.0,2.0,153.0,-1.0,391.0,220.0,234.0,150.0,6172.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5


Introduce missing data:

In [21]:
nsample = 50000
python_exe = 'python3'  # on some platforms the name of the python3 exe is python, python3.6 ...

!{python_exe} create_missing.py --dataset cover --pmiss 0.2 --normalize01 0 -o xmissing.csv --oref x.csv -n {nsample}
df_mis = pd.read_csv("xmissing.csv")
df_mis.head()

features: #54 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53]
label(s): ['target'] # 7
#: 50000


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,44,45,46,47,48,49,50,51,52,53
0,2596.0,51.0,3.0,258.0,,510.0,221.0,232.0,,,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,0.0,0.0
1,2590.0,,2.0,212.0,-6.0,390.0,220.0,235.0,151.0,,...,,0.0,0.0,0.0,0.0,,,0.0,0.0,
2,,139.0,9.0,,65.0,3180.0,234.0,238.0,135.0,6121.0,...,,0.0,,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,,155.0,18.0,,118.0,3090.0,238.0,238.0,,,...,0.0,0.0,0.0,0.0,0.0,0.0,,0.0,0.0,0.0
4,2595.0,45.0,,,-1.0,391.0,,234.0,150.0,6172.0,...,0.0,0.0,0.0,0.0,0.0,,0.0,,0.0,0.0


Run the algorithm gain:

In [22]:
!{python_exe} gain.py -i xmissing.csv -o ximputed.csv

gain data:ximputed.csv # it:5000 testall:1 odir:. autocat:1 is_cat_one_hot:False

Namespace(alpha=10, autocategorical=1, bs=128, dataset=None, i='xmissing.csv', it=5000, o='ximputed.csv', phint=0.9, pmiss=0.2, ref=None, target=None, testall=1, trainratio=0.8, verbose=0)

features: #54 ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '50', '51', '52', '53'] label:None

From gain.py:227: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.


From gain.py:207: The name tf.random_normal is deprecated. Please use tf.random.normal instead.


From gain.py:302: The name tf.log is deprecated. Please use tf.math.log instead.


From gain.py:313: The name tf.train.AdamOptimizer is deprecated. Please use tf.compat.v1.train.AdamOpt

     0) loss train 0.484 test 0.514                                             
     0) loss train 0.484 test 0.514:   0%|     | 1/5000 [00:00<43:43,  1.91it/s]OMP: Info #250: KMP_AFFINITY: pid 26838 tid 26936 thread 31 bound to OS proc set 11
OMP: Info #250: KMP_AFFINITY: pid 26838 tid 26937 thread 32 bound to OS proc set 12
OMP: Info #250: KMP_AFFINITY: pid 26838 tid 26938 thread 33 bound to OS proc set 13
OMP: Info #250: KMP_AFFINITY: pid 26838 tid 26939 thread 34 bound to OS proc set 14
OMP: Info #250: KMP_AFFINITY: pid 26838 tid 26940 thread 35 bound to OS proc set 15
OMP: Info #250: KMP_AFFINITY: pid 26838 tid 26941 thread 36 bound to OS proc set 16
OMP: Info #250: KMP_AFFINITY: pid 26838 tid 26942 thread 37 bound to OS proc set 17
OMP: Info #250: KMP_AFFINITY: pid 26838 tid 26943 thread 38 bound to OS proc set 18
OMP: Info #250: KMP_AFFINITY: pid 26838 tid 26944 thread 39 bound to OS proc set 19
OMP: Info #250: KMP_AFFINITY: pid 26838 tid 26945 thread 40 bound to OS proc set 0


Show imputed data

In [23]:
df_imp = pd.read_csv("ximputed.csv")
df_imp.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,44,45,46,47,48,49,50,51,52,53
0,2596.0,51.0,3.0,258.0,27.0,510.0,221.0,232.0,135.0,3293.959076,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2590.0,147.0,2.0,212.0,-6.0,390.0,220.0,235.0,151.0,3266.718292,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2802.0,139.0,9.0,253.0,65.0,3180.0,234.0,238.0,135.0,6121.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,2840.0,155.0,18.0,278.0,118.0,3090.0,238.0,238.0,128.0,3202.90802,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,2595.0,45.0,13.0,248.0,-1.0,391.0,203.0,234.0,150.0,6172.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [24]:
def normalization (data):
  '''Normalize data in [0, 1] range.
  
  Args:
    - data: original data
  
  Returns:
    - norm_data: normalized data
    - norm_parameters: min_val, max_val for each feature for renormalization
  '''

  # Parameters
  _, dim = data.shape
  norm_data = data.copy()
  
  # MixMax normalization
  min_val = np.zeros(dim)
  max_val = np.zeros(dim)
  
  # For each dimension
  for i in range(dim):
    min_val[i] = np.nanmin(norm_data[:,i])
    norm_data[:,i] = norm_data[:,i] - np.nanmin(norm_data[:,i])
    max_val[i] = np.nanmax(norm_data[:,i])
    norm_data[:,i] = norm_data[:,i] / (np.nanmax(norm_data[:,i]) + 1e-6)   
    
  # Return norm_parameters for renormalization
  norm_parameters = {'min_val': min_val,
                     'max_val': max_val}
      
  return norm_data#, norm_parameters

'''Compute RMSE loss between ori_data and imputed_data

Args:
- ori_data: original data without missing values
- imputed_data: imputed data
- data_m: indicator matrix for missingness

Returns:
- rmse: Root Mean Squared Error
'''
import numpy as np
# Only for missing values
ori_data = normalization(np.array(pd.read_csv("x.csv")))
imputed_data = normalization(np.array(df_imp))
data_m = np.array(df_mis!=df_mis)
print(ori_data.shape,imputed_data.shape,data_m.shape)
nominator = np.sum(((1-data_m) * ori_data - (1-data_m) * imputed_data)**2)
denominator = np.sum(1-data_m)

rmse = np.sqrt(nominator/float(denominator))

print(rmse)


(50000, 54) (50000, 54) (50000, 54)
0.008663944049070279
