# DataWig

[Exploratory Data Analysis](https://en.wikipedia.org/wiki/Exploratory_data_analysis) (EDA) is an approach to analyzing data sets to summarize and understand their main characteristics, often with visual methods.

Starting with your *Machine Learning Checklist* you will see that a crucial step is **preparing and understanding** the data. It is a step where you will spend most of your time.  Here is an example checklists from [Aurélien Géron](10_20_2018_Machine-Learning-Project-Checklist.txt) that you can adapt to your needs.   

In these series of notebooks, we explore the Home Credit data set. We clean, preprocess and visualize the data for downstream processes in the Data Science Workflow.  All notebooks can be obtained from Github : https://github.com/chalendony/data-prep-visualization


# Python Imports

In [99]:
%config IPCompleter.greedy=True
import os
import sys
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)
    
import pandas as pd
import numpy as np
from pandas.plotting import scatter_matrix
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelBinarizer
pd.set_option('display.max_columns', 125)
import quilt
from scripts.preprocess import percent_missing, align_dataframes, as_dict
from string import Template
import missingno as msno
%matplotlib inline
import impyute
import datawig

## Import Quilt Packages from Local Repository 

In [100]:
from quilt.data.avare import homecredit

In [101]:
# avoid parens and copy original data
table = 'previous_application'
df = homecredit[table]().copy(deep=True)
df.head()

Unnamed: 0,SK_ID_PREV,SK_ID_CURR,NAME_CONTRACT_TYPE,AMT_ANNUITY,AMT_APPLICATION,AMT_CREDIT,AMT_DOWN_PAYMENT,AMT_GOODS_PRICE,WEEKDAY_APPR_PROCESS_START,HOUR_APPR_PROCESS_START,FLAG_LAST_APPL_PER_CONTRACT,NFLAG_LAST_APPL_IN_DAY,RATE_DOWN_PAYMENT,RATE_INTEREST_PRIMARY,RATE_INTEREST_PRIVILEGED,NAME_CASH_LOAN_PURPOSE,NAME_CONTRACT_STATUS,DAYS_DECISION,NAME_PAYMENT_TYPE,CODE_REJECT_REASON,NAME_TYPE_SUITE,NAME_CLIENT_TYPE,NAME_GOODS_CATEGORY,NAME_PORTFOLIO,NAME_PRODUCT_TYPE,CHANNEL_TYPE,SELLERPLACE_AREA,NAME_SELLER_INDUSTRY,CNT_PAYMENT,NAME_YIELD_GROUP,PRODUCT_COMBINATION,DAYS_FIRST_DRAWING,DAYS_FIRST_DUE,DAYS_LAST_DUE_1ST_VERSION,DAYS_LAST_DUE,DAYS_TERMINATION,NFLAG_INSURED_ON_APPROVAL
0,2030495,271877,Consumer loans,1730.43,17145.0,17145.0,0.0,17145.0,SATURDAY,15,Y,1,0.0,0.182832,0.867336,XAP,Approved,-73,Cash through the bank,XAP,,Repeater,Mobile,POS,XNA,Country-wide,35,Connectivity,12.0,middle,POS mobile with interest,365243.0,-42.0,300.0,-42.0,-37.0,0.0
1,2802425,108129,Cash loans,25188.615,607500.0,679671.0,,607500.0,THURSDAY,11,Y,1,,,,XNA,Approved,-164,XNA,XAP,Unaccompanied,Repeater,XNA,Cash,x-sell,Contact center,-1,XNA,36.0,low_action,Cash X-Sell: low,365243.0,-134.0,916.0,365243.0,365243.0,1.0
2,2523466,122040,Cash loans,15060.735,112500.0,136444.5,,112500.0,TUESDAY,11,Y,1,,,,XNA,Approved,-301,Cash through the bank,XAP,"Spouse, partner",Repeater,XNA,Cash,x-sell,Credit and cash offices,-1,XNA,12.0,high,Cash X-Sell: high,365243.0,-271.0,59.0,365243.0,365243.0,1.0
3,2819243,176158,Cash loans,47041.335,450000.0,470790.0,,450000.0,MONDAY,7,Y,1,,,,XNA,Approved,-512,Cash through the bank,XAP,,Repeater,XNA,Cash,x-sell,Credit and cash offices,-1,XNA,12.0,middle,Cash X-Sell: middle,365243.0,-482.0,-152.0,-182.0,-177.0,1.0
4,1784265,202054,Cash loans,31924.395,337500.0,404055.0,,337500.0,THURSDAY,9,Y,1,,,,Repairs,Refused,-781,Cash through the bank,HC,,Repeater,XNA,Cash,walk-in,Credit and cash offices,-1,XNA,24.0,high,Cash Street: high,,,,,,


In [102]:
# drop keys and empty columns
dropcols = ['RATE_INTEREST_PRIVILEGED','RATE_INTEREST_PRIMARY','SK_ID_PREV', 'SK_ID_CURR']
df.drop(dropcols, axis=1, inplace=True)

# drop rows containing null, also done by datawig?
df.dropna(axis=0, how='any', inplace=True)

# select random instances
seed = 500
numinstances = 1000
df = df.sample(numinstances,random_state=seed)
df.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 929166 to 1290315
Data columns (total 33 columns):
NAME_CONTRACT_TYPE             1000 non-null object
AMT_ANNUITY                    1000 non-null float64
AMT_APPLICATION                1000 non-null float64
AMT_CREDIT                     1000 non-null float64
AMT_DOWN_PAYMENT               1000 non-null float64
AMT_GOODS_PRICE                1000 non-null float64
WEEKDAY_APPR_PROCESS_START     1000 non-null object
HOUR_APPR_PROCESS_START        1000 non-null int64
FLAG_LAST_APPL_PER_CONTRACT    1000 non-null object
NFLAG_LAST_APPL_IN_DAY         1000 non-null int64
RATE_DOWN_PAYMENT              1000 non-null float64
NAME_CASH_LOAN_PURPOSE         1000 non-null object
NAME_CONTRACT_STATUS           1000 non-null object
DAYS_DECISION                  1000 non-null int64
NAME_PAYMENT_TYPE              1000 non-null object
CODE_REJECT_REASON             1000 non-null object
NAME_TYPE_SUITE                1000 non-null objec

In [103]:
# assign data types
description = pd.read_excel('data/HomeCredit_columns_description.xlsx', sheet_name='Sheet1',usecols=[2,3,4])
description.head()

Unnamed: 0,Table,Row,Type
0,application_train,SK_ID_CURR,numerical
1,application_train,TARGET,categorical
2,application_train,NAME_CONTRACT_TYPE,categorical
3,application_train,CODE_GENDER,categorical
4,application_train,FLAG_OWN_CAR,categorical


In [104]:
# rename to python types: category , float
python_cat_dtype = 'object'
python_num_dtype = 'float64'

description.replace('categorical', python_cat_dtype, inplace=True)
description.replace('numerical', python_num_dtype, inplace=True)

# type cols
typecols = description[(description.Table == table)]
typecols.head()

Unnamed: 0,Table,Row,Type
173,previous_application,SK_ID_PREV,float64
174,previous_application,SK_ID_CURR,float64
175,previous_application,NAME_CONTRACT_TYPE,object
176,previous_application,AMT_ANNUITY,float64
177,previous_application,AMT_APPLICATION,float64


In [105]:
# get target columns 
targetcols = pd.DataFrame(df.columns, columns=['Row'])
targetcols.head()

Unnamed: 0,Row
0,NAME_CONTRACT_TYPE
1,AMT_ANNUITY
2,AMT_APPLICATION
3,AMT_CREDIT
4,AMT_DOWN_PAYMENT


In [106]:
# join , ensure col correct -  we dont know which cols are present in the description
targetcols = targetcols.merge(typecols, how='left')
targetcols.head()

Unnamed: 0,Row,Table,Type
0,NAME_CONTRACT_TYPE,previous_application,object
1,AMT_ANNUITY,previous_application,float64
2,AMT_APPLICATION,previous_application,float64
3,AMT_CREDIT,previous_application,float64
4,AMT_DOWN_PAYMENT,previous_application,float64


In [107]:
# retrieve all columns of same type 
cat = targetcols.loc[(targetcols.Type == python_cat_dtype),'Row'].values.tolist()
num = targetcols.loc[(targetcols.Type == python_num_dtype),'Row'].values.tolist()

print(cat)
print(num)
#print(len(cat) + len(num))

['NAME_CONTRACT_TYPE', 'WEEKDAY_APPR_PROCESS_START', 'HOUR_APPR_PROCESS_START', 'FLAG_LAST_APPL_PER_CONTRACT', 'NFLAG_LAST_APPL_IN_DAY', 'NAME_CASH_LOAN_PURPOSE', 'NAME_CONTRACT_STATUS', 'NAME_PAYMENT_TYPE', 'CODE_REJECT_REASON', 'NAME_TYPE_SUITE', 'NAME_CLIENT_TYPE', 'NAME_GOODS_CATEGORY', 'NAME_PORTFOLIO', 'NAME_PRODUCT_TYPE', 'CHANNEL_TYPE', 'SELLERPLACE_AREA', 'NAME_SELLER_INDUSTRY', 'NAME_YIELD_GROUP', 'PRODUCT_COMBINATION', 'NFLAG_INSURED_ON_APPROVAL']
['AMT_ANNUITY', 'AMT_APPLICATION', 'AMT_CREDIT', 'AMT_DOWN_PAYMENT', 'AMT_GOODS_PRICE', 'RATE_DOWN_PAYMENT', 'DAYS_DECISION', 'CNT_PAYMENT', 'DAYS_FIRST_DRAWING', 'DAYS_FIRST_DUE', 'DAYS_LAST_DUE_1ST_VERSION', 'DAYS_LAST_DUE', 'DAYS_TERMINATION']


In [108]:
## batch update types 
df[cat] = df[cat].astype(python_cat_dtype)
df[num] = df[num].astype(python_num_dtype)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 929166 to 1290315
Data columns (total 33 columns):
NAME_CONTRACT_TYPE             1000 non-null object
AMT_ANNUITY                    1000 non-null float64
AMT_APPLICATION                1000 non-null float64
AMT_CREDIT                     1000 non-null float64
AMT_DOWN_PAYMENT               1000 non-null float64
AMT_GOODS_PRICE                1000 non-null float64
WEEKDAY_APPR_PROCESS_START     1000 non-null object
HOUR_APPR_PROCESS_START        1000 non-null object
FLAG_LAST_APPL_PER_CONTRACT    1000 non-null object
NFLAG_LAST_APPL_IN_DAY         1000 non-null object
RATE_DOWN_PAYMENT              1000 non-null float64
NAME_CASH_LOAN_PURPOSE         1000 non-null object
NAME_CONTRACT_STATUS           1000 non-null object
DAYS_DECISION                  1000 non-null float64
NAME_PAYMENT_TYPE              1000 non-null object
CODE_REJECT_REASON             1000 non-null object
NAME_TYPE_SUITE                1000 non-null o

In [110]:
# out categoricals dont look like categoricals

df[cat] 

Unnamed: 0,NAME_CONTRACT_TYPE,WEEKDAY_APPR_PROCESS_START,HOUR_APPR_PROCESS_START,FLAG_LAST_APPL_PER_CONTRACT,NFLAG_LAST_APPL_IN_DAY,NAME_CASH_LOAN_PURPOSE,NAME_CONTRACT_STATUS,NAME_PAYMENT_TYPE,CODE_REJECT_REASON,NAME_TYPE_SUITE,NAME_CLIENT_TYPE,NAME_GOODS_CATEGORY,NAME_PORTFOLIO,NAME_PRODUCT_TYPE,CHANNEL_TYPE,SELLERPLACE_AREA,NAME_SELLER_INDUSTRY,NAME_YIELD_GROUP,PRODUCT_COMBINATION,NFLAG_INSURED_ON_APPROVAL
929166,Consumer loans,SATURDAY,8,Y,1,XAP,Approved,Cash through the bank,XAP,Unaccompanied,Repeater,Audio/Video,POS,XNA,Country-wide,4000,Consumer electronics,high,POS household with interest,0
1053922,Consumer loans,WEDNESDAY,17,Y,1,XAP,Approved,Cash through the bank,XAP,Family,Repeater,Computers,POS,XNA,Country-wide,1200,Consumer electronics,low_normal,POS household with interest,0
218963,Consumer loans,SATURDAY,11,Y,1,XAP,Approved,Cash through the bank,XAP,Family,New,Clothing and Accessories,POS,XNA,Stone,35,Clothing,middle,POS industry with interest,0
1593438,Consumer loans,THURSDAY,17,Y,1,XAP,Approved,Cash through the bank,XAP,Family,Repeater,Construction Materials,POS,XNA,Regional / Local,4000,Construction,low_normal,POS industry with interest,0
1346827,Cash loans,TUESDAY,17,Y,1,XNA,Approved,Cash through the bank,XAP,Family,Repeater,XNA,Cash,x-sell,Credit and cash offices,0,XNA,high,Cash X-Sell: high,1
897815,Consumer loans,SATURDAY,14,Y,1,XAP,Approved,Cash through the bank,XAP,Unaccompanied,Refreshed,Computers,POS,XNA,Regional / Local,50,Consumer electronics,high,POS household with interest,1
24666,Consumer loans,WEDNESDAY,18,Y,1,XAP,Approved,XNA,XAP,"Spouse, partner",New,Computers,POS,XNA,Country-wide,2170,Consumer electronics,middle,POS household with interest,0
577624,Consumer loans,SATURDAY,11,Y,1,XAP,Approved,Cash through the bank,XAP,Other_B,New,Computers,POS,XNA,Country-wide,1378,Consumer electronics,middle,POS household with interest,0
1408920,Consumer loans,TUESDAY,3,Y,1,XAP,Approved,XNA,XAP,Children,New,Computers,POS,XNA,Stone,312,Consumer electronics,middle,POS household with interest,0
613751,Consumer loans,MONDAY,10,Y,1,XAP,Approved,XNA,XAP,Unaccompanied,New,Consumer Electronics,POS,XNA,Stone,60,Consumer electronics,middle,POS household with interest,0


In [109]:

# select a portion of the data for evaluation
df_train, df_test = datawig.utils.random_split(df)


output_column = 'PRODUCT_COMBINATION'
output_path = 'imputer_model'
lst = [*df.columns.values]
lst.remove(output_column)
input_cols = lst

# Initialize a SimpleImputer model
imputer = datawig.SimpleImputer(
    input_columns=input_cols,  # columns containing information about the column we want to impute
    output_column='PRODUCT_COMBINATION',  # the column we'd like to impute values for
    output_path=output_path  # stores model data and metrics
)

# Fit an imputer model on the train data
#imputer.fit(train_df=df_train, num_epochs=5)

# Fit an imputer model with default list of hyperparameters
imputer.fit_hpo(train_df=df_train)

# Impute missing values and return original dataframe with predictions
predictions = imputer.predict(df_test)

# Calculate f1 score for true vs predicted values
f1 = datawig.f1_score(predictions[output_column], predictions[output_column+'_imputed'], average='weighted')

# Print overall classification report
print(datawig.classification_report(predictions[output_column], predictions[output_column+'_imputed']))


# fit an imputer model with customized hyperparameters
#imputer.fit_hpo(
#    train_df=df_train,
#    num_epochs=100,
#    patience=3,
#    learning_rate_candidates=[1e-3, 3e-4, 1e-4]
#)


2019-05-01 16:49:04,377 [INFO]  Assuming 13 numeric input columns: AMT_ANNUITY, AMT_APPLICATION, AMT_CREDIT, AMT_DOWN_PAYMENT, AMT_GOODS_PRICE, RATE_DOWN_PAYMENT, DAYS_DECISION, CNT_PAYMENT, DAYS_FIRST_DRAWING, DAYS_FIRST_DUE, DAYS_LAST_DUE_1ST_VERSION, DAYS_LAST_DUE, DAYS_TERMINATION
2019-05-01 16:49:04,379 [INFO]  Assuming 19 string input columns: SELLERPLACE_AREA, NAME_PAYMENT_TYPE, NAME_TYPE_SUITE, NAME_PRODUCT_TYPE, NAME_SELLER_INDUSTRY, NAME_CASH_LOAN_PURPOSE, HOUR_APPR_PROCESS_START, NAME_PORTFOLIO, CHANNEL_TYPE, FLAG_LAST_APPL_PER_CONTRACT, NFLAG_LAST_APPL_IN_DAY, NAME_CONTRACT_TYPE, NAME_GOODS_CATEGORY, NAME_YIELD_GROUP, NAME_CONTRACT_STATUS, NFLAG_INSURED_ON_APPROVAL, NAME_CLIENT_TYPE, CODE_REJECT_REASON, WEEKDAY_APPR_PROCESS_START
2019-05-01 16:49:04,380 [INFO]  Assuming 13 numeric input columns: AMT_ANNUITY, AMT_APPLICATION, AMT_CREDIT, AMT_DOWN_PAYMENT, AMT_GOODS_PRICE, RATE_DOWN_PAYMENT, DAYS_DECISION, CNT_PAYMENT, DAYS_FIRST_DRAWING, DAYS_FIRST_DUE, DAYS_LAST_DUE_1ST_VER

2019-05-01 16:49:04,465 [INFO]  Detected 0 rows with missing labels                         for column PRODUCT_COMBINATION
2019-05-01 16:49:04,467 [INFO]  Dropping 0/640 rows
2019-05-01 16:49:04,472 [INFO]  Detected 0 rows with missing labels                         for column PRODUCT_COMBINATION
2019-05-01 16:49:04,473 [INFO]  Dropping 0/160 rows
2019-05-01 16:49:04,475 [INFO]  Train: 640, Test: 160
2019-05-01 16:49:04,477 [INFO]  Fitting data encoder <class 'datawig.column_encoders.NumericalEncoder'> on columns AMT_ANNUITY and 640 rows of training data with parameters {'input_columns': ['AMT_ANNUITY'], 'output_column': 'AMT_ANNUITY_numeric', 'output_dim': 1, 'normalize': True, 'scaler': None}
2019-05-01 16:49:04,485 [INFO]  Fitting data encoder <class 'datawig.column_encoders.NumericalEncoder'> on columns AMT_APPLICATION and 640 rows of training data with parameters {'input_columns': ['AMT_APPLICATION'], 'output_column': 'AMT_APPLICATION_numeric', 'output_dim': 1, 'normalize': True, 

2019-05-01 16:49:04,701 [INFO]  Concatenating numeric columns ['AMT_GOODS_PRICE'] into AMT_GOODS_PRICE_numeric
2019-05-01 16:49:04,703 [INFO]  Normalizing with StandardScaler
2019-05-01 16:49:04,708 [INFO]  Data Encoding - Encoded 640 rows of column                         AMT_GOODS_PRICE with <class 'datawig.column_encoders.NumericalEncoder'> into                         <class 'numpy.ndarray'> of shape (640, 1)                         and then into shape (640, 1)
2019-05-01 16:49:04,725 [INFO]  Data Encoding - Encoded 640 rows of column                         WEEKDAY_APPR_PROCESS_START with <class 'datawig.column_encoders.BowEncoder'> into                         <class 'scipy.sparse.csr.csr_matrix'> of shape (640, 32768)                         and then into shape (640, 32768)


AttributeError: 'int' object has no attribute 'lower'

## Missing Numerical

In [162]:
# batch: assign each column, select from description
for c in cols:
    dtype = description[c]
    df[c] = df[c].astype(dtype)
    
df.info(versbose=True)

Unnamed: 0,Table,Row,Type
175,previous_application,NAME_CONTRACT_TYPE,categorical
181,previous_application,WEEKDAY_APPR_PROCESS_START,categorical
182,previous_application,HOUR_APPR_PROCESS_START,categorical
183,previous_application,FLAG_LAST_APPL_PER_CONTRACT,categorical
184,previous_application,NFLAG_LAST_APPL_IN_DAY,categorical


#### Strategy: Impute Probalistic

* Datawig: https://github.com/awslabs/datawig/blob/master/README.md#imputation-of-numerical-columns

In [None]:
# fill in some nulls
 ## hmm does not include missing data in th 
#seed = 200
#nullval = df.sample(frac=0.2,random_state=seed)
#test = df.loc[nullval.index,['PRODUCT_COMBINATION']]
nullval.index  
df.loc[nullval.index,['PRODUCT_COMBINATION']] = np.nan
df.loc[nullval.index,['PRODUCT_COMBINATION']].head()


# test set: counts for each value 
test.PRODUCT_COMBINATION.value_counts()

In [None]:
###### TODO : Assignment of Categoricals is not working!!!

#https://stackoverflow.com/questions/32718639/pandas-filling-nans-in-categorical-data
# update data types : Once you create Categorical Data, you can insert only values in category.
#print('Updating data types')

#table = 'previous_application'

# retriev type from description
#meta = hc_description.loc[hc_description['Table']==table,['Row','Type']]
#dict_types = as_dict(meta)

# set types in data table
#df = df.astype(dict_types)
