# DataWig

[Exploratory Data Analysis](https://en.wikipedia.org/wiki/Exploratory_data_analysis) (EDA) is an approach to analyzing data sets to summarize and understand their main characteristics, often with visual methods.

Starting with your *Machine Learning Checklist* you will see that a crucial step is **preparing and understanding** the data. It is a step where you will spend most of your time.  Here is an example checklists from [Aurélien Géron](10_20_2018_Machine-Learning-Project-Checklist.txt) that you can adapt to your needs.   

In these series of notebooks, we explore the Home Credit data set. We clean, preprocess and visualize the data for downstream processes in the Data Science Workflow.  All notebooks can be obtained from Github : https://github.com/chalendony/data-prep-visualization


# Python Imports

In [104]:
import os
import sys
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)
    
import pandas as pd
import numpy as np
from pandas.plotting import scatter_matrix
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelBinarizer
pd.set_option('display.max_columns', 125)
import quilt
from scripts.preprocess import percent_missing, align_dataframes, as_dict
from string import Template
import missingno as msno
%matplotlib inline
import impyute
import datawig

## Import Quilt Packages from Local Repository 

In [105]:
from quilt.data.avare import homecredit
homecredit

<GroupNode>
POS_CASH_balance
application_train
bureau
bureau_balance
credit_card_balance
installments_payments
previous_application

In [106]:
# avoid parens and copy original data
frames = {}
for key, val in homecredit._items():
    frames[key] = val().copy(deep=True)

#  data types
hc_description = pd.read_excel('data/HomeCredit_columns_description.xlsx', sheet_name='Sheet1',usecols=[2,3,4])
hc_description.head()

# fix error
idx = hc_description[(hc_description.Row == 'NFLAG_MICRO_CASH') & (hc_description.Table == 'previous_application')].index
hc_description.drop(idx, inplace=True)

# make copy ...
table = 'previous_application'
df = frames[table].copy(deep=True)

## Drop the empty columns
dropcols = ['RATE_INTEREST_PRIVILEGED','RATE_INTEREST_PRIMARY']
df.drop(dropcols, axis=1, inplace=True)

In [107]:
df.shape

(1670214, 35)

In [108]:
df.PRODUCT_COMBINATION.value_counts()

Cash                              285990
POS household with interest       263622
POS mobile with interest          220670
Cash X-Sell: middle               143883
Cash X-Sell: low                  130248
Card Street                       112582
POS industry with interest         98833
POS household without interest     82908
Card X-Sell                        80582
Cash Street: high                  59639
Cash X-Sell: high                  59301
Cash Street: middle                34658
Cash Street: low                   33834
POS mobile without interest        24082
POS other with interest            23879
POS industry without interest      12602
POS others without interest         2555
Name: PRODUCT_COMBINATION, dtype: int64

In [109]:
# select random instances
seed = 300
numinstances = 10000
df = df.sample(numinstances,random_state=seed)

In [110]:
# overall: counts for each value
df.PRODUCT_COMBINATION.value_counts()

Cash                              1696
POS household with interest       1532
POS mobile with interest          1359
Cash X-Sell: middle                854
Cash X-Sell: low                   753
Card Street                        724
POS industry with interest         603
POS household without interest     496
Card X-Sell                        477
Cash X-Sell: high                  372
Cash Street: high                  340
Cash Street: low                   207
Cash Street: middle                179
POS other with interest            155
POS mobile without interest        149
POS industry without interest       84
POS others without interest         14
Name: PRODUCT_COMBINATION, dtype: int64

In [111]:
#https://stackoverflow.com/questions/32718639/pandas-filling-nans-in-categorical-data
# update data types : Once you create Categorical Data, you can insert only values in category.
print('Updating data types')
table = 'previous_application'

# retriev type from description
meta = hc_description.loc[hc_description['Table']==table,['Row','Type']]
dict_types = as_dict(meta)
    
for table in frames.keys():
    print(table)  
    
    # retriev type from description
    meta = hc_description.loc[hc_description['Table']==table,['Row','Type']]
    dict_types = as_dict(meta)
    
    # set types in data table
    df = df.astype(dict_types)

Updating data types
POS_CASH_balance


KeyError: 'Only a column name can be used for the key in a dtype mappings argument.'

In [78]:
%%time

# select a portion of the data for evaluation
df_train, df_test = datawig.utils.random_split(df)

input_cols = [*df.columns.difference(['SK_ID_PREV', 'SK_ID_CURR']).values]
output_column = 'PRODUCT_COMBINATION'
output_path = 'imputer_model'

# Initialize a SimpleImputer model
imputer = datawig.SimpleImputer(
    input_columns=input_cols,  # columns containing information about the column we want to impute
    output_column='PRODUCT_COMBINATION',  # the column we'd like to impute values for
    output_path=output_path  # stores model data and metrics
)

# Fit an imputer model on the train data
imputer.fit(train_df=df_train, num_epochs=5)

# Fit an imputer model with default list of hyperparameters
#imputer.fit_hpo(train_df=df_train)

# Impute missing values and return original dataframe with predictions
predictions = imputer.predict(df_test)

# Calculate f1 score for true vs predicted values
f1 = f1_score(predictions[output_column], predictions[output_column+'_imputed'], average='weighted')

# Print overall classification report
print(classification_report(predictions[output_column], predictions[output_column+'_imputed']))


# fit an imputer model with customized hyperparameters
#imputer.fit_hpo(
#    train_df=df_train,
#    num_epochs=100,
#    patience=3,
#    learning_rate_candidates=[1e-3, 3e-4, 1e-4]
#)


2019-05-01 02:26:32,440 [INFO]  Assuming 13 numeric input columns: AMT_ANNUITY, AMT_APPLICATION, AMT_CREDIT, AMT_DOWN_PAYMENT, AMT_GOODS_PRICE, CNT_PAYMENT, DAYS_DECISION, DAYS_FIRST_DRAWING, DAYS_FIRST_DUE, DAYS_LAST_DUE, DAYS_LAST_DUE_1ST_VERSION, DAYS_TERMINATION, RATE_DOWN_PAYMENT
2019-05-01 02:26:32,442 [INFO]  Assuming 20 string input columns: HOUR_APPR_PROCESS_START, NAME_TYPE_SUITE, FLAG_LAST_APPL_PER_CONTRACT, WEEKDAY_APPR_PROCESS_START, NAME_PRODUCT_TYPE, NAME_YIELD_GROUP, NFLAG_LAST_APPL_IN_DAY, NAME_SELLER_INDUSTRY, NAME_GOODS_CATEGORY, NAME_CASH_LOAN_PURPOSE, NAME_CONTRACT_STATUS, NAME_PORTFOLIO, PRODUCT_COMBINATION, SELLERPLACE_AREA, CHANNEL_TYPE, NFLAG_INSURED_ON_APPROVAL, CODE_REJECT_REASON, NAME_CONTRACT_TYPE, NAME_CLIENT_TYPE, NAME_PAYMENT_TYPE
2019-05-01 02:26:32,445 [INFO]  No output column name provided for ColumnEncoder using PRODUCT_COMBINATION
2019-05-01 02:26:32,446 [INFO]  Assuming categorical output column: PRODUCT_COMBINATION
2019-05-01 02:26:32,449 [INFO]  

ValueError: fill value must be in categories

## Missing Numerical

#### Strategy: Impute Probalistic

* Datawig: https://github.com/awslabs/datawig/blob/master/README.md#imputation-of-numerical-columns

In [None]:
# fill in some nulls
 ## hmm does not include missing data in th 
#seed = 200
#nullval = df.sample(frac=0.2,random_state=seed)
#test = df.loc[nullval.index,['PRODUCT_COMBINATION']]
nullval.index  
df.loc[nullval.index,['PRODUCT_COMBINATION']] = np.nan
df.loc[nullval.index,['PRODUCT_COMBINATION']].head()


# test set: counts for each value 
test.PRODUCT_COMBINATION.value_counts()