# DataWig

[Exploratory Data Analysis](https://en.wikipedia.org/wiki/Exploratory_data_analysis) (EDA) is an approach to analyzing data sets to summarize and understand their main characteristics, often with visual methods.

Starting with your *Machine Learning Checklist* you will see that a crucial step is **preparing and understanding** the data. It is a step where you will spend most of your time.  Here is an example checklists from [Aurélien Géron](10_20_2018_Machine-Learning-Project-Checklist.txt) that you can adapt to your needs.   

In these series of notebooks, we explore the Home Credit data set. We clean, preprocess and visualize the data for downstream processes in the Data Science Workflow.  All notebooks can be obtained from Github : https://github.com/chalendony/data-prep-visualization


# Python Imports

In [123]:
import os
import sys
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)
    
import pandas as pd
import numpy as np
from pandas.plotting import scatter_matrix
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelBinarizer
pd.set_option('display.max_columns', 125)
import quilt
from scripts.preprocess import percent_missing, align_dataframes, as_dict
from string import Template
import missingno as msno
%matplotlib inline
import impyute
import datawig

## Import Quilt Packages from Local Repository 

In [124]:
from quilt.data.avare import homecredit
homecredit

<GroupNode>
POS_CASH_balance
application_train
bureau
bureau_balance
credit_card_balance
installments_payments
previous_application

In [135]:
# avoid parens and copy original data

#### TODO dont need this loop
frames = {}
for key, val in homecredit._items():
    frames[key] = val().copy(deep=True)

#  data types
hc_description = pd.read_excel('data/HomeCredit_columns_description.xlsx', sheet_name='Sheet1',usecols=[2,3,4])
hc_description.head()

# fix error
idx = hc_description[(hc_description.Row == 'NFLAG_MICRO_CASH') & (hc_description.Table == 'previous_application')].index
hc_description.drop(idx, inplace=True)

# make copy ...
table = 'previous_application'
df = frames[table].copy(deep=True)



In [126]:
df.shape

(1670214, 37)

In [127]:
df.PRODUCT_COMBINATION.value_counts()

Cash                              285990
POS household with interest       263622
POS mobile with interest          220670
Cash X-Sell: middle               143883
Cash X-Sell: low                  130248
Card Street                       112582
POS industry with interest         98833
POS household without interest     82908
Card X-Sell                        80582
Cash Street: high                  59639
Cash X-Sell: high                  59301
Cash Street: middle                34658
Cash Street: low                   33834
POS mobile without interest        24082
POS other with interest            23879
POS industry without interest      12602
POS others without interest         2555
Name: PRODUCT_COMBINATION, dtype: int64

In [136]:
# select random instances
seed = 300
numinstances = 10000
df = df.sample(numinstances,random_state=seed)

In [129]:
# overall: counts for each value
df.PRODUCT_COMBINATION.value_counts()

Cash                              1696
POS household with interest       1532
POS mobile with interest          1359
Cash X-Sell: middle                854
Cash X-Sell: low                   753
Card Street                        724
POS industry with interest         603
POS household without interest     496
Card X-Sell                        477
Cash X-Sell: high                  372
Cash Street: high                  340
Cash Street: low                   207
Cash Street: middle                179
POS other with interest            155
POS mobile without interest        149
POS industry without interest       84
POS others without interest         14
Name: PRODUCT_COMBINATION, dtype: int64

In [130]:
###### TODO : Assignment of Categoricals is not working!!!

#https://stackoverflow.com/questions/32718639/pandas-filling-nans-in-categorical-data
# update data types : Once you create Categorical Data, you can insert only values in category.
#print('Updating data types')

#table = 'previous_application'

# retriev type from description
#meta = hc_description.loc[hc_description['Table']==table,['Row','Type']]
#dict_types = as_dict(meta)

# set types in data table
#df = df.astype(dict_types)

## Drop the empty columns
#dropcols = ['RATE_INTEREST_PRIVILEGED','RATE_INTEREST_PRIMARY']
#df.drop(dropcols, axis=1, inplace=True)

Updating data types
{'SK_ID_PREV': 'float64', 'SK_ID_CURR': 'float64', 'NAME_CONTRACT_TYPE': 'category', 'AMT_ANNUITY': 'float64', 'AMT_APPLICATION': 'float64', 'AMT_CREDIT': 'float64', 'AMT_DOWN_PAYMENT': 'float64', 'AMT_GOODS_PRICE': 'float64', 'WEEKDAY_APPR_PROCESS_START': 'category', 'HOUR_APPR_PROCESS_START': 'category', 'FLAG_LAST_APPL_PER_CONTRACT': 'category', 'NFLAG_LAST_APPL_IN_DAY': 'category', 'RATE_DOWN_PAYMENT': 'float64', 'RATE_INTEREST_PRIMARY': 'float64', 'RATE_INTEREST_PRIVILEGED': 'float64', 'NAME_CASH_LOAN_PURPOSE': 'category', 'NAME_CONTRACT_STATUS': 'category', 'DAYS_DECISION': 'float64', 'NAME_PAYMENT_TYPE': 'category', 'CODE_REJECT_REASON': 'category', 'NAME_TYPE_SUITE': 'category', 'NAME_CLIENT_TYPE': 'category', 'NAME_GOODS_CATEGORY': 'category', 'NAME_PORTFOLIO': 'category', 'NAME_PRODUCT_TYPE': 'category', 'CHANNEL_TYPE': 'category', 'SELLERPLACE_AREA': 'category', 'NAME_SELLER_INDUSTRY': 'category', 'CNT_PAYMENT': 'float64', 'NAME_YIELD_GROUP': 'category', 

In [None]:
%%time

# select a portion of the data for evaluation
df_train, df_test = datawig.utils.random_split(df)

input_cols = [*df.columns.difference(['SK_ID_PREV', 'SK_ID_CURR']).values]
output_column = 'PRODUCT_COMBINATION'
output_path = 'imputer_model'

# Initialize a SimpleImputer model
imputer = datawig.SimpleImputer(
    input_columns=input_cols,  # columns containing information about the column we want to impute
    output_column='PRODUCT_COMBINATION',  # the column we'd like to impute values for
    output_path=output_path  # stores model data and metrics
)

# Fit an imputer model on the train data
#imputer.fit(train_df=df_train, num_epochs=5)

# Fit an imputer model with default list of hyperparameters
imputer.fit_hpo(train_df=df_train)

# Impute missing values and return original dataframe with predictions
predictions = imputer.predict(df_test)

# Calculate f1 score for true vs predicted values
f1 = datawig.f1_score(predictions[output_column], predictions[output_column+'_imputed'], average='weighted')

# Print overall classification report
print(datawig.classification_report(predictions[output_column], predictions[output_column+'_imputed']))


# fit an imputer model with customized hyperparameters
#imputer.fit_hpo(
#    train_df=df_train,
#    num_epochs=100,
#    patience=3,
#    learning_rate_candidates=[1e-3, 3e-4, 1e-4]
#)


2019-05-01 03:27:32,258 [INFO]  Assuming 19 numeric input columns: AMT_ANNUITY, AMT_APPLICATION, AMT_CREDIT, AMT_DOWN_PAYMENT, AMT_GOODS_PRICE, CNT_PAYMENT, DAYS_DECISION, DAYS_FIRST_DRAWING, DAYS_FIRST_DUE, DAYS_LAST_DUE, DAYS_LAST_DUE_1ST_VERSION, DAYS_TERMINATION, HOUR_APPR_PROCESS_START, NFLAG_INSURED_ON_APPROVAL, NFLAG_LAST_APPL_IN_DAY, RATE_DOWN_PAYMENT, RATE_INTEREST_PRIMARY, RATE_INTEREST_PRIVILEGED, SELLERPLACE_AREA
2019-05-01 03:27:32,260 [INFO]  Assuming 16 string input columns: NAME_PRODUCT_TYPE, CODE_REJECT_REASON, NAME_YIELD_GROUP, NAME_SELLER_INDUSTRY, NAME_PORTFOLIO, NAME_CONTRACT_TYPE, NAME_TYPE_SUITE, PRODUCT_COMBINATION, NAME_GOODS_CATEGORY, NAME_CASH_LOAN_PURPOSE, FLAG_LAST_APPL_PER_CONTRACT, CHANNEL_TYPE, NAME_CLIENT_TYPE, WEEKDAY_APPR_PROCESS_START, NAME_PAYMENT_TYPE, NAME_CONTRACT_STATUS
2019-05-01 03:27:32,262 [INFO]  Assuming 19 numeric input columns: AMT_ANNUITY, AMT_APPLICATION, AMT_CREDIT, AMT_DOWN_PAYMENT, AMT_GOODS_PRICE, CNT_PAYMENT, DAYS_DECISION, DAYS_F

2019-05-01 03:27:32,312 [INFO]  Assuming categorical output column: PRODUCT_COMBINATION
2019-05-01 03:27:32,318 [INFO]  Using [[cpu(0)]] as the context for training
2019-05-01 03:27:32,321 [INFO]  Fitting label encoder <class 'datawig.column_encoders.CategoricalEncoder'> on 6400 rows                             of training data
2019-05-01 03:27:32,329 [INFO]  17 most often encountered discrete values:                      ['Cash' 'POS household with interest' 'POS mobile with interest'
 'Cash X-Sell: middle' 'Cash X-Sell: low' 'Card Street'
 'POS industry with interest' 'Card X-Sell'
 'POS household without interest' 'Cash X-Sell: high' 'Cash Street: high'
 'Cash Street: low' 'Cash Street: middle' 'POS other with interest'
 'POS mobile without interest' 'POS industry without interest'
 'POS others without interest']
2019-05-01 03:27:32,339 [INFO]  Detected 6 rows with missing labels                         for column PRODUCT_COMBINATION
2019-05-01 03:27:32,341 [INFO]  Dropping 6/6400 r

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[k1] = value[k2]
2019-05-01 03:27:32,941 [INFO]  Fitting data encoder <class 'datawig.column_encoders.NumericalEncoder'> on columns DAYS_LAST_DUE and 6394 rows of training data with parameters {'input_columns': ['DAYS_LAST_DUE'], 'output_column': 'DAYS_LAST_DUE_numeric', 'output_dim': 1, 'normalize': True, 'scaler': None}
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[k1] = value[k2]
2019-05-01 03:27:33,005 [INFO]  Fitting data encoder <class 'datawig.column_encoders.NumericalEncoder'> on columns DAYS_LAST_DUE_1ST_VERSION and 6394 rows 

2019-05-01 03:27:33,801 [INFO]  Concatenating numeric columns ['AMT_ANNUITY'] into AMT_ANNUITY_numeric
2019-05-01 03:27:33,803 [INFO]  Normalizing with StandardScaler
2019-05-01 03:27:33,807 [INFO]  Data Encoding - Encoded 6400 rows of column                         AMT_ANNUITY with <class 'datawig.column_encoders.NumericalEncoder'> into                         <class 'numpy.ndarray'> of shape (6400, 1)                         and then into shape (6400, 1)
2019-05-01 03:27:33,811 [INFO]  Concatenating numeric columns ['AMT_APPLICATION'] into AMT_APPLICATION_numeric
2019-05-01 03:27:33,813 [INFO]  Normalizing with StandardScaler
2019-05-01 03:27:33,817 [INFO]  Data Encoding - Encoded 6400 rows of column                         AMT_APPLICATION with <class 'datawig.column_encoders.NumericalEncoder'> into                         <class 'numpy.ndarray'> of shape (6400, 1)                         and then into shape (6400, 1)
2019-05-01 03:27:33,823 [INFO]  Concatenating numeric columns ['AM

2019-05-01 03:27:35,094 [INFO]  Data Encoding - Encoded 6400 rows of column                         NAME_GOODS_CATEGORY with <class 'datawig.column_encoders.BowEncoder'> into                         <class 'scipy.sparse.csr.csr_matrix'> of shape (6400, 32768)                         and then into shape (6400, 32768)
2019-05-01 03:27:35,336 [INFO]  Data Encoding - Encoded 6400 rows of column                         NAME_PAYMENT_TYPE with <class 'datawig.column_encoders.BowEncoder'> into                         <class 'scipy.sparse.csr.csr_matrix'> of shape (6400, 32768)                         and then into shape (6400, 32768)
2019-05-01 03:27:35,397 [INFO]  Data Encoding - Encoded 6400 rows of column                         NAME_PORTFOLIO with <class 'datawig.column_encoders.BowEncoder'> into                         <class 'scipy.sparse.csr.csr_matrix'> of shape (6400, 32768)                         and then into shape (6400, 32768)
2019-05-01 03:27:35,470 [INFO]  Data Encoding - Encod

2019-05-01 03:27:36,502 [INFO]  Concatenating numeric columns ['AMT_GOODS_PRICE'] into AMT_GOODS_PRICE_numeric
2019-05-01 03:27:36,504 [INFO]  Normalizing with StandardScaler
2019-05-01 03:27:36,509 [INFO]  Data Encoding - Encoded 1600 rows of column                         AMT_GOODS_PRICE with <class 'datawig.column_encoders.NumericalEncoder'> into                         <class 'numpy.ndarray'> of shape (1600, 1)                         and then into shape (1600, 1)
2019-05-01 03:27:36,593 [INFO]  Data Encoding - Encoded 1600 rows of column                         CHANNEL_TYPE with <class 'datawig.column_encoders.BowEncoder'> into                         <class 'scipy.sparse.csr.csr_matrix'> of shape (1600, 32768)                         and then into shape (1600, 32768)
2019-05-01 03:27:36,597 [INFO]  Concatenating numeric columns ['CNT_PAYMENT'] into CNT_PAYMENT_numeric
2019-05-01 03:27:36,598 [INFO]  Normalizing with StandardScaler
2019-05-01 03:27:36,601 [INFO]  Data Encoding - E

2019-05-01 03:27:37,102 [INFO]  Data Encoding - Encoded 1600 rows of column                         NAME_SELLER_INDUSTRY with <class 'datawig.column_encoders.BowEncoder'> into                         <class 'scipy.sparse.csr.csr_matrix'> of shape (1600, 32768)                         and then into shape (1600, 32768)
2019-05-01 03:27:37,133 [INFO]  Data Encoding - Encoded 1600 rows of column                         NAME_TYPE_SUITE with <class 'datawig.column_encoders.BowEncoder'> into                         <class 'scipy.sparse.csr.csr_matrix'> of shape (1600, 32768)                         and then into shape (1600, 32768)
2019-05-01 03:27:37,163 [INFO]  Data Encoding - Encoded 1600 rows of column                         NAME_YIELD_GROUP with <class 'datawig.column_encoders.BowEncoder'> into                         <class 'scipy.sparse.csr.csr_matrix'> of shape (1600, 32768)                         and then into shape (1600, 32768)
2019-05-01 03:27:37,167 [INFO]  Concatenating numeri

## Missing Numerical

#### Strategy: Impute Probalistic

* Datawig: https://github.com/awslabs/datawig/blob/master/README.md#imputation-of-numerical-columns

In [None]:
# fill in some nulls
 ## hmm does not include missing data in th 
#seed = 200
#nullval = df.sample(frac=0.2,random_state=seed)
#test = df.loc[nullval.index,['PRODUCT_COMBINATION']]
nullval.index  
df.loc[nullval.index,['PRODUCT_COMBINATION']] = np.nan
df.loc[nullval.index,['PRODUCT_COMBINATION']].head()


# test set: counts for each value 
test.PRODUCT_COMBINATION.value_counts()