# Imputing Categorical Using DataWig

Before you include Datawig in your operational pipeline, you can evaluation its performance on a data set that you are familar with. 

Prediction are made on a single column at a time, and you can receive output on the probability of each prediction. 

# Python Imports

In [1]:
%config IPCompleter.greedy=True
import os
import sys
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)
    
import pandas as pd
import numpy as np
from pandas.plotting import scatter_matrix
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelBinarizer
pd.set_option('display.max_columns', 125)
import quilt
from scripts.preprocess import percent_missing, as_dict
from string import Template
import missingno as msno
import impyute
import datawig
from sklearn.metrics import f1_score, classification_report

In [2]:
# load data from local repository, once per session
from quilt.data.avare import homecredit

# Validate Data Types 

We manually assign the types to each variable and override the data typed in the pandas dataframe.

In [3]:
description = pd.read_csv('data/new_data_description_file.csv')
description.head()

Unnamed: 0,Row,Table,Type
0,SK_ID_PREV,POS_CASH_balance,object
1,SK_ID_CURR,POS_CASH_balance,object
2,MONTHS_BALANCE,POS_CASH_balance,float64
3,CNT_INSTALMENT,POS_CASH_balance,float64
4,CNT_INSTALMENT_FUTURE,POS_CASH_balance,float64


## Override Inferred Data Types (Single Table)

TODO: check Datawig code to see data types are used. 

We are concerned with the data types because we have to choose the correct data encoding before passing the data to Datawig.

In [4]:
table = 'previous_application'
df = homecredit[table]()

python_cat_dtype = 'object'
python_num_dtype = 'float64'

    
condtable = description.Table == table
condcat = description.Type == python_cat_dtype
condnum = description.Type == python_num_dtype
        
catcols = description.loc[(condtable & condcat),'Row'].values.tolist()
numcols = description.loc[(condtable & condnum),'Row'].values.tolist()
    
df[catcols] = df[catcols].astype(python_cat_dtype) 
df[numcols] = df[numcols].astype(python_num_dtype)

## Select Subset for Analysis

We will create a subset __without__ nulls, then split the data into a training and test set.

The test set will allow us to measure the performance.

In [5]:
# drop empty columns
dropcols = ['RATE_INTEREST_PRIVILEGED','RATE_INTEREST_PRIMARY','SK_ID_PREV', 'SK_ID_CURR']
df.drop(dropcols, axis=1, inplace=True)

# drop rows containing null 
df.dropna(axis=0, how='any', inplace=True)

# select random instances
seed = 500
numinstances = 1000
df = df.sample(numinstances,random_state=seed)
#df.info(verbose=True)

# Strategy Encode Categorical

When validating the data types, go beyond either categorical or numerical so you can 
to choose the appropriate strategy to encode the data.

* [Rule of Thumb for Validating Data Types](https://towardsdatascience.com/7-data-types-a-better-way-to-think-about-data-types-for-machine-learning-939fae99a689) 

* [Category Encoders Package](http://contrib.scikit-learn.org/categorical-encoding/index.html])

__Requirements Algorithm:__ when working with Datawig if a nominal  consists of a sequence of integers, the variable will be interpreted as numeric: In our data set, each of the following are impacted:

* HOUR_APPR_PROCESS_START : hours, __ordinal__  (but we encode it as numeric)
* NFLAG_LAST_APPL_IN_DAY, NFLAG_INSURED_ON_APPROVAL : (0,1) __binary__ 
* SELLERPLACE_AREA : 4-digit code for a location, __nomimal__  

__Categorical Encoding Strategy__ for this task and algorithm we creat a simple a user-defined encoding for the above mentioned categoricals - map sequence of integers to a string and allows us to see strings as plain text.

In [6]:
# User-defined Categorical Encoding
prefix = 's_'
df['NFLAG_LAST_APPL_IN_DAY'] =  prefix + df['NFLAG_LAST_APPL_IN_DAY'].astype(str) 
df['SELLERPLACE_AREA'] = prefix + df['SELLERPLACE_AREA'].astype(str) 
df['NFLAG_INSURED_ON_APPROVAL'] = prefix +  df['NFLAG_INSURED_ON_APPROVAL'].astype(str) 

# Train Model

In [7]:
#%%time
# select a portion of the data for evaluation
df_train, df_test = datawig.utils.random_split(df)

output_column = 'PRODUCT_COMBINATION'
output_path = 'imputer_model' # TODO: file in .gitignore, change path
lst = [*df.columns.values]
lst.remove(output_column)
input_cols = lst


# Initialize a SimpleImputer model
imputer = datawig.SimpleImputer(
    input_columns=lst,  # columns containing information about the column we want to impute
    output_column=output_column,  # the column we'd like to impute values for
    output_path=output_path  # stores model data and metrics 
)

# Fit an imputer model with default list of hyperparameters
imputer.fit_hpo(train_df=df_train)

2019-05-08 10:28:10,447 [INFO]  Assuming 14 numeric input columns: AMT_ANNUITY, AMT_APPLICATION, AMT_CREDIT, AMT_DOWN_PAYMENT, AMT_GOODS_PRICE, HOUR_APPR_PROCESS_START, RATE_DOWN_PAYMENT, DAYS_DECISION, CNT_PAYMENT, DAYS_FIRST_DRAWING, DAYS_FIRST_DUE, DAYS_LAST_DUE_1ST_VERSION, DAYS_LAST_DUE, DAYS_TERMINATION
2019-05-08 10:28:10,449 [INFO]  Assuming 18 string input columns: NAME_YIELD_GROUP, NFLAG_LAST_APPL_IN_DAY, NAME_SELLER_INDUSTRY, NAME_CONTRACT_STATUS, NAME_TYPE_SUITE, NAME_PORTFOLIO, FLAG_LAST_APPL_PER_CONTRACT, CODE_REJECT_REASON, CHANNEL_TYPE, NAME_CONTRACT_TYPE, NAME_CASH_LOAN_PURPOSE, NFLAG_INSURED_ON_APPROVAL, NAME_PAYMENT_TYPE, SELLERPLACE_AREA, WEEKDAY_APPR_PROCESS_START, NAME_CLIENT_TYPE, NAME_GOODS_CATEGORY, NAME_PRODUCT_TYPE
2019-05-08 10:28:10,450 [INFO]  Assuming 14 numeric input columns: AMT_ANNUITY, AMT_APPLICATION, AMT_CREDIT, AMT_DOWN_PAYMENT, AMT_GOODS_PRICE, HOUR_APPR_PROCESS_START, RATE_DOWN_PAYMENT, DAYS_DECISION, CNT_PAYMENT, DAYS_FIRST_DRAWING, DAYS_FIRST_D

2019-05-08 10:28:10,547 [INFO]  Detected 0 rows with missing labels                         for column PRODUCT_COMBINATION
2019-05-08 10:28:10,549 [INFO]  Dropping 0/640 rows
2019-05-08 10:28:10,555 [INFO]  Detected 0 rows with missing labels                         for column PRODUCT_COMBINATION
2019-05-08 10:28:10,557 [INFO]  Dropping 0/160 rows
2019-05-08 10:28:10,560 [INFO]  Train: 640, Test: 160
2019-05-08 10:28:10,562 [INFO]  Fitting data encoder <class 'datawig.column_encoders.NumericalEncoder'> on columns AMT_ANNUITY and 640 rows of training data with parameters {'input_columns': ['AMT_ANNUITY'], 'output_column': 'AMT_ANNUITY_numeric', 'output_dim': 1, 'normalize': True, 'scaler': None}
2019-05-08 10:28:10,578 [INFO]  Fitting data encoder <class 'datawig.column_encoders.NumericalEncoder'> on columns AMT_APPLICATION and 640 rows of training data with parameters {'input_columns': ['AMT_APPLICATION'], 'output_column': 'AMT_APPLICATION_numeric', 'output_dim': 1, 'normalize': True, 

2019-05-08 10:28:10,802 [INFO]  Concatenating numeric columns ['AMT_DOWN_PAYMENT'] into AMT_DOWN_PAYMENT_numeric
2019-05-08 10:28:10,803 [INFO]  Normalizing with StandardScaler
2019-05-08 10:28:10,810 [INFO]  Data Encoding - Encoded 640 rows of column                         AMT_DOWN_PAYMENT with <class 'datawig.column_encoders.NumericalEncoder'> into                         <class 'numpy.ndarray'> of shape (640, 1)                         and then into shape (640, 1)
2019-05-08 10:28:10,814 [INFO]  Concatenating numeric columns ['AMT_GOODS_PRICE'] into AMT_GOODS_PRICE_numeric
2019-05-08 10:28:10,815 [INFO]  Normalizing with StandardScaler
2019-05-08 10:28:10,822 [INFO]  Data Encoding - Encoded 640 rows of column                         AMT_GOODS_PRICE with <class 'datawig.column_encoders.NumericalEncoder'> into                         <class 'numpy.ndarray'> of shape (640, 1)                         and then into shape (640, 1)
2019-05-08 10:28:10,843 [INFO]  Data Encoding - Encoded 6

2019-05-08 10:28:11,254 [INFO]  Concatenating numeric columns ['DAYS_FIRST_DUE'] into DAYS_FIRST_DUE_numeric
2019-05-08 10:28:11,257 [INFO]  Normalizing with StandardScaler
2019-05-08 10:28:11,261 [INFO]  Data Encoding - Encoded 640 rows of column                         DAYS_FIRST_DUE with <class 'datawig.column_encoders.NumericalEncoder'> into                         <class 'numpy.ndarray'> of shape (640, 1)                         and then into shape (640, 1)
2019-05-08 10:28:11,265 [INFO]  Concatenating numeric columns ['DAYS_LAST_DUE_1ST_VERSION'] into DAYS_LAST_DUE_1ST_VERSION_numeric
2019-05-08 10:28:11,266 [INFO]  Normalizing with StandardScaler
2019-05-08 10:28:11,272 [INFO]  Data Encoding - Encoded 640 rows of column                         DAYS_LAST_DUE_1ST_VERSION with <class 'datawig.column_encoders.NumericalEncoder'> into                         <class 'numpy.ndarray'> of shape (640, 1)                         and then into shape (640, 1)
2019-05-08 10:28:11,278 [INFO]  C

2019-05-08 10:28:11,532 [INFO]  Data Encoding - Encoded 160 rows of column                         NAME_PAYMENT_TYPE with <class 'datawig.column_encoders.BowEncoder'> into                         <class 'scipy.sparse.csr.csr_matrix'> of shape (160, 32768)                         and then into shape (160, 32768)
2019-05-08 10:28:11,539 [INFO]  Data Encoding - Encoded 160 rows of column                         CODE_REJECT_REASON with <class 'datawig.column_encoders.BowEncoder'> into                         <class 'scipy.sparse.csr.csr_matrix'> of shape (160, 32768)                         and then into shape (160, 32768)
2019-05-08 10:28:11,552 [INFO]  Data Encoding - Encoded 160 rows of column                         NAME_TYPE_SUITE with <class 'datawig.column_encoders.BowEncoder'> into                         <class 'scipy.sparse.csr.csr_matrix'> of shape (160, 32768)                         and then into shape (160, 32768)
2019-05-08 10:28:11,562 [INFO]  Data Encoding - Encoded 160 ro

2019-05-08 10:28:21,023 [INFO]  Epoch[0] Validation-cross-entropy=0.590815
2019-05-08 10:28:21,024 [INFO]  Epoch[0] Validation-PRODUCT_COMBINATION-accuracy=0.850000
2019-05-08 10:28:25,004 [INFO]  Epoch[1] Batch [0-20]	Speed: 84.42 samples/sec	cross-entropy=0.425005	PRODUCT_COMBINATION-accuracy=0.869048
2019-05-08 10:28:28,519 [INFO]  Epoch[1] Train-cross-entropy=0.399919
2019-05-08 10:28:28,526 [INFO]  Epoch[1] Train-PRODUCT_COMBINATION-accuracy=0.882812
2019-05-08 10:28:28,527 [INFO]  Epoch[1] Time cost=7.502
2019-05-08 10:28:28,721 [INFO]  Saved checkpoint to "imputer_model0/model-0001.params"
2019-05-08 10:28:29,991 [INFO]  Epoch[1] Validation-cross-entropy=0.373925
2019-05-08 10:28:29,992 [INFO]  Epoch[1] Validation-PRODUCT_COMBINATION-accuracy=0.893750
2019-05-08 10:28:33,852 [INFO]  Epoch[2] Batch [0-20]	Speed: 87.21 samples/sec	cross-entropy=0.274019	PRODUCT_COMBINATION-accuracy=0.910714
2019-05-08 10:28:37,437 [INFO]  Epoch[2] Train-cross-entropy=0.271097
2019-05-08 10:28:37,4

2019-05-08 10:29:56,607 [INFO]  Attribute PRODUCT_COMBINATION, Label: POS household without interest	Reaching 0.9285714285714286 precision / 0.8666666666666667 recall at threshold 0.8725219964981079
2019-05-08 10:29:56,610 [INFO]  Attribute PRODUCT_COMBINATION, Label: Cash Street: high	Reaching 0.75 precision / 0.6 recall at threshold 0.7542812824249268
2019-05-08 10:29:56,612 [INFO]  Attribute PRODUCT_COMBINATION, Label: POS other with interest	Reaching 1.0 precision / 0.0 recall at threshold 0.9942069053649902
2019-05-08 10:29:56,615 [INFO]  Attribute PRODUCT_COMBINATION, Label: Cash X-Sell: high	Reaching 0.0 precision / 0.0 recall at threshold 0.9367694854736328
2019-05-08 10:29:56,618 [INFO]  Attribute PRODUCT_COMBINATION, Label: POS industry without interest	Reaching 0.0 precision / 0.0 recall at threshold 0.26197901368141174
2019-05-08 10:29:56,621 [INFO]  Attribute PRODUCT_COMBINATION, Label: Cash X-Sell: middle	Reaching 1.0 precision / 0.0 recall at threshold 0.6984896063804626

2019-05-08 10:29:56,860 [INFO]  Data Encoding - Encoded 160 rows of column                         CODE_REJECT_REASON with <class 'datawig.column_encoders.BowEncoder'> into                         <class 'scipy.sparse.csr.csr_matrix'> of shape (160, 32768)                         and then into shape (160, 32768)
2019-05-08 10:29:56,868 [INFO]  Data Encoding - Encoded 160 rows of column                         NAME_TYPE_SUITE with <class 'datawig.column_encoders.BowEncoder'> into                         <class 'scipy.sparse.csr.csr_matrix'> of shape (160, 32768)                         and then into shape (160, 32768)
2019-05-08 10:29:56,875 [INFO]  Data Encoding - Encoded 160 rows of column                         NAME_CLIENT_TYPE with <class 'datawig.column_encoders.BowEncoder'> into                         <class 'scipy.sparse.csr.csr_matrix'> of shape (160, 32768)                         and then into shape (160, 32768)
2019-05-08 10:29:56,888 [INFO]  Data Encoding - Encoded 160 row

2019-05-08 10:29:58,529 [INFO]  Concatenating numeric columns ['AMT_DOWN_PAYMENT'] into AMT_DOWN_PAYMENT_numeric
2019-05-08 10:29:58,531 [INFO]  Normalizing with StandardScaler
2019-05-08 10:29:58,536 [INFO]  Data Encoding - Encoded 640 rows of column                         AMT_DOWN_PAYMENT with <class 'datawig.column_encoders.NumericalEncoder'> into                         <class 'numpy.ndarray'> of shape (640, 1)                         and then into shape (640, 1)
2019-05-08 10:29:58,544 [INFO]  Concatenating numeric columns ['AMT_GOODS_PRICE'] into AMT_GOODS_PRICE_numeric
2019-05-08 10:29:58,545 [INFO]  Normalizing with StandardScaler
2019-05-08 10:29:58,550 [INFO]  Data Encoding - Encoded 640 rows of column                         AMT_GOODS_PRICE with <class 'datawig.column_encoders.NumericalEncoder'> into                         <class 'numpy.ndarray'> of shape (640, 1)                         and then into shape (640, 1)
2019-05-08 10:29:58,567 [INFO]  Data Encoding - Encoded 6

2019-05-08 10:29:58,903 [INFO]  Concatenating numeric columns ['DAYS_FIRST_DUE'] into DAYS_FIRST_DUE_numeric
2019-05-08 10:29:58,905 [INFO]  Normalizing with StandardScaler
2019-05-08 10:29:58,908 [INFO]  Data Encoding - Encoded 640 rows of column                         DAYS_FIRST_DUE with <class 'datawig.column_encoders.NumericalEncoder'> into                         <class 'numpy.ndarray'> of shape (640, 1)                         and then into shape (640, 1)
2019-05-08 10:29:58,914 [INFO]  Concatenating numeric columns ['DAYS_LAST_DUE_1ST_VERSION'] into DAYS_LAST_DUE_1ST_VERSION_numeric
2019-05-08 10:29:58,915 [INFO]  Normalizing with StandardScaler
2019-05-08 10:29:58,920 [INFO]  Data Encoding - Encoded 640 rows of column                         DAYS_LAST_DUE_1ST_VERSION with <class 'datawig.column_encoders.NumericalEncoder'> into                         <class 'numpy.ndarray'> of shape (640, 1)                         and then into shape (640, 1)
2019-05-08 10:29:58,926 [INFO]  C

<datawig.simple_imputer.SimpleImputer at 0x10bc21a20>

# Evaluate Performance

In [8]:
# Impute missing values and return original dataframe with predictions
predictions = imputer.predict(df_test)

predictions.head()

# Calculate f1 score for true vs predicted values
f1 = f1_score(predictions[output_column], predictions[output_column+'_imputed'], average='weighted')

# Print overall classification report
print(classification_report(predictions[output_column], predictions[output_column+'_imputed']))


2019-05-08 10:32:25,788 [INFO]  Data Encoding - Encoded 208 rows of column                         NAME_CONTRACT_TYPE with <class 'datawig.column_encoders.BowEncoder'> into                         <class 'scipy.sparse.csr.csr_matrix'> of shape (208, 32768)                         and then into shape (208, 32768)
2019-05-08 10:32:25,791 [INFO]  Concatenating numeric columns ['AMT_ANNUITY'] into AMT_ANNUITY_numeric
2019-05-08 10:32:25,793 [INFO]  Normalizing with StandardScaler
2019-05-08 10:32:25,796 [INFO]  Data Encoding - Encoded 208 rows of column                         AMT_ANNUITY with <class 'datawig.column_encoders.NumericalEncoder'> into                         <class 'numpy.ndarray'> of shape (208, 1)                         and then into shape (208, 1)
2019-05-08 10:32:25,800 [INFO]  Concatenating numeric columns ['AMT_APPLICATION'] into AMT_APPLICATION_numeric
2019-05-08 10:32:25,801 [INFO]  Normalizing with StandardScaler
2019-05-08 10:32:25,807 [INFO]  Data Encoding - Encod

2019-05-08 10:32:26,052 [INFO]  Data Encoding - Encoded 208 rows of column                         NAME_SELLER_INDUSTRY with <class 'datawig.column_encoders.BowEncoder'> into                         <class 'scipy.sparse.csr.csr_matrix'> of shape (208, 32768)                         and then into shape (208, 32768)
2019-05-08 10:32:26,057 [INFO]  Concatenating numeric columns ['CNT_PAYMENT'] into CNT_PAYMENT_numeric
2019-05-08 10:32:26,059 [INFO]  Normalizing with StandardScaler
2019-05-08 10:32:26,064 [INFO]  Data Encoding - Encoded 208 rows of column                         CNT_PAYMENT with <class 'datawig.column_encoders.NumericalEncoder'> into                         <class 'numpy.ndarray'> of shape (208, 1)                         and then into shape (208, 1)
2019-05-08 10:32:26,074 [INFO]  Data Encoding - Encoded 208 rows of column                         NAME_YIELD_GROUP with <class 'datawig.column_encoders.BowEncoder'> into                         <class 'scipy.sparse.csr.csr_ma

                                precision    recall  f1-score   support

             Cash Street: high       0.75      0.75      0.75         4
             Cash X-Sell: high       0.88      0.88      0.88         8
           Cash X-Sell: middle       1.00      1.00      1.00         1
   POS household with interest       0.86      0.99      0.92        76
POS household without interest       0.95      0.69      0.80        29
    POS industry with interest       0.93      0.96      0.94        26
      POS mobile with interest       0.96      0.98      0.97        49
   POS mobile without interest       0.00      0.00      0.00         2
       POS other with interest       1.00      0.40      0.57         5

                     micro avg       0.91      0.91      0.91       200
                     macro avg       0.81      0.74      0.76       200
                  weighted avg       0.90      0.91      0.90       200



  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


In [9]:
predictions[['PRODUCT_COMBINATION', 'PRODUCT_COMBINATION_imputed', 'PRODUCT_COMBINATION_imputed_proba']]

Unnamed: 0,PRODUCT_COMBINATION,PRODUCT_COMBINATION_imputed,PRODUCT_COMBINATION_imputed_proba
265076,POS mobile with interest,POS mobile with interest,0.995796
1529698,POS household without interest,POS household with interest,0.560829
788943,POS household with interest,POS household with interest,0.991698
69679,POS mobile with interest,POS mobile with interest,0.956910
420487,POS industry with interest,POS household with interest,0.483662
231187,POS mobile with interest,POS mobile with interest,0.925527
583942,POS household with interest,POS household with interest,0.983383
1088118,POS household without interest,POS household without interest,0.978934
542793,POS household with interest,POS household with interest,0.830658
1332567,POS household with interest,POS household with interest,0.922260


# Conclusion

The results don't look to bad on this simple example. 

* Try it on another data and see how well it works 
* Practice using a larger sample size, or different column
