# Exercise : Imputing Categorical Using Impuyte

KNN is an algorithm that is useful for matching a point with its closest k neighbors in a multi-dimensional space. It  useful for dealing with different types of missing data.
    
If you have a  small amout of data set and with low dimensions, then KNN imputing is worth trying.
   
The Impyute package provides an implemention called [fastKNN](https://impyute.readthedocs.io/en/master/_modules/impyute/imputation/cs/fast_knn.html#fast_knn) - which is intended to be faster than fit+transform for each subset.
    
    
Try it out with the Home Credit data set and compare this result to [DataWig](https://datawig.readthedocs.io/en/latest/source/userguide.html#introduction-to-imputer) or [MICE](https://www.statsmodels.org/stable/imputation.html)
  

# Python Imports

In [9]:
%config IPCompleter.greedy=True
import os
import sys
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)
    
import pandas as pd
import numpy as np
from pandas.plotting import scatter_matrix
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelBinarizer
pd.set_option('display.max_columns', 125)
import quilt
from scripts.preprocess import percent_missing, as_dict
from string import Template
import missingno as msno
import impyute
import datawig
from sklearn.metrics import f1_score, classification_report

In [2]:
# load data from local repository, once per session
from quilt.data.avare import homecredit

# Validate Data Types 

We manually assign the types to each variable and override the data typed in the pandas dataframe.

In [3]:
description = pd.read_csv('data/new_data_description_file.csv')
description.head()

Unnamed: 0,Row,Table,Type
0,SK_ID_PREV,POS_CASH_balance,object
1,SK_ID_CURR,POS_CASH_balance,object
2,MONTHS_BALANCE,POS_CASH_balance,float64
3,CNT_INSTALMENT,POS_CASH_balance,float64
4,CNT_INSTALMENT_FUTURE,POS_CASH_balance,float64


## Override Inferred Data Types (Single Table)

In [6]:
table = 'previous_application'
df = homecredit[table]()

python_cat_dtype = 'object'
python_num_dtype = 'float64'
  
condtable = description.Table == table
condcat = description.Type == python_cat_dtype
condnum = description.Type == python_num_dtype
        
catcols = description.loc[(condtable & condcat),'Row'].values.tolist()
numcols = description.loc[(condtable & condnum),'Row'].values.tolist()
    
df[catcols] = df[catcols].astype(python_cat_dtype) 
df[numcols] = df[numcols].astype(python_num_dtype)

## Select Subset for Analysis

We will create a subset __without__ nulls, then split the data into a training and test set.

The test set will allow us to measure the performance.

In [7]:
# drop empty columns
dropcols = ['RATE_INTEREST_PRIVILEGED','RATE_INTEREST_PRIMARY','SK_ID_PREV', 'SK_ID_CURR']
df.drop(dropcols, axis=1, inplace=True)

# drop rows containing null 
df.dropna(axis=0, how='any', inplace=True)

# select random instances
seed = 500
numinstances = 1000
df = df.sample(numinstances,random_state=seed)
#df.info(verbose=True)

# Strategy Encode Categorical

When validating the data types, go beyond either categorical or numerical so you can 
to choose the appropriate strategy to encode the data.

* [Rule of Thumb for Validating Data Types](https://towardsdatascience.com/7-data-types-a-better-way-to-think-about-data-types-for-machine-learning-939fae99a689) 

* [Category Encoders Package](http://contrib.scikit-learn.org/categorical-encoding/index.html])

__Requirements Algorithm:__ when working with Datawig if a nominal  consists of a sequence of integers, the variable will be interpreted as numeric: In our data set, each of the following are impacted:

* HOUR_APPR_PROCESS_START : hours, __ordinal__  (but we encode it as numeric)
* NFLAG_LAST_APPL_IN_DAY, NFLAG_INSURED_ON_APPROVAL : (0,1) __binary__ 
* SELLERPLACE_AREA : 4-digit code for a location, __nomimal__  

__Categorical Encoding Strategy__ for this task and algorithm we creat a simple a user-defined encoding for the above mentioned categoricals - map sequence of integers to a string and allows us to see strings as plain text.

In [8]:
# User-defined Categorical Encoding
prefix = 's_'
df['NFLAG_LAST_APPL_IN_DAY'] =  prefix + df['NFLAG_LAST_APPL_IN_DAY'].astype(str) 
df['SELLERPLACE_AREA'] = prefix + df['SELLERPLACE_AREA'].astype(str) 
df['NFLAG_INSURED_ON_APPROVAL'] = prefix +  df['NFLAG_INSURED_ON_APPROVAL'].astype(str) 

# Train Model

In [12]:
#%%time
# select a portion of the data for evaluation
df_impute = df.copy(deep=True)

# insert random null in a selected column


# convert dataframe to array


# impute

k = 3
#predicted = fast_knn(data, k=k)

# Evaluate Performance

# Conclusion
