# Impyte Documentation
This is a first practicle attempt to clarify the usage of `impyte`. It's a collection of easy applicable and reproducible examples that you could use to simplyify your data processing workflow.

## Importing and generating sample data
In order to show some of the features of the library, we'll be using Kaggle's HR data that can be found [here](https://www.kaggle.com/ludobenistant/hr-analytics).

In [46]:
# import library and data set
import impyte
import pandas as pd
from tools.data_prep import remove_random

In [47]:
data = pd.read_csv('data/hr_test.csv')

In [48]:
data.head()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary
0,0.38,0.53,2,157,3,0,1,0,sales,low
1,0.8,0.86,5,262,6,0,1,0,sales,medium
2,0.11,0.88,7,272,4,0,1,0,sales,medium
3,0.72,0.87,5,223,5,0,1,0,sales,low
4,0.37,0.52,2,159,3,0,1,0,sales,low


## Add random missing values
In order to show some of the pattern visualization and imputation methods, we need incomplete data sets. To achieve this, we'll be using a helper function that deletes values at random. 25 % of all values from each category will be deleted.

In [49]:
data = remove_random(data, .25, randomize_index=False)

In [50]:
data.head()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary
0,,0.54,3.0,153.0,2.0,0.0,0.0,0.0,technical,medium
1,,0.92,5.0,169.0,2.0,0.0,0.0,0.0,sales,medium
2,,0.62,5.0,199.0,4.0,0.0,0.0,0.0,sales,low
3,,0.63,4.0,104.0,7.0,1.0,0.0,0.0,sales,medium
4,,0.61,3.0,266.0,2.0,0.0,0.0,0.0,management,high


In [51]:
data.describe()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years
count,14625.0,14625.0,14625.0,14625.0,14625.0,14625.0,14625.0,14625.0
mean,0.613072,0.715699,3.801231,201.070769,3.500034,0.144547,0.238359,0.02065
std,0.248873,0.171078,1.230459,49.960864,1.459104,0.351656,0.426094,0.142213
min,0.09,0.36,2.0,96.0,2.0,0.0,0.0,0.0
25%,0.44,0.56,3.0,156.0,3.0,0.0,0.0,0.0
50%,0.64,0.72,4.0,200.0,3.0,0.0,0.0,0.0
75%,0.82,0.87,5.0,245.0,4.0,0.0,0.0,0.0
max,1.0,1.0,7.0,310.0,10.0,1.0,1.0,1.0


In [52]:
len(data)

14999

In [53]:
reload(impyte)

<module 'impyte' from 'impyte.pyc'>

## Testing rudimentary features
Below is a first trial run of some of the features implemented by impyter and its helper classes.

## `NanChecker`
Functionality testing of `NanChecker` class.

In [54]:
nan_checker = impyte.NanChecker()

#### `NanChecker.is_nan(data, nan_vals=None, recursive=True)`
Detect missing values (NaN in numeric arrays, empty strings in string arrays). NaN detection feature also enables NaN-value ingestion as parameter values.

In [55]:
# [True, False, False]
print nan_checker.is_nan(["", 'None', 'NaN'])

# [True, True, True]
print nan_checker.is_nan(["", 'None', 'NaN'], nan_vals=['', None, 'None', 'NaN'])

[True, False, False]
[True, True, True]


In [56]:
# Recursive nan detection
# [True, True, False, [False, True, True]]
print nan_checker.is_nan(["", None, 'NaN', ["List Value 1", '', None]])

[True, True, False, [False, True, True]]


In [57]:
# Values can be declared as nan-values
# [True, False, False, True]
nan_checker.is_nan(['NaN', 'Empty', 'None', 'N/A'], nan_vals=['NaN', 'N/A'])

[True, False, False, True]

## `Pattern`
Functionality testing of `Pattern` class. The `Pattern` class stores different patterns and data summaries regarding NaN values. 

In [58]:
reload(impyte)
pattern_log = impyte.Pattern()

#### `Pattern._check_complete_row(row)`
Determines whether a row consists out of only 1s. Method for NaN summary creation

In [59]:
pd.DataFrame([1, 1, 1]).apply(pattern_log._check_complete_row)
#pattern_log._check_complete_row()

0   -1
dtype: int64

#### `Pattern._compute_pattern(data, nan_values="", verbose=False)`
Function that checks for missing values and prints out a quick table of a summary of missing values. Includes pattern overview and counts of missing values by column. Computes pattern and returns its pattern structure plus the count of data points for each of these patterns. To get a nice digestible table add `["table"]` selector to output.

In [60]:
pattern_dict = pattern_log._compute_pattern(data)
indices, table = pattern_dict["indices"], pattern_dict["table"]

In [61]:
table

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary,Count
0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,11259
1,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,1.0,374
2,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,374
3,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
4,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,374
5,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,1.0,1.0,374
6,1.0,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,374
7,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
8,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
9,1.0,1.0,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,374


In [62]:
indices[0][:10] # first 10 indices of pattern 0

[3740, 3741, 3742, 3743, 3744, 3745, 3746, 3747, 3748, 3749]

In [63]:
pattern_nr = 0
print "Pattern {} has {:,} rows.".format(pattern_nr, len(indices[0])) 

Pattern 0 has 11,259 rows.


#### `Pattern._is_discrete(tmpdata, unique_instances)`
Determines on dtype and by counting unique instances whether a column contains categorical or continuous values.

In [64]:
pattern_log._is_discrete(data['satisfaction_level'], unique_instances=5) # False

False

In [65]:
pattern_log._is_discrete(data['sales'], unique_instances=5) # True

True

#### `Pattern._get_discrete_and_continuous(tmpdata, unique_instances)`
Returns the column names of discrete and continuous variables. Column names are stored in lists for easy selection. Those lists are stored in one dictionary object. All continuous column names can be accessed through `['continuous']` all discrete variables through `['discrete']`.

In [66]:
pattern_log._get_discrete_and_continuous(data, unique_instances=5)

{'continuous': ['satisfaction_level',
  'last_evaluation',
  'number_project',
  'average_montly_hours',
  'time_spend_company'],
 'discrete': ['Work_accident',
  'left',
  'promotion_last_5years',
  'sales',
  'salary']}

#### `Pattern._get_missing_value_percentage(self, data, importance_filter=False)`
Shows missing value percentage based on result and actual data table.

In [67]:
pattern_log._get_missing_value_percentage(data)

Unnamed: 0,Complete,Missing,Percentage
satisfaction_level,14625,374,2.49 %
last_evaluation,14625,374,2.49 %
number_project,14625,374,2.49 %
average_montly_hours,14625,374,2.49 %
time_spend_company,14625,374,2.49 %
Work_accident,14625,374,2.49 %
left,14625,374,2.49 %
promotion_last_5years,14625,374,2.49 %
sales,14625,374,2.49 %
salary,14625,374,2.49 %


#### `Pattern.get_pattern()`
Returns NaN-patterns based on primary computation or initiates new computation of NaN-patterns. Uses `_compute_patterns()` to visualize patterns if not yet computed. Otherwise returns stored information regarding patterns.

In [68]:
pattern_log.get_pattern()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary,Count
0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,11259
1,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,1.0,374
2,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,374
3,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
4,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,374
5,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,1.0,1.0,374
6,1.0,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,374
7,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
8,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
9,1.0,1.0,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,374


#### `Pattern.row_nan_pattern(row)`
This is a core piece to determining which patterns are available. Other methods work with the output of `row_nan_pattern()` to determine unique pattern structures and count these before turning them into readable tables.

In [69]:
# (1, 1, 'NaN', 'NaN')
print pattern_log.row_nan_pattern(['Value 1', 'Value 2', '', None])

(1, 1, 'NaN', 'NaN')


#### `Pattern.get_pattern_indices(pattern_no)`

In [70]:
pattern_log.get_pattern_indices(0)[:10] # get first 10 indices of pattern 0

[3740, 3741, 3742, 3743, 3744, 3745, 3746, 3747, 3748, 3749]

#### `Pattern.get_continuous()`
Returns list with names of all continuous variables.

In [71]:
pattern_log.get_continuous()

['satisfaction_level', 'last_evaluation', 'average_montly_hours']

#### `Pattern.get_discrete()`
Returns list with names of all discrete variables.

In [72]:
pattern_log.get_discrete()

['number_project',
 'time_spend_company',
 'Work_accident',
 'left',
 'promotion_last_5years',
 'sales',
 'salary']

## `Impyter`

In [73]:
reload(impyte)
imp = impyte.Impyter() # instantiate impyte class

### Load data into imputer

#### `Impyter.load_data()`
Requires a pandas DataFrame to load. Otherwise, the input is being transformed into a DataFrame. While loading the data is being copied into the object, to stay clear of consistency issues with the original data set.

In [74]:
imp.load_data(data)

As an alternative a DataFrame can be handed over while instantiating the Impyter object.

In [75]:
imp = impyte.Impyter(data)

#### `Impyter.get_data()`
Returns the loaded data for quick reference.

In [76]:
imp.get_data().head()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary
0,,0.54,3.0,153.0,2.0,0.0,0.0,0.0,technical,medium
1,,0.92,5.0,169.0,2.0,0.0,0.0,0.0,sales,medium
2,,0.62,5.0,199.0,4.0,0.0,0.0,0.0,sales,low
3,,0.63,4.0,104.0,7.0,1.0,0.0,0.0,sales,medium
4,,0.61,3.0,266.0,2.0,0.0,0.0,0.0,management,high


#### `Impyter.pattern()`
Leverages `Pattern._compute_pattern()` and `Pattern.get_pattern()` methods to compute and return an overview of all existant NaN patterns in the data set. The overview shows a `NaN` in the column where a data point was missing and `1` for all complete slots. On the right hand side is a count variable to indicate how often that pattern was found. The patterns are always sorted by count and it is not given, that pattern 0 is always the pattern with only complete cases.

In [77]:
reload(impyte)
imp = impyte.Impyter(data) # instantiate impyte class
nan_checker = impyte.NanChecker()
pattern_log = impyte.Pattern()
imp.pattern()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary,Count
0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,11259
1,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,1.0,374
2,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,374
3,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
4,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,374
5,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,1.0,1.0,374
6,1.0,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,374
7,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
8,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
9,1.0,1.0,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,374


#### `Impyter.drop_pattern()`
Drops pattern from data set.

In [78]:
imp.drop_pattern(4).head()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary
0,,0.54,3.0,153.0,2.0,0.0,0.0,0.0,technical,medium
1,,0.92,5.0,169.0,2.0,0.0,0.0,0.0,sales,medium
2,,0.62,5.0,199.0,4.0,0.0,0.0,0.0,sales,low
3,,0.63,4.0,104.0,7.0,1.0,0.0,0.0,sales,medium
4,,0.61,3.0,266.0,2.0,0.0,0.0,0.0,management,high


In [79]:
imp.pattern()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary,Count
0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,11259
1,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,1.0,374
2,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,374
3,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
4,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,374
5,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,1.0,1.0,374
6,1.0,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,374
7,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
8,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
9,1.0,1.0,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,374


If inplace flag set to True, the changes happen in the data set that's being stored in the Impyte object. Otherwise, a copy without the dropped pattern will be returned and the stored data set stays intact.

In [80]:
_ = imp.drop_pattern(4, inplace=True)

In [81]:
imp.pattern()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary,Count
0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1,11259
1,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,1,374
2,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,1,374
3,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,1,374
5,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,1.0,1,374
6,1.0,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,1,374
7,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1,374
8,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1,374
9,1.0,1.0,1.0,1.0,1.0,,1.0,1.0,1.0,1,374
10,1.0,1.0,1.0,1.0,1.0,1.0,,1.0,1.0,1,374


#### `Impyter.get_missing_summary()`
Returns table with information on missing values per column and its percentage.

In [82]:
imp.get_missing_summary()

Unnamed: 0,Complete,Missing,Percentage
satisfaction_level,14251,374,2.56 %
last_evaluation,14251,374,2.56 %
number_project,14251,374,2.56 %
average_montly_hours,14251,374,2.56 %
time_spend_company,14251,374,2.56 %
Work_accident,14251,374,2.56 %
left,14251,374,2.56 %
promotion_last_5years,14251,374,2.56 %
sales,14251,374,2.56 %


Setting the importance filter flag to `True`  shows only columns that have some missing values. This is helpful for data sets with a large amount of variables and only few nan-values.

In [83]:
for pattern_no in range(1,4): #  drop patterns 1 to 3
    imp.drop_pattern(pattern_no, inplace=True)

In [84]:
imp.get_missing_summary(importance_filter=True)

Unnamed: 0,Complete,Missing,Percentage
satisfaction_level,13129,374,2.77 %
last_evaluation,13129,374,2.77 %
time_spend_company,13129,374,2.77 %
Work_accident,13129,374,2.77 %
left,13129,374,2.77 %
promotion_last_5years,13129,374,2.77 %


#### `Impyte.one_hot_encode()`
Relies on `pandas.get_dummies()` method to transform categorical values into one-hot-encoded values.

In [85]:
reload(impyte)
imputer = impyte.Impyter()
imputer.load_data(data)
_ = imputer.pattern()

In [86]:
ohe_data = imputer.one_hot_encode(data)
ohe_data.head()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales_ohe_IT,sales_ohe_RandD,sales_ohe_accounting,sales_ohe_hr,sales_ohe_management,sales_ohe_marketing,sales_ohe_product_mng,sales_ohe_sales,sales_ohe_support,sales_ohe_technical,salary_ohe_high,salary_ohe_low,salary_ohe_medium
0,,0.54,3.0,153.0,2.0,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,1,0,0,1
1,,0.92,5.0,169.0,2.0,0.0,0.0,0.0,0,0,0,0,0,0,0,1,0,0,0,0,1
2,,0.62,5.0,199.0,4.0,0.0,0.0,0.0,0,0,0,0,0,0,0,1,0,0,0,1,0
3,,0.63,4.0,104.0,7.0,1.0,0.0,0.0,0,0,0,0,0,0,0,1,0,0,0,0,1
4,,0.61,3.0,266.0,2.0,0.0,0.0,0.0,0,0,0,0,1,0,0,0,0,0,1,0,0


#### `Impyte.one_hot_decode()`
The inversion method to `Impyte.one_hot_encode()`. Transforms one-hot-encoded columns back to categorical values.

In [87]:
imputer.one_hot_decode(ohe_data).head()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary
0,,0.54,3.0,153.0,2.0,0.0,0.0,0.0,technical,medium
1,,0.92,5.0,169.0,2.0,0.0,0.0,0.0,sales,medium
2,,0.62,5.0,199.0,4.0,0.0,0.0,0.0,sales,low
3,,0.63,4.0,104.0,7.0,1.0,0.0,0.0,sales,medium
4,,0.61,3.0,266.0,2.0,0.0,0.0,0.0,management,high


#### `Impyter.impute()`
Impute is the core method of impyte. The method works out of the box and uses Random Forest estimators per default to impute missing values. It automatically performs cross-validation to showcase the potential accuracy of the imputation. Scoring that is being used is f1_macro score for classifiers (supporting binary and multi-class) and r2 for regression models.

In [88]:
reload(impyte)
imputer = impyte.Impyter()
imputer.load_data(data)
_ = imputer.pattern()
complete_df = imputer.impute(estimator='rf', accuracy=[.5, .7])

Scoring Threshold             Classification                Regression                    
                              0.5                           0.7                           

Pattern: Label                Score                         Estimator                     
1: sales                      0.334 (f1_macro)              RandomForestClassifier          dropped...
2: average_montly_hours       0.337 (r2)                    RandomForestRegressor           dropped...
3: number_project             0.567 (f1_macro)              RandomForestClassifier          filled...
4: salary                     0.515 (f1_macro)              RandomForestClassifier          filled...
5: promotion_last_5years      0.758 (f1_macro)              RandomForestClassifier          filled...
6: time_spend_company         0.570 (f1_macro)              RandomForestClassifier          filled...
7: satisfaction_level         0.466 (r2)                    RandomForestRegressor           dropped...
8: last_e

### Investigate completed data set

In [89]:
imp = impyte.Impyter(complete_df)

In [90]:
imp.pattern()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary,Count
0,1.0,1.0,1,1.0,1,1,1,1,1.0,1,13503
1,,1.0,1,1.0,1,1,1,1,1.0,1,374
2,1.0,,1,1.0,1,1,1,1,1.0,1,374
3,1.0,1.0,1,,1,1,1,1,1.0,1,374
4,1.0,1.0,1,1.0,1,1,1,1,,1,374


#### `Impyter.get_result()`
Returns the results data set, once impute was performed.

In [91]:
imputer.get_result().head()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary
0,,0.54,3.0,153.0,2.0,0.0,0.0,0.0,technical,medium
1,,0.92,5.0,169.0,2.0,0.0,0.0,0.0,sales,medium
2,,0.62,5.0,199.0,4.0,0.0,0.0,0.0,sales,low
3,,0.63,4.0,104.0,7.0,1.0,0.0,0.0,sales,medium
4,,0.61,3.0,266.0,2.0,0.0,0.0,0.0,management,high


### Comparison of complete indices methods

In [92]:
%timeit len(imputer.pattern_log.get_complete_indices())

1000 loops, best of 3: 1.57 ms per loop


In [93]:
%timeit len(imputer.get_complete_old())

100 loops, best of 3: 8.3 ms per loop


In [94]:
_ = imputer.drop_pattern(7, inplace=True)

In [95]:
data.iloc[:, imputer.pattern_log.store_tuple_columns[(1, 1, 1, 1, 1, 1, 1, 1, 1, 'NaN')]].columns

Index([u'salary'], dtype='object')

In [96]:
data.tail()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary
14994,0.43,0.55,2.0,129.0,3.0,0.0,1.0,0.0,sales,low
14995,0.9,0.93,4.0,263.0,3.0,1.0,0.0,0.0,support,medium
14996,0.64,0.97,4.0,268.0,2.0,0.0,0.0,0.0,technical,low
14997,0.77,0.54,5.0,252.0,2.0,0.0,0.0,0.0,IT,low
14998,0.62,0.73,3.0,245.0,4.0,0.0,0.0,0.0,IT,low


### Accessing final models

In [97]:
mdl = imputer.get_model(2) # returns model for pattern 2

In [98]:
mdl.get_feature_name()

'average_montly_hours'

In [99]:
mdl.get_model()

[RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
            max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)]

### Scaling for preprocessing

In [100]:
from sklearn.preprocessing import StandardScaler

In [101]:
scaler = StandardScaler()

In [102]:
test = data.dropna()

In [103]:
len(test[test.apply(nan_checker.is_nan)])

11259

In [104]:
test.describe()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years
count,11259.0,11259.0,11259.0,11259.0,11259.0,11259.0,11259.0,11259.0
mean,0.611434,0.715921,3.791811,200.558487,3.494715,0.145572,0.238565,0.020339
std,0.248173,0.171192,1.228547,49.811842,1.456982,0.352693,0.426225,0.141164
min,0.09,0.36,2.0,96.0,2.0,0.0,0.0,0.0
25%,0.44,0.56,3.0,156.0,3.0,0.0,0.0,0.0
50%,0.64,0.72,4.0,199.0,3.0,0.0,0.0,0.0
75%,0.81,0.87,5.0,244.0,4.0,0.0,0.0,0.0
max,1.0,1.0,7.0,310.0,10.0,1.0,1.0,1.0


In [105]:
test_scaled = scaler.fit_transform(test[test.corr().columns])

In [106]:
test[:5]

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary
3740,0.1,0.82,6.0,244.0,4.0,0.0,1.0,0.0,technical,medium
3741,0.48,0.53,3.0,211.0,4.0,0.0,0.0,0.0,technical,low
3742,0.68,0.8,2.0,254.0,2.0,1.0,0.0,0.0,sales,low
3743,0.84,0.87,4.0,246.0,6.0,0.0,1.0,0.0,hr,low
3744,0.63,0.5,4.0,167.0,3.0,1.0,0.0,0.0,technical,medium


In [107]:
pd.DataFrame(test_scaled[:5])

Unnamed: 0,0,1,2,3,4,5,6,7
0,-2.060893,0.607995,1.797479,0.872151,0.346818,-0.412764,1.786543,-0.144089
1,-0.529632,-1.086083,-0.644539,0.209628,0.346818,-0.412764,-0.55974,-0.144089
2,0.276294,0.491162,-1.458545,1.072915,-1.025944,2.422691,-0.55974,-0.144089
3,0.921035,0.900078,0.169467,0.912304,1.719579,-0.412764,1.786543,-0.144089
4,0.074813,-1.261332,0.169467,-0.673735,-0.339563,2.422691,-0.55974,-0.144089


In [108]:
test_rescaled = scaler.inverse_transform(test_scaled)

In [109]:
pd.DataFrame(test_rescaled[:5])

Unnamed: 0,0,1,2,3,4,5,6,7
0,0.1,0.82,6.0,244.0,4.0,0.0,1.0,0.0
1,0.48,0.53,3.0,211.0,4.0,0.0,0.0,0.0
2,0.68,0.8,2.0,254.0,2.0,1.0,0.0,0.0
3,0.84,0.87,4.0,246.0,6.0,0.0,1.0,0.0
4,0.63,0.5,4.0,167.0,3.0,1.0,0.0,0.0


In [115]:
reload(impyte)
testdf = pd.read_csv('data/masterdf_201710230_andirs.csv', low_memory=False)
imp = impyte.Impyter()
_ = imp.load_data(testdf)
imp.pattern()

KeyboardInterrupt: 

In [111]:
imp.get_missing_summary()

Unnamed: 0,Complete,Missing,Percentage
Incident_Year,24791,170517,87.31 %
Incident Date,24791,170517,87.31 %
Incident_Cat,24767,170541,87.32 %


In [112]:
imp.get_missing_summary(importance_filter=True)

Unnamed: 0,Complete,Missing,Percentage
Incident_Year,24791,170517,87.31 %
Incident Date,24791,170517,87.31 %
Incident_Cat,24767,170541,87.32 %


### Timing some of the functions
Below is a piece of information on efficiency of the functions. This doesn't reflect complexity of the functions, but it will show an runtime estimate for your local machine.

In [113]:
%timeit nan_checker.is_nan(["", 'None', 'NaN'])
%timeit nan_checker.is_nan(["", 'None', 'NaN'], nan_vals=['', None, 'None', 'NaN'])

The slowest run took 11.63 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 6.44 µs per loop
100000 loops, best of 3: 7.12 µs per loop


### Multi-Nans
Prediction of values with multi-nan is a last resort option. This might be suitable for certain edge cases but if the score values are low it should be considered dropping the feature or the data points.

In [117]:
multi_data = data.copy()
import numpy as np
for i in range(0, 100):
    multi_data.at[i, "last_evaluation"] = np.nan
reload(impyte)
imp = impyte.Impyter(multi_data)
_ = imp.pattern()
imp.pattern_log.get_multi_nan_pattern_nos()
imp.pattern_log.get_column_name(11)
res = imp.impute(estimator="rf", multi_nans=True)

Scoring Threshold             Classification                Regression                    
                              None                          None                          

Pattern: Label                Score                         Estimator                     
1: sales                      0.334 (f1_macro)              RandomForestClassifier          filled...
2: average_montly_hours       0.334 (r2)                    RandomForestRegressor           filled...
3: number_project             0.571 (f1_macro)              RandomForestClassifier          filled...
4: salary                     0.518 (f1_macro)              RandomForestClassifier          filled...
5: promotion_last_5years      0.760 (f1_macro)              RandomForestClassifier          filled...
6: time_spend_company         0.568 (f1_macro)              RandomForestClassifier          filled...
7: last_evaluation            0.294 (r2)                    RandomForestRegressor           filled...
8: Work_acci

In [118]:
model = imp.get_model(11)

In [119]:
model.feature_name[0]

'satisfaction_level'

In [120]:
model.estimator_name[0]

'RandomForestRegressor'

In [121]:
model.scoring[0]

'r2'

In [122]:
model.accuracy[0]

array([ 0.42704134,  0.42658864,  0.41218536,  0.40748348,  0.42532636])

In [123]:
model.model[0]

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_split=1e-07, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False)

In [124]:
imp2 = impyte.Impyter(res)
imp2.pattern()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary,Count
0,1,1,1,1,1,1,1,1,1,1,14999


In [125]:
imp2.get_data()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary
0,0.256000,0.726000,3.0,153.0,2.0,0.0,0.0,0.0,technical,medium
1,0.390000,0.777000,5.0,169.0,2.0,0.0,0.0,0.0,sales,medium
2,0.390000,0.809000,5.0,199.0,4.0,0.0,0.0,0.0,sales,low
3,0.390000,0.730000,4.0,104.0,7.0,1.0,0.0,0.0,sales,medium
4,0.223000,0.791000,3.0,266.0,2.0,0.0,0.0,0.0,management,high
5,0.334000,0.756000,5.0,217.0,6.0,0.0,0.0,0.0,accounting,medium
6,0.390000,0.809000,5.0,266.0,6.0,0.0,0.0,0.0,sales,low
7,0.256000,0.727000,4.0,271.0,4.0,0.0,0.0,0.0,technical,low
8,0.226000,0.741000,5.0,191.0,2.0,1.0,0.0,0.0,technical,high
9,0.274000,0.714000,5.0,236.0,4.0,1.0,0.0,0.0,support,high


In [126]:
imp.save_model("testmodel.pkl")

## Ensemble
Ensemble iterates over several estimators and imputes accordingly. So far it prints the results for easy comparison.

In [45]:
imp.ensemble(estimator_list=["rf", "dt"])

Scoring Threshold             Classification                Regression                    
                              None                          None                          

Pattern: Label                Score                         Estimator                     
1: sales                      0.332 (f1_macro)              RandomForestClassifier          filled...
2: average_montly_hours       0.324 (r2)                    RandomForestRegressor           filled...
3: number_project             0.571 (f1_macro)              RandomForestClassifier          filled...
4: salary                     0.521 (f1_macro)              RandomForestClassifier          filled...
5: promotion_last_5years      0.741 (f1_macro)              RandomForestClassifier          filled...
6: time_spend_company         0.578 (f1_macro)              RandomForestClassifier          filled...
7: last_evaluation            0.274 (r2)                    RandomForestRegressor           filled...
8: Work_acci