# Impyte Documentation
This is a first practical attempt to clarify the usage of `impyte`. It's a collection of easy applicable and reproducible examples that you could use to simplyify your data processing workflow.

## Importing and generating sample data
In order to show some of the features of the library, we'll be using Kaggle's HR data that can be found [here](https://www.kaggle.com/ludobenistant/hr-analytics).

In [1]:
# import library and data set
from importlib import reload
import impyte
reload(impyte)
import pandas as pd
from tools.testing_sets import TestingSetCreator
from tools.data_prep import remove_random

In [2]:
data = pd.read_csv('data/hr_test.csv')

In [3]:
data.head()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary
0,0.38,0.53,2,157,3,0,1,0,sales,low
1,0.8,0.86,5,262,6,0,1,0,sales,medium
2,0.11,0.88,7,272,4,0,1,0,sales,medium
3,0.72,0.87,5,223,5,0,1,0,sales,low
4,0.37,0.52,2,159,3,0,1,0,sales,low


## Add random missing values
In order to show some of the pattern visualization and imputation methods, we need incomplete data sets. To achieve this, we'll be using a helper function that deletes values at random. 25 % of all values from each category will be deleted.

In [4]:
data = remove_random(data, .25, randomize_index=False)

In [5]:
data.head()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary
0,,0.49,2.0,131.0,3.0,0.0,1.0,0.0,RandD,high
1,,0.96,4.0,268.0,3.0,0.0,0.0,0.0,technical,medium
2,,0.65,3.0,235.0,10.0,0.0,0.0,0.0,technical,low
3,,0.43,5.0,269.0,3.0,0.0,0.0,0.0,sales,medium
4,,0.43,3.0,224.0,6.0,0.0,0.0,0.0,hr,low


In [6]:
data.describe()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years
count,14625.0,14625.0,14625.0,14625.0,14625.0,14625.0,14625.0,14625.0
mean,0.612661,0.716468,3.805607,201.128274,3.494222,0.14441,0.237607,0.021607
std,0.248629,0.171086,1.234099,49.885469,1.455338,0.351517,0.425632,0.145401
min,0.09,0.36,2.0,96.0,2.0,0.0,0.0,0.0
25%,0.44,0.56,3.0,156.0,3.0,0.0,0.0,0.0
50%,0.64,0.72,4.0,200.0,3.0,0.0,0.0,0.0
75%,0.82,0.87,5.0,245.0,4.0,0.0,0.0,0.0
max,1.0,1.0,7.0,310.0,10.0,1.0,1.0,1.0


In [7]:
len(data)

14999

## Testing rudimentary features
Below is a first trial run of some of the features implemented by impyter and its helper classes.

## `NanChecker`
Functionality testing of `NanChecker` class.

In [8]:
nan_checker = impyte.NanChecker()

#### `NanChecker.is_nan(data, nan_vals=None, recursive=True)`
Detect missing values (NaN in numeric arrays, empty strings in string arrays). NaN detection feature also enables NaN-value ingestion as parameter values.

In [9]:
# [True, False, False]
print(nan_checker.is_nan(["", 'None', 'NaN']))

# [True, True, True]
print(nan_checker.is_nan(["", 'None', 'NaN'], nan_vals=['', None, 'None', 'NaN']))

[True, False, False]
[True, True, True]


In [10]:
# Recursive nan detection
# [True, True, False, [False, True, True]]
print(nan_checker.is_nan(["", None, 'NaN', ["List Value 1", '', None]]))

[True, True, False, [False, True, True]]


In [11]:
# Values can be declared as nan-values
# [True, False, False, True]
nan_checker.is_nan(['NaN', 'Empty', 'None', 'N/A'], nan_vals=['NaN', 'N/A'])

[True, False, False, True]

## `Pattern`
Functionality testing of `Pattern` class. The `Pattern` class stores different patterns and data summaries regarding NaN values. 

In [12]:
reload(impyte)
pattern_log = impyte.Pattern()

#### `Pattern._check_complete_row(row)`
Determines whether a row consists out of only 1s. Method for NaN summary creation

In [13]:
tsc = TestingSetCreator(random_seed=23)

In [14]:
df = tsc.test_set(complete=2, spat1=4, spat2=5)
imp = impyte.Impyter(df)
imp.pattern()

Unnamed: 0,0,1,2,Count
0,1.0,,1,5
1,,1.0,1,4
2,1.0,1.0,1,2


In [15]:
complete_contestants = imp.pattern().apply(pattern_log._check_complete_row, axis=1)

In [16]:
complete_contestants.values

array([-1, -1,  2])

The pattern number containing all complete data points can be found by searching for the maximum value in the array.

In [17]:
complete_contestants.max()

2

#### `Pattern._compute_pattern(data, nan_values="", verbose=False)`
Function that checks for missing values and prints out a quick table of a summary of missing values. Includes pattern overview and counts of missing values by column. Computes pattern and returns its pattern structure plus the count of data points for each of these patterns. To get a nice digestible table add `["table"]` selector to output.

In [18]:
pattern_dict = pattern_log._compute_pattern(data)
indices, table = pattern_dict["indices"], pattern_dict["table"]

In [19]:
table

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary,Count
0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,11259
1,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
2,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
3,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
4,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,374
5,1.0,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,374
6,1.0,1.0,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,374
7,1.0,1.0,1.0,1.0,1.0,1.0,,1.0,1.0,1.0,374
8,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,1.0,1.0,374
9,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,1.0,374


In [20]:
indices[0][:10] # first 10 indices of pattern 0

[374, 749, 1124, 1499, 1874, 2249, 2624, 2999, 3374, 3749]

In [21]:
pattern_nr = 0
print("Pattern {} has {:,} rows.".format(pattern_nr, len(indices[0])))

Pattern 0 has 11,259 rows.


#### `Pattern._is_discrete(tmpdata, unique_instances)`
Determines based on dtype and by counting unique instances whether a column contains categorical/discrete or continuous values.

In [22]:
continuous_df = pd.DataFrame([.53, .22, .1, .11, .4, .7])

In [23]:
discrete_df = pd.DataFrame(["all", "your", "base", "are", "belong", "to", "us"])

In [24]:
pattern_log._is_discrete(continuous_df[0], unique_instances=5) # False

False

In [25]:
pattern_log._is_discrete(discrete_df[0], unique_instances=5) # True

True

#### `Pattern._get_discrete_and_continuous(tmpdata, unique_instances)`
Returns the column names of discrete and continuous variables. Column names are stored in lists for easy selection. Those lists are stored in one dictionary object. All continuous column names can be accessed through `['continuous']` all discrete variables through `['discrete']`.

In [26]:
pattern_log._get_discrete_and_continuous(data, unique_instances=5)

{'continuous': ['satisfaction_level',
  'last_evaluation',
  'number_project',
  'average_montly_hours',
  'time_spend_company'],
 'discrete': ['Work_accident',
  'left',
  'promotion_last_5years',
  'sales',
  'salary']}

#### `Pattern._get_unique_vals(data)`
For each column, this method returns an unique value count.

In [27]:
# [93, 66, 7, 216, 9, 3, 3, 3, 11, 4]
pattern_log._get_unique_vals(data)

[93, 66, 7, 216, 9, 3, 3, 3, 11, 4]

In [28]:
df = tsc.test_set(complete=5)
df

[[1, 1, 0.9248652516259452],
 [1, 1, 0.9486057779931771],
 [1, 1, 0.8924333440485793],
 [1, 1, 0.08355067683068362],
 [1, 1, 0.5920272268857353]]

In [29]:
# [1, 1, 5]
pattern_log._get_unique_vals(pd.DataFrame(df))

[1, 1, 5]

#### `Pattern._store_tuple(tup, row_idx, tmp_col_names)`
Internal storage method to save patterns in pattern_log.

#### `Pattern.get_complete_id()`
Returns pattern number of complete data points.

In [30]:
pattern_log.get_complete_id()

0

#### `Pattern.get_column_name(pattern_no)`
Returns column name(s) of nan-pattern.

In [31]:
pattern_log.get_column_name(1)

['satisfaction_level']

#### `Pattern.get_missing_value_percentage(self, data, importance_filter=False)`
Shows missing value percentage and count of unique values in category based on result and actual data table.

In [32]:
pattern_log.get_missing_value_percentage(data)

Unnamed: 0,Complete,Missing,Percentage,Unique
satisfaction_level,14625,374,2.49 %,93
last_evaluation,14625,374,2.49 %,66
number_project,14625,374,2.49 %,7
average_montly_hours,14625,374,2.49 %,216
time_spend_company,14625,374,2.49 %,9
Work_accident,14625,374,2.49 %,3
left,14625,374,2.49 %,3
promotion_last_5years,14625,374,2.49 %,3
sales,14625,374,2.49 %,11
salary,14625,374,2.49 %,4


#### `Pattern.get_pattern(data=None, unique_instances=10, recompute=False)`
Returns NaN-patterns based on primary computation or initiates new computation of NaN-patterns. Uses `_compute_patterns()` to visualize patterns if not yet computed. Otherwise returns stored information regarding patterns. If a pattern is already computed, the `recompute` flag has to be set to `True` in order to compute a new pattern structure.

In [33]:
reload(impyte)

<module 'impyte' from '/Users/andirs/Dropbox (Personal)/_Studium/04_Semester/Projektarbeit/impyter/impyte.py'>

In [34]:
pattern_log = impyte.Pattern()

In [35]:
pattern_log.get_pattern(data)

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary,Count
0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,11259
1,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
2,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
3,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
4,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,374
5,1.0,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,374
6,1.0,1.0,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,374
7,1.0,1.0,1.0,1.0,1.0,1.0,,1.0,1.0,1.0,374
8,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,1.0,1.0,374
9,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,1.0,374


In [36]:
df_max = tsc.test_set(complete=10000, spat1=2000, spat2=1000, spat3=500, mpat1=300, mpat2=200, mpat3=100)

In [37]:
pattern_log.get_pattern(pd.DataFrame(df_max), recompute=True)

Unnamed: 0,0,1,2,Count
0,1.0,1.0,1.0,10000
1,,1.0,1.0,2000
2,1.0,,1.0,1000
3,1.0,1.0,,500
4,,,1.0,300
5,,1.0,,200
6,1.0,,,100


#### `Pattern.get_single_nan_pattern_nos()`
Returns all pattern numbers of single nans.

In [38]:
pattern_log.get_single_nan_pattern_nos()

Int64Index([1, 2, 3], dtype='int64')

#### `Pattern.get_multi_nan_pattern_nos(multi=True)`
Returns all pattern numbers of multi-nans or single-nans.

In [39]:
pattern_log.get_multi_nan_pattern_nos()

Int64Index([4, 5, 6], dtype='int64')

#### `Pattern.get_pattern_indices(pattern_no)`

In [40]:
# [374, 749, 1124, 1499, 1874, 2249, 2624, 2999, 3374, 3749]
pattern_log.get_pattern_indices(0)[:10] # get first 10 indices of pattern 0

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

#### `Pattern.get_continuous()`
Returns list with names of all continuous variables.

In [41]:
pattern_log.get_continuous()

[2]

#### `Pattern.get_discrete()`
Returns list with names of all discrete variables.

In [42]:
pattern_log.get_discrete()

[0, 1]

#### `Pattern.remove_pattern(pattern_no)`
Removes a pattern from storage.

In [43]:
reload(impyte)
pattern_log = impyte.Pattern()
pattern_log.get_pattern(pd.DataFrame(df_max))

Unnamed: 0,0,1,2,Count
0,1.0,1.0,1.0,10000
1,,1.0,1.0,2000
2,1.0,,1.0,1000
3,1.0,1.0,,500
4,,,1.0,300
5,,1.0,,200
6,1.0,,,100


In [44]:
pattern_log.remove_pattern(6)

In [45]:
pattern_log.get_pattern()

Unnamed: 0,0,1,2,Count
0,1.0,1.0,1.0,10000
1,,1.0,1.0,2000
2,1.0,,1.0,1000
3,1.0,1.0,,500
4,,,1.0,300
5,,1.0,,200


## `Impyter`

In [46]:
reload(impyte)
imp = impyte.Impyter() # instantiate impyte class

### Load data into imputer

#### `Impyter._data_check(data)`
Checks if data is pandas DataFrame and transforms otherwise.

In [47]:
df_max[:5]

[[1, 1, 0.9248652516259452],
 [1, 1, 0.9486057779931771],
 [1, 1, 0.8924333440485793],
 [1, 1, 0.08355067683068362],
 [1, 1, 0.5920272268857353]]

In [48]:
imp._data_check(df_max).head()

Unnamed: 0,0,1,2
0,1.0,1.0,0.924865
1,1.0,1.0,0.948606
2,1.0,1.0,0.892433
3,1.0,1.0,0.083551
4,1.0,1.0,0.592027


In [49]:
imp._data_check([["a", "d", "g"], ["b", "e", "h"], ["c", "f", "i"]])

Unnamed: 0,0,1,2
0,a,d,g
1,b,e,h
2,c,f,i


#### `Impyter._get_display_options(cols=True)`
Returns pandas display options for better readability of results. 

In [50]:
imp._get_display_options()

20

In [51]:
imp._get_display_options(cols=False)

60

#### `Impyter._set_display_options(length, cols=True)`
Sets individual display options for pattern results. If `cols` set to `False`, the maximum rows displayed can be set.

In [52]:
imp._set_display_options(25)

In [53]:
imp._set_display_options(65, cols=False)

#### `Impyter.load_data()`
Requires a pandas DataFrame to load. Otherwise, the input is being transformed into a DataFrame. While loading the data is being copied into the object, to stay clear of consistency issues with the original data set.

In [54]:
imp.load_data(data)

As an alternative a DataFrame can be handed over while instantiating the Impyter object.

In [55]:
imp = impyte.Impyter(data)

#### `Impyter.get_data()`
Returns the loaded data for quick reference.

In [56]:
imp.get_data().head()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary
0,,0.49,2.0,131.0,3.0,0.0,1.0,0.0,RandD,high
1,,0.96,4.0,268.0,3.0,0.0,0.0,0.0,technical,medium
2,,0.65,3.0,235.0,10.0,0.0,0.0,0.0,technical,low
3,,0.43,5.0,269.0,3.0,0.0,0.0,0.0,sales,medium
4,,0.43,3.0,224.0,6.0,0.0,0.0,0.0,hr,low


#### `Impyter.pattern()`
Leverages `Pattern._compute_pattern()` and `Pattern.get_pattern()` methods to compute and return an overview of all existant NaN patterns in the data set. The overview shows a `NaN` in the column where a data point was missing and `1` for all complete slots. On the right hand side is a count variable to indicate how often that pattern was found. The patterns are always sorted by count and it is not given, that pattern 0 is always the pattern with only complete cases.

In [57]:
reload(impyte)
imp = impyte.Impyter(data) # instantiate impyte class
imp.pattern()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary,Count
0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,11259
1,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
2,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
3,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
4,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,374
5,1.0,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,374
6,1.0,1.0,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,374
7,1.0,1.0,1.0,1.0,1.0,1.0,,1.0,1.0,1.0,374
8,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,1.0,1.0,374
9,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,1.0,374


#### `Impyter.drop_pattern(pattern_no, inplace=False)`
Drops pattern from data set and returns preliminary result. If `inplace` flag is set to `True`, internal storage of impyte object is being manipulated as well.

In [58]:
imp.drop_pattern(4).head()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary
0,,0.49,2.0,131.0,3.0,0.0,1.0,0.0,RandD,high
1,,0.96,4.0,268.0,3.0,0.0,0.0,0.0,technical,medium
2,,0.65,3.0,235.0,10.0,0.0,0.0,0.0,technical,low
3,,0.43,5.0,269.0,3.0,0.0,0.0,0.0,sales,medium
4,,0.43,3.0,224.0,6.0,0.0,0.0,0.0,hr,low


In [59]:
imp.pattern()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary,Count
0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,11259
1,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
2,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
3,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
4,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,374
5,1.0,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,374
6,1.0,1.0,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,374
7,1.0,1.0,1.0,1.0,1.0,1.0,,1.0,1.0,1.0,374
8,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,1.0,1.0,374
9,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,1.0,374


If inplace flag set to True, the changes happen in the data set that's being stored in the Impyte object. Otherwise, a copy without the dropped pattern will be returned and the stored data set stays intact.

In [60]:
_ = imp.drop_pattern(4, inplace=True)

In [61]:
imp.pattern()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary,Count
0,1.0,1.0,1.0,1,1.0,1.0,1.0,1.0,1.0,1.0,11259
1,,1.0,1.0,1,1.0,1.0,1.0,1.0,1.0,1.0,374
2,1.0,,1.0,1,1.0,1.0,1.0,1.0,1.0,1.0,374
3,1.0,1.0,,1,1.0,1.0,1.0,1.0,1.0,1.0,374
5,1.0,1.0,1.0,1,,1.0,1.0,1.0,1.0,1.0,374
6,1.0,1.0,1.0,1,1.0,,1.0,1.0,1.0,1.0,374
7,1.0,1.0,1.0,1,1.0,1.0,,1.0,1.0,1.0,374
8,1.0,1.0,1.0,1,1.0,1.0,1.0,,1.0,1.0,374
9,1.0,1.0,1.0,1,1.0,1.0,1.0,1.0,,1.0,374
10,1.0,1.0,1.0,1,1.0,1.0,1.0,1.0,1.0,,374


#### `Impyter.get_pattern(pattern_no)`
Returns data points for a specific pattern_no for further investigation.

In [62]:
imp.get_pattern(1).head()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary
0,,0.49,2.0,131.0,3.0,0.0,1.0,0.0,RandD,high
1,,0.96,4.0,268.0,3.0,0.0,0.0,0.0,technical,medium
2,,0.65,3.0,235.0,10.0,0.0,0.0,0.0,technical,low
3,,0.43,5.0,269.0,3.0,0.0,0.0,0.0,sales,medium
4,,0.43,3.0,224.0,6.0,0.0,0.0,0.0,hr,low


#### `Impyter.get_summary()`
Returns table with information on missing values per column, its percentage and the count of unique values within that column.

In [63]:
imp.get_summary()

Unnamed: 0,Complete,Missing,Percentage,Unique
satisfaction_level,14625,374,2.49 %,93
last_evaluation,14625,374,2.49 %,66
number_project,14625,374,2.49 %,7
time_spend_company,14625,374,2.49 %,9
Work_accident,14625,374,2.49 %,3
left,14625,374,2.49 %,3
promotion_last_5years,14625,374,2.49 %,3
sales,14625,374,2.49 %,11
salary,14625,374,2.49 %,4


In [64]:
data['Work_accident'].unique()

array([  0.,   1.,  nan])

Setting the importance filter flag to `True`  shows only columns that have some missing values. This is helpful for data sets with a large amount of variables and only few nan-values.

In [65]:
for pattern_no in range(1,4): #  drop patterns 1 to 3
    imp.drop_pattern(pattern_no, inplace=True)

In [66]:
imp.get_summary(importance_filter=True)

Unnamed: 0,Complete,Missing,Percentage,Unique
time_spend_company,14625,374,2.49 %,9
Work_accident,14625,374,2.49 %,3
left,14625,374,2.49 %,3
promotion_last_5years,14625,374,2.49 %,3
sales,14625,374,2.49 %,11
salary,14625,374,2.49 %,4


#### `Impyte.one_hot_encode()`
Relies on `pandas.get_dummies()` method to transform categorical values into one-hot-encoded values.

In [67]:
reload(impyte)
imputer = impyte.Impyter()
imputer.load_data(data)
_ = imputer.pattern()

In [68]:
ohe_data = imputer.one_hot_encode(data)
ohe_data.head()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales_ohe_IT,sales_ohe_RandD,sales_ohe_accounting,sales_ohe_hr,sales_ohe_management,sales_ohe_marketing,sales_ohe_product_mng,sales_ohe_sales,sales_ohe_support,sales_ohe_technical,salary_ohe_high,salary_ohe_low,salary_ohe_medium
0,,0.49,2.0,131.0,3.0,0.0,1.0,0.0,0,1,0,0,0,0,0,0,0,0,1,0,0
1,,0.96,4.0,268.0,3.0,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,1,0,0,1
2,,0.65,3.0,235.0,10.0,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,1,0,1,0
3,,0.43,5.0,269.0,3.0,0.0,0.0,0.0,0,0,0,0,0,0,0,1,0,0,0,0,1
4,,0.43,3.0,224.0,6.0,0.0,0.0,0.0,0,0,0,1,0,0,0,0,0,0,0,1,0


#### `Impyte.one_hot_decode()`
The inversion method to `Impyte.one_hot_encode()`. Transforms one-hot-encoded columns back to categorical values.

In [69]:
imputer.one_hot_decode(ohe_data).head()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary
0,,0.49,2.0,131.0,3.0,0.0,1.0,0.0,RandD,high
1,,0.96,4.0,268.0,3.0,0.0,0.0,0.0,technical,medium
2,,0.65,3.0,235.0,10.0,0.0,0.0,0.0,technical,low
3,,0.43,5.0,269.0,3.0,0.0,0.0,0.0,sales,medium
4,,0.43,3.0,224.0,6.0,0.0,0.0,0.0,hr,low


#### `Impyter.impute()`
Impute is the core method of impyte. The method works out of the box and uses Random Forest estimators per default to impute missing values. It automatically performs cross-validation to showcase the potential accuracy of the imputation. Scoring that is being used is f1_macro score for classifiers (supporting binary and multi-class) and r2 for regression models.

In [113]:
reload(impyte)
imputer = impyte.Impyter()
imputer.load_data(data)

In [114]:
_ = imputer.pattern()
complete_df = imputer.impute(estimator='rf')

Scoring Threshold             Classification                Regression                    
                              None                          None                          

Pattern: Label                Score                         Estimator                     
1: satisfaction_level         0.465 (r2)                    RandomForestRegressor           imputed...
2: last_evaluation            0.296 (r2)                    RandomForestRegressor           imputed...
3: number_project             0.576 (f1_macro)              RandomForestClassifier          imputed...
4: average_montly_hours       0.321 (r2)                    RandomForestRegressor           imputed...
5: time_spend_company         0.594 (f1_macro)              RandomForestClassifier          imputed...
6: Work_accident              0.599 (f1_macro)              RandomForestClassifier          imputed...
7: left                       0.981 (f1_macro)              RandomForestClassifier          imputed...
8: pr

In [122]:
complete_df = imputer.impute(estimator='rf')

Scoring Threshold             Classification                Regression                    
                              None                          None                          

Pattern: Label                Score                         Estimator                     
1: satisfaction_level         0.465 (r2)                    RandomForestRegressor           imputed...
2: last_evaluation            0.296 (r2)                    RandomForestRegressor           imputed...
3: number_project             0.576 (f1_macro)              RandomForestClassifier          imputed...
4: average_montly_hours       0.321 (r2)                    RandomForestRegressor           imputed...
5: time_spend_company         0.594 (f1_macro)              RandomForestClassifier          imputed...
6: Work_accident              0.599 (f1_macro)              RandomForestClassifier          imputed...
7: left                       0.981 (f1_macro)              RandomForestClassifier          imputed...
8: pr

##### Investigate completed data set

In [124]:
imp = impyte.Impyter(complete_df)

In [125]:
imp.pattern()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary,Count
0,1,1,1,1,1,1,1,1,1,1,14999


##### Scoring measures
In order to fill in only columns that surpass a certain scoring threshold (i.e. f1 score > .7), the threshold parameter can be set. The first value determines classification (f1) the second one regression (r2).

In [126]:
complete_df_threshold = imputer.impute(estimator='rf', threshold={"f1_macro": .7,
                                                                  "r2": .5})

Scoring Threshold             Classification                Regression                    
                              0.7                           0.5                           

Pattern: Label                Score                         Estimator                     
1: satisfaction_level         0.465 (r2)                    RandomForestRegressor           not imputed...
2: last_evaluation            0.296 (r2)                    RandomForestRegressor           not imputed...
3: number_project             0.576 (f1_macro)              RandomForestClassifier          not imputed...
4: average_montly_hours       0.321 (r2)                    RandomForestRegressor           not imputed...
5: time_spend_company         0.594 (f1_macro)              RandomForestClassifier          not imputed...
6: Work_accident              0.599 (f1_macro)              RandomForestClassifier          not imputed...
7: left                       0.981 (f1_macro)              RandomForestClassifier  

In [127]:
len(imputer.get_result())

14999

In [128]:
imputer.get_pattern(7, result=True)

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary
2250,0.96,0.70,3.0,207.0,3.0,0.0,0.0,0.0,IT,high
2251,0.87,0.90,5.0,254.0,6.0,0.0,1.0,0.0,support,low
2252,0.70,0.57,5.0,247.0,3.0,0.0,0.0,0.0,RandD,low
2253,0.43,0.54,2.0,150.0,3.0,0.0,1.0,0.0,hr,low
2254,0.52,0.90,5.0,176.0,3.0,0.0,0.0,0.0,technical,medium
2255,0.43,0.47,2.0,135.0,3.0,0.0,0.0,0.0,sales,low
2256,0.50,0.90,5.0,226.0,2.0,0.0,1.0,0.0,management,high
2257,0.55,0.98,2.0,144.0,2.0,0.0,1.0,0.0,hr,medium
2258,0.79,0.74,5.0,172.0,2.0,0.0,1.0,0.0,product_mng,low
2259,0.58,0.67,5.0,265.0,3.0,0.0,0.0,0.0,support,medium


In [129]:
imp = impyte.Impyter(complete_df_threshold)
imp.pattern()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary,Count
0,1.0,1.0,1.0,1.0,1.0,1.0,1,1,1.0,1.0,12007
1,,1.0,1.0,1.0,1.0,1.0,1,1,1.0,1.0,374
2,1.0,,1.0,1.0,1.0,1.0,1,1,1.0,1.0,374
3,1.0,1.0,,1.0,1.0,1.0,1,1,1.0,1.0,374
4,1.0,1.0,1.0,,1.0,1.0,1,1,1.0,1.0,374
5,1.0,1.0,1.0,1.0,,1.0,1,1,1.0,1.0,374
6,1.0,1.0,1.0,1.0,1.0,,1,1,1.0,1.0,374
7,1.0,1.0,1.0,1.0,1.0,1.0,1,1,,1.0,374
8,1.0,1.0,1.0,1.0,1.0,1.0,1,1,1.0,,374


##### Multi-Nans
Prediction of values with multi-nan is a last resort option. This might be suitable for certain edge cases but if the score values are low it should be considered dropping the feature or the data points.

In [294]:
reload(impyte)
multi_data = data.copy()
import numpy as np
for i in range(0, 100):
    multi_data.at[i, "last_evaluation"] = np.nan
reload(impyte)
imp = impyte.Impyter(multi_data)
_ = imp.pattern()
res = imp.impute(estimator="rf", multi_nans=True)

Scoring Threshold             Classification                Regression                    
                              None                          None                          

Pattern: Label                Score                         Estimator                     
1: last_evaluation            0.287 (r2)                    RandomForestRegressor           imputed...
2: number_project             0.577 (f1_macro)              RandomForestClassifier          imputed...
3: average_montly_hours       0.333 (r2)                    RandomForestRegressor           imputed...
4: time_spend_company         0.564 (f1_macro)              RandomForestClassifier          imputed...
5: Work_accident              0.598 (f1_macro)              RandomForestClassifier          imputed...
6: left                       0.981 (f1_macro)              RandomForestClassifier          imputed...
7: promotion_last_5years      0.758 (f1_macro)              RandomForestClassifier          imputed...
8: sa

In [295]:
res = imp.impute(estimator="rf", multi_nans=True)

Scoring Threshold             Classification                Regression                    
                              None                          None                          

Pattern: Label                Score                         Estimator                     
1: last_evaluation            0.287 (r2)                    RandomForestRegressor           imputed...
2: number_project             0.577 (f1_macro)              RandomForestClassifier          imputed...
3: average_montly_hours       0.333 (r2)                    RandomForestRegressor           imputed...
4: time_spend_company         0.564 (f1_macro)              RandomForestClassifier          imputed...
5: Work_accident              0.598 (f1_macro)              RandomForestClassifier          imputed...
6: left                       0.981 (f1_macro)              RandomForestClassifier          imputed...
7: promotion_last_5years      0.758 (f1_macro)              RandomForestClassifier          imputed...
8: sa

In [220]:
imp.save_model(name='all_models_hr.pkl')

In [259]:
reload(impyte)

<module 'impyte' from '/Users/andirs/Dropbox (Personal)/_Studium/04_Semester/Projektarbeit/impyter/impyte.py'>

In [275]:
imp2 = impyte.Impyter()

In [276]:
imp2.load_data(multi_data)

In [277]:
imp2.load_model('all_models_hr.pkl')

Computing NaN-patterns first ...

Found 11 models...
Added model for pattern 1
Added model for pattern 2
Added model for pattern 3
Added model for pattern 4
Added model for pattern 5
Added model for pattern 6
Added model for pattern 7
Added model for pattern 8
Added model for pattern 9
Added model for pattern 10
Added model for pattern 11


#### `Impyter.drop_imputation()`
In the case of multi-nan, `drop_imputation` will average the score of all models. Yet, performing this method for multi-nan patterns is discouraged. Further individual treatment of the data set might be more helpful in order to preprocess the information correctly. One potential action could be, to drop multi-nan columns if they contain no information.

In [272]:
imp2.drop_imputation({"f1_macro": .5,
                     "r2": .3})

Dropping pattern 1 (0.2911254806690595 < 0.3 r2)
Dropping pattern 8 (0.34046861019180874 < 0.5 f1_macro)
Dropping pattern 11 (0.24383207565250203 < 0.3 r2)


In [273]:
imp2.pattern()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary,Count
0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,11259
1,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
2,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
3,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,374
4,1.0,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,374
5,1.0,1.0,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,374
6,1.0,1.0,1.0,1.0,1.0,1.0,,1.0,1.0,1.0,374
7,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,1.0,1.0,374
8,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,1.0,374
9,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,374


Drop imputation can also be used to drop the pattern, if imputation results aren't sufficient enough. If `drop_pattern` is set to `True`, impyte will automatically remove each pattern that doesn't reach the threshold.

In [284]:
imp2.load_data(multi_data)

In [285]:
imp2.load_model('all_models_hr.pkl')

Computing NaN-patterns first ...

Found 11 models...
Added model for pattern 1
Added model for pattern 2
Added model for pattern 3
Added model for pattern 4
Added model for pattern 5
Added model for pattern 6
Added model for pattern 7
Added model for pattern 8
Added model for pattern 9
Added model for pattern 10
Added model for pattern 11


In [286]:
imp2.drop_imputation({"f1_macro": .5,
                      "r2": .3}, drop_pattern=True)

Dropping pattern 1 (0.2911254806690595 < 0.3 r2)
Dropping pattern 8 (0.34046861019180874 < 0.5 f1_macro)
Dropping pattern 11 (0.24383207565250203 < 0.3 r2)


In [287]:
imp2.pattern()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary,Count
0,1.0,1,1.0,1.0,1.0,1.0,1.0,1.0,1,1.0,11259
2,1.0,1,,1.0,1.0,1.0,1.0,1.0,1,1.0,374
3,1.0,1,1.0,,1.0,1.0,1.0,1.0,1,1.0,374
4,1.0,1,1.0,1.0,,1.0,1.0,1.0,1,1.0,374
5,1.0,1,1.0,1.0,1.0,,1.0,1.0,1,1.0,374
6,1.0,1,1.0,1.0,1.0,1.0,,1.0,1,1.0,374
7,1.0,1,1.0,1.0,1.0,1.0,1.0,,1,1.0,374
9,1.0,1,1.0,1.0,1.0,1.0,1.0,1.0,1,,374
10,,1,1.0,1.0,1.0,1.0,1.0,1.0,1,1.0,274


#### `Impyter.get_result()`
Returns the results data set, once impute was performed. Before imputation this method returns the original data set.

In [127]:
imp.get_result().head()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary
100,0.431,0.84,6.0,298.0,4.0,0.0,1.0,0.0,sales,low
101,0.209,0.89,5.0,152.0,3.0,0.0,0.0,0.0,support,medium
102,0.265,0.65,3.0,139.0,4.0,0.0,0.0,0.0,management,high
103,0.291,0.71,5.0,145.0,4.0,0.0,0.0,0.0,technical,medium
104,0.344,0.53,5.0,205.0,3.0,1.0,0.0,0.0,technical,medium


In [128]:
len(imp.get_result())

14151

#### `Impyter.get_model()`

In [320]:
model = imp.get_model(2)

In [321]:
model.feature_name[0]

'number_project'

In [322]:
model.estimator_name[0]

'RandomForestClassifier'

In [323]:
model.scoring[0]

'f1_macro'

In [324]:
model.get_score()

[0.57158731113498162]

In [325]:
model.model[0]

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [342]:
reload(impyte)
imp_max = impyte.Impyter(df_max)

In [343]:
imp_max.pattern()

Unnamed: 0,0,1,2,Count
0,1.0,1.0,1.0,10000
1,,1.0,1.0,2000
2,1.0,,1.0,1000
3,1.0,1.0,,500
4,,,1.0,300
5,,1.0,,200
6,1.0,,,100


In [344]:
imp_max.pattern_log.pattern_dependent_dict

{('NaN', 'NaN', 1): [0, 1],
 ('NaN', 1, 'NaN'): [0, 2],
 ('NaN', 1, 1): [0],
 (1, 'NaN', 'NaN'): [1, 2],
 (1, 'NaN', 1): [1],
 (1, 1, 'NaN'): [2],
 (1, 1, 1): []}

In [345]:
_ = imp_max.impute(multi_nans=True)

Scoring Threshold             Classification                Regression                    
                              None                          None                          

Pattern: Label                Score                         Estimator                     
1: 0                          1.000 (f1_macro)              RandomForestClassifier          imputed...
2: 1                          1.000 (f1_macro)              RandomForestClassifier          imputed...
3: 2                          -0.000 (r2)                   RandomForestRegressor           imputed...

Multi nans
4: 0                          1.000 (f1_macro)              RandomForestClassifier          imputed...
4: 1                          1.000 (f1_macro)              RandomForestClassifier          imputed...
5: 0                          1.000 (f1_macro)              RandomForestClassifier          imputed...
5: 2                          -0.000 (r2)                   RandomForestRegressor           impu

In [346]:
mmdl = imp_max.model_log[4]

In [347]:
mmdl.model_list

[<impyte.ImpyterModel at 0x1a2eb3e828>, <impyte.ImpyterModel at 0x1a2fcf1d68>]

In [348]:
mmdl.get_dependend_and_independent_variables()

{'dependent_variables': [0, 1], 'independent_variables': [2]}

#### `Impyter.save_model(pattern_no=None, name=None)`
Stores an imputation model for either the whole data set or a particular pattern in a pickle file. If `pattern_no` is not set, the method stores all models. If `name` is not set, an automated name is being produced including a timestamp.

In [349]:
imp_max.save_model(name='all_models_dummy.pkl')

In [350]:
imp_max.save_model(2, name='one_model_dummy.pkl')

#### `Impyter.map_model_to_pattern(mdl)`
Checks model for similarity to stored patterns and returns pattern number if a match is found.

In [351]:
mdl = imp_max.get_model(2)

In [352]:
print(imp_max.map_model_to_pattern(mdl))

2


#### `Impyter.map_model_to_pattern(mdl)`
Checks model for similarity to stored patterns and returns pattern number if a match is found.

In [354]:
mdl = imp_max.get_model(4)

In [355]:
print(imp.map_multimodel_to_pattern(mdl))

4


#### `Impyter.load_model(model)`
Load a stored machine learning model to perform value imputation. If model is a list of models, all models will be checked according to their independent and dependent variables. If a matching pattern is found, the model will be updated.

In [356]:
# 2017-12-01_impyte_mdl_1512153316.pkl
imp.load_model(model='all_models_dummy.pkl')

Found 6 models...
Added model for pattern 1
Added model for pattern 2
Added model for pattern 3
Added model for pattern 4
Added model for pattern 5
Added model for pattern 6


In [357]:
imp.load_model(model='one_model_dummy.pkl')

Added model for pattern 2


#### `Impyter.compare_features(list1, list2)`
Compares two lists given its objects based on a comparison of Counter dicts. The order of all elements is unimportant.

In [358]:
# [True]
imp.compare_features(["one", "two", "three"], ["one", "two", "three"])

True

In [359]:
# [True]
imp.compare_features(["one", "two", "three"], ["one", "three", "two"])

True

In [360]:
# [False]
imp.compare_features(["one", "two", "three"], ["one", "two", "four"])

False

### Timing some of the functions
Below is a piece of information on efficiency of the functions. This doesn't reflect complexity of the functions, but it will show an runtime estimate for your local machine.

In [362]:
%timeit nan_checker.is_nan(["", 'None', 'NaN'])
%timeit nan_checker.is_nan(["", 'None', 'NaN'], nan_vals=['', None, 'None', 'NaN'])

3.67 µs ± 415 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
4.01 µs ± 473 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


## Ensemble
Ensemble iterates over several estimators and imputes accordingly. So far it prints the results for easy comparison.

In [361]:
imp.ensemble(estimator_list=["rf", "dt"])

Scoring Threshold             Classification                Regression                    
                              None                          None                          

Pattern: Label                Score                         Estimator                     
1: 0                          1.000 (f1_macro)              RandomForestClassifier          imputed...
2: 1                          1.000 (f1_macro)              RandomForestClassifier          imputed...
3: 2                          -0.000 (r2)                   RandomForestRegressor           imputed...
Scoring Threshold             Classification                Regression                    
                              None                          None                          

Pattern: Label                Score                         Estimator                     
1: 0                          1.000 (f1_macro)              DecisionTreeClassifier          imputed...
2: 1                          1.000 (f1_

In [375]:
imp2.pattern_log.column_names

Index(['satisfaction_level', 'last_evaluation', 'number_project',
       'average_montly_hours', 'time_spend_company', 'Work_accident', 'left',
       'promotion_last_5years', 'sales', 'salary'],
      dtype='object')

In [377]:
print(imp2.pattern_log.complete_idx)

None


In [378]:
imp2.pattern_log.get_complete_id()

0

In [379]:
imp2.pattern_log.complete_idx

0