# Impyte Documentation
This is a first practicle attempt to clarify the usage of `impyte`. It's a collection of easy applicable and reproducible examples that you could use to simplyify your data processing workflow.

## Importing and generating sample data
In order to show some of the features of the library, we'll be using Kaggle's HR data that can be found [here](https://www.kaggle.com/ludobenistant/hr-analytics).

In [246]:
# import library and data set
import impyte
import pandas as pd
from tools.data_prep import remove_random

In [247]:
data = pd.read_csv('data/hr_test.csv')

In [248]:
data.head()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary
0,0.38,0.53,2,157,3,0,1,0,sales,low
1,0.8,0.86,5,262,6,0,1,0,sales,medium
2,0.11,0.88,7,272,4,0,1,0,sales,medium
3,0.72,0.87,5,223,5,0,1,0,sales,low
4,0.37,0.52,2,159,3,0,1,0,sales,low


## Add random missing values
In order to show some of the pattern visualization and imputation methods, we need incomplete data sets. To achieve this, we'll be using a helper function that deletes values at random. 25 % of all values from each category will be deleted.

In [249]:
data = remove_random(data, .25)

In [250]:
data.head()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary
0,,0.9,4.0,263.0,5.0,0.0,1.0,0.0,support,low
1,,0.62,4.0,138.0,3.0,0.0,0.0,0.0,technical,medium
2,,0.59,2.0,144.0,3.0,0.0,0.0,0.0,sales,medium
3,,0.66,4.0,270.0,2.0,0.0,0.0,0.0,IT,low
4,,0.47,2.0,253.0,3.0,0.0,1.0,0.0,sales,low


In [251]:
data.describe()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years
count,14625.0,14625.0,14625.0,14625.0,14625.0,14625.0,14625.0,14625.0
mean,0.612568,0.715946,3.804581,201.070154,3.500239,0.143795,0.238359,0.02147
std,0.248915,0.171149,1.233244,49.919767,1.4599,0.350894,0.426094,0.14495
min,0.09,0.36,2.0,96.0,2.0,0.0,0.0,0.0
25%,0.44,0.56,3.0,156.0,3.0,0.0,0.0,0.0
50%,0.64,0.72,4.0,200.0,3.0,0.0,0.0,0.0
75%,0.82,0.87,5.0,245.0,4.0,0.0,0.0,0.0
max,1.0,1.0,7.0,310.0,10.0,1.0,1.0,1.0


In [252]:
len(data)

14999

In [253]:
reload(impyte)
imputer = impyte.Imputer() # instantiate impyte class

## Testing rudimentary features
Below is a first trial run of some of the features implemented by impyter and its helper classes.

## `NanChecker`
Functionality testing of `NanChecker` class.

In [254]:
nan_checker = impyte.NanChecker()

#### `NanChecker.is_nan(data, nan_vals=None, recursive=True)`
Detect missing values (NaN in numeric arrays, empty strings in string arrays). NaN detection feature also enables NaN-value ingestion as parameter values.

In [255]:
# [True, False, False]
print nan_checker.is_nan(["", 'None', 'NaN'])

# [True, True, True]
print nan_checker.is_nan(["", 'None', 'NaN'], nan_vals=['', None, 'None', 'NaN'])

[True, False, False]
[True, True, True]


In [256]:
# Recursive nan detection
# [True, True, False, [False, True, True]]
print nan_checker.is_nan(["", None, 'NaN', ["List Value 1", '', None]])

[True, True, False, [False, True, True]]


In [257]:
# Values can be declared as nan-values
# [True, False, False, True]
nan_checker.is_nan(['NaN', 'Empty', 'None', 'N/A'], nan_vals=['NaN', 'N/A'])

[True, False, False, True]

## `Pattern`
Functionality testing of `Pattern` class. The `Pattern` class stores different patterns and data summaries regarding NaN values. 

#### `Pattern._get_discrete_and_continuous(tmpdata)`
Returns the column names of discrete and continuous variables. Column names are stored in lists for easy selection. Those lists are stored in one dictionary object. All continuous column names can be accessed through `['continuous']` all discrete variables through `['discrete']`.

In [258]:
reload(impyte)
pattern_log = impyte.Pattern()

In [259]:
pattern_log._get_discrete_and_continuous(data)

{'continuous': ['satisfaction_level',
  'last_evaluation',
  'number_project',
  'average_montly_hours',
  'time_spend_company',
  'Work_accident',
  'left',
  'promotion_last_5years'],
 'discrete': ['sales', 'salary']}

#### `Pattern._compute_pattern(data, nan_values="", verbose=False)`
Function that checks for missing values and prints out a quick table of a summary of missing values. Includes pattern overview and counts of missing values by column. Computes pattern and returns its pattern structure plus the count of data points for each of these patterns. To get a nice digestible table add `["table"]` selector to output.

In [260]:
pattern_dict = pattern_log._compute_pattern(data)
indices, table = pattern_dict["indices"], pattern_dict["table"]

In [261]:
table

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary,Count
10,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,11259
8,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,1.0,374
3,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,374
2,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
9,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,374
7,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,1.0,1.0,374
4,1.0,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,374
0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
1,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
5,1.0,1.0,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,374


In [262]:
indices[0][:10] # first 10 indices of pattern 0

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

#### `Pattern.get_pattern()`
Returns NaN-patterns based on primary computation or initiates new computation of NaN-patterns. Uses `_compute_patterns()` to visualize patterns if not yet computed. Otherwise returns stored information regarding patterns.

In [263]:
pattern_log.get_pattern()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary,Count
10,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,11259
8,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,1.0,374
3,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,374
2,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
9,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,374
7,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,1.0,1.0,374
4,1.0,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,374
0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
1,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
5,1.0,1.0,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,374


#### `Pattern.row_nan_pattern(row)`
This is a core piece to determining which patterns are available. Other methods work with the output of `row_nan_pattern()` to determine unique pattern structures and count these before turning them into readable tables.

In [264]:
# (1, 1, 'NaN', 'NaN')
print pattern_log.row_nan_pattern(['Value 1', 'Value 2', '', None])

(1, 1, 'NaN', 'NaN')


#### `Pattern.get_pattern_indices(pattern_no)`

In [265]:
pattern_log.get_pattern_indices(0)[:10] # get first 10 indices of pattern 0

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

#### `Pattern.print_pattern(data)`
Counts individual NaN patterns and returns them in a dictionary.

In [266]:
print_pattern_dict = pattern_log.print_pattern(data)

In [267]:
print_pattern_dict

Counter({(1, 1, 1, 1, 1, 1, 1, 1, 1, 1): 11259,
         (1, 1, 1, 1, 1, 1, 1, 1, 1, 'NaN'): 374,
         (1, 1, 1, 1, 1, 1, 1, 1, 'NaN', 1): 374,
         (1, 1, 1, 1, 1, 1, 1, 'NaN', 1, 1): 374,
         (1, 1, 1, 1, 1, 1, 'NaN', 1, 1, 1): 374,
         (1, 1, 1, 1, 1, 'NaN', 1, 1, 1, 1): 374,
         (1, 1, 1, 1, 'NaN', 1, 1, 1, 1, 1): 374,
         (1, 1, 1, 'NaN', 1, 1, 1, 1, 1, 1): 374,
         (1, 1, 'NaN', 1, 1, 1, 1, 1, 1, 1): 374,
         (1, 'NaN', 1, 1, 1, 1, 1, 1, 1, 1): 374,
         ('NaN', 1, 1, 1, 1, 1, 1, 1, 1, 1): 374})

#### `Pattern.get_continuous()`
Returns list with names of all continuous variables.

In [268]:
pattern_log.get_continuous()

['satisfaction_level',
 'last_evaluation',
 'number_project',
 'average_montly_hours',
 'time_spend_company',
 'Work_accident',
 'left',
 'promotion_last_5years']

#### `Pattern.get_discrete()`
Returns list with names of all discrete variables.

In [269]:
pattern_log.get_discrete()

['sales', 'salary']

## `Imputer`

### Load data into imputer

In [270]:
reload(impyte)
imputer.load_data(data)

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary
0,,0.90,4.0,263.0,5.0,0.0,1.0,0.0,support,low
1,,0.62,4.0,138.0,3.0,0.0,0.0,0.0,technical,medium
2,,0.59,2.0,144.0,3.0,0.0,0.0,0.0,sales,medium
3,,0.66,4.0,270.0,2.0,0.0,0.0,0.0,IT,low
4,,0.47,2.0,253.0,3.0,0.0,1.0,0.0,sales,low
5,,0.97,5.0,242.0,3.0,0.0,0.0,0.0,product_mng,low
6,,0.78,3.0,166.0,2.0,0.0,0.0,0.0,marketing,medium
7,,0.74,5.0,253.0,2.0,0.0,0.0,0.0,accounting,medium
8,,0.40,3.0,113.0,3.0,0.0,0.0,0.0,marketing,high
9,,0.59,2.0,164.0,2.0,0.0,0.0,0.0,sales,low


In [271]:
%timeit imputer.pattern()

The slowest run took 226234.13 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 3: 9.06 µs per loop


In [272]:
%timeit imputer.get_pattern(4)

The slowest run took 8.74 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 898 µs per loop


In [273]:
reload(impyte)
imputer = impyte.Imputer(data) # instantiate impyte class
nan_checker = impyte.NanChecker()
pattern_log = impyte.Pattern()
imputer.pattern()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary,Count
10,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,11259
8,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,1.0,374
3,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,374
2,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
9,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,374
7,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,1.0,1.0,374
4,1.0,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,374
0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
1,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
5,1.0,1.0,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,374


In [274]:
_ = imputer.drop_pattern(4, inplace=True)

In [275]:
imputer.pattern()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary,Count
10,1.0,1.0,1.0,1.0,1,1.0,1.0,1.0,1.0,1.0,11259
8,1.0,1.0,1.0,1.0,1,1.0,1.0,1.0,,1.0,374
3,1.0,1.0,1.0,,1,1.0,1.0,1.0,1.0,1.0,374
2,1.0,1.0,,1.0,1,1.0,1.0,1.0,1.0,1.0,374
9,1.0,1.0,1.0,1.0,1,1.0,1.0,1.0,1.0,,374
7,1.0,1.0,1.0,1.0,1,1.0,1.0,,1.0,1.0,374
0,,1.0,1.0,1.0,1,1.0,1.0,1.0,1.0,1.0,374
1,1.0,,1.0,1.0,1,1.0,1.0,1.0,1.0,1.0,374
5,1.0,1.0,1.0,1.0,1,,1.0,1.0,1.0,1.0,374
6,1.0,1.0,1.0,1.0,1,1.0,,1.0,1.0,1.0,374


In [276]:
_ = imputer.load_data(data)

In [277]:
imputer.pattern()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary,Count
10,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,11259
8,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,1.0,374
3,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,374
2,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
9,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,374
7,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,1.0,1.0,374
4,1.0,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,374
0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
1,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
5,1.0,1.0,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,374


In [278]:
_ = imputer.drop_pattern(4)

In [279]:
imputer.pattern()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary,Count
10,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,11259
8,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,1.0,374
3,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,374
2,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
9,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,374
7,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,1.0,1.0,374
4,1.0,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,374
0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
1,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
5,1.0,1.0,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,374


In [280]:
imputer.drop_pattern(7, inplace=True)

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary
0,,0.90,4.0,263.0,5.0,0.0,1.0,0.0,support,low
1,,0.62,4.0,138.0,3.0,0.0,0.0,0.0,technical,medium
2,,0.59,2.0,144.0,3.0,0.0,0.0,0.0,sales,medium
3,,0.66,4.0,270.0,2.0,0.0,0.0,0.0,IT,low
4,,0.47,2.0,253.0,3.0,0.0,1.0,0.0,sales,low
5,,0.97,5.0,242.0,3.0,0.0,0.0,0.0,product_mng,low
6,,0.78,3.0,166.0,2.0,0.0,0.0,0.0,marketing,medium
7,,0.74,5.0,253.0,2.0,0.0,0.0,0.0,accounting,medium
8,,0.40,3.0,113.0,3.0,0.0,0.0,0.0,marketing,high
9,,0.59,2.0,164.0,2.0,0.0,0.0,0.0,sales,low


In [281]:
imputer.pattern()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary,Count
10,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1,1.0,1.0,11259
8,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1,,1.0,374
3,1.0,1.0,1.0,,1.0,1.0,1.0,1,1.0,1.0,374
2,1.0,1.0,,1.0,1.0,1.0,1.0,1,1.0,1.0,374
9,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1,1.0,,374
4,1.0,1.0,1.0,1.0,,1.0,1.0,1,1.0,1.0,374
0,,1.0,1.0,1.0,1.0,1.0,1.0,1,1.0,1.0,374
1,1.0,,1.0,1.0,1.0,1.0,1.0,1,1.0,1.0,374
5,1.0,1.0,1.0,1.0,1.0,,1.0,1,1.0,1.0,374
6,1.0,1.0,1.0,1.0,1.0,1.0,,1,1.0,1.0,374


In [282]:
imputer.impute()

(['satisfaction_level',
  'last_evaluation',
  'number_project',
  'average_montly_hours',
  'time_spend_company',
  'Work_accident',
  'left',
  'promotion_last_5years'],
 ['sales', 'salary'])

In [283]:
data.tail()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary
14994,0.84,0.51,4.0,259.0,3.0,0.0,0.0,0.0,RandD,medium
14995,0.36,0.5,2.0,132.0,3.0,0.0,1.0,0.0,technical,low
14996,0.92,0.56,3.0,174.0,3.0,0.0,0.0,0.0,sales,medium
14997,0.68,0.64,2.0,167.0,2.0,0.0,0.0,0.0,hr,low
14998,0.78,0.63,4.0,158.0,3.0,0.0,0.0,0.0,technical,medium


### Timing some of the functions
Below is a piece of information on efficiency of the functions. This doesn't reflect complexity of the functions, but it will show an runtime estimate for your local machine.

In [284]:
%timeit nan_checker.is_nan(["", 'None', 'NaN'])
%timeit nan_checker.is_nan(["", 'None', 'NaN'], nan_vals=['', None, 'None', 'NaN'])

100000 loops, best of 3: 5.29 µs per loop
100000 loops, best of 3: 5.19 µs per loop


In [318]:
reload(impyte)
pattern_log = impyte.Pattern()
pattern_log._compute_pattern(data)["table"]

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary,Count
10,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,11259
8,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,1.0,374
3,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,374
2,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
9,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,374
7,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,1.0,1.0,374
4,1.0,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,374
0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
1,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
5,1.0,1.0,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,374


In [319]:
imp = impyte.Imputer()

In [320]:
_ = imp.load_data(data)

In [321]:
imp.pattern()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary,Count
10,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,11259
8,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,1.0,374
3,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,374
2,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
9,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,374
7,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,1.0,1.0,374
4,1.0,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,374
0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
1,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
5,1.0,1.0,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,374


In [323]:
len(imp.get_pattern(10))

11259