# Impyte Documentation
This is a first practicle attempt to clarify the usage of `impyte`. It's a collection of easy applicable and reproducible examples that you could use to simplyify your data processing workflow.

## Generating sample data
In order to show some of the features of the library, we'll be using Kaggle's HR data that can be found [here](https://www.kaggle.com/ludobenistant/hr-analytics).

In [1]:
# import library and data set
import impyte
import pandas as pd
from tools.data_prep import remove_random

In [2]:
data = pd.read_csv('data/hr_test.csv')

In [3]:
data.head()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary
0,0.38,0.53,2,157,3,0,1,0,sales,low
1,0.8,0.86,5,262,6,0,1,0,sales,medium
2,0.11,0.88,7,272,4,0,1,0,sales,medium
3,0.72,0.87,5,223,5,0,1,0,sales,low
4,0.37,0.52,2,159,3,0,1,0,sales,low


## Add random missing values
In order to show some of the pattern visualization and imputation methods, we need incomplete data sets. To achieve this, we'll be using a helper function that deletes values at random. 25 % of all values from each category will be deleted.

In [4]:
data = remove_random(data, .25)

In [5]:
data.head()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary
0,,0.91,7.0,297.0,4.0,0.0,1.0,0.0,technical,low
1,,0.88,5.0,204.0,5.0,0.0,0.0,0.0,sales,low
2,,0.59,4.0,250.0,2.0,0.0,0.0,0.0,support,low
3,,0.64,3.0,183.0,3.0,0.0,0.0,0.0,hr,medium
4,,0.66,6.0,174.0,3.0,0.0,0.0,0.0,sales,medium


In [6]:
len(data)

14999

In [7]:
reload(impyte)
imputer = impyte.Imputer() # instantiate impyte class
nan_checker = impyte.NanChecker()
pattern_log = impyte.Pattern()

### Timing some of the functions
Below is a piece of information on efficiency of the functions. This doesn't reflect complexity of the functions, but it will show an runtime estimate for your local machine.

In [8]:
%timeit nan_checker.is_nan(["", 'None', 'NaN'])
%timeit nan_checker.is_nan(["", 'None', 'NaN'], nan_vals=['', None, 'None', 'NaN'])

The slowest run took 4.31 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 4.87 µs per loop
100000 loops, best of 3: 5.92 µs per loop


In [9]:
# Showcase nan_check
%timeit pattern_log.row_nan_pattern(['yes', 'no', '', None])

100000 loops, best of 3: 8.4 µs per loop


In [74]:
%timeit pattern_log.print_pattern(data)

1 loop, best of 3: 1.45 s per loop


In [75]:
%timeit pattern_log.compute_pattern(data)

1 loop, best of 3: 3.46 s per loop


In [76]:
pattern_log.compute_pattern(data)['table']

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary,Count
0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,11259
8,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
1,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,374
2,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
3,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,374
9,1.0,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,374
7,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,1.0,374
4,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
5,1.0,1.0,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,374
10,1.0,1.0,1.0,1.0,1.0,1.0,,1.0,1.0,1.0,374


### Load data into imputer

In [85]:
reload(impyte)
imputer.load_data(data)

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary
0,0.72,1.00,4.0,169.0,3.0,0.0,1.0,0.0,sales,medium
1,0.92,1.00,3.0,212.0,2.0,0.0,0.0,0.0,support,low
2,0.72,1.00,4.0,,5.0,0.0,1.0,0.0,sales,low
3,0.50,0.90,5.0,226.0,2.0,0.0,0.0,0.0,management,high
4,0.53,0.92,3.0,199.0,2.0,0.0,0.0,0.0,hr,medium
5,0.49,0.67,4.0,185.0,2.0,0.0,0.0,0.0,sales,low
6,0.53,0.45,4.0,180.0,3.0,0.0,0.0,0.0,technical,medium
7,0.21,0.53,3.0,229.0,5.0,0.0,0.0,0.0,accounting,medium
8,0.73,0.60,4.0,222.0,3.0,0.0,0.0,0.0,technical,medium
9,0.49,0.95,4.0,156.0,2.0,0.0,0.0,0.0,technical,medium


In [86]:
%timeit imputer.pattern()

1 loop, best of 3: 3.35 s per loop


In [87]:
%timeit imputer.get_pattern(4)

100 loops, best of 3: 1.62 ms per loop


In [88]:
reload(impyte)
imputer = impyte.Imputer(data) # instantiate impyte class
nan_checker = impyte.NanChecker()
pattern_log = impyte.Pattern()
imputer.pattern()
imputer.drop_pattern(4)

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary
0,0.72,1.00,4.0,169.0,3.0,0.0,1.0,0.0,sales,medium
1,0.92,1.00,3.0,212.0,2.0,0.0,0.0,0.0,support,low
2,0.72,1.00,4.0,,5.0,0.0,1.0,0.0,sales,low
3,0.50,0.90,5.0,226.0,2.0,0.0,0.0,0.0,management,high
4,0.53,0.92,3.0,199.0,2.0,0.0,0.0,0.0,hr,medium
5,0.49,0.67,4.0,185.0,2.0,0.0,0.0,0.0,sales,low
6,0.53,0.45,4.0,180.0,3.0,0.0,0.0,0.0,technical,medium
7,0.21,0.53,3.0,229.0,5.0,0.0,0.0,0.0,accounting,medium
8,0.73,0.60,4.0,222.0,3.0,0.0,0.0,0.0,technical,medium
9,0.49,0.95,4.0,156.0,2.0,0.0,0.0,0.0,technical,medium


In [81]:
#imputer.pattern()

In [89]:
imputer.load_data(data)

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary
0,0.72,1.00,4.0,169.0,3.0,0.0,1.0,0.0,sales,medium
1,0.92,1.00,3.0,212.0,2.0,0.0,0.0,0.0,support,low
2,0.72,1.00,4.0,,5.0,0.0,1.0,0.0,sales,low
3,0.50,0.90,5.0,226.0,2.0,0.0,0.0,0.0,management,high
4,0.53,0.92,3.0,199.0,2.0,0.0,0.0,0.0,hr,medium
5,0.49,0.67,4.0,185.0,2.0,0.0,0.0,0.0,sales,low
6,0.53,0.45,4.0,180.0,3.0,0.0,0.0,0.0,technical,medium
7,0.21,0.53,3.0,229.0,5.0,0.0,0.0,0.0,accounting,medium
8,0.73,0.60,4.0,222.0,3.0,0.0,0.0,0.0,technical,medium
9,0.49,0.95,4.0,156.0,2.0,0.0,0.0,0.0,technical,medium


In [90]:
imputer.pattern()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary,Count
0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,11259
8,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
1,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,374
2,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
3,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,374
9,1.0,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,374
7,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,1.0,374
4,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
5,1.0,1.0,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,374
10,1.0,1.0,1.0,1.0,1.0,1.0,,1.0,1.0,1.0,374


In [91]:
imputer.drop_pattern(4)

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary
0,0.72,1.00,4.0,169.0,3.0,0.0,1.0,0.0,sales,medium
1,0.92,1.00,3.0,212.0,2.0,0.0,0.0,0.0,support,low
2,0.72,1.00,4.0,,5.0,0.0,1.0,0.0,sales,low
3,0.50,0.90,5.0,226.0,2.0,0.0,0.0,0.0,management,high
4,0.53,0.92,3.0,199.0,2.0,0.0,0.0,0.0,hr,medium
5,0.49,0.67,4.0,185.0,2.0,0.0,0.0,0.0,sales,low
6,0.53,0.45,4.0,180.0,3.0,0.0,0.0,0.0,technical,medium
7,0.21,0.53,3.0,229.0,5.0,0.0,0.0,0.0,accounting,medium
8,0.73,0.60,4.0,222.0,3.0,0.0,0.0,0.0,technical,medium
9,0.49,0.95,4.0,156.0,2.0,0.0,0.0,0.0,technical,medium


In [92]:
imputer.pattern()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary,Count
0,1.0,1,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,11259
7,,1,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
1,1.0,1,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,374
2,1.0,1,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
3,1.0,1,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,374
8,1.0,1,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,374
6,1.0,1,1.0,1.0,1.0,1.0,1.0,1.0,,1.0,374
4,1.0,1,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,374
9,1.0,1,1.0,1.0,1.0,1.0,,1.0,1.0,1.0,374
5,1.0,1,1.0,1.0,1.0,1.0,1.0,,1.0,1.0,374


In [93]:
imputer.drop_pattern(7)

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary
0,0.72,1.00,4.0,169.0,3.0,0.0,1.0,0.0,sales,medium
1,0.92,1.00,3.0,212.0,2.0,0.0,0.0,0.0,support,low
2,0.72,1.00,4.0,,5.0,0.0,1.0,0.0,sales,low
3,0.50,0.90,5.0,226.0,2.0,0.0,0.0,0.0,management,high
4,0.53,0.92,3.0,199.0,2.0,0.0,0.0,0.0,hr,medium
5,0.49,0.67,4.0,185.0,2.0,0.0,0.0,0.0,sales,low
6,0.53,0.45,4.0,180.0,3.0,0.0,0.0,0.0,technical,medium
7,0.21,0.53,3.0,229.0,5.0,0.0,0.0,0.0,accounting,medium
8,0.73,0.60,4.0,222.0,3.0,0.0,0.0,0.0,technical,medium
9,0.49,0.95,4.0,156.0,2.0,0.0,0.0,0.0,technical,medium


In [94]:
imputer.pattern()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary,Count
0,1,1,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,11259
1,1,1,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,374
2,1,1,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
3,1,1,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,374
7,1,1,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,374
6,1,1,1.0,1.0,1.0,1.0,1.0,1.0,,1.0,374
4,1,1,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,374
8,1,1,1.0,1.0,1.0,1.0,,1.0,1.0,1.0,374
5,1,1,1.0,1.0,1.0,1.0,1.0,,1.0,1.0,374
