# Impyte Documentation
This is a first practicle attempt to clarify the usage of `impyte`. It's a collection of easy applicable and reproducible examples that you could use to simplyify your data processing workflow.

## Importing and generating sample data
In order to show some of the features of the library, we'll be using Kaggle's HR data that can be found [here](https://www.kaggle.com/ludobenistant/hr-analytics).

In [1]:
# import library and data set
import impyte
import pandas as pd
from tools.data_prep import remove_random

In [2]:
data = pd.read_csv('data/hr_test.csv')

In [3]:
data.head()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary
0,0.38,0.53,2,157,3,0,1,0,sales,low
1,0.8,0.86,5,262,6,0,1,0,sales,medium
2,0.11,0.88,7,272,4,0,1,0,sales,medium
3,0.72,0.87,5,223,5,0,1,0,sales,low
4,0.37,0.52,2,159,3,0,1,0,sales,low


## Add random missing values
In order to show some of the pattern visualization and imputation methods, we need incomplete data sets. To achieve this, we'll be using a helper function that deletes values at random. 25 % of all values from each category will be deleted.

In [4]:
data = remove_random(data, .25)

In [5]:
data.head()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary
0,,0.88,3.0,153.0,3.0,0.0,0.0,0.0,technical,low
1,,0.8,3.0,263.0,3.0,0.0,0.0,0.0,marketing,medium
2,,0.61,4.0,147.0,3.0,0.0,0.0,0.0,accounting,medium
3,,0.48,3.0,211.0,7.0,0.0,0.0,0.0,sales,medium
4,,0.71,5.0,265.0,2.0,0.0,0.0,1.0,sales,medium


In [6]:
data.describe()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years
count,14625.0,14625.0,14625.0,14625.0,14625.0,14625.0,14625.0,14625.0
mean,0.612628,0.716301,3.803556,201.022359,3.498872,0.144889,0.238906,0.021128
std,0.248868,0.171224,1.232165,49.976471,1.460321,0.352001,0.42643,0.143817
min,0.09,0.36,2.0,96.0,2.0,0.0,0.0,0.0
25%,0.44,0.56,3.0,156.0,3.0,0.0,0.0,0.0
50%,0.64,0.72,4.0,200.0,3.0,0.0,0.0,0.0
75%,0.82,0.87,5.0,245.0,4.0,0.0,0.0,0.0
max,1.0,1.0,7.0,310.0,10.0,1.0,1.0,1.0


In [7]:
len(data)

14999

In [8]:
reload(impyte)
imputer = impyte.Imputer() # instantiate impyte class
nan_checker = impyte.NanChecker()
pattern_log = impyte.Pattern()

### Testing rudimentary features
Below is a first trial run of some of the features implemented by impyter and its helper classes.

#### NanChecker.is_nan( )
The NaN detection feature that also enables NaN-value ingestion as parameter values.

In [9]:
# [True, False, False]
print nan_checker.is_nan(["", 'None', 'NaN'])

# [True, True, True]
print nan_checker.is_nan(["", 'None', 'NaN'], nan_vals=['', None, 'None', 'NaN'])

[True, False, False]
[True, True, True]


#### Pattern.row_nan_pattern( )
This is a core piece to determining which patterns are available. Other methods work with the output of `row_nan_pattern()` to determine unique pattern structures and count these before turning them into readable tables.

In [10]:
# (1, 1, 'NaN', 'NaN')
print pattern_log.row_nan_pattern(['yes', 'no', '', None])

(1, 1, 'NaN', 'NaN')


### Timing some of the functions
Below is a piece of information on efficiency of the functions. This doesn't reflect complexity of the functions, but it will show an runtime estimate for your local machine.

In [11]:
%timeit nan_checker.is_nan(["", 'None', 'NaN'])
%timeit nan_checker.is_nan(["", 'None', 'NaN'], nan_vals=['', None, 'None', 'NaN'])

100000 loops, best of 3: 4.74 µs per loop
100000 loops, best of 3: 4.9 µs per loop


In [12]:
# Showcase nan_check
%timeit pattern_log.row_nan_pattern(['yes', 'no', '', None])

The slowest run took 5.11 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 8.02 µs per loop


In [13]:
%timeit pattern_log.print_pattern(data)

1 loop, best of 3: 748 ms per loop


In [14]:
%timeit pattern_log.compute_pattern(data)

1 loop, best of 3: 1.71 s per loop


#### Pattern.compute_pattern( )
Feature that computes pattern and returns its pattern structure plus the count of data points for each of these patterns. To get a nice digestible table add `['table']` selector to output.

In [15]:
pattern_log.compute_pattern(data)['table']

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary,Count
10,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,11259
8,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,1.0,374
3,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,374
2,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
9,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,374
7,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,1.0,1.0,374
4,1.0,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,374
0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
1,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
5,1.0,1.0,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,374


### Load data into imputer

In [16]:
reload(impyte)
imputer.load_data(data)

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary
0,,0.88,3.0,153.0,3.0,0.0,0.0,0.0,technical,low
1,,0.80,3.0,263.0,3.0,0.0,0.0,0.0,marketing,medium
2,,0.61,4.0,147.0,3.0,0.0,0.0,0.0,accounting,medium
3,,0.48,3.0,211.0,7.0,0.0,0.0,0.0,sales,medium
4,,0.71,5.0,265.0,2.0,0.0,0.0,1.0,sales,medium
5,,0.44,4.0,166.0,4.0,1.0,0.0,0.0,support,low
6,,0.54,4.0,191.0,2.0,1.0,0.0,1.0,sales,medium
7,,0.96,6.0,310.0,4.0,0.0,1.0,0.0,technical,low
8,,0.62,5.0,254.0,5.0,1.0,0.0,0.0,sales,low
9,,0.84,6.0,309.0,4.0,0.0,1.0,0.0,support,low


In [17]:
%timeit imputer.pattern()

1 loop, best of 3: 1.78 s per loop


In [18]:
%timeit imputer.get_pattern(4)

The slowest run took 6.55 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 704 µs per loop


In [40]:
reload(impyte)
imputer = impyte.Imputer(data) # instantiate impyte class
nan_checker = impyte.NanChecker()
pattern_log = impyte.Pattern()
imputer.pattern()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary,Count
10,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,11259
8,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,1.0,374
3,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,374
2,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
9,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,374
7,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,1.0,1.0,374
4,1.0,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,374
0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
1,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
5,1.0,1.0,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,374


In [41]:
_ = imputer.drop_pattern(4, inplace=True)

In [42]:
imputer.pattern()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary,Count
10,1.0,1.0,1.0,1.0,1,1.0,1.0,1.0,1.0,1.0,11259
8,1.0,1.0,1.0,1.0,1,1.0,1.0,1.0,,1.0,374
3,1.0,1.0,1.0,,1,1.0,1.0,1.0,1.0,1.0,374
2,1.0,1.0,,1.0,1,1.0,1.0,1.0,1.0,1.0,374
9,1.0,1.0,1.0,1.0,1,1.0,1.0,1.0,1.0,,374
7,1.0,1.0,1.0,1.0,1,1.0,1.0,,1.0,1.0,374
0,,1.0,1.0,1.0,1,1.0,1.0,1.0,1.0,1.0,374
1,1.0,,1.0,1.0,1,1.0,1.0,1.0,1.0,1.0,374
5,1.0,1.0,1.0,1.0,1,,1.0,1.0,1.0,1.0,374
6,1.0,1.0,1.0,1.0,1,1.0,,1.0,1.0,1.0,374


In [43]:
_ = imputer.load_data(data)

In [44]:
imputer.pattern()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary,Count
10,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,11259
8,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,1.0,374
3,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,374
2,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
9,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,374
7,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,1.0,1.0,374
4,1.0,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,374
0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
1,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
5,1.0,1.0,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,374


In [45]:
_ = imputer.drop_pattern(4)

In [46]:
imputer.pattern()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary,Count
10,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,11259
8,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,1.0,374
3,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,374
2,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
9,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,374
7,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,1.0,1.0,374
4,1.0,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,374
0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
1,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
5,1.0,1.0,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,374


In [47]:
imputer.drop_pattern(7, inplace=True)

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary
0,,0.88,3.0,153.0,3.0,0.0,0.0,0.0,technical,low
1,,0.80,3.0,263.0,3.0,0.0,0.0,0.0,marketing,medium
2,,0.61,4.0,147.0,3.0,0.0,0.0,0.0,accounting,medium
3,,0.48,3.0,211.0,7.0,0.0,0.0,0.0,sales,medium
4,,0.71,5.0,265.0,2.0,0.0,0.0,1.0,sales,medium
5,,0.44,4.0,166.0,4.0,1.0,0.0,0.0,support,low
6,,0.54,4.0,191.0,2.0,1.0,0.0,1.0,sales,medium
7,,0.96,6.0,310.0,4.0,0.0,1.0,0.0,technical,low
8,,0.62,5.0,254.0,5.0,1.0,0.0,0.0,sales,low
9,,0.84,6.0,309.0,4.0,0.0,1.0,0.0,support,low


In [48]:
imputer.pattern()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary,Count
10,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1,1.0,1.0,11259
8,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1,,1.0,374
3,1.0,1.0,1.0,,1.0,1.0,1.0,1,1.0,1.0,374
2,1.0,1.0,,1.0,1.0,1.0,1.0,1,1.0,1.0,374
9,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1,1.0,,374
4,1.0,1.0,1.0,1.0,,1.0,1.0,1,1.0,1.0,374
0,,1.0,1.0,1.0,1.0,1.0,1.0,1,1.0,1.0,374
1,1.0,,1.0,1.0,1.0,1.0,1.0,1,1.0,1.0,374
5,1.0,1.0,1.0,1.0,1.0,,1.0,1,1.0,1.0,374
6,1.0,1.0,1.0,1.0,1.0,1.0,,1,1.0,1.0,374


In [44]:
imputer.impute()

(['satisfaction_level',
  'last_evaluation',
  'number_project',
  'average_montly_hours',
  'time_spend_company',
  'Work_accident',
  'left',
  'promotion_last_5years'],
 ['sales', 'salary'])

In [45]:
data.tail()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary
14994,0.52,0.94,3.0,263.0,3.0,0.0,0.0,0.0,product_mng,medium
14995,0.43,0.44,5.0,213.0,3.0,0.0,0.0,0.0,sales,low
14996,0.45,0.49,2.0,134.0,3.0,0.0,1.0,0.0,RandD,medium
14997,0.1,0.9,7.0,281.0,4.0,0.0,1.0,0.0,sales,low
14998,0.54,0.67,2.0,129.0,3.0,1.0,0.0,0.0,sales,low


In [35]:
nnna = {}

In [39]:
if not nnna:
    print "yes"

yes
