# Impyte Documentation
This is a first practicle attempt to clarify the usage of `impyte`. It's a collection of easy applicable and reproducible examples that you could use to simplyify your data processing workflow.

## Importing and generating sample data
In order to show some of the features of the library, we'll be using Kaggle's HR data that can be found [here](https://www.kaggle.com/ludobenistant/hr-analytics).

In [92]:
# import library and data set
import impyte
import pandas as pd
from tools.data_prep import remove_random

In [93]:
data = pd.read_csv('data/hr_test.csv')

In [94]:
data.head()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary
0,0.38,0.53,2,157,3,0,1,0,sales,low
1,0.8,0.86,5,262,6,0,1,0,sales,medium
2,0.11,0.88,7,272,4,0,1,0,sales,medium
3,0.72,0.87,5,223,5,0,1,0,sales,low
4,0.37,0.52,2,159,3,0,1,0,sales,low


## Add random missing values
In order to show some of the pattern visualization and imputation methods, we need incomplete data sets. To achieve this, we'll be using a helper function that deletes values at random. 25 % of all values from each category will be deleted.

In [95]:
data = remove_random(data, .25, randomize_index=False)

In [96]:
data.head()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary
0,,0.87,4.0,129.0,5.0,0.0,0.0,0.0,support,medium
1,,0.95,5.0,149.0,2.0,0.0,0.0,0.0,technical,low
2,,0.85,7.0,307.0,4.0,0.0,1.0,0.0,management,low
3,,0.84,5.0,303.0,5.0,0.0,1.0,0.0,accounting,medium
4,,0.87,6.0,262.0,6.0,0.0,0.0,0.0,sales,high


In [97]:
data.describe()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years
count,14625.0,14625.0,14625.0,14625.0,14625.0,14625.0,14625.0,14625.0
mean,0.612939,0.716557,3.80506,201.075624,3.495521,0.145915,0.237812,0.021265
std,0.248681,0.171202,1.233292,49.941275,1.455249,0.353033,0.425758,0.144271
min,0.09,0.36,2.0,96.0,2.0,0.0,0.0,0.0
25%,0.44,0.56,3.0,156.0,3.0,0.0,0.0,0.0
50%,0.64,0.72,4.0,200.0,3.0,0.0,0.0,0.0
75%,0.82,0.87,5.0,245.0,4.0,0.0,0.0,0.0
max,1.0,1.0,7.0,310.0,10.0,1.0,1.0,1.0


In [98]:
len(data)

14999

In [99]:
reload(impyte)

<module 'impyte' from 'impyte.py'>

## Testing rudimentary features
Below is a first trial run of some of the features implemented by impyter and its helper classes.

## `NanChecker`
Functionality testing of `NanChecker` class.

In [100]:
nan_checker = impyte.NanChecker()

#### `NanChecker.is_nan(data, nan_vals=None, recursive=True)`
Detect missing values (NaN in numeric arrays, empty strings in string arrays). NaN detection feature also enables NaN-value ingestion as parameter values.

In [101]:
# [True, False, False]
print nan_checker.is_nan(["", 'None', 'NaN'])

# [True, True, True]
print nan_checker.is_nan(["", 'None', 'NaN'], nan_vals=['', None, 'None', 'NaN'])

[True, False, False]
[True, True, True]


In [102]:
# Recursive nan detection
# [True, True, False, [False, True, True]]
print nan_checker.is_nan(["", None, 'NaN', ["List Value 1", '', None]])

[True, True, False, [False, True, True]]


In [103]:
# Values can be declared as nan-values
# [True, False, False, True]
nan_checker.is_nan(['NaN', 'Empty', 'None', 'N/A'], nan_vals=['NaN', 'N/A'])

[True, False, False, True]

## `Pattern`
Functionality testing of `Pattern` class. The `Pattern` class stores different patterns and data summaries regarding NaN values. 

#### `Pattern._get_discrete_and_continuous(tmpdata)`
Returns the column names of discrete and continuous variables. Column names are stored in lists for easy selection. Those lists are stored in one dictionary object. All continuous column names can be accessed through `['continuous']` all discrete variables through `['discrete']`.

In [104]:
reload(impyte)
pattern_log = impyte.Pattern()

In [105]:
pattern_log._get_discrete_and_continuous(data)

{'continuous': ['satisfaction_level',
  'last_evaluation',
  'number_project',
  'average_montly_hours',
  'time_spend_company',
  'Work_accident',
  'left',
  'promotion_last_5years'],
 'discrete': ['sales', 'salary']}

#### `Pattern._compute_pattern(data, nan_values="", verbose=False)`
Function that checks for missing values and prints out a quick table of a summary of missing values. Includes pattern overview and counts of missing values by column. Computes pattern and returns its pattern structure plus the count of data points for each of these patterns. To get a nice digestible table add `["table"]` selector to output.

In [106]:
pattern_dict = pattern_log._compute_pattern(data)
indices, table = pattern_dict["indices"], pattern_dict["table"]

In [107]:
table

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary,Count
0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,11259
1,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,1.0,374
2,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,374
3,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
4,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,374
5,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,1.0,1.0,374
6,1.0,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,374
7,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
8,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
9,1.0,1.0,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,374


In [108]:
indices[0][:10] # first 10 indices of pattern 0

[3740, 3741, 3742, 3743, 3744, 3745, 3746, 3747, 3748, 3749]

In [109]:
pattern_nr = 0
print "Pattern {} has {:,} rows.".format(pattern_nr, len(indices[0])) 

Pattern 0 has 11,259 rows.


#### `Pattern.get_pattern()`
Returns NaN-patterns based on primary computation or initiates new computation of NaN-patterns. Uses `_compute_patterns()` to visualize patterns if not yet computed. Otherwise returns stored information regarding patterns.

In [110]:
pattern_log.get_pattern()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary,Count
0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,11259
1,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,1.0,374
2,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,374
3,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
4,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,374
5,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,1.0,1.0,374
6,1.0,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,374
7,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
8,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
9,1.0,1.0,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,374


#### `Pattern.row_nan_pattern(row)`
This is a core piece to determining which patterns are available. Other methods work with the output of `row_nan_pattern()` to determine unique pattern structures and count these before turning them into readable tables.

In [111]:
# (1, 1, 'NaN', 'NaN')
print pattern_log.row_nan_pattern(['Value 1', 'Value 2', '', None])

(1, 1, 'NaN', 'NaN')


#### `Pattern.get_pattern_indices(pattern_no)`

In [112]:
pattern_log.get_pattern_indices(0)[:10] # get first 10 indices of pattern 0

[3740, 3741, 3742, 3743, 3744, 3745, 3746, 3747, 3748, 3749]

#### `Pattern.get_continuous()`
Returns list with names of all continuous variables.

In [113]:
pattern_log.get_continuous()

['satisfaction_level',
 'last_evaluation',
 'number_project',
 'average_montly_hours',
 'time_spend_company',
 'Work_accident',
 'left',
 'promotion_last_5years']

#### `Pattern.get_discrete()`
Returns list with names of all discrete variables.

In [114]:
pattern_log.get_discrete()

['sales', 'salary']

## `Impyter`

In [115]:
reload(impyte)
imputer = impyte.Impyter() # instantiate impyte class

### Load data into imputer

In [116]:
_ = imputer.load_data(data)

In [117]:
%timeit imputer.pattern()

The slowest run took 145407.32 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 3: 5.96 µs per loop


In [118]:
%timeit imputer.get_pattern(4)

1000 loops, best of 3: 813 µs per loop


In [119]:
reload(impyte)
imputer = impyte.Impyter(data) # instantiate impyte class
nan_checker = impyte.NanChecker()
pattern_log = impyte.Pattern()
imputer.pattern()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary,Count
0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,11259
1,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,1.0,374
2,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,374
3,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
4,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,374
5,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,1.0,1.0,374
6,1.0,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,374
7,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
8,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
9,1.0,1.0,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,374


In [120]:
_ = imputer.drop_pattern(4, inplace=True)

In [121]:
imputer.pattern_log.missing_per_column

[374, 374, 374, 374, 374, 374, 374, 374, 374, 374]

In [122]:
imputer.get_missing_summary()

Unnamed: 0,Complete,Missing,Percentage
satisfaction_level,14251,374,2.56 %
last_evaluation,14251,374,2.56 %
number_project,14251,374,2.56 %
average_montly_hours,14251,374,2.56 %
time_spend_company,14251,374,2.56 %
Work_accident,14251,374,2.56 %
left,14251,374,2.56 %
promotion_last_5years,14251,374,2.56 %
sales,14251,374,2.56 %
salary,14251,374,2.56 %


In [123]:
imputer.pattern()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary,Count
0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1,11259
1,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,1,374
2,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,1,374
3,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,1,374
5,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,1.0,1,374
6,1.0,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,1,374
7,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1,374
8,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1,374
9,1.0,1.0,1.0,1.0,1.0,,1.0,1.0,1.0,1,374
10,1.0,1.0,1.0,1.0,1.0,1.0,,1.0,1.0,1,374


In [124]:
_ = imputer.load_data(data)

In [125]:
imputer.pattern()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary,Count
0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,11259
1,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,1.0,374
2,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,374
3,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
4,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,374
5,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,1.0,1.0,374
6,1.0,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,374
7,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
8,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
9,1.0,1.0,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,374


In [126]:
_ = imputer.drop_pattern(4)

In [127]:
imputer.pattern()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary,Count
0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,11259
1,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,1.0,374
2,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,374
3,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
4,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,374
5,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,1.0,1.0,374
6,1.0,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,374
7,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
8,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,374
9,1.0,1.0,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,374


In [128]:
imputer.pattern_log.easy_access

{(1, 1, 1, 1, 1, 1, 1, 1, 1, 'NaN'): ['salary'],
 (1, 1, 1, 1, 1, 1, 1, 1, 'NaN', 1): ['sales'],
 (1, 1, 1, 1, 1, 1, 1, 'NaN', 1, 1): ['promotion_last_5years'],
 (1, 1, 1, 1, 1, 1, 'NaN', 1, 1, 1): ['left'],
 (1, 1, 1, 1, 1, 'NaN', 1, 1, 1, 1): ['Work_accident'],
 (1, 1, 1, 1, 'NaN', 1, 1, 1, 1, 1): ['time_spend_company'],
 (1, 1, 1, 'NaN', 1, 1, 1, 1, 1, 1): ['average_montly_hours'],
 (1, 1, 'NaN', 1, 1, 1, 1, 1, 1, 1): ['number_project'],
 (1, 'NaN', 1, 1, 1, 1, 1, 1, 1, 1): ['last_evaluation'],
 ('NaN', 1, 1, 1, 1, 1, 1, 1, 1, 1): ['satisfaction_level']}

### Comparison of complete indices methods

In [129]:
%timeit len(imputer.pattern_log.get_complete_indices())

1000 loops, best of 3: 918 µs per loop


In [130]:
%timeit len(imputer.get_complete_old())

100 loops, best of 3: 5.84 ms per loop


In [131]:
_ = imputer.drop_pattern(7, inplace=True)

In [132]:
data.iloc[:, imputer.pattern_log.store_tuple_columns[(1, 1, 1, 1, 1, 1, 1, 1, 1, 'NaN')]].columns

Index([u'salary'], dtype='object')

In [133]:
data.tail()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary
14994,0.59,0.57,5.0,257.0,2.0,0.0,0.0,0.0,sales,low
14995,0.1,0.79,7.0,310.0,4.0,0.0,1.0,0.0,hr,medium
14996,0.19,0.72,4.0,102.0,3.0,0.0,0.0,0.0,sales,medium
14997,0.1,0.8,6.0,264.0,4.0,0.0,1.0,0.0,technical,low
14998,0.81,0.5,4.0,170.0,4.0,0.0,0.0,0.0,support,low


### Timing some of the functions
Below is a piece of information on efficiency of the functions. This doesn't reflect complexity of the functions, but it will show an runtime estimate for your local machine.

In [134]:
%timeit nan_checker.is_nan(["", 'None', 'NaN'])
%timeit nan_checker.is_nan(["", 'None', 'NaN'], nan_vals=['', None, 'None', 'NaN'])

100000 loops, best of 3: 4.9 µs per loop
100000 loops, best of 3: 5.09 µs per loop


### `one_hot_encode()`
Transforms categorical values in a one-hot-encoded data set.

In [135]:
reload(impyte)
imputer = impyte.Impyter()
imputer.load_data(data)
_ = imputer.pattern()

In [136]:
ohe_data = imputer.one_hot_encode(data)
ohe_data.head()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales_ohe_IT,sales_ohe_RandD,...,sales_ohe_hr,sales_ohe_management,sales_ohe_marketing,sales_ohe_product_mng,sales_ohe_sales,sales_ohe_support,sales_ohe_technical,salary_ohe_high,salary_ohe_low,salary_ohe_medium
0,,0.87,4.0,129.0,5.0,0.0,0.0,0.0,0,0,...,0,0,0,0,0,1,0,0,0,1
1,,0.95,5.0,149.0,2.0,0.0,0.0,0.0,0,0,...,0,0,0,0,0,0,1,0,1,0
2,,0.85,7.0,307.0,4.0,0.0,1.0,0.0,0,0,...,0,1,0,0,0,0,0,0,1,0
3,,0.84,5.0,303.0,5.0,0.0,1.0,0.0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,,0.87,6.0,262.0,6.0,0.0,0.0,0.0,0,0,...,0,0,0,0,1,0,0,1,0,0


### `one_hot_decode()`
Transforms one-hot-encoded columns back to categorical values.

In [137]:
imputer.one_hot_decode(ohe_data).head()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary
0,,0.87,4.0,129.0,5.0,0.0,0.0,0.0,support,medium
1,,0.95,5.0,149.0,2.0,0.0,0.0,0.0,technical,low
2,,0.85,7.0,307.0,4.0,0.0,1.0,0.0,management,low
3,,0.84,5.0,303.0,5.0,0.0,1.0,0.0,accounting,medium
4,,0.87,6.0,262.0,6.0,0.0,0.0,0.0,sales,high


### `impute()`
Impute is the core method of impyte. The method works out of the box and uses Random Forest estimators per default to impute missing values. It automatically performs cross-validation to showcase the potential accuracy of the imputation.

In [159]:
data.dtypes

satisfaction_level       float64
last_evaluation          float64
number_project           float64
average_montly_hours     float64
time_spend_company       float64
Work_accident            float64
left                     float64
promotion_last_5years    float64
sales                     object
salary                    object
dtype: object

In [166]:
for col in data.columns:
    print cont_is_categorical(data[col])

False
False
False
False
False
True
True
True
False
True


In [172]:
reload(impyte)
imputer = impyte.Impyter()
imputer.load_data(data)
_ = imputer.pattern()
complete_df = imputer.impute(classifier='dt')

Label: sales 	 Fitting DecisionTreeClassifier
CV-Scores: [ 0.35135135  0.36573458  0.36234458  0.37216541  0.37989324]
Label: average_montly_hours 	 Fitting DecisionTreeRegressor
CV-Scores: [-0.07836237 -0.03201855 -0.11174284 -0.09918681 -0.07093816]
Label: number_project 	 Fitting DecisionTreeClassifier
CV-Scores: [ 0.50709849  0.51863354  0.5044405   0.50888889  0.51756336]
Label: salary 	 Fitting DecisionTreeClassifier
CV-Scores: [ 0.54860186  0.56882771  0.57415631  0.57707685  0.58196357]
Label: promotion_last_5years 	 Fitting DecisionTreeClassifier
CV-Scores: [ 0.97024867  0.96580817  0.96891652  0.97291297  0.9697912 ]
Label: time_spend_company 	 Fitting DecisionTreeClassifier
CV-Scores: [ 0.51929047  0.54727031  0.52975133  0.54933333  0.55002223]
Label: satisfaction_level 	 Fitting DecisionTreeRegressor
CV-Scores: [ 0.0319617   0.14864342  0.07286285  0.10683744  0.17800497]
Label: last_evaluation 	 Fitting DecisionTreeRegressor
CV-Scores: [-0.2111922  -0.17343372 -0.22220342

In [173]:
len(data['time_spend_company'].unique())

9

In [168]:
imputer.get_result()['promotion_last_5years'].unique()

array([ 0.,  1.])

### Accessing final models

In [73]:
imputer.model_log[2].get_model()

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_split=1e-07,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=None,
           splitter='best')

In [62]:
imputer.model_log[1].get_feature_name()

'sales'

### Investigate completed data set

In [147]:
imp = impyte.Impyter(complete_df)

In [148]:
imp.pattern()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary,Count
0,1,1,1,1,1,1,1,1,1,1,14999


In [50]:
data['last_evaluation'].describe()

count    14625.000000
mean         0.716410
std          0.171208
min          0.360000
25%          0.560000
50%          0.720000
75%          0.870000
max          1.000000
Name: last_evaluation, dtype: float64

### Scaling for preprocessing

In [44]:
from sklearn.preprocessing import StandardScaler

In [45]:
scaler = StandardScaler()

In [46]:
test = data.dropna()

In [47]:
len(test[test.apply(nan_checker.is_nan)])

11259

In [48]:
test.describe()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years
count,11259.0,11259.0,11259.0,11259.0,11259.0,11259.0,11259.0,11259.0
mean,0.611581,0.717509,3.807532,201.078337,3.503419,0.142997,0.240608,0.02105
std,0.248967,0.171942,1.23237,50.063233,1.456804,0.350085,0.427471,0.143557
min,0.09,0.36,2.0,96.0,2.0,0.0,0.0,0.0
25%,0.44,0.56,3.0,156.0,3.0,0.0,0.0,0.0
50%,0.64,0.72,4.0,200.0,3.0,0.0,0.0,0.0
75%,0.82,0.87,5.0,245.0,4.0,0.0,0.0,0.0
max,1.0,1.0,7.0,310.0,10.0,1.0,1.0,1.0


In [75]:
test_scaled = scaler.fit_transform(test[test.corr().columns])

In [76]:
test[:5]

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary
3740,0.9,0.74,5.0,249.0,3.0,0.0,0.0,0.0,support,medium
3741,0.89,0.52,4.0,189.0,3.0,0.0,0.0,0.0,support,medium
3742,0.58,0.87,3.0,268.0,2.0,0.0,0.0,0.0,sales,low
3743,0.62,0.5,4.0,156.0,2.0,0.0,0.0,0.0,support,medium
3744,0.27,0.75,5.0,264.0,3.0,0.0,0.0,0.0,technical,medium


In [83]:
pd.DataFrame(test_scaled[:5])

Unnamed: 0,0,1,2,3,4,5,6,7
0,1.1585,0.142626,0.978675,0.962305,-0.349958,-0.404773,-0.557824,-0.151309
1,1.118144,-1.143529,0.164571,-0.240267,-0.349958,-0.404773,-0.557824,-0.151309
2,-0.132901,0.902626,-0.649533,1.343119,-1.027781,-0.404773,-0.557824,-0.151309
3,0.028524,-1.260452,0.164571,-0.901681,-1.027781,-0.404773,-0.557824,-0.151309
4,-1.383945,0.201087,0.978675,1.262947,-0.349958,-0.404773,-0.557824,-0.151309


In [79]:
test_rescaled = scaler.inverse_transform(test_scaled)

In [82]:
pd.DataFrame(test_rescaled[:5])

Unnamed: 0,0,1,2,3,4,5,6,7
0,0.9,0.74,5.0,249.0,3.0,0.0,0.0,0.0
1,0.89,0.52,4.0,189.0,3.0,0.0,0.0,0.0
2,0.58,0.87,3.0,268.0,2.0,0.0,0.0,0.0
3,0.62,0.5,4.0,156.0,2.0,0.0,0.0,0.0
4,0.27,0.75,5.0,264.0,3.0,0.0,0.0,0.0


In [131]:
reload(impyte)
testdf = pd.read_csv('data/masterdf_201710230_andirs.csv', low_memory=False)
imp = impyte.Impyter()
_ = imp.load_data(testdf)
imp.pattern()

Unnamed: 0.1,Unnamed: 0,Incident Date,EAS,Incident_Year,Incident_Cat,Incident_Dummy,Neighborhood,Location_y,Address,Building_Cat,...,count all complaints not corrected,count potential fire control not corrected,count fire emergency safety,count potential fire cause,count fire emergency safety not corrected,count potential fire cause not corrected,next_fire_dpt_address,next_fire_dpt_distance,next_fire_dpt_latlong,Count
0,1,,1,,,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,170517
1,1,1.0,1,1.0,1.0,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,24767
2,1,1.0,1,1.0,,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,24


In [132]:
imp.get_missing_summary()

Unnamed: 0,Complete,Missing,Percentage
Unnamed: 0,195308,0,0.00 %
next_fire_dpt_address,195308,0,0.00 %
count potential fire cause not corrected,195308,0,0.00 %
count fire emergency safety not corrected,195308,0,0.00 %
count potential fire cause,195308,0,0.00 %
count fire emergency safety,195308,0,0.00 %
count potential fire control not corrected,195308,0,0.00 %
count all complaints not corrected,195308,0,0.00 %
count all complaints,195308,0,0.00 %
count potential fire control,195308,0,0.00 %


In [133]:
imp.get_missing_summary(importance_filter=True)

Unnamed: 0,Complete,Missing,Percentage
Incident_Year,24791,170517,87.31 %
Incident Date,24791,170517,87.31 %
Incident_Cat,24767,170541,87.32 %
