## import local

In [1]:
from __future__ import print_function
__file__ = !cd .. ;pwd
__file__ = __file__[0]
__file__

'/Users/brucecottman/Documents/PROJECTS/paso'

In [2]:
import sys
from random import random
sys.path.append(__file__)
sys.path

['/Users/brucecottman/Documents/PROJECTS/paso/lessons',
 '/Users/brucecottman/anaconda3/envs/paso/lib/python37.zip',
 '/Users/brucecottman/anaconda3/envs/paso/lib/python3.7',
 '/Users/brucecottman/anaconda3/envs/paso/lib/python3.7/lib-dynload',
 '',
 '/Users/brucecottman/.local/lib/python3.7/site-packages',
 '/Users/brucecottman/anaconda3/envs/paso/lib/python3.7/site-packages',
 '/Users/brucecottman/anaconda3/envs/paso/lib/python3.7/site-packages/aeosa',
 '/Users/brucecottman/anaconda3/envs/paso/lib/python3.7/site-packages/IPython/extensions',
 '/Users/brucecottman/.ipython',
 '/Users/brucecottman/Documents/PROJECTS/paso']

## paso's Offering of Data Cleaners for your Machine or Deep Learning Projects

Data cleaning is a subject that is lightly touched in your brick&mortar or on-line classes. However, in your work as a Data Engineer or Data Scientist you will spend a great deal of your time getting ready (pre-processing) your data so that it can be input into your model. Data cleaning is critical to any production service. Can we find a way to compose a ``sklearn pipeline``  to automate some of your data cleaning?

**Scikit-Learn pipelines** are composed of steps, each step has to be **Scikit-Learn transformer** or a custom **Scikit-Learn transformer**. The last of the pipeline can be a transformer or an estimator, where a **Scikit-Learn estimator** is a compatible model. 

**paso** has quite a few data cleaning transformers that are custom **Scikit-Learn transformers**. Note that **paso** translates from Spanish to English to the word step.

**paso** is a package written in Python and some ``C``(for speed) that was originally intended to bundle best-practices and state-of-the-art services, classes and functions for the Machine Learning and Deep Learning community. **paso** has grown beyond this to offer patterns, classes and methods you can use in your **Scikit-Learn pipelines** or custom code with or without adopting the entire **paso** package.

The **paso** package consists of a growing set of **paso** services that you can turn on and off for any of your Python projects. Also, included are new state-of-art classes and methods not yet available in **Scikit-Learn**, but are compatible with all of **Scikit-Learn**.

**paso** will supply many but not all the tools you need to clean the data. Just because **paso** supplies a tool, does not mean you should use it on your dataset. For example, usually removing rows that have a high density of ``NAs``will result in a worse loss metric or unwanted bias, but not all the time. Iteration of over different cleaning strategies (pipelines) is an important goal of **paso**.

Discussion will be divided into the following major segments:
- What data cleaning behaviors are currently available in the **paso** package.
- How to add a data cleaning **Scikit-Learn pipeline** to your code.
- Overview the source code for the x,y,z,a and e class. You are free to use and modify this code.

As we saw in [lesson-1](), we need to startup **paso** services.

In [3]:
import numpy as np
import pandas as pd
from pandas_summary import DataFrameSummary

import warnings
warnings.filterwarnings("ignore")

import matplotlib as mpl
from matplotlib import pyplot as plt
from matplotlib import cm

%reload_ext autoreload
%autoreload 2
%matplotlib inline
import matplotlib
import seaborn as sns

from paso.base import Paso,Log,PasoError
from loguru import logger
session = Paso(parameters_filepath='../parameters/lesson.2.yaml').startup()

paso 15.7.2019 17:07:38 INFO Log started
paso 15.7.2019 17:07:38 INFO Read in parameter file: ../parameters/lesson.2.yaml


Next, we load the ``boston``data set into the ``City``dataframe. We will munge City up to show what the **paso** cleaners can do.

In [4]:
from sklearn.datasets import load_boston
boston = load_boston()

City = pd.DataFrame(boston.data, columns = boston.feature_names )
City['MEDV'] = boston.target
logger.info(boston.DESCR)
City.head()

paso 15.7.2019 17:07:38 INFO .. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per 

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


Now we can have some fun and dirty the data. 
1. First add in the feature ``asv`` which has all the same values .
1. Second in the  feature ``saRRIM`` whose values are the same as ``CRIM``.
1. Add in the feature ``vlvZN`` that has very low variance.
1. Add in the feature ``hcwINDUS`` which with a very high correlation  with the feature ``INDUS``.

In [5]:
City['asv'] = 99.0
City['saCRIM'] = City['CRIM']
City['vlvZN'] = 1
indi = list(City.columns).index('vlvZN')
City.iloc[0:1,indi] = 0
City['hcwINDUS'] = - City['INDUS']
City.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV,asv,saCRIM,vlvZN,hcwINDUS
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0,99.0,0.00632,0,-2.31
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6,99.0,0.02731,1,-7.07
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7,99.0,0.02729,1,-7.07
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4,99.0,0.03237,1,-2.18
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2,99.0,0.06905,1,-2.18


## Transform Value to Missing

Detecting and correcting for missing and outlier (good or bad) values is an evolving area of research. We will cover outlier values in the lesson on Scalers. For this lesson we will just focus on missing values.

Different values can indicate, a value is missing. For example,
- ``999`` could mean "did not answer" in some features.
- ``NA`` could mean not-applicable for this feature/record.
- ``-1`` could mean missing for this feature/record-1could mean missing for this feature/record`.
and so on.

[This is the pandas tutorial on missing data](http://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html).

In [6]:
from paso.pre.cleaners import Transform_Values_Ratios_to_Missing
o = Transform_Values_Ratios_to_Missing()
x = o.transform(City,inplace=True,missing_values=[99.0,0.0])
x.head()


Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV,asv,saCRIM,vlvZN,hcwINDUS
0,0.00632,18.0,2.31,,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0,,0.00632,,-2.31
1,0.02731,,7.07,,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6,,0.02731,1.0,-7.07
2,0.02729,,7.07,,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7,,0.02729,1.0,-7.07
3,0.03237,,2.18,,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4,,0.03237,1.0,-2.18
4,0.06905,,2.18,,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2,,0.06905,1.0,-2.18


## paso Class for Determining  Missing Values Ratios for the Features and rows.

Having a sufficiently large ratio of missing values for a feature renders it statistically irrelevant, you can remove this feature from the dataset. Similarly for a row with a large ratio of missing values (an observation ) renders it statistically irrelevant,and you remove this row from the dataset.

An extra row (each features missing value ratio) and an extra feature (each row missing value ratio) is added to the returned **pandas** dataframe. The missing value ratios are kept in the class instance attribute  ``missing_value_ratios``.


In [7]:
from paso.pre.cleaners import Missing_Values_Ratios
o = Missing_Values_Ratios()
x = o.transform(City,inplace=True,missing_values=[])
x.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV,asv,saCRIM,vlvZN,hcwINDUS
0,0.00632,18.0,2.31,,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0,,0.00632,,-2.31
1,0.02731,,7.07,,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6,,0.02731,1.0,-7.07
2,0.02729,,7.07,,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7,,0.02729,1.0,-7.07
3,0.03237,,2.18,,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4,,0.03237,1.0,-2.18
4,0.06905,,2.18,,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2,,0.06905,1.0,-2.18


In [8]:
o.features_mvr

CRIM        0.000000
ZN          0.735178
INDUS       0.000000
CHAS        0.930830
NOX         0.000000
RM          0.000000
AGE         0.000000
DIS         0.000000
RAD         0.000000
TAX         0.000000
PTRATIO     0.000000
B           0.000000
LSTAT       0.000000
MEDV        0.000000
asv         1.000000
saCRIM      0.000000
vlvZN       0.001976
hcwINDUS    0.000000
dtype: float64

In [9]:
o.rows_mvr.head()

0    0.166667
1    0.166667
2    0.166667
3    0.166667
4    0.166667
dtype: float64

## paso Class For Imputing a Feature's Missing Values

Someone should write a book on filling in missing and bad values of features. Setting these values to their best approximation is key to your data cleaning efforts and further down stream your predictive accuracy and power. But someone did write a [succinct article on the subject:] (https://towardsdatascience.com/6-different-ways-to-compensate-for-missing-values-data-imputation-with-examples-6022d9ca0779).

We offer a smorgasbord of impute strategies in *paso's** Impute_Missing_Values. We decided to do by list of features, as impute strategy can vary over subsets of features.


In [10]:
#Impute_Features_by_Values

## paso Class for Removing Duplicate Features in a DataSet

If a feature has the same values by index as another feature then one of those features should be deleted. The duplicate feature is redundant and will have no predictive power. 

Duplicate features are quite common as an enterprise's database or data lake ages and different data sources are added.

The **paso** class to delete duplicate features is given as:

In [11]:
from paso.pre.cleaners import Dupilicate_Features_by_Values
x = Dupilicate_Features_by_Values().transform(City,inplace=True)
x.head()

paso 15.7.2019 17:07:39 ERROR Passed dataset, DataFrame, contained NA


PasoError: Passed dataset, DataFrame, contained NA

## paso Class for Removing Zero Variance Features from a DataSet

This class finds all the features which have only one unique value. 
The variation between values is zero. All these features are removed from
the dataset as they have no predictive ability.

In [None]:
from paso.pre.cleaners import Features_with_Single_Unique_Value
x = Features_with_Single_Unique_Value().transform(City,inplace=True)
x.head()

In [None]:
City.head()

## paso Class for the Determining Variance of Features

This class finds all the variance of each feature and returns a dataframe where the 1st column is the feature string and the 2nd column is the variance of that feature.This class is a diagnostic tool that is used to decide if the variance is small and thus will have low predictive power.  Care should be used before eliminating any feature and a 2nd opinion of the **SHAP** value should be used in order to reach a decision to remove a feature.

In [None]:
from paso.pre.cleaners import Features_Variances
Features_Variances().transform(City)

## paso Class for the Determining Feature-Feature Correlation

[Read this for a good overview of correlation](https://medium.com/fintechexplained/did-you-know-the-importance-of-finding-correlations-in-data-science-1fa3943debc2)

If any given Feature has an absolute high correlation coefficient with another feature (open interval -1.0,1.0) then is very likely the second one of them will have low predictive power as it is redundant with the other.

Usually the Pearson correlation coefficient is used, which is sensitive only to a linear relationship between two variables (which may be present even when one variable is a nonlinear function of the other). A Pearson correlation coefficient of -0.97 is a strong negative correlation while a correlation of 0.10 would be a weak positive correlation.

Spearman's rank correlation coefficient is a measure of how well the relationship between two variables can be described by a monotonic function. Kendall's rank correlation coefficient is statistic used to measure the ordinal association between two measured quantities. Spearman's rank correlation coefficient is the more widely used rank correlation coefficient. However, Kendall's is easier to understand.

In most of the situations, the interpretations of Kendall and Spearman rank correlation coefficient are very similar to Pearson correlation coefficient and thus usually lead to the same diagnosis. The paso class calculates Peason's,or Spearman's, or Kendall's correlation co-efficients for all feature pairs of the dataset.

One of the features of the feature-pair should be removed for its negligible effect on the prediction. Again, this class is a diagnostic that indicates if one feature of will have low predictive power. Care should be used before eliminating any feature to look at the **SHAP** value, (sd/mean) and correlation co-efficient in order to reach a decision to remove a feature.

In [None]:
from paso.pre.cleaners import Feature_Feature_Correlation
o = Feature_Feature_Correlation()
corr = o.transform(City,threshold= 0.75,inplace=True)

corr.head()

In [None]:
o.plot()

## paso Class for the Removing Features from a Dataset

I should mention at this point that **sklearn** has a few different algorithms for [feature selection](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html#examples-using-sklearn-feature-selection-selectkbest). You might to want to look at these before removing any features.

However, I find the feature diagnosis tools given in **paso**, especially **SHAP**.  **SHAP** is state-of-the-art in determining a feature's importance in a model as of 2018 and as far as I know still the state-of-the-art. Please, let us know if you know something better and we add it to **paso**. This is very likely as the field is growing rapidly.

Based on your analysis, you are now ready to remove features from a **pandas** dataframe. We have written a wraper class around drop so as to be compatiable with the ``,transform`` method and to include **paso**'s services.

City.drop(['asv','saCRIM','vlvZN','hcwINDUS'],axis=1,inplace=True)
City.columnsdd

In [None]:
City.columns

In [None]:
from paso.pre.cleaners import Remove_Features
x = Remove_Features().transform(City,inplace=True,remove=['vlvZN','hcwINDUS']) 

x.columns

## paso Class for Removing Features that are not Common to Train and Test 

If the train or test datasets have features the other does not, then those features will have no predictive power and should be removed from both datasets. The exception being the target feature that is present in the training dataset of a supervised problem. 

Features in one dataset (train, test) and not in the other (train, test) may point to other problems in the population of these datasets. You should check the steps in your data loading.

Where I have seen this most often is the assembly test dataset from upstream services (Kafka, Google PubSub, Amazon Kinesis Stream, PySpark, and RabbitMQ, to name a few). The features of the test dataset are changed while the pre-trained model (and thus the training set) do not have these new features feature. Depending on your error handling, what happens usually is failure.

Using `` ``:
1. Differences in the features of the train and test datasets are removed.
2. What features have been removed are logged for later reconciliation.
3. The prediction from the input test dataset is successfully handled by the pre-trained model.
[A good overview to data streamingcan be found at this link.](https://medium.com/analytics-vidhya/data-streams-and-online-machine-learning-in-python-a382e9e8d06a)

To create test and train datasets, we can use 30% of ``City`` as test, leaving 70% of ``City`` as train.

In [None]:
from sklearn.model_selection import train_test_split
train, test, train_target, test_target = train_test_split( City[City.columns.difference(['CRIM'])]
                                              ,  City['MEDV']
                                              , test_size=0.3
                                              , random_state=88)

train.shape,train_target.shape, test.shape,  test_target.shape

In [None]:
from paso.pre.cleaners import Features_not_in_train_or_test

In [None]:
test['extraF'] = 10
x = Features_not_in_train_or_test().transform(train,y=test,inplace=False)
x[1].head()

In the above example, the returned two dataframes are the transform of ``train`` and ``test``. By using ``inplace=False``, the ``test``dataset is not changed. 

In [None]:
test.head()

## paso Class for Handling Imbalanced Classes

All avaiable class balance strategies are shown with:

In [None]:
from paso.pre.cleaners import Class_Balance
o = Class_Balance('SMOTE')
o.classBalancers()

Next, we load the ``iris``data set,which has claas(categorical) into the ``Flower``dataframe.

In [None]:
from sklearn.datasets import load_iris
iris = load_iris()

Flower = pd.DataFrame(iris.data, columns = iris.feature_names )
Flower['TypeOf'] = iris.target
#logger.info(iris.DESCR)
DataFrameSummary(Flower).summary()

In [None]:
X = Flower[Flower.columns.difference(['TypeOf'])]
y = Flower['TypeOf']
X.shape,y.shape

Split original dataset into training set(70\%) and test set(30\%).

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics

In [None]:
epochs = 10000
accAcum = 0
for round in range(epochs):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
    clf=RandomForestClassifier(n_estimators=100)
    clf.fit(X_train,y_train)
    y_pred=clf.predict(X_test)
    accAcum += metrics.accuracy_score(y_test, y_pred)
    
print("Accuracy:",accAcum/epochs)

In [None]:
Flower_0 = Flower[Flower['TypeOf']==0]
Flower = Flower.append(Flower_0) # append Flower_0
print(Flower_0.shape)
DataFrameSummary(Flower).summary()

In [None]:
X = Flower[Flower.columns.difference(['TypeOf'])]
y = Flower['TypeOf']
X.shape,y.shape

In [None]:
X,y = o.transform(X,y)
X.shape,y.shape

1. Recreate Flower from X,y with ``TypeOf=[1,2]`` articial data added in to balance. This has added effect that ``TypeOf=0`` is removed.
1. Add back in base ``TypeOf=0``
3. ``Flower``contines 50 each real data for ``TypeOf=[0,1,2]`` and 40 each of synthetic ``TypeOf=[1,2]``

In [None]:
Flower = pd.DataFrame(X,columns= Flower[Flower.columns.difference(['TypeOf'])].columns)
Flower['TypeOf'] = y
Flower = Flower[Flower['TypeOf']>0]
Flower = Flower.append(Flower_0) # append Flower_0
print(Flower_0.shape)
DataFrameSummary(Flower).summary()

Now balance classes again with oversampler ``SMOOT``. Just ``TypeOf=0`` needs 50 rows of synthetic data to balance the classes. The result is similar to image augumention in that we doubled the ``Isis``dataset with synthetic data.

In [None]:
X = Flower[Flower.columns.difference(['TypeOf'])]
y = Flower['TypeOf']
X,y = o.transform(X,y)
X.shape,y.shape

In [None]:
Flower = pd.DataFrame(X,columns= Flower[Flower.columns.difference(['TypeOf'])].columns)
Flower['TypeOf'] = y
# append Flower_0
print(Flower_0.shape)
DataFrameSummary(Flower).summary()

In [None]:
epochs = 10000
accAcum = 0
for round in range(epochs):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
    clf=RandomForestClassifier(n_estimators=100)
    clf.fit(X_train,y_train)
    y_pred=clf.predict(X_test)
    accAcum += metrics.accuracy_score(y_test, y_pred)
    
print("Accuracy:",accAcum/epochs)

 Let us check to see if training is better or not

## Creating a Pipeline for your Data Cleaning

In a production, data cleaning routines, sequences of transformations that convert raw data into a format useful for analysis, have been viewed as static components that fit into data integration or Extract-Transform-Load (ETL) pipelines and are executed on new data entering the prediction model.

However, this perspective fails to take into account that data cleaning is frequently an iterative analysis process for training model or models. paso is constantly involving to support both dynamic production and iterative development data cleaning pipelines.

[A good introduction to Scikit-learn pipelines](https://medium.com/vickdata/a-simple-guide-to-scikit-learn-pipelines-4ac0d974bdcf)

You can display a direct-acyclic-graph(DAG) of your **paso** pipeline.

In [None]:
session.display_DAG()

## Summary

In summary paso data cleaning transformation classes are:
- ``Impute_Features_by_Values``
- ``Duplicate_Features_by_Values``
- ``Features_with_Single_Unique_Value``
- `` ``
- `` ``
- ``Remove_Features``
- ``Features_not_in_train_or_test``


paso data cleaning analysis classes are:
- ``Features_missing_value_ratios``
- ``Features_Variances``
- ``Feature_Feature_Correlation``
- ``Features_SHAP_values ``

All paso data cleaning transformation classes are similar do skleaarn transformation classes, support sklearn pipeline classes, but differ from sklearn as paso uses a pandas dataframe for the first argument input and output. sklearn uses a numpy array for input and output.

You have seen **paso** offers data cleaning classes for both production data engineers and research data scientists. **paso** support streaming data as well as bulk extraction data cleaning. You can expect **paso** to continue to offer state-of-the-art tools for data cleaning.

Other lessons on **paso** are:
1. [**paso**'s Offering of Logging and Parameter Services for your Python Project](https://github.com/bcottman/paso/blob/master/lessons/lesson_1.ipynb)

In the future, we will cover **paso** in more depth with the following lesons:
- Overview of **paso** scalers and handling data outliers.
- Overview of **paso** encoders.
- Overview of **paso** machine learning and deep learning models.
- Using  **paso** on GPUs.
- and yet more…

If you have a service or feature or see a bug, then leave the **paso** project a [note](https://github.com/bcottman/paso/issues).