# Pandas Pipes

Cleaner Data Analysis with Pandas Using Pipes  
https://towardsdatascience.com/cleaner-data-analysis-with-pandas-using-pipes-4d73770fbf3c


## Great idea for a standard set of utility cleanup functions in a library


Data:  
https://www.kaggle.com/yoghurtpatil/direct-marketing?select=DirectMarketing.csv  

In [2]:
import numpy as np
import pandas as pd

In [3]:
marketing = pd.read_csv("./data/DirectMarketing.csv")
marketing.head()

Unnamed: 0,Age,Gender,OwnHome,Married,Location,Salary,Children,History,Catalogs,AmountSpent
0,Old,Female,Own,Single,Far,47500,0,High,6,755
1,Middle,Male,Rent,Single,Close,63600,0,High,6,1318
2,Young,Female,Rent,Single,Close,13500,0,Low,18,296
3,Middle,Male,Own,Married,Close,85600,1,High,18,2436
4,Middle,Female,Own,Single,Close,68400,0,High,12,1304


## Functions used by PIPE   

In [5]:
def drop_missing(df):
    thresh = len(df) * 0.6
    df.dropna(axis=1, thresh=thresh, inplace=True)
    return df

def remove_outliers(df, column_name):
    low = np.quantile(df[column_name], 0.05)
    high = np.quantile(df[column_name], 0.95)
    return df[df[column_name].between(low, high, inclusive=True)]

def to_category(df):
    cols = df.select_dtypes(include='object').columns
    for col in cols:
        ratio = len(df[col].value_counts()) / len(df)
        if ratio < 0.05:
            df[col] = df[col].astype('category')
    return df

def copy_df(df):
   return df.copy()

### Pipe to clean a copy of the data  

In [6]:
marketing_cleaned = (marketing.
                       pipe(copy_df).
                       pipe(drop_missing).
                       pipe(remove_outliers, 'Salary').
                       pipe(to_category))

In [7]:
print(marketing.shape)
marketing.dtypes

(1000, 10)


Age            object
Gender         object
OwnHome        object
Married        object
Location       object
Salary          int64
Children        int64
History        object
Catalogs        int64
AmountSpent     int64
dtype: object

In [8]:
print(marketing_cleaned.shape)
marketing_cleaned.dtypes

(900, 10)


Age            category
Gender         category
OwnHome        category
Married        category
Location       category
Salary            int64
Children          int64
History        category
Catalogs          int64
AmountSpent       int64
dtype: object