# <font color="green">Bamboo Pipeline</font>
### A data pipelining tool for preprocessing mixed-attribute datasets with pandas dataframes
The purpose of this notebook is to demonstrate a walk-through of the bamboo pipeline preprocessing modules. 

This library contains preprocessing transformer functions (inheriting sklearn.base.TransformerMixin) for Pandas DataFrames based around the concept of using PandasFeatureUnion to stitch together the results of multiple different Pipelines. The advantage to this module is to return pandas DataFrames from the pipelined transformer functions instead of the (scipy.sparse_matrix, np.array, etc) native output of sklearn transformers so that the indexes and column names are preserved for the output.

### Datasets
This walk-through will consist of three datasets:
1. <font color="red">The "canonical" Titanic dataset</font>
2. <font color="green">The NCCTG Lung Cancer Data</font>
3. <font color="blue">A Soil Compositions of Physical and Chemical Characteristics</font>

Note that each of the datasets have numeric and categorical features, with some examples of binary and ordinal features as well.

In [1]:
# imports
from bamboo_pipeline import PandasFeatureUnion, PandasTransform, PandasSubsetSelector, PandasOneHotEncoder, PandasLabelEncoder
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
import numpy as np
import pandas as pd
import sklearn
import os, sys

# Titanic Dataset

In [2]:
# Get data
titanic_path = os.path.join('data', 'titanic.csv')
titanic_df = pd.read_csv(titanic_path, index_col=0)
titanic_df.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


### Titanic Discussion
Irrelevant (I'm declaring these irrelevant for the sake of this exercise):
- Name
- Ticket 
- Cabin

Target: 
- Survived (binary)

Numeric:
- Age
- Fare
- Pclass (can be thought of as ordinal categories as well)
- Parch (number of parents and children aboard)
- SibSp (number of siblings and spouses aboard)

Categorical:
- Embarked
- Sex (binary)

### Pipelines
What we are going to do is create the following pipelines:
1. Categorical pipeline (Embarked) will impute missing data with the most frequent value and use a one-hot encoder.
2. Numerical pipeline (Age, Fare, Pclass, Parch, Sibsp) will impute missing data with the median value and use min-max scaling.
3. Identity pipeline (Survived) will pass the target and any columns that won't be used for ML without adjustments.
4. Boolean pipeline (Sex) will impute missing data with the most frequent value and label-encode (0 or 1)

### Getting fancy
For kicks, and to demonstrate capabilities, let's:
- knock a few years off Mrs. Cumings' age (and any other 38 year old), so that she should be the same age as Mrs. Futrelle now.
- implement a charity to upgrade anyone under the age of 25 to first class (This should change Mrs. Braund from 3rd to 1st.

In [3]:
# Create the numeric data pre-processing pipeline
num_attribs = ['Age', 'Pclass', 'Parch', 'SibSp']
def numeric_pipeline(num_attribs):
    num_pipeline = Pipeline([
        ('selector', PandasSubsetSelector(num_attribs)),
        ('imputer', PandasTransform( SimpleImputer(strategy='median') )),
        ('age_before_beauty', PandasTransform(lambda X: X.where(X.index!='Age',35) if X['Age']==38 else X, axis=1)),
        ('assistance', PandasTransform(lambda X: X.where(X.index!='Pclass',1) if X['Age']<25 else X, axis=1)),
        ('std_scaler', PandasTransform( sklearn.preprocessing.StandardScaler() ) )
    ])
    return num_pipeline

# Create the categorical data pre-processing pipeline
cat_attribs = ['Embarked']
def categorical_pipeline(cat_attribs):
    cat_pipeline = Pipeline([
        ('selector', PandasSubsetSelector(cat_attribs)),
        ('imputer', PandasTransform( SimpleImputer(strategy='most_frequent') )),
        ('cat_encoder', PandasOneHotEncoder())
    ])
    return cat_pipeline

# Pass columns through without scaling
idt_attribs = ['Fare', 'Survived', 'Name']
def identity_pipeline(idt_attribs):
    idt_pipeline = Pipeline([
        ('selector', PandasSubsetSelector(idt_attribs))
    ])
    return idt_pipeline

# Create the boolean data pre-processing pipeline
bool_attribs = ['Sex']
def boolean_pipeline(bool_attribs):
    bool_pipeline = Pipeline([
        ('selector', PandasSubsetSelector(bool_attribs)),
        ('imputer', PandasTransform( SimpleImputer(strategy='most_frequent') )),
        ('bool_encoder', PandasLabelEncoder())
    ])
    return bool_pipeline

# Create pipeline
num_pipeline = numeric_pipeline(num_attribs)
cat_pipeline = categorical_pipeline(cat_attribs)
idt_pipeline = identity_pipeline(idt_attribs)
bool_pipeline = boolean_pipeline(bool_attribs)

full_pipeline = PandasFeatureUnion(transformer_list=[
    ('num_pipeline', num_pipeline),
    ('cat_pipeline', cat_pipeline),
    ('idt_pipeline', idt_pipeline),
    ('bool_pipeline', bool_pipeline)
])

### Pipeline-processed Data:

In [4]:
# Execute
processed_titanic_df = full_pipeline.fit_transform(titanic_df)
processed_titanic_df.head()

Unnamed: 0_level_0,Age,Pclass,Parch,SibSp,Embarked_C,Embarked_Q,Embarked_S,Fare,Survived,Name,Sex
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,-0.563773,-0.917852,-0.473674,0.432793,0.0,0.0,1.0,7.25,0,"Braund, Mr. Owen Harris",1
2,0.436842,-0.917852,-0.473674,0.432793,1.0,0.0,0.0,71.2833,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0
3,-0.255892,1.274658,-0.473674,-0.474545,0.0,0.0,1.0,7.925,1,"Heikkinen, Miss. Laina",0
4,0.436842,-0.917852,-0.473674,0.432793,0.0,0.0,1.0,53.1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0
5,0.436842,1.274658,-0.473674,-0.474545,0.0,0.0,1.0,8.05,0,"Allen, Mr. William Henry",1


### Compare with original dataframe

In [5]:
titanic_df.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


### You can see that:
- Mrs. Cumings's age is now the same as Mrs. Futrelle and Mr. Allen in the processed_titanic_df (35, but scaled).
- Mr. Braund now has the same Pclass as Mrs. Cumings (1, but scaled).
- There are three one-hot encoded columns where Embarked used to be: Embarked_C, Embarked_Q and Embarked_S.
- All of the numeric values are scaled.
- Sex is label-encoded to 0/1 now.

# NCCTG Lung Cancer Dataset

In [6]:
# Get data
cancer_path = os.path.join('data', 'cancer.csv')
cancer_df = pd.read_csv(cancer_path, index_col=0)
cancer_df.sort_index(inplace=True)
cancer_df.head()

Unnamed: 0,inst,time,status,age,sex,ph.ecog,ph.karno,pat.karno,meal.cal,wt.loss
1,3.0,306,2,74,1,ONE,90.0,100.0,1175.0,
2,3.0,455,2,68,1,ZERO,90.0,90.0,1225.0,15.0
3,3.0,1010,1,56,1,ZERO,90.0,90.0,,15.0
4,5.0,210,2,57,1,ONE,90.0,60.0,1150.0,11.0
5,1.0,883,2,60,1,ZERO,100.0,90.0,,0.0


### Cancer Discussion
Here are the column definitions for the less obvious ones:
-inst:	Institution code
-time:	Survival time in days
-status:	censoring status 1=censored, 2=dead
-ph.ecog:	ECOG performance score (0=good 5=dead)
-ph.karno:	Karnofsky performance score (bad=0-good=100) rated by physician
-pat.karno:	Karnofsky performance score as rated by patient
-meal.cal:	Calories consumed at meals
-wt.loss:	Weight loss in last six months

Numeric:
- time
- age
- ph.karno
- pat.karno
- meal.cal
- wt.loss

Categorical:
- status (binary)
- sex (binary; already in ML format!)
- inst

Ordinal:
- ph.ecog

### Getting a little fancier
- I changed ph.ecog from numbers to word representations in order to show another functionality of PandasTransform.
- Let's assume the folks for whom we don't have meal.cal values are eating the maximum caloric intake of any of the patients

In [7]:
# Create the numeric data pre-processing pipeline
num_attribs = ['time', 'age', 'ph.karno', 'pat.karno', 'wt.loss']
def numeric_pipeline(num_attribs):
    num_pipeline = Pipeline([
        ('selector', PandasSubsetSelector(num_attribs)),
        ('imputer', PandasTransform( SimpleImputer(strategy='median') )),
        ('std_scaler', PandasTransform( sklearn.preprocessing.StandardScaler() ) )
    ])
    return num_pipeline

# Create pipeline for meal calories
meal_attribs = ['meal.cal']
def mealcal_pipeline(num_attribs):
    meal_pipeline = Pipeline([
        ('selector', PandasSubsetSelector(meal_attribs)),
        ('eat_well', PandasTransform(lambda X: X.where(pd.notnull(X),X.max()))),
        ('std_scaler', PandasTransform( sklearn.preprocessing.StandardScaler() ) )
    ])
    return meal_pipeline

# Create the categorical data pre-processing pipeline
cat_attribs = ['inst']
def categorical_pipeline(cat_attribs):
    cat_pipeline = Pipeline([
        ('selector', PandasSubsetSelector(cat_attribs)),
        ('imputer', PandasTransform( SimpleImputer(strategy='most_frequent') )),
        ('cat_encoder', PandasOneHotEncoder())
    ])
    return cat_pipeline

# Pass columns through without scaling
idt_attribs = ['sex']
def identity_pipeline(idt_attribs):
    idt_pipeline = Pipeline([
        ('selector', PandasSubsetSelector(idt_attribs))
    ])
    return idt_pipeline

# Create the boolean data pre-processing pipeline
bool_attribs = ['status']
def boolean_pipeline(bool_attribs):
    bool_pipeline = Pipeline([
        ('selector', PandasSubsetSelector(bool_attribs)),
        ('imputer', PandasTransform( SimpleImputer(strategy='most_frequent') )),
        ('bool_encoder', PandasLabelEncoder())
    ])
    return bool_pipeline

# Create the boolean data pre-processing pipeline
ord_attribs = ['ph.ecog']
mapping = {"ZERO": 0, "ONE": 1, "TWO": 2, "THREE": 3, "FOUR": 4, "FIVE": 5}
def ordinal_pipeline(ord_attribs):
    ord_pipeline = Pipeline([
        ('selector', PandasSubsetSelector(ord_attribs)),
        ('imputer', PandasTransform( SimpleImputer(strategy='most_frequent') )),
        ('ord_encoder', PandasTransform( lambda X: mapping[X[0].upper()], axis=1))
    ])
    return ord_pipeline

# Create pipeline
num_pipeline = numeric_pipeline(num_attribs)
meal_pipeline = mealcal_pipeline(meal_attribs)
cat_pipeline = categorical_pipeline(cat_attribs)
idt_pipeline = identity_pipeline(idt_attribs)
bool_pipeline = boolean_pipeline(bool_attribs)
ord_pipeline = ordinal_pipeline(ord_attribs)

full_pipeline = PandasFeatureUnion(transformer_list=[
    ('num_pipeline', num_pipeline),
    ('meal_pipeline', meal_pipeline),
    ('cat_pipeline', cat_pipeline),
    ('idt_pipeline', idt_pipeline),
    ('bool_pipeline', bool_pipeline),
    ('ord_pipeline', ord_pipeline)
])

In [8]:
# Execute
processed_cancer_df = full_pipeline.fit_transform(cancer_df)
processed_cancer_df.head()

Unnamed: 0,time,age,ph.karno,pat.karno,wt.loss,meal.cal,inst_1.0,inst_2.0,inst_3.0,inst_4.0,...,inst_15.0,inst_16.0,inst_21.0,inst_22.0,inst_26.0,inst_32.0,inst_33.0,sex,status,ph.ecog
1,0.003652,1.276035,0.657478,1.382875,-0.208979,-0.128529,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,1,1
2,0.712558,0.613311,0.657478,0.692951,0.420026,-0.063143,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,1,0
3,3.353112,-0.712138,0.657478,0.692951,0.420026,1.734966,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0,0
4,-0.453093,-0.601684,0.657478,-1.376823,0.105524,-0.161222,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,1,1
5,2.748877,-0.270322,1.47218,0.692951,-0.759358,1.734966,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,1,0


### Compare to original dataset

In [9]:
cancer_df.head()

Unnamed: 0,inst,time,status,age,sex,ph.ecog,ph.karno,pat.karno,meal.cal,wt.loss
1,3.0,306,2,74,1,ONE,90.0,100.0,1175.0,
2,3.0,455,2,68,1,ZERO,90.0,90.0,1225.0,15.0
3,3.0,1010,1,56,1,ZERO,90.0,90.0,,15.0
4,5.0,210,2,57,1,ONE,90.0,60.0,1150.0,11.0
5,1.0,883,2,60,1,ZERO,100.0,90.0,,0.0


### You can see that:
- Patients 3 and 5, who were missing meal.cal, have the highest values.
- There are a lot of institutions!
- ph.ecog looks to be mapped correctly with our PandasTransform.