# Python Learning Sessions: Feature Engineering

## Basics

### Pipelines
Piplines are a way to combine multiple steps of preprocessing into a single step.
Pandas has a built-in function for creating pipelines. The method can be called by a DataFrame.pipe() method. You can pass a function to the pipe() method and chain multiple transformations together by calling the pipe() method after another.

SKLearn has a built-in pipeline class that sequentially applies each step of the pipeline. SKLearn also has ColumnTransformer class that can be used to combine multiple steps of preprocessing into a single step. These two classes can be used together to provide a more flexible way to combine multiple steps of preprocessing into a single step.

#### Pandas Pipeline
---

When applying a series of transformations to a dataframe, we usually call it like this:

```python
df = h(df)
df = g(df, arg1=a)
df = func(df, arg2=b, arg3=c)
```

Or we can nest the transformations using functions:
```python
df = func(g(h(df), arg1=a), arg2=b, arg3=c)  
```

Or we can use the pipe() method to apply a series of transformations to a dataframe:

```python
(df.pipe(h)
   .pipe(g, arg1=a)
   .pipe(func, arg2=b, arg3=c)
)
```

You can define a pipeline operator as a function that takes a dataframe and returns a dataframe.

In [72]:
import pandas as pd
import numpy as np

link = 'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/taxis.csv'
df = pd.read_csv(link)
df

Unnamed: 0,pickup,dropoff,passengers,distance,fare,tip,tolls,total,color,payment,pickup_zone,dropoff_zone,pickup_borough,dropoff_borough
0,2019-03-23 20:21:09,2019-03-23 20:27:24,1,1.60,7.0,2.15,0.0,12.95,yellow,credit card,Lenox Hill West,UN/Turtle Bay South,Manhattan,Manhattan
1,2019-03-04 16:11:55,2019-03-04 16:19:00,1,0.79,5.0,0.00,0.0,9.30,yellow,cash,Upper West Side South,Upper West Side South,Manhattan,Manhattan
2,2019-03-27 17:53:01,2019-03-27 18:00:25,1,1.37,7.5,2.36,0.0,14.16,yellow,credit card,Alphabet City,West Village,Manhattan,Manhattan
3,2019-03-10 01:23:59,2019-03-10 01:49:51,1,7.70,27.0,6.15,0.0,36.95,yellow,credit card,Hudson Sq,Yorkville West,Manhattan,Manhattan
4,2019-03-30 13:27:42,2019-03-30 13:37:14,3,2.16,9.0,1.10,0.0,13.40,yellow,credit card,Midtown East,Yorkville West,Manhattan,Manhattan
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6428,2019-03-31 09:51:53,2019-03-31 09:55:27,1,0.75,4.5,1.06,0.0,6.36,green,credit card,East Harlem North,Central Harlem North,Manhattan,Manhattan
6429,2019-03-31 17:38:00,2019-03-31 18:34:23,1,18.74,58.0,0.00,0.0,58.80,green,credit card,Jamaica,East Concourse/Concourse Village,Queens,Bronx
6430,2019-03-23 22:55:18,2019-03-23 23:14:25,1,4.14,16.0,0.00,0.0,17.30,green,cash,Crown Heights North,Bushwick North,Brooklyn,Brooklyn
6431,2019-03-04 10:09:25,2019-03-04 10:14:29,1,1.12,6.0,0.00,0.0,6.80,green,credit card,East New York,East Flatbush/Remsen Village,Brooklyn,Brooklyn


In [14]:
def get_unique_routes(df):
    """
    Get the number of unique routes
    
    Parameters
    ----------
    df : pandas.DataFrame
        DataFrame with the taxi data

    Returns
    -------
    df : pandas.DataFrame
        DataFrame with the taxi data with the number of unique routes

    """
    df = df.groupby(by=['pickup_borough', 'dropoff_borough']).agg({
        'distance': 'mean',
        'fare': 'mean',
        'tip': 'mean',
        'tolls': 'mean',
        'payment': pd.Series.mode,
        'color': pd.Series.mode,
    })
    df.reset_index(inplace=True)
    df['route'] = df.pickup_borough + '-' + df.dropoff_borough
    df.drop(['pickup_borough', 'dropoff_borough'], axis=1, inplace=True)
    return df


def bin_distance(df, bins):
    """
    Bin the distance column

    Parameters
    ----------
    df : pandas.DataFrame
        DataFrame with the taxi data
    bins : list
        List of bins to use

    Returns
    -------
    df : pandas.DataFrame
        DataFrame with the taxi data with the distance column binned

    """
    df['distance_bin'] = pd.cut(df.distance, bins)
    df.drop(['distance'], axis=1, inplace=True)
    return df


def get_tip_toll_percentage(df):
    """
    Get the percentage of tip and tolls
    
    Parameters
    ----------
    df : pandas.DataFrame
        DataFrame with the taxi data
        
    Returns
    -------
    df : pandas.DataFrame
        DataFrame with the taxi data with the percentage of tip and tolls

    """
    df['ave_tip_percentage'] = df.tip / (df.fare)
    df['ave_toll_percentage'] = df.tolls / (df.fare)
    df.drop(['tip', 'tolls'], axis=1, inplace=True)
    return df

In [75]:
df.pipe(get_unique_routes)\
  .pipe(bin_distance, bins=np.arange(0, 100, 5))\
  .pipe(get_tip_toll_percentage)

Unnamed: 0,fare,payment,color,route,distance_bin,ave_tip_percentage,ave_toll_percentage
0,14.539091,credit card,green,Bronx-Bronx,"(0, 5]",0.006586,0.008754
1,54.0625,credit card,green,Bronx-Brooklyn,"(10, 15]",0.0,0.079908
2,29.698,credit card,green,Bronx-Manhattan,"(5, 10]",0.0113,0.038144
3,40.1575,credit card,green,Bronx-Queens,"(10, 15]",0.0,0.143435
4,58.124,credit card,green,Brooklyn-Bronx,"(15, 20]",0.0,0.099098
5,11.877589,credit card,green,Brooklyn-Brooklyn,"(0, 5]",0.052987,0.0
6,25.096567,credit card,green,Brooklyn-Manhattan,"(5, 10]",0.092901,0.047958
7,34.842692,credit card,green,Brooklyn-Queens,"(10, 15]",0.040203,0.0
8,24.127273,"[cash, credit card]",yellow,Manhattan-Bronx,"(5, 10]",0.036963,0.010309
9,24.495098,credit card,yellow,Manhattan-Brooklyn,"(5, 10]",0.138886,0.027606


#### SKLearn Pipeline, FeatureUnion, ColumnTransformer
* `Pipeline`: chaining estimators
* `FeatureUnion`: composite feature spaces
* `ColumnTransformer` for heterogeneous data
* `TransformedTargetRegressor`: transforming target in regression

In [32]:
import numpy as np
from sklearn.pipeline import make_pipeline
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer, TransformedTargetRegressor
from sklearn.preprocessing import (
    OneHotEncoder, StandardScaler, QuantileTransformer, MinMaxScaler
)
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import SelectKBest, VarianceThreshold
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import set_config

numeric_preprocessor = Pipeline(
    steps=[
        ('quasi_constant_remover', VarianceThreshold(threshold=0.0)),
        (
            "imputation_mean",
            SimpleImputer(missing_values=np.nan, strategy="mean")
        ),
        ("scaler", StandardScaler()),
    ]
)

categorical_preprocessor = Pipeline(
    steps=[
        (
            "imputation_constant", 
            SimpleImputer(fill_value="missing", strategy="constant"),
        ),
        ("onehot", OneHotEncoder(handle_unknown="ignore")),
    ]
)

# Text preprocessing
#------------------------------------------------------------------------------
from sklearn.base import BaseEstimator, TransformerMixin

class AverageWordLengthExtractor(BaseEstimator, TransformerMixin):
    """Takes in dataframe, outputs average word length"""

    def __init__(self):
        pass

    def average_word_length(self, name):
        """Helper code to compute average word length of a name"""
        return np.mean([len(word) for word in name.split()])

    def transform(self, df, y=None):
        """The workhorse of this feature extractor"""
        return df['road_name'].apply(self.average_word_length)

    def fit(self, df, y=None):
        """Returns `self` unless something different happens in train, test"""
        return self

best_word_selector = Pipeline(
    steps=[
        ('bow', CountVectorizer()),
        ('select_one', SelectKBest(k=1))
    ]
)

text_preprocessor = FeatureUnion(
    transformer_list=[
        ('best_word', best_word_selector),
        ('ave', AverageWordLengthExtractor())
    ]
)

preprocessor = ColumnTransformer(
    [
        ('keys', 'drop', ['cif', 'mastercif', 'acct_num']),
        ("categorical", categorical_preprocessor, ["state", "gender"]),
        ("numerical", numeric_preprocessor, ["age", "weight"]),
        ('text', text_preprocessor, ['feedback']),
    ]
)

qt_transformer = QuantileTransformer(output_distribution='normal')
linreg = LinearRegression(n_jobs=-1)
regr = TransformedTargetRegressor(
    regressor=linreg,
    transformer=qt_transformer)

clf = GBMClassifier()

pipe = make_pipeline(preprocessor, clf)

In [31]:
set_config(display="diagram")
pipe  # click on the diagram below to see the details of each step

In [26]:
pipe.get_params()

{'memory': None,
 'steps': [('columntransformer',
   ColumnTransformer(transformers=[('keys', 'drop',
                                    ['cif', 'mastercif', 'acct_num']),
                                   ('categorical',
                                    Pipeline(steps=[('imputation_constant',
                                                     SimpleImputer(fill_value='missing',
                                                                   strategy='constant')),
                                                    ('onehot',
                                                     OneHotEncoder(handle_unknown='ignore'))]),
                                    ['state', 'gender']),
                                   ('numerical',
                                    Pipeline(steps=[('quasi_constant_remover',
                                                     VarianceThreshold()),
                                                    ('imputation_mean',
                             

### Advantages of Pipelines

#### Avoid target leakage

In [43]:
df.loc[lambda x: x.isna().any(axis=1), :]

Unnamed: 0,pickup,dropoff,passengers,distance,fare,tip,tolls,total,color,payment,pickup_zone,dropoff_zone,pickup_borough,dropoff_borough
7,2019-03-22 12:47:13,2019-03-22 12:58:17,0,1.4,8.5,0.00,0.0,11.80,yellow,,Murray Hill,Flatiron,Manhattan,Manhattan
42,2019-03-30 23:59:14,2019-03-30 23:59:17,1,0.0,80.0,20.08,0.0,100.38,yellow,credit card,,,,
445,2019-03-19 06:57:14,2019-03-19 07:00:08,1,1.3,5.5,0.00,0.0,6.30,yellow,,Boerum Hill,Columbia Street,Brooklyn,Brooklyn
491,2019-03-07 07:11:33,2019-03-07 07:11:39,1,1.6,2.5,0.00,0.0,5.80,yellow,,Murray Hill,Murray Hill,Manhattan,Manhattan
545,2019-03-27 11:03:43,2019-03-27 11:14:34,1,4.2,15.0,0.00,0.0,15.80,yellow,,LaGuardia Airport,Forest Hills,Queens,Queens
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6118,2019-03-30 00:49:48,2019-03-30 00:49:56,1,0.0,25.0,0.00,0.0,25.50,green,credit card,Prospect Heights,,Brooklyn,
6169,2019-03-27 02:11:01,2019-03-27 02:12:03,1,4.1,3.0,0.00,0.0,4.30,green,,Jackson Heights,Jackson Heights,Queens,Queens
6311,2019-03-12 07:10:30,2019-03-12 07:14:18,1,0.7,4.5,0.00,0.0,5.30,green,,Long Island City/Hunters Point,Long Island City/Hunters Point,Queens,Queens
6314,2019-03-28 22:36:04,2019-03-28 22:36:07,1,0.0,25.0,0.00,0.0,25.00,green,cash,Jamaica,,Queens,


In [51]:
df.fillna(df.mode().iloc[0])\
  .loc[df.isna().any(axis=1), :]

Unnamed: 0,pickup,dropoff,passengers,distance,fare,tip,tolls,total,color,payment,pickup_zone,dropoff_zone,pickup_borough,dropoff_borough
7,2019-03-22 12:47:13,2019-03-22 12:58:17,0,1.4,8.5,0.00,0.0,11.80,yellow,credit card,Murray Hill,Flatiron,Manhattan,Manhattan
42,2019-03-30 23:59:14,2019-03-30 23:59:17,1,0.0,80.0,20.08,0.0,100.38,yellow,credit card,Midtown Center,Upper East Side North,Manhattan,Manhattan
445,2019-03-19 06:57:14,2019-03-19 07:00:08,1,1.3,5.5,0.00,0.0,6.30,yellow,credit card,Boerum Hill,Columbia Street,Brooklyn,Brooklyn
491,2019-03-07 07:11:33,2019-03-07 07:11:39,1,1.6,2.5,0.00,0.0,5.80,yellow,credit card,Murray Hill,Murray Hill,Manhattan,Manhattan
545,2019-03-27 11:03:43,2019-03-27 11:14:34,1,4.2,15.0,0.00,0.0,15.80,yellow,credit card,LaGuardia Airport,Forest Hills,Queens,Queens
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6118,2019-03-30 00:49:48,2019-03-30 00:49:56,1,0.0,25.0,0.00,0.0,25.50,green,credit card,Prospect Heights,Upper East Side North,Brooklyn,Manhattan
6169,2019-03-27 02:11:01,2019-03-27 02:12:03,1,4.1,3.0,0.00,0.0,4.30,green,credit card,Jackson Heights,Jackson Heights,Queens,Queens
6311,2019-03-12 07:10:30,2019-03-12 07:14:18,1,0.7,4.5,0.00,0.0,5.30,green,credit card,Long Island City/Hunters Point,Long Island City/Hunters Point,Queens,Queens
6314,2019-03-28 22:36:04,2019-03-28 22:36:07,1,0.0,25.0,0.00,0.0,25.00,green,cash,Jamaica,Upper East Side North,Queens,Manhattan


In [60]:
df.fillna(df.iloc[:4000].mode().iloc[0])\
  .loc[df.isna().any(axis=1), :]

Unnamed: 0,pickup,dropoff,passengers,distance,fare,tip,tolls,total,color,payment,pickup_zone,dropoff_zone,pickup_borough,dropoff_borough
7,2019-03-22 12:47:13,2019-03-22 12:58:17,0,1.4,8.5,0.00,0.0,11.80,yellow,credit card,Murray Hill,Flatiron,Manhattan,Manhattan
42,2019-03-30 23:59:14,2019-03-30 23:59:17,1,0.0,80.0,20.08,0.0,100.38,yellow,credit card,Midtown Center,Midtown Center,Manhattan,Manhattan
445,2019-03-19 06:57:14,2019-03-19 07:00:08,1,1.3,5.5,0.00,0.0,6.30,yellow,credit card,Boerum Hill,Columbia Street,Brooklyn,Brooklyn
491,2019-03-07 07:11:33,2019-03-07 07:11:39,1,1.6,2.5,0.00,0.0,5.80,yellow,credit card,Murray Hill,Murray Hill,Manhattan,Manhattan
545,2019-03-27 11:03:43,2019-03-27 11:14:34,1,4.2,15.0,0.00,0.0,15.80,yellow,credit card,LaGuardia Airport,Forest Hills,Queens,Queens
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6118,2019-03-30 00:49:48,2019-03-30 00:49:56,1,0.0,25.0,0.00,0.0,25.50,green,credit card,Prospect Heights,Midtown Center,Brooklyn,Manhattan
6169,2019-03-27 02:11:01,2019-03-27 02:12:03,1,4.1,3.0,0.00,0.0,4.30,green,credit card,Jackson Heights,Jackson Heights,Queens,Queens
6311,2019-03-12 07:10:30,2019-03-12 07:14:18,1,0.7,4.5,0.00,0.0,5.30,green,credit card,Long Island City/Hunters Point,Long Island City/Hunters Point,Queens,Queens
6314,2019-03-28 22:36:04,2019-03-28 22:36:07,1,0.0,25.0,0.00,0.0,25.00,green,cash,Jamaica,Midtown Center,Queens,Manhattan


#### Grid Search and cross-validation with pipelines is more efficient

In [29]:
from sklearn.model_selection import GridSearchCV
gs = GridSearchCV(
    estimator=pipe,
    param_grid={
        'text__ngram__ngram__ngram_range': [(1, 1), (1, 2), (1, 3)],
        'text__ave__ave_word_length': [1, 2, 3],
        'regr__regressor__alpha': [0.1, 0.01, 0.001],
        'regr__regressor__max_iter': [100, 1000, 10000],
        'regr__regressor__normalize': [True, False],
        'regr__regressor__fit_intercept': [True, False],
        'columntransformer__numerical__scaler':[
            MinMaxScaler(), StandardScaler(), None, 
        ]
    },
)
gs

# References
* [6.1. Pipelines and composite estimators](https://scikit-learn.org/stable/modules/compose.html)