 # `scikit-learn` Pipelines
 
_Author: Carleton Smith_

<a id="learning-objectives"></a>
# Learning Objectives
- Review EDA and preprocessing in pandas
- Build a preprocessing pipeline in sklearn

<a id="top"></a>
# Lesson Guide
- [Acquire Data](#acquire)
- [Sklearn Pipelines](#pipelines)
- [Exploratory Data Analysis](#explore)
- [Preprocessing Pipeline](#preprocess)
- [Model Building](#modeling)

In [1]:
# imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

<a id="acquire"></a>
# Acquire Data

For this lessons, we will use the "Census Income" dataset, provided by the UCI Machine Learning Repository.

**Link**: https://archive.ics.uci.edu/ml/datasets/Adult

Our goal will be to predict if an individual's income exceeds \$50k per year based on census data.


**FEATURES**

1. `age`: continuous.
2. `workclass`: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
3. `fnlwgt`: continuous.
4. `education`: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
5. `education-num`: continuous.
6. `marital-status`: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
7. `occupation`: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
8. `race`: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
9. `sex`: Female, Male.
10. `capital-gain`: continuous.
11. `capital-loss`: continuous.
12. `hours-per-week`: continuous.
13. `native-country`: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.



In [None]:
adult = pd.read_csv('./datasets/adult.data.txt', na_values= ' ?', header=None)
adult.head()

In [None]:
features = [
    'age',
    'workclass',
    'fnlwgt',
    'education',
    'education_num',
    'marital_status',
    'occupation',
    'relationship',
    'race',
    'sex',
    'capital_gain',
    'capital_loss',
    'hours_per_week',
    'native_country',
    'income',
]

**Challenge**: Add column headers using the `features` list defined above

This is all the modifications we'll make at this time.

[Back to Top](#top)

<a id="pipelines"></a>
# Review: Sklearn Pipelines
---

Sklearn provides a [module](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.pipeline) for creating preprocessing pipelines. Some of you may be familiar. We will demonstrate Pipelines through an "end to end" project.

In [None]:
# import Pipeline class
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression

In [None]:
X = adult[['age']].copy()
y = adult['fnlwgt']
display(X.head())
display(y.head())

**Explanation**    
The sklearn pipeline takes a list of tuples. Each tuples is a step. The first element in a tuple is the name of the step and the second element is the class object that performs the step. Each step must have a `.fit` and a `.transform` method. The only exception to this is the last step, which can be a model class object (sklearn models do not have a `.transform` method).

The output of one step becomes the input of the next step, so order matters when constructing the pipeline.

**Example**:

In [None]:
# instantiate a pipeline with 2 steps


In [None]:
# fit the pipeline with data


In [None]:
# make predictions


<a id="explore"></a>
# Exploratory Data Analysis
---

Let's run through some basic EDA procedures.

**Challenge**: Show how many missing values exist.

**Challenge**: What are the data types for each column?

**Challenge**: what is the distribution of `income`? This is our target variable.

**Challenge**: Create a plot of correlations

**Challenge**: What are the distributions of the numeric data?

[Back to Top](#top)

<a id="preprocess"></a>
# Preprocessing Pipeline
---

We will build this pipeline out sequentially.

**PREPROCESSING STEPS**
1. Separate target variable from features - sklearn requires this.
2. Peform a train-test split
3. With training data:
    - **SEPARATE** numeric columns from categorical ones
    - **NUMERIC DF** preprocessing:
        - Replace nan values
        - Standardize features
   
    - **CATEGORICAL DF** preprocessing:
        - Replace nan values
        - Create dummy variables
    - **CONCATENATE** numeric and categorical DF
    - **ENCODE** target variable
<br>
<br>
4. Package these steps into a `Pipeline`

**Challenge**: Using a list comprehension, create a list of the numeric columns.

**Challenge**: Using a list comprehension, create a list of the numeric columns.

**1. Separate target variable from features**

In [None]:
X = adult.drop('income', axis=1)
y = adult['income']

**Challenge**: Make a function that will extract out `X`. Call the function `feature_extractor`.

- The function should accept a DataFrame and return a DataFrame with the target variable removed.
- It's okay to hardcode the name of the target variable `income`.

**2. Peform a train-test split**

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.3,
    stratify=y,
    random_state=24
)

**3. We will make functions for the following:**
    - **SEPARATE** numeric columns from categorical ones
    - **NUMERIC DF** preprocessing:
        - Replace nan values
        - Standardize features
   
    - **CATEGORICAL DF** preprocessing:
        - Replace nan values
        - Create dummy variables
    - **CONCATENATE** numeric and categorical DF
    - **ENCODE** target variable

**Challenge**: Create a function that will extract the categorical columns. Name this function `categorical_extractor`.

- The function should accept a DataFrame and return a DataFrame with only categorical variables.

**Challenge**: Create a function that will extract the numeric columns. Name this function `numeric_extractor`.

- The function should accept a DataFrame and return a DataFrame with only categorical variables.

The following function is provided for convenience. It will add column names to dummy variable columns.

In [None]:
def dummy_col_adder(array):
    dummy_cols = []
    for col, cat_set in zip(cat_cols, cat_pipe.named_steps['OneHotEncoder'].categories_):
        for cat in cat_set:
            dummy_cols.append(col+'_'+cat)
    return pd.DataFrame(array, columns=dummy_cols)

[Back to Top](#top)

### `FunctionTransformer`

Sklearn pipelines require every step to be a `class` object with a `.fit` and `.transform` method. In order to use the functions we defined above, we will need to convert them to a "transformer" class object. The `FunctionTransformer` is class is designed for doing exactly that.

**DEMO**: Use `numeric_extractor` to grab numeric columns. Then try doing the same task after converting it to a transformer.

In [None]:
from sklearn.preprocessing import FunctionTransformer

In [None]:
# use numeric_extractor function


In [None]:
# convert numeric_extractor to transformer, then use to transform data


**Challenge**: Using the `num_transformer` defined above, build a pipeline for the numeric data. Call the pipeline `num_pipe`. The steps should include:

1. `num_transformer`
2. `SimpleImputer` (from sklearn.impute - use the `median` strategy)
3. `StandardScaler` (from sklearn.preprocessing)


In [None]:
from sklearn.impute import SimpleImputer

In [None]:
# make numeric pipe


In [None]:
# transform X_train


#### We'll now build the pipeline for categorical data

1. Use `FunctionTransformer` to transform `categorical_extractor`
2. Use `FunctionTransformer` to transform `dummy_col_adder`
3. Build the categorical pipeline

For number 3, include the following steps:

- `cat_transformer`
- `SimpleImputer(strategy='most_frequent')`
- `OneHotEncoder(sparse=False, handle_unknown='ignore')`
- `dummy_col_transformer`

In [None]:
from sklearn.preprocessing import OneHotEncoder

In [None]:
# transform our two functions
cat_transformer = FunctionTransformer(categorical_extractor, validate=False)
dummy_col_transformer = FunctionTransformer(dummy_col_adder, validate=False)

In [None]:
cat_pipe = Pipeline([
    ('cat_transformer', cat_transformer),
    ('cat_im', SimpleImputer(strategy='most_frequent')),
    ('OneHotEncoder', OneHotEncoder(sparse=False, handle_unknown='ignore')),
    ('dummy_col_transformer', dummy_col_transformer)
])

In [None]:
cat_pipe.fit(X_train)

In [None]:
cat_pipe.transform(X_train).head()

[Back to Top](#top)

### `FeatureUnion`

We now have two pipelines: `num_pipe` and `cat_pipe`. They each do their job and we can have them work in parallel. For this, we use the `FeatureUnion` class from the `pipeline` module. This will run each pipeline seperately and combine the results.

In [None]:
from sklearn.pipeline import FeatureUnion

In [None]:
# make FeatureUnion
feat_union = FeatureUnion([
    ('num_pipe', num_pipe),
    ('cat_pipe', cat_pipe)
])

In [None]:
feat_union.fit(X_train).transform(X_train)[:1]

Above, the first 6 items are from the numeric data. The remaining columns are dummy variables from the categorical columns.


Finally, we combine this into one pipeline.

In [None]:
feature_pipe = Pipeline([
    ('feat_union', feat_union)
])

#### Use this pipeline to _fit_ and _transform_ `X_train`

In [None]:
# fit and transform training data
X_train_prepared = pd.DataFrame(
    feature_pipe.fit(X_train).transform(X_train),
    index=X_train.index,
    columns = num_cols + [col+ '_' + level.strip()
                          for col, cat in zip(cat_cols, cat_pipe.named_steps['OneHotEncoder'].categories_)
                          for level in cat])
X_train_prepared.head()

**Challenge**: Use this fitted pipeline to transform `X_test`.

In [None]:
# transform testing data
X_test_prepared = pd.DataFrame()

# print out 
X_test_prepared.head()

**ENCODE TARGET VARIABLE**

The final step is to encode the target variable `income` to be numeric. For this, we'll use `LabelEncoder` from sklearn.preprocessing.

In [None]:
y_train[:5]

**Challenge**: Use `LabelEncoder` to transform `y_train` so that "<50k" is 0 and ">=50k" is 1.

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
# fit and transform y_train
le = ''

Transform the test set:

In [None]:
# transform ONLY: y_test
y_test_encoded = pd.Series(le.transform(y_test), index=y_test.index)
y_test_encoded[:5]

[Back to Top](#top)

<a id="modeling"></a>
# Model Building
---

Now that our data is ready, we can move on to modeling.

**Challenge**: What is the first step of the modeling process?

ANSWER: Calculate the baseline.

In [None]:
y_test_encoded.value_counts()[0] / y_test_encoded.value_counts().sum()

The most simple model possible is using the majority class (under \$50k) as the prediction for every value.

We would achieve 76\% accuracy if we did this. All future models must beat this baseline.