### Pipeline


#### This example was focused on the TITANIC dataset.  Therefore all code will be based on this.

In [1]:
import pandas as pd

In [2]:
# df = pd.read_csv('titanic.csv'). 
# df.shape
# df.columns
# df.isna().sum()

In [3]:
# Step 1 - Selecting all rows and a handful of features.  We are trying to predict who SURVIVED by using the following
# features: ["Pclass", "Embarked", "Sex"]
# Step 2 - getting rid of null rows in "Embarked"

# df = df.loc[df.Embarked.notna(), ["Survived", "Pclass", "Embarked", "Sex"]]


In [4]:
# Step 3 - run df.shape again to double check that filters took and no null values

# df.shape
# df.isna().sum()

In [5]:
# Step 4 - Cross validate a model that predicts SURVIVE using only Pclass. Then will use
# pipeline, one hot encoder and column transformer for Sex and Embarked

### Cross Validate model (that predicts SURVIVE using only Pclass feature) using Log Reg (it is a classification problem). The point of cross validation is to EVALUATE your model so you can decide if you are building a good model.

In [6]:
# Step 1 - define X and y.  Remember, X needs to be 2 dimensional array...y can be 1

# X = df.loc[:, ["Pclass"]]
# y = df.Survived
# X.shape
# y.shape

In [7]:
# Step 2 - Logistic Regression

# from sklearn.linear_model import LogisticRegression
# logreg = LogisticRegression()

In [8]:
# Step 3a - Calculate cross_val_score. (cv is the number of folds.  Mean is the mean of the 5 folds cross validation)

# from sklearn.model_selection import cross_val_score
# cross_val_score(logreg, X, y, cv=5, scoring="accuracy").mean()

In [1]:
# Step 3b - compare this to null accuracy.  It is the accuracy you would get by PREDICTING the MOST FREQUENT class.
# Your cross_val_score should generally beat the null accuracy

# y.value_counts(normalize=True)

### One-Hot-Encoding (also known as dummy encoding) CATEGORICAL features (it needs to be numeric!). 

#### Why use OneHotEncoder versus Pandas get_dummy?  
1. Because OneHotEncoder DOES NOT effect the width of your original df.
2. You don't have to worry about preprocessing NEW data as it comes in. For example: You won't have problems if your NEW data has different categories than your TRAINING data.  What happens if your TRAINING data has (C,Q,V in Embarked) and your NEW data has (only C and Q)?  The shape is wrong!!
3. You can do a GRIDSEARCH with both model parameters and preprocessing parameters
4. In some cases, preprocessing OUTSIDE sklearn can make cross validation scores LESS REALIABLE. 

In [10]:
# Step 1 - OneHotEncode sex column

# from sklearn.preprocessing import OneHotEncoder
# ohe = OneHotEncoder()
# ohe.fit_transform(df[["Sex"]])

In [11]:
# Step 2 - To see the CATEGORIES that OHE did and make sure its correct

# ohe.categories_

#### Dummy encoding multiple categorical features in one go without changing Pclass feature (using make_column_transformer)

In [None]:
# Step 1 - Redefine X to have all 3 features

# X = df.drop("Survivied", axis="columns")

In [12]:
# Step 2 - Using make_column_transformer to transform the 2 categorical features and leave Pclass alone. Using
# column transformer to do ALL THE PREPROCESSING at the same time.  I think there are other features.

# from sklearn.compose import make_column_transformer
# column_transform = make_column_transformer(
#                                 (OneHotEncoder(), ["Sex", "Embarked"]),
#                                 remainder="passthrough") 

# column_transform.fit_transform(X)

#### Pipeline step (Pipeline is for chaining steps together). Then passing through cross_val_score to check accuracy of adding 3 features instead of one.

In [14]:
# Step 1 - create the pipeline by passing through the PREPROCESSED data in column_transform into the LOGREG model

# from sklearn.pipeline import make_pipeline
# pipe = make_pipeline(column_transform, logreg)

In [17]:
# Step 2 - Now pass the entire pipeline into the CROSS VALIDATE MODEL.  Hopefully your accuracy score increased.  If
# it did, then adding the 2 other features IMPROVED your model! (eg. 0.7777 versus 0.67)

# cross_val_score(pipe, X, y, cv=5, scoring="accuracy").mean()

In [18]:
# cross validating an ENTIRE PIPELINE OF STEPS (not just a model), but preprocessing of data and model building. 
# This includes:

# a) cross val will split the data, 5 fold split
# b) after splitting the data, it will run the pipeline

#### So, what do you do now?  Use new data and run it through the model you just built to predict! using the model that was created in the pipeline

In [19]:
# Step 1 - if you need to quickly create new data by randomly selecting from your TRAINING data (don't do this)

# X_new = X.sample(5, random_state = 99)

In [20]:
# Step 2 - training the model in the pipeline 

# pipe.fit(X, y)

In [23]:
# Step 3 - predict!

# pipe.predict(X_new)

# NOTE: passing X_new only works because the PIPELINE INCLUDES the preprocessing phase of breaking down Sex, Embarked 
#       into numerical values (dummy encoding/oneHotEncoding).  If you didn't have this phase, it wouldn't be
#       able to predict because it would still be in CATEGORICAL data (male/female)
# The one line of code is DOING EVERYTHING!!