# Basic Workflow

In [2]:
# Always have your imports at the top
import random
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.base import TransformerMixin
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer

from hashlib import sha1 # just for grading purposes
import json # just for grading purposes

from utils import get_dataset, workflow_steps, data_analysis_steps

def _hash(obj, salt='none'):
    if type(obj) is not str:
        obj = json.dumps(obj)
    to_encode = obj + salt
    return sha1(to_encode.encode()).hexdigest()

X, y = get_dataset()  # preloaded dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

## Exercise 1: ~The Larch~ Workflow steps

What are the basic workflow steps?

<img src="media/the_larch.jpg" width="300" />

<p style="font-size: 9px; text-align:center">A larch</p>

You probably know them already, but we want you to really internalize them. We've given you a list of steps `workflow_steps`, but it appears that, not only does it have to many steps, some are _probably_ wrong, as well.

Select the correct ones and reorder them!

In [3]:
print("Workflow steps:")
for i in range(len(workflow_steps)):
    print(i+1, ': ', workflow_steps[i])

Workflow steps:
1 :  Evaluate results
2 :  Get the data
3 :  Establish a Baseline
4 :  Watch Netflix
5 :  Increase complexity
6 :  Data analysis and preparation
7 :  Iterate
8 :  Spam
9 :  Train model
10 :  Google Hackathon solutions


In [4]:
# Exercise 1.1. Filter and sort the names of the steps in the workflow_steps list
workflow_steps_answer = ['Get the data', 'Data analysis and preparation', 'Train model', 'Evaluate results', 'Iterate']
workflow_steps_answer

['Get the data',
 'Data analysis and preparation',
 'Train model',
 'Evaluate results',
 'Iterate']

In [5]:
### BEGIN TESTS
assert _hash([step.lower() for step in workflow_steps_answer], 'salt0') == '701e2306da9bfde36382bdb6feb80a354916ebf4'
### END TESTS

There are way too many substeps in the Data Analysis and Preparation step to group them all under a single category. We've given you another list of steps: `data_analysis_steps`.

Aside from being shuffled, it should be fine but keep an eye out. You never know what to expect...

In [6]:
print("Data Analysis and Preparation steps:")
for i in range(len(data_analysis_steps)):
    print(i+1, ': ', data_analysis_steps[i])

Data Analysis and Preparation steps:
1 :  Feature engineering
2 :  Data analysis
3 :  Spanish Inquisition
4 :  Dealing with data problems
5 :  Feature selection


In [7]:
# Exercise 1.2. Filter and sort the names of the steps in the data_analysis_steps list
data_analysis_steps_answer = ['Data analysis','Dealing with data problems', 'Feature engineering', 'Feature selection']


In [8]:
### BEGIN TESTS
assert _hash([step.lower() for step in data_analysis_steps_answer], 'salt0') == '658ab90eff4a0cea2bfb51cc89c8db5b4121fa86'
### END TESTS

<p style="text-align:center">That's right! 

<p style="text-align:center"><b>Nobody</b> expects the Spanish Inquisition!

<img src="media/spanish_inquisition.gif" width="400" />

## Exercise 2: Specific workflow questions

Here are some more specific questions about individual workflow steps.

In [9]:
# Exercise 2.1. True or False, you should split your data in a training and test set
split_training_test = True

# Exercise 2.2. True or False, Scikit Pipelines are only useful in production environments
scikit_pipelines_useful = False

# Exercise 2.3. True or False, you should try to make a complex baseline, so you just have 
#               to make simple improvements on it, later on.
baseline_complex = False

# Exercise 2.4. (optional) True or False, is Brian the Messiah?
# is_brian_the_messiah = ...


In [10]:
### BEGIN TESTS
assert _hash(split_training_test, 'salt1') == '569b45c42b5c7b490c92692b911af35f575c8a06'
assert _hash(scikit_pipelines_useful, 'salt2') == 'ef07576cc7d3bcb2cf29e1a772aec2aad7f59158'
assert _hash(baseline_complex, 'salt3') == 'f24a294afb4a09f7f9df9ee13eb18e7d341c439d'
### END TESTS

<img src="media/monty_python_messiah.jpg" width=500>

## Scikit Pipelines

We've already loaded and splitted a dataset for the following exercises. They're stored in the `X_train`, `X_test`, `y_train` and `y_test` variables.

In a perfect world, where you have all your data clean and ready-to-go, you can create your pipeline with just Scikit-learn's Transformers. However, in the real world, that's not the case, and you'll need to create custom Transformers to get the job done. Take a look at the data set, what do you see?

In [11]:
# Do some data analysis here
X_train.describe()
X_train.info()
X_train.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 67 entries, 61 to 54
Data columns (total 24 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   17      67 non-null     float64
 1   leg_1   67 non-null     object 
 2   7       67 non-null     float64
 3   arm_2   67 non-null     object 
 4   2       67 non-null     float64
 5   11      67 non-null     float64
 6   4       59 non-null     float64
 7   16      67 non-null     float64
 8   0       57 non-null     float64
 9   14      59 non-null     float64
 10  5       56 non-null     float64
 11  3       67 non-null     float64
 12  leg_3   67 non-null     object 
 13  13      67 non-null     float64
 14  12      59 non-null     float64
 15  10      67 non-null     float64
 16  15      67 non-null     float64
 17  9       67 non-null     float64
 18  19      67 non-null     float64
 19  18      67 non-null     float64
 20  8       67 non-null     float64
 21  6       67 non-null     float64
 22  arm

Unnamed: 0,17,leg_1,7,arm_2,2,11,4,16,0,14,...,12,10,15,9,19,18,8,6,arm_0,1
61,0.489187,courageous,1.1097,neuronal,0.634721,0.586694,0.513908,-1.000331,0.871297,-0.474904,...,,1.35537,0.563723,0.12638,0.35063,0.179582,-1.34598,-0.241258,teletype,-1.47487
65,-1.660961,surety,0.207688,clipboard,0.429618,-0.074433,0.833922,0.55979,,0.620672,...,0.271579,1.804741,-0.566037,0.380198,-0.070166,0.45918,-1.335344,-1.276749,aggregate,-1.416933
58,-0.428115,mask,0.850222,g,1.50076,-1.294681,0.996267,0.076822,-0.467701,1.160827,...,,0.559426,1.724002,-0.046921,-1.556582,-0.493757,0.346504,-0.349258,impeccable,-1.228234
70,-0.083106,lulu,0.760056,airborne,-1.50472,0.043602,-0.429302,-0.611769,-0.622649,1.695051,...,0.08244,1.27989,0.663523,-0.742471,-1.406317,-0.692421,0.194607,-1.457551,upkeep,-1.447232
29,-1.125489,jackdaw,0.129221,interruptible,2.445752,0.869606,0.654366,-1.77872,0.413435,1.355638,...,0.109395,-1.54851,-0.734265,-0.773789,0.279969,-0.055585,1.876796,0.725767,chapati,1.722513


While crunching your data, you probably found two issues:

1. There are 4 columns whose name starts with either `arm` or `leg` which are all filled with gibberish
2. There are some values missing in some columns

So, first things first, let's get rid of those columns through a Custom Transformer, so we can plug it in a Scikit Pipeline after.

## Exercise 3: Custom Transformer

In [67]:
class RemoveLimbs(TransformerMixin):
    
    def transform(self, X, *_):
        
        X = X.copy()
        columns= X.columns
        list_drop = []
        for col in columns:
            if (col.startswith('arm') or col.startswith('leg')):
                list_drop.append(col)
                
        X = X.drop(list_drop, axis = 1)
        return X
    
    def fit(self, *_):
        return self


In [68]:
### BEGIN TESTS
assert _hash(sorted(RemoveLimbs().fit_transform(X).columns), 'salt5') == '71443dfc3077d773d4c74e958dadf91dc2cc148a'
assert _hash(list(map(lambda col: col.startswith('arm') or col.startswith('leg'), RemoveLimbs().fit_transform(X_train).columns)), 'salt6') == 'ce45cf3759d2210f2d1315f1673b18f34e3ac711'
### END TESTS

<img src="media/monty_python_black_knight.gif" width=500>

Now that we have our Custom Transformer in place, we can design our pipeline. For the sake of the exercise, you'll want to create a pipeline with the following steps:

1. Removes limbs columns
2. Imputes missing values with the mean
3. Has a Random Forest Classifier as the last step

You may use `make_pipeline` to create your pipeline with as many steps as you want as long as the first two are the Custom Transformer you developed previously, a `SimpleImputer` as the second step, and a `RandomForestClassifier` as the last step.

## Exercise 4: Scikit Pipelines

In [69]:
pipeline = make_pipeline(
    RemoveLimbs(),
    # it's cool how scikit already has a mean imputer ready to go!
    SimpleImputer(strategy='mean'),
    
    RandomForestClassifier(n_estimators=10)
)

In [70]:
### BEGIN TESTS
assert _hash(pipeline.steps[0][0], 'salt7') == '471b02068ac2c4f479c2e9f85f4b3dc2179bb841'
assert _hash(pipeline.steps[1][0], 'salt8') == 'ca83eaea1a7e243fa5574cfa6f52831166ee0f32'
assert _hash(pipeline.steps[-1][0], 'salt9') == '0d66ba4309ad4939673169e74f87088dcadd510b'
### END TESTS

Does it work? Let's check it out on our dataset!

In [71]:
pipeline.fit(X_train, y_train)

Pipeline(steps=[('removelimbs',
                 <__main__.RemoveLimbs object at 0x7f403f4f2da0>),
                ('simpleimputer', SimpleImputer()),
                ('randomforestclassifier',
                 RandomForestClassifier(n_estimators=10))])

In [72]:
y_pred = pipeline.predict(X_test)
accuracy_score(y_test, y_pred)

0.9393939393939394

That's it for this Exercise Notebook! It doesn't get much cleaner than this, does it?

You can still practice around with pipelines, maybe add a few more steps. See how you can adapt your pipeline and how it affects the predictions.

Can you see how Scikit-learn's Pipelines might save time? Can you imagine how useful that would be in stressful situations (like, *for example*, an Hackathon)?