# Basic Workflow

In [54]:
# Always have your imports at the top
import random
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.base import TransformerMixin
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer

from hashlib import sha1 # just for grading purposes
import json # just for grading purposes

from utils import get_dataset, workflow_steps, data_analysis_steps

def _hash(obj, salt='none'):
    if type(obj) is not str:
        obj = json.dumps(obj)
    to_encode = obj + salt
    return sha1(to_encode.encode()).hexdigest()

X, y = get_dataset()  # preloaded dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

## Exercise 1: ~The Larch~ Workflow steps

What are the basic workflow steps?

<img src="media/the_larch.jpg" width="300" />

<p style="font-size: 9px; text-align:center">A larch</p>

You probably know them already, but we want you to really internalize them. We've given you a list of steps `workflow_steps`, but it appears that, not only does it have to many steps, some are _probably_ wrong, as well.

Select the correct ones and reorder them!

In [55]:
print("Workflow steps:")
for i in range(len(workflow_steps)):
    print(i+1, ': ', workflow_steps[i])

Workflow steps:
1 :  Watch Netflix
2 :  Increase complexity
3 :  Evaluate results
4 :  Get the data
5 :  Establish a Baseline
6 :  Data analysis and preparation
7 :  Train model
8 :  Spam
9 :  Google Hackathon solutions
10 :  Iterate


In [56]:
# Exercise 1.1. Filter and sort the names of the steps in the workflow_steps list
# workflow_steps_answer = [...]

# YOUR CODE HERE
workflow_steps_answer = ['Get the data', 'Data analysis and preparation', 'Train model','Evaluate results','Iterate']


In [57]:
### BEGIN TESTS
assert _hash([step.lower() for step in workflow_steps_answer], 'salt0') == '701e2306da9bfde36382bdb6feb80a354916ebf4'
### END TESTS

There are way too many substeps in the Data Analysis and Preparation step to group them all under a single category. We've given you another list of steps: `data_analysis_steps`.

Aside from being shuffled, it should be fine but keep an eye out. You never know what to expect...

In [58]:
print("Data Analysis and Preparation steps:")
for i in range(len(data_analysis_steps)):
    print(i+1, ': ', data_analysis_steps[i])

Data Analysis and Preparation steps:
1 :  Feature engineering
2 :  Dealing with data problems
3 :  Spanish Inquisition
4 :  Feature selection
5 :  Data analysis


In [59]:
# Exercise 1.2. Filter and sort the names of the steps in the data_analysis_steps list
# data_analysis_steps_answer = [...]

# YOUR CODE HERE
data_analysis_steps_answer = ["Data analysis",
                              "Dealing with data problems",
                              "Feature engineering",
                              "Feature selection"]

In [60]:
### BEGIN TESTS
assert _hash([step.lower() for step in data_analysis_steps_answer], 'salt0') == '658ab90eff4a0cea2bfb51cc89c8db5b4121fa86'
### END TESTS

<p style="text-align:center">That's right! 

<p style="text-align:center"><b>Nobody</b> expects the Spanish Inquisition!

<img src="media/spanish_inquisition.gif" width="400" />

## Exercise 2: Specific workflow questions

Here are some more specific questions about individual workflow steps.

In [61]:
# Exercise 2.1. True or False, you should split your data in a training and test set
# split_training_test = ...

# Exercise 2.2. True or False, Scikit Pipelines are only useful in production environments
# scikit_pipelines_useful = ...

# Exercise 2.3. True or False, you should try to make a complex baseline, so you just have 
#               to make simple improvements on it, later on.
# baseline_complex = ...

# Exercise 2.4. (optional) True or False, is Brian the Messiah?
# is_brian_the_messiah = ...

# YOUR CODE HERE
split_training_test = True
scikit_pipelines_useful = False
baseline_complex = False
is_brian_the_messiah = True

In [62]:
### BEGIN TESTS
assert _hash(split_training_test, 'salt1') == '569b45c42b5c7b490c92692b911af35f575c8a06'
assert _hash(scikit_pipelines_useful, 'salt2') == 'ef07576cc7d3bcb2cf29e1a772aec2aad7f59158'
assert _hash(baseline_complex, 'salt3') == 'f24a294afb4a09f7f9df9ee13eb18e7d341c439d'
### END TESTS

<img src="media/monty_python_messiah.jpg" width=500>

## Scikit Pipelines

We've already loaded and splitted a dataset for the following exercises. They're stored in the `X_train`, `X_test`, `y_train` and `y_test` variables.

In a perfect world, where you have all your data clean and ready-to-go, you can create your pipeline with just Scikit-learn's Transformers. However, in the real world, that's not the case, and you'll need to create custom Transformers to get the job done. Take a look at the data set, what do you see?

In [63]:
# Do some data analysis here
X_test

Unnamed: 0,17,leg_1,7,arm_2,2,11,4,16,0,14,...,12,10,15,9,19,18,8,6,arm_0,1
93,0.086144,Locke,-2.92135,temptation,-1.072139,-0.734592,0.176442,0.428817,0.200569,-0.810252,...,0.43656,0.250588,0.046683,-1.015822,-0.82759,-0.367028,1.148637,0.903935,periscope,-0.24878
39,-1.438278,shrub,-0.668144,Erich,0.919229,-0.291811,-0.426358,0.298753,0.88311,,...,1.873298,1.572337,0.77814,-0.18048,0.11327,1.148446,-0.077837,1.080048,yuh,-1.762549
53,-0.803179,warranty,-0.271124,bay,1.492689,-0.998385,-0.226479,-0.877983,0.76608,,...,-0.021367,2.368674,1.170775,-0.100154,0.913585,0.367366,1.226933,-0.747212,Riviera,-2.654613
71,-0.977555,SW,0.751387,value,0.099332,-0.592394,-0.576771,-0.238948,0.048522,-0.863991,...,-1.669405,-1.693028,-0.83807,0.270457,0.500917,0.755391,-0.83095,0.54336,neuralgia,1.897924
46,2.010205,Baltimore,-0.798297,nag,-0.176947,-0.552223,0.285865,-0.612789,0.202923,0.632932,...,-1.379319,1.312175,-0.886027,1.547505,0.658544,0.334457,-1.515744,-0.73093,recovery,-0.833116
36,0.040592,Palmyra,-0.662901,nobody'd,-0.701992,-1.592994,,0.125225,-0.019638,0.440475,...,-1.402605,0.155132,1.519901,0.223914,0.04886,0.543298,0.55249,1.749577,liberal,-0.773361
1,1.375707,Runge,0.125576,bottleneck,-0.150056,0.321357,,1.18947,1.613711,0.421921,...,-0.173072,-1.158068,0.96336,-0.244157,-0.297564,0.701173,0.453534,0.015579,sloven,0.659924
5,0.88366,dearth,-1.576392,hint,0.652323,-0.483186,1.078681,1.683928,-1.225766,1.573987,...,1.47654,-1.28568,0.889484,0.224452,-0.172627,-0.038508,-1.464375,1.380091,epicure,0.807427
14,0.425458,deport,-0.047711,seem,-0.966976,0.128104,-1.583903,-0.452306,0.840644,-0.681052,...,-0.003603,0.254157,-1.463612,-0.446183,0.7858,0.760415,-0.652624,-1.158365,Clayton,0.375316
77,-0.978764,ineluctable,0.3773,metallurgic,-0.444293,-0.459361,1.03754,0.47898,0.830336,-0.849844,...,0.756989,-0.793714,-0.528785,0.071566,-0.269875,-0.510016,-0.856084,-0.922165,geocentric,0.946218


While crunching your data, you probably found two issues:

1. There are 4 columns whose name starts with either `arm` or `leg` which are all filled with gibberish
2. There are some values missing in some columns

So, first things first, let's get rid of those columns through a Custom Transformer, so we can plug it in a Scikit Pipeline after.

## Exercise 3: Custom Transformer

In [67]:
# Create a pipeline step called RemoveLimbs that removes any
# column whose name starts with the string 'arm' or 'leg'

# YOUR CODE HERE
    
class RemoveLimbs(TransformerMixin):
    
    def transform(self, X, *_):
        return X.select_dtypes(exclude='object').copy()
    
    def fit(self, *_):
        return self

In [68]:
### BEGIN TESTS
assert _hash(sorted(RemoveLimbs().fit_transform(X).columns), 'salt5') == '71443dfc3077d773d4c74e958dadf91dc2cc148a'
assert _hash(list(map(lambda col: col.startswith('arm') or col.startswith('leg'), RemoveLimbs().fit_transform(X_train).columns)), 'salt6') == 'ce45cf3759d2210f2d1315f1673b18f34e3ac711'
### END TESTS

<img src="media/monty_python_black_knight.gif" width=500>

Now that we have our Custom Transformer in place, we can design our pipeline. For the sake of the exercise, you'll want to create a pipeline with the following steps:

1. Removes limbs columns
2. Imputes missing values with the mean
3. Has a Random Forest Classifier as the last step

You may use `make_pipeline` to create your pipeline with as many steps as you want as long as the first two are the Custom Transformer you developed previously, a `SimpleImputer` as the second step, and a `RandomForestClassifier` as the last step.

## Exercise 4: Scikit Pipelines

In [69]:
# YOUR CODE HERE

pipeline = make_pipeline(
    RemoveLimbs(),
    SimpleImputer(strategy='mean'),
    RandomForestClassifier(n_estimators=10)
)



In [70]:
### BEGIN TESTS
assert _hash(pipeline.steps[0][0], 'salt7') == '471b02068ac2c4f479c2e9f85f4b3dc2179bb841'
assert _hash(pipeline.steps[1][0], 'salt8') == 'ca83eaea1a7e243fa5574cfa6f52831166ee0f32'
assert _hash(pipeline.steps[-1][0], 'salt9') == '0d66ba4309ad4939673169e74f87088dcadd510b'
### END TESTS

Does it work? Let's check it out on our dataset!

In [71]:
pipeline.fit(X_train, y_train)

Pipeline(steps=[('removelimbs',
                 <__main__.RemoveLimbs object at 0x7f9399e102e8>),
                ('simpleimputer', SimpleImputer()),
                ('randomforestclassifier',
                 RandomForestClassifier(n_estimators=10))])

In [72]:
y_pred = pipeline.predict(X_test)
accuracy_score(y_test, y_pred)

0.9090909090909091

That's it for this Exercise Notebook! It doesn't get much cleaner than this, does it?

You can still practice around with pipelines, maybe add a few more steps. See how you can adapt your pipeline and how it affects the predictions.

Can you see how Scikit-learn's Pipelines might save time? Can you imagine how useful that would be in stressful situations (like, *for example*, an Hackathon)?