# Basic Workflow

In [1]:
# Always have your imports at the top
import random
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.base import TransformerMixin
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer

from hashlib import sha1 # just for grading purposes
import json # just for grading purposes

from utils import get_dataset, workflow_steps, data_analysis_steps

def _hash(obj, salt='none'):
    if type(obj) is not str:
        obj = json.dumps(obj)
    to_encode = obj + salt
    return sha1(to_encode.encode()).hexdigest()

X, y = get_dataset()  # preloaded dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

## Exercise 1: ~The Larch~ Workflow steps

What are the basic workflow steps?

<img src="media/the_larch.jpg" width="300" />

<p style="font-size: 9px; text-align:center">A larch</p>

You probably know them already, but we want you to really internalize them. We've given you a list of steps `workflow_steps`, but it appears that, not only does it have to many steps, some are _probably_ wrong, as well.

Select the correct ones and reorder them!

In [2]:
print("Workflow steps:")
for i in range(len(workflow_steps)):
    print(i+1, ': ', workflow_steps[i])

Workflow steps:
1 :  Establish a Baseline
2 :  Data analysis and preparation
3 :  Evaluate results
4 :  Watch Netflix
5 :  Increase complexity
6 :  Get the data
7 :  Spam
8 :  Iterate
9 :  Google Hackathon solutions
10 :  Train model


In [56]:
# Exercise 1.1. Filter and sort the names of the steps in the workflow_steps list
# workflow_steps_answer = [...]

# YOUR CODE HERE
workflow_steps_answer = ["Get the data",
                         "Data analysis and preparation",
                         "Train model",
                         "Evaluate results",
                         "Establish a Baseline",
                         "Increase complexity"
                         ]

In [57]:
### BEGIN TESTS
assert _hash([step.lower() for step in workflow_steps_answer], 'salt0') == '701e2306da9bfde36382bdb6feb80a354916ebf4'
### END TESTS

AssertionError: 

There are way too many substeps in the Data Analysis and Preparation step to group them all under a single category. We've given you another list of steps: `data_analysis_steps`.

Aside from being shuffled, it should be fine but keep an eye out. You never know what to expect...

In [11]:
print("Data Analysis and Preparation steps:")
for i in range(len(data_analysis_steps)):
    print(i+1, ': ', data_analysis_steps[i])

Data Analysis and Preparation steps:
1 :  Data analysis
2 :  Feature selection
3 :  Dealing with data problems
4 :  Spanish Inquisition
5 :  Feature engineering


In [18]:
# Exercise 1.2. Filter and sort the names of the steps in the data_analysis_steps list
# data_analysis_steps_answer = [...]

# YOUR CODE HERE
data_analysis_steps_answer = ["Data analysis",
                              "Dealing with data problems",
                              "Feature engineering",
                              "Feature selection"]

In [19]:
### BEGIN TESTS
assert _hash([step.lower() for step in data_analysis_steps_answer], 'salt0') == '658ab90eff4a0cea2bfb51cc89c8db5b4121fa86'
### END TESTS

<p style="text-align:center">That's right! 

<p style="text-align:center"><b>Nobody</b> expects the Spanish Inquisition!

<img src="media/spanish_inquisition.gif" width="400" />

## Exercise 2: Specific workflow questions

Here are some more specific questions about individual workflow steps.

In [66]:
# Exercise 2.1. True or False, you should split your data in a training and test set
# split_training_test = ...

# Exercise 2.2. True or False, Scikit Pipelines are only useful in production environments
# scikit_pipelines_useful = ...

# Exercise 2.3. True or False, you should try to make a complex baseline, so you just have 
#               to make simple improvements on it, later on.
# baseline_complex = ...

# Exercise 2.4. (optional) True or False, is Brian the Messiah?
# is_brian_the_messiah = ...

# YOUR CODE HERE
split_training_test = True
scikit_pipelines_useful = False
baseline_complex = False
is_brian_the_messiah = True

In [67]:
### BEGIN TESTS
assert _hash(split_training_test, 'salt1') == '569b45c42b5c7b490c92692b911af35f575c8a06'
assert _hash(scikit_pipelines_useful, 'salt2') == 'ef07576cc7d3bcb2cf29e1a772aec2aad7f59158'
assert _hash(baseline_complex, 'salt3') == 'f24a294afb4a09f7f9df9ee13eb18e7d341c439d'
### END TESTS

<img src="media/monty_python_messiah.jpg" width=500>

## Scikit Pipelines

We've already loaded and splitted a dataset for the following exercises. They're stored in the `X_train`, `X_test`, `y_train` and `y_test` variables.

In a perfect world, where you have all your data clean and ready-to-go, you can create your pipeline with just Scikit-learn's Transformers. However, in the real world, that's not the case, and you'll need to create custom Transformers to get the job done. Take a look at the data set, what do you see?

In [87]:
# Do some data analysis here

X_test

Unnamed: 0,17,leg_1,7,arm_2,2,11,4,16,0,14,...,12,10,15,9,19,18,8,6,arm_0,1
88,1.846707,lapelled,-0.359292,smutty,0.583928,-0.033127,-0.489439,2.526932,-0.517611,1.794558,...,0.590655,0.068456,-1.371117,-0.016423,0.681891,1.044161,0.223788,1.108704,heft,0.506885
37,1.236093,selenium,0.609138,peripheral,1.09131,-0.705012,-0.111226,0.169361,0.558327,-0.055769,...,-1.092313,-2.281386,0.460938,0.538756,-0.73553,-0.903908,0.076005,-0.316408,ineffective,1.896911
8,0.959271,headmen,-0.767348,Aventine,2.153182,-0.10876,0.02451,0.097676,0.690144,0.401712,...,0.872321,-2.009185,-1.24869,0.224092,1.451144,0.497998,-0.40122,0.183342,stank,2.357902
83,-0.603985,stratus,-0.155677,bract,0.08659,1.006293,,-2.471645,,-0.576892,...,1.167782,-0.575002,-0.149518,0.529804,0.371146,-0.203045,-1.129707,0.254421,sat,0.588465
33,-0.415288,test,2.270693,foothold,0.632782,0.018418,-1.478586,0.235615,0.326927,1.676437,...,0.181866,-2.003477,-1.373141,0.829406,0.338496,1.143754,-0.219101,0.248221,appliance,2.404373
31,-0.553588,coprocessor,1.628397,deconvolve,0.568983,0.043811,-0.803675,-0.088282,0.963879,-0.147002,...,-0.379128,-1.820377,0.655741,-0.557492,1.677701,1.639117,2.210523,-0.20358,shopworn,1.393983
15,-0.353166,Debby,-0.295401,chimney,0.338484,-0.079641,0.579633,1.187386,-1.062394,,...,0.168461,2.529834,0.884395,-0.187144,0.194384,0.325796,0.428307,1.317598,inferring,-2.68318
39,-1.438278,shrub,-0.668144,Erich,0.919229,-0.291811,-0.426358,0.298753,0.88311,,...,1.873298,1.572337,0.77814,-0.18048,0.11327,1.148446,-0.077837,1.080048,yuh,-1.762549
99,0.992042,walkover,-0.755745,young,-0.17496,0.388579,,0.919076,-0.006071,2.493,...,0.53651,-1.755186,0.717686,0.081829,-0.66809,0.321698,0.838491,-0.898468,Brighton,1.308576
85,-1.356582,calf,-0.035641,percussion,0.46643,-2.42424,0.794265,-1.562546,0.736844,0.884045,...,-1.615132,0.480502,0.308773,0.066991,0.293558,-1.254289,-0.281328,1.164739,earthen,-0.568113


While crunching your data, you probably found two issues:

1. There are 4 columns whose name starts with either `arm` or `leg` which are all filled with gibberish
2. There are some values missing in some columns

So, first things first, let's get rid of those columns through a Custom Transformer, so we can plug it in a Scikit Pipeline after.

## Exercise 3: Custom Transformer

In [89]:
# Create a pipeline step called RemoveEvilColumns that removes any
# column whose name starts with the string 'evil'

# YOUR CODE HERE
RemoveLimbs = X_train.fit_transform(X_train).columns
RemoveLimbs

AttributeError: 'DataFrame' object has no attribute 'fit_transform'

In [84]:
### BEGIN TESTS
assert _hash(sorted(RemoveLimbs().fit_transform(X).columns), 'salt5') == '71443dfc3077d773d4c74e958dadf91dc2cc148a'
assert _hash(list(map(lambda col: col.startswith('arm') or col.startswith('leg'), RemoveLimbs().fit_transform(X_train).columns)), 'salt6') == 'ce45cf3759d2210f2d1315f1673b18f34e3ac711'
### END TESTS

TypeError: 'DataFrame' object is not callable

<img src="media/monty_python_black_knight.gif" width=500>

Now that we have our Custom Transformer in place, we can design our pipeline. For the sake of the exercise, you'll want to create a pipeline with the following steps:

1. Removes limbs columns
2. Imputes missing values with the mean
3. Has a Random Forest Classifier as the last step

You may use `make_pipeline` to create your pipeline with as many steps as you want as long as the first two are the Custom Transformer you developed previously, a `SimpleImputer` as the second step, and a `RandomForestClassifier` as the last step.

## Exercise 4: Scikit Pipelines

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
### BEGIN TESTS
assert _hash(pipeline.steps[0][0], 'salt7') == '471b02068ac2c4f479c2e9f85f4b3dc2179bb841'
assert _hash(pipeline.steps[1][0], 'salt8') == 'ca83eaea1a7e243fa5574cfa6f52831166ee0f32'
assert _hash(pipeline.steps[-1][0], 'salt9') == '0d66ba4309ad4939673169e74f87088dcadd510b'
### END TESTS

Does it work? Let's check it out on our dataset!

In [None]:
pipeline.fit(X_train, y_train)

In [None]:
y_pred = pipeline.predict(X_test)
accuracy_score(y_test, y_pred)

That's it for this Exercise Notebook! It doesn't get much cleaner than this, does it?

You can still practice around with pipelines, maybe add a few more steps. See how you can adapt your pipeline and how it affects the predictions.

Can you see how Scikit-learn's Pipelines might save time? Can you imagine how useful that would be in stressful situations (like, *for example*, an Hackathon)?