# Basic Workflow

In [4]:
# Always have your imports at the top
import random
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.base import TransformerMixin
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer

from hashlib import sha1 # just for grading purposes
import json # just for grading purposes

from utils import get_dataset, workflow_steps, data_analysis_steps

def _hash(obj, salt='none'):
    if type(obj) is not str:
        obj = json.dumps(obj)
    to_encode = obj + salt
    return sha1(to_encode.encode()).hexdigest()

X, y = get_dataset()  # preloaded dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

## Exercise 1: ~The Larch~ Workflow steps

What are the basic workflow steps?

<img src="media/the_larch.jpg" width="300" />

<p style="font-size: 9px; text-align:center">A larch</p>

You probably know them already, but we want you to really internalize them. We've given you a list of steps `workflow_steps`, but it appears that, not only does it have to many steps, some are _probably_ wrong, as well.

Select the correct ones and reorder them!

In [5]:
print("Workflow steps:")
for i in range(len(workflow_steps)):
    print(i+1, ': ', workflow_steps[i])

Workflow steps:
1 :  Establish a Baseline
2 :  Evaluate results
3 :  Google Hackathon solutions
4 :  Spam
5 :  Increase complexity
6 :  Get the data
7 :  Watch Netflix
8 :  Iterate
9 :  Train model
10 :  Data analysis and preparation


In [8]:
# Exercise 1.1. Filter and sort the names of the steps in the workflow_steps list
# workflow_steps_answer = [...]

# YOUR CODE HERE
workflow_steps_answer = ["Get the data",
                         "Data analysis and preparation",
                         "Train model",
                         "Evaluate results",                         
                         'Iterate',
                         "Establish a Baseline",
                         "Increase complexity"
                         
                        ]

In [9]:
### BEGIN TESTS
assert _hash([step.lower() for step in workflow_steps_answer], 'salt0') == '701e2306da9bfde36382bdb6feb80a354916ebf4'
### END TESTS

AssertionError: 

There are way too many substeps in the Data Analysis and Preparation step to group them all under a single category. We've given you another list of steps: `data_analysis_steps`.

Aside from being shuffled, it should be fine but keep an eye out. You never know what to expect...

In [42]:
print("Data Analysis and Preparation steps:")
for i in range(len(data_analysis_steps)):
    print(i+1, ': ', data_analysis_steps[i])

Data Analysis and Preparation steps:
1 :  Dealing with data problems
2 :  Data analysis
3 :  Feature selection
4 :  Spanish Inquisition
5 :  Feature engineering


In [43]:
# Exercise 1.2. Filter and sort the names of the steps in the data_analysis_steps list
# data_analysis_steps_answer = [...]

# YOUR CODE HERE
data_analysis_steps_answer = ["Data analysis",
                              "Dealing with data problems",
                              "Feature engineering",
                              "Feature selection"]

In [44]:
### BEGIN TESTS
assert _hash([step.lower() for step in data_analysis_steps_answer], 'salt0') == '658ab90eff4a0cea2bfb51cc89c8db5b4121fa86'
### END TESTS

<p style="text-align:center">That's right! 

<p style="text-align:center"><b>Nobody</b> expects the Spanish Inquisition!

<img src="media/spanish_inquisition.gif" width="400" />

## Exercise 2: Specific workflow questions

Here are some more specific questions about individual workflow steps.

In [45]:
# Exercise 2.1. True or False, you should split your data in a training and test set
# split_training_test = ...

# Exercise 2.2. True or False, Scikit Pipelines are only useful in production environments
# scikit_pipelines_useful = ...

# Exercise 2.3. True or False, you should try to make a complex baseline, so you just have 
#               to make simple improvements on it, later on.
# baseline_complex = ...

# Exercise 2.4. (optional) True or False, is Brian the Messiah?
# is_brian_the_messiah = ...

# YOUR CODE HERE
split_training_test = True
scikit_pipelines_useful = False
baseline_complex = False
is_brian_the_messiah = True

In [46]:
### BEGIN TESTS
assert _hash(split_training_test, 'salt1') == '569b45c42b5c7b490c92692b911af35f575c8a06'
assert _hash(scikit_pipelines_useful, 'salt2') == 'ef07576cc7d3bcb2cf29e1a772aec2aad7f59158'
assert _hash(baseline_complex, 'salt3') == 'f24a294afb4a09f7f9df9ee13eb18e7d341c439d'
### END TESTS

<img src="media/monty_python_messiah.jpg" width=500>

## Scikit Pipelines

We've already loaded and splitted a dataset for the following exercises. They're stored in the `X_train`, `X_test`, `y_train` and `y_test` variables.

In a perfect world, where you have all your data clean and ready-to-go, you can create your pipeline with just Scikit-learn's Transformers. However, in the real world, that's not the case, and you'll need to create custom Transformers to get the job done. Take a look at the data set, what do you see?

In [47]:
# Do some data analysis here
X_test

Unnamed: 0,17,leg_1,7,arm_2,2,11,4,16,0,14,...,12,10,15,9,19,18,8,6,arm_0,1
53,-0.803179,warranty,-0.271124,bay,1.492689,-0.998385,-0.226479,-0.877983,0.76608,,...,-0.021367,2.368674,1.170775,-0.100154,0.913585,0.367366,1.226933,-0.747212,Riviera,-2.654613
62,-0.272724,abrogate,-0.054295,beryllium,-2.696887,-1.012104,,-0.335785,0.823171,-1.654857,...,-0.230935,1.658822,-0.749202,-1.289961,-0.245743,-1.503143,0.073318,0.696206,vernier,-1.207273
52,-0.062593,buckskin,-0.280675,businessman,-0.753965,-1.389572,0.758929,0.594754,1.02257,-1.645399,...,-1.692957,-0.053198,1.46098,1.384273,0.104201,0.281191,2.439752,-0.09834,crypt,-0.558181
58,-0.428115,mask,0.850222,g,1.50076,-1.294681,0.996267,0.076822,-0.467701,1.160827,...,,0.559426,1.724002,-0.046921,-1.556582,-0.493757,0.346504,-0.349258,impeccable,-1.228234
56,1.440117,chimeric,1.80094,clavicle,-0.676392,-0.082151,0.196521,0.642723,0.342725,1.117296,...,-0.040158,1.086594,-1.054638,0.569767,-0.089736,0.709004,0.456753,-1.430775,eyepiece,-0.556581
76,1.001046,runabout,0.677875,certitude,-2.703232,0.950308,0.975198,-0.927353,,,...,-0.654076,1.372711,0.500477,0.070052,0.189582,0.501094,-0.168822,-1.830633,cacao,-1.464473
1,1.375707,Runge,0.125576,bottleneck,-0.150056,0.321357,,1.18947,1.613711,0.421921,...,-0.173072,-1.158068,0.96336,-0.244157,-0.297564,0.701173,0.453534,0.015579,sloven,0.659924
47,-1.211016,wattage,0.047399,horn,-0.651836,-0.662624,,0.259723,-0.763259,0.570599,...,,-1.559849,-1.113279,-1.627542,-0.06608,-1.66152,-1.804882,-0.384556,everything,1.890331
74,1.187679,Exxon,0.20116,shamrock,-0.464617,1.615376,-0.903702,0.40373,1.217159,-0.32232,...,0.283288,0.964233,0.193754,0.998311,-1.17904,0.324359,1.521316,-0.258905,commissary,-0.963142
91,-0.953329,slim,1.624678,Gullah,0.12267,-2.362932,-0.645964,-0.182896,0.619154,,...,0.323079,1.178696,0.55832,0.020794,-0.482744,-0.799192,2.057495,-0.252354,Aruba,-1.310899


While crunching your data, you probably found two issues:

1. There are 4 columns whose name starts with either `arm` or `leg` which are all filled with gibberish
2. There are some values missing in some columns

So, first things first, let's get rid of those columns through a Custom Transformer, so we can plug it in a Scikit Pipeline after.

## Exercise 3: Custom Transformer

In [52]:
# Create a pipeline step called RemoveLimbs that removes any
# column whose name starts with the string 'arm' or 'leg'

# YOUR CODE HERE
def Remove_Limbs(table):
    
    cols_to_drop = []
    for colname in table.columns:
        if colname.startswith('arm') or colname.startswith('leg'):
            cols_to_drop.append(colname)

    table.drop(cols_to_drop, axis=1, inplace=True)
    
    return table

Remove_Limbs(X_te)

In [53]:
### BEGIN TESTS
assert _hash(sorted(RemoveLimbs().fit_transform(X).columns), 'salt5') == '71443dfc3077d773d4c74e958dadf91dc2cc148a'
assert _hash(list(map(lambda col: col.startswith('arm') or col.startswith('leg'), RemoveLimbs().fit_transform(X_train).columns)), 'salt6') == 'ce45cf3759d2210f2d1315f1673b18f34e3ac711'
### END TESTS

NameError: name 'RemoveLimbs' is not defined

<img src="media/monty_python_black_knight.gif" width=500>

Now that we have our Custom Transformer in place, we can design our pipeline. For the sake of the exercise, you'll want to create a pipeline with the following steps:

1. Removes limbs columns
2. Imputes missing values with the mean
3. Has a Random Forest Classifier as the last step

You may use `make_pipeline` to create your pipeline with as many steps as you want as long as the first two are the Custom Transformer you developed previously, a `SimpleImputer` as the second step, and a `RandomForestClassifier` as the last step.

## Exercise 4: Scikit Pipelines

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
### BEGIN TESTS
assert _hash(pipeline.steps[0][0], 'salt7') == '471b02068ac2c4f479c2e9f85f4b3dc2179bb841'
assert _hash(pipeline.steps[1][0], 'salt8') == 'ca83eaea1a7e243fa5574cfa6f52831166ee0f32'
assert _hash(pipeline.steps[-1][0], 'salt9') == '0d66ba4309ad4939673169e74f87088dcadd510b'
### END TESTS

Does it work? Let's check it out on our dataset!

In [None]:
pipeline.fit(X_train, y_train)

In [None]:
y_pred = pipeline.predict(X_test)
accuracy_score(y_test, y_pred)

That's it for this Exercise Notebook! It doesn't get much cleaner than this, does it?

You can still practice around with pipelines, maybe add a few more steps. See how you can adapt your pipeline and how it affects the predictions.

Can you see how Scikit-learn's Pipelines might save time? Can you imagine how useful that would be in stressful situations (like, *for example*, an Hackathon)?