# Basic Workflow

In [1]:
# Always have your imports at the top
import random
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.base import TransformerMixin
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer

from hashlib import sha1 # just for grading purposes
import json # just for grading purposes

from utils import get_dataset, workflow_steps, data_analysis_steps

def _hash(obj, salt='none'):
    if type(obj) is not str:
        obj = json.dumps(obj)
    to_encode = obj + salt
    return sha1(to_encode.encode()).hexdigest()

X, y = get_dataset()  # preloaded dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

## Exercise 1: ~The Larch~ Workflow steps

What are the basic workflow steps?

<img src="media/the_larch.jpg" width="300" />

<p style="font-size: 9px; text-align:center">A larch</p>

You probably know them already, but we want you to really internalize them. We've given you a list of steps `workflow_steps`, but it appears that, not only does it have to many steps, some are _probably_ wrong, as well.

Select the correct ones and reorder them!

In [2]:
print("Workflow steps:")
for i in range(len(workflow_steps)):
    print(i+1, ': ', workflow_steps[i])

Workflow steps:
1 :  Iterate
2 :  Watch Netflix
3 :  Google Hackathon solutions
4 :  Get the data
5 :  Spam
6 :  Data analysis and preparation
7 :  Evaluate results
8 :  Train model
9 :  Establish a Baseline
10 :  Increase complexity


In [7]:
# Exercise 1.1. Filter and sort the names of the steps in the workflow_steps list
# workflow_steps_answer = [...]

# YOUR CODE HERE
workflow_steps_answer = ["Get the data",
                         "Data analysis and preparation"
                         "Train model",  
                         "Evaluate results", 
                         "Iterate"]


In [8]:
### BEGIN TESTS
assert _hash([step.lower() for step in workflow_steps_answer], 'salt0') == '701e2306da9bfde36382bdb6feb80a354916ebf4'
### END TESTS

AssertionError: 

There are way too many substeps in the Data Analysis and Preparation step to group them all under a single category. We've given you another list of steps: `data_analysis_steps`.

Aside from being shuffled, it should be fine but keep an eye out. You never know what to expect...

In [49]:
print("Data Analysis and Preparation steps:")
for i in range(len(data_analysis_steps)):
    print(i+1, ': ', data_analysis_steps[i])

Data Analysis and Preparation steps:
1 :  Data analysis
2 :  Spanish Inquisition
3 :  Dealing with data problems
4 :  Feature engineering
5 :  Feature selection


In [50]:
# Exercise 1.2. Filter and sort the names of the steps in the data_analysis_steps list
# data_analysis_steps_answer = [...]

# YOUR CODE HERE
data_analysis_steps_answer = ["Data analysis",
                              "Dealing with data problems",
                              "Feature engineering",
                              "Feature selection"]

In [51]:
### BEGIN TESTS
assert _hash([step.lower() for step in data_analysis_steps_answer], 'salt0') == '658ab90eff4a0cea2bfb51cc89c8db5b4121fa86'
### END TESTS

<p style="text-align:center">That's right! 

<p style="text-align:center"><b>Nobody</b> expects the Spanish Inquisition!

<img src="media/spanish_inquisition.gif" width="400" />

## Exercise 2: Specific workflow questions

Here are some more specific questions about individual workflow steps.

In [52]:
# Exercise 2.1. True or False, you should split your data in a training and test set
# split_training_test = ...

# Exercise 2.2. True or False, Scikit Pipelines are only useful in production environments
# scikit_pipelines_useful = ...

# Exercise 2.3. True or False, you should try to make a complex baseline, so you just have 
#               to make simple improvements on it, later on.
# baseline_complex = ...

# Exercise 2.4. (optional) True or False, is Brian the Messiah?
# is_brian_the_messiah = ...

# YOUR CODE HERE
split_training_test = True
scikit_pipelines_useful = False
baseline_complex = False
is_brian_the_messiah = True

In [53]:
### BEGIN TESTS
assert _hash(split_training_test, 'salt1') == '569b45c42b5c7b490c92692b911af35f575c8a06'
assert _hash(scikit_pipelines_useful, 'salt2') == 'ef07576cc7d3bcb2cf29e1a772aec2aad7f59158'
assert _hash(baseline_complex, 'salt3') == 'f24a294afb4a09f7f9df9ee13eb18e7d341c439d'
### END TESTS

<img src="media/monty_python_messiah.jpg" width=500>

## Scikit Pipelines

We've already loaded and splitted a dataset for the following exercises. They're stored in the `X_train`, `X_test`, `y_train` and `y_test` variables.

In a perfect world, where you have all your data clean and ready-to-go, you can create your pipeline with just Scikit-learn's Transformers. However, in the real world, that's not the case, and you'll need to create custom Transformers to get the job done. Take a look at the data set, what do you see?

In [54]:
# Do some data analysis here
X_test

Unnamed: 0,17,leg_1,7,arm_2,2,11,4,16,0,14,...,12,10,15,9,19,18,8,6,arm_0,1
33,-0.415288,test,2.270693,foothold,0.632782,0.018418,-1.478586,0.235615,0.326927,1.676437,...,0.181866,-2.003477,-1.373141,0.829406,0.338496,1.143754,-0.219101,0.248221,appliance,2.404373
60,-0.30318,tribesman,-1.616311,fleet,0.799942,0.73881,,0.367287,-0.935439,0.615367,...,-1.053682,-0.150138,-0.053969,-0.535963,-0.01942,-0.349317,1.085982,-1.067803,plural,0.159855
82,0.186609,Iran,0.19409,warplane,-0.446434,-1.359856,-0.095296,0.249384,0.645484,0.746254,...,1.073632,0.826007,-1.027544,-0.307778,0.607897,0.279022,2.163255,-1.026515,choosy,-0.329294
23,-1.380101,wrap,-0.055548,precocious,-1.703382,-0.964923,,-0.269407,1.058424,,...,0.384065,-1.402101,-0.993251,-1.183259,1.628616,0.074095,-1.758739,-0.032695,swore,1.69607
67,0.849602,shepherdess,-0.69291,experimentation,0.357015,1.586017,-0.208122,0.280992,2.133033,-1.237815,...,0.8996,-1.00635,-0.639226,-0.151785,-0.589365,-0.493001,-1.952088,0.3073,Waldorf,1.186741
68,-0.546859,Sarasota,-0.543425,Elmsford,-0.032753,1.848956,0.198085,0.013929,-0.268889,1.126565,...,-0.712846,1.151991,-0.843212,2.57336,-0.573662,-0.14436,-1.106526,0.10643,Rutledge,-0.70427
94,0.184836,genteel,0.70031,poplar,-0.858358,0.186767,-0.949399,-0.975873,-0.611518,-0.755383,...,-0.575638,1.652145,-0.691021,-0.923233,0.493318,2.632382,-1.406661,0.12201,Carrara,-1.225329
29,-1.125489,jackdaw,0.129221,interruptible,2.445752,0.869606,0.654366,-1.77872,0.413435,1.355638,...,0.109395,-1.54851,-0.734265,-0.773789,0.279969,-0.055585,1.876796,0.725767,chapati,1.722513
99,0.992042,walkover,-0.755745,young,-0.17496,0.388579,,0.919076,-0.006071,2.493,...,0.53651,-1.755186,0.717686,0.081829,-0.66809,0.321698,0.838491,-0.898468,Brighton,1.308576
14,0.425458,deport,-0.047711,seem,-0.966976,0.128104,-1.583903,-0.452306,0.840644,-0.681052,...,-0.003603,0.254157,-1.463612,-0.446183,0.7858,0.760415,-0.652624,-1.158365,Clayton,0.375316


While crunching your data, you probably found two issues:

1. There are 4 columns whose name starts with either `arm` or `leg` which are all filled with gibberish
2. There are some values missing in some columns

So, first things first, let's get rid of those columns through a Custom Transformer, so we can plug it in a Scikit Pipeline after.

## Exercise 3: Custom Transformer

In [60]:
# Create a pipeline step called RemoveLimbs that removes any
# column whose name starts with the string 'arm' or 'leg'

# YOUR CODE HERE
def RemoveLimbs(table):
    
   # cols_to_drop = []
   # for colname in table.columns:
   #     if colname.startswith('arm') or colname.startswith('leg'):
   #         cols_to_drop.append(colname)

   # table.drop(cols_to_drop, axis=1, inplace=True)
    
   # return table

    cols_to_drop = []
    for colname in table.columns:
        if colname.startswith('arm') or colname.startswith('leg'):
            cols_to_drop.append(colname)

    table.drop(cols_to_drop, axis=1, inplace=True)
    
    return table

RemoveLimbs(X_test)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


Unnamed: 0,17,7,2,11,4,16,0,14,5,3,13,12,10,15,9,19,18,8,6,1
33,-0.415288,2.270693,0.632782,0.018418,-1.478586,0.235615,0.326927,1.676437,-2.39955,0.770865,-2.211135,0.181866,-2.003477,-1.373141,0.829406,0.338496,1.143754,-0.219101,0.248221,2.404373
60,-0.30318,-1.616311,0.799942,0.73881,,0.367287,-0.935439,0.615367,,1.838184,0.808058,-1.053682,-0.150138,-0.053969,-0.535963,-0.01942,-0.349317,1.085982,-1.067803,0.159855
82,0.186609,0.19409,-0.446434,-1.359856,-0.095296,0.249384,0.645484,0.746254,-1.053839,1.577453,0.21915,1.073632,0.826007,-1.027544,-0.307778,0.607897,0.279022,2.163255,-1.026515,-0.329294
23,-1.380101,-0.055548,-1.703382,-0.964923,,-0.269407,1.058424,,-1.720671,0.717542,-2.039232,0.384065,-1.402101,-0.993251,-1.183259,1.628616,0.074095,-1.758739,-0.032695,1.69607
67,0.849602,-0.69291,0.357015,1.586017,-0.208122,0.280992,2.133033,-1.237815,-1.140548,-0.6227,0.588317,0.8996,-1.00635,-0.639226,-0.151785,-0.589365,-0.493001,-1.952088,0.3073,1.186741
68,-0.546859,-0.543425,-0.032753,1.848956,0.198085,0.013929,-0.268889,1.126565,-0.713525,-0.024125,0.059218,-0.712846,1.151991,-0.843212,2.57336,-0.573662,-0.14436,-1.106526,0.10643,-0.70427
94,0.184836,0.70031,-0.858358,0.186767,-0.949399,-0.975873,-0.611518,-0.755383,,1.053642,-1.351685,-0.575638,1.652145,-0.691021,-0.923233,0.493318,2.632382,-1.406661,0.12201,-1.225329
29,-1.125489,0.129221,2.445752,0.869606,0.654366,-1.77872,0.413435,1.355638,-1.435349,1.496044,-1.244655,0.109395,-1.54851,-0.734265,-0.773789,0.279969,-0.055585,1.876796,0.725767,1.722513
99,0.992042,-0.755745,-0.17496,0.388579,,0.919076,-0.006071,2.493,0.36017,-0.290275,-0.09889,0.53651,-1.755186,0.717686,0.081829,-0.66809,0.321698,0.838491,-0.898468,1.308576
14,0.425458,-0.047711,-0.966976,0.128104,-1.583903,-0.452306,0.840644,-0.681052,-1.79532,-2.423879,-1.889541,-0.003603,0.254157,-1.463612,-0.446183,0.7858,0.760415,-0.652624,-1.158365,0.375316


In [61]:
### BEGIN TESTS
assert _hash(sorted(RemoveLimbs().fit_transform(X).columns), 'salt5') == '71443dfc3077d773d4c74e958dadf91dc2cc148a'
assert _hash(list(map(lambda col: col.startswith('arm') or col.startswith('leg'), RemoveLimbs().fit_transform(X_train).columns)), 'salt6') == 'ce45cf3759d2210f2d1315f1673b18f34e3ac711'
### END TESTS

TypeError: RemoveLimbs() missing 1 required positional argument: 'table'

<img src="media/monty_python_black_knight.gif" width=500>

Now that we have our Custom Transformer in place, we can design our pipeline. For the sake of the exercise, you'll want to create a pipeline with the following steps:

1. Removes limbs columns
2. Imputes missing values with the mean
3. Has a Random Forest Classifier as the last step

You may use `make_pipeline` to create your pipeline with as many steps as you want as long as the first two are the Custom Transformer you developed previously, a `SimpleImputer` as the second step, and a `RandomForestClassifier` as the last step.

## Exercise 4: Scikit Pipelines

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
### BEGIN TESTS
assert _hash(pipeline.steps[0][0], 'salt7') == '471b02068ac2c4f479c2e9f85f4b3dc2179bb841'
assert _hash(pipeline.steps[1][0], 'salt8') == 'ca83eaea1a7e243fa5574cfa6f52831166ee0f32'
assert _hash(pipeline.steps[-1][0], 'salt9') == '0d66ba4309ad4939673169e74f87088dcadd510b'
### END TESTS

Does it work? Let's check it out on our dataset!

In [None]:
pipeline.fit(X_train, y_train)

In [None]:
y_pred = pipeline.predict(X_test)
accuracy_score(y_test, y_pred)

That's it for this Exercise Notebook! It doesn't get much cleaner than this, does it?

You can still practice around with pipelines, maybe add a few more steps. See how you can adapt your pipeline and how it affects the predictions.

Can you see how Scikit-learn's Pipelines might save time? Can you imagine how useful that would be in stressful situations (like, *for example*, an Hackathon)?