# Preprocessor Tuning

## (0) The `tumors` dataset

* 👩🏻‍⚕️ The following dataset describes tumors that are either <font color=red>malignant</font> or <font color=green>benign</font>. 
* 🎯 The task is to detect as many malignant tumors as possible.

In [3]:
import pandas as pd
#pd.set_option("max_columns",None)

url = "https://wagon-public-datasets.s3.amazonaws.com/Machine%20Learning%20Datasets/ML_tumors_dataset.csv"
data = pd.read_csv(url)

data.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,malignant
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,1
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,1
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,1
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,1
4,20.29,14.34,,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,,0.205,0.4,0.1625,0.2364,0.07678,1


In [4]:
round(data.malignant.value_counts(normalize = True),2)

0    0.63
1    0.37
Name: malignant, dtype: float64

## (1) Building a Pipeline

❓ **Question: Building a Pipeline** ❓

Combine the following steps in a **`Pipeline`** object named `pipeline`:

1. Impute missing values with a **`KNNImputer`**
2. Scale all the (numerical) features with a **`MinMaxScaler`**
3. Model a **`LogisticRegression`** with default parameters

In [5]:
from sklearn.pipeline import make_pipeline
from sklearn.impute import KNNImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.compose import make_column_selector
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer

X = data.drop(columns=['malignant'])
y = data['malignant']

num_transformer = make_pipeline(KNNImputer(),  MinMaxScaler())
cat_transformer = OneHotEncoder()

preproc = make_column_transformer(
    (num_transformer, make_column_selector(dtype_include=['float64'])),
    (cat_transformer, make_column_selector(dtype_include=['int64', 'bool'])))

pipe = make_pipeline(preproc, LogisticRegression())
pipe.fit(X, y)

pipe.get_params

<bound method Pipeline.get_params of Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('pipeline',
                                                  Pipeline(steps=[('knnimputer',
                                                                   KNNImputer()),
                                                                  ('minmaxscaler',
                                                                   MinMaxScaler())]),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x7efbd55781f0>),
                                                 ('onehotencoder',
                                                  OneHotEncoder(),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x7efbd5578400>)])),
                ('logisticregression', LogisticRegression())])>

## (2) Optimizing a pipelined model

❓ **Question (GridSearching a Pipeline)** ❓

* What is the optimal number of neighbors for the KNN imputer: 2, 5, or 10 ? 
    * Perform a GridSearch on your pipeline and save your answer under a variable called `n_best`.
    * _Be careful: Use a scoring metric that is relevant for the task in your Grid Search, just saying... :)_
* Feel free to GridSearch on the whole dataset instead of using a train/test split in this challenge. Here, the goal is just to become familiar with Pipelines :)



In [13]:
n_best = 5

In [7]:
param_distributions = {'columntransformer__pipeline__knnimputer': [2, 5, 10]}
pipe.get_params().keys()

dict_keys(['memory', 'steps', 'verbose', 'columntransformer', 'logisticregression', 'columntransformer__n_jobs', 'columntransformer__remainder', 'columntransformer__sparse_threshold', 'columntransformer__transformer_weights', 'columntransformer__transformers', 'columntransformer__verbose', 'columntransformer__verbose_feature_names_out', 'columntransformer__pipeline', 'columntransformer__onehotencoder', 'columntransformer__pipeline__memory', 'columntransformer__pipeline__steps', 'columntransformer__pipeline__verbose', 'columntransformer__pipeline__knnimputer', 'columntransformer__pipeline__minmaxscaler', 'columntransformer__pipeline__knnimputer__add_indicator', 'columntransformer__pipeline__knnimputer__copy', 'columntransformer__pipeline__knnimputer__metric', 'columntransformer__pipeline__knnimputer__missing_values', 'columntransformer__pipeline__knnimputer__n_neighbors', 'columntransformer__pipeline__knnimputer__weights', 'columntransformer__pipeline__minmaxscaler__clip', 'columntransf

In [1]:
from sklearn.model_selection import GridSearchCV

k_grid = {'columntransformer__pipeline__knnimputer__n_neighbors' : [2,5,10]}
scores = []
grid_search = GridSearchCV(
    pipe, 
    k_grid,
    n_jobs=-1,
    scoring="recall", 
    cv=5)

grid_search.fit(X, y)

grid_search.best_params_

NameError: name 'pipe' is not defined

## (3) Evaluating a pipeline

❓ **Question: what is the performance of the optimal pipeline**  ❓

- Make sure you cross-validate your optimal pipeline! 
- Store your result as a `float` number in a variable named `cv_score`

In [11]:
from sklearn.model_selection import cross_val_score

cv_score = cross_val_score(grid_search, X, y, cv=5, scoring='recall').mean()
cv_score

0.9150609080841638

In [14]:
from nbresult import ChallengeResult
result = ChallengeResult('solution', 
                         n_best = n_best,
                         cv_score=cv_score)
result.write()
print(result.check())


platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0 -- /bin/python3
cachedir: .pytest_cache
rootdir: /home/quantium/labs/lewagon/data-challenges/05-ML/08-Workflow/01-Preprocessor-Tuning/tests
plugins: dash-1.19.0, anyio-3.6.1
[1mcollecting ... [0mcollected 2 items

test_solution.py::TestSolution::test_n_neighbours [32mPASSED[0m[32m                 [ 50%][0m
test_solution.py::TestSolution::test_score_good_enough [32mPASSED[0m[32m            [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/solution.pickle

[32mgit[39m commit -m [33m'Completed solution step'[39m

[32mgit[39m push origin master



## (4) Predicting using a fitted and pipelined model

👇 Here is a new tumor.

In [16]:
new_url = "https://wagon-public-datasets.s3.amazonaws.com/Machine%20Learning%20Datasets/ML_new_tumor.csv"

new_data = pd.read_csv(new_url)
new_data

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902


❓ **Question: Using your optimal pipeline, predict whether the new tumor is malignant or not** ❓

In [17]:
pipe.predict(new_data)
pipe.predict_proba(new_data)[0][1]

0.9673757870962858

🏁 Congratulations! You are now an expert at pipelining !

💾 Don't forget to git add/commit/push your notebook...

🚀 ... and move on to the next challenge!