# Preprocessor Tuning

## (0) The `tumors` Dataset

* 👩🏻‍⚕️ The following dataset describes tumors that are either <font color=red>malignant</font> or <font color=green>benign</font>. 
* 🎯 The task is to detect as many malignant tumors as possible.

In [36]:
import pandas as pd

from sklearn.pipeline import Pipeline
from sklearn.impute import KNNImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_validate
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV


In [13]:
pd.set_option('display.max_columns', None)

url = "https://wagon-public-datasets.s3.amazonaws.com/05-Machine-Learning/08-Workflow/tumors_dataset.csv"
data = pd.read_csv(url)

data.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,radius error,texture error,perimeter error,area error,smoothness error,compactness error,concavity error,concave points error,symmetry error,fractal dimension error,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,malignant
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,1.095,0.9053,8.589,153.4,0.006399,0.04904,0.05373,0.01587,0.03003,0.006193,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,1
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,0.5435,0.7339,3.398,74.08,0.005225,0.01308,0.0186,0.0134,0.01389,0.003532,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,1
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,0.7456,0.7869,4.585,94.03,0.00615,0.04006,0.03832,0.02058,0.0225,0.004571,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,1
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,0.4956,1.156,3.445,27.23,0.00911,0.07458,0.05661,0.01867,0.05963,0.009208,14.91,26.5,,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,1
4,20.29,14.34,,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,0.7572,0.7813,5.438,94.44,0.01149,0.02461,0.05688,0.01885,0.01756,0.005115,22.54,16.67,152.2,1575.0,,0.205,0.4,0.1625,0.2364,0.07678,1


In [14]:
round(data.malignant.value_counts(normalize = True),2)

0    0.63
1    0.37
Name: malignant, dtype: float64

In [15]:
data.dtypes.value_counts()

float64    30
int64       1
dtype: int64

In [16]:
data.isnull().sum().sort_values(ascending=False).head(5)

worst smoothness        5
worst compactness       5
worst concavity         4
worst concave points    3
worst texture           3
dtype: int64

In [25]:
X = data.drop("malignant", axis=1)
y = data["malignant"]

## (1) Building a Pipeline

❓ **Question: Building a Pipeline** ❓

Combine the following steps in a **`Pipeline`** object named `pipeline`:

1. Impute missing values with a **`KNNImputer`**
2. Scale all the (numerical) features with a **`MinMaxScaler`**
3. Model a **`LogisticRegression`** with default parameters

In [38]:
# Build numerical pipeline
num_preprocessing = Pipeline([
    ('knn_imputer', KNNImputer()),
    ('standard_scaler', MinMaxScaler())
]).set_output(transform="pandas")

num_preprocessing

In [39]:
pipelined_model = Pipeline([
    ("preprocessing", num_preprocessing),
    ("logistic_regression", LogisticRegression())
])

pipelined_model

In [40]:
pipelined_model.fit(X, y)

In [41]:
pipelined_model.score(X, y)

0.9718804920913884

## (2) Optimizing a pipelined model

❓ **Question (GridSearching a Pipeline)** ❓

* What is the optimal number of neighbors for the KNN imputer: 2, 5, or 10 ? 
    * Perform a GridSearch on your pipeline and save your answer under a variable called `n_best`.
    * _Be careful: Use a scoring metric that is relevant for the task in your Grid Search, just saying... :)_
* Feel free to GridSearch on the whole dataset instead of using a train/test split in this challenge. Here, the goal is just to become familiar with Pipelines :)



In [None]:
n_best = None

In [43]:
pipelined_model.get_params()

{'memory': None,
 'steps': [('preprocessing',
   Pipeline(steps=[('knn_imputer', KNNImputer()),
                   ('standard_scaler', MinMaxScaler())])),
  ('logistic_regression', LogisticRegression())],
 'verbose': False,
 'preprocessing': Pipeline(steps=[('knn_imputer', KNNImputer()),
                 ('standard_scaler', MinMaxScaler())]),
 'logistic_regression': LogisticRegression(),
 'preprocessing__memory': None,
 'preprocessing__steps': [('knn_imputer', KNNImputer()),
  ('standard_scaler', MinMaxScaler())],
 'preprocessing__verbose': False,
 'preprocessing__knn_imputer': KNNImputer(),
 'preprocessing__standard_scaler': MinMaxScaler(),
 'preprocessing__knn_imputer__add_indicator': False,
 'preprocessing__knn_imputer__copy': True,
 'preprocessing__knn_imputer__keep_empty_features': False,
 'preprocessing__knn_imputer__metric': 'nan_euclidean',
 'preprocessing__knn_imputer__missing_values': nan,
 'preprocessing__knn_imputer__n_neighbors': 5,
 'preprocessing__knn_imputer__weights': 

In [50]:
# Instantiate model
model = KNNImputer()

# Hyperparameter Grid
grid = {
    'preprocessing__knn_imputer__n_neighbors': [2,5,10]
}

# Instantiate Grid Search
search = GridSearchCV(
    pipelined_model,
    grid,
    scoring = "recall",
    cv = 5,
    n_jobs=-1, # parallelize computation
    verbose=1
)

# Fit data to Grid Search
search.fit(X, y);
search

Fitting 5 folds for each of 3 candidates, totalling 15 fits


In [51]:
# Best score
display(search.best_score_)

# Best Params
display(search.best_params_)

0.9244739756367663

{'preprocessing__knn_imputer__n_neighbors': 5}

In [63]:
n_best = search.best_params_["preprocessing__knn_imputer__n_neighbors"]

## (3) Evaluating a pipeline

❓ **Question: what is the performance of the optimal pipeline**  ❓

- Make sure you cross-validate your optimal pipeline! 
- Store your result as a `float` number in a variable named `cv_score`

In [58]:
cv_results = cross_val_score(estimator=pipelined_model, X=X, y=y, scoring="recall")

print(cv_results)

cv_score = cv_results.mean()
print(cv_score)

[0.93023256 0.93023256 0.92857143 0.9047619  0.92857143]
0.9244739756367663


In [64]:
from nbresult import ChallengeResult

result = ChallengeResult(
    'solution',
    n_best = n_best,
    cv_score=cv_score
)

result.write()
print(result.check())


platform linux -- Python 3.10.6, pytest-7.1.3, pluggy-1.0.0 -- /home/branchedelac/.pyenv/versions/3.10.6/envs/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /home/branchedelac/code/branchedelac/05-ML/08-Workflow/data-preprocessor-tuning/tests
plugins: anyio-3.6.2, asyncio-0.19.0, typeguard-2.13.3
asyncio: mode=strict
[1mcollecting ... [0mcollected 2 items

test_solution.py::TestSolution::test_n_neighbours [32mPASSED[0m[32m                 [ 50%][0m
test_solution.py::TestSolution::test_score_good_enough [32mPASSED[0m[32m            [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/solution.pickle

[32mgit[39m commit -m [33m'Completed solution step'[39m

[32mgit[39m push origin master



## (4) Predicting using a fitted and pipelined model

👇 Here is a new tumor.

In [65]:
new_url = "https://wagon-public-datasets.s3.amazonaws.com/05-Machine-Learning/08-Workflow/new_tumor.csv"

new_data = pd.read_csv(new_url)
new_data

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,radius error,texture error,perimeter error,area error,smoothness error,compactness error,concavity error,concave points error,symmetry error,fractal dimension error,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,0.5435,0.7339,3.398,74.08,0.005225,0.01308,0.0186,0.0134,0.01389,0.003532,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902


❓ **Question: Using your optimal pipeline, predict whether the new tumor is malignant or not** ❓

In [66]:
result = pipelined_model.predict(new_data)

In [67]:
result

array([1])

🏁 Congratulations! You are now an expert at pipelining !

💾 Don't forget to git add/commit/push your notebook...

🚀 ... and move on to the next challenge!