# Preprocessor Tuning

👇 Consider the following dataset as your training set

In [1]:
import pandas as pd

data = pd.read_csv("data.csv")

data.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,malignant
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,1
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,1
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,1
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,1
4,20.29,14.34,,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,,0.205,0.4,0.1625,0.2364,0.07678,1


In [2]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 31 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   mean radius              568 non-null    float64
 1   mean texture             569 non-null    float64
 2   mean perimeter           567 non-null    float64
 3   mean area                568 non-null    float64
 4   mean smoothness          568 non-null    float64
 5   mean compactness         569 non-null    float64
 6   mean concavity           567 non-null    float64
 7   mean concave points      567 non-null    float64
 8   mean symmetry            569 non-null    float64
 9   mean fractal dimension   569 non-null    float64
 10  radius error             569 non-null    float64
 11  texture error            567 non-null    float64
 12  perimeter error          569 non-null    float64
 13  area error               569 non-null    float64
 14  smoothness error         5

The dataset describes tumors that are either malignant or benign. The task is to detect as many malignant tumors as possible.

👇 Combine the following steps in a `Pipeline`:

- Impute missing values with a `KNNImputer`
- Scale all the features with a `MinMaxScaler`
- Model a `LogisticRegression` with default parameters
- Use the scoring metric relevant for the task

❓With how many neighbors does the `KNNImputer` produce the optimal pipeline: 2, 5, or 10?

In [3]:
from sklearn.pipeline import Pipeline

X = data.drop(columns='malignant')
y = data['malignant']

In [4]:
y.value_counts()/len(y)  # check class balance ====> imbalance

0    0.627417
1    0.372583
Name: malignant, dtype: float64

In [5]:
from sklearn.impute import KNNImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression

In [6]:
preprocessing_pipe = Pipeline([
    ('imputer', KNNImputer()),
    ('scaler', MinMaxScaler())
])

In [7]:
from sklearn import set_config
set_config(display='diagram')

In [8]:
preprocessing_pipe

In [9]:
final_pipe = Pipeline([
    ('prep_pipe', preprocessing_pipe),
    ('log_regression', LogisticRegression())]
)

In [10]:
from sklearn.model_selection import cross_val_score, train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [11]:
final_pipe.fit(X_train, y_train)

In [12]:
final_pipe.predict(X_test.iloc[0:2])

array([1, 0])

In [13]:
final_pipe.score(X_test, y_test)

0.9590643274853801

In [14]:
cross_val_score(final_pipe, X_train, y_train, cv=5, scoring='f1').mean()

0.9518120842726336

In [15]:
from sklearn.model_selection import GridSearchCV

In [16]:
final_pipe.get_params()

{'memory': None,
 'steps': [('prep_pipe',
   Pipeline(steps=[('imputer', KNNImputer()), ('scaler', MinMaxScaler())])),
  ('log_regression', LogisticRegression())],
 'verbose': False,
 'prep_pipe': Pipeline(steps=[('imputer', KNNImputer()), ('scaler', MinMaxScaler())]),
 'log_regression': LogisticRegression(),
 'prep_pipe__memory': None,
 'prep_pipe__steps': [('imputer', KNNImputer()), ('scaler', MinMaxScaler())],
 'prep_pipe__verbose': False,
 'prep_pipe__imputer': KNNImputer(),
 'prep_pipe__scaler': MinMaxScaler(),
 'prep_pipe__imputer__add_indicator': False,
 'prep_pipe__imputer__copy': True,
 'prep_pipe__imputer__metric': 'nan_euclidean',
 'prep_pipe__imputer__missing_values': nan,
 'prep_pipe__imputer__n_neighbors': 5,
 'prep_pipe__imputer__weights': 'uniform',
 'prep_pipe__scaler__clip': False,
 'prep_pipe__scaler__copy': True,
 'prep_pipe__scaler__feature_range': (0, 1),
 'log_regression__C': 1.0,
 'log_regression__class_weight': None,
 'log_regression__dual': False,
 'log_regres

In [17]:
grid_search = GridSearchCV(
    final_pipe,
    param_grid={
        'prep_pipe__imputer__n_neighbors': [2, 5, 10]
    },
    cv=5,
    scoring='f1'   # class imbalance
)

In [18]:
grid_search.fit(X_train, y_train)
grid_search.best_params_

{'prep_pipe__imputer__n_neighbors': 2}

👇 What is the performance of the optimal pipeline? Make sure you cross validate!

In [19]:
best_pipeline = grid_search.best_estimator_
best_pipeline

In [20]:
best_pipeline.predict(X_test)

array([1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0,
       1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1,
       1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0])

In [21]:
best_pipeline.score(X_test, y_test)

0.9590643274853801

In [22]:
cross_val_score(best_pipeline, X_train, y_train, cv=5, scoring='f1').mean()

0.9518120842726336

👇 Using your optimal pipeline, predict wether the following tumor is malignant or not

In [23]:
new_data = pd.read_csv("new_data.csv")
new_data

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902


In [24]:
best_pipeline.predict(new_data)

array([1])