# Tuning Pipeline

👇 Consider the following dataset.

In [None]:
import pandas as pd

data = pd.read_csv("data.csv")

data.head()

In [None]:
data.shape

In [None]:
data.info()

In [None]:
data.isnull().sum()/len(data)

In [None]:
# ====> no feture contains more than 1 % Null 

In [None]:
import matplotlib.pyplot as plt
data.hist(bins=50, figsize=(15,8))
plt.show()

In [None]:
data.boxplot(figsize=(15, 8))
plt.show()

In [None]:
X = data.drop(columns='target_5y')
y = data['target_5y']

In [None]:
y.value_counts()/len(y)  # check class balance ====> imbalance

Each observations represents a player and each column a characteristic of performance. The target defines whether the player last less than 5 years [0] or 5 years or more [1] as a professional.

The task is to build a model to assist the recruitment process of promising young players. The model should limit false alarms as much as possible to avoid recruiting players that will flop.

👇 In a pipeline:

- Impute missing values using a `SimpleImputer` an find out which strategy works best
- Scale the features
- Model with an `svm.SVC` and tune its parameters `C` and `kernel` using a `RandomSearch`
- Use the scoring metric appropriate for the task

Once you have built your optimal pipeline, export it as a pickle file.

## Pipeline

In [None]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline

preprocessor = Pipeline([
    ('imputer', SimpleImputer()),
    ('scaler', MinMaxScaler())
])

In [None]:
from sklearn.svm import SVC

final_pipe = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', SVC())
])

In [None]:
from sklearn import set_config; set_config(display='diagram')

In [None]:
final_pipe

## Random Search

In [None]:
final_pipe.get_params()

In [None]:
from sklearn.model_selection import cross_val_score, train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [None]:
from sklearn.model_selection import RandomizedSearchCV
from scipy import stats

# Hyperparameter Grid
grid = {'preprocessor__imputer__strategy': ['mean', 'median','most_frequent'],
        'classifier__C': stats.uniform(1, 10), 
        'classifier__kernel': ['rbf', 'poly', 'sigmoid'],
        'classifier__degree': [2, 3, 4, 6]
       }

# Instanciate Grid Search
search = RandomizedSearchCV(final_pipe, 
                            grid, 
                            scoring='precision',
                            n_iter=100,  # number of draws
                            cv=5, n_jobs=-1)

In [None]:
search.fit(X_train, y_train)
search.best_params_

In [None]:
optimal_pipeline = search.best_estimator_
optimal_pipeline

In [None]:
optimal_pipeline.score(X_test, y_test)

In [None]:
cross_val_score(optimal_pipeline, X_train, y_train, cv=5, scoring='precision').mean()

In [None]:
cross_val_score(optimal_pipeline, X_train, y_train, cv=5, scoring='f1').mean()

## Export

In [None]:
import pickle

# Export pipeline as pickle file
with open("optimal_pipeline.pkl", "wb") as file:
    pickle.dump(optimal_pipeline, file)

In [None]:
reloaded_opt_pipe = pickle.load(open('optimal_pipeline.pkl', 'rb'))
reloaded_opt_pipe.score(X_test, y_test)