# Tuning Pipeline

👇 Consider the following dataset.

In [1]:
import pandas as pd

data = pd.read_csv("data.csv")

data.head()

Unnamed: 0,games played,minutes played,points per game,field goals made,field goal attempts,field goal percent,3 point made,3 point attempt,3 point %,free throw made,free throw attempts,free throw %,offensive rebounds,defensive rebounds,rebounds,assists,steals,blocks,turnovers,target_5y
0,36.0,27.4,7.4,2.6,7.6,,0.5,2.1,25.0,1.6,2.3,69.9,0.7,3.4,4.1,1.9,0.4,0.4,1.3,0
1,35.0,26.9,,2.0,6.7,29.6,0.7,2.8,23.5,2.6,3.4,76.5,0.5,2.0,2.4,3.7,1.1,0.5,1.6,0
2,,15.3,5.2,2.0,4.7,42.2,0.4,1.7,24.4,0.9,1.3,67.0,0.5,1.7,2.2,1.0,0.5,0.3,1.0,0
3,58.0,11.6,5.7,2.3,5.5,42.6,0.1,0.5,22.6,0.9,1.3,68.9,1.0,0.9,1.9,0.8,0.6,0.1,1.0,1
4,48.0,11.5,4.5,1.6,3.0,52.4,0.0,0.1,0.0,1.3,1.9,67.4,1.0,1.5,2.5,0.3,0.3,0.4,0.8,1


Each observations represents a player and each column a characteristic of performance. The target defines whether the player last less than 5 years [0] or 5 years or more [1] as a professional.

The task is to build a model to assist the recruitment process of promising young players. The model should limit false alarms as much as possible to avoid recruiting players that will flop.

👇 In a pipeline:

- Impute missing values using a `SimpleImputer` an find out which strategy works best
- Scale the features
- Model with an `svm.SVC` and tune its parameters `C` and `kernel` using a `RandomSearch`
- Use the scoring metric appropriate for the task

Once you have built your optimal pipeline, export it as a pickle file.

## Pipeline

## Random Search

## Export

In [2]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.impute import SimpleImputer
from sklearn.svm import SVC

pipe = Pipeline([
    ('imputer', SimpleImputer()),
    ('scaling', MinMaxScaler()),
    ('svn', SVC())
])


In [3]:
pipe.get_params()

{'memory': None,
 'steps': [('imputer', SimpleImputer()),
  ('scaling', MinMaxScaler()),
  ('svn', SVC())],
 'verbose': False,
 'imputer': SimpleImputer(),
 'scaling': MinMaxScaler(),
 'svn': SVC(),
 'imputer__add_indicator': False,
 'imputer__copy': True,
 'imputer__fill_value': None,
 'imputer__missing_values': nan,
 'imputer__strategy': 'mean',
 'imputer__verbose': 0,
 'scaling__copy': True,
 'scaling__feature_range': (0, 1),
 'svn__C': 1.0,
 'svn__break_ties': False,
 'svn__cache_size': 200,
 'svn__class_weight': None,
 'svn__coef0': 0.0,
 'svn__decision_function_shape': 'ovr',
 'svn__degree': 3,
 'svn__gamma': 'scale',
 'svn__kernel': 'rbf',
 'svn__max_iter': -1,
 'svn__probability': False,
 'svn__random_state': None,
 'svn__shrinking': True,
 'svn__tol': 0.001,
 'svn__verbose': False}

In [6]:
from sklearn.model_selection import RandomizedSearchCV
from scipy import stats


# Hyperparameter Grid
grid = {'imputer__strategy': ['mean','median', 'most_frequent'],
       'svn__C': stats.loguniform(0.001, 1000),
       'svn__kernel': ['linear', 'poly', 'rbf', 'sigmoid']}

# Instanciate Grid Search
grid_search = RandomizedSearchCV(pipe, grid,
                                 scoring='precision', n_iter=100,  # number of draws
                                 cv=5, 
                                 n_jobs=-1)

# Fit data to Grid Search
grid_search.fit(data.drop(columns="target_5y"), data['target_5y'])
tuned_pipe = grid_search.best_estimator_

In [7]:
import pickle

# Export pipeline as pickle file
with open("pipeline.pkl", "wb") as file:
    pickle.dump(tuned_pipe, file )