## Imports

In [1]:
import pandas as pd

from scipy.stats import uniform

from sklearn import set_config; set_config(display='diagram')

from sklearn.impute import SimpleImputer
from sklearn.model_selection import RandomizedSearchCV, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.svm import SVC

# Tuning Pipeline

üëá Consider the following dataset.

In [2]:
data = pd.read_csv("https://wagon-public-datasets.s3.amazonaws.com/05-Machine-Learning/08-Workflow/tuning_pipeline_data.csv")
data.head()

Unnamed: 0,games played,minutes played,points per game,field goals made,field goal attempts,field goal percent,3 point made,3 point attempt,3 point %,free throw made,free throw attempts,free throw %,offensive rebounds,defensive rebounds,rebounds,assists,steals,blocks,turnovers,target_5y
0,36.0,27.4,7.4,2.6,7.6,,0.5,2.1,25.0,1.6,2.3,69.9,0.7,3.4,4.1,1.9,0.4,0.4,1.3,0
1,35.0,26.9,,2.0,6.7,29.6,0.7,2.8,23.5,2.6,3.4,76.5,0.5,2.0,2.4,3.7,1.1,0.5,1.6,0
2,,15.3,5.2,2.0,4.7,42.2,0.4,1.7,24.4,0.9,1.3,67.0,0.5,1.7,2.2,1.0,0.5,0.3,1.0,0
3,58.0,11.6,5.7,2.3,5.5,42.6,0.1,0.5,22.6,0.9,1.3,68.9,1.0,0.9,1.9,0.8,0.6,0.1,1.0,1
4,48.0,11.5,4.5,1.6,3.0,52.4,0.0,0.1,0.0,1.3,1.9,67.4,1.0,1.5,2.5,0.3,0.3,0.4,0.8,1


- Each observation represents a player
- Each column represents a characteristic of a player's performance

The target defines whether the player lasted less than 5 years (`0`) vs. 5 years or more (`1`) as a professional.

In [3]:
X = data.drop(columns="target_5y")
y = data['target_5y']

## Pipeline

üëá We are giving you the simple pipeline below

In [4]:
# Preprocessing pipe
preprocessor = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaling', MinMaxScaler())
])

# Final pipe
pipe = Pipeline([
    ('preprocessing', preprocessor),
    ('model_svm', SVC())
])

pipe

## Fine-Tuning

Our task is to assist in the recruitment process of promising young players.  
The model should **limit false alarms as much as possible** to avoid recruiting players that will flop.

‚ùì **Fine-tune this pipeline to maximize your objective**

- Use the `scoring` metric appropriate for the task
- Do a (randomized) search for the optimal
    - imputing `strategy`
    - `kernel`
    - regularization factor `C`
- Store your randomized search results in a `search` variable

In [11]:
pipe.get_params()['steps']

[('preprocessing',
  Pipeline(steps=[('imputer', SimpleImputer()), ('scaling', MinMaxScaler())])),
 ('model_svm', SVC())]

In [30]:
set_config(display='text')

from sklearn.model_selection import GridSearchCV

params = {
     'preprocessing__imputer__strategy': ['mean', 'median', 'most_frequent'],
     'model_svm__kernel':  ['linear', 'poly','rbf', 'sigmoid'],
     'model_svm__C': [0.1, 1, 10]                                     
}

search = GridSearchCV(pipe, params, cv=5, scoring='precision', n_jobs = -1)
search.fit(X,y)
search.cv_results_

{'mean_fit_time': array([0.05637379, 0.06320896, 0.05491695, 0.04826846, 0.05122166,
        0.05239687, 0.05588264, 0.06366973, 0.05810452, 0.08375454,
        0.08602419, 0.09997406, 0.04039888, 0.04207692, 0.05820022,
        0.07356806, 0.06337557, 0.06769419, 0.05782585, 0.0600224 ,
        0.05818567, 0.06891131, 0.06924429, 0.07720795, 0.04441562,
        0.04592185, 0.05425797, 0.16605763, 0.15395627, 0.15969462,
        0.07027488, 0.06747179, 0.07024384, 0.07569847, 0.07195215,
        0.0661315 ]),
 'std_fit_time': array([0.01931638, 0.01745383, 0.01292715, 0.00339254, 0.00334023,
        0.00230977, 0.00360721, 0.0071021 , 0.00109484, 0.00234432,
        0.00086562, 0.01640605, 0.0012035 , 0.00157097, 0.0207311 ,
        0.01282383, 0.00220079, 0.00697718, 0.00397391, 0.00503809,
        0.0004998 , 0.00113604, 0.00091657, 0.00911401, 0.00134561,
        0.00067483, 0.00501408, 0.03372987, 0.02225147, 0.0101268 ,
        0.00621247, 0.00582465, 0.00686071, 0.00421368, 0.003

In [26]:
best_model = search.best_estimator_

In [29]:
best_model.score(X, y)

0.7119402985074627

In [31]:
from nbresult import ChallengeResult

result = ChallengeResult(
    'solution',
    scoring = search.scoring,
    cv = search.cv,
    mean_test_score = search.cv_results_['mean_test_score']
)

result.write()
print(result.check())


platform linux -- Python 3.10.6, pytest-7.1.3, pluggy-1.0.0 -- /home/andreza/.pyenv/versions/3.10.6/envs/lewagon/bin/python
cachedir: .pytest_cache
rootdir: /home/andreza/code/andrezaumbelino/05-ML/06-Workflow/data-tuning-pipeline/tests
plugins: typeguard-2.13.3, asyncio-0.19.0, anyio-3.6.2
asyncio: mode=strict
[1mcollecting ... [0mcollected 1 item

test_solution.py::TestSolution::test_cv_results [32mPASSED[0m[32m                   [100%][0m



üíØ You can commit your code:

[1;32mgit[39m add tests/solution.pickle

[32mgit[39m commit -m [33m'Completed solution step'[39m

[32mgit[39m push origin master



## Export

Once you have built your optimal pipeline, export it as a pickle file

In [34]:
import joblib
joblib.dump(search.best_estimator_, 'melhor_modelo.pkl')


['melhor_modelo.pkl']

In [35]:
modelo_carregado = joblib.load('melhor_modelo.pkl')

üèÅ Congratulation. Don't forget to add, commit and push your notebook.