# Pipeline Building

En este notebook se busca construir el pipeline que finalmente tomará los datos a predecir, le aplicará las transformaciones necesarias y finalmente le aplicará el modelo de ML elegido.

> Este código es necesario ponerlo, adaptarlo y ejecutarlo estando en la carpeta src para generar un pipeline funcional usando streamlit.

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics

from feature_engine.selection import DropFeatures

import src.Preprocessors.preprocessors as pp

import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline, FeatureUnion

import joblib

In [2]:
# Carga de datos
df = pd.read_csv('data_mobile_price_range.csv')

In [3]:
X_train, X_test, y_train, y_test = train_test_split(
    df.drop(['price_range'], axis=1), 
    df['price_range'], 
    test_size=0.1, 
    random_state=0,
)

### Starting pipeline building

Se recrearán todos los pasos realizados en el **EDA** sobre el conjunto de entrenamiento para así conseguir un pipeline que realice preprocesamiento y aplicación del modelo de predicción

#### Pipeline de preprocesamiento

In [4]:
features_non_zero = ['battery_power', 'clock_speed', 'int_memory', 'm_dep',
       'mobile_wt', 'n_cores', 'px_height', 'px_width', 'ram', 'sc_h',
       'sc_w', 'talk_time']

In [5]:
train_set = pd.concat([X_train, y_train], axis=1) # Se concatenan temporalmente

In [6]:
train_set = train_set[(train_set[features_non_zero] > 0).all(axis=1)]

In [7]:
train_set = train_set[(train_set['px_width']/train_set['px_height']) <= 20]

In [8]:
X_train = train_set.drop(['price_range'], axis=1)
y_train = train_set['price_range']

In [9]:
prange_pp_pipeline = Pipeline(
    [('screen_features', pp.FeatureCreator()),
     ('drop_used_features', DropFeatures(features_to_drop=['px_width','px_height','sc_w', 'sc_h'])),
     ('drop_features_less_important', DropFeatures(features_to_drop=['wifi', 'touch_screen', 'four_g', 'dual_sim', 'blue', 'three_g'])),
     ('scaler', MinMaxScaler()),
     ]
)

In [10]:
prange_pp_pipeline.fit(X_train)

In [11]:
new_columns = ['battery_power', 'clock_speed', 'fc', 'int_memory', 'm_dep',
       'mobile_wt', 'n_cores', 'pc', 'ram', 'talk_time', 'num_pix',
       'aspect_ratio']

In [12]:
# Pipeline aplicado sobre el conjunto de entrenamiento
X_train = pd.DataFrame(
    prange_pp_pipeline.transform(X_train)
)

# Pipeline aplicado sobre el conjunto de prueba, notar que es el mismo usado anteriormente
X_test = pd.DataFrame(
    prange_pp_pipeline.transform(X_test)
)

In [13]:
X_train_show = X_train.copy()
X_train_show.columns = new_columns
X_train_show

Unnamed: 0,battery_power,clock_speed,fc,int_memory,m_dep,mobile_wt,n_cores,pc,ram,talk_time,num_pix,aspect_ratio
0,0.333556,0.00,0.631579,0.983871,0.666667,0.825000,1.000000,0.90,0.932122,0.277778,0.695425,0.042781
1,0.122995,0.00,0.421053,0.177419,0.777778,0.016667,0.857143,0.75,0.789417,0.000000,0.335459,0.462745
2,0.750668,0.84,0.000000,0.903226,1.000000,1.000000,0.285714,0.55,0.265901,0.888889,0.685527,0.285068
3,0.647059,0.00,0.157895,0.903226,0.444444,0.583333,0.857143,0.40,0.351416,0.000000,0.808124,0.993080
4,0.863636,0.88,0.631579,0.677419,0.666667,0.791667,0.571429,0.85,0.680652,0.222222,0.243329,0.470588
...,...,...,...,...,...,...,...,...,...,...,...,...
1577,0.483289,0.44,0.473684,0.500000,1.000000,0.641667,0.000000,0.90,0.946018,0.611111,0.072469,0.500000
1578,0.439171,0.08,0.052632,0.435484,0.666667,0.358333,0.142857,1.00,0.343666,0.722222,0.140804,0.532872
1579,0.460561,0.60,0.000000,0.612903,0.111111,0.108333,0.571429,0.05,0.896312,0.388889,0.591141,0.542986
1580,0.461230,0.76,0.105263,0.177419,0.888889,0.741667,0.000000,0.35,0.206307,0.333333,0.080380,0.420168


## Pipeline de predicción

In [14]:
clf_rf = RandomForestClassifier()

In [15]:
# Se usan los parámetros usados para entrenar el modelo R. Forest anteriormente en Pycaret
model_params = {'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'monotonic_cst': None,
 'n_estimators': 100,
 'n_jobs': -1,
 'oob_score': False,
 'random_state': 123,
 'verbose': 0,
 'warm_start': False}

In [16]:
clf_rf.set_params(**model_params)
clf_rf.fit(X_train.values, y_train.values)

In [17]:
y_pred = clf_rf.predict(X_test.values)

print("Accuracy:", metrics.accuracy_score(y_test, y_pred))
print("AUC:", metrics.roc_auc_score(y_test, clf_rf.predict_proba(X_test.values), multi_class='ovr'))
print("precision_macro:", metrics.precision_score(y_test, y_pred, average='macro'))

Accuracy: 0.87
AUC: 0.9802227083990889
precision_macro: 0.8684048268804366


### Se guarda el pipeline final

Para esto se unen secuencialmente el pipeline de preprocesado y el modelo

In [18]:
final_pipeline = Pipeline([
    ('preprocessing', prange_pp_pipeline),
    ('Model', clf_rf),
    ])

final_pipeline

In [19]:
joblib.dump(final_pipeline, './src/Pipelines/prange_ml_pipeline.joblib')

['./src/Pipelines/prange_ml_pipeline.joblib']