# Tarea 9 : SciKit en paralelo

Se hará una comparación entre el tiempo de ejecución de un bosque aleatorio ejecutado de forma secuencial vs el mismo bosque aleatorio ejecutado en paralelo.

Se buscará hacer un pronostico de la proporción de propina dejada por un usuario de taxi.

## Paquetería

In [1]:
from dask.distributed import Client
from dask import delayed
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error
from datetime import datetime
import numpy as np
import pandas as pd
import math
import sys

### Paquetería Dask - Grid Search

In [2]:
!{sys.executable} -m pip install dask-searchcv

[33mYou are using pip version 9.0.1, however version 10.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [3]:
from dask_searchcv import GridSearchCV as DaskGridSearchCV

## Cliente dask

In [4]:
client = Client("scheduler:8786")

## Procesamiento de la base

Se crean nuevas variables de pronóstico y se procesa la base

In [5]:
datos = pd.read_csv('/data/trips.csv')
datos['tip_prop'] = datos.tip_amount/datos.fare_amount
datos['tpep_dropoff_datetime'] = datos.tpep_dropoff_datetime\
.apply(lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S'))
datos['tpep_pickup_datetime'] = datos.tpep_pickup_datetime\
.apply(lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S'))
datos['car'] = 0
datos.loc[datos.car_type == 'A', 'car'] = 1
datos.loc[datos.car_type == 'B', 'car'] = 2
diff = []

Se buscan valores nulos en la base.

In [6]:
datos.isnull().any()

car_type                 False
fare_amount              False
passenger_count          False
taxi_id                  False
tip_amount               False
tpep_dropoff_datetime    False
tpep_pickup_datetime     False
trip_distance            False
tip_prop                  True
car                      False
dtype: bool

Hay valores nulos en la proporción de propina.

Se limpian los datos dado que solo son 3.

In [7]:
nulos = datos.loc[pd.isnull(datos.tip_prop)].index.tolist()
len(nulos)

3

In [8]:
datos = datos.drop(nulos).reset_index(drop=True)

### Tiempo de duración del viaje.

In [9]:
for i in range(len(datos)):
    aux = (datos.tpep_dropoff_datetime[i] - datos.tpep_pickup_datetime[i])
    diff.append(aux.seconds/60)

datos['time_diff'] = diff
datos['hour'] = datos.tpep_pickup_datetime.apply(lambda x: x.hour)

In [10]:
base = datos.drop(['tpep_pickup_datetime', 'tpep_dropoff_datetime', 'tip_prop', 'car_type'], axis = 1)

## Grupo de entrenamiento y prueba

In [11]:
X_train, X_test, y_train, y_test = train_test_split(base, datos.tip_prop, test_size=0.3, random_state=109649)

Random Forest

In [15]:
Regr = RandomForestRegressor(bootstrap=True, n_jobs=-1)
parameters = {'n_estimators': [3, 5, 10, 15], 'max_depth':[4, 5, 6, 10], 'min_samples_split':[2, 3, 4, 7]}

Búsqueda secuencial

In [13]:
predict_model_seq = GridSearchCV(Regr, parameters, cv=4)
%time predict_model_seq.fit(X_train, y_train)
print(predict_model_seq.best_params_)
predict_model_seq = predict_model_seq.best_estimator_

CPU times: user 18.7 s, sys: 1.97 s, total: 20.7 s
Wall time: 1min 23s
{'max_depth': 10, 'n_estimators': 10, 'min_samples_split': 7}


Tiempo de proceso: 1 minuto y 23 Segundos

Búsqueda en paralelo

In [16]:
predict_model_paral = DaskGridSearchCV(Regr, parameters, cv=4)
%time predict_model_paral.fit(X_train, y_train)

print(predict_model_paral.best_params_)
predict_model_paral=predict_model_paral.best_estimator_


CPU times: user 30 ms, sys: 80 ms, total: 110 ms
Wall time: 23.3 s
{'max_depth': 10, 'n_estimators': 10, 'min_samples_split': 2}


Tiempo de proceso: 23.3 segundos

## Resultados del pronóstico

In [17]:
Test1 = predict_model_seq.predict(X_test).tolist()
Test2 = predict_model_paral.predict(X_test).tolist()

In [18]:
base_pronostico = pd.DataFrame({'Mod_Seq':Test1, 'Mod_Par':Test2, 'Real':y_test.tolist()})
base_pronostico.head()

Unnamed: 0,Mod_Par,Mod_Seq,Real
0,0.22928,0.229213,0.235714
1,0.241334,0.24325,0.232
2,0.140394,0.138464,0.109091
3,4e-06,6e-06,0.0
4,0.062066,0.05818,0.038462


In [19]:
mean_squared_error(base_pronostico.Mod_Seq, base_pronostico.Real)

0.0038645379358035544

In [20]:
mean_squared_error(base_pronostico.Mod_Par, base_pronostico.Real)

0.003405549879579453

En este caso aunque es más lento, el porceso secuencial resulta ser más certero.