## Ejercicio/Tarea
### Daniel Sharp 138176

Aprovecha la capacidad de Dask para realizar cómputo en paralelo para ajustar un modelo para predecir la proporción de propina de un viaje. Realiza búsqueda de hiperparámetros en grid con cross validation. Puedes usar funciones de scikit learn. Recuerda usar el decorador `delayed` para ejecutar en paralelo.

* ¿Qué tan rápido es buscar en paralelo comparado con una búsqueda secuencial en python?

**Python**  
Necesitamos hacer one-hot-encoding para las variables categóricas (tipo de auto) y quizás convendría hacer 'feature engineering' para obtener otras variables, como mes del año, día de la semana o incluso periodo del día (madrugada, mañana, tarde, noche).

In [1]:
from dask.distributed import Client
from dask import delayed
import dask.dataframe as dd
import pandas as pd
client = Client("scheduler:8786")
import time

In [2]:
trips_df = dd.read_csv("/data/trips.csv")
trips_df.tpep_pickup_datetime = trips_df.tpep_pickup_datetime.astype('M8[us]')
trips_df.tpep_dropoff_datetime = trips_df.tpep_dropoff_datetime.astype('M8[us]')
trips_df.head()

Unnamed: 0,car_type,fare_amount,passenger_count,taxi_id,tip_amount,tpep_dropoff_datetime,tpep_pickup_datetime,trip_distance
0,A,22.0,1,1,4.6,2015-01-03 01:37:02,2015-01-03 01:17:32,6.9
1,A,9.0,1,1,0.0,2015-01-05 23:35:02,2015-01-05 23:25:15,1.81
2,A,7.5,1,1,1.0,2015-01-06 15:22:12,2015-01-06 15:11:45,0.96
3,A,8.5,1,1,1.0,2015-01-08 08:31:23,2015-01-08 08:22:12,1.9
4,A,7.5,1,1,1.66,2015-01-08 12:35:54,2015-01-08 12:26:26,1.0


In [3]:
from sklearn.base import TransformerMixin

Funciones para hacer feature engineering a los datos:

In [4]:
# Extraer valor de timestamp
class getFromTimestamp(TransformerMixin):  

    def __init__(self, columns=[None], what = 'hour'):
        self.columns = columns
        self.what = what

    def transform(self, df):
        if (self.what is 'hour'):
            for col in self.columns:
                df["hour"] = df[col].map(lambda d: d.hour)
        elif (self.what is 'day'):
            for col in self.columns:
                df["day"]= df[col].map(lambda d: d.day)
        elif (self.what is 'dow'):
            for col in self.columns:
                df["dow"]= df[col].map(lambda d: d.dayofweek)
        elif (self.what is 'month'):
            for col in self.columns:
                df["month"]= df[col].map(lambda d: d.month)
        else:
            print("option not available")
        self = df
    def fit(self,*_):
        return self

In [5]:
# Para particionar una variable de horas
class hourBuckets(TransformerMixin):  

    def __init__(self, column=None):
        self.column = column

    def transform(self, df):
        colname = self.column+"_buck"
        def func(x):
            if (x < 7):
                return 1
            elif(x < 13):
                return 2
            elif (x < 19):
                return 3
            else:
                return 4
        df[colname] = df[self.column].astype(int).map(lambda d: func(d))
        self = df
    def fit(self,*_):
        return self

In [6]:
# Para particionar una variable de horas
class make_dummies(TransformerMixin):  

    def __init__(self, columns=[None]):
        self.columns = columns

    def transform(self, df):
        for c in self.columns:
            dummies = dd.get_dummies(df.categorize(columns=c)[c], prefix = c, drop_first=True)
            for col in dummies.columns:
                df[col] = dummies[col]
#        self = dd.get_dummies(df.categorize(self.columns), drop_first=True)
    def fit(self,*_):
        return self

Declaración de las funciones y ejecución para crear la base sobre la cual correré los modelos:

In [7]:
h = getFromTimestamp(["tpep_dropoff_datetime"], "hour")
dow = getFromTimestamp(["tpep_dropoff_datetime"], "dow")
mon = getFromTimestamp(["tpep_dropoff_datetime"],"month")
hb = hourBuckets("hour")
md = make_dummies(['dow','hour_buck','month', 'car_type'])

In [8]:
h.fit(trips_df).transform(trips_df)
dow.fit(trips_df).transform(trips_df)
mon.fit(trips_df).transform(trips_df)
hb.fit(trips_df).transform(trips_df)
md.fit(trips_df).transform(trips_df)
trips_df = trips_df.assign(target = trips_df.tip_amount/trips_df.fare_amount)
trips_df = trips_df.drop(['car_type','taxi_id','tpep_dropoff_datetime','tpep_pickup_datetime','hour','hour_buck','month','dow','tip_amount'],axis=1)

In [9]:
trips_df.dtypes

fare_amount        float64
passenger_count      int64
trip_distance      float64
dow_0                uint8
dow_1                uint8
dow_3                uint8
dow_4                uint8
dow_6                uint8
dow_2                uint8
hour_buck_4          uint8
hour_buck_3          uint8
hour_buck_2          uint8
car_type_B           uint8
target             float64
dtype: object

In [10]:
trips_df[trips_df.isnull().any(axis=1)].compute()

Unnamed: 0,fare_amount,passenger_count,trip_distance,dow_0,dow_1,dow_3,dow_4,dow_6,dow_2,hour_buck_4,hour_buck_3,hour_buck_2,car_type_B,target
3276,0.0,5,0.23,0,1,0,0,0,0,0,0,1,0,
4050,0.0,2,13.4,0,0,0,0,0,0,0,0,0,1,
5739,0.0,2,4.8,0,0,0,0,0,1,1,0,0,1,


In [11]:
# tiramos las tres observaciones con NA
trips_df = trips_df.dropna()

#### Hyper-parameters

In [12]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split

In [13]:
params = {'max_iter' :[50, 100,200],
          'alpha':[0.001,0.01,0.1,1]}

In [14]:
clf = GridSearchCV(Lasso(), params, cv=10, scoring='neg_mean_squared_error',verbose=1, n_jobs=-1)
delayed_clf = delayed(clf)
X = trips_df.drop('target',axis=1)
y = trips_df['target']

In [15]:
%%time
res = delayed_clf.fit(X, y).compute()

CPU times: user 20.9 ms, sys: 0 ns, total: 20.9 ms
Wall time: 1.6 s


In [16]:
res.grid_scores_



[mean: -0.01638, std: 0.00502, params: {'max_iter': 50, 'alpha': 0.001},
 mean: -0.01638, std: 0.00502, params: {'max_iter': 100, 'alpha': 0.001},
 mean: -0.01638, std: 0.00502, params: {'max_iter': 200, 'alpha': 0.001},
 mean: -0.01640, std: 0.00500, params: {'max_iter': 50, 'alpha': 0.01},
 mean: -0.01640, std: 0.00500, params: {'max_iter': 100, 'alpha': 0.01},
 mean: -0.01640, std: 0.00500, params: {'max_iter': 200, 'alpha': 0.01},
 mean: -0.01640, std: 0.00500, params: {'max_iter': 50, 'alpha': 0.1},
 mean: -0.01640, std: 0.00500, params: {'max_iter': 100, 'alpha': 0.1},
 mean: -0.01640, std: 0.00500, params: {'max_iter': 200, 'alpha': 0.1},
 mean: -0.01640, std: 0.00500, params: {'max_iter': 50, 'alpha': 1},
 mean: -0.01640, std: 0.00500, params: {'max_iter': 100, 'alpha': 1},
 mean: -0.01640, std: 0.00500, params: {'max_iter': 200, 'alpha': 1}]

In [17]:
X = trips_df.drop('target',axis=1).compute()
y = trips_df['target'].compute()

In [18]:
%%time
res = clf.fit(X,y)

Fitting 10 folds for each of 12 candidates, totalling 120 fits
CPU times: user 222 ms, sys: 92.6 ms, total: 314 ms
Wall time: 1.92 s


[Parallel(n_jobs=-1)]: Done 120 out of 120 | elapsed:    1.8s finished


In [19]:
res.grid_scores_



[mean: -0.01638, std: 0.00502, params: {'max_iter': 50, 'alpha': 0.001},
 mean: -0.01638, std: 0.00502, params: {'max_iter': 100, 'alpha': 0.001},
 mean: -0.01638, std: 0.00502, params: {'max_iter': 200, 'alpha': 0.001},
 mean: -0.01640, std: 0.00500, params: {'max_iter': 50, 'alpha': 0.01},
 mean: -0.01640, std: 0.00500, params: {'max_iter': 100, 'alpha': 0.01},
 mean: -0.01640, std: 0.00500, params: {'max_iter': 200, 'alpha': 0.01},
 mean: -0.01640, std: 0.00500, params: {'max_iter': 50, 'alpha': 0.1},
 mean: -0.01640, std: 0.00500, params: {'max_iter': 100, 'alpha': 0.1},
 mean: -0.01640, std: 0.00500, params: {'max_iter': 200, 'alpha': 0.1},
 mean: -0.01640, std: 0.00500, params: {'max_iter': 50, 'alpha': 1},
 mean: -0.01640, std: 0.00500, params: {'max_iter': 100, 'alpha': 1},
 mean: -0.01640, std: 0.00500, params: {'max_iter': 200, 'alpha': 1}]

Haz lo mismo que arriba, pero utilizando la biblioteca Dask-ML http://dask-ml.readthedocs.io/en/latest/ 

* ¿Cómo se comparan los tiempos de ejecución de tu búsqueda con la de Dask ML?

In [20]:
from dask_searchcv import GridSearchCV as GSCVDask
from dask_ml.linear_model import LinearRegression

In [21]:
params = {'max_iter' :[50, 100,200],
          'C':[0.001,0.01,0.1,1]}

**Utilizando el estimador 'Linear Regression' y Grid Search de Dask**  
Por alguna razón es muy lento y tarda 3 minutos en ejecutarse:

In [22]:
%%time
clf_dask = GSCVDask(LinearRegression(penalty='l1'), params, cv=10, scoring='neg_mean_squared_error', n_jobs=-1)
X = trips_df.drop('target',axis=1).values
y = trips_df['target'].values
res_dask=clf_dask.fit(X,y)

CPU times: user 348 ms, sys: 29.5 ms, total: 378 ms
Wall time: 4min 13s


In [23]:
res_dask.best_score_

-0.03341257110650153

**Utilizando el estimador 'Lasso' de Sklearn y Grid Search de Dask**  
Es mucho más rápido que el anterior y comparable con el de sklearn

In [24]:
params = {'max_iter' :[50, 100,200],
          'alpha':[0.001,0.01,0.1,1]}

In [25]:
%%time
#clf_dask = GSCVDask(LinearRegression(penalty='l1'), params, cv=10, scoring='neg_mean_squared_error', n_jobs=-1)
clf_dask = GSCVDask(Lasso(), params, cv=10, scoring='neg_mean_squared_error', n_jobs=-1)
X = trips_df.drop('target',axis=1).values
y = trips_df['target'].values
res_dask=clf_dask.fit(X,y)

CPU times: user 49.7 ms, sys: 6.22 ms, total: 55.9 ms
Wall time: 1.09 s


In [26]:
res_dask.best_score_

-0.016381782765348213

**Bonus**

Haz lo mismo utilizando Spark ML

* ¿Cómo se comparan los tiempos de ejecución de Spark vs Dask?

Guardo el dataframe procesado a CSV para poder ejecutarlo en Spark:

In [27]:
dd.to_csv(trips_df,'/data/')

['/data/0.part']

La ejecución en Spark se encuentra en el archivo 'trips_spark.ipynb', a continuación está un screenshot de la ejecución del GridSearch en este programa:  

![](spark.png)

Del ejercicio podemos ver que el grid search más rápido fue utilizando la función de GridSearch de Dask con el estimador de Sklearn, tardó cerca de un segundo. El segundo más rápido fue el que utiliza las funciones de sklearn con el decorador de delayed, pues ejecutó el search completo tan solo 1.6 segundos, seguido por el grid utilizando funciones secuenciales de sklearn, con 1.9 segundos. Finalmente está Spark, que tardó cerca de 20 segundos, por mucho el más lento.