# Práctica Calificada 2 - Grupo 1
---
<h3>1. OBJETIVO</h3>

**Predicción de tarifas de taxis**<br>
El objetivo de esta evaluación es construir un modelo de aprendizaje que sea capaz de
predecir la tarifa que cobra un taxi de acuerdo a cierta información de entrada.

<h3>2. PAQUETES Y MÓDULOS</h3>

In [1]:
import warnings
warnings.filterwarnings("ignore")

In [2]:
from sklearn import linear_model
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score, average_precision_score, precision_recall_curve
from inspect import signature
from math import sqrt, sin, cos, asin, pi, log
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler, PolynomialFeatures

import plotly.express as px
import pandas as pd
import numpy as np
import seaborn as sns
import datetime as dt
import os
import matplotlib.pyplot as plt
import ciso8601 #módulo que contiene una implementacion rapida de datetime

%matplotlib inline

<h3>3. MUESTREO</h3>

El conjunto de datos está compuesto por un archivo CSV que contiene alrededor de 55
millones de registros de viajes en taxi. Cada registro contiene la siguiente información:
* **ID**: cadena que identifica de manera única a cada registro
* **pickup_datetime**: timestamp indicando cuando el viaje a empezado
* **pickup_longitude**: número real indicando la ubicación en **longitud** en donde el viaje
empezó
* **pickup_latitude**: número real indicando la ubicación en **latitud** en donde el viaje
empezó
* **dropoff_longitude**: número real indicando la ubicación en longitud en donde el viaje
terminó
* **dropoff_latitude**: número real indicando la ubicación en latitud en donde el viaje
terminó
* **passenger_count**: número entero indicando el número de pasajeros en el servicio de
taxi
* **fare_amount: número real indicando el costo del taxi. Esta es la variable a predecir**

**Población** <br>
Cargando todos los datos

In [3]:
%%time
df = pd.read_csv("train.csv") # approx 55M 
df.shape

Wall time: 2min 21s


(55423856, 8)

**Muestra**<br>
Se trabajará con toda la población

In [4]:
%%time
df_s = df

Wall time: 0 ns


In [6]:
df_s.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55423856 entries, 0 to 55423855
Data columns (total 8 columns):
 #   Column             Dtype  
---  ------             -----  
 0   key                object 
 1   fare_amount        float64
 2   pickup_datetime    object 
 3   pickup_longitude   float64
 4   pickup_latitude    float64
 5   dropoff_longitude  float64
 6   dropoff_latitude   float64
 7   passenger_count    int64  
dtypes: float64(5), int64(1), object(2)
memory usage: 3.3+ GB


<h3>4. LIMPIEZA</h3>

**ELIMINAR DATOS NULOS Y LAS COLUMNA KEY**<br>
En todos los datos hay 376 datos nulos en el destino del pasajero `(dropoff_longitude, dropoff_latitude)` lo eliminamos con `dropna`

In [7]:
df_s.isna().sum()

key                    0
fare_amount            0
pickup_datetime        0
pickup_longitude       0
pickup_latitude        0
dropoff_longitude    376
dropoff_latitude     376
passenger_count        0
dtype: int64

In [10]:
df_s.shape

(55423480, 8)

In [11]:
df_s.drop(columns='key', inplace=True)

**LIMPIEZA  GENERAL**

In [22]:
%%time
def filter_data(dataframe):
    #Solo quedan si cumplen las condiciones 
    return dataframe[
                    #Coordenadas Ilegales
                    (-180.0 <= dataframe["pickup_longitude"])&
                    (dataframe["pickup_longitude"] <= 180.0)&
                    (-90.0 <= dataframe["pickup_latitude"])&
                    (dataframe["pickup_latitude"] <= 90.0)&
                    (-180.0 <= dataframe["dropoff_longitude"])&
                    (dataframe["dropoff_longitude"] <= 180.0)&
                    (-90.0 <= dataframe["dropoff_latitude"])&
                    (dataframe["dropoff_latitude"] <= 90.0)& 
                    #(df_s["pickup_longitude"] != df_s["dropoff_longitude"])&
                    #Fare amount  
                    (2.0 <= dataframe["fare_amount"])&
                    (dataframe["fare_amount"] <= 100)&
                    # passenger_count
                    (1<=dataframe["passenger_count"])&
                    (dataframe["passenger_count"]<= 6)]    
     
print ("Shape antes del limpieza general: ", df_s.shape)
# almacenado en data la limpieza general
data = filter_data(df_s)
print ("Shape despues del limpieza general: ", data.shape)
print ("Limpiando %d registros"%(df_s.shape[0] - data.shape[0]))

Shape antes del limpieza general:  (55423480, 7)
Shape despues del limpieza general:  (55200088, 7)
Limpiando 223392 registros
Wall time: 26.2 s


**INTERCAMBIAR COORDENADAS DE PUNTOS PARA CONSIDERARLOS DENTRO DE LA REGIÓN VALIDA**

In [23]:
def swap_coordinates (dataframe,
    city_limits = { 
        "lon_min":-76,
        "lon_max":-73,
        "lat_min":38,
        "lat_max":50} ):
    #Intercambia las coordenadas de los viajes en 
    # la region [38 , 50]x[-76, -73]
    # la region de principarl de trabajo
    # (cuidad de NY) es : [-76, -73]x[38 , 50]
    datap = dataframe
    city_interchange = ((datap["pickup_longitude"] > city_limits["lat_min"])&
                        (datap["pickup_longitude"] < city_limits["lat_max"])&
                        (datap["pickup_latitude"] > city_limits["lon_min"] )& #-74.252444 
                        (datap["pickup_latitude"] < city_limits["lon_max"] )& 

                        (datap["dropoff_longitude"] > city_limits["lat_min"])&
                        (datap["dropoff_longitude"] < city_limits["lat_max"])&
                        (datap["dropoff_latitude"] >  city_limits["lon_min"])&
                        (datap["dropoff_latitude"] <  city_limits["lon_max"]))
    print ( "Numero de reflejos : ",city_interchange.sum())
    
    datap.loc[city_interchange] = datap.loc[city_interchange].rename(columns={
                                            'pickup_longitude':'pickup_latitude',
                                            'pickup_latitude':'pickup_longitude',
                                            'dropoff_latitude':'dropoff_longitude',
                                            'dropoff_longitude':'dropoff_latitude'})
    return datap

data = swap_coordinates(data)

Numero de reflejos :  26684


**DETERMINAR LA REGIÓN VALIDA**

In [24]:
#long_border = (-74.03, -73.75)
#lat_border = (40.63, 40.85)
def filter_out_of_city(dataframe,city_limits = {"lon_min":-74.03 ,
                                                "lon_max":-73.75,
                                                "lat_min":40.63,
                                                "lat_max":40.85}):
    #Solo quedan si estan dentro de la ciudad 
    return dataframe[(city_limits["lon_min"]<= dataframe["pickup_longitude"])&
                    (dataframe["pickup_longitude"] <= city_limits["lon_max"])&
                    (city_limits["lat_min"]<= dataframe["pickup_latitude"])&
                    (dataframe["pickup_latitude"] <= city_limits["lat_max"])&
                    (city_limits["lon_min"] <= dataframe["dropoff_longitude"])&
                    (dataframe["dropoff_longitude"] <= city_limits["lon_max"])&
                    (city_limits["lat_min"]<= dataframe["dropoff_latitude"])&
                    (dataframe["dropoff_latitude"] <= city_limits["lat_max"])]

print ("Shape antes de la limpieza por región: ", data.shape)
data = filter_out_of_city(data)
print ("Shape despues de la limpieza por región: ", data.shape)
print ("Limpiando %d registros acumulados en total hasta ahora "%(df_s.shape[0] - data.shape[0]))

Shape antes de la limpieza por región:  (55200088, 7)
Shape despues de la limpieza por región:  (53383845, 7)
Limpiando 2039635 registros acumulados en total hasta ahora 


<h2>5. INGENIERÍA DE CARACTERÍSTICAS</h2>

In [27]:
def isInside( column_lat ,column_lon ,region):
    return (column_lat>=region['min_lat'])&(column_lat<=region['max_lat'])&(column_lon>=region['min_long'])&(column_lon<=region['max_long'])    

**CONTROL DE VIAJES PARTIENDO O LLEGANDO A LOS AEROPUERTOS DE NEW YORK**

In [28]:
JFK={'min_long':-73.8352,'min_lat':40.6195,'max_long':-73.7401, 'max_lat':40.6659}
EWR={'min_long':-74.1925,'min_lat':40.6700, 'max_long':-74.1531, 'max_lat':40.7081}
LG={'min_long':-73.8895, 'min_lat':40.7664,'max_long':-73.8550,'max_lat':40.7931}

In [29]:
%%time
data['pickup_airport'] = isInside(data.pickup_latitude, data.pickup_longitude , JFK)|isInside(data.pickup_latitude, data.pickup_longitude ,EWR) | isInside(data.pickup_latitude, data.pickup_longitude ,LG)
data['dropoff_airport'] = isInside(data.dropoff_latitude, data.dropoff_longitude , JFK) |isInside(data.dropoff_latitude, data.dropoff_longitude ,EWR) |isInside(data.dropoff_latitude, data.dropoff_longitude ,LG)

Wall time: 1.69 s


In [30]:
data['pickup_airport'] = data['pickup_airport'].astype(int)
data['dropoff_airport'] = data['dropoff_airport'].astype(int)

<h5>SEPARAR DATETIME EN YEAR, MONTH, DAY, HOUR Y WEEKDAY</h5>

In [31]:
%%time
def f2(datestr):
    return ciso8601.parse_datetime(datestr [ :-4])

def separate_datetime_to_features(dataframe): 
    dataframe ['pickup_datetime'] =dataframe.pickup_datetime.apply(f2)  #pd.to_datetime(data.pickup_datetime ,infer_datetime_format=True) # convertimos a tipo de dato de datetime
    dataframe['year'] = dataframe['pickup_datetime'].dt.year
    dataframe['month'] = dataframe['pickup_datetime'].dt.month
    dataframe['day'] = dataframe['pickup_datetime'].dt.day
    dataframe['hour'] = dataframe['pickup_datetime'].dt.hour
    dataframe['weekday'] = dataframe['pickup_datetime'].dt.weekday
    dataframe.drop(columns='pickup_datetime', inplace=True)
    return dataframe

data = separate_datetime_to_features(data)

Wall time: 54.5 s


<h5>DETERMINAR LA DISTANCIA DE HAVERSINE ENTRE PUNTOS DADO SU LATITUD Y LONGITUD</h5>

In [34]:
def haversine(lon1, lat1, lon2, lat2):
    # Haversine vectorizado usando funciones de np
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])

    dlon = lon2 - lon1
    dlat = lat2 - lat1

    a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2
    km = 6367 * 2 * np.arcsin(np.sqrt(a))
    km[km < 0.00008] = 0.00008
    return km

In [None]:
%time
data['distance'] = haversine(data['pickup_longitude'],data['pickup_latitude'] , data['dropoff_longitude'], data['dropoff_latitude'])

<h4> SELECCIÓN DE VARIABLES</h4>

In [37]:
# Quedaron 53383845 datos para la entrenamiento y el testeo
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 53383845 entries, 0 to 55423855
Data columns (total 14 columns):
 #   Column             Dtype  
---  ------             -----  
 0   fare_amount        float64
 1   pickup_longitude   float64
 2   pickup_latitude    float64
 3   dropoff_longitude  float64
 4   dropoff_latitude   float64
 5   passenger_count    int64  
 6   pickup_airport     int32  
 7   dropoff_airport    int32  
 8   year               int64  
 9   month              int64  
 10  day                int64  
 11  hour               int64  
 12  weekday            int64  
 13  distance           float64
dtypes: float64(6), int32(2), int64(6)
memory usage: 5.6 GB


In [38]:
data.drop(columns=['pickup_longitude','pickup_latitude','dropoff_longitude','dropoff_latitude'], inplace=True)

In [37]:
predictors = ['distance', 'pickup_airport', 'dropoff_airport', 'year','month','hour']
salida = 'fare_amount'
X = data[predictors]
y = data[salida]

#ESCALANDO
scaler = StandardScaler()
X = scaler.fit_transform(X)

## 6. Entrenamiento del modelo

In [38]:
def print_metrics(y_test, y_pred):
    r2score = r2_score(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)
    rmse = sqrt(mean_squared_error(y_test, y_pred))

    print("MSE", mse)
    print("RMSE",rmse)
    print("R2", r2score)

In [41]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=7, test_size=0.3)

## Random Forest Regressor

In [None]:
model = RandomForestRegressor(max_depth=6, max_features='sqrt', n_estimators=50,n_jobs=1)

In [50]:
%%time
model.fit(X_train,y_train)

Wall time: 23min 49s


RandomForestRegressor(max_depth=6, max_features='sqrt', n_estimators=50,
                      n_jobs=1)

In [51]:
y_pred = model.predict(X_test)

#### Métrica

In [52]:
print_metrics(y_test, y_pred)

MSE 14.998795151405638
RMSE 3.8728277977991272
R2 0.8030349035166128


## Boosting

In [39]:
from sklearn.ensemble import GradientBoostingRegressor
gbr = GradientBoostingRegressor(n_estimators=60, learning_rate=0.5, max_depth=2, random_state=0)

In [42]:
%%time 
gbr.fit(X_train, y_train)

GradientBoostingRegressor(learning_rate=0.5, max_depth=2, n_estimators=60,
                          random_state=0)

In [45]:
gbr.score(X_test, y_test)
y_pred = gbr.predict(X_test)

#### Métrica

In [46]:
print_metrics(y_test, y_pred)

MSE 12.560874420619301
RMSE 3.5441323932126605
R2 0.8350498278562641
