# Descripción del proyecto

La compañía Sweet Lift Taxi ha recopilado datos históricos sobre pedidos de taxis en los aeropuertos. Para atraer a más conductores durante las horas pico, necesitamos predecir la cantidad de pedidos de taxis para la próxima hora. Construye un modelo para dicha predicción.

La métrica RECM en el conjunto de prueba no debe ser superior a 48.

## Instrucciones del proyecto.

1. Descarga los datos y haz el remuestreo por una hora.
2. Analiza los datos
3. Entrena diferentes modelos con diferentes hiperparámetros. La muestra de prueba debe ser el 10% del conjunto de datos inicial.4. Prueba los datos usando la muestra de prueba y proporciona una conclusión.

## Descripción de los datos

Los datos se almacenan en el archivo `taxi.csv`. 	
El número de pedidos está en la columna `num_orders`.

## Preparación

In [105]:
!pip install plotly_express




[notice] A new release of pip is available: 24.0 -> 24.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [106]:
import pandas as pd
import numpy as np

import plotly.graph_objects as go
import plotly_express as px
from plotly.subplots import make_subplots
import matplotlib.pyplot as plt

from statsmodels.tsa.seasonal import seasonal_decompose
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error # da error por importar root_mean_squared_error

from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
import lightgbm as lgb
from catboost import CatBoostRegressor

from statsmodels.tsa.statespace.sarimax import SARIMAX

In [107]:
try:
    data = pd.read_csv(
        r"C:\Users\armod\OneDrive\Escritorio\Curso Data Science\Project_13\taxi.csv",
        index_col=0,
        parse_dates=[0],    
    )
except:
    data = pd.read_csv(
    "https://practicum-content.s3.us-west-1.amazonaws.com/datasets/taxi.csv?etag=11687de0e23962e5a11c9d8ae13eb630",
    index_col=0,
    parse_dates=[0],    
)

In [108]:
# data= pd.read_csv("/datasets/taxi.csv",
#     index_col=0,
#     parse_dates=[0],    )

In [109]:
data

Unnamed: 0_level_0,num_orders
datetime,Unnamed: 1_level_1
2018-03-01 00:00:00,9
2018-03-01 00:10:00,14
2018-03-01 00:20:00,28
2018-03-01 00:30:00,20
2018-03-01 00:40:00,32
...,...
2018-08-31 23:10:00,32
2018-08-31 23:20:00,24
2018-08-31 23:30:00,27
2018-08-31 23:40:00,39


## Análisis

In [110]:
data.isna().sum()
np.random.seed(999)

In [111]:
data_hour = data.resample("1H").sum()
data_hour.head()

Unnamed: 0_level_0,num_orders
datetime,Unnamed: 1_level_1
2018-03-01 00:00:00,124
2018-03-01 01:00:00,85
2018-03-01 02:00:00,71
2018-03-01 03:00:00,66
2018-03-01 04:00:00,43


In [112]:
decomposed = seasonal_decompose(data_hour["num_orders"], model="additive")

fig = make_subplots(
    rows=4,
    cols=1,
    subplot_titles=("Original", "Tendencia", "Estacionalidad", "Residuales"),
    vertical_spacing=0.07,
)

for i, observation in enumerate(["observed", "trend", "seasonal", "resid"], start=1):
    dataframe = pd.DataFrame(
        {"ds": data_hour.index, "observation": getattr(decomposed, observation)}
    )
    fig.add_trace(
        go.Scatter(x=dataframe["ds"], y=dataframe["observation"], showlegend=False),
        row=i,
        col=1,
        
    )
start_date = '2018-08-24'
end_date = '2018-08-31'
fig.update_layout(
    height=1000, 
    width=1000, 
    title_text="Descomposición de Serie Temporal",
    xaxis=dict(range=[start_date, end_date]),   
    xaxis3=dict(range=[start_date, end_date], autorange=False),  
    xaxis4=dict(range=[start_date, end_date], autorange=False) 
)
fig.show()

## Formación

In [113]:

def make_features(data, max_lag, rolling_mean_size):
    new_row_index = [data.index[-1] + pd.Timedelta(hours=1)]
    datos = {"num_orders":[0]}
    new_row = pd.DataFrame( data= datos, index= new_row_index)
    data = pd.concat([data,new_row])

    data["year"] = data.index.year
    data["month"] = data.index.month
    data["day"] = data.index.day
    data["dayofweek"] = data.index.dayofweek

    for lag in range(1, max_lag + 1):
        data["lag_{}".format(lag)] = data["num_orders"].shift(lag)

    data["rolling_mean"] = data["num_orders"].shift().rolling(rolling_mean_size).mean()
    return data

data_hour = make_features(data_hour, 4, 4)
display(data_hour.tail(5))

Unnamed: 0,num_orders,year,month,day,dayofweek,lag_1,lag_2,lag_3,lag_4,rolling_mean
2018-08-31 20:00:00,154,2018,8,31,4,136.0,207.0,217.0,197.0,189.25
2018-08-31 21:00:00,159,2018,8,31,4,154.0,136.0,207.0,217.0,178.5
2018-08-31 22:00:00,223,2018,8,31,4,159.0,154.0,136.0,207.0,164.0
2018-08-31 23:00:00,205,2018,8,31,4,223.0,159.0,154.0,136.0,168.0
2018-09-01 00:00:00,0,2018,9,1,5,205.0,223.0,159.0,154.0,185.25


## Prueba

In [114]:
print(data_hour.index[-1] + pd.Timedelta(hours=1))
#display(data_hour.iloc[-6:-1]) 

2018-09-01 01:00:00


In [115]:
data_hour.dropna(inplace=True)
X = data_hour.drop("num_orders", axis=1)
y = data_hour["num_orders"]

X_Bulk, X_test, y_Bulk, y_test = train_test_split(X, y, test_size=0.2, shuffle=False)

X_train, X_valid, y_train, y_valid = train_test_split(X_Bulk, y_Bulk, test_size=0.25, shuffle=False) 


#### Regresión lineal 

In [116]:
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

predictions_train = lr_model.predict(X_train)
predictions_test = lr_model.predict(X_valid)

rmse_lr = mean_squared_error(y_valid, predictions_test, squared=False)

print("rmse train:", mean_squared_error(y_train, predictions_train, squared=False))
print("rmse test:", rmse_lr )


data_predict = pd.DataFrame(data={"pred":predictions_test,"real":y_valid})
fig = px.line(data_predict)
fig.update_layout(xaxis=dict(range=["2018-08-30", "2018-09-01"]))
fig.show()

rmse train: 27.750323387290692
rmse test: 33.081286692524834



'squared' is deprecated in version 1.4 and will be removed in 1.6. To calculate the root mean squared error, use the function'root_mean_squared_error'.


'squared' is deprecated in version 1.4 and will be removed in 1.6. To calculate the root mean squared error, use the function'root_mean_squared_error'.



#### Bosque Aleatorio


In [117]:
rf_model = RandomForestRegressor(max_features="sqrt", max_depth=10)

rf_model.fit(X_train, y_train)
rf_predictions = rf_model.predict(X_valid)

rmse_rf = mean_squared_error(y_valid, rf_predictions, squared=False)
print(f"RMSE: {rmse:.2f}")

data_predict = pd.DataFrame(data={"pred":rf_predictions,"real":y_valid})
fig = px.line(data_predict)
fig.update_layout(xaxis=dict(range=["2018-08-30", "2018-09-01"]))
fig.show()

# feature_importances = rf_model.feature_importances_
# features = X_train.columns
# indices = np.argsort(feature_importances)[-10:]


# plt.figure(figsize=(10, 8))
# plt.title('Importancia de las Características')
# plt.barh(range(len(indices)), feature_importances[indices], color='blue', align='center')
# plt.yticks(range(len(indices)), [features[i] for i in indices])
# plt.xlabel('Importancia Relativa')
# plt.show()

RMSE: 33.77



'squared' is deprecated in version 1.4 and will be removed in 1.6. To calculate the root mean squared error, use the function'root_mean_squared_error'.



#### LightGBM


In [118]:
lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_valid, y_valid)

lgb_params = {
    "boosting_type": "gbdt",
    "objective": "regression",
    "metric": {"l2", "l1"},
    "num_leaves": 31,
    "learning_rate": 0.05,
    "feature_fraction": 0.9,
    "bagging_fraction": 0.8,
    "bagging_freq": 5,
    "verbose": 0,
}

lgb_model = lgb.train(lgb_params, lgb_train, valid_sets=lgb_eval)

lgb_predictions = lgb_model.predict(X_valid, num_iteration=lgb_model.best_iteration)


rmse_lgbm = mean_squared_error(y_valid, lgb_predictions, squared=False)
print(f"RMSE: {rmse:.2f}")

data_predict = pd.DataFrame(data={"pred":lgb_predictions,"real":y_valid})
fig = px.line(data_predict)
fig.update_layout(xaxis=dict(range=["2018-08-30", "2018-09-01"]))
fig.show()


'squared' is deprecated in version 1.4 and will be removed in 1.6. To calculate the root mean squared error, use the function'root_mean_squared_error'.



RMSE: 33.77


#### CatBoost Regressor

In [119]:
cat_model = CatBoostRegressor(
    iterations=1000, learning_rate=0.03, depth=6, loss_function="RMSE", verbose=False
)

cat_model.fit(X_train, y_train)
cat_predictions = cat_model.predict(X_valid)

rmse_cat = mean_squared_error(y_valid, cat_predictions, squared=False)
print(f"RMSE: {rmse:.2f}")


data_predict = pd.DataFrame(data={"pred":cat_predictions,"real":y_valid})
fig = px.line(data_predict)
fig.update_layout(xaxis=dict(range=["2018-08-30", "2018-09-01"]))
fig.show()


RMSE: 33.77



'squared' is deprecated in version 1.4 and will be removed in 1.6. To calculate the root mean squared error, use the function'root_mean_squared_error'.



In [120]:
rsmes ={"Linear Regression": rmse_lr, "Random Forest Regressor":rmse_rf,"LightGBM":rmse_lgbm, "Catboost":rmse_cat}
print(rsmes)

{'Linear Regression': 33.081286692524834, 'Random Forest Regressor': 33.792425244115755, 'LightGBM': 34.32321226037138, 'Catboost': 33.76895667859574}


In [121]:
#Testeando el modelo elegido
predictions_test = lr_model.predict(X_test)

rmse_lr = mean_squared_error(y_test, predictions_test, squared=False)

print("rmse test:", rmse_lr )


data_predict = pd.DataFrame(data={"pred":predictions_test,"real":y_test})
fig = px.line(data_predict)
fig.update_layout(xaxis=dict(range=["2018-08-30", "2018-09-01"]))
fig.show()


'squared' is deprecated in version 1.4 and will be removed in 1.6. To calculate the root mean squared error, use the function'root_mean_squared_error'.



rmse test: 48.550743877119366


En este trabajo pudimos trabajar con series temporales, usando datos de una compañía de taxis, pudimos descomponer su demanda a lo largo del tiempo en tendencia y estacionalidad para poder entenderla mejor, en base a esta compresión pudimos hacer proyecciones de solicitudes en la proxima hora.
Además comparamos diferentes modelos contra la regresión lineal, y para este campo específico fue un poco mejor LR que todos los demás modelos, teniendo en cuenta el costo computacional de cada modelo podemos decir que es la mejor opción.

# Lista de revisión

- [x]  	
Jupyter Notebook está abierto.
- [ ]  El código no tiene errores
- [ ]  Las celdas con el código han sido colocadas en el orden de ejecución.
- [ ]  	
Los datos han sido descargados y preparados.
- [ ]  Se ha realizado el paso 2: los datos han sido analizados
- [ ]  Se entrenó el modelo y se seleccionaron los hiperparámetros
- [ ]  Se han evaluado los modelos. Se expuso una conclusión
- [ ] La *RECM* para el conjunto de prueba no es más de 48