# Entrenamiento del modelo – NYC Taxi Fare

Este notebook ejecuta **solo la parte de entrenamiento del modelo de ML**:

1. Lee la capa **curated** desde GCS (ya generada por el ETL).
2. Utiliza las funciones de `src/models/trainer.py` para:
   - preparar las features,
   - entrenar un `GBTRegressor`,
   - evaluar métricas (RMSE / MAE),
   - guardar el modelo en GCS.

> Primero se debe haber corrido antes el notebook de ETL y
> de tener creado `models/trainer.py` y `gcs/paths.py`.


## 1. Configuración de entorno y paths

In [13]:
import os, sys
project_root = "/home/barcenasvac/nyc-taxi-etl-pyspark"
src_path = os.path.join(project_root, "src")

if src_path not in sys.path:
    sys.path.append(src_path)

print("Project root:", project_root)
print("SRC path    :", src_path)


Project root: /home/barcenasvac/nyc-taxi-etl-pyspark
SRC path    : /home/barcenasvac/nyc-taxi-etl-pyspark/src


## 2. Importar utilidades, rutas y trainer

In [14]:


from utils.spark_builder import build_spark_session
from gcs.paths import GCS_CURATED_PATH, GCS_MODEL_PATH
from models.trainer import train_fare_model

print("CURATED path:", GCS_CURATED_PATH)
print("MODEL path  :", GCS_MODEL_PATH)

CURATED path: gs://nyc-taxi-etl/curated/nyc_taxi/yellow_2015_01
MODEL path  : gs://nyc-taxi-etl/models/nyc_taxi_fare_gbt_2015_01


## 3. Crear sesión Spark y leer capa curated

In [15]:
spark = build_spark_session("NYC Taxi – Training Notebook")

df_curated = spark.read.parquet(GCS_CURATED_PATH)

print("Muestra de datos curated:")
df_curated.show(5, truncate=False)


                                                                                

Muestra de datos curated:


[Stage 250:>                                                        (0 + 1) / 1]

+------------+--------+--------------------+---------------------+---------------+-------------+------------------+------------------+----------+------------------+------------------+------------------+-----------+-----+-------+----------+------------+---------------------+------------+-----------------+-----------+----------+------------------+------------+-----------+
|payment_type|VendorID|tpep_pickup_datetime|tpep_dropoff_datetime|passenger_count|trip_distance|pickup_longitude  |pickup_latitude   |RateCodeID|store_and_fwd_flag|dropoff_longitude |dropoff_latitude  |fare_amount|extra|mta_tax|tip_amount|tolls_amount|improvement_surcharge|total_amount|trip_duration_min|pickup_hour|pickup_dow|avg_speed_kmh     |payment_desc|pickup_date|
+------------+--------+--------------------+---------------------+---------------+-------------+------------------+------------------+----------+------------------+------------------+------------------+-----------+-----+-------+----------+------------+--

                                                                                

## 4. Entrenamiento del modelo y métricas

In [17]:
metrics = train_fare_model(df_curated, GCS_MODEL_PATH)
metrics

                                                                                

{'rmse': 3.2708882939846484, 'mae': 1.1268357786766927}

## 5. (Opcional) Inspeccionar el modelo guardado

In [18]:
from pyspark.ml import PipelineModel

print("Cargando modelo desde GCS…")
model = PipelineModel.load(GCS_MODEL_PATH)
print(model)

Cargando modelo desde GCS…


                                                                                

PipelineModel_d7ceba606afe


In [None]:
import time

# ------------------------------ #
# 1. Inicio medición de tiempo
# ------------------------------ #
overall_start = time.perf_counter()

print(f"Leyendo datos curated desde: {GCS_CURATED_PATH}")
t0 = time.perf_counter()

df_curated = (
    spark.read
    .parquet(GCS_CURATED_PATH)
)

read_time = time.perf_counter() - t0
print(f"[ML] Registros leídos: {df_curated.count():,}")

# --------------------------------- #
# 2. Entrenamiento
# --------------------------------- #
t0 = time.perf_counter()
metrics = train_fare_model(df_curated, GCS_MODEL_PATH)
train_time = time.perf_counter() - t0

overall_time = time.perf_counter() - overall_start

# --------------------------------- #
# 3. Impresión igual al main_train
# --------------------------------- #
print("=" * 80)
print("RESUMEN ENTRENAMIENTO")
print(f"  Lectura curated     : {read_time:.2f} s")
print(f"  Entrenamiento modelo: {train_time:.2f} s")
print(f"  Tiempo total script : {overall_time:.2f} s")
print("MÉTRICAS ML:")
print(f"  RMSE = {metrics['rmse']:.2f}")
print(f"  MAE  = {metrics['mae']:.2f}")
print("=" * 80)
