# 🤖 MLflow Training Pipeline
## FASE 3: Machine Learning + Experiment Tracking

Este notebook implementa el pipeline de entrenamiento con MLflow tracking.

### 🎯 **Objetivos:**
1. **Configurar MLflow** tracking server
2. **Feature Engineering** desde PostgreSQL
3. **Entrenar modelos** (Random Forest, XGBoost, Linear Regression)
4. **Comparar experimentos** y seleccionar el mejor
5. **Registrar modelo** en MLflow Registry

### 🏗️ **Arquitectura Hexagonal:**
- **Domain**: Servicios de ML y reglas de negocio
- **Ports**: Interfaces para ML y tracking
- **Adapters**: MLflow, scikit-learn, PostgreSQL

### 📋 **Prerequisitos:**
- FASE 2 completada (PostgreSQL con datos)
- 49,719 viajes en la base de datos

In [1]:
# 📦 Setup e imports
import pandas as pd
import numpy as np
import asyncio
import asyncpg
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.preprocessing import StandardScaler
import xgboost as xgb
import logging
from datetime import datetime
import sys
import os
import warnings
warnings.filterwarnings('ignore')

# Agregar el directorio del proyecto al path
sys.path.append(os.path.join(os.getcwd(), 'taxi_duration_predictor'))

# Configurar logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

print("🚀 Setup FASE 3 completado!")
print(f"📅 Fecha: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print("🎯 Objetivo: Entrenar modelos con MLflow tracking")

🚀 Setup FASE 3 completado!
📅 Fecha: 2025-07-19 10:37:56
🎯 Objetivo: Entrenar modelos con MLflow tracking


## 🔧 **Paso 1: Configurar MLflow Tracking**

In [2]:
# 🔧 Configuración MLflow
# Usar SQLite local para tracking (simple para el curso)
mlflow.set_tracking_uri("sqlite:///mlflow.db")
mlflow.set_experiment("taxi_duration_prediction")

# Configuración de conexión PostgreSQL (reutilizar de FASE 2)
AWS_ENDPOINT = "taxi-duration-db.ckj7uy651uld.us-east-1.rds.amazonaws.com"
DB_PORT = 5432
DB_NAME = "postgres"
DB_USER = "taxiuser"
DB_PASSWORD = "TaxiDB2025!"

print("🔬 MLflow configurado:")
print(f"   Tracking URI: sqlite:///mlflow.db")
print(f"   Experiment: taxi_duration_prediction")
print("📡 PostgreSQL configurado para feature extraction")
print("✅ Configuración completa!")

2025/07/19 10:39:35 INFO mlflow.store.db.utils: Creating initial MLflow database tables...
2025/07/19 10:39:35 INFO mlflow.store.db.utils: Updating database tables
INFO  [alembic.runtime.migration] Context impl SQLiteImpl.
INFO  [alembic.runtime.migration] Will assume non-transactional DDL.
2025/07/19 10:39:35 INFO mlflow.store.db.utils: Updating database tables
INFO  [alembic.runtime.migration] Context impl SQLiteImpl.
INFO  [alembic.runtime.migration] Will assume non-transactional DDL.
INFO  [alembic.runtime.migration] Running upgrade  -> 451aebb31d03, add metric step
INFO  [alembic.runtime.migration] Running upgrade  -> 451aebb31d03, add metric step
INFO  [alembic.runtime.migration] Running upgrade 451aebb31d03 -> 90e64c465722, migrate user column to tags
INFO  [alembic.runtime.migration] Running upgrade 90e64c465722 -> 181f10493468, allow nulls for metric values
INFO  [alembic.runtime.migration] Running upgrade 451aebb31d03 -> 90e64c465722, migrate user column to tags
INFO  [alembi

🔬 MLflow configurado:
   Tracking URI: sqlite:///mlflow.db
   Experiment: taxi_duration_prediction
📡 PostgreSQL configurado para feature extraction
✅ Configuración completa!


## 📊 **Paso 2: Feature Engineering**

In [9]:
# 📊 Función para extraer datos y crear features
async def extract_and_engineer_features(sample_size: int = 10000):
    """Extrae datos de PostgreSQL y crea features para ML"""
    try:
        print(f"🔄 Extrayendo {sample_size} registros para entrenamiento...")

        # Conectar a PostgreSQL
        conn = await asyncpg.connect(
            host=AWS_ENDPOINT,
            port=DB_PORT,
            database=DB_NAME,
            user=DB_USER,
            password=DB_PASSWORD
        )

        # Extraer muestra aleatoria
        query = f"""
        SELECT
            pickup_longitude, pickup_latitude,
            dropoff_longitude, dropoff_latitude,
            passenger_count, vendor_id,
            pickup_datetime, trip_duration_seconds
        FROM taxi_trips
        ORDER BY RANDOM()
        LIMIT {sample_size}
        """

        rows = await conn.fetch(query)
        await conn.close()

        # Convertir a DataFrame
        df = pd.DataFrame([dict(row) for row in rows])
        print(f"📊 Datos extraídos: {df.shape}")

        # Feature Engineering
        print("🔨 Creando features...")

        # 1. Distancia Haversine
        def haversine_distance(lat1, lon1, lat2, lon2):
            R = 6371  # Radio de la Tierra en km
            lat1, lon1, lat2, lon2 = map(np.radians, [lat1, lon1, lat2, lon2])
            dlat = lat2 - lat1
            dlon = lon2 - lon1
            a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
            return 2 * R * np.arcsin(np.sqrt(a))

        df['distance_km'] = haversine_distance(
            df['pickup_latitude'], df['pickup_longitude'],
            df['dropoff_latitude'], df['dropoff_longitude']
        )

        # 2. Features temporales
        df['pickup_datetime'] = pd.to_datetime(df['pickup_datetime'])
        df['hour_of_day'] = df['pickup_datetime'].dt.hour
        df['day_of_week'] = df['pickup_datetime'].dt.dayofweek
        df['month'] = df['pickup_datetime'].dt.month

        # 3. Features categóricas
        df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
        df['is_rush_hour'] = df['hour_of_day'].isin([7, 8, 9, 17, 18, 19]).astype(int)

        # 4. Target (convertir a minutos)
        df['duration_minutes'] = df['trip_duration_seconds'] / 60

        # Seleccionar features finales
        feature_columns = [
            'distance_km', 'passenger_count', 'vendor_id',
            'hour_of_day', 'day_of_week', 'month',
            'is_weekend', 'is_rush_hour'
        ]

        X = df[feature_columns]
        y = df['duration_minutes']

        print(f"✅ Features creadas: {X.shape}")
        print(f"📈 Features: {list(X.columns)}")
        print(f"🎯 Target: duration_minutes (promedio: {y.mean():.1f} min)")

        return X, y, df

    except Exception as e:
        print(f"❌ Error en feature engineering: {e}")
        return None, None, None

# ✅ Ejecutar extracción
print("💡 Ejecuta: X, y, df = await extract_and_engineer_features(10000)")

💡 Ejecuta: X, y, df = await extract_and_engineer_features(10000)


## 🤖 **Paso 3: Pipeline de Entrenamiento**

In [4]:
# 🤖 Función para entrenar y evaluar modelo
def train_and_evaluate_model(X, y, model, model_name, hyperparams=None):
    """Entrena un modelo y registra en MLflow"""

    with mlflow.start_run(run_name=f"{model_name}_experiment"):
        print(f"🔄 Entrenando {model_name}...")

        # Dividir datos
        X_train, X_test, y_train, y_test = train_test_split(
            X, y, test_size=0.2, random_state=42
        )

        # Normalizar features para algunos modelos
        if model_name in ['LinearRegression']:
            scaler = StandardScaler()
            X_train_scaled = scaler.fit_transform(X_train)
            X_test_scaled = scaler.transform(X_test)
        else:
            X_train_scaled = X_train
            X_test_scaled = X_test

        # Entrenar modelo
        start_time = datetime.now()
        model.fit(X_train_scaled, y_train)
        training_time = (datetime.now() - start_time).total_seconds()

        # Predicciones
        y_pred = model.predict(X_test_scaled)

        # Métricas
        rmse = np.sqrt(mean_squared_error(y_test, y_pred))
        mae = mean_absolute_error(y_test, y_pred)
        r2 = r2_score(y_test, y_pred)

        # Log en MLflow
        mlflow.log_param("model_type", model_name)
        mlflow.log_param("train_size", len(X_train))
        mlflow.log_param("test_size", len(X_test))
        mlflow.log_param("features", list(X.columns))

        if hyperparams:
            mlflow.log_params(hyperparams)

        mlflow.log_metric("rmse", rmse)
        mlflow.log_metric("mae", mae)
        mlflow.log_metric("r2_score", r2)
        mlflow.log_metric("training_time_seconds", training_time)

        # Guardar modelo
        mlflow.sklearn.log_model(model, "model")

        print(f"✅ {model_name} entrenado:")
        print(f"   RMSE: {rmse:.2f} minutos")
        print(f"   MAE: {mae:.2f} minutos")
        print(f"   R²: {r2:.3f}")
        print(f"   Tiempo: {training_time:.1f}s")

        return {
            'model_name': model_name,
            'rmse': rmse,
            'mae': mae,
            'r2': r2,
            'training_time': training_time
        }

print("💡 Función de entrenamiento lista")

💡 Función de entrenamiento lista


## 🏁 **Paso 4: Ejecutar Experimentos**

Ahora vamos a entrenar múltiples modelos y compararlos:

In [5]:
# 🏁 Ejecutar experimentos completos
async def run_ml_experiments():
    """Ejecuta todos los experimentos de ML"""

    print("🚀 Iniciando experimentos de ML...")

    # 1. Extraer datos y crear features
    X, y, df = await extract_and_engineer_features(10000)

    if X is None:
        print("❌ Error en extracción de datos")
        return

    # 2. Definir modelos a probar
    models_to_test = [
        {
            'model': RandomForestRegressor(n_estimators=100, random_state=42),
            'name': 'RandomForest',
            'params': {'n_estimators': 100, 'random_state': 42}
        },
        {
            'model': LinearRegression(),
            'name': 'LinearRegression',
            'params': {}
        },
        {
            'model': xgb.XGBRegressor(n_estimators=100, random_state=42, verbosity=0),
            'name': 'XGBoost',
            'params': {'n_estimators': 100, 'random_state': 42}
        }
    ]

    # 3. Entrenar todos los modelos
    results = []

    for model_config in models_to_test:
        result = train_and_evaluate_model(
            X, y,
            model_config['model'],
            model_config['name'],
            model_config['params']
        )
        results.append(result)
        print("" + "-"*50)

    # 4. Resumen de resultados
    print("\n📊 **RESUMEN DE EXPERIMENTOS:**")
    results_df = pd.DataFrame(results)
    results_df = results_df.sort_values('rmse')

    print(results_df[['model_name', 'rmse', 'mae', 'r2']].to_string(index=False))

    best_model = results_df.iloc[0]
    print(f"\n🏆 **MEJOR MODELO: {best_model['model_name']}**")
    print(f"   RMSE: {best_model['rmse']:.2f} minutos")
    print(f"   MAE: {best_model['mae']:.2f} minutos")
    print(f"   R²: {best_model['r2']:.3f}")

    return results_df

print("💡 Ejecuta: results = await run_ml_experiments()")

💡 Ejecuta: results = await run_ml_experiments()


In [10]:
# 🚀 EJECUTAR TODOS LOS EXPERIMENTOS
results = await run_ml_experiments()

🚀 Iniciando experimentos de ML...
🔄 Extrayendo 10000 registros para entrenamiento...
📊 Datos extraídos: (10000, 8)
🔨 Creando features...
❌ Error en feature engineering: loop of ufunc does not support argument 0 of type decimal.Decimal which has no callable radians method
❌ Error en extracción de datos
📊 Datos extraídos: (10000, 8)
🔨 Creando features...
❌ Error en feature engineering: loop of ufunc does not support argument 0 of type decimal.Decimal which has no callable radians method
❌ Error en extracción de datos


In [7]:
# 🔍 Verificar estructura de la base de datos
async def check_database_structure():
    """Verifica las columnas de la tabla"""
    try:
        conn = await asyncpg.connect(
            host=AWS_ENDPOINT,
            port=DB_PORT,
            database=DB_NAME,
            user=DB_USER,
            password=DB_PASSWORD
        )

        # Ver estructura de la tabla
        columns = await conn.fetch("""
            SELECT column_name, data_type
            FROM information_schema.columns
            WHERE table_name = 'taxi_trips'
            ORDER BY ordinal_position
        """)

        print("📋 Columnas en taxi_trips:")
        for col in columns:
            print(f"   - {col['column_name']}: {col['data_type']}")

        # Ver muestra de datos
        sample = await conn.fetchrow("SELECT * FROM taxi_trips LIMIT 1")
        print(f"\n🔍 Muestra de datos:")
        for key, value in sample.items():
            print(f"   {key}: {value}")

        await conn.close()

    except Exception as e:
        print(f"❌ Error: {e}")

await check_database_structure()

📋 Columnas en taxi_trips:
   - id: character varying
   - vendor_id: integer
   - pickup_datetime: timestamp without time zone
   - dropoff_datetime: timestamp without time zone
   - passenger_count: integer
   - pickup_longitude: numeric
   - pickup_latitude: numeric
   - dropoff_longitude: numeric
   - dropoff_latitude: numeric
   - store_and_fwd_flag: character varying
   - trip_duration_seconds: numeric
   - created_at: timestamp without time zone

🔍 Muestra de datos:
   id: id2875421
   vendor_id: 2
   pickup_datetime: 2016-03-14 17:24:55
   dropoff_datetime: 2016-03-14 17:32:30
   passenger_count: 1
   pickup_longitude: -73.9821548
   pickup_latitude: 40.7679367
   dropoff_longitude: -73.9646301
   dropoff_latitude: 40.7656021
   store_and_fwd_flag: N
   trip_duration_seconds: 455.00
   created_at: 2025-07-19 14:14:20.619628


In [11]:
# 📊 Función CORREGIDA para extraer datos y crear features
async def extract_and_engineer_features_fixed(sample_size: int = 10000):
    """Extrae datos de PostgreSQL y crea features para ML"""
    try:
        print(f"🔄 Extrayendo {sample_size} registros para entrenamiento...")

        # Conectar a PostgreSQL
        conn = await asyncpg.connect(
            host=AWS_ENDPOINT,
            port=DB_PORT,
            database=DB_NAME,
            user=DB_USER,
            password=DB_PASSWORD
        )

        # Extraer muestra aleatoria
        query = f"""
        SELECT
            pickup_longitude, pickup_latitude,
            dropoff_longitude, dropoff_latitude,
            passenger_count, vendor_id,
            pickup_datetime, trip_duration_seconds
        FROM taxi_trips
        ORDER BY RANDOM()
        LIMIT {sample_size}
        """

        rows = await conn.fetch(query)
        await conn.close()

        # Convertir a DataFrame y manejar tipos Decimal
        df = pd.DataFrame([dict(row) for row in rows])

        # Convertir Decimal a float
        numeric_columns = ['pickup_longitude', 'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude', 'trip_duration_seconds']
        for col in numeric_columns:
            df[col] = df[col].astype(float)

        print(f"📊 Datos extraídos: {df.shape}")

        # Feature Engineering
        print("🔨 Creando features...")

        # 1. Distancia Haversine
        def haversine_distance(lat1, lon1, lat2, lon2):
            R = 6371  # Radio de la Tierra en km
            lat1, lon1, lat2, lon2 = map(np.radians, [lat1, lon1, lat2, lon2])
            dlat = lat2 - lat1
            dlon = lon2 - lon1
            a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
            return 2 * R * np.arcsin(np.sqrt(a))

        df['distance_km'] = haversine_distance(
            df['pickup_latitude'], df['pickup_longitude'],
            df['dropoff_latitude'], df['dropoff_longitude']
        )

        # 2. Features temporales
        df['pickup_datetime'] = pd.to_datetime(df['pickup_datetime'])
        df['hour_of_day'] = df['pickup_datetime'].dt.hour
        df['day_of_week'] = df['pickup_datetime'].dt.dayofweek
        df['month'] = df['pickup_datetime'].dt.month

        # 3. Features categóricas
        df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
        df['is_rush_hour'] = df['hour_of_day'].isin([7, 8, 9, 17, 18, 19]).astype(int)

        # 4. Target (convertir a minutos)
        df['duration_minutes'] = df['trip_duration_seconds'] / 60

        # Seleccionar features finales
        feature_columns = [
            'distance_km', 'passenger_count', 'vendor_id',
            'hour_of_day', 'day_of_week', 'month',
            'is_weekend', 'is_rush_hour'
        ]

        X = df[feature_columns]
        y = df['duration_minutes']

        print(f"✅ Features creadas: {X.shape}")
        print(f"📈 Features: {list(X.columns)}")
        print(f"🎯 Target: duration_minutes (promedio: {y.mean():.1f} min)")

        return X, y, df

    except Exception as e:
        print(f"❌ Error en feature engineering: {e}")
        return None, None, None

print("✅ Función corregida lista!")

✅ Función corregida lista!


In [12]:
# 🚀 EJECUTAR EXPERIMENTOS CORREGIDOS
async def run_ml_experiments_fixed():
    """Ejecuta todos los experimentos de ML con función corregida"""

    print("🚀 Iniciando experimentos de ML...")

    # 1. Extraer datos y crear features (versión corregida)
    X, y, df = await extract_and_engineer_features_fixed(10000)

    if X is None:
        print("❌ Error en extracción de datos")
        return

    # 2. Definir modelos a probar
    models_to_test = [
        {
            'model': RandomForestRegressor(n_estimators=100, random_state=42),
            'name': 'RandomForest',
            'params': {'n_estimators': 100, 'random_state': 42}
        },
        {
            'model': LinearRegression(),
            'name': 'LinearRegression',
            'params': {}
        },
        {
            'model': xgb.XGBRegressor(n_estimators=100, random_state=42, verbosity=0),
            'name': 'XGBoost',
            'params': {'n_estimators': 100, 'random_state': 42}
        }
    ]

    # 3. Entrenar todos los modelos
    results = []

    for model_config in models_to_test:
        result = train_and_evaluate_model(
            X, y,
            model_config['model'],
            model_config['name'],
            model_config['params']
        )
        results.append(result)
        print("" + "-"*50)

    # 4. Resumen de resultados
    print("\n📊 **RESUMEN DE EXPERIMENTOS:**")
    results_df = pd.DataFrame(results)
    results_df = results_df.sort_values('rmse')

    print(results_df[['model_name', 'rmse', 'mae', 'r2']].to_string(index=False))

    best_model = results_df.iloc[0]
    print(f"\n🏆 **MEJOR MODELO: {best_model['model_name']}**")
    print(f"   RMSE: {best_model['rmse']:.2f} minutos")
    print(f"   MAE: {best_model['mae']:.2f} minutos")
    print(f"   R²: {best_model['r2']:.3f}")

    print(f"\n🔬 **MLflow UI:** mlflow ui --backend-store-uri sqlite:///mlflow.db")

    return results_df

# ¡EJECUTAR EXPERIMENTOS!
results = await run_ml_experiments_fixed()

🚀 Iniciando experimentos de ML...
🔄 Extrayendo 10000 registros para entrenamiento...
📊 Datos extraídos: (10000, 8)
🔨 Creando features...
✅ Features creadas: (10000, 8)
📈 Features: ['distance_km', 'passenger_count', 'vendor_id', 'hour_of_day', 'day_of_week', 'month', 'is_weekend', 'is_rush_hour']
🎯 Target: duration_minutes (promedio: 14.0 min)
🔄 Entrenando RandomForest...
📊 Datos extraídos: (10000, 8)
🔨 Creando features...
✅ Features creadas: (10000, 8)
📈 Features: ['distance_km', 'passenger_count', 'vendor_id', 'hour_of_day', 'day_of_week', 'month', 'is_weekend', 'is_rush_hour']
🎯 Target: duration_minutes (promedio: 14.0 min)
🔄 Entrenando RandomForest...




✅ RandomForest entrenado:
   RMSE: 6.62 minutos
   MAE: 4.27 minutos
   R²: 0.681
   Tiempo: 5.1s
--------------------------------------------------
🔄 Entrenando LinearRegression...




✅ LinearRegression entrenado:
   RMSE: 7.47 minutos
   MAE: 4.85 minutos
   R²: 0.595
   Tiempo: 0.1s
--------------------------------------------------
🔄 Entrenando XGBoost...




✅ XGBoost entrenado:
   RMSE: 6.85 minutos
   MAE: 4.31 minutos
   R²: 0.659
   Tiempo: 0.3s
--------------------------------------------------

📊 **RESUMEN DE EXPERIMENTOS:**
      model_name     rmse      mae       r2
    RandomForest 6.623458 4.274288 0.681264
         XGBoost 6.849632 4.305443 0.659124
LinearRegression 7.468528 4.847471 0.594741

🏆 **MEJOR MODELO: RandomForest**
   RMSE: 6.62 minutos
   MAE: 4.27 minutos
   R²: 0.681

🔬 **MLflow UI:** mlflow ui --backend-store-uri sqlite:///mlflow.db


## 📋 **Resumen de FASE 3**

### ✅ **Pasos a ejecutar:**
1. **Ejecutar setup** (celda 2)
2. **Configurar MLflow** (celda 4) 
3. **Ejecutar experimentos**: `results = await run_ml_experiments()`

### 🎯 **Qué lograremos:**
- **MLflow tracking** de experimentos
- **Comparación** de 3 modelos diferentes
- **Feature engineering** profesional
- **Métricas standardizadas** (RMSE, MAE, R²)

### 🔧 **Para ver MLflow UI:**
```bash
mlflow ui --backend-store-uri sqlite:///mlflow.db
```

### 🎓 **Para el curso:**
*"Implementamos el adaptador MLModelService con scikit-learn y XGBoost, demostrando cómo la arquitectura hexagonal nos permite comparar diferentes algoritmos sin cambiar la lógica de negocio. MLflow nos da tracking profesional de experimentos."*