# Módulo 1.3: Regresión Avanzada con MLflow

## Teoría: Regresión en Machine Learning

### ¿Qué es la regresión?
La regresión es una técnica de aprendizaje supervisado que predice valores continuos (a diferencia de la clasificación que predice categorías discretas).

### Tipos de regresión:
1. **Regresión Lineal**: Asume relación lineal entre variables
   - Simple: una variable independiente
   - Múltiple: varias variables independientes
   
2. **Regresión Polinomial**: Captura relaciones no lineales

3. **Regresión Ridge (L2)**: Añade regularización para prevenir overfitting
   - Penaliza coeficientes grandes
   - Útil cuando hay multicolinealidad

4. **Regresión Lasso (L1)**: Regularización que puede eliminar features
   - Feature selection automática
   - Produce modelos sparse

5. **Elastic Net**: Combina L1 y L2

6. **Árboles de decisión y ensembles**: Random Forest, Gradient Boosting

### Métricas de evaluación:
- **MAE (Mean Absolute Error)**: Error promedio absoluto
- **MSE (Mean Squared Error)**: Error cuadrático medio (penaliza más los errores grandes)
- **RMSE (Root Mean Squared Error)**: Raíz del MSE (mismas unidades que el target)
- **R² Score**: Proporción de varianza explicada (0-1, siendo 1 perfecto)
- **MAPE (Mean Absolute Percentage Error)**: Error porcentual

## Objetivos
- Implementar múltiples algoritmos de regresión
- Tracking completo con MLflow
- Feature engineering y selección
- Análisis de residuos
- Regularización y tuning

In [None]:
import mlflow
import mlflow.sklearn
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import fetch_california_housing, load_diabetes
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.linear_model import (
    LinearRegression, Ridge, Lasso, ElasticNet, 
    HuberRegressor, RANSACRegressor
)
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import (
    RandomForestRegressor, GradientBoostingRegressor,
    AdaBoostRegressor, ExtraTreesRegressor
)
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import (
    mean_absolute_error, mean_squared_error, r2_score,
    mean_absolute_percentage_error
)
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

sns.set_style('whitegrid')

In [None]:
mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment("sklearn-regression-advanced")

## 1. Dataset: California Housing

Dataset de precios de viviendas en California con 8 features

In [None]:
california = fetch_california_housing()
X = california.data
y = california.target

df = pd.DataFrame(X, columns=california.feature_names)
df['MedHouseVal'] = y

print("Dataset shape:", df.shape)
print("\nFeatures:")
for i, feature in enumerate(california.feature_names):
    print(f"{i+1}. {feature}")

print("\nTarget variable (MedHouseVal): Median house value in $100,000s")
print(f"Mean: ${y.mean()*100000:.2f}")
print(f"Median: ${np.median(y)*100000:.2f}")
print(f"Min: ${y.min()*100000:.2f}")
print(f"Max: ${y.max()*100000:.2f}")

print("\nStatistical summary:")
print(df.describe())

### Análisis Exploratorio de Datos (EDA)

In [None]:
fig, axes = plt.subplots(3, 3, figsize=(15, 12))
axes = axes.ravel()

for i, col in enumerate(california.feature_names):
    axes[i].scatter(df[col], df['MedHouseVal'], alpha=0.3, s=1)
    axes[i].set_xlabel(col)
    axes[i].set_ylabel('MedHouseVal')
    axes[i].set_title(f'{col} vs MedHouseVal')
    
    corr = np.corrcoef(df[col], df['MedHouseVal'])[0, 1]
    axes[i].text(0.05, 0.95, f'Corr: {corr:.3f}', 
                transform=axes[i].transAxes, verticalalignment='top',
                bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))

axes[8].hist(df['MedHouseVal'], bins=50, edgecolor='black')
axes[8].set_xlabel('MedHouseVal')
axes[8].set_ylabel('Frequency')
axes[8].set_title('Distribution of Target Variable')

plt.tight_layout()
plt.savefig('eda_california.png', dpi=150)
plt.show()

### Matriz de correlación

In [None]:
plt.figure(figsize=(10, 8))
correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm', center=0)
plt.title('Correlation Matrix')
plt.tight_layout()
plt.savefig('correlation_matrix.png', dpi=150)
plt.show()

print("Correlaciones con el target (MedHouseVal):")
print(correlation_matrix['MedHouseVal'].sort_values(ascending=False))

## 2. Preparación de Datos

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"Train set: {X_train.shape}")
print(f"Test set: {X_test.shape}")

## 3. Función de Evaluación Completa

In [None]:
def evaluate_regression_model(model, X_train, y_train, X_test, y_test, model_name="Model"):
    
    y_pred_train = model.predict(X_train)
    y_pred_test = model.predict(X_test)
    
    metrics = {
        'train_mae': mean_absolute_error(y_train, y_pred_train),
        'train_mse': mean_squared_error(y_train, y_pred_train),
        'train_rmse': np.sqrt(mean_squared_error(y_train, y_pred_train)),
        'train_r2': r2_score(y_train, y_pred_train),
        
        'test_mae': mean_absolute_error(y_test, y_pred_test),
        'test_mse': mean_squared_error(y_test, y_pred_test),
        'test_rmse': np.sqrt(mean_squared_error(y_test, y_pred_test)),
        'test_r2': r2_score(y_test, y_pred_test),
        'test_mape': mean_absolute_percentage_error(y_test, y_pred_test)
    }
    
    residuals = y_test - y_pred_test
    
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))
    
    axes[0, 0].scatter(y_test, y_pred_test, alpha=0.5, s=10)
    axes[0, 0].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
    axes[0, 0].set_xlabel('Actual Values')
    axes[0, 0].set_ylabel('Predicted Values')
    axes[0, 0].set_title(f'Actual vs Predicted - {model_name}')
    axes[0, 0].text(0.05, 0.95, f"R² = {metrics['test_r2']:.4f}", 
                   transform=axes[0, 0].transAxes, verticalalignment='top',
                   bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.8))
    
    axes[0, 1].scatter(y_pred_test, residuals, alpha=0.5, s=10)
    axes[0, 1].axhline(y=0, color='r', linestyle='--', lw=2)
    axes[0, 1].set_xlabel('Predicted Values')
    axes[0, 1].set_ylabel('Residuals')
    axes[0, 1].set_title('Residual Plot')
    
    axes[1, 0].hist(residuals, bins=50, edgecolor='black', alpha=0.7)
    axes[1, 0].set_xlabel('Residuals')
    axes[1, 0].set_ylabel('Frequency')
    axes[1, 0].set_title('Distribution of Residuals')
    axes[1, 0].axvline(x=0, color='r', linestyle='--', lw=2)
    
    stats.probplot(residuals, dist="norm", plot=axes[1, 1])
    axes[1, 1].set_title('Q-Q Plot')
    axes[1, 1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plot_filename = f'evaluation_{model_name.replace(" ", "_")}.png'
    plt.savefig(plot_filename, dpi=150)
    plt.close()
    
    return metrics, residuals, plot_filename

## 4. Experimento 1: Regresión Lineal Básica

### Teoría: Regresión Lineal
La regresión lineal asume que la relación entre las variables independientes (X) y la dependiente (y) es lineal:

$$y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_n x_n + \epsilon$$

Donde:
- $\beta_0$ es el intercepto
- $\beta_i$ son los coeficientes
- $\epsilon$ es el error

El objetivo es minimizar la suma de errores cuadráticos (MSE).

In [None]:
with mlflow.start_run(run_name="linear_regression_baseline"):
    
    model = LinearRegression()
    model.fit(X_train_scaled, y_train)
    
    mlflow.log_param("model_type", "LinearRegression")
    mlflow.log_param("scaler", "StandardScaler")
    mlflow.log_param("n_features", X_train.shape[1])
    
    metrics, residuals, plot_file = evaluate_regression_model(
        model, X_train_scaled, y_train, X_test_scaled, y_test, "LinearRegression"
    )
    
    for metric_name, metric_value in metrics.items():
        mlflow.log_metric(metric_name, metric_value)
    
    mlflow.log_artifact(plot_file)
    
    coefficients_df = pd.DataFrame({
        'feature': california.feature_names,
        'coefficient': model.coef_
    }).sort_values('coefficient', ascending=False)
    coefficients_df.to_csv('linear_reg_coefficients.csv', index=False)
    mlflow.log_artifact('linear_reg_coefficients.csv')
    
    mlflow.sklearn.log_model(model, "linear_regression_model")
    
    mlflow.set_tag("model_family", "linear")
    mlflow.set_tag("dataset", "california_housing")
    
    print("Linear Regression Results:")
    print(f"Train R²: {metrics['train_r2']:.4f}")
    print(f"Test R²: {metrics['test_r2']:.4f}")
    print(f"Test RMSE: ${metrics['test_rmse']*100000:.2f}")
    print(f"Test MAE: ${metrics['test_mae']*100000:.2f}")
    print(f"Test MAPE: {metrics['test_mape']*100:.2f}%")
    
    print("\nFeature Importance (by coefficient magnitude):")
    print(coefficients_df)

## 5. Experimento 2: Regularización (Ridge, Lasso, ElasticNet)

### Teoría: Regularización

**Ridge (L2 Regularization)**:
$$\text{Cost} = MSE + \alpha \sum_{i=1}^{n} \beta_i^2$$

- Penaliza coeficientes grandes
- No elimina features (coeficientes → 0 pero nunca = 0)
- Útil cuando hay multicolinealidad

**Lasso (L1 Regularization)**:
$$\text{Cost} = MSE + \alpha \sum_{i=1}^{n} |\beta_i|$$

- Puede eliminar features (coeficientes = 0)
- Feature selection automática
- Modelos más interpretables

**ElasticNet**: Combina L1 y L2
$$\text{Cost} = MSE + \alpha_1 \sum_{i=1}^{n} |\beta_i| + \alpha_2 \sum_{i=1}^{n} \beta_i^2$$

In [None]:
regularization_models = [
    {'name': 'Ridge_alpha_0.1', 'model': Ridge(alpha=0.1, random_state=42)},
    {'name': 'Ridge_alpha_1.0', 'model': Ridge(alpha=1.0, random_state=42)},
    {'name': 'Ridge_alpha_10', 'model': Ridge(alpha=10.0, random_state=42)},
    {'name': 'Lasso_alpha_0.01', 'model': Lasso(alpha=0.01, random_state=42)},
    {'name': 'Lasso_alpha_0.1', 'model': Lasso(alpha=0.1, random_state=42)},
    {'name': 'ElasticNet_alpha_0.1', 'model': ElasticNet(alpha=0.1, l1_ratio=0.5, random_state=42)},
]

with mlflow.start_run(run_name="regularization_comparison") as parent_run:
    
    mlflow.set_tag("experiment_type", "regularization")
    
    results = []
    
    for config in regularization_models:
        with mlflow.start_run(run_name=config['name'], nested=True):
            
            model = config['model']
            model.fit(X_train_scaled, y_train)
            
            mlflow.log_param("model_type", config['name'].split('_')[0])
            
            if hasattr(model, 'alpha'):
                mlflow.log_param("alpha", model.alpha)
            if hasattr(model, 'l1_ratio'):
                mlflow.log_param("l1_ratio", model.l1_ratio)
            
            metrics, residuals, plot_file = evaluate_regression_model(
                model, X_train_scaled, y_train, X_test_scaled, y_test, config['name']
            )
            
            for metric_name, metric_value in metrics.items():
                mlflow.log_metric(metric_name, metric_value)
            
            mlflow.log_artifact(plot_file)
            
            n_features_used = np.sum(model.coef_ != 0)
            mlflow.log_metric("n_features_used", n_features_used)
            
            mlflow.sklearn.log_model(model, f"model_{config['name']}")
            
            results.append({
                'model': config['name'],
                'test_r2': metrics['test_r2'],
                'test_rmse': metrics['test_rmse'],
                'test_mae': metrics['test_mae'],
                'n_features_used': n_features_used
            })
            
            print(f"{config['name']}: R²={metrics['test_r2']:.4f}, "
                  f"RMSE={metrics['test_rmse']:.4f}, Features={n_features_used}")
    
    results_df = pd.DataFrame(results)
    results_df.to_csv('regularization_comparison.csv', index=False)
    mlflow.log_artifact('regularization_comparison.csv')
    
    print("\nRegularization Comparison:")
    print(results_df)

## 6. Experimento 3: Tree-Based Models

### Teoría: Modelos basados en árboles

**Random Forest**:
- Ensemble de árboles de decisión
- Bootstrap aggregating (bagging)
- Reduce varianza
- Robusto a outliers

**Gradient Boosting**:
- Ensemble secuencial
- Cada árbol corrige errores del anterior
- Típicamente mejor performance que RF
- Más propenso a overfitting

In [None]:
tree_models = [
    {'name': 'RandomForest_50', 'model': RandomForestRegressor(n_estimators=50, random_state=42)},
    {'name': 'RandomForest_100', 'model': RandomForestRegressor(n_estimators=100, random_state=42)},
    {'name': 'GradientBoosting', 'model': GradientBoostingRegressor(n_estimators=100, random_state=42)},
    {'name': 'ExtraTrees', 'model': ExtraTreesRegressor(n_estimators=100, random_state=42)},
]

with mlflow.start_run(run_name="tree_models_comparison") as parent_run:
    
    results = []
    
    for config in tree_models:
        with mlflow.start_run(run_name=config['name'], nested=True):
            
            model = config['model']
            model.fit(X_train, y_train)
            
            mlflow.log_param("model_type", config['name'])
            if hasattr(model, 'n_estimators'):
                mlflow.log_param("n_estimators", model.n_estimators)
            
            metrics, residuals, plot_file = evaluate_regression_model(
                model, X_train, y_train, X_test, y_test, config['name']
            )
            
            for metric_name, metric_value in metrics.items():
                mlflow.log_metric(metric_name, metric_value)
            
            mlflow.log_artifact(plot_file)
            
            if hasattr(model, 'feature_importances_'):
                importance_df = pd.DataFrame({
                    'feature': california.feature_names,
                    'importance': model.feature_importances_
                }).sort_values('importance', ascending=False)
                
                plt.figure(figsize=(10, 6))
                plt.barh(importance_df['feature'], importance_df['importance'])
                plt.xlabel('Importance')
                plt.title(f'Feature Importance - {config["name"]}')
                plt.tight_layout()
                importance_plot = f'importance_{config["name"]}.png'
                plt.savefig(importance_plot)
                mlflow.log_artifact(importance_plot)
                plt.close()
            
            mlflow.sklearn.log_model(model, f"model_{config['name']}")
            
            results.append({
                'model': config['name'],
                'test_r2': metrics['test_r2'],
                'test_rmse': metrics['test_rmse'],
                'test_mae': metrics['test_mae']
            })
            
            print(f"{config['name']}: R²={metrics['test_r2']:.4f}, RMSE={metrics['test_rmse']:.4f}")
    
    results_df = pd.DataFrame(results)
    print("\nTree Models Comparison:")
    print(results_df)

## 7. Comparación Global de Todos los Modelos

In [None]:
experiment = mlflow.get_experiment_by_name("sklearn-regression-advanced")
all_runs = mlflow.search_runs(experiment_ids=[experiment.experiment_id])

runs_with_metrics = all_runs.dropna(subset=['metrics.test_r2'])
top_runs = runs_with_metrics.nlargest(10, 'metrics.test_r2')

print("Top 10 Models por R² Score:")
print(top_runs[['tags.mlflow.runName', 'metrics.test_r2', 'metrics.test_rmse', 'metrics.test_mae']])

plt.figure(figsize=(12, 6))
plt.barh(range(len(top_runs)), top_runs['metrics.test_r2'])
plt.yticks(range(len(top_runs)), top_runs['tags.mlflow.runName'])
plt.xlabel('R² Score')
plt.title('Top 10 Models - R² Score Comparison')
plt.tight_layout()
plt.savefig('final_model_comparison.png', dpi=150)
plt.show()

## Resumen del Módulo 1.3

### Conceptos Clave:

1. **Regresión Lineal y Regularización**
   - Ridge: reduce magnitud de coeficientes
   - Lasso: elimina features
   - ElasticNet: lo mejor de ambos mundos

2. **Modelos basados en árboles**
   - Random Forest: bagging, reduce varianza
   - Gradient Boosting: boosting, corrige errores

3. **Métricas de Regresión**
   - R²: proporción de varianza explicada
   - RMSE: error en unidades del target
   - MAE: error promedio absoluto
   - MAPE: error porcentual

4. **Análisis de Residuos**
   - Scatter plots
   - Histogramas
   - Q-Q plots

### Mejores Prácticas con MLflow:
- Nested runs para comparaciones
- Logging completo de métricas
- Visualizaciones automáticas
- Feature importance tracking

### Próximo Módulo:
Deep Learning con TensorFlow/Keras