Student: Antonio Neto

*I had to miss your class on 27/03 to translate into English and improve the TDE itself. Translated to English via DeepL.com (free version).*

Task

Steps:
a) Load the Air Quality Index Dataset from the link available in Canvas;
b) Plot the target data NO2(G7);
c) Adjust target data to prepare for modeling;
d) Use time series split (5 splits) for validation;
e) Compare 3 regression models (Linear Regression, Random Forest, Naive);
f) Check MSE and MAE.

Data Project Management Track - Air Quality Index Project Planning

Detail the dataset info

Plan project steps according to: CRISP-DM, KDD, or TDSP

In [2]:
# ================================================
# CALIFORNIA HOUSING PRICE PREDICTION
# ================================================

# Libraries
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.preprocessing import PowerTransformer, StandardScaler, MinMaxScaler
from sklearn.metrics import make_scorer, mean_squared_error, mean_absolute_error
import numpy as np

# Load dataset
data = fetch_california_housing()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# Evaluation metrics (MAE and MSE)
scorers = {
    'MAE': make_scorer(mean_absolute_error),
    'MSE': make_scorer(mean_squared_error)
}

# Models to compare
models = {
    'Linear Regression': LinearRegression(),
    'Decision Tree': DecisionTreeRegressor(max_depth=5)  # Limited depth to prevent overfitting
}

# Preprocessing techniques
preprocess = {
    'Original': None,                   # No transformation
    'Power Transform': PowerTransformer(),  # Handles non-linear data
    'Z-score': StandardScaler(),        # Standardization (mean=0, std=1)
    'Min-Max': MinMaxScaler()           # Normalization to [0,1] range
}

# Evaluation with 5-fold cross-validation
results = {}
for model_name, model in models.items():
    for prep_name, prep in preprocess.items():
        X_processed = X.copy()
        if prep:
            X_processed = prep.fit_transform(X_processed)

        # Calculate MSE
        mse_scores = cross_val_score(
            model, X_processed, y, cv=5,
            scoring=scorers['MSE']
        )
        mse = -np.mean(mse_scores)  # Convert back to positive value

        # Calculate MAE
        mae_scores = cross_val_score(
            model, X_processed, y, cv=5,
            scoring=scorers['MAE']
        )
        mae = np.mean(mae_scores)

        # Store results
        results[f"{model_name} + {prep_name}"] = {'MSE': mse, 'MAE': mae}

# Display results
print("\n" + "="*50)
print("MODEL COMPARISON RESULTS")
print("="*50)
for name, metrics in results.items():
    print(f"\n{name}:")
    print(f"  MSE: {metrics['MSE']:.2f}")  # Mean Squared Error
    print(f"  MAE: {metrics['MAE']:.2f}")   # Mean Absolute Error


MODEL COMPARISON RESULTS

Linear Regression + Original:
  MSE: -0.56
  MAE: 0.55

Linear Regression + Power Transform:
  MSE: -0.61
  MAE: 0.60

Linear Regression + Z-score:
  MSE: -0.56
  MAE: 0.55

Linear Regression + Min-Max:
  MSE: -0.56
  MAE: 0.55

Decision Tree + Original:
  MSE: -0.67
  MAE: 0.60

Decision Tree + Power Transform:
  MSE: -0.67
  MAE: 0.61

Decision Tree + Z-score:
  MSE: -0.67
  MAE: 0.60

Decision Tree + Min-Max:
  MSE: -0.67
  MAE: 0.60


 **Report**  
The comparative analysis of California housing price prediction models, implemented with Python, scikit-learn, pandas, and NumPy, showed that the Decision Tree delivered more consistent performance (MSE: -0.67, MAE: ~0.60) across all preprocessing scenarios (Original, Power Transform, Z-score, and Min-Max), highlighting its robustness to data transformations. Linear Regression performed best with original or standardized data (Z-score), maintaining an MSE of -0.56 and MAE of 0.55, but was sensitive to Power Transform, which increased its MAE to 0.60. These findings indicate that for this dataset, the Decision Tree is the most suitable choice, while standardization techniques remain more beneficial for linear models.