# Housing prices in Hyderabad, India

## Project Objective ðŸŽ¯

The objective of this project is to develop a regression model to predict housing prices in Hyderabad, India. Using features such as the property's area, location, number of bedrooms, and available amenities, the model will aim to estimate the market value of a property as accurately as possible.

- This predictive model will be a valuable tool for:
- Home Buyers and Sellers: To obtain an objective price estimate for a property.
- Real Estate Agents: To assist with property valuation and client advisory.
- Investors: To identify potentially undervalued or overvalued properties in the market.

## 4. Training the models

### 4.1 Loading the dataset

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet
from sklearn.linear_model import Lasso 
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_validate
from sklearn.model_selection import GridSearchCV
from IPython.display import display
import sys


training_features = pd.read_parquet('../../datasets/processed/housing_prices/hyderabad_house_price_training_features.parquet')
training_labels = pd.read_parquet('../../datasets/processed/housing_prices/hyderabad_house_price_training_labels.parquet')

test_features = pd.read_parquet('../../datasets/processed/housing_prices/hyderabad_house_price_test_features.parquet')
test_labels= pd.read_parquet('../../datasets/processed/housing_prices/hyderabad_house_price_test_labels.parquet')

target_metrics = '../../datasets/processed/housing_prices/hyderabad_house_price_metrics.csv'

## 4.2 Training default regresion model using cross validation

**Problem:**

We need to train the model, but we want to ensure that our training set is sufficiently representative. Furthermore, we need to obtain a reliable and stable estimate of the model's performance, as a single data split can lead to misleading results (either too optimistic or too pessimistic).

**Justification:**

Cross-validation is used to address this problem. By dividing the data into multiple folds (k) and iteratively training and validating on different subsets, we obtain a more robust measure of the model's generalization ability. However, the choice of k itself can influence the stability and bias of the metrics. A very low k can have high bias, while a very high k can have high variance. Therefore, it is justified to experiment with different values of k to understand how this parameter affects the perceived performance of our model (measured by RÂ² and RMSE).

**Action:**

- We will train a LinearRegression model using a Pipeline.
- We will use the cross_validate function to evaluate its performance with different numbers of folds: 2, 5, 10, and 100.
- For each run, we will calculate the average R-squared (RÂ²) and Root Mean Squared Error (RMSE).
- Finally, we will compile all the results into a single DataFrame to compare how the choice of folds affects the metrics and their standard deviation. This will help us choose a reliable cross-validation strategy.


In [2]:
folds = 10

pipe = Pipeline([
    ('regresion', LinearRegression())
])

param_grid = {
}

scoring_metrics = {
    'neg_rmse': 'neg_root_mean_squared_error',
    'r2': 'r2'
}

cv_results = cross_validate(
    estimator=pipe,
    X=training_features,
    y=training_labels,
    cv=folds,
    scoring=scoring_metrics,
    n_jobs=-1
)

metrics = pd.DataFrame(cv_results)

print("Metrics summary:")
print(f"R2 Mean: {metrics['test_r2'].mean():.4f} (std: +/- {metrics['test_r2'].std():.4f})")
print(f"RMSE Mean: {-metrics['test_neg_rmse'].mean():.4f} (std: +/- {metrics['test_neg_rmse'].std():.4f})")

metrics


Metrics summary:
R2 Mean: 0.8660 (std: +/- 0.0475)
RMSE Mean: 0.2313 (std: +/- 0.0360)


Unnamed: 0,fit_time,score_time,test_neg_rmse,test_r2
0,0.007202,0.002419,-0.25801,0.857748
1,0.007641,0.002614,-0.23514,0.872808
2,0.007637,0.003692,-0.200422,0.897125
3,0.007653,0.002563,-0.236432,0.880555
4,0.007795,0.002598,-0.174697,0.92077
5,0.006364,0.002373,-0.209514,0.884365
6,0.006705,0.002428,-0.307203,0.740807
7,0.006382,0.002307,-0.238624,0.865136
8,0.006409,0.002491,-0.212318,0.869049
9,0.009028,0.002522,-0.240806,0.871474


### 4.3 Model Validation

**Problem:**

How to systematically find the best regression model and its optimal hyperparameters for our dataset.

**Justification:**

Since no single model is universally best (the "No Free Lunch" theorem), a systematic comparison is essential. GridSearchCV is the standard tool for this, as it exhaustively explores hyperparameter combinations and uses cross-validation to provide a robust estimate of model performance on unseen data.

**Action:**

The code iterates through a predefined list of models (e.g., Linear Regression, Lasso, Ridge) and their respective hyperparameter grids. For each model, it performs an exhaustive GridSearchCV with 10-fold cross-validation to find the best parameter set based on RÂ² and RMSE metrics. All results are then compiled into a single Pandas DataFrame and saved to a CSV file for analysis.

**Verify:**

Success is verified by inspecting the final final_metrics DataFrame. It should contain the consolidated performance metrics for all evaluated models. The creation of the target CSV file also confirms that the process completed successfully.

In [3]:
folds = 10
max_components = training_features.shape[1]

scoring_metrics = {
    'neg_rmse': 'neg_root_mean_squared_error',
    'r2': 'r2'
}

models_configuration = [
    {
        'name': 'DLR',
        'pipeline': Pipeline([
            ('regresion', LinearRegression())
        ]),
        'param_grid': {
            'regresion__fit_intercept': [True, False]  # mx + b , mx
        }
    },
    {
        'name': 'PCA+DLR',
        'pipeline': Pipeline([
            ('pca', PCA()),
            ('regresion', LinearRegression())
        ]),
        'param_grid': {
            'pca__n_components': range(1, max_components + 1),
            'pca__whiten': [True, False],  # transform each new component to hava variance 0
        }
    },
    {
        'name': 'L1',
        'pipeline': Pipeline([
            ('regresion', Lasso(max_iter=10000)) # Aumentar max_iter es buena prÃ¡ctica
        ]),
        'param_grid': {
            'regresion__alpha': np.logspace(-4, 1, 30)
        }
    },
    {
        'name': 'L2',
        'pipeline': Pipeline([
            ('regresion', Ridge())
        ]),
        'param_grid': {
            'regresion__alpha': np.logspace(-4, 4, 30)
        }
    },
    {
        'name': 'L1+L2',
        'pipeline': Pipeline([
            ('regresion', ElasticNet(max_iter=10000))
        ]),
        'param_grid': {
            'regresion__alpha': np.logspace(-4, 1, 30), # Fuerza total
            'regresion__l1_ratio': np.arange(0.1, 1.0, 0.1) # Mezcla L1/L2
        }
    },
    {
        'name': 'PCA+L1',
        'pipeline': Pipeline([
            ('pca', PCA()),
            ('regresion', Lasso(max_iter=10000))
        ]),
        'param_grid': {
            'pca__n_components': range(1, max_components + 1),
            'regresion__alpha': np.logspace(-4, 1, 30)
        }
    }
]

all_metrics = []

for config in models_configuration:
    model_name = config['name']
    pipeline = config['pipeline']
    param_grid = config['param_grid']
    
    print(f"--- Ejecutando: {model_name} ---")

    grid_search = GridSearchCV(
        estimator=pipeline,
        param_grid=param_grid,
        scoring=scoring_metrics,
        refit='r2',
        cv=folds,
        n_jobs=-1
    )

    grid_search.fit(training_features, training_labels)

    grid_metrics = pd.DataFrame(grid_search.cv_results_)
    grid_metrics['model_name'] = model_name
    
    all_metrics.append(grid_metrics)

final_metrics = pd.concat(all_metrics, ignore_index=True)
final_metrics['params'] = final_metrics['params'].astype(str)

final_metrics.to_csv(target_metrics)

display(final_metrics)

--- Ejecutando: DLR ---
--- Ejecutando: PCA+DLR ---
--- Ejecutando: L1 ---
--- Ejecutando: L2 ---
--- Ejecutando: L1+L2 ---
--- Ejecutando: PCA+L1 ---


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_regresion__fit_intercept,params,split0_test_neg_rmse,split1_test_neg_rmse,split2_test_neg_rmse,split3_test_neg_rmse,...,split8_test_r2,split9_test_r2,mean_test_r2,std_test_r2,rank_test_r2,model_name,param_pca__n_components,param_pca__whiten,param_regresion__alpha,param_regresion__l1_ratio
0,0.009296,0.002161,0.003468,0.000791,True,{'regresion__fit_intercept': True},-0.258010,-0.235140,-0.200422,-0.236432,...,0.869049,0.871474,0.865984,0.045106,1,DLR,,,,
1,0.010961,0.002158,0.004128,0.000408,False,{'regresion__fit_intercept': False},-0.326480,-0.278803,-0.274026,-0.255292,...,0.801262,0.782330,0.801072,0.048315,2,DLR,,,,
2,0.008035,0.000812,0.003920,0.000340,,"{'pca__n_components': 1, 'pca__whiten': True}",-0.589999,-0.583586,-0.573684,-0.605158,...,0.152395,0.211369,0.208900,0.077873,173,PCA+DLR,1.0,True,,
3,0.007980,0.001178,0.003686,0.000770,,"{'pca__n_components': 1, 'pca__whiten': False}",-0.589999,-0.583586,-0.573684,-0.605158,...,0.152395,0.211369,0.208900,0.077873,174,PCA+DLR,1.0,False,,
4,0.007007,0.001069,0.002990,0.000819,,"{'pca__n_components': 2, 'pca__whiten': True}",-0.425748,-0.407447,-0.403174,-0.418704,...,0.584162,0.643420,0.612292,0.052087,172,PCA+DLR,2.0,True,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3111,0.010971,0.001130,0.004394,0.000454,,"{'pca__n_components': 87, 'regresion__alpha': ...",-0.685871,-0.659323,-0.628748,-0.685061,...,-0.016161,-0.000687,-0.006274,0.007100,1915,PCA+L1,87.0,,2.043360,
3112,0.011144,0.000993,0.004679,0.000381,,"{'pca__n_components': 87, 'regresion__alpha': ...",-0.685871,-0.659323,-0.628748,-0.685061,...,-0.016161,-0.000687,-0.006274,0.007100,1915,PCA+L1,87.0,,3.039195,
3113,0.011748,0.000844,0.004860,0.000323,,"{'pca__n_components': 87, 'regresion__alpha': ...",-0.685871,-0.659323,-0.628748,-0.685061,...,-0.016161,-0.000687,-0.006274,0.007100,1915,PCA+L1,87.0,,4.520354,
3114,0.011723,0.001212,0.004592,0.000362,,"{'pca__n_components': 87, 'regresion__alpha': ...",-0.685871,-0.659323,-0.628748,-0.685061,...,-0.016161,-0.000687,-0.006274,0.007100,1915,PCA+L1,87.0,,6.723358,
