# Housing prices in Hyderabad, India

## Project Objective ðŸŽ¯

The objective of this project is to develop a regression model to predict housing prices in Hyderabad, India. Using features such as the property's area, location, number of bedrooms, and available amenities, the model will aim to estimate the market value of a property as accurately as possible.

- This predictive model will be a valuable tool for:
- Home Buyers and Sellers: To obtain an objective price estimate for a property.
- Real Estate Agents: To assist with property valuation and client advisory.
- Investors: To identify potentially undervalued or overvalued properties in the market.

In [60]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet
from sklearn.linear_model import Lasso 
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_validate
from sklearn.model_selection import GridSearchCV
from IPython.display import display
import sys

sys.path.append('../../src/utils')


# Utilities
from regresion_metrics import show_model_equation, get_model_coeficients_dataframe


training_features = pd.read_parquet('../../datasets/processed/housing_prices/hyderabad_house_price_training_features.parquet')
training_labels = pd.read_parquet('../../datasets/processed/housing_prices/hyderabad_house_price_training_labels.parquet')

test_features = pd.read_parquet('../../datasets/processed/housing_prices/hyderabad_house_price_test_features.parquet')
test_labels= pd.read_parquet('../../datasets/processed/housing_prices/hyderabad_house_price_test_labels.parquet')

target_metrics = '../../datasets/processed/housing_prices/hyderabad_house_price_metrics.csv'

## 1.1 Training default regresion model using cross validation

**Problem:**

We need to train the model, but we want to ensure that our training set is sufficiently representative. Furthermore, we need to obtain a reliable and stable estimate of the model's performance, as a single data split can lead to misleading results (either too optimistic or too pessimistic).

**Justification:**

Cross-validation is used to address this problem. By dividing the data into multiple folds (k) and iteratively training and validating on different subsets, we obtain a more robust measure of the model's generalization ability. However, the choice of k itself can influence the stability and bias of the metrics. A very low k can have high bias, while a very high k can have high variance. Therefore, it is justified to experiment with different values of k to understand how this parameter affects the perceived performance of our model (measured by RÂ² and RMSE).

**Action:**

- We will train a LinearRegression model using a Pipeline.
- We will use the cross_validate function to evaluate its performance with different numbers of folds: 2, 5, 10, and 100.
- For each run, we will calculate the average R-squared (RÂ²) and Root Mean Squared Error (RMSE).
- Finally, we will compile all the results into a single DataFrame to compare how the choice of folds affects the metrics and their standard deviation. This will help us choose a reliable cross-validation strategy.


In [67]:
folds = 10

pipe = Pipeline([
    ('regresion', LinearRegression())
])

param_grid = {
}

scoring_metrics = {
    'neg_rmse': 'neg_root_mean_squared_error',
    'r2': 'r2'
}

cv_results = cross_validate(
    estimator=pipe,
    X=training_features,
    y=training_labels,
    cv=folds,
    scoring=scoring_metrics,
    n_jobs=-1
)

metrics = pd.DataFrame(cv_results)

print("Metrics summary:")
print(f"R2 Mean: {metrics['test_r2'].mean():.4f} (std: +/- {metrics['test_r2'].std():.4f})")
print(f"RMSE Mean: {-metrics['test_neg_rmse'].mean():.4f} (std: +/- {metrics['test_neg_rmse'].std():.4f})")

metrics


Metrics summary:
R2 Mean: 0.8660 (std: +/- 0.0475)
RMSE Mean: 0.2313 (std: +/- 0.0360)


Unnamed: 0,fit_time,score_time,test_neg_rmse,test_r2
0,0.007488,0.002319,-0.25801,0.857748
1,0.007017,0.002382,-0.23514,0.872808
2,0.006967,0.002385,-0.200422,0.897125
3,0.006465,0.002403,-0.236432,0.880555
4,0.008704,0.002467,-0.174697,0.92077
5,0.007076,0.002409,-0.209514,0.884365
6,0.007244,0.00239,-0.307203,0.740807
7,0.006415,0.002345,-0.238624,0.865136
8,0.007422,0.00244,-0.212318,0.869049
9,0.007934,0.00254,-0.240806,0.871474


### 1.3 Dimensionality Reduction

Problem:

Having too many features (high dimensionality) causes models to overfit, become unstable due to redundant data (multicollinearity), and require significant computational resources to train.

Justification:

PCA reduces the number of features by creating a smaller set of new, uncorrelated features called principal components. This method retains most of the original data's important information (variance) while making the model simpler, faster, and less prone to overfitting.

Action:

- Iterate and Select the top principal components that explain most of the variance.
- Compare the score of the smallest component reduction and best explanation of the variance and rmse and r2_score of the initially calculated regression.
- Transform the dataset into this new, smaller set of features.

In [None]:
folds = 10
max_components = training_features.shape[1]

scoring_metrics = {
    'neg_rmse': 'neg_root_mean_squared_error',
    'r2': 'r2'
}

models_configuration = [
    {
        'name': 'Default Lineal Regresion',
        'pipeline': Pipeline([
            ('regresion', LinearRegression())
        ]),
        'param_grid': {
            'regresion__fit_intercept': [True, False]  # mx + b , mx
        }
    },
    {
        'name': 'PCA + Default Lineal Regresion',
        'pipeline': Pipeline([
            ('pca', PCA()),
            ('regresion', LinearRegression())
        ]),
        'param_grid': {
            'pca__n_components': range(1, max_components + 1),
            'pca__whiten': [True, False],  # transform each new component to hava variance 0
        }
    },
    {
        'name': 'Lasso (L1)',
        'pipeline': Pipeline([
            ('regresion', Lasso(max_iter=10000)) # Aumentar max_iter es buena prÃ¡ctica
        ]),
        'param_grid': {
            'regresion__alpha': np.logspace(-4, 1, 30)
        }
    },
    {
        'name': 'Ridge (L2)',
        'pipeline': Pipeline([
            ('regresion', Ridge())
        ]),
        'param_grid': {
            'regresion__alpha': np.logspace(-4, 4, 30)
        }
    },
    {
        'name': 'ElasticNet (L1 + L2)',
        'pipeline': Pipeline([
            ('regresion', ElasticNet(max_iter=10000))
        ]),
        'param_grid': {
            'regresion__alpha': np.logspace(-4, 1, 30), # Fuerza total
            'regresion__l1_ratio': np.arange(0.1, 1.0, 0.1) # Mezcla L1/L2
        }
    },
    {
        'name': 'PCA + Lasso (L1)',
        'pipeline': Pipeline([
            ('pca', PCA()),
            ('regresion', Lasso(max_iter=10000))
        ]),
        'param_grid': {
            'pca__n_components': range(1, max_components + 1),
            'regresion__alpha': np.logspace(-4, 1, 30)
        }
    }
]

all_metrics = []

for config in models_configuration:
    model_name = config['name']
    pipeline = config['pipeline']
    param_grid = config['param_grid']
    
    print(f"--- Ejecutando: {model_name} ---")

    grid_search = GridSearchCV(
        estimator=pipeline,
        param_grid=param_grid,
        scoring=scoring_metrics,
        refit='r2',
        cv=folds,
        n_jobs=-1
    )

    grid_search.fit(training_features, training_labels)

    grid_metrics = pd.DataFrame(grid_search.cv_results_)
    grid_metrics['model_name'] = model_name
    
    all_metrics.append(grid_metrics)

final_metrics = pd.concat(all_metrics, ignore_index=True)
final_metrics['params'] = final_metrics['params'].astype(str)

final_metrics.to_csv(target_metrics)

display(final_metrics)

--- Ejecutando: Default Lineal Regresion ---
--- Ejecutando: Default Lineal Regresion with PCA ---
--- Ejecutando: Lasso (L1) ---
--- Ejecutando: Ridge (L2) ---
--- Ejecutando: ElasticNet (L1 + L2) ---
--- Ejecutando: PCA + Lasso (L1) ---


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_regresion__fit_intercept,params,split0_test_neg_rmse,split1_test_neg_rmse,split2_test_neg_rmse,split3_test_neg_rmse,...,split8_test_r2,split9_test_r2,mean_test_r2,std_test_r2,rank_test_r2,model_name,param_pca__n_components,param_pca__whiten,param_regresion__alpha,param_regresion__l1_ratio
0,0.007440,0.000634,0.002494,0.000112,True,{'regresion__fit_intercept': True},-0.258010,-0.235140,-0.200422,-0.236432,...,0.869049,0.871474,0.865984,0.045106,1,Default Lineal Regresion,,,,
1,0.008140,0.001186,0.003548,0.001659,False,{'regresion__fit_intercept': False},-0.326480,-0.278803,-0.274026,-0.255292,...,0.801262,0.782330,0.801072,0.048315,2,Default Lineal Regresion,,,,
2,0.005491,0.000775,0.003058,0.000811,,"{'pca__n_components': 1, 'pca__whiten': True}",-0.589999,-0.583586,-0.573684,-0.605158,...,0.152395,0.211369,0.208900,0.077873,173,Default Lineal Regresion with PCA,1.0,True,,
3,0.007461,0.001650,0.003592,0.000580,,"{'pca__n_components': 1, 'pca__whiten': False}",-0.589999,-0.583586,-0.573684,-0.605158,...,0.152395,0.211369,0.208900,0.077873,174,Default Lineal Regresion with PCA,1.0,False,,
4,0.005839,0.000989,0.003224,0.000678,,"{'pca__n_components': 2, 'pca__whiten': True}",-0.425748,-0.407447,-0.403174,-0.418704,...,0.584162,0.643420,0.612292,0.052087,172,Default Lineal Regresion with PCA,2.0,True,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3111,0.013264,0.001675,0.005979,0.001226,,"{'pca__n_components': 87, 'regresion__alpha': ...",-0.685871,-0.659323,-0.628748,-0.685061,...,-0.016161,-0.000687,-0.006274,0.007100,1915,PCA + Lasso (L1),87.0,,2.043360,
3112,0.014141,0.003878,0.006216,0.001348,,"{'pca__n_components': 87, 'regresion__alpha': ...",-0.685871,-0.659323,-0.628748,-0.685061,...,-0.016161,-0.000687,-0.006274,0.007100,1915,PCA + Lasso (L1),87.0,,3.039195,
3113,0.013290,0.003989,0.005404,0.002035,,"{'pca__n_components': 87, 'regresion__alpha': ...",-0.685871,-0.659323,-0.628748,-0.685061,...,-0.016161,-0.000687,-0.006274,0.007100,1915,PCA + Lasso (L1),87.0,,4.520354,
3114,0.014629,0.004272,0.006736,0.001554,,"{'pca__n_components': 87, 'regresion__alpha': ...",-0.685871,-0.659323,-0.628748,-0.685061,...,-0.016161,-0.000687,-0.006274,0.007100,1915,PCA + Lasso (L1),87.0,,6.723358,
