# Housing prices in Hyderabad, India

## Project Objective üéØ

The objective of this project is to develop a regression model to predict housing prices in Hyderabad, India. Using features such as the property's area, location, number of bedrooms, and available amenities, the model will aim to estimate the market value of a property as accurately as possible.

- This predictive model will be a valuable tool for:
- Home Buyers and Sellers: To obtain an objective price estimate for a property.
- Real Estate Agents: To assist with property valuation and client advisory.
- Investors: To identify potentially undervalued or overvalued properties in the market.

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.decomposition import PCA
from IPython.display import display, Markdown
import sys

sys.path.append('../../src/utils')


# Utilities
from regresion_metrics import show_model_equation, get_model_coeficients_dataframe


training_features = pd.read_parquet('../../datasets/processed/housing_prices/hyderabad_house_price_training_features.parquet')
training_labels = pd.read_parquet('../../datasets/processed/housing_prices/hyderabad_house_price_training_labels.parquet')


test_features = pd.read_parquet('../../datasets/processed/housing_prices/hyderabad_house_price_test_features.parquet')
test_labels= pd.read_parquet('../../datasets/processed/housing_prices/hyderabad_house_price_test_labels.parquet')


## 1.1 Training default regresion model using cross validation

**Problem:**

We need to train the model, but we want to ensure that our training set is sufficiently representative. Furthermore, we need to obtain a reliable and stable estimate of the model's performance, as a single data split can lead to misleading results (either too optimistic or too pessimistic).

**Justification:**

Cross-validation is used to address this problem. By dividing the data into multiple folds (k) and iteratively training and validating on different subsets, we obtain a more robust measure of the model's generalization ability. However, the choice of k itself can influence the stability and bias of the metrics. A very low k can have high bias, while a very high k can have high variance. Therefore, it is justified to experiment with different values of k to understand how this parameter affects the perceived performance of our model (measured by R¬≤ and RMSE).

**Action:**

- We will train a LinearRegression model using a Pipeline.
- We will use the cross_validate function to evaluate its performance with different numbers of folds: 2, 5, 10, and 100.
- For each run, we will calculate the average R-squared (R¬≤) and Root Mean Squared Error (RMSE).
- Finally, we will compile all the results into a single DataFrame to compare how the choice of folds affects the metrics and their standard deviation. This will help us choose a reliable cross-validation strategy.


In [29]:
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_validate
from IPython.display import display, Markdown

all_results = []

for fold  in [5, 10]:
    pipe = Pipeline([
        ('regresion', LinearRegression())
    ])

    number_cross_validation = 5

    scoring_metrics = {
        'neg_rmse': 'neg_root_mean_squared_error',
        'r2': 'r2'
    }


    metrics = cross_validate(estimator=pipe,
                X=training_features,
                y=training_labels,
                cv=fold ,
                scoring=scoring_metrics)


    summary = {
        'Folds': fold,
        'R2 Mean': metrics['test_r2'].mean(),
        'R2 Std': metrics['test_r2'].std(),
        'RMSE Mean': -metrics['test_neg_rmse'].mean(),
        'RMSE Std': metrics['test_neg_rmse'].std()
    }

    all_results.append(summary)

final_results_df = pd.DataFrame(all_results)

best_r2_row = final_results_df.loc[final_results_df['R2 Mean'].idxmax()]
best_rmse_row = final_results_df.loc[final_results_df['RMSE Mean'].idxmin()]

print(f"üèÜ Best R¬≤ (the highest): {best_r2_row}")
print(f"üìâ Best RMSE (the lowest): {best_rmse_row}")
display(final_results_df)


üèÜ Best R¬≤ (the highest): Folds        10.000000
R2 Mean       0.881804
R2 Std        0.021611
RMSE Mean     0.212943
RMSE Std      0.018193
Name: 1, dtype: float64
üìâ Best RMSE (the lowest): Folds        10.000000
R2 Mean       0.881804
R2 Std        0.021611
RMSE Mean     0.212943
RMSE Std      0.018193
Name: 1, dtype: float64


Unnamed: 0,Folds,R2 Mean,R2 Std,RMSE Mean,RMSE Std
0,5,0.879093,0.021429,0.215632,0.012236
1,10,0.881804,0.021611,0.212943,0.018193


### 1.3 Dimensionality Reduction

Problem:

Having too many features (high dimensionality) causes models to overfit, become unstable due to redundant data (multicollinearity), and require significant computational resources to train.

Justification:

PCA reduces the number of features by creating a smaller set of new, uncorrelated features called principal components. This method retains most of the original data's important information (variance) while making the model simpler, faster, and less prone to overfitting.

Action:

- Iterate and Select the top principal components that explain most of the variance.
- Compare the score of the smallest component reduction and best explanation of the variance and rmse and r2_score of the initially calculated regression.
- Transform the dataset into this new, smaller set of features.

In [30]:
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score
from IPython.display import display, Markdown


pipe = Pipeline([
    ('pca', PCA()),
    ('regresion', LinearRegression())
])


max_components = training_features.shape[1]
param_grid = {
    'pca__n_components': range(1, max_components + 1)
}

# 3. Configurar y ejecutar GridSearchCV
# Para obtener tanto RMSE como R2, podemos pasar m√∫ltiples m√©tricas.
scoring_metrics = {
    'neg_rmse': 'neg_root_mean_squared_error',
    'r2': 'r2'
}

grid_search = GridSearchCV(
    estimator=pipe,
    param_grid=param_grid,
    scoring=scoring_metrics,
    refit='r2',
    cv=5,
    n_jobs=-1
)

grid_search.fit(training_features, training_labels)


results_df = pd.DataFrame(grid_search.cv_results_)

results_df.to_parquet("../../datasets/processed/metrics.parquet")