# Housing prices in Hyderabad, India

## Project Objective ðŸŽ¯

The objective of this project is to develop a regression model to predict housing prices in Hyderabad, India. Using features such as the property's area, location, number of bedrooms, and available amenities, the model will aim to estimate the market value of a property as accurately as possible.

- This predictive model will be a valuable tool for:
- Home Buyers and Sellers: To obtain an objective price estimate for a property.
- Real Estate Agents: To assist with property valuation and client advisory.
- Investors: To identify potentially undervalued or overvalued properties in the market.

## 4. Training the models

### 4.1 Loading the dataset

In [5]:
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.neural_network import MLPRegressor

from IPython.display import display


training_features = pd.read_parquet('../../datasets/processed/housing_prices/hyderabad_house_price_training_features.parquet')
training_labels = pd.read_parquet('../../datasets/processed/housing_prices/hyderabad_house_price_training_labels.parquet')

test_features = pd.read_parquet('../../datasets/processed/housing_prices/hyderabad_house_price_test_features.parquet')
test_labels= pd.read_parquet('../../datasets/processed/housing_prices/hyderabad_house_price_test_labels.parquet')

target_metrics = '../../datasets/processed/housing_prices/hyderabad_house_price_neural_network_metrics.csv'

### 4.2 Model Validation

**Problem:**

How to systematically find the best regression model and its optimal hyperparameters for our dataset.

**Justification:**

Since no single model is universally best (the "No Free Lunch" theorem), a systematic comparison is essential. GridSearchCV is the standard tool for this, as it exhaustively explores hyperparameter combinations and uses cross-validation to provide a robust estimate of model performance on unseen data.

**Action:**

The code iterates through a predefined list of models (e.g., Linear Regression, Lasso, Ridge) and their respective hyperparameter grids. For each model, it performs an exhaustive GridSearchCV with 10-fold cross-validation to find the best parameter set based on RÂ² and RMSE metrics. All results are then compiled into a single Pandas DataFrame and saved to a CSV file for analysis.

**Verify:**

Success is verified by inspecting the final final_metrics DataFrame. It should contain the consolidated performance metrics for all evaluated models. The creation of the target CSV file also confirms that the process completed successfully.

In [6]:
folds = 10
max_components = training_features.shape[1]

scoring_metrics = {
    'neg_rmse': 'neg_root_mean_squared_error',
    'r2': 'r2'
}

models_configuration = [
    {
        'name': 'NN', # Neural Network
        'pipeline': Pipeline([
            ('regresion', MLPRegressor(max_iter=1000, random_state=42)) 
        ]),
        'param_grid': {
            'regresion__hidden_layer_sizes': [(64, 32), (100,), (128, 64, 32)],
            'regresion__activation': ['relu', 'tanh'],
            'regresion__solver': ['adam'],
            'regresion__alpha': [0.0001, 0.001]
        }
    }
]

all_metrics = []

for config in models_configuration:
    model_name = config['name']
    pipeline = config['pipeline']
    param_grid = config['param_grid']
    
    print(f"--- Ejecutando: {model_name} ---")

    grid_search = GridSearchCV(
        estimator=pipeline,
        param_grid=param_grid,
        scoring=scoring_metrics,
        refit='r2',
        cv=folds,
        n_jobs=-1
    )

    grid_search.fit(training_features, training_labels)

    grid_metrics = pd.DataFrame(grid_search.cv_results_)
    grid_metrics['model_name'] = model_name
    
    all_metrics.append(grid_metrics)

final_metrics = pd.concat(all_metrics, ignore_index=True)
final_metrics['params'] = final_metrics['params'].astype(str)

final_metrics.to_csv(target_metrics)

display(final_metrics)

--- Ejecutando: NN ---


  y = column_or_1d(y, warn=True)


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_regresion__activation,param_regresion__alpha,param_regresion__hidden_layer_sizes,param_regresion__solver,params,split0_test_neg_rmse,...,split4_test_r2,split5_test_r2,split6_test_r2,split7_test_r2,split8_test_r2,split9_test_r2,mean_test_r2,std_test_r2,rank_test_r2,model_name
0,1.295753,0.303075,0.006516,0.00126,relu,0.0001,"(64, 32)",adam,"{'regresion__activation': 'relu', 'regresion__...",-0.285708,...,0.880493,0.853397,0.72226,0.837466,0.815485,0.841891,0.829926,0.039549,3,NN
1,1.824042,0.247982,0.006145,0.000505,relu,0.0001,"(100,)",adam,"{'regresion__activation': 'relu', 'regresion__...",-0.290934,...,0.887891,0.856641,0.746299,0.831757,0.849158,0.80298,0.835471,0.037566,1,NN
2,3.506766,0.682786,0.006295,0.001473,relu,0.0001,"(128, 64, 32)",adam,"{'regresion__activation': 'relu', 'regresion__...",-0.270478,...,0.871452,0.871051,0.741675,0.858718,0.729346,0.849264,0.834152,0.050072,2,NN
3,1.641684,0.501082,0.005841,0.00074,relu,0.001,"(64, 32)",adam,"{'regresion__activation': 'relu', 'regresion__...",-0.276028,...,0.8886,0.855056,0.71334,0.828785,0.775192,0.829446,0.824747,0.046686,5,NN
4,1.982924,0.226258,0.005925,0.000629,relu,0.001,"(100,)",adam,"{'regresion__activation': 'relu', 'regresion__...",-0.298164,...,0.885296,0.853081,0.738084,0.828972,0.785256,0.813324,0.829534,0.042174,4,NN
5,3.831703,0.604308,0.006084,0.000656,relu,0.001,"(128, 64, 32)",adam,"{'regresion__activation': 'relu', 'regresion__...",-0.285204,...,0.899724,0.875657,0.725575,0.861642,0.627896,0.840484,0.821001,0.077933,6,NN
6,1.601778,0.33855,0.00606,0.000994,tanh,0.0001,"(64, 32)",adam,"{'regresion__activation': 'tanh', 'regresion__...",-0.66756,...,0.051033,0.058766,0.033897,0.025659,0.025412,0.053956,0.042827,0.010986,10,NN
7,3.06164,0.587811,0.005657,0.000433,tanh,0.0001,"(100,)",adam,"{'regresion__activation': 'tanh', 'regresion__...",-0.349484,...,0.76197,0.727838,0.6606,0.754414,0.745482,0.760971,0.741048,0.029954,7,NN
8,2.628256,0.5721,0.006323,0.000704,tanh,0.0001,"(128, 64, 32)",adam,"{'regresion__activation': 'tanh', 'regresion__...",-0.683953,...,0.003046,0.001686,0.002079,-0.021448,-0.017115,0.002549,-0.002947,0.008561,12,NN
9,1.288439,0.240589,0.005525,0.000472,tanh,0.001,"(64, 32)",adam,"{'regresion__activation': 'tanh', 'regresion__...",-0.667556,...,0.051045,0.058781,0.033907,0.025665,0.025418,0.053963,0.042836,0.010987,9,NN
