# Housing prices in Hyderabad, India

## Project Objective 🎯

The objective of this project is to develop a regression model to predict housing prices in Hyderabad, India. Using features such as the property's area, location, number of bedrooms, and available amenities, the model will aim to estimate the market value of a property as accurately as possible.

- This predictive model will be a valuable tool for:
- Home Buyers and Sellers: To obtain an objective price estimate for a property.
- Real Estate Agents: To assist with property valuation and client advisory.
- Investors: To identify potentially undervalued or overvalued properties in the market.

## 1.1 Getting training, validation, test datasets

In [27]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.decomposition import PCA
import sys

sys.path.append('../../src/utils')


# Utilities
from regresion_metrics import evaluate_model_metrics, show_model_equation, get_model_coeficients_dataframe


training_features = pd.read_parquet('../../datasets/processed/housing_prices/hyderabad_house_price_training_features.parquet')
training_labels = pd.read_parquet('../../datasets/processed/housing_prices/hyderabad_house_price_training_labels.parquet')

validation_features = pd.read_parquet('../../datasets/processed/housing_prices/hyderabad_house_price_validation_features.parquet')
validation_labels = pd.read_parquet('../../datasets/processed/housing_prices/hyderabad_house_price_validation_labels.parquet')

test_features = pd.read_parquet('../../datasets/processed/housing_prices/hyderabad_house_price_test_features.parquet')
test_labels= pd.read_parquet('../../datasets/processed/housing_prices/hyderabad_house_price_test_labels.parquet')


## 1.2 Training and Predict with predetermined hyperparameters

In [28]:
# Training
linealRegresionModel = LinearRegression()
linealRegresionModel.fit(training_features, training_labels)

# Predict data sets (validation, test)
validation_predictions = linealRegresionModel.predict(validation_features)
test_predictions = linealRegresionModel.predict(test_features)

validation_rmse = np.sqrt(mean_squared_error(validation_labels, validation_predictions))
validation_r2_score = r2_score(validation_labels, validation_predictions)

# Validation set metrics
validation_metrics = {
    'MAE': mean_absolute_error(validation_labels, validation_predictions),
    'MSE': mean_squared_error(validation_labels, validation_predictions),
    'RMSE': validation_rmse,
    'R²': validation_r2_score
}

# Test set metrics
test_metrics = {
    'MAE': mean_absolute_error(test_labels, test_predictions),
    'MSE': mean_squared_error(test_labels, test_predictions),
    'RMSE': np.sqrt(mean_squared_error(test_labels, test_predictions)),
    'R²': r2_score(test_labels, test_predictions)
}

comparison_df = pd.DataFrame({
    'Validation Set': validation_metrics,
    'Test Set': test_metrics
}).round(4)


print("\n--- Regresion Metrics ---")
print(comparison_df)

print("\n--- Regresion Model Equation ---")
show_model_equation(linealRegresionModel, training_features)

print("\n--- Coeficients ---")
get_model_coeficients_dataframe(linealRegresionModel, training_features)


--- Regresion Metrics ---
      Validation Set  Test Set
MAE           0.1501    0.1605
MSE           0.0393    0.0516
RMSE          0.1982    0.2272
R²            0.8859    0.8750

--- Regresion Model Equation ---
y = 5.3275 + 1.3755 x (Area) - 0.0638 x (No. of Bedrooms) + 0.0353 x (Resale) - 0.0647 x (MaintenanceStaff) - 0.0524 x (Gymnasium) + 0.0108 x (SwimmingPool) + 0.1009 x (LandscapedGardens) - 0.0215 x (JoggingTrack) - 0.0596 x (RainWaterHarvesting) + 0.0582 x (IndoorGames) + 0.0484 x (ShoppingMall) - 0.0108 x (Intercom) + 0.0008 x (SportsFacility) + 0.0218 x (ATM) + 0.0374 x (ClubHouse) - 0.0927 x (School) + 0.0141 x (24X7Security) + 0.0185 x (PowerBackup) - 0.0272 x (CarParking) - 0.0780 x (StaffQuarter) - 0.0432 x (Cafeteria) + 0.0599 x (MultipurposeRoom) + 0.0515 x (Hospital) - 0.0192 x (WashingMachine) + 0.0338 x (Gasconnection) + 0.0738 x (AC) - 0.0209 x (Wifi) + 0.0426 x (Children'splayarea) + 0.0181 x (LiftAvailable) + 0.0116 x (BED) - 0.0043 x (VaastuCompliant) - 0.11

Unnamed: 0,Coeficiente (m)
Area,1.375493
No. of Bedrooms,-0.063752
Resale,0.035315
MaintenanceStaff,-0.064707
Gymnasium,-0.052377
...,...
Location_Tarnaka,0.628071
Location_Tellapur,0.546346
Location_TellapurOsman Nagar Road,0.651408
Location_Toli Chowki,0.531638


### 1.3 Dimensionality Reduction

Problem:

Having too many features (high dimensionality) causes models to overfit, become unstable due to redundant data (multicollinearity), and require significant computational resources to train.

Justification:

PCA reduces the number of features by creating a smaller set of new, uncorrelated features called principal components. This method retains most of the original data's important information (variance) while making the model simpler, faster, and less prone to overfitting.

Action:

- Scale the numerical features first.
- Apply PCA to the scaled data.
- Select the top principal components that explain most of the variance.
- Transform the dataset into this new, smaller set of features.

In [31]:
max_components = training_features.shape[1]
results = []
pca_models = []
score_lambda = lambda row, alpha=1, beta=1, gamma=1: \
    (row['PCA_SCALED_RMSE'] * alpha) + \
    (row['PCA_SCALED_R2_SCORE'] * beta) + \
    (row['SCALED_COMPONENTS'] * gamma)

for n in range(1, max_components + 1):
    pca_model = PCA(n_components=n)
    train_pca_x = pca_model.fit_transform(training_features)
    val_pca_x = pca_model.transform(validation_features)

    pca_explained_variance_ratio = sum(pca_model.explained_variance_ratio_)

    linealRegresionModel = LinearRegression()
    linealRegresionModel.fit(train_pca_x, training_labels)

    val_pca_predictions = linealRegresionModel.predict(val_pca_x)
    reduction_val_rmse = mean_squared_error(validation_labels, val_pca_predictions)
    reduction_val_r2_score = r2_score(validation_labels, val_pca_predictions)

    results.append({
        'Components': n,
        'Explained Variance': pca_explained_variance_ratio,
        'PCA_RMSE': reduction_val_rmse,
        'PCA_R2_Score': reduction_val_r2_score
    })

    pca_models.append(pca_model)



pca_dataset = pd.DataFrame(results)

min_rmse = pca_dataset['PCA_RMSE'].min()
max_rmse = pca_dataset['PCA_RMSE'].max()
delta_rmse = max_rmse - min_rmse

min_r2_score = pca_dataset['PCA_R2_Score'].min()
max_r2_score = pca_dataset['PCA_R2_Score'].max()
delta_r2_score = max_r2_score - min_r2_score

min_comp = pca_dataset['Components'].min()
max_comp = pca_dataset['Components'].max()
delta_comp = max_comp - min_comp


pca_dataset['PCA_SCALED_RMSE'] = 0 if delta_rmse == 0 else (pca_dataset['PCA_RMSE'] - min_rmse) / delta_rmse
pca_dataset['PCA_SCALED_R2_SCORE'] = 0 if delta_r2_score == 0 else (1 - pca_dataset['PCA_R2_Score'] - min_r2_score ) / delta_r2_score
pca_dataset['SCALED_COMPONENTS'] = 0 if delta_comp == 0 else (pca_dataset['Components'] - min_comp) / delta_comp
pca_dataset['SCORE'] = pca_dataset.apply(score_lambda, axis=1)

best_model_idx = pca_dataset['SCORE'].idxmin()
best_model_info = pca_dataset.loc[best_model_idx]
best_pca_model = pca_models[best_model_idx]

print(f"Inital Regresion RMSE: {validation_rmse} - Initial Regresion R2 Score: {validation_r2_score}")
print(f"Best PCA RMSE: {best_model_info['PCA_RMSE']} - Best PCA R2 Score: {best_model_info['PCA_R2_Score']}")

pca_dataset

Inital Regresion RMSE: 0.1982084184914916 - Initial Regresion R2 Score: 0.8859082519240723
Best PCA RMSE: 0.050623645044639835 - Best PCA R2 Score: 0.8529843887019305


Unnamed: 0,Components,Explained Variance,PCA_RMSE,PCA_R2_Score,PCA_SCALED_RMSE,PCA_SCALED_R2_SCORE,SCALED_COMPONENTS,SCORE
0,1,0.369925,0.254714,0.260287,1.000000,0.766320,0.000000,1.766320
1,2,0.454542,0.150858,0.561895,0.517906,0.284227,0.010989,0.813122
2,3,0.520228,0.122935,0.642985,0.388292,0.154612,0.021978,0.564882
3,4,0.569137,0.122817,0.643328,0.387743,0.154063,0.032967,0.574773
4,5,0.604028,0.124508,0.638418,0.395591,0.161911,0.043956,0.601458
...,...,...,...,...,...,...,...,...
87,88,0.998839,0.039809,0.884390,0.002427,-0.231252,0.956044,0.727219
88,89,0.999270,0.039900,0.884127,0.002847,-0.230832,0.967033,0.739048
89,90,0.999657,0.039900,0.884126,0.002849,-0.230830,0.978022,0.750041
90,91,0.999984,0.039827,0.884339,0.002508,-0.231172,0.989011,0.760347
