# Housing prices in Hyderabad, India

## Project Objective 🎯

The objective of this project is to develop a regression model to predict housing prices in Hyderabad, India. Using features such as the property's area, location, number of bedrooms, and available amenities, the model will aim to estimate the market value of a property as accurately as possible.

- This predictive model will be a valuable tool for:
- Home Buyers and Sellers: To obtain an objective price estimate for a property.
- Real Estate Agents: To assist with property valuation and client advisory.
- Investors: To identify potentially undervalued or overvalued properties in the market.

## 1. Exploratory Data Analysis (EDA)

In [1]:
import pandas as pd
import numpy as np
import logging
from pandas.errors import ParserError, EmptyDataError

filepath = '../../datasets/raw/hyderabad_house_price_original.csv'

raw_data = pd.read_csv(filepath)

datasetPath = '../../datasets/processed/housing_prices/hyderabad_house_price_original.parquet'
raw_data.to_parquet(datasetPath)

raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2518 entries, 0 to 2517
Data columns (total 40 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Price                2518 non-null   int64 
 1   Area                 2518 non-null   int64 
 2   Location             2518 non-null   object
 3   No. of Bedrooms      2518 non-null   int64 
 4   Resale               2518 non-null   int64 
 5   MaintenanceStaff     2518 non-null   int64 
 6   Gymnasium            2518 non-null   int64 
 7   SwimmingPool         2518 non-null   int64 
 8   LandscapedGardens    2518 non-null   int64 
 9   JoggingTrack         2518 non-null   int64 
 10  RainWaterHarvesting  2518 non-null   int64 
 11  IndoorGames          2518 non-null   int64 
 12  ShoppingMall         2518 non-null   int64 
 13  Intercom             2518 non-null   int64 
 14  SportsFacility       2518 non-null   int64 
 15  ATM                  2518 non-null   int64 
 16  ClubHo

**Initial Observation:**

- The dataset has been loaded successfully
- It contains 2518 rows and 40 columns.
- There are no columns with missing values.
- The Location column is text and the others are numeric type. 
- The Price column will be our target.
- Most columns represent whether the home includes a specific amenity.
- Other columns refer to the area and number of bedrooms in the houses.

In [2]:
class ColumnInfo:
    def __init__(self, name, desc, dtype, format_str=None):
        """
        Initializes the feature information.
        
        Args:
            name (str): The name of the column.
            desc (str): The description of the feature.
            dtype (type): The expected data type (e.g., np.int64, str).
            format_str (str, optional): The format string for display.
        """
        self.name = name
        self.desc = desc
        self.dtype = dtype
        self.format_str = format_str

class Column:
    PRICE = ColumnInfo("Price", "The price of the house.", np.int64, format_str='$ {:,.0f}')
    AREA = ColumnInfo("Area", "The area of the property in square feet.", np.int64, format_str='{:,.0f}')
    LOCATION = ColumnInfo("Location", "The neighborhood in Hyderabad.", str)
    NO_OF_BEDROOMS = ColumnInfo("No. of Bedrooms", "The number of bedrooms.", np.int64, format_str='{:.0f}')
    RESALE = ColumnInfo("Resale", "A binary flag indicating if the property is for resale.", np.int64, format_str='{:.0f}')
    MAINTENANCE_STAFF = ColumnInfo("MaintenanceStaff", "A flag for the availability of maintenance staff.", np.int64, format_str='{:.0f}')
    GYMNASIUM = ColumnInfo("Gymnasium", "A flag for the availability of a gymnasium.", np.int64, format_str='{:.0f}')
    SWIMMING_POOL = ColumnInfo("SwimmingPool", "A flag for the availability of a swimming pool.", np.int64, format_str='{:.0f}')
    LANDSCAPED_GARDENS = ColumnInfo("LandscapedGardens", "A flag for the availability of landscaped gardens.", np.int64, format_str='{:.0f}')
    JOGGING_TRACK = ColumnInfo("JoggingTrack", "A flag for the availability of a jogging track.", np.int64, format_str='{:.0f}')
    RAIN_WATER_HARVESTING = ColumnInfo("RainWaterHarvesting", "A flag for the availability of rainwater harvesting.", np.int64, format_str='{:.0f}')
    INDOOR_GAMES = ColumnInfo("IndoorGames", "A flag for the availability of indoor games facilities.", np.int64, format_str='{:.0f}')
    SHOPPING_MALL = ColumnInfo("ShoppingMall", "A flag for the availability of a nearby shopping mall.", np.int64, format_str='{:.0f}')
    INTERCOM = ColumnInfo("Intercom", "A flag for the availability of an intercom facility.", np.int64, format_str='{:.0f}')
    SPORTS_FACILITY = ColumnInfo("SportsFacility", "A flag for the availability of a sports facility.", np.int64, format_str='{:.0f}')
    ATM = ColumnInfo("ATM", "A flag for the availability of a nearby ATM.", np.int64, format_str='{:.0f}')
    CLUB_HOUSE = ColumnInfo("ClubHouse", "A flag for the availability of a club house.", np.int64, format_str='{:.0f}')
    SCHOOL = ColumnInfo("School", "A flag for the availability of a nearby school.", np.int64, format_str='{:.0f}')
    SECURITY_24X7 = ColumnInfo("24X7Security", "A flag for the availability of 24x7 security.", np.int64, format_str='{:.0f}')
    POWER_BACKUP = ColumnInfo("PowerBackup", "A flag for the availability of power backup.", np.int64, format_str='{:.0f}')
    CAR_PARKING = ColumnInfo("CarParking", "A flag for the availability of car parking.", np.int64, format_str='{:.0f}')
    STAFF_QUARTER = ColumnInfo("StaffQuarter", "A flag for the availability of a staff quarter.", np.int64, format_str='{:.0f}')
    CAFETERIA = ColumnInfo("Cafeteria", "A flag for the availability of a cafeteria.", np.int64, format_str='{:.0f}')
    MULTIPURPOSE_ROOM = ColumnInfo("MultipurposeRoom", "A flag for the availability of a multipurpose room.", np.int64, format_str='{:.0f}')
    HOSPITAL = ColumnInfo("Hospital", "A flag for the availability of a nearby hospital.", np.int64, format_str='{:.0f}')
    WASHING_MACHINE = ColumnInfo("WashingMachine", "A flag indicating if a washing machine is included.", np.int64, format_str='{:.0f}')
    GAS_CONNECTION = ColumnInfo("Gasconnection", "A flag for the availability of a gas connection.", np.int64, format_str='{:.0f}')
    AC = ColumnInfo("AC", "A flag indicating if air conditioning is included.", np.int64, format_str='{:.0f}')
    WIFI = ColumnInfo("Wifi", "A flag for the availability of WiFi.", np.int64, format_str='{:.0f}')
    CHILDRENS_PLAY_AREA = ColumnInfo("Children'splayarea", "A flag for the availability of a children's play area.", np.int64, format_str='{:.0f}')
    LIFT_AVAILABLE = ColumnInfo("LiftAvailable", "A flag for the availability of a lift.", np.int64, format_str='{:.0f}')
    BED = ColumnInfo("BED", "A flag indicating if a bed is included.", np.int64, format_str='{:.0f}')
    VAASTU_COMPLIANT = ColumnInfo("VaastuCompliant", "A flag indicating if the property is Vaastu compliant.", np.int64, format_str='{:.0f}')
    MICROWAVE = ColumnInfo("Microwave", "A flag indicating if a microwave is included.", np.int64, format_str='{:.0f}')
    GOLF_COURSE = ColumnInfo("GolfCourse", "A flag for the availability of a golf course.", np.int64, format_str='{:.0f}')
    TV = ColumnInfo("TV", "A flag indicating if a TV is included.", np.int64, format_str='{:.0f}')
    DINING_TABLE = ColumnInfo("DiningTable", "A flag indicating if a dining table is included.", np.int64, format_str='{:.0f}')
    SOFA = ColumnInfo("Sofa", "A flag indicating if a sofa is included.", np.int64, format_str='{:.0f}')
    WARDROBE = ColumnInfo("Wardrobe", "A flag indicating if a wardrobe is included.", np.int64, format_str='{:.0f}')
    REFRIGERATOR = ColumnInfo("Refrigerator", "A flag indicating if a refrigerator is included.", np.int64, format_str='{:.0f}')

    @staticmethod
    def get_format_dict():
        """
        Generates a format dictionary from the features defined in the class
        that have a format_str defined.
        """
        return {
            value.name: value.format_str
            for value in Column.__dict__.values()
            if isinstance(value, ColumnInfo) and value.format_str is not None
        }

    @staticmethod
    def get_dtypes_dict():
        """
        Generates a data type dictionary for all features defined in the class.
        """
        return {
            value.name: value.dtype
            for value in Column.__dict__.values()
            if isinstance(value, ColumnInfo)
        }
    

# Checking the expected data for each column
try:
    dataset = pd.read_parquet(datasetPath)
except (FileNotFoundError, ValueError, ParserError, EmptyDataError) as e:
    logging.exception(f"Error trying to load the dataset '{filepath}'. Cause: {e}")
    
all_stats = dataset.describe(include='all')
styled_all_stats = all_stats.style.format(Column.get_format_dict(), na_rep="-")

# Iterar sobre todos los atributos de la clase 'Column'
for attribute_name in dir(Column):
    attribute = getattr(Column, attribute_name)
    
    # Verificar si el atributo es una instancia de ColumnInfo
    if isinstance(attribute, ColumnInfo):
        print(f"\nColumna: {attribute.name}")
        print(f"  - Descripción: {attribute.desc}")
        # Usamos .__name__ para obtener un nombre de tipo más limpio (ej. 'int64' en lugar de <class 'numpy.int64'>)
        print(f"  - Tipo Esperado: {attribute.dtype.__name__}")
        
        # Manejar el caso donde format_str puede no estar definido
        expected_format = attribute.format_str if attribute.format_str else "No definido"
        print(f"  - Formato Esperado: {expected_format}")

print("### Descriptive Statistics for ALL Columns ###")
display(styled_all_stats)


Columna: AC
  - Descripción: A flag indicating if air conditioning is included.
  - Tipo Esperado: int64
  - Formato Esperado: {:.0f}

Columna: Area
  - Descripción: The area of the property in square feet.
  - Tipo Esperado: int64
  - Formato Esperado: {:,.0f}

Columna: ATM
  - Descripción: A flag for the availability of a nearby ATM.
  - Tipo Esperado: int64
  - Formato Esperado: {:.0f}

Columna: BED
  - Descripción: A flag indicating if a bed is included.
  - Tipo Esperado: int64
  - Formato Esperado: {:.0f}

Columna: Cafeteria
  - Descripción: A flag for the availability of a cafeteria.
  - Tipo Esperado: int64
  - Formato Esperado: {:.0f}

Columna: CarParking
  - Descripción: A flag for the availability of car parking.
  - Tipo Esperado: int64
  - Formato Esperado: {:.0f}

Columna: Children'splayarea
  - Descripción: A flag for the availability of a children's play area.
  - Tipo Esperado: int64
  - Formato Esperado: {:.0f}

Columna: ClubHouse
  - Descripción: A flag for the avai

Unnamed: 0,Price,Area,Location,No. of Bedrooms,Resale,MaintenanceStaff,Gymnasium,SwimmingPool,LandscapedGardens,JoggingTrack,RainWaterHarvesting,IndoorGames,ShoppingMall,Intercom,SportsFacility,ATM,ClubHouse,School,24X7Security,PowerBackup,CarParking,StaffQuarter,Cafeteria,MultipurposeRoom,Hospital,WashingMachine,Gasconnection,AC,Wifi,Children'splayarea,LiftAvailable,BED,VaastuCompliant,Microwave,GolfCourse,TV,DiningTable,Sofa,Wardrobe,Refrigerator
count,"$ 2,518",2518,2518,2518,2518,2518,2518,2518,2518,2518,2518,2518,2518,2518,2518,2518,2518,2518,2518,2518,2518,2518,2518,2518,2518,2518,2518,2518,2518,2518,2518,2518,2518,2518,2518,2518,2518,2518,2518,2518
unique,-,-,243,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-
top,-,-,Kukatpally,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-
freq,-,-,166,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-
mean,"$ 9,818,380",1645,-,3,0,0,1,1,1,1,1,1,0,1,1,0,1,0,1,1,1,0,0,1,0,0,0,0,0,1,1,0,1,0,0,0,0,0,0,0
std,"$ 8,777,113",746,-,1,0,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2
min,"$ 2,000,000",500,-,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
25%,"$ 4,760,000",1160,-,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
50%,"$ 7,754,000",1500,-,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
75%,"$ 10,900,000",1829,-,3,0,0,1,1,1,1,1,1,0,1,1,0,1,0,1,1,1,0,0,1,0,0,0,0,0,1,1,0,1,0,0,0,0,0,0,0


| Feature | Detailed Observation | Implication & Next Steps |
| :--- | :--- | :--- |
| **Price** | The range is vast (₹2M to ₹16.5 Cr). The mean (**₹9.8M**) is significantly higher than the median (**₹7.7M**), and the standard deviation is almost as large as the mean. | The distribution is **highly right-skewed** due to outliers. A **logarithmic transformation** (`np.log1p`) will likely be necessary for better model performance. |
| **Area** | Ranges from 500 to 9,400 sq ft. There is a clear positive trend with `Price`, but the variance in `Price` increases as `Area` gets larger. | `Area` is a **strong, primary predictor**. The increasing variance suggests the relationship might not be perfectly linear. Feature scaling (like `StandardScaler`) will be necessary. |
| **Location** | A **high-cardinality categorical feature** with 243 unique values. `Kukatpally` is the most frequent. Median prices vary drastically between locations. | This is a critical feature, but its high number of unique values requires a careful **encoding strategy** (e.g., target encoding or grouping) to avoid making the dataset too wide. |
| **No. of Bedrooms**| A discrete feature from 1 to 8, centered around 3. While median price increases with more bedrooms, there is a **significant price overlap** between categories. | This overlap implies that `No. of Bedrooms` alone isn't enough to determine price; its interaction with `Area` and `Location` is key. It can be used directly as a numerical feature. |
| **Resale** | A **binary feature** (0 or 1) indicating if a property is new (0) or a resale (1). | This clearly segments the market. We need to analyze if there is a significant price difference between new and resale properties and check if the classes are imbalanced. |
| **Amenities** | Multiple amenity columns (e.g., `MaintenanceStaff`, `Gymnasium`, `SwimmingPool`) are expected to be binary but contain the value **`9`**. | The value `9` likely represents **missing or unknown data**. This is a critical **data quality issue** that must be addressed during data cleaning, for example, by replacing them with the mode (likely `0`). |