# A Comprehensive Analysis Of Predicting Housing Prices


## Introduction

In the ever-evolving real estate market, accurately predicting housing prices is crucial for investors, homeowners, and policy makers alike. This project aims to leverage the power of data science to forecast housing prices based on a variety of features, including location, square footage, and additional house characteristics. Through meticulous data importing, cleaning, and manipulation, followed by exploratory data analysis (EDA), hypothesis testing, and predictive modeling, we seek to uncover the underlying patterns that drive housing prices.

Our journey begins with gathering comprehensive housing data from various sources, followed by rigorous preprocessing to ensure data quality and usability. We then dive deep into the data, employing statistical and visual analysis techniques to explore relationships and trends. Hypothesis testing allows us to challenge assumptions and gain insights, while machine learning models enable us to predict prices with accuracy. Finally, we encapsulate our findings, model performance, and insights in a detailed report, complemented by visualizations to aid in understanding.

This notebook serves as a structured guide through each phase of the project, from data importation to predictive analytics and reporting. Whether you're a seasoned data scientist or a curious enthusiast, this analysis aims to provide valuable insights into the dynamics of housing prices and demonstrate the power of data-driven decision-making in the real estate domain.


# Lets Begin!


## Data Importing

#### Here we get the dataset taken from [HERE](https://www.kaggle.com/datasets/dragonduck/property-listings-in-kuala-lumpur?resource=download)

In this initial phase, we focus on sourcing and loading our housing data. The data is gathered from various real estate websites, APIs, or public datasets, ensuring a comprehensive foundation for our analysis. By utilizing the Pandas library, we efficiently import the data into our Python environment, setting the stage for the subsequent steps of data cleaning and manipulation.

First we install the **pandas** library
And then using **pandas** library to import the **housing.csv** file into python


In [6]:
!pip install pandas



In [7]:
import pandas as pd

#Read the housing.csv file

df = pd.read_csv("housing.csv")

df.head()

Unnamed: 0,Location,Price,Rooms,Bathrooms,Car Parks,Property Type,Size,Furnishing
0,"KLCC, Kuala Lumpur","RM 1,250,000",2+1,3.0,2.0,Serviced Residence,"Built-up : 1,335 sq. ft.",Fully Furnished
1,"Damansara Heights, Kuala Lumpur","RM 6,800,000",6,7.0,,Bungalow,Land area : 6900 sq. ft.,Partly Furnished
2,"Dutamas, Kuala Lumpur","RM 1,030,000",3,4.0,2.0,Condominium (Corner),"Built-up : 1,875 sq. ft.",Partly Furnished
3,"Cheras, Kuala Lumpur",,,,,,,
4,"Bukit Jalil, Kuala Lumpur","RM 900,000",4+1,3.0,2.0,Condominium (Corner),"Built-up : 1,513 sq. ft.",Partly Furnished


## Data Cleaning

Data cleaning is a critical step to prepare our dataset for analysis. This process involves handling missing values, removing duplicate entries, and converting data types to ensure consistency across our dataset. These actions are essential for maintaining data quality and reliability, paving the way for accurate and insightful analysis.


In [8]:
#'Price' column contains values formatted like "RM 1,250,000"
# Remove 'RM ' prefix and convert to numeric
df['Price'] = df['Price'].str.replace('RM ', '').str.replace(",","")
df['Price'] = pd.to_numeric(df['Price'], errors='coerce')

# Display the cleaned head of the DataFrame to verify changes
df.head()



Unnamed: 0,Location,Price,Rooms,Bathrooms,Car Parks,Property Type,Size,Furnishing
0,"KLCC, Kuala Lumpur",1250000.0,2+1,3.0,2.0,Serviced Residence,"Built-up : 1,335 sq. ft.",Fully Furnished
1,"Damansara Heights, Kuala Lumpur",6800000.0,6,7.0,,Bungalow,Land area : 6900 sq. ft.,Partly Furnished
2,"Dutamas, Kuala Lumpur",1030000.0,3,4.0,2.0,Condominium (Corner),"Built-up : 1,875 sq. ft.",Partly Furnished
3,"Cheras, Kuala Lumpur",,,,,,,
4,"Bukit Jalil, Kuala Lumpur",900000.0,4+1,3.0,2.0,Condominium (Corner),"Built-up : 1,513 sq. ft.",Partly Furnished


In [9]:
# Handling Missing Values
# For categorical columns,want to fill NaN with a placeholder "Unknown"
df['Furnishing'] = df['Furnishing'].fillna('Unknown')

# Delete Row where "Price" is NaN
df.dropna(subset=['Price'], inplace=True)

# Replace NaN values in 'Car Parks' column with 0
df['Car Parks'] = df['Car Parks'].fillna(0)

# Display the cleaned DataFrame to verify changes
df.head()

Unnamed: 0,Location,Price,Rooms,Bathrooms,Car Parks,Property Type,Size,Furnishing
0,"KLCC, Kuala Lumpur",1250000.0,2+1,3.0,2.0,Serviced Residence,"Built-up : 1,335 sq. ft.",Fully Furnished
1,"Damansara Heights, Kuala Lumpur",6800000.0,6,7.0,0.0,Bungalow,Land area : 6900 sq. ft.,Partly Furnished
2,"Dutamas, Kuala Lumpur",1030000.0,3,4.0,2.0,Condominium (Corner),"Built-up : 1,875 sq. ft.",Partly Furnished
4,"Bukit Jalil, Kuala Lumpur",900000.0,4+1,3.0,2.0,Condominium (Corner),"Built-up : 1,513 sq. ft.",Partly Furnished
5,"Taman Tun Dr Ismail, Kuala Lumpur",5350000.0,4+2,5.0,4.0,Bungalow,Land area : 7200 sq. ft.,Partly Furnished


## Data Manipulation

With a clean dataset, we proceed to manipulate the data to better suit our analytical needs. This includes feature engineering to create new, insightful variables, encoding categorical variables for machine learning readiness, and scaling numerical features to normalize their ranges. These steps are crucial for enhancing our dataset's utility and preparing it for exploratory data analysis and model training.


In [10]:
import numpy as np
# Converting Data Types
# Extracting Numeric Data from 'Size'
df['Size'] = df['Size'].str.extract(r'(\d+,\d+|\d+)').replace(',', '', regex=True).astype(float)

#'Rooms' should be numeric, you might need a custom function to handle "2+1" cases
def convert_rooms(room):
    if pd.isnull(room):
        return np.nan  # Return NaN for missing values
    parts = room.split('+')
    if len(parts) == 2:
        if parts[1] != "":
            return float(parts[0]) + float(parts[1])  # Example conversion: "2+1" becomes 2.5
        else:
            return float(parts[0])
    try:
        return float(room)
    except ValueError:
        return np.nan  # Return NaN if conversion fails

df['Rooms'] = df['Rooms'].apply(convert_rooms)

df.head()

NameError: name 'np' is not defined

## Exploratory Data Analysis (EDA) and Visualization

Exploratory Data Analysis (EDA) allows us to dive deep into the dataset, uncovering patterns, relationships, and insights. Through visualizations such as histograms, scatter plots, and heatmaps, we gain a comprehensive understanding of the data's characteristics and the factors influencing housing prices. This visual and statistical exploration is pivotal in guiding our hypothesis testing and predictive modeling efforts.


## Hypothesis Testing

Armed with insights from our EDA, we formulate and test hypotheses regarding factors that may influence housing prices. Utilizing statistical tests, such as t-tests or chi-square tests, we assess the validity of these hypotheses, providing a data-driven foundation for our predictive models. This step is crucial for identifying significant variables and relationships within our dataset.


## Predictive Analytics (Machine Learning / Deep Learning)

Transitioning from analysis to prediction, we employ machine learning models to forecast housing prices. This phase involves splitting our data into training and testing sets, selecting appropriate models, and training them on our dataset. Through model evaluation, we assess the accuracy and effectiveness of our predictions, striving for models that offer both high precision and generalizability.


## Reporting

In the final phase of our project, we compile and present our findings, insights, and model performance metrics. This comprehensive report not only highlights the key outcomes of our exploratory data analysis and hypothesis testing but also showcases the predictive power of our models. Through detailed visualizations and narrative, we provide a clear and engaging overview of our project's achievements and implications for the real estate market.


## Extra Bonus Components

Beyond the core components of our project, we explore additional enhancements such as web deployment, dashboard creation, prescriptive analytics, and GUI development. These extra features aim to extend the applicability and accessibility of our analysis, offering real-time insights, interactive visualizations, and user-friendly interfaces for diverse audiences.
