- Project Overview
- Objectives
- Dataset
- Features
- Tools & Libraries
- Exploratory Data Analysis(EDA)
- Model Evaluation & Refinement
- Key Insights
- Project Structure
- How to Run the Project
- Future Improvements
- Author & Contact
This project involves:
-
Performing Exploratory Data Analysis (EDA) to identify trends, patterns, and correlations.
-
Handling missing data and preparing the dataset.
-
Building and evaluating regression models (Linear Regression, Polynomial Regression, Ridge Regression).
-
Comparing model performance to find the best approach for predicting house prices.
The objective of this project is to analyze residential housing data from King County, USA (including Seattle) and predict house sale prices using regression models. The project simulates a real-world scenario where a Real Estate Investment Trust wants to estimate property values based on features like square footage, number of bedrooms, bathrooms, grade, and location.
-
Source: King County Housing dataset (modified for learning purposes)
-
Rows: 21,613
-
Columns: 22
-
Target Variable: price
--
-
bedrooms
,bathrooms
-
sqft_living
,sqft_lot
-
floors
,waterfront
,view
,condition
,grade
-
sqft_above
,sqft_basement
-
yr_built
,yr_renovated
,zipcode
,lat
,long
-
sqft_living15
,sqft_lot15
- Python 🐍
- Pandas – Data manipulation & analysis
- Matplotlib,Seaborn – Data visualization
- Scikit-learn (Linear Regression, Ridge Regression, Polynomial Features, Pipelines) - Model Development
- Jupyter Notebook – Development & exploration
- Github
-
Checked missing values (
bedrooms
,bathrooms
) and replaced with mean. -
Dropped irrelevant columns (
id
,Unnamed: 0
). -
Distribution of houses by number of floors.
-
Boxplot:
waterfront
vsprice
→ waterfront houses tend to be more expensive. -
Regression plot:
sqft_above
vsprice
→ strong positive correlation. -
Correlation heatmap →
sqft_living
,grade
, andsqft_above
most correlated with price.
- Simple Linear Regression
-
Feature:
long
→ very weak predictor (R² ≈ 0.00046). -
Feature:
sqft_living
→ moderate predictor (R² ≈ 0.49).
- Multiple Linear Regression
-
Features:
floors
,waterfront
,lat
,bedrooms
,sqft_basement
,view
,bathrooms
,sqft_living15
,sqft_above
,grade
,sqft_living
. -
R² ≈ 0.65
- Polynomial Regression with Pipeline
-
Degree = 2
-
R² ≈ 0.75
-
Train-Test Split (85% train / 15% test).
-
Ridge Regression (α=0.1): R² ≈ 0.64 on test data.
-
Polynomial Features + Ridge Regression: R² ≈ 0.70 on test data.
--
-
Most important predictors:
sqft_living
,grade
,sqft_above
,sqft_living15
. -
Waterfront houses and homes with higher grade have significantly higher prices.
-
Polynomial regression improves performance compared to simple linear models.
📦 House-Sale-Analysis-EDA-Regression-Python
│
├── README.md
├── .gitignore
├── notebooks/ # Jupyter notebooks
│ ├── regression_exploratory_data_analysis.ipynb
├── data/
└──kc_house_data_NaN.csv
- Clone the repository:
git clone https://github.com/codewithchirag18/House-Sale-Analysis-EDA-Regression-Python.git
- Navigate to the folder:
bash
Copy code
cd sales-analysis-eda
- Install required libraries:
bash
Copy code
pip install -r requirements.txt
- Open Jupyter Notebook:
bash
Copy code
jupyter notebook regression-exploratory_data_analysis.ipynb
-
Try Random Forest or Gradient Boosting for better accuracy.
-
Feature engineering (e.g., combine year built + renovation into “house age”).
-
Deploy the model using Flask / Streamlit for predictions.
Chirag Tomar
Data Analyst
📧 Email: tomarchirag431@gmail.com
🔗 LeetCode
--