Data Analysis with Python: House Sales in King County, USA

Repository Overview

This repository provides a comprehensive data analysis on house sale prices in King County, which includes Seattle. The dataset contains information about houses sold between May 2014 and May 2015.

The goal of this project is to predict house prices using various features such as the number of bedrooms, bathrooms, square footage, and other variables.

Phases

Importing Data Sets: In this phase, I imported the necessary libraries and loaded the dataset from the provided CSV file. I also displayed the first few rows of the dataset and checked the data types of each column.
Data Wrangling: During this phase, I cleaned and preprocessed the data by dropping unnecessary columns, handling missing values, and replacing them with mean values of the respective columns.
Exploratory Data Analysis: In this phase, I analyzed the dataset by generating various visualizations like box plots and regression plots to understand the correlations between different features and the target variable, price.
Model Development: In this phase, I developed various regression models using features identified during the exploratory data analysis. I created linear regression models and a pipeline to predict house prices and calculate the R^2 values for each model.
Model Evaluation and Refinement: In the final phase, I evaluated the performance of our models by splitting the dataset into training and testing sets. I used Ridge regression and applied polynomial transformations to improve the performance of our models.

Conclusion

Through this project, I have successfully developed and evaluated various regression models to predict house prices using different features. By refining the model and applying polynomial transformations, I have achieved better R^2 values, which indicate improved prediction accuracy. Based on the R^2 scores obtained from the various models, we can observe the following:

Simple Linear Regression: The R^2 score for this model is the lowest among all the models, indicating that the single predictor does not provide a strong explanation for the variance in the dependent variable (price).
Multiple Linear Regression: The R^2 score is higher compared to the Simple Linear Regression model, suggesting that using multiple features to predict the target variable results in a better fit and improved predictive performance.
Pipeline Model: This model's R^2 score is the highest among all the models, indicating that the use of feature scaling and preprocessing in the pipeline has significantly improved the predictive performance.
Ridge Regression: The R^2 score for Ridge Regression is slightly lower than the Multiple Linear Regression model, which implies that the regularization term has not improved the model's performance on the test data.
Polynomial Ridge Regression: The R^2 score for this model is higher than the Ridge Regression model but lower than the Pipeline Model. By transforming the data into higher-order polynomial features and using Ridge Regression, the model has achieved a better fit compared to Ridge Regression but not as good as the Pipeline Model.

Model	R^2 Score
Simple Linear Regression	0.492853
Multiple Linear Regression	0.657695
Pipeline Model	0.751340
Ridge Regression	0.647876
Polynomial Ridge Regression	0.700274

In summary, the Pipeline Model has the highest R^2 score, indicating that it is the most successful in predicting the target variable (price) among the evaluated models. The use of feature scaling and preprocessing in the pipeline has significantly contributed to the improvement in performance. However, it is essential to consider other factors, such as model complexity, interpretability, and training time, before selecting the best model for a specific application.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
House_Sales_in_King_Count_USA.ipynb		House_Sales_in_King_Count_USA.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Data Analysis with Python: House Sales in King County, USA

Repository Overview

Phases

Conclusion

About

Uh oh!

Releases

Packages

Languages

danielmschaves/ibm-data-analysis-python-project

Folders and files

Latest commit

History

Repository files navigation

Data Analysis with Python: House Sales in King County, USA

Repository Overview

Phases

Conclusion

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages