Zillow Clustering Project

Project Description

Zillow is a billion dollar real estate business that sells thousands of houses in the state of California alone. We have decided to look into log error, the overvaluation or undervalution of a houses' worth. We will look into what features if any influence the amount of log error and create a model to estimate the log error for properties.

Project Goal

Discover driver of log error from the Zillow data
Use drivers to develop machine learning model to predict log error
This information will be used to further our understanding of which elements contribute to or detract from a log_error.

Initial Thoughts

Log Error is driven by certain home features, such as size, location, or value metrics.

The Plan

Aquire data
Explore data in search of what causes log_error
- Answer the following initial questions
  - What is the distribution of logerror?
  - Is a property more likely to over evaluated or under evaluated?
  - If we cluster location data with home age, is there a relationship with log error?
  - If we cluster size features(Bath bed ratio and calculated finished square feet), is there a relationship with log error?
  - If we cluster value features(tax value, structure dollar square feet), is there a relationship with log error?
Develop a Model to predict log error
- Use drivers identified in explore to build predictive models of different types
- Evaluate models on train and validate data
- Select the best model to use on test data
Draw Conclusions

Data Dictionary

Feature	Definition
fips	Federal Information Processing Standard code - see https://en.wikipedia.org/wiki/FIPS_county_code for more detail
latitude	Latitude of the middle of the parcel multiplied by 10e6
longitude	Longitude of the middle of the parcel multiplied by 10e6
LA	fips for the LA county
Orange	fips for the Orange county
Ventura	fips for the Ventura county
yearbuilt	The Year the principal residence was built
age	The year sold, 2017, minus the yearbuilt
age_bin	The age of the residence divided into several bins
taxamount	The total property tax assessed for that assessment year
taxrate	The taxamount divided by tax value multiplied by 100
taxvalue	The total tax assessed value of the parcel
lot_sqft	Area of the lot in square feet
acres	lot_sqft divided by 43560
acres_bin	The acres of residence divided into several bins
sqft_bin	The sqft of residence divided into several bins
structure_dollar_per_sqft	The tax value divided by sqft
structure_dollar_sqft_bin	A division of the structure dollar divided into several bins
land_dollar_per_sqft	land_value divided by lot_sqft
lot_dollar_sqft_bin	land_dollar_per_sqft divided into several bins
bath_count	number of bathrooms in residence
bed_count	number of bedrooms in residence
bath_bed_ratio	bath_count divided by bed_count
cola	Whether or not a residence is in the city of LA

Steps to Reproduce

Clone this repository
Ensure an .env file is in the repo, with definitions of host, username, and password for the Codeup database
Aquire the data from Codeup SQL database
Put the data in the file containing the cloned repo
Run notebook

Takeaways and Conclusions

Feature engineering suburban vs. urban areas is likely to provide value to a model
Feature engineering neighborhoods within the three counties is likely to improve modeling
Additional creation of clusters could provide more insight into log error

Next Steps

Better data collection from real estate agents and brokers will help avoid missing values, which plague this dataset
Features that include amenities that are not part of a home, such as crime rate, distance from hospitals, and school districts are likely to add value to a model

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
.gitignore		.gitignore
Burton- notebook.ipynb		Burton- notebook.ipynb
README.md		README.md
cristina_notebook.ipynb		cristina_notebook.ipynb
explore.py		explore.py
final_report.ipynb		final_report.ipynb
model.py		model.py
wrangle.py		wrangle.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Zillow Clustering Project

Project Description

Project Goal

Initial Thoughts

The Plan

Data Dictionary

Steps to Reproduce

Takeaways and Conclusions

Next Steps

About

Releases

Packages

Contributors 2

Languages

burton-barnes-cristina-lucin/zillow-clustering-project

Folders and files

Latest commit

History

Repository files navigation

Zillow Clustering Project

Project Description

Project Goal

Initial Thoughts

The Plan

Data Dictionary

Steps to Reproduce

Takeaways and Conclusions

Next Steps

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages