Colaborators Miguel Villoslada and Andrew Coupe
We are working as analysts for a real estate company. Our company wants to build a machine learning model to predict the selling prices of houses based on a variety of features on which the value of the house is evaluated.
Our job is to build a model that will predict the price of a house based on features in the dataset. Our management also wants to explore the characteristics of the houses using some business intelligence tools. One of those parameters includes understanding which factors are responsible for higher property value - $650K and above.
We followed this workflow in Python in order to build the required model. Explore our Python Notebook on the following link.
The dataset we used consists of information on 22,000 properties and contains historic data of houses sold between May 2014 to May 2015. Our first approach consisted of:
- Importing the data
- Exploring the data (EDA Method)
- Data wrangling
- Data cleaning
- The total dataset consited of 21,597 rows. We did not find and nulls or duplicates, as the project was on Houses we first decided to check if any houses had duplicate geolocations, what we found was that some of the houses were sold more than once during the time period between May 2014 to May 2015. As we wanted to predict the price of houses using different features we will run and model the data with all data points, then run again with just the most recent purchase information.
- We also found a house with 33 rooms, we noticed that the 33 rooms are located on just one floor with around 100 sqm. So each bedroom would be 4.5 meters squared, we assume this is bad data or an error and will remove it.
- We will look to clarify the definitions of the features: sqft_living15, grade, condition and bathrooms.
-
We proceeded to realize the first iteration of linear regression using our Machine Learning Model only dropping the columns id and date. Our accuracy score R2 was 0.7068, so the model performs well, but we wanted to improved it.
-
In the second iteration linear regression we dropped id duplicates for houses sold several times only keeping the last sale, changing the year of renovation to binary 0 and 1, an removing the outlier of the bad data 33 bedroom property. We rescaled the data using MinMaxScaler. Our R2 went down to 0.6914.
-
In our third iteration we used KNN model with K=4, our r2 model result was 0.7864 which we were very pleased with considering our previous results.
After running three versions of regression models we came to the conclusion that our best price prediction model was using the KNN (K-Nearest Neighbour) model, using K=4 which gave us a satisfying r2 result of 0.7864.
Generally speaking we came to the conclusion that dropping multiple features did not improve the accuracy of our models and that when predicitng the value of a property all the features are relevant in our machine learning model.
To understand which factors are responsible for a property value of over $650,000 we used SQL to calculate the Mode on the main factors
- Bedrooms 4
- Bathrooms 2.5
- Sqft Living 2,440
- Condition 3
- Grade 9
- Year Built 2014
- Most Expensive Zip codes of the data set were Bellevue 98004, Medina 98039 and Mercer Island 98040.
Seattle is home to more billionaires that any other city in the USA, a total of 8 billionaires live there with a combined net worth of $252B. Neighbours include Bill Gates, Jeff Bezos and Colleen Willoughby.
Our assignment deliverables also included visualization with Tableau. We aimed to visualize our findings in a way that a non-technical audience would find simple to understand as well as visually pleasing. To view our story follow the link.
We used SQL to reach a deeper understanding of our dataset. We used queries to answer the questions set by our real estate company. Follow the link to dive into our queries.
We used Powerpoint to make our presentation in the most visually appealing and simple way to present to stakeholders. To view our presentation follow the link.