# Project 2 : Ames Housing Price Prediction (Part 3/3)

## Recap of our Business Problem
* Build a model to predict prices for house listings in Ames, Iowa USA, based on features such as size, features, and location
* Identify the 30 most influential features, with the goal of identifying strategies for potential home sellers to maximize profit

## Project Organization

The analysis will be broken into 3 parts. This is the third and final part of the series where we will be using our insights put forth in the previous notebooks to make a conclusion and suggestions on future improvements:<br><br>
1.Data Preprocessing and Feature Engineering<br>
2.Model Selection and Tuning<br>
__3.Business Summary and Recommendations__

Linear Assumption. Linear regression assumes that the relationship between your input and output is linear. It does not support anything else. This may be obvious, but it is good to remember when you have a lot of attributes. You may need to transform data to make the relationship linear (e.g. log transform for an exponential relationship).

Remove Noise. Linear regression assumes that your input and output variables are not noisy. Consider using data cleaning operations that let you better expose and clarify the signal in your data. This is most important for the output variable and you want to remove outliers in the output variable (y) if possible.

Remove Collinearity. Linear regression will over-fit your data when you have highly correlated input variables. Consider calculating pairwise correlations for your input data and removing the most correlated.

Gaussian Distributions. Linear regression will make more reliable predictions if your input and output variables have a Gaussian distribution. You may get some benefit using transforms (e.g. log or BoxCox) on you variables to make their distribution more Gaussian looking.

Rescale Inputs: Linear regression will often make more reliable predictions if you rescale input variables using standardization or normalization.

One of the great advantages of linear regression relative to more complex techniques is its interpretability. Not only can we identify which features exert the most influence on our predictions, we even have the relative magnitude and direction of their influence, as given by the associated coefficients. This information can be very useful. When using more complex and high-powered modeling techniques like gradient boosting or neural networks, we will tend to get higher predictive accuracy, but much more limited interpretability. This is one of the great trade-offs in data science.

As an example, we might have considered using something like principal component analysis with these data, as a way of reducing the dimensionality of the polynomial features and systematically gleaning the value in all of the multicollinear interaction terms heavily correlated with sale price. Doing this would greatly reduce feature complexity and surely improve the accuracy of the model, but we would also lose all interpretability. The composite features would be a palimpsest. Instead, we chose to manually select a few top poly features by looking at correlation, coefficient magnitudes, and trying to employ common sense or domain knowledge to leave out anything that was likely to be redundant.

Numeric Variables
There are 36 relevant numerical features. MSSubClass, which "identifies the type of dwelling involved in the sale", is encoded as numeric but is in reality a categorical variable.
There are 36 numerical features, of the following types:
Square footage: Indicates the square footage of certain features, i.e. 1stFlrSF (First floor square footage) and GarageArea (Size of garage in square feet).
Time: Time related variables like when the home was built or sold.
Room and amenties: data that represent amenties like “How many bathrooms?”
Condition and quality: Subjective variables rated from 1–10.
Most of the variables that deal with the actual physical space of the apartment are postively skewed — which makes sense, as people tend to live in smaller homes/apartments apart from the extremely wealthy.
Sale Price also has a similar positively skewed distribution — I hypothesize that the variables dealing with the actual dimensions of the apartment have a large impact on Sale Price.

In [None]:
pd.Series(enet_opt.coef_, index=final.columns).sort_values(ascending=True).plot.barh(figsize=(15,8))
plt.title('Features that affect sale price the most', fontsize=14)
plt.xlabel('coefficients', fontsize=12)
plt.ylabel('features', fontsize=12);

Many features do not have a strong relationship with Sale Price, such as ‘Year Sold’. However, a few variables, like overall quality and lot square footage are highly correlated with Sale Price.

Categorical Variables
Similar to the numeric features, there is a range of categorical features. While many look like the sale price varies with category, there are many that don’t. Let’s identify a few features that affect value. Some include the presence or absence of central air, the neighborhood, the external quality, and the zoning.
There are also features that don’t vary in price a lot among different categories, including the roof style and land slope.

In order to run our models on the data, I had to transform many of the variables. The following pre-processing steps were taken:
Removing outliers: the classic Tukey method of taking 1.5 * IQR to remove outliers removed too much data. I therefore removed values that were outside of 3 * IQR instead.
Filling NaN values: Many of the variables had NaN values that needed to be dealt with. Those values were filled accordingly based off of what made the most sense. For example, NaN values for values like Alley were filled with a string ("No Alley"), whereas NaN values for GarageYrBuilt were filled with the median to prevent from skewing the data.
Created dummy variables for the categorical variables.
Split the data into a training set and a test set
Scaled the data

 you can see above there was significant missingness by feature across the train and test datasets. Most of the missing data corresponded to the absence of a feature. For example, the Garage features, mentioned in the above table, showed up as "NA" if the house did not have a garage. These were imputed as 0 or "None" depending on the feature type. Below is a breakdown of how we handled imputation across all the features.

GarageYear  → There was an observation with a GarageYear of 2207. After confirming with YearBuilt, we concluded that this should be changed to 2007.
MSSubClass → Converted to String-type as this is really a categorical variable.

The model suggests that while location is important (Being in Northridge Heights and Stonebrook were the 3rd and 4th strongest predictors), size and quality are the even more important for predicting home prices in Ames.

This is just a start, however. Much more could be done to improve the model. As I work to improve the model I will spend more time looking at the data before constructing my final model

Main Takeaways and Future Improvements:
Improve Feature Engineering and Selection:
Recursive Feature Selection
Reduce Multicollinearity
Find better ways to interact quality and square footage.
Convert 'quality' to a more understandable metric (eg. cost for improvements).
Impute missing data more effectively:
Zeroing missing numeric data reduces predictive power of the variables.
Collect More Data:
Incorporate interactions with buyers and sellers
(eg. Listing vs Final Sale Price)

The provided data set included property characteristics for 2050 sold properties in Ames, Iowa. All properties in the data were sold from 2006-2008.

The characteristics included typical property features (eg. number of bedrooms and bathrooms; square footage), as well as ratings for house quality or condition, construction materials, zoning data, and location data.

More in depth description of the data can be found here.

Current house pricing metrics within Zillow are outdated. The predictions are made primarily on bed and bath and square footage. With the movement to a more micro style of living, we should be evaluating more on the overall quality instead of the overall quantity.

Recommendations
Homeowners looking to increase the value of their property could carry out the following renovations:
Improving the quality and condition of key home features such as the external areas (i.e., lawn, porch, etc), basement and kitchen
Making sure the home is in a finished condition, ideally in a move-in state when selling
Installing features that would keep the house warm during freezing winters, such as fireplaces in the home and a brick exterior for better insulation
Considerations for buyers looking to purchase a house in Ames:
Ideal neighbourhoods to purchase homes in are Northridge Heights, Stone Brookes or Northridge, which would give better returns on property investments
Homes with large square footage and in good, finished condition will command a higher price
Single-storey homes have better sale price psf than other types of homes, while townhouses have the lowest value psf.
Homes with 2- or 3- bedrooms are ideal, any more bedrooms could decrease the home value
Look for homes with fireplaces and brick exteriors which could increase the home value, while the presence of masonry veneer could decrease the home value
Younger homes have higher value

Problem Statement
This project examines a housing dataset from Ames, Iowa, USA. Ideally, the homeowners in Ames should be armed with data, such as the Ames housing dataset to maximise their property value, while buyers would make the best investment decision based on the available data when purchasing property. However, due to a lack of data, many homeowners overspend in trying to improve particular features of their home that do not translate to higher home value. Buyers may not get a bang for their buck when purchasing homes as they do not know which features (size, features, neighbourhood, etc) warrant a higher or lower home price. This could result in a financial loss for both homeowners and buyers.

The Ames housing train dataset will be processed using linear regression models to find out which features in the dataset have a positive correlation to sale price. The goal of this project is to produce a model that can best predict sale prices based on the features in Ames housing test set. This model would provide homeowners with information on which parts of their home to improve to get a higher sale price. Buyers would also know which features are worth paying more for, enabling them to make better informed purchasing choices.

Executive Summary
The Ames housing train dataset comprises information on 81 home features. These features broadly refer to the size, condition and quality of the house, number of rooms, type of home, key features (garage, basement, fireplace, pool, etc), significant dates (year built, sold, remodelled) and the neighbourhood it is located in. It also includes the sale price for each of the 2051 entries. The features are classified into numeric and categorical, and the latter can be subdivided into ordinal (quality assessments) and descriptive categories (neighbourhoods, etc).

To generate a good production model that can accurately predict sale prices of houses in Ames with an unseen dataset, I first conducted data cleaning and EDA on the train dataset, where I was able to gauge how each feature affected sale price. Following which, I conducted conducted features engineering by adding, dropping and tweaking features, as well as using one-hot encoding to convert the remaining categorical columns to numeric, so that modelling can take place. Subsequently, I used regularised models (Ridge and Lasso) to further narrow down to the best-performing features, i.e., the features will the highest coefficients for each model. After several iterations to pick the best combination of features that returned good r2 scores and RMSE, I ended up picking a list of 30 best-performing features to be included in the features matrix. I conducted Linear, Ridge, Lasso and Elastic Net regression modelling, before concluding that the Elastic Net regression produced a model that was the best at predicting sale prices. Lastly, I fit the final production model on the Ames housing dataset (which I processed in the same way as the Ames housing train set), to obtain the sale price predictions, which I submitted to the Kaggle competition.

As expected, the model showed that the square footage of the house affected the sale price most. Quality and condition of key features such as external areas, basement and kitchen also significantly affect the sale price. The ideal home type is also a single-storey house. I also found that generally, the younger the home, the higher the sale price and that Hillside homes commanded an additional premium. The most desirable neighbourhoods in Ames are Northridge Heights, Stone Brook and Northridge, due to their proximity to schools and amenities. The data also showed that fireplaces and brick exterior on homes also had a positive impact on sale price as Iowa experienced extremely frigid winters. Conversely, factors that would have a detrimental impact on sale price are having an unfinished home, a house having more than 3 bedrooms or the house type being a Townhouse.

Accordingly, recommendations for homeowners to improve the value of their homes would be to make sure it is in a livable condition, improving the quality of features such as kitchen, basements and external areas, as well as installing fireplaces or even a brick exterior to improve liveability during wintertime. Buyers should wanting to make good investments in the Ames housing market should purchase newly built, single-storey homes with 2 or 3 bedrooms, in the neighbourhoods of Northridge Heights, Stone Brooks or Northridge. Homes with fireplaces and brick exteriors would be an added bonus for the property's value.

Contents
EDA and Cleaning
Preprocessing and Feature Engineering
Model Benchmarks
Model Tuning
Production Model and Insights
Data Dictionary
The data dictionary can be found in this link

Kaggle Submission Score

Conclusion and Business Recommendations¶
After several iterations of feature engineering, feature selection and regularization, we have selected 25 features out of the initial 241 variables for our housing price prediction model. The top 3 features that had the largest influence on house prices were 'overall_qual * total_sf', 'neighborhood_northcluster' and 'overall_cond'. In other words, a property's overall quality, size, location and condition are key predictors of sale price. As such, a homeowner in Ames could engage in some home upgrading to improve the overall material, finish and condition in order to fetch a higher sale price.

The effects of using Lasso regularization, removing outliers and adding a quadratic predictor has enabled our model to outperform the baseline model. While the baseline linear regression model returned an R2 of 0.85 and an RMSE of 31,018, our final model achieved an R2 of 0.90 and RMSE of 24,700. However, when scored on Kaggle, we received a private score of 31,165 and public score of 26,550, which are higher than the RMSE obtained on our training set. This is probably the case of the RMSE being highly sensitive to the 3 outliers in the holdout set.

The distribution of sale prices is slightly right-skewed, with a small number of houses that were sold at high prices. This had implications on our model’s performance, resulting in underestimations of houses at high prices. Taking a log transformation on sale prices will help to normalize the distribution and improve the RMSE.

The subprime mortgage crisis saw real estate prices take a nosedive between 2007-2010. As our dataset provided was focused exclusively between the years of 2006-2010, the prediction model will be unique to the time period and market conditions back then. As such, it is hard for one to generalise the results from this model to other market conditions and cities outside of Ames.

Having recently joined a company in Ames, Iowa that specialises in purchasing existing residential properties, performing cost-effective renovations (if required) and on-selling these properties for profit, I’ve been tasked with supporting the company to optimise investment and maximise return by:

Developing an algorithm to estimate the price of residential houses based on fixed features i.e. characteristics that cannot be easily renovated (e.g. location, square feet, number of bedrooms and bathrooms)
Identifying characteristics of residential houses that can be cost-effectively renovated and estimating the mean value of these renovations
This project uses the Ames housing data available on Kaggle, which includes 81 features describing a wide range of characteristics of 1,460 homes in Ames, Iowa sold between 2006 and 2010.  Models were required to be trained on houses sold prior to 2010 and evaluated on houses sold in 2010.

After several iterations of feature engineering, feature selection and regularization, we have selected 25 features out of the initial 241 variables for our housing price prediction model. The top 3 features that had the largest influence on house prices were 'overall_qual * total_sf', 'neighborhood_northcluster' and 'overall_cond'. In other words, a property's overall quality, size, location and condition are key predictors of sale price. As such, a homeowner in Ames could engage in some home upgrading to improve the overall material, finish and condition in order to fetch a higher sale price.

The effects of using Lasso regularization, removing outliers and adding a quadratic predictor has enabled our model to outperform the baseline model. While the baseline linear regression model returned an R2 of 0.85 and an RMSE of 31,018, our final model achieved an R2 of 0.90 and RMSE of 24,700. However, when scored on Kaggle, we received a private score of 31,165 and public score of 26,550, which are higher than the RMSE obtained on our training set. This is probably the case of the RMSE being highly sensitive to the 3 outliers in the holdout set.

The distribution of sale prices is slightly right-skewed, with a small number of houses that were sold at high prices. This had implications on our model’s performance, resulting in underestimations of houses at high prices. Taking a log transformation on sale prices will help to normalize the distribution and improve the RMSE.

The subprime mortgage crisis saw real estate prices take a nosedive between 2007-2010. As our dataset provided was focused exclusively between the years of 2006-2010, the prediction model will be unique to the time period and market conditions back then. As such, it is hard for one to generalise the results from this model to other market conditions and cities outside of Ames.