# Ames Housing Sale Predictions - Production Model

## Contents:
- [Imports & Data](#Imports-\&-Data)
- [Define X & y](#Define-X-\&-y)
- [Scale Model](#Scale-Model)
- [Fit and Asses Model](#Fit-and-Asses-Model)
    - [Linear Regression (OLS)](#Linear-Regression-(OLS))
    - [Ridge Model](#Ridge-Model)
    - [LASSO Model](#LASSO-Model)
- [Predictions & Review](#Predictions-\&-Review)
- [Save Submission Data](#Save-Submission-Data)
- [Conclusions & Recommendations](#Conclusions-\&-Recommendations)

## Imports & Data

#### Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
from sklearn.linear_model import RidgeCV
from sklearn.linear_model import LassoCV

#### Read In Data

In [2]:
# define paths
path_production_data = '../datasets/03_production/'
path_submission_data = '../datasets/04_submission/'

In [3]:
# Read in data
df_train = pd.read_csv(f'{path_production_data}train_production.csv')
df_test = pd.read_csv(f'{path_production_data}test_production.csv')

## Define X & y

In [4]:
# Define X & y
X_train = df_train.drop(columns='SalePrice')
y_train = df_train['SalePrice']

# Test Data - for predictions
X_test = df_test

## Scale Model

In [5]:
# Scale model
sc = StandardScaler()

In [6]:
Z_train = sc.fit_transform(X_train)
Z_test = sc.transform(X_test)

In [7]:
# Review scaled means
print(f'Train mean sum: {Z_train.mean(axis=0).sum().round(4)}')
print(f'Test mean sum: {Z_test.mean(axis=0).sum().round(4)}')

Train mean sum: -0.0
Test mean sum: -1.078


## Fit and Asses Model
- Only one model is uncommented at a time.
- This allows for easy sqitching between models to produce different outputs.

### Linear Regression (OLS)

In [8]:
# # Instantiate Model
# model = LinearRegression()

In [9]:
# # Fit model
# model.fit(Z_train, y_train);

### Ridge Model

In [10]:
# # Create a list of ridge alphas to test
# r_alphas = np.logspace(0, 1, 500)

# # Cross-validate over our list of ridge alphas.
# model = RidgeCV(alphas=r_alphas, scoring='r2', cv=10) # Uses MSE by default

# # Fit model using best ridge alpha!
# model.fit(Z_train, y_train);

In [11]:
# # Optimal value of alpha
# model.alpha_

### LASSO Model

In [18]:
# Create a list of LASSO alphas to test
l_alphas = np.logspace(1, 2, 500)

# Cross-validate over our list of Lasso alphas.
model = LassoCV(alphas=l_alphas, cv=10, max_iter=5000)

# Fit model using best lasso alpha!
model.fit(Z_train, y_train);

In [19]:
# Optimal value of alpha
model.alpha_

42.38913057338779

## Predictions & Review
- Make predictions and review R$^2$
- Format data for submission
- Final data check

In [20]:
# Make predictions
y_preds = model.predict(Z_test)

In [21]:
# Add predictions to df_test
df_test['SalePrice'] = y_preds

In [22]:
# Create submission df (correct 'Id' column name and set 'Id' as index)
df_submission = df_test[['Id', 'SalePrice']].set_index('Id')
print(df_submission.shape) # Confirm this is (878, 1)!
df_submission.head() # Id starts with (2658, 2718, 2414, 1989, 625, ...)

(878, 1)


Unnamed: 0_level_0,SalePrice
Id,Unnamed: 1_level_1
2658,147300.72206
2718,158758.907182
2414,211361.910297
1989,99272.602747
625,175110.811257


In [23]:
# Verify Train data R2
print(f'Training R2: {model.score(Z_train, y_train)}')

Training R2: 0.9465720446283913


## Save Submission Data
- Save formatted csv for Kaggle submission

In [18]:
# Save submission
descriptor = 'dum_ord_poly_eng_lasso_5' # describe filename
df_submission.to_csv(f'{path_submission_data}cl_submission_{descriptor}.csv')

## Conclusions & Recommendations

### Conclusion
Our team of data scientists analyzed the Ames, IA housing dataset to determine if the data provided meaningful information about the Sale Price of each home.

We initially cleaned the data to account for abnormalities and problems. We then performed EDA on the dataset to discover meaningful correlations between features. Then we engineered features via several methods including, creating dummies, mapping ordinal categories to numbers, creating new features based upon similar features, and creating polynomial features to determine reactions between features. All features were then scaled to prepare them for modeling and regularization.

All of these engineered features were then tested on three linear regression models:
- Ordinary Least Squares (OLS)
- Ridge Regression (l2 penalty)
- LASSO Regression (l1 penalty)

From these models, it was determined that the Ridge and LASSO models performed best based on their $R^2$ and MSE scores. Both the Ridge and LASSO models utilized 10-fold cross-validation to improve their performance. The Ordinary Least Squares model outperformed the other two models on the training data, but worse on the test data. This is because the model was overfit. Because the Ridge and LASSO models utilize penalties, they can regularize the data to create more robust models that generalize better to new data.

We have concluded that utilizing any of the three linear regression models with this data set can produce accurate predictions above the $R^2$ of 0.90 and < 30000 MSE threshold that Zillow provided.

### Recommendations
Based on our achievement of the success metrics and conclusions, we recommend that Zillow allocate more funding to further develop this home sale price prediction technology. Resources should be distributed to collect larger and better data sets as well as continue to refine and improve the current proof-of-concept models.

We also recommend that if Zillow chooses to utilize models to predict the effect of certain aspects of a home on sale price rather than simply utilizing all information to predict the sale price, the complexity of the model be reduced by removing variables that exhibit multicollinearity. This would likely reduce the overall performance of the model, but would provide more clarity to how individual features affect sale price.