
# Predict Home Value of Zillow Listings

**Alec Hartman**

**April 13, 2020**

## Goals
1. Develop a model to predict home values using square feet, bedrooms, and bathrooms.
2. Create a summary [presentation](https://docs.google.com/presentation/d/1ECtW4r91m_6WJGXTojXHFAnLWTK0KJ-z5euiXm5XgHI/edit?usp=sharing) describing the drivers of single unit property values.
3. Plot distributions of tax rates for each county, and provide key measures of central tendency and measures of spread.

---
### 1. Acquire + Preparation (aka Wrangling)

In [1]:
import numpy as np
import pandas as pd

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

import math
from scipy import stats
from statsmodels.formula.api import ols

import sklearn
import sklearn.metrics
import sklearn.linear_model
import sklearn.feature_selection

import wrangle as wr
import split_scale as ss
import explore as ex
import evaluate as ev

import warnings
warnings.filterwarnings("ignore")

In [None]:
df = wr.wrangle_zillow()
df

### Data Dictionary
- **bathrooms**:
    - The number of bathrooms in each home
    - zillow SQL database field: properties_2017.bathroomcnt
    - Homes with zero bathrooms were filtered out in my SQL query
    - I chose this field to represent the number of bathrooms per home as it appears to be the most complete and appropriate field in the database
- **bedrooms**:
    - The number of bedrooms in each home
    - zillow SQL database field: properties_2017.bedroomcnt
    - Homes with zero bedrooms were filtered out in my SQL query
    - I chose this field to represent the number of bedrooms per home as it appears to be the most complete and appropriate field in the database
- **square_feet**:
    - The square footage of each home
    - zillow SQL database field: properties_2017.calculatedfinishedsquarefeet
    - I chose this field to represent square footage per home as it appears to be the most complete and appropriate field in the database
- **fips_code**:
    - The Federal Information Processing System (FIPS) code for each home. Essentially, this is a state and county unique identifier established by the Federal Communications Commission (FCC)
    - zillow SQL database field: properties_2017.fips
    - I chose to use this field to index the county in which each home is located 
- **property_description**:
    - The property description of each home
    - zillow SQL database field: propertylandusetype.propertylandusedesc
    - I filtered the data in my SQL query to include Single Family Residential properties only
    - I chose to use and filter by this field as I interpreted the term "single unit properties" from the project specifications to mean Single Family Residential properties
- **home_value**:
    - The property's tax assessed value in 2017, presumably
    - zillow SQL database field: properties_2017.taxvaluedollarcnt
    - I used this field to represent home value as suggested in the project specifications
- **tax_amount**:
    - The amount of tax paid on each property in 2017, presumably
    - zillow SQL database field: properties_2017.taxamount
    - I used this field to represent tax amount and calculate the tax rate for each property
- **tax_rate**:
    - The tax rate for each property in 2017, presumably
    - zillow SQL database fields: (properties_2017.taxamount/properties_2017.taxvaluedollarcnt) as tax_rate
    - I used the fields above to calculate the tax rate for each property
- **transaction_date**:
    - The last transaction date for each property
    - zillow SQL database field: predictions_2017.transactiondate
    - I filtered the data in my SQL query to include only those homes whose last transaction date was in the "hot months" of May and June in terms of real estate demand as per the project specifications
    - I chose to use and filter by this field as appears to be the most appropriate date field in the database
- **county**:
    - The county in which each property is located
    - This field was indexed using the FIPS codes provided by the FCC as mentioned above. You can reference the website where this information was found by following this [link](https://transition.fcc.gov/oet/info/maps/census/fips/fips.txt).
    - I chose to include this field in my DataFrame it will be used to plot the distribution of tax rates by county

#### In which state and counties are these listing located?

In [None]:
df.county.unique()

> All properties contained in the DataFrame above are in the California counties of Los Angeles, Orange, and Ventura.

#### Let's ensure our data is truly ready to analyze.

In [None]:
assert (df.bathrooms == 0).sum() == 0, "If you see an AssertionError, there are zero values in the bathrooms feature."
assert (df.bedrooms == 0).sum() == 0, "If you see an AssertionError, there are zero values in the bedrooms feature."
assert (df.square_feet == 0).sum() == 0, "If you see an AssertionError, there are zero values in the square_feet feature."
assert (df.fips_code == 0).sum() == 0, "If you see an AssertionError, there are zero values in the fips_code feature."
assert (df.home_value == 0).sum() == 0, "If you see an AssertionError, there are zero values in the tax_value feature."
assert (df.tax_amount == 0).sum() == 0, "If you see an AssertionError, there are zero values in the tax_amount feature."
assert (df.tax_rate == 0).sum() == 0, "If you see an AssertionError, there are zero values in the tax_rate feature."

In [None]:
df.isnull().sum()

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
train, test = ss.split_my_data(df)

In [None]:
train.head()

In [None]:
test.head()

In [None]:
train.shape[0] / df.shape[0]

In [None]:
test.shape[0] / df.shape[0]

In [None]:
train = train[["square_feet", "bedrooms", "bathrooms", "home_value"]]
test = test[["square_feet", "bedrooms", "bathrooms", "home_value"]]

In [None]:
train.head()

In [None]:
test.head()

---
### 2. Explore

#### Distribution of Tax Rates by County

In [None]:
for county in df.county.unique():
    plt.figure(figsize=(16, 4))
    plt.title(f"Distribution of Tax Rates in {county} - mean: {df[df.county == county].tax_rate.mean():.3f}; median: {df[df.county == county].tax_rate.median():.3f}, std: {df[df.county == county].tax_rate.std():.3f}")
    sns.distplot(df[df.county == county].tax_rate)
    plt.xlabel("Tax Rate")
    plt.xlim(0, 0.50)
    plt.ylim(0, 400)
    plt.show()

#### Statistical Testing

> Is the home value in Los Angeles County higher than the average home value? 

$H_0$: There is no difference in home value between homes located in Los Angeles County and the average home.

$H_a$: There is a difference in home value between homes located in Los Angeles County and the average home.

In [None]:
alpha = .01

x = df[df.county == "Los Angeles County"].home_value
mu = df.home_value.mean()

t_stat, p = stats.ttest_1samp(x, mu)
print(f"t = {t_stat:.3}")
print(f"p = {p:.3}")

if p < alpha:
    print("Reject null hypothesis.")
else:
    print("Fail to reject null hypothesis")

> Is the tax rate correlated with square feet? 

$H_0$: There is no linear relationship between tax rate and square feet.

$H_a$: There is a linear relationship between tax rate and square feet.

In [None]:
r, p = stats.pearsonr(df.tax_rate, df.square_feet)
print("r =", r)
print("p =", p)

if p < alpha:
    print("Reject null hypothesis.")
else:
    print("Fail to reject null hypothesis")

In [None]:
plt.figure(figsize=(16, 8))
sns.heatmap(df.corr(), annot=True)
plt.show()

#### Plotting Variable Pairs of Train DataFrame

In [None]:
plt.figure(figsize=(16, 8))
sns.pairplot(data=train, kind="reg", plot_kws={"line_kws": {"color": sns.color_palette("colorblind")[4]}})
plt.show()

> **Observations**:
>    - square_feet and bathrooms have a strong, positive linear relationship
>    - home_value and square_feet have a strong, positive linear relationship
>    - home_value and bedrooms have a positive linear relationship
>    - home_value and bathrooms have a positive linear relationship
>    - bathrooms and bedrooms have a positive linear relationship

---
### 3. Model

### Hypothesis

$H_0$: Single unit property value is independent of square footage, number of bedrooms, and number of bathrooms

$H_a$: Single unit property value is dependent on square footage, number of bedrooms, and number of bathrooms

---
#### Linear Regression Model

In [None]:
pd.options.display.float_format = '{:.3f}'.format

In [None]:
print(f"Mean home_value = {train.home_value.mean():.2f}")
print(f"Median home_value = {train.home_value.median():.2f}")

> I chose the median home_value of the train dataset to be my baseline because the mean is heavily influenced by outliers.

In [None]:
predictions = pd.DataFrame({
    "actual_home_value": train.home_value,
    "baseline_home_value": train.home_value.median()
})
predictions.head()

In [None]:
train.shape

In [None]:
# feature
X = train[["square_feet", "bedrooms", "bathrooms"]]
# target
y = train.home_value

# 1. Make the model
lm = sklearn.linear_model.LinearRegression()
# 2. Fit the model
lm.fit(X, y)
# 3. Use the model
predictions["home_value ~ square_feet + bedrooms + bathrooms"] = lm.predict(X)
predictions.head()

In [None]:
plt.figure(figsize=(16, 8))
ev.plot_residuals(predictions.actual_home_value, predictions["home_value ~ square_feet + bedrooms + bathrooms"])
plt.ticklabel_format(axis="both", style="plain")
plt.show()

In [None]:
print(f"""The equation of our regression line is: 
y = ({lm.coef_[0]:.2f} * square_feet) + ({lm.coef_[1]:.2f} * bedrooms) + ({lm.coef_[-1]:.2f} * bathrooms) + {lm.intercept_:.2f}""")

In [None]:
ev.regression_errors(predictions.actual_home_value, predictions["home_value ~ square_feet + bedrooms + bathrooms"], predictions)

In [None]:
predictions.apply(lambda c: math.sqrt(sklearn.metrics.mean_squared_error(predictions.actual_home_value, c)))

In [None]:
ev.better_than_baseline(predictions.actual_home_value, predictions["home_value ~ square_feet + bedrooms + bathrooms"], predictions.baseline_home_value, predictions)

In [None]:
plt.figure(figsize=(16, 8))
sns.regplot(x=predictions.actual_home_value, y=predictions["home_value ~ square_feet + bedrooms + bathrooms"], data=predictions, label="home_value ~ square_feet + bedrooms + bedrooms", line_kws={"color": sns.color_palette("colorblind")[4]})
plt.ticklabel_format(axis="both", style="plain")

plt.title("Actual v. Predicted Home Value")
plt.xlabel("Actual")
plt.ylabel("Predicted")
plt.legend()

plt.show()

---
### Let's test our Linear Regression model

In [None]:
X_test = test[["square_feet", "bedrooms", "bathrooms"]]

test["yhat"] = lm.predict(X_test)
test.head()

In [None]:
ev.regression_errors(test.home_value, test.yhat, test)

In [None]:
test["yhat_baseline"] = test.home_value.median()
test.head()

In [None]:
ev.baseline_errors(test.home_value, test.yhat_baseline, test)

In [None]:
ev.better_than_baseline(test.home_value, test.yhat, test.yhat_baseline, test)

In [None]:
test.apply(lambda c: math.sqrt(sklearn.metrics.mean_squared_error(test.home_value, c)))[4:]

In [None]:
print(f'Coefficient of determination, or explained variance: {sklearn.metrics.r2_score(test.home_value, test.yhat):.2f}')

In [None]:
sklearn.feature_selection.f_regression(test[["square_feet", "bedrooms", "bathrooms"]], test.home_value)

> My model is off on it's predictions of home_value by, on average, 600,863.73 dollars which is better than the baseline model by ~200,000 dollars.

> 42% of the variance in home_value can be explained by the square_feet, bedrooms, and bathrooms in my model.

> Reject null hypothesis, "Single unit property value is independent of square footage, number of bedrooms, and number of bathrooms"

In [None]:
plt.figure(figsize=(16, 8))
sns.regplot(x=test.home_value, y=test.yhat, data=test, label="home_value ~ square_feet + bedrooms + bathrooms", line_kws={"color": sns.color_palette("colorblind")[4]})
plt.ticklabel_format(axis="both", style="plain")

plt.title("Actual v. Predicted Home Value")
plt.xlabel("Actual")
plt.ylabel("Predicted")
plt.legend()

plt.show()

---
#### Using ols to Compare Findings with sklearn 

In [None]:
model = ols("home_value ~ square_feet + bedrooms + bathrooms", test).fit()

In [None]:
actual = test.home_value
predicted = model.predict()

In [None]:
ev.model_significance(model)

In [None]:
model.summary()

---
### Additional Models

In [None]:
# feature
X = train[["square_feet"]]
# target
y = train.home_value

# 1. Make the model
lm = sklearn.linear_model.LinearRegression()
# 2. Fit the model
lm.fit(X, y)
# 3. Use the model
predictions["home_value ~ square_feet"] = lm.predict(X)
predictions.head()

In [None]:
plt.figure(figsize=(16, 8))
ev.plot_residuals(predictions.actual_home_value, predictions["home_value ~ square_feet"])
plt.ticklabel_format(axis="both", style="plain")
plt.show()

In [None]:
ev.regression_errors(predictions.actual_home_value, predictions["home_value ~ square_feet"], predictions)

In [None]:
ev.baseline_errors(predictions.actual_home_value, predictions.baseline_home_value, predictions)

In [None]:
ev.better_than_baseline(predictions.actual_home_value, predictions["home_value ~ square_feet"], predictions.baseline_home_value, predictions)

In [None]:
predictions.apply(lambda c: math.sqrt(sklearn.metrics.mean_squared_error(predictions.actual_home_value, c)))

In [None]:
plt.figure(figsize=(16, 8))
sns.scatterplot(x=predictions.actual_home_value, y=predictions["home_value ~ square_feet"], label="home_value ~ square_feet")
plt.ticklabel_format(axis="both", style="plain")
plt.show()

---

In [None]:
# feature
X = train[["bedrooms"]]
# target
y = train.home_value

# 1. Make the model
lm = sklearn.linear_model.LinearRegression()
# 2. Fit the model
lm.fit(X, y)
# 3. Use the model
predictions["home_value ~ bedrooms"] = lm.predict(X)
predictions.head()

In [None]:
plt.figure(figsize=(16, 8))
ev.plot_residuals(predictions.actual_home_value, predictions["home_value ~ bedrooms"])
plt.ticklabel_format(axis="both", style="plain")
plt.show()

In [None]:
ev.regression_errors(predictions.actual_home_value, predictions["home_value ~ bedrooms"], predictions)

In [None]:
ev.better_than_baseline(predictions.actual_home_value, predictions["home_value ~ bedrooms"], predictions.baseline_home_value, predictions)

In [None]:
predictions.apply(lambda c: math.sqrt(sklearn.metrics.mean_squared_error(predictions.actual_home_value, c)))

---

In [None]:
# feature
X = train[["bathrooms"]]
# target
y = train.home_value

# 1. Make the model
lm = sklearn.linear_model.LinearRegression()
# 2. Fit the model
lm.fit(X, y)
# 3. Use the model
predictions["home_value ~ bathrooms"] = lm.predict(X)
predictions.head()

In [None]:
plt.figure(figsize=(16, 8))
ev.plot_residuals(predictions.actual_home_value, predictions["home_value ~ bathrooms"])
plt.ticklabel_format(axis="both", style="plain")
plt.show()

In [None]:
ev.regression_errors(predictions.actual_home_value, predictions["home_value ~ bathrooms"], predictions)

In [None]:
ev.better_than_baseline(predictions.actual_home_value, predictions["home_value ~ bathrooms"], predictions.baseline_home_value, predictions)

In [None]:
predictions.apply(lambda c: math.sqrt(sklearn.metrics.mean_squared_error(predictions.actual_home_value, c)))

> home_value ~ square_feet + bedrooms + bathrooms is still the best model