# King County Real Estate - Housing Analysis

## Business Question:

King County Real Estate has hired us to investigate which features of a home have the greatest effect on price.

* They would like us to make a model to predict housing prices.
* From that model, they would like to know which factors have the largest effect on price.

## Data Importing & Cleaning

The dataset "kc_house_data.csv" was obtained from the link below. King County 2014-2015 House Sales dataset

https://osf.io/twq9p/

In [1]:
import numpy as np 
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.formula.api import ols
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.feature_selection import RFE
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import scipy.stats as stats


sns.set_style("whitegrid")
%matplotlib inline

sns.set(rc={'figure.figsize':(11,8)})

In [2]:
url = "https://raw.githubusercontent.com/bigbenx3/housing_analysis_project/main/kc_house_data.csv"
df1 = pd.read_csv(url, error_bad_lines=False)

In [3]:
df1.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3,1.0,1180,5650,1.0,0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,...,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,20150225T000000,180000.0,2,1.0,770,10000,1.0,0,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000.0,4,3.0,1960,5000,1.0,0,0,...,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000.0,3,2.0,1680,8080,1.0,0,0,...,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


Sneak preview of all the features in dataset.

##### Null values present?

In [4]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             21613 non-null  int64  
 1   date           21613 non-null  object 
 2   price          21613 non-null  float64
 3   bedrooms       21613 non-null  int64  
 4   bathrooms      21613 non-null  float64
 5   sqft_living    21613 non-null  int64  
 6   sqft_lot       21613 non-null  int64  
 7   floors         21613 non-null  float64
 8   waterfront     21613 non-null  int64  
 9   view           21613 non-null  int64  
 10  condition      21613 non-null  int64  
 11  grade          21613 non-null  int64  
 12  sqft_above     21613 non-null  int64  
 13  sqft_basement  21613 non-null  int64  
 14  yr_built       21613 non-null  int64  
 15  yr_renovated   21613 non-null  int64  
 16  zipcode        21613 non-null  int64  
 17  lat            21613 non-null  float64
 18  long  

21613 non null entries for each columns. Also, all the data, with the exception of date, are numeric (floats and integers). So, that's good. Most likely, "date" will be dropped as it's pretty irrelevant as a feature relating to house price.

In [5]:
df1.isnull().sum()

id               0
date             0
price            0
bedrooms         0
bathrooms        0
sqft_living      0
sqft_lot         0
floors           0
waterfront       0
view             0
condition        0
grade            0
sqft_above       0
sqft_basement    0
yr_built         0
yr_renovated     0
zipcode          0
lat              0
long             0
sqft_living15    0
sqft_lot15       0
dtype: int64

Still zero null values in all the columns.

## Minor Data - Manipulation

Let's again look back at the sort of data that we have, using the .info() function.

In [7]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             21613 non-null  int64  
 1   date           21613 non-null  object 
 2   price          21613 non-null  float64
 3   bedrooms       21613 non-null  int64  
 4   bathrooms      21613 non-null  float64
 5   sqft_living    21613 non-null  int64  
 6   sqft_lot       21613 non-null  int64  
 7   floors         21613 non-null  float64
 8   waterfront     21613 non-null  int64  
 9   view           21613 non-null  int64  
 10  condition      21613 non-null  int64  
 11  grade          21613 non-null  int64  
 12  sqft_above     21613 non-null  int64  
 13  sqft_basement  21613 non-null  int64  
 14  yr_built       21613 non-null  int64  
 15  yr_renovated   21613 non-null  int64  
 16  zipcode        21613 non-null  int64  
 17  lat            21613 non-null  float64
 18  long  

So, there are few issues potentially: 
    
    1. It's true that pretty much all our data is numeric, but looking back at the 

Removing outliers, the 33 bedrooms/ bathrooms... that kind of thing

## Feature selection - Looking at what features contribute most

#### Correlation with Target Method

Simplifying the dataset means removing columns that might not be relevant for our current analysis.

If we are trying to investigate factors that affect the price and value of a home: 

1. So, "id" and "date" are irrelevant because they aren't really features pertaining to the house, not components of the house.
REMOVED "id" and "date"


So that leaves us with 18 other features (18 other columns excluding "price) to account for in our model.

To start, we can look at correlation between pairs of features to try to get an idea from there, what features may the most helpful in our model.

In [None]:
df1.corr()

So even with correlation values with each pair of variables, it's still a lot of information to digest.
What are we looking for? : 
High absolute values in the "price" row/column

To make it easier for us to discern, let's use a heatmap.

In [None]:
sns.heatmap(df1.corr())

Still a bit cluttered. (The lighter colors along the "price" row/column, represent the positive correlations between price and another variable).

Isolating the "price" column can give us a better visual to draw conclusions from.

In [None]:
df1_corrs = df1.corr()["price"].map(abs).sort_values(ascending=False)
df1_corrs

This shows that there are "price" and the following variables have the least correlation:

id, (date isn't on here but again irrelevant to house price), longitude, zipcode, yr_built, sqft_lot15, sqft_lot, yr_renovated.


Those features listed above were all displayed in the latter end of the spectrum, mostly in purple, representing the least correlation with "price".

###### The highest correlation was "sqft_living" (sqft footage of living space of home) to "price"; and "grade" (quality of the house) to "price".

So let's look at those two features in this simple model

In [None]:
df1_preds = df1[["sqft_living", "grade"]]
df1_target = df1["price"]

lr = LinearRegression()
lr.fit(df1_preds, df1_target)
lr.score(df1_preds, df1_target)

So we know about 53.5% of the variance surrounding data relating to "price"; We only know about 53.5% of what goes into calculating "price".

###### Now... there is an issue: How do we know if the large correlation values we are seeing are due to solely that of one feature to "price" (our target).

For example, our two features: "sqft_living" and "grade" both have high correlations to "price", 0.702 and 0.667, respectively.

However, these two features have a high correlation toward each other: 0.7627.


Because, we are specifically trying to investigate the features that contribute the most to price, we need to identify the effect (as much as possible) of the individual feature on our target, the "price" of the home.

Therefore, multicollinearity between features is sort of a big deal.

###### Our top contending features to use: 

* sqft_living      0.702035
* grade            0.667434
* sqft_above       0.605567
* sqft_living15    0.585379
* bathrooms        0.525138
* view             0.397293
* sqft_basement    0.323816
* bedrooms         0.308350
* lat              0.307003
* waterfront       0.266369
* floors           0.256794

Those features above 0.20 correlation value are somewhat of interest. Those above 0.40 are of great interest as features for our model.

In [None]:
df = df1[["sqft_lot", "sqft_living",
                    "grade", "condition", "bathrooms", "bedrooms",
                    "waterfront", "price", "floors", "lat", "long"]]

We are eliminating the columns below:


yr_built

date

view

sqft_above

sqft_basement

yr_renovated

zipcode

lat

long

sqft_living15

sqft_lot15

In [None]:
df.info()

From 21 to 11 columns to account for.

## Exploratory Analysis

We want to get a sense of the data, the values, for each feature and remove the outliers in preparation to building a model.

Before, that we want to change the datatypes for some the columns, for example "price".

In [None]:
df.info()

In [None]:
df["price"] = df["price"].astype(int)

In [None]:
df.info()

We may need to change the other features into another datatype. For now, this will do.

#### Prices Overview

The dependent variable here is price of the homes. Let's get a sense of the prices.

In [None]:
df.price.describe()

count    21,600

mean     540,000

std      367,000

min      75,000

25%      321,900

50%      450,000

75%      645000

max      7,700,000

(USD) 2014-2015
King County, Washington 98001

It's easier to see now the corresponding numerical values.

In [None]:
sns.histplot(df.price)

https://www.thoughtco.com/what-is-the-interquartile-range-rule-3126244

In [None]:
q3, q1 = np.percentile(df["price"], [75 ,25])
iqr = q3 - q1
iqr

323050 is the interquartile range.

In [None]:
q3

Oh, ok- the 75percentile.

In [None]:
q1

And the 25percentile. 

In [None]:
323050*1.5

This number will allow us to find the range that are outliers.

Though it's not often affected much by them, the interquartile range can be used to detect outliers. This is done using these steps:

Calculate the interquartile range for the data.

Multiply the interquartile range (IQR) by 1.5 (a constant used to discern outliers).

Add 1.5 x (IQR) to the third quartile. Any number greater than this is a suspected outlier.

Subtract 1.5 x (IQR) from the first quartile. Any number less than this is a suspected outlier.


https://www.thoughtco.com/what-is-the-interquartile-range-rule-3126244

In [None]:
645000+484575.0

In [None]:
321950-484575.0

So regarding "price" values, any home price < -162625 and > 1129575 are outliers.
And since we don't deal with negative numbers with price, we'll ignore the < -162625 part.

In [None]:
import pandas as pd
df_price_unique_values = df["price"].unique()
print(sorted(df_price_unique_values))

Again, ignoring the negative range because our prices start at 75,000. so let's drop values greater than 1129575.

However, let's double check on how many entries we will be discarding before we do so.

In [None]:
price_counts = df.groupby("price")["price"].agg("count").sort_values(ascending=True)
price_counts

In [None]:
pd.set_option("display.max_rows", 5000)

In [None]:
price_counts = df.groupby("price")["price"].agg("count").sort_values(ascending=False)
price_counts

In [None]:
df.info()

Since price values greater than 1129575 are outliers, we have to keep values less than or equal to 1129575.

In [None]:
df_outliers_rmvd = df[df["price"] <= 1129575]
df_outliers_rmvd.info()

In [None]:
sns.histplot(df_outliers_rmvd.price)

Our new visual plot. Not the best, but with the outliers removed, it'll work for now.

In [None]:
df_outliers_rmvd.price.describe()

Now, trying to simplify the code: This will be out reuseable template for the other features.

In [None]:
q3, q1 = np.percentile(df_outliers_rmvd["price"], [75 ,25])
iqr = q3 - q1
print("iqr=", iqr)
print("q3=", q3)
print("q1=", q1)
print("constant=", iqr*1.5)

In [None]:
print("suspected outliers are greater than this number:", q3+(iqr*1.5))
print("suspected outliers are less than this number", q1-(iqr*1.5))

<b>So regarding "price", any price value < -162625 and > 1129575 are outliers.</b>

Trying to create a reuseable template. We'll try it with Living Space Square Footage.

#### Living Space Square Footage

In [None]:
df_outliers_rmvd.sqft_living.describe()

Visual Plot: Initial Look

In [None]:
sns.histplot(df_outliers_rmvd["sqft_living"])

Now, trying to take out the outliers to hopefully normalize the distribution.

In [None]:
q3, q1 = np.percentile(df_outliers_rmvd["sqft_living"], [75 ,25])
iqr = q3 - q1
print("iqr=", iqr)
print("q3=", q3)
print("q1=", q1)
print("constant=", iqr*1.5)

In [None]:
print("suspected outliers are greater than this number:", q3+(iqr*1.5))
print("suspected outliers are less than this number", q1-(iqr*1.5))

So regarding "sqft_living", any sqft_living value < -146.5 and > 3977.5 are outliers. Again, any negative numbers, we can sort of ignore, unless negative values start appearing on our histogram plot.

Let's remove the outliers.

In [None]:
df_outliers_rmvd = df_outliers_rmvd[df_outliers_rmvd["sqft_living"] <= 3977.5]
df_outliers_rmvd.info()

Let's see the new histogram plot.

In [None]:
sns.histplot(df_outliers_rmvd["sqft_living"])

Still a bit crude but we can work with that for now.

#### Lot Square Footage

In [None]:
df_outliers_rmvd.sqft_lot.describe()

Visual Plot: Initial Look

In [None]:
sns.histplot(df_outliers_rmvd["sqft_lot"])

Now, trying to take out the outliers to hopefully normalize the distribution.

In [None]:
q3, q1 = np.percentile(df_outliers_rmvd["sqft_lot"], [75 ,25])
iqr = q3 - q1
print("iqr=", iqr)
print("q3=", q3)
print("q1=", q1)
print("constant=", iqr*1.5)

In [None]:
print("suspected outliers are greater than this number:", q3+(iqr*1.5))
print("suspected outliers are less than this number", q1-(iqr*1.5))

So regarding "sqft_living", any sqft_living value < -2800.0 and > 18000.0 are outliers. Again, any negative numbers, we can sort of ignore, unless negative values start appearing on our histogram plot.

Let's remove the outliers.

In [None]:
df_outliers_rmvd = df_outliers_rmvd[df_outliers_rmvd["sqft_lot"] <= 18000]
df_outliers_rmvd.info()

Let's see the new histogram plot.

In [None]:
sns.histplot(df_outliers_rmvd["sqft_lot"])

Still a bit crude but we can work with that for now.

And since lot_sqftspace is a bit difficult to discern for a general correlation, we might just scratch the feature altogether towards the end.

#### Bedrooms

In [None]:
df_outliers_rmvd.bedrooms.describe()

Visual Plot: Initial Look

In [None]:
sns.histplot(df_outliers_rmvd["bedrooms"])

Now, trying to take out the outliers to hopefully normalize the distribution.

In [None]:
q3, q1 = np.percentile(df_outliers_rmvd["bedrooms"], [75 ,25])
iqr = q3 - q1
print("iqr=", iqr)
print("q3=", q3)
print("q1=", q1)
print("constant=", iqr*1.5)

In [None]:
print("suspected outliers are greater than this number:", q3+(iqr*1.5))
print("suspected outliers are less than this number", q1-(iqr*1.5))

So regarding "sqft_living", any sqft_living value < 1.5 and > 5.5 are outliers. Again, any negative numbers, we can sort of ignore, unless negative values start appearing on our histogram plot.

Let's remove the outliers.

In [None]:
df_outliers_rmvd = df_outliers_rmvd[df_outliers_rmvd["bedrooms"]<= 5.5]
df_outliers_rmvd = df_outliers_rmvd[df_outliers_rmvd["bedrooms"]>= 1.5]
df_outliers_rmvd.info()

We have to double check that both portions of the range were kept and not discarded.

In [None]:
df_outliers_rmvd.loc[df_outliers_rmvd["bedrooms"] <= 5.5]

In [None]:
df_outliers_rmvd.loc[df_outliers_rmvd["bedrooms"] >= 1.5]

Let's see the new histogram plot.

In [None]:
sns.histplot(df_outliers_rmvd["bedrooms"])

Still a bit crude but we can work with that for now.

Very crude correlation and normal distribution curve.

#### Bathrooms

In [None]:
df_outliers_rmvd.bathrooms.describe()

Visual Plot: Initial Look

In [None]:
sns.histplot(df_outliers_rmvd["bathrooms"])

Now, trying to take out the outliers to hopefully normalize the distribution.

In [None]:
q3, q1 = np.percentile(df_outliers_rmvd["bathrooms"], [75 ,25])
iqr = q3 - q1
print("iqr=", iqr)
print("q3=", q3)
print("q1=", q1)
print("constant=", iqr*1.5)

In [None]:
print("suspected outliers are greater than this number:", q3+(iqr*1.5))
print("suspected outliers are less than this number", q1-(iqr*1.5))

So regarding "sqft_living", any sqft_living value < 0 and > 4 are outliers. Again, any negative numbers, we can sort of ignore, unless negative values start appearing on our histogram plot.

Let's remove the outliers.

In [None]:
df_outliers_rmvd = df_outliers_rmvd[df_outliers_rmvd["bathrooms"] <= 4]
df_outliers_rmvd.info()

Let's see the new histogram plot.

In [None]:
sns.histplot(df_outliers_rmvd["bathrooms"])

Still a bit crude but we can work with that for now.

Very crude correlation as well.

#### Grade

Now grade is one of those that need not remove outliers because we just need to understand what grade homes is considered more expensive. So just a correlation will do.

In [None]:
df_outliers_rmvd.grade.describe()

Visual Plot: Initial Look

In [None]:
sns.histplot(df_outliers_rmvd["grade"])

A bit crude, but we will work that in with price later.

#### Condition

In [None]:
df_outliers_rmvd.condition.describe()

Visual Plot: Initial Look

In [None]:
sns.histplot(df_outliers_rmvd["condition"])

Still a bit crude but we can work with that for now.

#### Floors

In [None]:
df_outliers_rmvd.floors.describe()

Visual Plot: Initial Look

In [None]:
sns.histplot(df_outliers_rmvd["floors"])

No real correlation yet til we match with price.

#### Location

In [None]:
fig = plt.figure(figsize=(15,10))
ax = sns.scatterplot(x=df_outliers_rmvd["long"], y=df_outliers_rmvd["lat"], hue=df_outliers_rmvd["price"], palette="plasma",
                     marker=".")
ax.set( xlabel="Longitude",
        ylabel="Latitude", 
        title="Price by Location")

Seems there is a general area from 47.55 North latitude to 47.7 North latitude, where most of the most expensive properties are located.

#### Waterfront

In [None]:
df_outliers_rmvd.waterfront.describe()

In [None]:
sns.histplot(df_outliers_rmvd["waterfront"])

For our analysis, we will exclude waterfront as a feature because it doesn't show discernibility, that it would impact price. Perhaps, with the removal of outliers, has skewed the model towards homes without waterfronts and it would be interesting to see the effect a waterfront has on the price. My prior limited background knowledge agrees with the fact that a waterfront property would be more expensive than a similar property without one.

But right now that is my speculation. 

### Looking at Multicolinearity

In [None]:
corr_matrix = df_outliers_rmvd.corr()
print(corr_matrix["price"].sort_values(ascending=False))

Living area and grade have the highest correlations with price. Latitude visually showed more promise as a feature with a high correlation to price of the home.

## Data Modeling

Let's prepare a model and see where our features are at.

In [None]:
df_outliers_rmvd.info()

#### Model 0

In [None]:
X = df_outliers_rmvd.drop("price", 1)
y = df_outliers_rmvd["price"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=11)

In [None]:
predictors = sm.add_constant(X_train)
model_0 = sm.OLS(y_train , predictors).fit()
model_0.summary()

R-Squred value is decent - An R^2 of 1 indicates that the regression predictions perfectly fit the data.
Near zero p-values indicated strong evidence that the null hypothesis be rejected.
<b>High Condition number</b>... something to watch out for too.

In [None]:
lr= LinearRegression()
lr.fit(X_train, y_train)

# Use Linear Regression to make predictions for train and test data
y_hat_train = lr.predict(X_train)
y_hat_test = lr.predict(X_test)


# Calculate Root Mean Square Error
train_rmse = np.sqrt(mean_squared_error(y_train, y_hat_train))
test_rmse = np.sqrt(mean_squared_error(y_test, y_hat_test))

# Calculate Mean Absolute Error
test_mae = mean_absolute_error(y_test, y_hat_test)
train_mae = mean_absolute_error(y_train, y_hat_train)

print(f"Train Root Mean Square Error: {train_rmse}")
print(f"Test Root Mean Square Error: {test_rmse}")

print(f"Train Mean Absolute Error: {train_mae}")
print(f"Test Mean Absolute Error: {test_mae}")

In [None]:
fig = sm.graphics.qqplot(model_0.resid, dist=stats.norm, line='45', fit=True)

This residual plot is not all that good, room for improvement.

#### Model 1.0

The main goal of this model is to see if scaling helps in any way.

In [None]:
price_log = np.log(df_outliers_rmvd.price)
price_log = pd.DataFrame(price_log)

In [None]:
X1 = df_outliers_rmvd.drop('price', 1)
y1 =price_log

In [None]:
X_train1, X_test1, y_train1, y_test1 = train_test_split(X1, y1, test_size=0.2, random_state=11)

In [None]:
scaler = StandardScaler()
scalerp = StandardScaler()

X_train1[["sqft_lot", "sqft_living", "bathrooms", "bedrooms", "floors", "grade", "lat", "long"]]  =scaler.fit_transform(X_train1[["sqft_lot", "sqft_living", "bathrooms", "bedrooms", "floors", "grade", "lat", "long"]])


X_test1[["sqft_lot", "sqft_living", "bathrooms", "bedrooms", "floors", "grade", "lat", "long"]] = scaler.transform(X_test1[["sqft_lot", "sqft_living", "bathrooms", "bedrooms", "floors", "grade", "lat", "long"]])


y_train1 = scalerp.fit_transform(pd.DataFrame(y_train1))
y_test1 = scalerp.transform(pd.DataFrame(y_test1))

In [None]:
predictors = sm.add_constant(X_train1)
model_1 = sm.OLS(y_train1 , predictors).fit()
model_1.summary()

The issue with the condition number is gone. And r-squared has jumped from 63% to 67%.
So, that's sort of the good news. 

The bad news: the r-squared is still too low.

In [None]:
lr1= LinearRegression()
lr1.fit(X_train1, y_train1)


# Use Linear Regression to make predictions for train and test data
y_hat_train = lr1.predict(X_train1)
y_hat_test = lr1.predict(X_test1)



# Undo scale
y_train1 = scalerp.inverse_transform(y_train1)
y_test1 = scalerp.inverse_transform(y_test1)
y_hat_train = scalerp.inverse_transform(y_hat_train)
y_hat_test = scalerp.inverse_transform(y_hat_test)

# Undo log
y_train1 = np.exp(y_train1)
y_test1 = np.exp(y_test1)
y_hat_train = np.exp(y_hat_train)
y_hat_test = np.exp(y_hat_test)


# Calculate Root Mean Square Error
train_rmse1 = np.sqrt(mean_squared_error(y_train1, y_hat_train))
test_rmse1 = np.sqrt(mean_squared_error(y_test1, y_hat_test))

# Calculate Mean Absolute Error
test_mae1 = mean_absolute_error(y_test1, y_hat_test)
train_mae1 = mean_absolute_error(y_train1, y_hat_train)

print(f'Train Root Mean Square Error: {train_rmse1}')
print(f'Test Root Mean Square Error: {test_rmse1}')

print(f'Train Mean Absolute Error: {train_mae1}')
print(f'Test Mean Absolute Error: {test_mae1}')

In [None]:
y_hat_test

In [None]:
fig = sm.graphics.qqplot(model_1.resid, dist=stats.norm, line='45', fit=True)

So here's the dilemma: we don't want a model to be too fitted, overfitted, because then it really isn't any use as a model to predict. It's nothing more than a glorified calculator that spit out calculations and numbers for existing data. 

However, we want it to have some degree of fit to the line so that it CAN be used as a model.

A happy medium somewhere in there...

In [None]:
results = [ ['Model 0', train_rmse, test_rmse, train_mae, test_mae],
            ['Model 1',train_rmse1, test_rmse1, train_mae1, test_mae1]]

df_results = pd.DataFrame(results, columns=['Model', 'Train RMSE', 'Test RMSE', 'Train MAE', 'Test MAE'])
df_results

## CRITICAL-Model Decision

I think I'll go with model 2 because the scaling brought down the condition number, visually it was more aesthetically pleasing.

#### Choosing the Model

Typically lower RSME shows better fit to the line.

In [None]:
Xf = df_outliers_rmvd.drop('price', 1)

scalerf= StandardScaler()

Xf[["sqft_lot", "sqft_living", "bathrooms", "bedrooms", "floors", "grade", "condition", "lat", "long"]]  =scalerf.fit_transform(Xf[["sqft_lot", "sqft_living", "bathrooms", "bedrooms", "floors", "grade", "condition", "lat", "long"]])

scalerfp = StandardScaler()

price_sc = scalerp.transform(pd.DataFrame(df_outliers_rmvd.price))

y_hat = lr1.predict(Xf)

y_hat = np.exp(scalerp.inverse_transform(y_hat))

y_hat

rmse_f = np.sqrt(mean_squared_error(df_outliers_rmvd.price , y_hat))
mae_f = mean_absolute_error(df_outliers_rmvd.price, y_hat)

In [None]:
print(f'Root Mean Square Error: {rmse_f}')
print(f'Mean Absolute Error: {mae_f}')


In [None]:
mae_f

In [None]:
mae_f/df_outliers_rmvd.price.mean()

In [None]:
sns.histplot(model_1.resid)

In [None]:
fig = sm.graphics.qqplot(model_1.resid, dist=stats.norm, line='45', fit=True)

In [None]:
sns.regplot(x=np.log(y_test1), y=np.log(y_hat_test))

The majority of the plot conforms to the best fit line.

## Data Question - Answers

1. The factors most affecting the price of a house are:

* Location(lat)

* Quality of the house(grade)

* Living area(sqft_living)


## Results

* We have a model that has an Coefficient of Determination(R-squared) value of 0.672 which indicates that our model can explain 67.2% of all variation in the data around the mean.

* With a Mean Squared Error of around 140227 USD, that means our predicted price is, on average, 140227 USD off from our mean. While that number doesn't look too bad our Root Mean Squared Error is around 183833 USD which means that our model is being heavily penalized for predictions that are very far off the actual price.

* Average home price: 476,985 USD. The price prediction was +/-$140,227 off the real price (29.4% margin of error)



## Conclusions

Descriptive analysis and modeling reveal which factors contribute most to housing prices: 

● Increase Living Area(in square feet) 

● Buy homes in regions specified (47.55 15°N to 47.7 15°N) (Or 
    maybe homes outside of this region will likely be more affordable) 
    
● Upgrade the quality of your home

## Future Research

* The data we were provided was from 2014 to 2015. And such outdated data may not give us the optimal insights relevant to 
  today's housing situation

* We should be able to get a lot more out of the location data, with further analysis, incorporating data relevant to the 
  zipcode so there is a better determination for prices that can be expected in a more defined area.

* Also, streamlining the methods of getting a more fitted model without going too far into "overfitted" territory. 
  Like I've mentioned before, there is a happy medium in there.

* The most obvious next step is to try out new modeling techniques.  While linear regression is a good start, there are many 
  other techniques that I believe could help make better predictions.  Of particular interest to me in this context are 
  Polynomial Regression and Weighted Least Squares, that might be promising.

## Presentation Prep

In [None]:
fig = plt.figure(figsize=(11,8))
ax = sns.regplot(data=df_outliers_rmvd, x="sqft_living", y="price", marker=".",
     scatter_kws={"color": "grey"}, line_kws={"color": "blue"})

ax.set(  xlabel="Living Area(square feet)",
         ylabel="Price(in Millions of $)", 
         title="Price by Living Area",
 )


plt.xlim([0,5000])
plt.ylim([0, 1250000])
plt.show()

In [None]:
df_outliers_rmvd.grade.describe()

In [None]:
fig = plt.figure(figsize=(15,20))
ax = sns.barplot(data=df_outliers_rmvd, x="grade", y="price", ci=None)
ax.set( xticklabels=(["4","5", "6", "7", "8", "9", "10", "11"]),
        xlabel="Grade",
        ylabel="Price (in Thousands of $)", 
        title="Price by Grade"  )