# TEAM NAME ETC

# Overview

Our task is to build an inferential linear regression model. Our model will address our business problem. We will follow the assumptions of linear regression which are linearity, independence, normality, and homoscedasticity. We will also strive to have a high R^2 value, signaling that our parameters are explaining much of the total variance in house sales.

# Business Understanding

Our stakeholder is Opendoor. Opendoor makes cash offers for homes, and resells them for a profit. Sometimes they perform repairs on the home before reselling. We will be assisting this company in the King County, Washington area. The county seat is Seattle. Our job is to analyze the King County data set and provide solid models that can assist with buy recommendations for our stakeholder. We will report on the most important parameters when assessing home values. Since they also do repairs, we will also look at factors that assess the condition and amount of repairs likely needed in the property. Opendoor will be able to use our suggestions to buy properties that will sell.

# Data Understanding

The data that we used is King County House Sales data from the King County assessor website. 

Summary information for 'Condition' and 'Grade' can be found here: https://info.kingcounty.gov/assessor/esales/Glossary.aspx?type=r

## *Loading our data*

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
from scipy import stats
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.power import TTestIndPower, TTestPower
import matplotlib.pyplot as plt
%matplotlib inline

In [3]:
df = pd.read_csv('data/kc_house_data.csv')

## *Dealing with missing values*

In [23]:
df.isna().sum()

id                  0
date                0
price               0
bedrooms            0
bathrooms           0
sqft_living         0
sqft_lot            0
floors              0
waterfront       2376
view               63
condition           0
grade               0
sqft_above          0
sqft_basement       0
yr_built            0
yr_renovated     3842
zipcode             0
lat                 0
long                0
sqft_living15       0
sqft_lot15          0
dtype: int64

In [22]:
df['waterfront'].value_counts()
#2376 entries missing data

NO     19075
YES      146
Name: waterfront, dtype: int64

In [21]:
df['view'].value_counts()
#19422 none, so 63 missing data

NONE         19422
AVERAGE        957
GOOD           508
FAIR           330
EXCELLENT      317
Name: view, dtype: int64

In [24]:
df['yr_renovated'].value_counts()
#3842 missing values

0.0       17011
2014.0       73
2003.0       31
2013.0       31
2007.0       30
          ...  
1946.0        1
1959.0        1
1971.0        1
1951.0        1
1954.0        1
Name: yr_renovated, Length: 70, dtype: int64

## *Building Condition Column*

In [16]:
df['condition'].value_counts()

Average      14020
Good          5677
Very Good     1701
Fair           170
Poor            29
Name: condition, dtype: int64

In [26]:
df['condition'].replace('Poor', '1', inplace=True)
df['condition'].replace('Fair', '2', inplace=True)
df['condition'].replace('Average', '3', inplace=True)
df['condition'].replace('Good', '4', inplace=True)
df['condition'].replace('Very Good', '5', inplace=True)

In [30]:
df['condition'].astype(np.int64)

0        3
1        3
2        3
3        5
4        3
        ..
21592    3
21593    3
21594    3
21595    3
21596    3
Name: condition, Length: 21597, dtype: int64

In [31]:
df['condition'].value_counts()

3    14020
4     5677
5     1701
2      170
1       29
Name: condition, dtype: int64

# Modeling

## *Filtering columns*

In [10]:
y = df["price"]
X = df.drop("price", axis=1)

In [11]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [12]:
print(f"X_train is a DataFrame with {X_train.shape[0]} rows and {X_train.shape[1]} columns")
print(f"y_train is a Series with {y_train.shape[0]} values")

X_train is a DataFrame with 16197 rows and 20 columns
y_train is a Series with 16197 values


# Regression Results

# Conclusion