# TEAM NAME ETC

# Overview

Our task is to build an inferential linear regression model. Our model will address our business problem. We will follow the assumptions of linear regression which are linearity, independence, normality, and homoscedasticity. We will also strive to have a high R^2 value, signaling that our parameters are explaining much of the total variance in house sales.

# Business Understanding

Our stakeholder is Opendoor. Opendoor makes cash offers for homes, and resells them for a profit. Sometimes they perform repairs on the home before reselling. We will be assisting this company in the King County, Washington area. The county seat is Seattle. Our job is to analyze the King County data set and provide solid models that can assist with buy recommendations for our stakeholder. We will report on the most important parameters when assessing home values. Since they also do repairs, we will also look at factors that assess the condition and amount of repairs likely needed in the property. Opendoor will be able to use our suggestions to buy properties that will sell.

# Data Understanding

The data that we used is King County House Sales data from the King County assessor website. 

Summary information for 'Condition' and 'Grade' can be found here: https://info.kingcounty.gov/assessor/esales/Glossary.aspx?type=r

## *Loading our data*

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
from scipy import stats
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.power import TTestIndPower, TTestPower
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
df = pd.read_csv('data/kc_house_data.csv')

## *Formatting Columns*

In [5]:
df['condition'].replace('Poor', '1', inplace=True)
df['condition'].replace('Fair', '2', inplace=True)
df['condition'].replace('Average', '3', inplace=True)
df['condition'].replace('Good', '4', inplace=True)
df['condition'].replace('Very Good', '5', inplace=True)
df['condition'] = df['condition'].astype(np.int64)
df['condition'].value_counts()

3    14020
4     5677
5     1701
2      170
1       29
Name: condition, dtype: int64

In [6]:
df['view'].replace('NONE', '0', inplace=True)
df['view'].replace('FAIR', '1', inplace=True)
df['view'].replace('AVERAGE', '2', inplace=True)
df['view'].replace('GOOD', '3', inplace=True)
df['view'].replace('EXCELLENT', '4', inplace=True)
df['view'].fillna('NONE', inplace=True)
df['view'] = df['view'].astype(np.int64)
df['view'].value_counts()

0    19485
2      957
3      508
1      330
4      317
Name: view, dtype: int64

## *Dealing with missing values*

In [None]:
df.isna().sum()

# Modeling

## *Filtering columns*

In [None]:
y = df["price"]
X = df.drop("price", axis=1)

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [None]:
print(f"X_train is a DataFrame with {X_train.shape[0]} rows and {X_train.shape[1]} columns")
print(f"y_train is a Series with {y_train.shape[0]} values")

# Regression Results

# Conclusion