# Housing Prices in King County
Author: Carlos Garza

## Overview

This notebook contains a regression analysis of the cost of King County real estate. Utilizing the CRISP-DM framework, linear regression models, and statistical techniques, I created and refined a model that describes the cost of real estate in King County in relation to a list of independent variables. 

My data, methodology, and derived conclusions are detailed in the body of this document.

## Business Problem

To gain an edge in the industry, a Seattle-based real estate company wants to automate their initial appraisal process. Developing an algorithm to accurately appraise the value of a house without physically inspecting the property can be a invaluable advantage in the fast paced real estate market of a rapidly expanding city.

## Data

The data utilized in this model describes houses sold in 2014 and 2015.

The data is summarized below.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
%matplotlib inline

In [2]:
kc_house_data = pd.read_csv('data/kc_house_data.csv')
kc_house_data.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,10/13/2014,221900.0,3,1.0,1180,5650,1.0,,0.0,...,7,1180,0.0,1955,0.0,98178,47.5112,-122.257,1340,5650
1,6414100192,12/9/2014,538000.0,3,2.25,2570,7242,2.0,0.0,0.0,...,7,2170,400.0,1951,1991.0,98125,47.721,-122.319,1690,7639
2,5631500400,2/25/2015,180000.0,2,1.0,770,10000,1.0,0.0,0.0,...,6,770,0.0,1933,,98028,47.7379,-122.233,2720,8062
3,2487200875,12/9/2014,604000.0,4,3.0,1960,5000,1.0,0.0,0.0,...,7,1050,910.0,1965,0.0,98136,47.5208,-122.393,1360,5000
4,1954400510,2/18/2015,510000.0,3,2.0,1680,8080,1.0,0.0,0.0,...,8,1680,0.0,1987,0.0,98074,47.6168,-122.045,1800,7503


In [3]:
# Data Summary
kc_house_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21597 entries, 0 to 21596
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             21597 non-null  int64  
 1   date           21597 non-null  object 
 2   price          21597 non-null  float64
 3   bedrooms       21597 non-null  int64  
 4   bathrooms      21597 non-null  float64
 5   sqft_living    21597 non-null  int64  
 6   sqft_lot       21597 non-null  int64  
 7   floors         21597 non-null  float64
 8   waterfront     19221 non-null  float64
 9   view           21534 non-null  float64
 10  condition      21597 non-null  int64  
 11  grade          21597 non-null  int64  
 12  sqft_above     21597 non-null  int64  
 13  sqft_basement  21597 non-null  object 
 14  yr_built       21597 non-null  int64  
 15  yr_renovated   17755 non-null  float64
 16  zipcode        21597 non-null  int64  
 17  lat            21597 non-null  float64
 18  long  

In [4]:
# Exploration of Values
for column in kc_house_data.columns:
    print(column, '\n')
    print(kc_house_data[column].value_counts())
    print('________')

id 

795000620     3
1825069031    2
2019200220    2
7129304540    2
1781500435    2
             ..
7812801125    1
4364700875    1
3021059276    1
880000205     1
1777500160    1
Name: id, Length: 21420, dtype: int64
________
date 

6/23/2014    142
6/25/2014    131
6/26/2014    131
7/8/2014     127
4/27/2015    126
            ... 
2/15/2015      1
5/24/2015      1
3/8/2015       1
5/27/2015      1
7/27/2014      1
Name: date, Length: 372, dtype: int64
________
price 

350000.0    172
450000.0    172
550000.0    159
500000.0    152
425000.0    150
           ... 
870515.0      1
336950.0      1
386100.0      1
176250.0      1
884744.0      1
Name: price, Length: 3622, dtype: int64
________
bedrooms 

3     9824
4     6882
2     2760
5     1601
6      272
1      196
7       38
8       13
9        6
10       3
11       1
33       1
Name: bedrooms, dtype: int64
________
bathrooms 

2.50    5377
1.00    3851
1.75    3048
2.25    2047
2.00    1930
1.50    1445
2.75    1185
3.00     753
3

## Data Preprocessing

To begin, columns we are not interested in will be dropped. We are not concerned with the date the house was sold or if the house has been viewed.

Additionally, because all of our data ranges over only two years, we will count multiple sales of the same house independently, and therefore can drop the house ID.
The rational behind this is that houses constantly on the market may hint at some underlining issue that may or may not be described numerically by our data so we will leave each sale in our analysis to obtain the broadest picture. 

In [5]:
kc_house_data.drop(columns = ['id', 'date', 'view'], inplace = True)

Observing the data types and value counts in our data exploration above it can be seen that basement square footage is stored as a string. most values can be expressed as floats, so we will change data types to be treated as a continuous variable. unknown values (stored as '?') will be changed to 0.0

In [8]:
kc_house_data['sqft_basement'].replace('?','0.0', inplace = True)
kc_house_data['sqft_basement'] = kc_house_data['sqft_basement'].astype('float')

Now that all of our data is in a usable format, we can create our first baseline model

## Baseline Model

## Model Tuning/ Reiteration

## Insights

## Conclusions

## Future Work