# King County Home Improvements
![Hero Lake Washington, King County](images/hero-lake-washington-xlg.jpg)
<br>

**Author**: Carl Schneck <br>
**Program**: Data Science Flex <br>
**Phase 2 Project**

---

## Contents
See phase 1 project for template

---

## Overview

This project analyzes King County housing sales data in order to help a wholesale real estate investor make good decisions on which home improvements best improve sales prices. King County is the most populous county in Washington State, and 13th in the country.  

---

## Business Understanding

A wholesale real estate investor wants to get a better idea of which improvements relate to the the biggest increase in sales price. By finding out this information they can better access which projects are more worthwile and lead to the largest profit margin. Through a linear regression analysis we can figure out which features have the largest affect on sales price by looking at our models coefficients. 

---

## Data Understanding

Out dataset includes information on houses sold in the timeframe spanning the years 2014 through 2015 and 70 zipcodes. There are a total of 21597 entries with 21 columns worth of information. For this analysis we will cut down this data to features we believe can be changed. Features dealing with location, view or neighboors are things that in most cases are impossible to change so will be ommited.    

Where did it come from?
How does it relate to the data analysis question?
What does the data represent, who is in the sample and what variables are included.
What is the target variable?
What are the properties of the variables you intend to use?

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import stats

%matplotlib inline

In [4]:
# Load Kings county database and preview first few entries
kc_df = pd.read_csv('data/kc_house_data.csv')
kc_df.head(3)

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,10/13/2014,221900.0,3,1.0,1180,5650,1.0,,NONE,...,7 Average,1180,0.0,1955,0.0,98178,47.5112,-122.257,1340,5650
1,6414100192,12/9/2014,538000.0,3,2.25,2570,7242,2.0,NO,NONE,...,7 Average,2170,400.0,1951,1991.0,98125,47.721,-122.319,1690,7639
2,5631500400,2/25/2015,180000.0,2,1.0,770,10000,1.0,NO,NONE,...,6 Low Average,770,0.0,1933,,98028,47.7379,-122.233,2720,8062


In [5]:
# Taking a look at information of the dataset
kc_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21597 entries, 0 to 21596
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             21597 non-null  int64  
 1   date           21597 non-null  object 
 2   price          21597 non-null  float64
 3   bedrooms       21597 non-null  int64  
 4   bathrooms      21597 non-null  float64
 5   sqft_living    21597 non-null  int64  
 6   sqft_lot       21597 non-null  int64  
 7   floors         21597 non-null  float64
 8   waterfront     19221 non-null  object 
 9   view           21534 non-null  object 
 10  condition      21597 non-null  object 
 11  grade          21597 non-null  object 
 12  sqft_above     21597 non-null  int64  
 13  sqft_basement  21597 non-null  object 
 14  yr_built       21597 non-null  int64  
 15  yr_renovated   17755 non-null  float64
 16  zipcode        21597 non-null  int64  
 17  lat            21597 non-null  float64
 18  long  

The features that appear possible to be worked on include quantitative information on bathrooms, bedrooms and floors. Area information regarding the square footage of living space, and space above the basement. As well as the features `condition` and `grade` pertaining to the maintanence condition and quality of construction and design can also be worked on.

## Data Preparation

To prepare the data we must first drop all unnecessary columns.

In [8]:
import data_preparation as dp
from sklearn.model_selection import train_test_split

In [9]:
# Dropping unnecessary columns from original dataset
unn_columns = ['id', 'date', 'lat', 'long', 'zipcode', 'waterfront',
               'view', 'yr_built', 'yr_renovated', 'sqft_living15',
              'sqft_lot15', 'sqft_lot', 'sqft_basement']

# Drops columns
kc_df_iprep = kc_df.drop(columns=unn_columns, axis=1).copy()
kc_df_iprep.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21597 entries, 0 to 21596
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   price        21597 non-null  float64
 1   bedrooms     21597 non-null  int64  
 2   bathrooms    21597 non-null  float64
 3   sqft_living  21597 non-null  int64  
 4   floors       21597 non-null  float64
 5   condition    21597 non-null  object 
 6   grade        21597 non-null  object 
 7   sqft_above   21597 non-null  int64  
dtypes: float64(3), int64(3), object(2)
memory usage: 1.3+ MB


Next the values for object columns are adjusted and changed to numeric types, and the target variable `price` is seperated from the independent variables. Lastly the data is split into a training set and test set in order to check if our model works for unseen data.

In [10]:
kc_df_iprep, condition_map, grade_map = dp.initial_prep(kc_df_iprep)

# Target Variable
y = kc_df_iprep.price.copy()

# independent variables
X = kc_df_iprep.drop(columns='price', axis=1).copy()

# Splits the data into two sets at a 4:1 ratio of variables for train:test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=1)

## Modeling

Need to find best model.

---

## Evaluation

---

## Conclusion

---

## Appendix
Extra Figures??

---

## Resources

Any Works Cited

---