In [8]:
import pickle
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.ticker import StrMethodFormatter
import seaborn as sns


from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score

from sklearn import metrics
from sklearn import linear_model
from sklearn.linear_model import LinearRegression

from sklearn.ensemble import RandomForestRegressor

from statsmodels.stats.outliers_influence import variance_inflation_factor




%matplotlib inline

In [9]:
# Read in Data:
df = pd.read_csv('data/kc_house_data.csv')

# Business Understanding

### Define median american home:
- According to the Atlantic, "According to the real-estate firms Zillow and Redfin, the median size of an American single-family home is in the neighborhood of 1,600 or 1,650 square feet."
- Trulia: average homes have 4 bedrooms
- Business Problem: Help average home owner pin-point what types of renovations work best for 1600-1650 square foot, 4-bedroom homes

### King County, Washington Residential Glossary of Terms:
- https://info.kingcounty.gov/assessor/esales/Glossary.aspx?type=r#s


### King County's Grading System for Buildings:
#### Represents the construction quality of improvements. Grades run from grade 1 to 13. Generally defined as:

1. Falls short of minimum building standards. Normally cabin or inferior structure.

2. Falls short of minimum building standards. Normally cabin or inferior structure.

3. Falls short of minimum building standards. Normally cabin or inferior structure.

4. Generally older, low quality construction. Does not meet code.

5. Low construction costs and workmanship. Small, simple design.

6. Lowest grade currently meeting building code. Low quality materials and simple designs.

7. Average grade of construction and design. Commonly seen in plats and older sub-divisions.

8. Just above average in construction and design. Usually better materials in both the exterior and interior finish work.

9. Better architectural design with extra interior and exterior design and quality.

10. Homes of this quality generally have high quality features. Finish work is better and more design quality is seen in the floor plans. Generally have a larger square footage.

11. Custom design and higher quality finish work with added amenities of solid woods, bathroom fixtures and more luxurious options.

12. Custom design and excellent builders. All materials are of the highest quality and all conveniences are present.

13. Generally custom designed and built. Mansion level. Large amount of highest quality cabinet work, wood trim, marble, entry ways etc.

### BUILDING CONDITION
#### Relative to age and grade. Coded 1-5.

1. Poor- Worn out. Repair and overhaul needed on painted surfaces, roofing, plumbing, heating and numerous functional inadequacies. Excessive deferred maintenance and abuse, limited value-in-use, approaching abandonment or major reconstruction; reuse or change in occupancy is imminent. Effective age is near the end of the scale regardless of the actual chronological age.

2. Fair- Badly worn. Much repair needed. Many items need refinishing or overhauling, deferred maintenance obvious, inadequate building utility and systems all shortening the life expectancy and increasing the effective age.

3. Average- Some evidence of deferred maintenance and normal obsolescence with age in that a few minor repairs are needed, along with some refinishing. All major components still functional and contributing toward an extended life expectancy. Effective age and utility is standard for like properties of its class and usage.

4. Good- No obvious maintenance required but neither is everything new. Appearance and utility are above the standard and the overall effective age will be lower than the typical property.

5. Very Good- All items well maintained, many having been overhauled and repaired as they have shown signs of wear, increasing the life expectancy and lowering the effective age with little deterioration or obsolescence evident with a high degree of utility.

### Business Case:
- Goal is to be able to suggest house improvements for owners of average, single family dwellings 
    - Definition of average single family home:
    - What types of renovations are worth the investment?
        - Would the cost of adding a bathroom be worth the increase in selling price?

# Data Preparation

### Duplicate ID

In [6]:
df.id.value_counts()[df.id.value_counts() > 1]

795000620     3
1825069031    2
2019200220    2
7129304540    2
1781500435    2
             ..
7893805650    2
8161020060    2
1432400120    2
7701960990    2
1788900230    2
Name: id, Length: 176, dtype: int64

In [15]:
id_dupe_bools = df['id'].duplicated(keep = False)
df_dupe_id = df[id_dupe_bools] 

In [17]:
df_dupe_id[['id', 'date', 'condition', 'grade', 'bathrooms', 'bedrooms', 'sqft_living']]

Unnamed: 0,id,date,condition,grade,bathrooms,bedrooms,sqft_living
93,6021501535,7/25/2014,3,8,1.50,3,1580
94,6021501535,12/23/2014,3,8,1.50,3,1580
313,4139480200,6/18/2014,3,11,3.25,4,4290
314,4139480200,12/9/2014,3,11,3.25,4,4290
324,7520000520,9/5/2014,3,6,1.00,2,1240
...,...,...,...,...,...,...,...
20654,8564860270,3/30/2015,3,8,2.50,4,2680
20763,6300000226,6/26/2014,3,7,1.00,4,1200
20764,6300000226,5/4/2015,3,7,1.00,4,1200
21564,7853420110,10/3/2014,3,9,3.00,3,2780


In [4]:
df.columns

Index(['id', 'date', 'price', 'bedrooms', 'bathrooms', 'sqft_living',
       'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade',
       'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode',
       'lat', 'long', 'sqft_living15', 'sqft_lot15'],
      dtype='object')

## PostgreSQL Data Cleaning

### Bring data into PostgreSQL:
> CREATE TABLE original (id text,date text,price text,bedrooms text,bathrooms text,sqft_living text,sqft_lot text,floors text,waterfront text,view text,condition text,grade text,sqft_above text,sq ft_basement text,yr_built text,yr_renovated text,zipcode text,lat text,long text,sqft_living15 text,sqft_lot15 text);

>  \copy original FROM 'ph2finproj/dsc-phase-2-project-main/data/kc_house_data.csv' WITH DELIMITER ',' CSV HEADER;

### Clean up 'sqft_basement', which included '?' characters along with numeric data

> UPDATE original SET sqft_basement = replace(sqft_basement, '?', '0.0');

### Convert columns to appropriate data types: