<h1> Predicting Single Family Home Prices </h1>
<h2> 1. Planning and Process Map </h2>
<br>

<h3>A. Timeline:</h3>
<br>
    - present THU, 09 JUNE
    <br>
    - division of time: 
    <br>
        - Monday - wrangle, explore and test statistically
        <br>
        - Tuesday - model and produce final notebook, README and refine all helper functions and notation
        <br>
        - Wednesday - finalize notebook and supporting materials, write script, time presentation
        <br>

<h3> B. Business Deliverables and  Questions to Answer: </h3>
<br>
<br>
- Construct an ML Regression model that predict propery tax assessed values ('taxvaluedollarcnt') of Single Family Properties using attributes of the properties.
<br>
<br>
- Find the key drivers of property value for single family properties. Some questions that come to mind are: Why do some properties have a much higher value than others when they are located so close to each other? Why are some properties valued so differently from others when they have nearly the same physical attributes but only differ in location? Is having 1 bathroom worse than having 2 bedrooms?
<b> Frame home value in terms of both attributes and location </b>
<br>
<br>
- Deliver a report that the data science team can read through and replicate, understand what steps were taken, why and what the outcome was.
<br>
<br>
- Make recommendations on what works or doesn't work in prediction these homes' values.

<h3> C. DS pipeline and Linear Regression specific tasks: </h3>
<br>
    a. Acquire/Wrangle Data
    <br>
        i. properties_2017, predictions_2017, propertylandusetype tables from SQL query in the zillow data set
        <br>
        ii. Remove nulls, outliers, and unneeded columns
    <br>
        iii. Run distributions to view data at 30k feet (get a sense for dataset)
    <br>
        iv. split data into train, validate, test
    <br>
    b. Explore Data:
        <br>
        i. hypothesize/pose questions, visually explore relationships with bivariate stats, produce subsets of interest, answer hypotheses visually, statistically, and in English
        <br>
        ii. produce synthetic columns if needed (BR/BD count, groupby's)
        <br>
   c. Modeling:
       <br>
       i. create a scaled copy of the data frame (don't scale target variable)
       <br>
       ii. create a data frame with predictors, target, baseline prediction, and error for baseline, Zillow's 2017 predicitons, and Zillow error.
       <br>
       iii. produce at least 4 models (more if time allows), and add to the DataFrame, as well as prediction error for each model.
           - a. error units will be RMSE (dollars)
       <br>
       iv. compare model's performance to the baseline, and the Zillow predicitive model performance.
    <br>
   d. Report
   <br>
       i. Produce a clean notebook and README, with HTML and visuals
       <br>
       ii. Docstrings on all functions
       <br>
       iii. make script/outline, time presentation
       <br>

<h4> Note: Data to Use: </h4>
<br>
- properties_2017, predictions_2017, propertylandusetype tables
<br>
<h4> Note: Data to Exclude: </h4>
<br>
- taxvaluedollarcnt, landtaxvaluedollarcnt, structuretaxvaluedollrcnt, taxamount

<h2> 2. Acquisition </h2>

- a. SQL query to acquire and prep tables from Zillow tables: properties_2017, predictions_2017, propertylandusetype tables
- b. target variable: taxvaluedollarcnt (price) for single family homes
- c. columns of interest:
    - properties_2017: calculatedfinishedsqft AS home_sq_ft, garagecarcnt, fips AS county, lotsizesquarefeet AS lot_size, roomcnt as room_count, taxvaluedollarcnt as tax_assessed_price, fips as county
    - predictions_2017 (77614): parcelid, logerror
    - propertylandusetables: id and desc

In [1]:
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy import stats
import env
import os
import sklearn.preprocessing
from sklearn.model_selection import train_test_split

In [2]:
import wrangle

In [3]:
df = wrangle.get_zillow_data()

In [4]:
df.head()

Unnamed: 0,parcelid,bedroomcnt,bathroomcnt,sq_ft,lot_size,car_garage,room_count,tax_assessed_price,yearbuilt,taxamount,county,logerror
0,14297519,4.0,3.5,3100.0,4506.0,2.0,0.0,1023282.0,1998.0,11013.72,6059.0,0.025595
1,17052889,2.0,1.0,1465.0,12647.0,1.0,5.0,464000.0,1967.0,5672.48,6111.0,0.055619
2,14186244,3.0,2.0,1243.0,8432.0,2.0,6.0,564778.0,1962.0,6488.3,6059.0,0.005383
3,12177905,4.0,3.0,2376.0,13038.0,,0.0,145143.0,1970.0,1777.51,6037.0,-0.10341
4,12095076,4.0,3.0,2962.0,63000.0,,0.0,773303.0,1950.0,9516.26,6037.0,-0.001011


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52442 entries, 0 to 52441
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   parcelid            52442 non-null  int64  
 1   bedroomcnt          52442 non-null  float64
 2   bathroomcnt         52442 non-null  float64
 3   sq_ft               52360 non-null  float64
 4   lot_size            52073 non-null  float64
 5   car_garage          18015 non-null  float64
 6   room_count          52442 non-null  float64
 7   tax_assessed_price  52441 non-null  float64
 8   yearbuilt           52326 non-null  float64
 9   taxamount           52438 non-null  float64
 10  county              52442 non-null  float64
 11  logerror            52442 non-null  float64
dtypes: float64(11), int64(1)
memory usage: 4.8 MB


In [6]:
# make Parcelid, county into objects in the prep_zillow function

In [7]:
#run through the prep function to scrub nulls and whitespace
df = wrangle.prep_zillow(df)
df.head()

(17929, 12)
(17042, 12)


Unnamed: 0,parcelid,bedroomcnt,bathroomcnt,sq_ft,lot_size,car_garage,room_count,tax_assessed_price,yearbuilt,taxamount,county,logerror
0,14297519,4.0,3.5,3100.0,4506.0,2.0,0.0,1023282.0,1998.0,11013.72,orange_county,0.025595
1,17052889,2.0,1.0,1465.0,12647.0,1.0,5.0,464000.0,1967.0,5672.48,ventura,0.055619
2,14186244,3.0,2.0,1243.0,8432.0,2.0,6.0,564778.0,1962.0,6488.3,orange_county,0.005383
8,13944538,3.0,2.5,1340.0,1199.0,2.0,6.0,319668.0,1980.0,4078.08,orange_county,0.045602
9,17110996,3.0,2.5,1371.0,3445.0,2.0,5.0,198054.0,2004.0,2204.84,ventura,0.008669
