## Estimate Home Values in Zillow

#### Goals:

Your customer is the zillow data science team. state your goals as if you were delivering this to zillow. They have asked for something from you and you are basically communicating in a more concise way, and very clearly, the goals as you understand them and as you have taken and acted upon through your research.

Add outline of project here

#### Import
Import the necessay libraries needed to create a baseline model and then predict home values. Add libraries as needed.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.metrics import mean_squared_error, r2_score, explained_variance_score
from sklearn.linear_model import LinearRegression
from math import sqrt
import warnings
warnings.filterwarnings("ignore")

import env
import wrangle
import split_scale
import features

In this step, I loaded the cleaned up Zillow data from my wrangle.py file and explored it for a better understanding of what I am working with.

In [2]:
df = wrangle.wrangle_zillow()
df.shape

(14892, 10)

Summary Statistics of data

In [3]:
df.describe()

Unnamed: 0,bedroom,bathroom,lot_size,square_feet,tax_amount,home_value,property_id,fips
count,14892.0,14892.0,14892.0,14892.0,14892.0,14892.0,14892.0,14892.0
mean,3.316143,2.324302,10531.1,1938.497045,6564.370992,539646.0,261.0,6049.448429
std,0.926831,1.014254,29379.57,992.488089,8428.010576,729073.5,0.0,21.272489
min,1.0,1.0,594.0,300.0,51.26,10504.0,261.0,6037.0
25%,3.0,2.0,5594.0,1276.0,2710.6225,199187.0,261.0,6037.0
50%,3.0,2.0,6869.0,1678.0,4762.04,384778.5,261.0,6037.0
75%,4.0,3.0,8865.25,2342.25,7637.5575,643557.2,261.0,6059.0
max,11.0,11.0,1323788.0,15450.0,276797.83,23858370.0,261.0,6111.0


Summary of data frame

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 14892 entries, 0 to 14892
Data columns (total 10 columns):
bedroom             14892 non-null int64
bathroom            14892 non-null float64
lot_size            14892 non-null int64
square_feet         14892 non-null int64
tax_amount          14892 non-null float64
home_value          14892 non-null int64
property_id         14892 non-null int64
property_type       14892 non-null object
fips                14892 non-null int64
transaction_date    14892 non-null object
dtypes: float64(2), int64(6), object(2)
memory usage: 1.2+ MB


#### Splitting the Data
I will use a split function I created to split the data into two peices, train and test. I will then split the pieces further to establish an 'x' and 'y' variable for both the train and the test. The training data will be 80% of the data and the remaining 20% will be for the test data. I set a random seed at '123'.

In [5]:
train, test = split_scale.split_my_data(df, train_pct=.80, random_state=123)
train.head(), test.head()

(       bedroom  bathroom  lot_size  square_feet  tax_amount  home_value  \
 7040         2       3.0      9376         2353    13560.29     1121784   
 4866         3       1.0      5061         1085     2581.75      194062   
 2280         3       2.0      7588         1873     4301.71      339756   
 12524        4       3.0      7969         2082     4443.44      376676   
 10726        4       3.5      8000         4364     8876.72      875357   
 
        property_id              property_type  fips transaction_date  
 7040           261  Single Family Residential  6037       2017-06-19  
 4866           261  Single Family Residential  6037       2017-06-02  
 2280           261  Single Family Residential  6037       2017-05-16  
 12524          261  Single Family Residential  6059       2017-06-16  
 10726          261  Single Family Residential  6059       2017-05-19  ,
        bedroom  bathroom  lot_size  square_feet  tax_amount  home_value  \
 13487        5       3.0      60

To establish a baseline, I chose to look at the three features in my data in particular, the bedroom count, the bathroom count, and the total square footage of the house. I am assigning the home value as my target variable, since that is what I am looking to predict.

In [6]:
x_train = train[['bedroom', 'bathroom', 'square_feet']]
y_train = train[['home_value']]
x_test = train[['bedroom', 'bathroom', 'square_feet']]
y_test = train[['home_value']]
print(x_train.head())
print(y_train.head())

       bedroom  bathroom  square_feet
7040         2       3.0         2353
4866         3       1.0         1085
2280         3       2.0         1873
12524        4       3.0         2082
10726        4       3.5         4364
       home_value
7040      1121784
4866       194062
2280       339756
12524      376676
10726      875357


I will create a new data frame, 'predictions' to establish a baseline for my model taking the mean of my target.

In [7]:
# baseline model
predictions=pd.DataFrame({'actual':y_train.home_value}).reset_index(drop=True)
predictions['baseline'] = y_train.mean()[0]
predictions.head()

Unnamed: 0,actual,baseline
0,1121784,544122.62948
1,194062,544122.62948
2,339756,544122.62948
3,376676,544122.62948
4,875357,544122.62948


write something here.

In [24]:
def modeling_function(x_train, x_test, y_train, y_test):
    predictions=pd.DataFrame({'actual_value':y_train.home_value}).reset_index(drop=True)
        
    # model 1
    model1 = LinearRegression()
    model1.fit(x_train, y_train)
    model1_predictions = model1.predict(x_train)
    predictions['model1'] = model1_predictions
    
    predictions['baseline'] = y_train.mean()[0]
    
    return predictions

In [25]:
predictions = modeling_function(x_train, x_test, y_train, y_test)

In [26]:
predictions.head()

Unnamed: 0,actual_value,model1,baseline
0,1121784,983376.1,544122.62948
1,194062,98310.26,544122.62948
2,339756,541020.4,544122.62948
3,376676,541215.1,544122.62948
4,875357,1705229.0,544122.62948


#### Evaluate model1 compared to the baseline value.

In [29]:
MSE_base = mean_squared_error(predictions.actual_value, predictions.baseline)
SSE_base = MSE_base * len(predictions.actual_value)
RMSE_base = sqrt(MSE_base)
r2_base = r2_score(predictions.actual_value, predictions.baseline)
print(MSE_base, SSE_base, RMSE_base, r2_base)

575225329640.1548 6852659352003164.0 758436.1078167064 0.0


#### Model Error

In [30]:
MSE_1 = mean_squared_error(predictions.actual_value, predictions.model1)
SSE_1 = MSE_1 * len(predictions.actual_value)
RMSE_1 = sqrt(MSE_1)
r2_1 = r2_score(predictions.actual_value, predictions.model1)
print(MSE_1,SSE_1,RMSE_1,r2_1)

356640202824.1915 4248654736244593.5 597193.6058132167 0.37999913347471004
