
<div align="center"><img width="275" height="50" src="http://zillow.mediaroom.com/image/Zillow_Wordmark_Blue_RGB.jpg" /> </div> 

<div align="center"> <h1>Cluster Project</h1> 
  <h6> by John Grinstead & David Berchelmann -- April 7, 2021 </h6> </div>
  
  ------------------------------------------------

<div align="center"><img width="800" height="50" src="https://www.zillowstatic.com/s3/homepage/static/Buy_a_home.png" /> </div>



-------

<h1> Welcome! </h1>

The following jupyter notebook will take you through my regression project focusing on Zillow. The dataset comes from a SQL database and can also be accessed via Kaggle.com. 

All of the files and notebooks for this project can be accessed via the github repostiory located at --> https://github.com/davidb-and-john/clustering-project



----

<h1> Executive Summary </h1>

------

<h4><b>The Problem</b></h4>

- What is driving the errors in the Zestimates?

<h4><b>The Goal</b></h4>

- Use clustering to identify what groups of features are the strongest drivers of log error.

<h4><b>The Process</b></h4>

  * Acquire the Data
  * Prepare 
  * Explore 
  * Model
  * Create Recommendations Based On Findings 
  
<h4><b>The Findings</b></h4>

    
    


-------


<h3><u>Environment Setup</u></h3>

In [4]:
# packages for data analysis & mapping
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import cm
import matplotlib.dates as dates
import seaborn as sns
import plotly.express as px
from datetime import date 


# Statistical Tests
import scipy.stats as stats
from math import sqrt
from scipy.stats import norm


# modeling methods
from sklearn.metrics import mean_squared_error, r2_score, explained_variance_score
from sklearn.linear_model import LinearRegression, LassoLars, TweedieRegressor
from sklearn.preprocessing import PolynomialFeatures, MinMaxScaler, StandardScaler
from sklearn.model_selection import train_test_split, learning_curve
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import SelectKBest, f_regression, RFE 
import sklearn.preprocessing
pd.options.display.float_format = '{:20,.2f}'.format


# address warnings
import warnings
warnings.filterwarnings("ignore")

# acquire, prep, train, & explore functions
from wrangle import get_connection, new_zillow_data, get_zillow_data, clean_zillow, split, seperate_y, scale_data, split_seperate_scale 



<h4> Data Validation </h4>

 - Before the data was brought in through the acquire file, we investigated the set in SQL. Below are a few of our findings:
     - Some properties were labeled as 'single family residential' but had a unit count of more than 1
     - There were a number of properties that lacked location info (zip, lat, long, fips), these will be dropped in prep
     - Bedrooms and Bathrooms both have rows that have a value of 0. These were filled filled with the median count for each feature.
     - There were entries that had multiple transaction dates. To account for this, we filtered for the latest max date. We also did this for log error.

---
<h3><u>Acquire the Data</u></h3>

----

In [7]:
df = pd.read_csv("zillowcluster_df.csv")
print(f'Our original dataframe is coming in with {df.shape[0]} rows and {df.shape[1]} columns.')

Our original dataframe is coming in with 77413 rows and 68 columns.


In [6]:
df.isna().sum()

Unnamed: 0                          0
typeconstructiontypeid          77191
storytypeid                     77363
heatingorsystemtypeid           27974
buildingclasstypeid             77398
architecturalstyletypeid        77207
airconditioningtypeid           52460
parcelid                            0
id                                  0
basementsqft                    77363
bathroomcnt                        33
bedroomcnt                         33
buildingqualitytypeid           27742
calculatedbathnbr                 642
decktypeid                      76799
finishedfloor1squarefeet        71390
calculatedfinishedsquarefeet      229
finishedsquarefeet12             3665
finishedsquarefeet13            77372
finishedsquarefeet15            74404
finishedsquarefeet50            71390
finishedsquarefeet6             77027
fips                               33
fireplacecnt                    69137
fullbathcnt                       642
garagecarcnt                    51939
garagetotals

-----

<h3>Takeaways From Acquire</h3>

-----

----

<h3>Clean/Prep the Data</h3>

----

-----
<h3> Prep Takeaways </h3>

---------

<h3><u>Data Dictionary</u></h3>



|   Feature      |  Data Type   | Description    |
| :------------- | :----------: | -----------: |
|  parcelid | int64   | Unique parcel identifier    |
| landuse_id     | float64 | Identifier for landuse type|
| landuse_desc   | object | Describes the type of landuse|
| last_sold_date  | object |transaction date of when property last sold|
|  total_sqft  | float64   | Total livable square footage    |
| bedroom_quanity    | float64 | count of bedrooms|
| bathroom_quanity   | float64 | count of bathrooms|
| fips  | object | Federal Information Processing Code (county code)|
|  zip_code | object   | 5 digit code used by US Postal Service    |
| year_built    | object | year home was built|
| tax_assesed_value   | float64 | total value of home established by taxing authority|
| latitude  | float64 | geographic coordinate that specifies the north–south position |
|  longitude  | float64   | geographic coordinate that specifies the east-west position     |
| tax_assess_yr    | float64 | The most recent year property taxes were assessed|
| property_tax   | float64 | ad valorem tax on the value of a property.|
| age_of_home  | int64 | age of home as of today's date in years|
| tax_rate    | float64 | This is property tax / tax_assessed_value|
| baths_pers_qft   | float64 | numbers of baths per sqft|
| beds_pers_qft  | float64| number of beds per sqft|

-----

<h3> Explore the Data </h3>

-----

-----

<h3> Model the Data </h3>

-----

-------

<h3>Conclusions & Thoughts Moving Forward</h3>

-----