**Zillow**

For the following, iterate through the steps you would take to create functions: Write the code to do the following in a jupyter notebook, test it, convert to functions, then create the file to house those functions.

You will have a `zillow.ipynb` file and a helper file for each section in the pipeline.

**acquire & summarize**

- Acquire data from mySQL using the python module to connect and query. 
- You will want to end with **a single dataframe**. Make sure to include: the logerror, all fields related to the properties that are available. You will end up **using all the tables in the database**.
- Be sure to do **the correct join (inner, outer, etc.)**. We do not want to eliminate properties purely because they may have a null value for airconditioningtypeid.
- Only include properties with a **transaction in 2017**, and include **only the last transaction for each properity** (so no duplicate property ID's), along with zestimate error and date of transaction.
- Only include properties that include a latitude and longitude value.

**Summarize Zillow Database**

- airconditioningtype: 13 unique values
    - primary key: airconditioningtypeid


- architecturalstyletype: 27 unique values
    - primary key: architecturalstyletypeid
    
    
- buildingclasstype: 5 unique values
    - primary key: buildingclasstypeid
    
    
- heatingorsystemtype: 25 unique values
    - primary key: heatingorsystemtypeid
    
    
- predictions_2016: all the transactions in 2016 
    - No need to be joined
    
    
- predictions_2017: 77614 records in total
    - primary key: parcelid
    - 77613 records in 2017
    - 1 record in 2018
    - unique id: 77614
    - **unique parcelid: 77414**
    
    
- properties_2016: No need to be joined


- properties_2017: main table
    - primary key: parcelid
    
    
- propertylandusetype
    - primary key: propertylandusetypeid
    
    
- storytype: 35 unique values
    - primary key: storytypeid
    

- typeconstructiontype: 18 unqiue values
    - primary key: typeconstructiontypeid
    
    
- unique_properties: 2,985,217 rows
    - primary key: parcelid

In [1]:
import warnings
warnings.filterwarnings("ignore")
import os

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

import env, acquire

In [31]:
# Acquire properties with a transaction in 2017 order first by parcelid then transactiondate

query = """
        select *
        from properties_2017
        join predictions_2017 using(parcelid)
        left join airconditioningtype using(airconditioningtypeid)
        left join architecturalstyletype using(architecturalstyletypeid)
        left join buildingclasstype using(buildingclasstypeid)
        left join heatingorsystemtype using(heatingorsystemtypeid)
        left join propertylandusetype using(propertylandusetypeid)
        left join storytype using(storytypeid)
        left join typeconstructiontype using(typeconstructiontypeid)
        where transactiondate between '2017-01-01' and '2017-12-31'
        order by parcelid, transactiondate
        """

df = acquire.get_zillow_data(query, '1')

In [32]:
df.head()

Unnamed: 0,typeconstructiontypeid,storytypeid,propertylandusetypeid,heatingorsystemtypeid,buildingclasstypeid,architecturalstyletypeid,airconditioningtypeid,parcelid,id,basementsqft,...,id.1,logerror,transactiondate,airconditioningdesc,architecturalstyledesc,buildingclassdesc,heatingorsystemdesc,propertylandusedesc,storydesc,typeconstructiondesc
0,,,261.0,2.0,,,,10711855,1087254,,...,55006,-0.007357,2017-07-07,,,,Central,Single Family Residential,,
1,,,261.0,2.0,,,1.0,10711877,1072280,,...,71382,0.021066,2017-08-29,Central,,,Central,Single Family Residential,,
2,,,261.0,2.0,,,1.0,10711888,1340933,,...,23209,0.077174,2017-04-04,Central,,,Central,Single Family Residential,,
3,,,261.0,2.0,,,,10711910,1878109,,...,18017,-0.041238,2017-03-17,,,,Central,Single Family Residential,,
4,,,261.0,2.0,,,,10711923,2190858,,...,20378,-0.009496,2017-03-24,,,,Central,Single Family Residential,,


In [33]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 77613 entries, 0 to 77612
Data columns (total 69 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   typeconstructiontypeid        223 non-null    float64
 1   storytypeid                   50 non-null     float64
 2   propertylandusetypeid         77579 non-null  float64
 3   heatingorsystemtypeid         49571 non-null  float64
 4   buildingclasstypeid           15 non-null     float64
 5   architecturalstyletypeid      207 non-null    float64
 6   airconditioningtypeid         25007 non-null  float64
 7   parcelid                      77613 non-null  int64  
 8   id                            77613 non-null  int64  
 9   basementsqft                  50 non-null     float64
 10  bathroomcnt                   77579 non-null  float64
 11  bedroomcnt                    77579 non-null  float64
 12  buildingqualitytypeid         49809 non-null  float64
 13  c

In [34]:
# Address duplicates: show all duplicates

mask = df.duplicated(subset='parcelid', keep=False)
df_duplicated = df[mask]
df_duplicated.head()

Unnamed: 0,typeconstructiontypeid,storytypeid,propertylandusetypeid,heatingorsystemtypeid,buildingclasstypeid,architecturalstyletypeid,airconditioningtypeid,parcelid,id,basementsqft,...,id.1,logerror,transactiondate,airconditioningdesc,architecturalstyledesc,buildingclassdesc,heatingorsystemdesc,propertylandusedesc,storydesc,typeconstructiondesc
293,,,261.0,2.0,,,,10722858,16179,,...,14033,0.095171,2017-03-02,,,,Central,Single Family Residential,,
294,,,261.0,2.0,,,,10722858,16179,,...,14034,-0.172843,2017-07-28,,,,Central,Single Family Residential,,
539,,,261.0,2.0,,,,10732347,1836115,,...,13913,0.077198,2017-03-01,,,,Central,Single Family Residential,,
540,,,261.0,2.0,,,,10732347,1836115,,...,13914,-0.221145,2017-07-25,,,,Central,Single Family Residential,,
721,,,261.0,2.0,,,1.0,10739478,2119208,,...,2904,0.08328,2017-01-13,Central,,,Central,Single Family Residential,,


In [35]:
df_duplicated.shape

(395, 69)

In [40]:
# Only kee the last transaction for each properity. 

df.drop_duplicates(subset=['parcelid'], keep='last', inplace=True, ignore_index=True)
df.shape

(77414, 69)

In [41]:
# Check to see whether the property with most transatction date is kept.

df[(df.parcelid == 10722858) | (df.parcelid == 10732347)]

Unnamed: 0,typeconstructiontypeid,storytypeid,propertylandusetypeid,heatingorsystemtypeid,buildingclasstypeid,architecturalstyletypeid,airconditioningtypeid,parcelid,id,basementsqft,...,id.1,logerror,transactiondate,airconditioningdesc,architecturalstyledesc,buildingclassdesc,heatingorsystemdesc,propertylandusedesc,storydesc,typeconstructiondesc
293,,,261.0,2.0,,,,10722858,16179,,...,14034,-0.172843,2017-07-28,,,,Central,Single Family Residential,,
538,,,261.0,2.0,,,,10732347,1836115,,...,13914,-0.221145,2017-07-25,,,,Central,Single Family Residential,,


In [44]:
# Check if there exsits duplicate property ID

df.duplicated(subset='parcelid').any()

False

**Takeaways: Properties with transaction in 2017

In [23]:
df.isnull().sum(axis=0)

typeconstructiontypeid    77390
storytypeid               77563
propertylandusetypeid        34
heatingorsystemtypeid     28042
buildingclasstypeid       77598
                          ...  
buildingclassdesc         77598
heatingorsystemdesc       28042
propertylandusedesc          34
storydesc                 77563
typeconstructiondesc      77390
Length: 69, dtype: int64