## Zillow

For the following, iterate through the steps you would take to create functions: Write the code to do the following in a jupyter notebook, test it, convert to functions, then create the file to house those functions.

You will have a `zillow.ipynb` file and a helper file for each section in the pipeline.

**Summarize Zillow Database**

- airconditioningtype: 13 unique values
    - primary key: airconditioningtypeid


- architecturalstyletype: 27 unique values
    - primary key: architecturalstyletypeid
    
    
- buildingclasstype: 5 unique values
    - primary key: buildingclasstypeid
    
    
- heatingorsystemtype: 25 unique values
    - primary key: heatingorsystemtypeid
    
    
- predictions_2016: all the transactions in 2016 
    - No need to be joined
    
    
- predictions_2017: 77614 records in total
    - primary key: parcelid
    - 77613 records in 2017
    - 1 record in 2018
    - unique id: 77614
    - **unique parcelid: 77414**
    
    
- properties_2016: No need to be joined


- properties_2017: main table
    - primary key: parcelid
    
    
- propertylandusetype
    - primary key: propertylandusetypeid
    
    
- storytype: 35 unique values
    - primary key: storytypeid
    

- typeconstructiontype: 18 unqiue values
    - primary key: typeconstructiontypeid
    
    
- unique_properties: 2,985,217 rows
    - primary key: parcelid

## acquire & summarize

### 1. Acquire data from mySQL using the python module to connect and query. You will want to end with **a single dataframe**. Make sure to include: the logerror, all fields related to the properties that are available. You will end up **using all the tables in the database**.
- Be sure to do **the correct join (inner, outer, etc.)**. We do not want to eliminate properties purely because they may have a null value for airconditioningtypeid.
- Only include properties with a **transaction in 2017**, and include **only the last transaction for each properity** (so no duplicate property ID's), along with zestimate error and date of transaction.
- Only include properties that include a latitude and longitude value.

In [2]:
import warnings
warnings.filterwarnings("ignore")
import os

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

import env, acquire, summarize, prepare, wrangle_zillow

In [None]:
# Acquire properties with a transaction in 2017 order first by parcelid then transactiondate

query = """
        select *
        from properties_2017
        join predictions_2017 using(parcelid)
        left join airconditioningtype using(airconditioningtypeid)
        left join architecturalstyletype using(architecturalstyletypeid)
        left join buildingclasstype using(buildingclasstypeid)
        left join heatingorsystemtype using(heatingorsystemtypeid)
        left join propertylandusetype using(propertylandusetypeid)
        left join storytype using(storytypeid)
        left join typeconstructiontype using(typeconstructiontypeid)
        where transactiondate between '2017-01-01' and '2017-12-31'
        order by parcelid, transactiondate
        """

df = acquire.get_zillow_data(query, '1')
df.shape

In [None]:
df.head()

In [None]:
df.info()

In [None]:
# Address duplicates: show all duplicates

mask = df.duplicated(subset='parcelid', keep=False)
df_duplicated = df[mask]
df_duplicated.head()

In [None]:
df_duplicated.shape

In [None]:
# Only kee the last transaction (the most recent) for each properity. 

df.drop_duplicates(subset=['parcelid'], keep='last', inplace=True, ignore_index=True)
df.shape

In [None]:
# Check to see whether the property with most transatction date is kept.

df[(df.parcelid == 10722858) | (df.parcelid == 10732347)]

In [None]:
# Check if there exsits duplicate property ID

df.duplicated(subset='parcelid').any()

**Takeaways**: Properties with transaction in 2017

### 2. Summarize your data (summary stats, info, dtypes, shape, distributions, value_counts, etc.)

In [None]:
zillow = prepare.drop_zillow_duplicates(df)
zillow.shape

In [None]:
# Summary stats

zillow.describe()

In [None]:
# Info

zillow.info()

In [None]:
# Display object columns and the counts of unique values

zillow_obj_sum = summarize.sum_obj_cols(zillow)
zillow_obj_sum

In [None]:
# Count unique values in each attributes

summarize.obj_value_counts(zillow)

In [None]:
zillow.shape

In [None]:
zillow_num = summarize.num_df(zillow)
zillow_num.shape

In [None]:
zillow_obj = summarize.obj_df(zillow)
zillow_obj.shape

### 3. Write a function that takes in a dataframe of observations and attributes and returns a dataframe where each row is an atttribute name, the first column is the number of rows with missing values for that attribute, and the second column is percent of total rows that have missing values for that attribute. Run the function and document takeaways from this on how you want to handle missing values.

In [None]:
zillow.head()

In [None]:
# Compute the number of rows with missing values 

attributes_missing_values = pd.DataFrame(zillow.isna().sum(axis=0), columns=['num_row_missing'])
attributes_missing_values

In [None]:
# Add a column to compute the percent of total rows that have missing values

total_rows = zillow.shape[0]

attributes_missing_values['pct_rows_missing'] = attributes_missing_values.num_row_missing/total_rows
attributes_missing_values.head()

In [None]:
# Test the function

attributes_missing_values = summarize.sum_missing_values_attributes(zillow)
attributes_missing_values

### 4. Write a function that takes in a dataframe and returns a dataframe with 3 columns: the number of columns missing, percent of columns missing, and number of rows with n columns missing. Run the function and document takeaways from this on how you want to handle missing values.

In [None]:
# Count the rows based on how many missing values in that row. 

x = zillow.isnull().sum(axis=1).value_counts().sort_index()
x

In [None]:
# Construct the dictionary from list of lists

cols_missing_values = pd.DataFrame([x.index.tolist(), x.values.tolist()], 
                                   index = ['num_cols_missing', 'num_rows'])
cols_missing_values.T

In [None]:
# Construct the dictionary from dict

d = {'num_cols_missing': x.index.tolist(), 'num_rows': x.values.tolist()}

cols_missing_values = pd.DataFrame(d)
cols_missing_values

In [None]:
# Compute the percent of columns missing

n = zillow.shape[0] # Compuate the total number of rows
cols_missing_values['pct_cols_missing'] = (cols_missing_values.num_rows/n)*100
cols_missing_values

In [None]:
# Visualize the distribution of the 

x = cols_missing_values.num_cols_missing
y = cols_missing_values.num_rows

plt.rc('figure', figsize=(13,7))

plt.subplot(121)
plt. bar(x, y)

plt.subplot(122)
sns.barplot(x, y)

In [None]:
# Test the function

cols_missing_values = summarize.sum_missing_values_cols(zillow)
cols_missing_values

## Prepare
### 1. Remove any properties that are likely to be something other than single unit properties. (e.g. no duplexes, no land/lot, ...). 
- There are multiple ways to estimate that a property is a single unit, and there is not a single "right" answer. But for this exercise, do not purely filter by unitcnt as we did previously. 
- Add some new logic that will reduce the number of properties that are falsely removed. 
- You might want to use # bedrooms, square feet, unit type or the like to then identify those with unitcnt not defined.

In [None]:
zillow.shape

In [None]:
zillow.propertylandusetypeid.value_counts()

In [None]:
zillow.propertylandusedesc.value_counts()

In [None]:
# Use the propertylandusetypeids previously used in regression project
# It is better done in the SQL

single_unit = [260, 261, 262, 279]

zillow = zillow[zillow.propertylandusetypeid.isin(single_unit)]
zillow.shape

In [None]:
zillow.propertylandusetypeid.value_counts()

### 2. Create a function that will drop rows or columns based on the percent of values that are missing: handle_missing_values(df, prop_required_column, prop_required_row).

In [None]:
# Create the function based on the curriculum. 

def handle_missing_values(df, prop_required_column, prop_required_row):
    """
    Drop rows and columsn based on the perent of values that are missing.
    Parameters: 
    1. df
    2. the proportion, for each column, of rows with non-missing values requied to keep the column
    3. the proportion, for each row, of columns with non-missing values required to keep the row
    """
    threshold = int(round(prop_required_column*len(df.index),0))
    df.dropna(axis=1, thresh=threshold, inplace=True)
    threshold = int(round(prop_required_row*len(df.columns),0))
    df.dropna(axis=0, thresh=threshold, inplace=True)
    return df

In [None]:
# Test the function: the columns has no more than 40% missing and the rows has no more than 25% missing

zillow_dropna = handle_missing_values(zillow, 0.6, 0.75)
zillow_dropna.shape

In [None]:
# since the inplace = True, the zillow dataset have been modified.  

zillow.shape

### 3. Decide how to handle the remaining missing values
- Drop row/column

In [None]:
zillow.isna().sum(axis=0)

In [None]:
zillow.shape

In [None]:
# Drop row/column with missing values

mask = zillow.isna().sum(axis=1) == 0
zillow_handle_na = zillow[mask]
zillow_handle_na.shape

In [None]:
# Double check if there is any missing values in the dataframe

zillow_handle_na.isna().sum(axis=1).sum()

### 4 Test the functions in .py files

In [3]:
query = """
        select *
        from properties_2017
        join predictions_2017 using(parcelid)
        left join airconditioningtype using(airconditioningtypeid)
        left join architecturalstyletype using(architecturalstyletypeid)
        left join buildingclasstype using(buildingclasstypeid)
        left join heatingorsystemtype using(heatingorsystemtypeid)
        left join propertylandusetype using(propertylandusetypeid)
        left join storytype using(storytypeid)
        left join typeconstructiontype using(typeconstructiontypeid)
        where transactiondate between '2017-01-01' and '2017-12-31'
        order by parcelid, transactiondate
        """

zillow = acquire.get_zillow_data(query, '1')
zillow.shape

(77613, 69)

In [4]:
zillow = wrangle_zillow.wrangle_zillow_mvp(zillow)
zillow.shape

(32055, 35)