In [235]:
import acquire_zillow
import prepare_zillow

import pandas as pd
import numpy as np
import seaborn as sns

In [236]:
df = acquire_zillow.read_zillow_csv()

In [237]:
df = prepare_zillow.data_prep(df, cols_to_remove=['airconditioningtypeid', 'architecturalstyletypeid', 'buildingclasstypeid',
                              'buildingqualitytypeid', 'propertylandusetypeid', 'typeconstructiontypeid', 
                              'storytypeid', 'heatingorsystemtypeid'])

In [238]:
unit_one = df[(df['unitcnt'] >= 2) & (df['bedroomcnt'] >= 1) & (df['calculatedfinishedsquarefeet'] >= 500)].index    
df.drop(unit_one, inplace=True)

In [239]:
df = df[df.propertylandusedesc != 'Duplex (2 Units, Any Combination)']
df = df[df.propertylandusedesc != 'Quadruplex (4 Units, Any Combination)']
df = df[df.propertylandusedesc != 'Triplex (3 Units, Any Combination)']
df = df[df.propertylandusedesc != 'Condominium']

#### Handle Missing Values

1. Write or use a previously written function to return the total missing values and the percent missing values by column.

In [240]:
def null_percent_col(df):
    return df.isnull().sum() / df.shape[0] * 100.00

2. Write or use a previously written function to return the total missing values and the percent missing values by row.

In [241]:
def null_percent_row(df):
    return df.isnull().sum() / df.shape[1] * 100.00

3. Write a function that will take a dataframe and list of column names as input and return the dataframe with the null values in those columns replace by 0.

In [242]:
def replace_with_zero(df, column_name):
    return df.fillna(0)

4. Impute the values in land square feet.

For land square feet, the goal is to impute the missing values by creating a linear model where landtaxvaluedollarcnt is the x variable and the output/y-variable is the estimated land square feet. We'll then use this model to make predictions and fill in the missing values.

Write a function that accepts the zillow data frame and returns the data frame with the missing values filled in.

In [243]:
def replace_null_data(df, column_name, numerical = True):
    if numerical == True:
        df[column_name] = df[column_name].fillna(df[column_name].mean())
    else:
        df[column_name] = df[column_name].fillna(df[column_name].mode()[0])
    
    return df

df = replace_null_data(df,'landtaxvaluedollarcnt')
df = replace_null_data(df,'yearbuilt')

5. Create a function that fills missing values with 0s. Explore the data and decide which columns it makes sense to apply this transformation to.

In [244]:
def replace_with_zero(df, column_name):
    return df.fillna(0)

df = replace_with_zero(df,'fullbathcnt')
df = replace_null_data(df,'calculatedbathnbr')
df = replace_null_data(df,'unitcnt')

6. Run the first function that returns missing value totals by column: Does the attribute have enough information (i.e. enough non-null values) to be useful? Choose your cutoff and remove columns where there is not enough information available. Document your cutoff and your reasoning.

**Documentation of missing values by column:**

Each attribute in the Zillow data could be useful; I chose to look at all the numerical types and fill columns with the NaN value with the mean, and the non-numerical columns with the mode. I am hoping to look at the NaN information in a generalized form; i.e. in a way that it would create a bias in each column towards the mean of that column.

There are columns that will not have a place during and after the train test split and we will look at those further in detail. Currently there are around 100k rows to look at which is plenty of rows to see how each column effects logerror.

7. Run the function that returns missing values by row: Does the observation have enough information to use in our sample? Choose your cutoff and remove rows where there is not enough information available. Document your cutoff and your reasoning.

**Documentation of missing values by row:**

Taking a look at each row comes into play when we want to eliminate certain descriptions of propertylandusedesc from our prepared data. Part of the reasoning of just looking at single unit listings is to focus on the majority. There are around 100k single unit properties while there are only half that of the multi unit properites.

I chose to eliminate Duplex's, Triplex, Quadruplex, and Condominium since these are known to have multi units within each building. The rest of the descriptions could go either way so I created a different function that would eliminate through a different column: 'unitcnt'

8. Of the remaining missing values, can they be imputed or otherwise estimated?

    - Impute those that can be imputed with the method you feel best fits the attribute.
    - Decide whether to remove the rows or columns of any that cannot be reasonably imputed.
    - Document your reasons for the decisions on how to handle each of those.


#### Handle Outliers

1. Write a function that accepts a series (i.e. one column from a data frame) and summarizes how many outliers are in the series. This function should accept a second parameter that determines how outliers are detected, with the ability to detect outliers in 3 ways:

    - Using the IQR
    - Using standard deviations
    - Based on whether the observation is in the top or bottom 1%

In [None]:
def summarize_outliers():
    

## Exploration

Write a function that will take, as input, a dataframe and a list containing the column names of all ordered numeric variables. It will output, through subplots, a pairplot, a heatmap, and 1 other type of plot that will loop through and plot each combination of numeric variables (an x and a y, combination order doesn't matter here!).

In [165]:
num_columns = pd.DataFrame(df.select_dtypes('number').columns)
logerror_rate = df.logerror.mean()

Write a function that will use seaborn's relplot to plot 2 numeric (ordered) variables and 1 categorical variable. It will take, as input, a dataframe, column name indicated for each of the following: x, y, & hue.

Write a function that will take, as input, a dataframe, a categorical column name, and a list of numeric column names. It will return a series of subplots: a swarmplot for each numeric column. X will be the categorical variable.

Write a function that will take a dataframe and a list of categorical columns to plot each combination of variables in the chart type of your choice.

Explore, explore, explore! Use the functions you wrote above to create plots, and explore some more with other plots.

Test, test, test!