### Acquire

1. Acquire data from mySQL using the python module to connect and query. You will want to end with a single dataframe. Make sure to include: the logerror, all fields related to the properties that are available. You will end up using all the tables in the database.

    - Be sure to do the correct join (inner, outer, etc.). We do not want to eliminate properties purely because they may have a null value for airconditioningtypeid.
    - Only include properties with a transaction in 2017, and include only the last transaction for each properity (so no duplicate property ID's), along with zestimate error and date of transaction.
    - Only include properties that include a latitude and longitude value.

In [1]:
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import seaborn as sns
from util import get_db_url
import acquire
import summarize

In [2]:
df = acquire.get_zillow_data()

In [4]:
df.head()

Unnamed: 0,county,tax_rate,id,parcelid,airconditioningtypeid,airconditioningdesc,architecturalstyletypeid,architecturalstyledesc,basementsqft,bathroomcnt,...,landtaxvaluedollarcnt,taxamount,taxdelinquencyflag,taxdelinquencyyear,typeconstructiontypeid,typeconstructiondesc,censustractandblock,transactiondate,logerror,transactions
0,Ventura,0.012225,1387261,17052889,,,,,,1.0,...,376000.0,5672.48,,,,,61110010000000.0,2017-01-01,0.055619,1
1,Ventura,0.011133,43675,17110996,,,,,,2.5,...,99028.0,2204.84,,,,,61110050000000.0,2017-01-02,0.008669,1
2,Ventura,0.010838,2490820,17134185,,,,,,2.0,...,273509.0,4557.52,,,,,61110060000000.0,2017-01-03,0.05769,1
3,Ventura,0.018693,269618,17292247,,,,,,2.0,...,24808.0,1450.06,,,,,61110060000000.0,2017-01-03,-0.421908,1
4,Ventura,0.010678,74982,17141654,,,,,,3.0,...,126138.0,4139.18,,,,,61110050000000.0,2017-01-03,-0.021898,1


2. Summarize your data (summary stats, info, dtypes, shape, distributions, value_counts, etc.)

In [5]:
df.shape

(52169, 72)

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52169 entries, 0 to 52168
Data columns (total 72 columns):
county                          52169 non-null object
tax_rate                        52164 non-null float64
id                              52169 non-null int64
parcelid                        52169 non-null int64
airconditioningtypeid           13605 non-null float64
airconditioningdesc             13605 non-null object
architecturalstyletypeid        70 non-null float64
architecturalstyledesc          70 non-null object
basementsqft                    47 non-null float64
bathroomcnt                     52169 non-null float64
bedroomcnt                      52169 non-null float64
buildingclasstypeid             0 non-null object
buildingclassdesc               0 non-null object
buildingqualitytypeid           33628 non-null float64
calculatedbathnbr               52153 non-null float64
calculatedfinishedsquarefeet    52161 non-null float64
decktypeid                      387 n

In [7]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
tax_rate,52164.0,0.01336932,0.006682133,9.372442e-05,0.01163577,0.0123255,0.01368476,0.81649
id,52169.0,1496932.0,859434.1,349.0,757739.0,1500051.0,2241574.0,2982270.0
parcelid,52169.0,12988910.0,3214149.0,10711860.0,11508310.0,12577660.0,14128860.0,167639200.0
airconditioningtypeid,13605.0,2.440647,3.849141,1.0,1.0,1.0,1.0,13.0
architecturalstyletypeid,70.0,7.1,2.66567,2.0,7.0,7.0,7.0,21.0
basementsqft,47.0,678.9787,711.8252,38.0,263.5,512.0,809.5,3560.0
bathroomcnt,52169.0,2.305641,1.017982,1.0,2.0,2.0,3.0,18.0
bedroomcnt,52169.0,3.309475,0.9336865,1.0,3.0,3.0,4.0,14.0
buildingqualitytypeid,33628.0,6.265582,1.715854,1.0,5.0,6.0,8.0,12.0
calculatedbathnbr,52153.0,2.305658,1.017997,1.0,2.0,2.0,3.0,18.0


In [8]:
df.dtypes

county                           object
tax_rate                        float64
id                                int64
parcelid                          int64
airconditioningtypeid           float64
airconditioningdesc              object
architecturalstyletypeid        float64
architecturalstyledesc           object
basementsqft                    float64
bathroomcnt                     float64
bedroomcnt                      float64
buildingclasstypeid              object
buildingclassdesc                object
buildingqualitytypeid           float64
calculatedbathnbr               float64
calculatedfinishedsquarefeet    float64
decktypeid                      float64
finishedfloor1squarefeet        float64
finishedsquarefeet12            float64
finishedsquarefeet13             object
finishedsquarefeet15             object
finishedsquarefeet50            float64
finishedsquarefeet6             float64
fips                            float64
state                            object


In [9]:
df.id.value_counts()

657407     1
1181077    1
253358     1
1441194    1
910761     1
2841485    1
1555874    1
1727902    1
958481     1
2194056    1
1076632    1
660887     1
1969556    1
2442621    1
2768274    1
2114674    1
2142897    1
1889679    1
2335556    1
1244554    1
2135431    1
2792838    1
2930051    1
846670     1
811953     1
2174384    1
2825654    1
231285     1
1414631    1
324198     1
          ..
1613625    1
1636148    1
1361714    1
931203     1
1318761    1
1849196    1
2615190    1
1326957    1
391060     1
247698     1
113587     1
1562512    1
2568079    1
468878     1
343949     1
210828     1
200587     1
2688906    1
71560      1
1009542    1
2488189    1
1920899    1
2386655    1
2011007    1
1068175    1
2555765    1
2404       1
929647     1
2500462    1
2232322    1
Name: id, Length: 52169, dtype: int64

In [None]:
# Look to see if there are any nulls in each row

In [10]:
df.isnull().sum()

county                              0
tax_rate                            5
id                                  0
parcelid                            0
airconditioningtypeid           38564
airconditioningdesc             38564
architecturalstyletypeid        52099
architecturalstyledesc          52099
basementsqft                    52122
bathroomcnt                         0
bedroomcnt                          0
buildingclasstypeid             52169
buildingclassdesc               52169
buildingqualitytypeid           18541
calculatedbathnbr                  16
calculatedfinishedsquarefeet        8
decktypeid                      51782
finishedfloor1squarefeet        47815
finishedsquarefeet12              166
finishedsquarefeet13            52169
finishedsquarefeet15            52169
finishedsquarefeet50            47815
finishedsquarefeet6             52011
fips                                0
state                               0
fireplacecnt                    44948
fullbathcnt 

In [None]:
# Look to see if there are any nulls in each column

In [11]:
df.isnull().sum(axis=1)

0        33
1        33
2        33
3        34
4        33
5        33
6        32
7        33
8        34
9        31
10       32
11       33
12       33
13       32
14       29
15       32
16       30
17       29
18       34
19       34
20       31
21       32
22       33
23       33
24       33
25       33
26       35
27       31
28       32
29       32
         ..
52139    36
52140    34
52141    32
52142    37
52143    37
52144    36
52145    34
52146    32
52147    32
52148    37
52149    34
52150    33
52151    36
52152    32
52153    34
52154    35
52155    29
52156    35
52157    35
52158    34
52159    37
52160    35
52161    34
52162    29
52163    33
52164    36
52165    32
52166    36
52167    34
52168    37
Length: 52169, dtype: int64

In [None]:
# Use created summarize function to quickly run a summary on all data.

In [12]:
summarize.df_summary(df)

--- Shape: (52169, 72)
--- Info
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52169 entries, 0 to 52168
Data columns (total 72 columns):
county                          52169 non-null object
tax_rate                        52164 non-null float64
id                              52169 non-null int64
parcelid                        52169 non-null int64
airconditioningtypeid           13605 non-null float64
airconditioningdesc             13605 non-null object
architecturalstyletypeid        70 non-null float64
architecturalstyledesc          70 non-null object
basementsqft                    47 non-null float64
bathroomcnt                     52169 non-null float64
bedroomcnt                      52169 non-null float64
buildingclasstypeid             0 non-null object
buildingclassdesc               0 non-null object
buildingqualitytypeid           33628 non-null float64
calculatedbathnbr               52153 non-null float64
calculatedfinishedsquarefeet    52161 non-null float64
deckt

   num_cols_missing    pct_cols_missing  num_rows
0                23  31.944444444444443         2
1                24   33.33333333333333        12
2                25   34.72222222222222        11
3                26   36.11111111111111        30
4                27                37.5       177
5                28   38.88888888888889       389
6                29   40.27777777777778      2527
7                30   41.66666666666667      2194
8                31   43.05555555555556      5986
9                32   44.44444444444444      8880
10               33   45.83333333333333     11960
11               34   47.22222222222222     11151
12               35   48.61111111111111      3459
13               36                50.0      4121
14               37  51.388888888888886      1016
15               38   52.77777777777778       214
16               39  54.166666666666664        22
17               40   55.55555555555556        13
18               41   56.94444444444444         3


3. Write a function that takes in a dataframe of observations and attributes and returns a dataframe where each row is an atttribute name, the first column is the number of rows with missing values for that attribute, and the second column is percent of total rows that have missing values for that attribute. Run the function and document takeaways from this on how you want to handle missing values.

In [13]:
number_rows = df.isnull().sum()
number_rows

county                              0
tax_rate                            5
id                                  0
parcelid                            0
airconditioningtypeid           38564
airconditioningdesc             38564
architecturalstyletypeid        52099
architecturalstyledesc          52099
basementsqft                    52122
bathroomcnt                         0
bedroomcnt                          0
buildingclasstypeid             52169
buildingclassdesc               52169
buildingqualitytypeid           18541
calculatedbathnbr                  16
calculatedfinishedsquarefeet        8
decktypeid                      51782
finishedfloor1squarefeet        47815
finishedsquarefeet12              166
finishedsquarefeet13            52169
finishedsquarefeet15            52169
finishedsquarefeet50            47815
finishedsquarefeet6             52011
fips                                0
state                               0
fireplacecnt                    44948
fullbathcnt 

In [14]:
rows = df.shape[0]
rows

52169

In [15]:
pct_missing = number_rows/rows
pct_missing

county                          0.000000
tax_rate                        0.000096
id                              0.000000
parcelid                        0.000000
airconditioningtypeid           0.739213
airconditioningdesc             0.739213
architecturalstyletypeid        0.998658
architecturalstyledesc          0.998658
basementsqft                    0.999099
bathroomcnt                     0.000000
bedroomcnt                      0.000000
buildingclasstypeid             1.000000
buildingclassdesc               1.000000
buildingqualitytypeid           0.355403
calculatedbathnbr               0.000307
calculatedfinishedsquarefeet    0.000153
decktypeid                      0.992582
finishedfloor1squarefeet        0.916540
finishedsquarefeet12            0.003182
finishedsquarefeet13            1.000000
finishedsquarefeet15            1.000000
finishedsquarefeet50            0.916540
finishedsquarefeet6             0.996971
fips                            0.000000
state           

In [16]:
def nulls_by_col(df):
    num_missing = df.isnull().sum()
    rows = df.shape[0]
    pct_missing = num_missing/rows
    cols_missing = pd.DataFrame({'num_rows_missing': num_missing, 'pct_rows_missing': pct_missing})
    return cols_missing

In [17]:
nulls_by_col(df)

Unnamed: 0,num_rows_missing,pct_rows_missing
county,0,0.000000
tax_rate,5,0.000096
id,0,0.000000
parcelid,0,0.000000
airconditioningtypeid,38564,0.739213
airconditioningdesc,38564,0.739213
architecturalstyletypeid,52099,0.998658
architecturalstyledesc,52099,0.998658
basementsqft,52122,0.999099
bathroomcnt,0,0.000000


4. Write a function that takes in a dataframe and returns a dataframe with 3 columns: the number of columns missing, percent of columns missing, and number of rows with n columns missing. Run the function and document takeaways from this on how you want to handle missing values.

In [18]:
num_cols_missing = df.isnull().sum(axis=1)
num_cols_missing

0        33
1        33
2        33
3        34
4        33
5        33
6        32
7        33
8        34
9        31
10       32
11       33
12       33
13       32
14       29
15       32
16       30
17       29
18       34
19       34
20       31
21       32
22       33
23       33
24       33
25       33
26       35
27       31
28       32
29       32
         ..
52139    36
52140    34
52141    32
52142    37
52143    37
52144    36
52145    34
52146    32
52147    32
52148    37
52149    34
52150    33
52151    36
52152    32
52153    34
52154    35
52155    29
52156    35
52157    35
52158    34
52159    37
52160    35
52161    34
52162    29
52163    33
52164    36
52165    32
52166    36
52167    34
52168    37
Length: 52169, dtype: int64

In [19]:
pct_cols_missing = df.isnull().sum(axis=1)/df.shape[1]*100
pct_cols_missing

0        45.833333
1        45.833333
2        45.833333
3        47.222222
4        45.833333
5        45.833333
6        44.444444
7        45.833333
8        47.222222
9        43.055556
10       44.444444
11       45.833333
12       45.833333
13       44.444444
14       40.277778
15       44.444444
16       41.666667
17       40.277778
18       47.222222
19       47.222222
20       43.055556
21       44.444444
22       45.833333
23       45.833333
24       45.833333
25       45.833333
26       48.611111
27       43.055556
28       44.444444
29       44.444444
           ...    
52139    50.000000
52140    47.222222
52141    44.444444
52142    51.388889
52143    51.388889
52144    50.000000
52145    47.222222
52146    44.444444
52147    44.444444
52148    51.388889
52149    47.222222
52150    45.833333
52151    50.000000
52152    44.444444
52153    47.222222
52154    48.611111
52155    40.277778
52156    48.611111
52157    48.611111
52158    47.222222
52159    51.388889
52160    48.

In [20]:
def nulls_by_row(df):
    num_cols_missing = df.isnull().sum(axis=1)
    pct_cols_missing = df.isnull().sum(axis=1)/df.shape[1]*100
    rows_missing = pd.DataFrame({'num_cols_missing': num_cols_missing, 'pct_cols_missing': pct_cols_missing}).reset_index().groupby(['num_cols_missing','pct_cols_missing']).count().rename(index=str, columns={'index': 'num_rows'}).reset_index()
    return rows_missing

In [21]:
nulls_by_row(df)

Unnamed: 0,num_cols_missing,pct_cols_missing,num_rows
0,23,31.944444444444443,2
1,24,33.33333333333333,12
2,25,34.72222222222222,11
3,26,36.11111111111111,30
4,27,37.5,177
5,28,38.88888888888889,389
6,29,40.27777777777778,2527
7,30,41.66666666666667,2194
8,31,43.05555555555556,5986
9,32,44.44444444444444,8880


### Prepare

1. Remove any properties that are likely to be something other than single unit properties. (e.g. no duplexes, no land/lot, ...). There are multiple ways to estimate that a property is a single unit, and there is not a single "right" answer. But for this exercise, do not purely filter by unitcnt as we did previously. Add some new logic that will reduce the number of properties that are falsely removed. You might want to use # bedrooms, square feet, unit type or the like to then identify those with unitcnt not defined.

**Brought in only Single Family Residential properties in my SQL query. Used 'propertylandusetypeid' to get this.**

2. Create a function that will drop rows or columns based on the percent of values that are missing: handle_missing_values(df, prop_required_column, prop_required_row).

 - The input:
    - A dataframe
    - A number between 0 and 1 that represents the proportion, for each column, of rows with non-missing values required to keep the column. i.e. if prop_required_column = .6, then you are requiring a column to have at least 60% of values not-NA (no more than 40% missing).
    - A number between 0 and 1 that represents the proportion, for each row, of columns/variables with non-missing values required to keep the row. For example, if prop_required_row = .75, then you are requiring a row to have at least 75% of variables with a non-missing value (no more that 25% missing).
 - The output:
    - The dataframe with the columns and rows dropped as indicated. Be sure to drop the columns prior to the rows in your function.
 - hint:
    - Look up the dropna documentation.
    - You will want to compute a threshold from your input values (prop_required) and total number of rows or columns.
    - Make use of inplace, i.e. inplace=True/False.

In [None]:
def handle_missing_values(df, prop_required_column = .5, prop_required_row = .75):
    threshold = int(round(prop_required_column*len(df.index),0))
    df.dropna(axis=1, thresh=threshold, inplace=True)
    threshold = int(round(prop_required_row*len(df.columns),0))
    df.dropna(axis=0, thresh=threshold, inplace=True)
    return df

3. Decide how to handle the remaining missing values:

    - Fill with constant value.
    - Impute with mean, median, mode.
    - Drop row/column

In [None]:
def fill_zero(df, cols):
    df.fillna(value=0, inplace=True)
    return df

In [None]:
def remove_columns(df, cols_to_remove):  
    df = df.drop(columns=cols_to_remove)
    return df

In [None]:
def data_prep(df, cols_to_remove=[], prop_required_column=.5, prop_required_row=.75):
    df = remove_columns(df, cols_to_remove)
    df = handle_missing_values(df, prop_required_column, prop_required_row)
    return df

##### wrangle_zillow.py
Functions of the work above needed to acquire and prepare a new sample of data.