## Purpose:

According to and article from [brookfieldresidential.com](https://stories.brookfieldresidential.com/homebuyersschool/duplex-vs.-single-family-home-whats-the-difference-and-which-one-should-i-invest-in), duplexes might be more highly valued than single family homes. Using the Seattle area real estate information, I'll run some analysis to determine the accuracy of this claim.

In [1]:
#add auto reload for src function testing
%load_ext autoreload
%autoreload 2

#let's add the project directory to our module path
import os
import sys

module_path = os.path.abspath(os.path.join(os.pardir, os.pardir))
if module_path not in sys.path:
    sys.path.append(module_path)
    
    
#also import all of our modules
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
from src import data_cleaning

#and here is our data directiory
data_folder = '../../data/'

Let's open our data and see what it looks like. I'm looking for both housing prices and whether or not they are duplexes.

## Optional

The following lines of code will take a while to load. This is because the csv files are very large and contain data that we don't need. We solved this by creating new csv files that contain data for only 2019. Read through the following section to see how we did that, and uncomment the code if you want to follow along. Otherwise skip ahead to EXTR_ResBldg.csv header.

In [2]:
data_folder = '../../data/'


# res_bldg = pd.read_csv(data_folder+'EXTR_ResBldg.csv')

# rp_sale = pd.read_csv(data_folder+'EXTR_RPSale.csv')

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


### EXTR_RPSale.csv

It looks like this data set is what contains the sale price as well as some interesting characteristics of the property like whether it's historic or not. 

In [3]:
# rp_sale.info()
# rp_sale.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2087944 entries, 0 to 2087943
Data columns (total 24 columns):
 #   Column              Dtype 
---  ------              ----- 
 0   ExciseTaxNbr        int64 
 1   Major               object
 2   Minor               object
 3   DocumentDate        object
 4   SalePrice           int64 
 5   RecordingNbr        object
 6   Volume              object
 7   Page                object
 8   PlatNbr             object
 9   PlatType            object
 10  PlatLot             object
 11  PlatBlock           object
 12  SellerName          object
 13  BuyerName           object
 14  PropertyType        int64 
 15  PrincipalUse        int64 
 16  SaleInstrument      int64 
 17  AFForestLand        object
 18  AFCurrentUseLand    object
 19  AFNonProfitUse      object
 20  AFHistoricProperty  object
 21  SaleReason          int64 
 22  PropertyClass       int64 
dtypes: int64(7), object(17)
memory usage: 382.3+ MB


Unnamed: 0,ExciseTaxNbr,Major,Minor,DocumentDate,SalePrice,RecordingNbr,Volume,Page,PlatNbr,PlatType,...,PropertyType,PrincipalUse,SaleInstrument,AFForestLand,AFCurrentUseLand,AFNonProfitUse,AFHistoricProperty,SaleReason,PropertyClass,SaleWarning
0,714942,284150,10,04/06/1983,91500,198304110267,100.0,57.0,284150.0,P,...,2,6,2,N,N,N,,1,8,
1,1729614,172204,9157,12/21/1999,0,19991229001498,,,,,...,3,11,15,N,N,N,N,10,8,31 45
2,1729614,172204,9005,12/21/1999,0,19991229001498,,,,,...,3,11,15,N,N,N,N,10,8,31 45
3,2254430,192304,9020,12/05/2006,0,20061207002200,,,,,...,3,11,15,N,N,N,N,18,2,18 45
4,685277,885730,120,08/11/1982,0,198208170380,86.0,75.0,885730.0,P,...,3,2,15,N,N,N,,1,3,11


### Date
I know I only want to look at records from 2019 so I'm going to filter out the data to only include entries from that date.

First I'll convert document date to datetime:

In [4]:
# rp_sale['DocumentDate'] = pd.to_datetime(rp_sale['DocumentDate'])

# rp_sale.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2087944 entries, 0 to 2087943
Data columns (total 24 columns):
 #   Column              Dtype         
---  ------              -----         
 0   ExciseTaxNbr        int64         
 1   Major               object        
 2   Minor               object        
 3   DocumentDate        datetime64[ns]
 4   SalePrice           int64         
 5   RecordingNbr        object        
 6   Volume              object        
 7   Page                object        
 8   PlatNbr             object        
 9   PlatType            object        
 10  PlatLot             object        
 11  PlatBlock           object        
 12  SellerName          object        
 13  BuyerName           object        
 14  PropertyType        int64         
 15  PrincipalUse        int64         
 16  SaleInstrument      int64         
 17  AFForestLand        object        
 18  AFCurrentUseLand    object        
 19  AFNonProfitUse      object        
 20  AF

Now I'm going to create a function so that I can make a dataframe mask using apply. It will check whether the year attribute of a datetime object equals 2019 and, if so, it will return True, otherwise False.

In [5]:
# def in_2019(dateTime):
#     if dateTime.year == 2019:
#         return True
#     else:
#         return False

In [6]:
mask_2019 = rp_sale['DocumentDate'].apply(in_2019)


rp_sale_2019 = rp_sale[mask_2019]
rp_sale_2019.info()
rp_sale_2019.head(20)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 61351 entries, 72 to 2087942
Data columns (total 24 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   ExciseTaxNbr        61351 non-null  int64         
 1   Major               61351 non-null  object        
 2   Minor               61351 non-null  object        
 3   DocumentDate        61351 non-null  datetime64[ns]
 4   SalePrice           61351 non-null  int64         
 5   RecordingNbr        61351 non-null  object        
 6   Volume              61351 non-null  object        
 7   Page                61351 non-null  object        
 8   PlatNbr             61351 non-null  object        
 9   PlatType            61351 non-null  object        
 10  PlatLot             61351 non-null  object        
 11  PlatBlock           61351 non-null  object        
 12  SellerName          61351 non-null  object        
 13  BuyerName           61351 non-null  object 

Unnamed: 0,ExciseTaxNbr,Major,Minor,DocumentDate,SalePrice,RecordingNbr,Volume,Page,PlatNbr,PlatType,...,PropertyType,PrincipalUse,SaleInstrument,AFForestLand,AFCurrentUseLand,AFNonProfitUse,AFHistoricProperty,SaleReason,PropertyClass,SaleWarning
72,2999169,919715,200,2019-07-08,192000,20190712001080.0,,,,,...,3,2,3,N,N,N,N,1,3,
236,3000673,894444,200,2019-06-26,185000,20190722001395.0,,,,,...,3,2,3,N,N,N,N,1,3,
257,3027422,213043,120,2019-12-20,560000,20191226000848.0,,,,,...,11,6,3,N,N,N,N,1,8,
302,3002257,940652,630,2019-07-22,435000,20190730001339.0,,,,,...,11,6,3,N,N,N,N,1,8,
446,3018109,152504,9008,2019-10-18,7600000,20191030001615.0,,,,,...,3,7,3,N,N,N,N,1,2,
465,2993601,140281,20,2019-06-04,450000,20190614000489.0,,,,,...,3,6,3,N,N,N,N,1,8,
482,3015516,779790,30,2019-10-07,0,20191016000009.0,,,,,...,11,6,3,N,N,N,N,1,8,
586,3031504,766620,3538,2019-12-30,0,20200128000956.0,,,,,...,51,7,15,N,N,N,N,18,2,
594,3015264,124550,98,2019-09-27,193000,20191015000395.0,,,,,...,3,6,15,N,N,N,N,18,8,18 51 52
599,2980648,797320,2320,2019-03-27,540000,,,,,,...,3,6,3,N,N,N,N,1,8,


I want to use this dataframe for all my analysis, so I'm going to export it as a csv, and I'll create a function to automatically do this as well.

In [7]:
#uncomment line below to create file

#rp_sale_2019.to_csv(data_folder+'EXTR_RPSale_2019.csv')

## Function testing:

In [8]:
# rp_sale_2019 = data_cleaning.filter_data_by_year(rp_sale)

In [9]:
# rp_sale_2019.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 61351 entries, 72 to 2087942
Data columns (total 24 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   ExciseTaxNbr        61351 non-null  int64         
 1   Major               61351 non-null  object        
 2   Minor               61351 non-null  object        
 3   DocumentDate        61351 non-null  datetime64[ns]
 4   SalePrice           61351 non-null  int64         
 5   RecordingNbr        61351 non-null  object        
 6   Volume              61351 non-null  object        
 7   Page                61351 non-null  object        
 8   PlatNbr             61351 non-null  object        
 9   PlatType            61351 non-null  object        
 10  PlatLot             61351 non-null  object        
 11  PlatBlock           61351 non-null  object        
 12  SellerName          61351 non-null  object        
 13  BuyerName           61351 non-null  object 

In [11]:
# data_cleaning.create_2019_sale_csv(rp_sale)

In [12]:
# rp_sale = pd.read_csv(data_folder+'EXTR_RPSale_2019.csv')

In [13]:
# rp_sale.head()

Unnamed: 0.1,Unnamed: 0,ExciseTaxNbr,Major,Minor,DocumentDate,SalePrice,RecordingNbr,Volume,Page,PlatNbr,...,PropertyType,PrincipalUse,SaleInstrument,AFForestLand,AFCurrentUseLand,AFNonProfitUse,AFHistoricProperty,SaleReason,PropertyClass,SaleWarning
0,72,2999169,919715,200,2019-07-08,192000,20190712001080,,,,...,3,2,3,N,N,N,N,1,3,
1,236,3000673,894444,200,2019-06-26,185000,20190722001395,,,,...,3,2,3,N,N,N,N,1,3,
2,257,3027422,213043,120,2019-12-20,560000,20191226000848,,,,...,11,6,3,N,N,N,N,1,8,
3,302,3002257,940652,630,2019-07-22,435000,20190730001339,,,,...,11,6,3,N,N,N,N,1,8,
4,446,3018109,152504,9008,2019-10-18,7600000,20191030001615,,,,...,3,7,3,N,N,N,N,1,2,


## EXTR_ResBldg.csv

I might need to do the same thing for res_bldg as well, so let's check that one out too.

In [None]:
res_bldg.info()
res_bldg.head()

It looks like it's just information about the building itself, so I'm not worried about filtering this by date.