## Purpose:

According to and article from [brookfieldresidential.com](https://stories.brookfieldresidential.com/homebuyersschool/duplex-vs.-single-family-home-whats-the-difference-and-which-one-should-i-invest-in), duplexes might be more highly valued than single family homes. Using the Seattle area real estate information, I'll run some analysis to determine the accuracy of this claim.

In [1]:
#add auto reload for src function testing
%load_ext autoreload
%autoreload 2

#let's add the project directory to our module path
import os
import sys

module_path = os.path.abspath(os.path.join(os.pardir, os.pardir))
if module_path not in sys.path:
    sys.path.append(module_path)
    
    
#also import all of our modules
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
from src import data_cleaning

#and here is our data directiory
data_folder = '../../data/'

Let's open our data and see what it looks like. I'm looking for both housing prices and whether or not they are duplexes.

## Optional

The following lines of code will take a while to load. This is because the csv files are very large and contain data that we don't need. We solved this by creating new csv files that contain data for only 2019. Read through the following section to see how we did that, and uncomment the code if you want to follow along. Otherwise skip ahead to 2019 data import header.

### EXTR_RPSale.csv

It looks like this data set is what contains the sale price as well as some interesting characteristics of the property like whether it's historic or not. 

In [2]:
rp_sale = pd.read_csv(data_folder+'EXTR_RPSale.csv',dtype={'Major': 'str', 'Minor':'str'})
# rp_sale.info()
# rp_sale.head()

In [3]:
rp_sale.head()

Unnamed: 0,ExciseTaxNbr,Major,Minor,DocumentDate,SalePrice,RecordingNbr,Volume,Page,PlatNbr,PlatType,...,PropertyType,PrincipalUse,SaleInstrument,AFForestLand,AFCurrentUseLand,AFNonProfitUse,AFHistoricProperty,SaleReason,PropertyClass,SaleWarning
0,714942,284150,10,04/06/1983,91500,198304110267,100.0,57.0,284150.0,P,...,2,6,2,N,N,N,,1,8,
1,1729614,172204,9157,12/21/1999,0,19991229001498,,,,,...,3,11,15,N,N,N,N,10,8,31 45
2,1729614,172204,9005,12/21/1999,0,19991229001498,,,,,...,3,11,15,N,N,N,N,10,8,31 45
3,2254430,192304,9020,12/05/2006,0,20061207002200,,,,,...,3,11,15,N,N,N,N,18,2,18 45
4,685277,885730,120,08/11/1982,0,198208170380,86.0,75.0,885730.0,P,...,3,2,15,N,N,N,,1,3,11


### Date
I know I only want to look at records from 2019 so I'm going to filter out the data to only include entries from that date.

First I'll convert document date to datetime:

In [3]:
# rp_sale['DocumentDate'] = pd.to_datetime(rp_sale['DocumentDate'])

# rp_sale.info()

Now I'm going to create a function so that I can make a dataframe mask using apply. It will check whether the year attribute of a datetime object equals 2019 and, if so, it will return True, otherwise False.

In [4]:
# def in_2019(dateTime):
#     if dateTime.year == 2019:
#         return True
#     else:
#         return False

In [5]:
# mask_2019 = rp_sale['DocumentDate'].apply(in_2019)


# rp_sale_2019 = rp_sale[mask_2019]
# rp_sale_2019.info()
# rp_sale_2019.head(20)

I want to use this dataframe for all my analysis, so I'm going to export it as a csv, and I'll create a function to automatically do this as well.

In [6]:
#uncomment line below to create file

#rp_sale_2019.to_csv(data_folder+'EXTR_RPSale_2019.csv')

### 2019 data import function testing:

In [7]:
# rp_sale_2019 = data_cleaning.filter_data_by_year(rp_sale)

In [8]:
# rp_sale_2019.info()

In [9]:
# data_cleaning.create_2019_sale_csv(rp_sale)

# 2019 sales data import

In [10]:
rp_sale = pd.read_csv(data_folder+'EXTR_RPSale_2019.csv')

In [11]:
rp_sale.info()
rp_sale.head(10)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 61351 entries, 0 to 61350
Data columns (total 25 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Unnamed: 0          61351 non-null  int64 
 1   ExciseTaxNbr        61351 non-null  int64 
 2   Major               61351 non-null  int64 
 3   Minor               61351 non-null  int64 
 4   DocumentDate        61351 non-null  object
 5   SalePrice           61351 non-null  int64 
 6   RecordingNbr        61351 non-null  object
 7   Volume              61351 non-null  object
 8   Page                61351 non-null  object
 9   PlatNbr             61351 non-null  object
 10  PlatType            61351 non-null  object
 11  PlatLot             61351 non-null  object
 12  PlatBlock           61351 non-null  object
 13  SellerName          61351 non-null  object
 14  BuyerName           61351 non-null  object
 15  PropertyType        61351 non-null  int64 
 16  PrincipalUse        61

Unnamed: 0.1,Unnamed: 0,ExciseTaxNbr,Major,Minor,DocumentDate,SalePrice,RecordingNbr,Volume,Page,PlatNbr,...,PropertyType,PrincipalUse,SaleInstrument,AFForestLand,AFCurrentUseLand,AFNonProfitUse,AFHistoricProperty,SaleReason,PropertyClass,SaleWarning
0,72,2999169,919715,200,2019-07-08,192000,20190712001080.0,,,,...,3,2,3,N,N,N,N,1,3,
1,236,3000673,894444,200,2019-06-26,185000,20190722001395.0,,,,...,3,2,3,N,N,N,N,1,3,
2,257,3027422,213043,120,2019-12-20,560000,20191226000848.0,,,,...,11,6,3,N,N,N,N,1,8,
3,302,3002257,940652,630,2019-07-22,435000,20190730001339.0,,,,...,11,6,3,N,N,N,N,1,8,
4,446,3018109,152504,9008,2019-10-18,7600000,20191030001615.0,,,,...,3,7,3,N,N,N,N,1,2,
5,465,2993601,140281,20,2019-06-04,450000,20190614000489.0,,,,...,3,6,3,N,N,N,N,1,8,
6,482,3015516,779790,30,2019-10-07,0,20191016000009.0,,,,...,11,6,3,N,N,N,N,1,8,
7,586,3031504,766620,3538,2019-12-30,0,20200128000956.0,,,,...,51,7,15,N,N,N,N,18,2,
8,594,3015264,124550,98,2019-09-27,193000,20191015000395.0,,,,...,3,6,15,N,N,N,N,18,8,18 51 52
9,599,2980648,797320,2320,2019-03-27,540000,,,,,...,3,6,3,N,N,N,N,1,8,


## Creating the PIN column

The unique identifier for each piece of land is the PIN, which is made up of the Major and Minor columns. I'll turn the major and minor rows into a pin row so that I can easily join the  First I'll change everything into a string, I'll zero pad them, then I'll join them.

In [12]:
rp_sale['Major'] = rp_sale['Major'].apply(str)
rp_sale['Minor'] = rp_sale['Minor'].apply(str)

In [13]:
rp_sale.info()
rp_sale.head(10)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 61351 entries, 0 to 61350
Data columns (total 25 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Unnamed: 0          61351 non-null  int64 
 1   ExciseTaxNbr        61351 non-null  int64 
 2   Major               61351 non-null  object
 3   Minor               61351 non-null  object
 4   DocumentDate        61351 non-null  object
 5   SalePrice           61351 non-null  int64 
 6   RecordingNbr        61351 non-null  object
 7   Volume              61351 non-null  object
 8   Page                61351 non-null  object
 9   PlatNbr             61351 non-null  object
 10  PlatType            61351 non-null  object
 11  PlatLot             61351 non-null  object
 12  PlatBlock           61351 non-null  object
 13  SellerName          61351 non-null  object
 14  BuyerName           61351 non-null  object
 15  PropertyType        61351 non-null  int64 
 16  PrincipalUse        61

Unnamed: 0.1,Unnamed: 0,ExciseTaxNbr,Major,Minor,DocumentDate,SalePrice,RecordingNbr,Volume,Page,PlatNbr,...,PropertyType,PrincipalUse,SaleInstrument,AFForestLand,AFCurrentUseLand,AFNonProfitUse,AFHistoricProperty,SaleReason,PropertyClass,SaleWarning
0,72,2999169,919715,200,2019-07-08,192000,20190712001080.0,,,,...,3,2,3,N,N,N,N,1,3,
1,236,3000673,894444,200,2019-06-26,185000,20190722001395.0,,,,...,3,2,3,N,N,N,N,1,3,
2,257,3027422,213043,120,2019-12-20,560000,20191226000848.0,,,,...,11,6,3,N,N,N,N,1,8,
3,302,3002257,940652,630,2019-07-22,435000,20190730001339.0,,,,...,11,6,3,N,N,N,N,1,8,
4,446,3018109,152504,9008,2019-10-18,7600000,20191030001615.0,,,,...,3,7,3,N,N,N,N,1,2,
5,465,2993601,140281,20,2019-06-04,450000,20190614000489.0,,,,...,3,6,3,N,N,N,N,1,8,
6,482,3015516,779790,30,2019-10-07,0,20191016000009.0,,,,...,11,6,3,N,N,N,N,1,8,
7,586,3031504,766620,3538,2019-12-30,0,20200128000956.0,,,,...,51,7,15,N,N,N,N,18,2,
8,594,3015264,124550,98,2019-09-27,193000,20191015000395.0,,,,...,3,6,15,N,N,N,N,18,8,18 51 52
9,599,2980648,797320,2320,2019-03-27,540000,,,,,...,3,6,3,N,N,N,N,1,8,


In [14]:
# pad the Major and minor

rp_sale['Major'] = rp_sale['Major'].apply(lambda elem: elem.rjust(6, '0'))
rp_sale['Minor'] = rp_sale['Minor'].apply(lambda elem: elem.rjust(4, '0'))

rp_sale.head(10)

Unnamed: 0.1,Unnamed: 0,ExciseTaxNbr,Major,Minor,DocumentDate,SalePrice,RecordingNbr,Volume,Page,PlatNbr,...,PropertyType,PrincipalUse,SaleInstrument,AFForestLand,AFCurrentUseLand,AFNonProfitUse,AFHistoricProperty,SaleReason,PropertyClass,SaleWarning
0,72,2999169,919715,200,2019-07-08,192000,20190712001080.0,,,,...,3,2,3,N,N,N,N,1,3,
1,236,3000673,894444,200,2019-06-26,185000,20190722001395.0,,,,...,3,2,3,N,N,N,N,1,3,
2,257,3027422,213043,120,2019-12-20,560000,20191226000848.0,,,,...,11,6,3,N,N,N,N,1,8,
3,302,3002257,940652,630,2019-07-22,435000,20190730001339.0,,,,...,11,6,3,N,N,N,N,1,8,
4,446,3018109,152504,9008,2019-10-18,7600000,20191030001615.0,,,,...,3,7,3,N,N,N,N,1,2,
5,465,2993601,140281,20,2019-06-04,450000,20190614000489.0,,,,...,3,6,3,N,N,N,N,1,8,
6,482,3015516,779790,30,2019-10-07,0,20191016000009.0,,,,...,11,6,3,N,N,N,N,1,8,
7,586,3031504,766620,3538,2019-12-30,0,20200128000956.0,,,,...,51,7,15,N,N,N,N,18,2,
8,594,3015264,124550,98,2019-09-27,193000,20191015000395.0,,,,...,3,6,15,N,N,N,N,18,8,18 51 52
9,599,2980648,797320,2320,2019-03-27,540000,,,,,...,3,6,3,N,N,N,N,1,8,


In [15]:
rp_sale['PIN'] = rp_sale['Major'] + rp_sale['Minor']
rp_sale.head(10)

Unnamed: 0.1,Unnamed: 0,ExciseTaxNbr,Major,Minor,DocumentDate,SalePrice,RecordingNbr,Volume,Page,PlatNbr,...,PrincipalUse,SaleInstrument,AFForestLand,AFCurrentUseLand,AFNonProfitUse,AFHistoricProperty,SaleReason,PropertyClass,SaleWarning,PIN
0,72,2999169,919715,200,2019-07-08,192000,20190712001080.0,,,,...,2,3,N,N,N,N,1,3,,9197150200
1,236,3000673,894444,200,2019-06-26,185000,20190722001395.0,,,,...,2,3,N,N,N,N,1,3,,8944440200
2,257,3027422,213043,120,2019-12-20,560000,20191226000848.0,,,,...,6,3,N,N,N,N,1,8,,2130430120
3,302,3002257,940652,630,2019-07-22,435000,20190730001339.0,,,,...,6,3,N,N,N,N,1,8,,9406520630
4,446,3018109,152504,9008,2019-10-18,7600000,20191030001615.0,,,,...,7,3,N,N,N,N,1,2,,1525049008
5,465,2993601,140281,20,2019-06-04,450000,20190614000489.0,,,,...,6,3,N,N,N,N,1,8,,1402810020
6,482,3015516,779790,30,2019-10-07,0,20191016000009.0,,,,...,6,3,N,N,N,N,1,8,,7797900030
7,586,3031504,766620,3538,2019-12-30,0,20200128000956.0,,,,...,7,15,N,N,N,N,18,2,,7666203538
8,594,3015264,124550,98,2019-09-27,193000,20191015000395.0,,,,...,6,15,N,N,N,N,18,8,18 51 52,1245500098
9,599,2980648,797320,2320,2019-03-27,540000,,,,,...,6,3,N,N,N,N,1,8,,7973202320


In [16]:
rp_sale.drop(axis=1, labels='Unnamed: 0', inplace=True)
rp_sale.head(10)

Unnamed: 0,ExciseTaxNbr,Major,Minor,DocumentDate,SalePrice,RecordingNbr,Volume,Page,PlatNbr,PlatType,...,PrincipalUse,SaleInstrument,AFForestLand,AFCurrentUseLand,AFNonProfitUse,AFHistoricProperty,SaleReason,PropertyClass,SaleWarning,PIN
0,2999169,919715,200,2019-07-08,192000,20190712001080.0,,,,,...,2,3,N,N,N,N,1,3,,9197150200
1,3000673,894444,200,2019-06-26,185000,20190722001395.0,,,,,...,2,3,N,N,N,N,1,3,,8944440200
2,3027422,213043,120,2019-12-20,560000,20191226000848.0,,,,,...,6,3,N,N,N,N,1,8,,2130430120
3,3002257,940652,630,2019-07-22,435000,20190730001339.0,,,,,...,6,3,N,N,N,N,1,8,,9406520630
4,3018109,152504,9008,2019-10-18,7600000,20191030001615.0,,,,,...,7,3,N,N,N,N,1,2,,1525049008
5,2993601,140281,20,2019-06-04,450000,20190614000489.0,,,,,...,6,3,N,N,N,N,1,8,,1402810020
6,3015516,779790,30,2019-10-07,0,20191016000009.0,,,,,...,6,3,N,N,N,N,1,8,,7797900030
7,3031504,766620,3538,2019-12-30,0,20200128000956.0,,,,,...,7,15,N,N,N,N,18,2,,7666203538
8,3015264,124550,98,2019-09-27,193000,20191015000395.0,,,,,...,6,15,N,N,N,N,18,8,18 51 52,1245500098
9,2980648,797320,2320,2019-03-27,540000,,,,,,...,6,3,N,N,N,N,1,8,,7973202320


## Joining rp_sales with Res building info

Now that I have the Parcel ID number in my sales info, I want to join it with my resbldg number. Let's take a look at that dataframe.

In [17]:
res_bldg = pd.read_csv(data_folder+'EXTR_ResBldg.csv')

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


In [18]:
res_bldg.info()
res_bldg.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 515147 entries, 0 to 515146
Data columns (total 50 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   Major               515147 non-null  int64  
 1   Minor               515147 non-null  int64  
 2   BldgNbr             515147 non-null  int64  
 3   NbrLivingUnits      515147 non-null  int64  
 4   Address             515147 non-null  object 
 5   BuildingNumber      515147 non-null  object 
 6   Fraction            515147 non-null  object 
 7   DirectionPrefix     514643 non-null  object 
 8   StreetName          515147 non-null  object 
 9   StreetType          515147 non-null  object 
 10  DirectionSuffix     514643 non-null  object 
 11  ZipCode             469753 non-null  object 
 12  Stories             515147 non-null  float64
 13  BldgGrade           515147 non-null  int64  
 14  BldgGradeVar        515147 non-null  int64  
 15  SqFt1stFloor        515147 non-nul

Unnamed: 0,Major,Minor,BldgNbr,NbrLivingUnits,Address,BuildingNumber,Fraction,DirectionPrefix,StreetName,StreetType,...,FpMultiStory,FpFreestanding,FpAdditional,YrBuilt,YrRenovated,PcntComplete,Obsolescence,PcntNetCondition,Condition,AddnlCost
0,9800,440,1,1,2053 277TH AVE SE 98075,2053,,,277TH,AVE,...,0,0,0,2004,0,0,0,0,3,0
1,9800,710,1,1,27713 SE 26TH WAY 98075,27713,,SE,26TH,WAY,...,0,0,0,2002,0,0,0,0,3,0
2,9800,720,1,1,27719 SE 26TH WAY 98075,27719,,SE,26TH,WAY,...,0,0,0,2001,0,0,0,0,3,0
3,9802,30,1,1,2831 278TH AVE SE 98075,2831,,,278TH,AVE,...,0,0,0,2004,0,0,0,0,3,0
4,9802,140,1,1,2829 277TH TER SE 98075,2829,,,277TH,TER,...,0,0,0,2004,0,0,0,0,3,0


Let's make a function from the last time I made the pin column and use it on this dataframe too.

In [19]:
def add_PIN_column(df):
    """
    input: dataframe with Major and Minor columns
    output: dataframe with PIN column added
    """
    # turn the major and minor columns into strings
    df['Major'] = df['Major'].apply(str)
    df['Minor'] = df['Minor'].apply(str)
    
    # pad each column with zeros
    df['Major'] = df['Major'].apply(lambda elem: elem.rjust(6, '0'))
    df['Minor'] = df['Minor'].apply(lambda elem: elem.rjust(4, '0'))
    
    #create pin column
    df['PIN'] = df['Major'] + df['Minor']
    
    return df

Now let's test it out and try joining the two dataframes (sales and building)

In [20]:
res_bldg = add_PIN_column(res_bldg)
res_bldg.head()


Unnamed: 0,Major,Minor,BldgNbr,NbrLivingUnits,Address,BuildingNumber,Fraction,DirectionPrefix,StreetName,StreetType,...,FpFreestanding,FpAdditional,YrBuilt,YrRenovated,PcntComplete,Obsolescence,PcntNetCondition,Condition,AddnlCost,PIN
0,9800,440,1,1,2053 277TH AVE SE 98075,2053,,,277TH,AVE,...,0,0,2004,0,0,0,0,3,0,98000440
1,9800,710,1,1,27713 SE 26TH WAY 98075,27713,,SE,26TH,WAY,...,0,0,2002,0,0,0,0,3,0,98000710
2,9800,720,1,1,27719 SE 26TH WAY 98075,27719,,SE,26TH,WAY,...,0,0,2001,0,0,0,0,3,0,98000720
3,9802,30,1,1,2831 278TH AVE SE 98075,2831,,,278TH,AVE,...,0,0,2004,0,0,0,0,3,0,98020030
4,9802,140,1,1,2829 277TH TER SE 98075,2829,,,277TH,TER,...,0,0,2004,0,0,0,0,3,0,98020140
