# Data Cleaning with Pandas

In [1]:
import pandas as pd

## Scenario

As data scientists, we want to build a model to predict the sale price of a house in Seattle in 2019, based on its square footage. We know that the King County Department of Assessments has comprehensive data available on real property sales in the Seattle area. We need to prepare the data.

### First, get the data!

When working on a project involving data that can fit on our computer, we store it in a `data` directory.

```bash
cd <project_directory>  # example: cd ~/flatiron_ds/pandas-3
mkdir data
cd data
```

Note that `<project_directory>` in angle brackets is a _placeholder_. You should type the path to the actual location on your computer where you're working on this project. Do not literally type `<project_directory>` and _do not type the angle brackets_. You can see an example in the _comment_ to the right of the command above.

Now, we'll need to download the two data files that we need. We can do this at the command line:

```bash
wget https://aqua.kingcounty.gov/extranet/assessor/Real%20Property%20Sales.zip
wget https://aqua.kingcounty.gov/extranet/assessor/Residential%20Building.zip
```

*Note:* If you do not have the `wget` command yet, you can install it: `brew install wget`.

Note that `%20` in a URL translates into a space. Even though you will *never put spaces in filenames*, you may need to deal with spaces that _other_ people have used in filenames.

There are two ways to handle the spaces in these filenames when referencing them at the command line.

#### 1. You can _escape_ the spaces by putting a backslash (`\`, remember _backslash is next to backspace_) before each one:

`unzip Real\ Property\ Sales.zip`

This is what happens if you tab-complete the filename in the terminal. Tab completion is your friend!

#### 2. You can put the entire filename in quotes:

`unzip "Real Property Sales.zip"`

Try unzipping these files with the `unzip` command. The `unzip` command takes one argument, the name of the tile that you want to unzip.


You can use tab completion (press the `tab` key after the first three letters) to fill in the names, including spaces. This will 

In [2]:
sales_df = pd.read_csv('data/property_sales.csv')

  interactivity=interactivity, compiler=compiler, result=result)


### Seeing pink? Warnings are useful!

Note the warning above: `DtypeWarning: Columns (1, 2) have mixed types.` Because we start with an index of zero, the columns that we're being warned about are actually the _second_ and _third_ columns, `sales_df['Major']` and `sales_df['Minor']`.

In [3]:
sales_df.head().T

Unnamed: 0,0,1,2,3,4
ExciseTaxNbr,2687551,1235111,2704079,2584094,1056831
Major,138860,664885,423943,403700,951120
Minor,110,40,50,715,900
DocumentDate,08/21/2014,07/09/1991,10/11/2014,01/04/2013,04/20/1989
SalePrice,245000,0,0,0,85000
RecordingNbr,20140828001436,199203161090,20141205000558,20130110000910,198904260448
Volume,,071,,,117
Page,,001,,,053
PlatNbr,,664885,,,951120
PlatType,,C,,,P


### Data overload?

That's a lot of columns. We're only interested in identifying the date, sale price, and square footage of each specific property. What can we do?

In [4]:
sales_df = sales_df[['Major', 'Minor', 'DocumentDate', 'SalePrice']]

In [5]:
sales_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2014336 entries, 0 to 2014335
Data columns (total 4 columns):
Major           object
Minor           object
DocumentDate    object
SalePrice       int64
dtypes: int64(1), object(3)
memory usage: 61.5+ MB


In [6]:
bldg_df = pd.read_csv('data/res_building.csv')

  interactivity=interactivity, compiler=compiler, result=result)


### Another warning! Which column has index 11?

In [7]:
bldg_df.columns[11]

'ZipCode'

`ZipCode` seems like a potentially useful column. We'll need it to determine which house sales took place in Seattle.

In [8]:
bldg_df.head().T

Unnamed: 0,0,1,2,3,4
Major,4300,4610,4610,4610,4900
Minor,167,399,503,505,56
BldgNbr,1,1,1,1,1
NbrLivingUnits,1,1,1,1,1
Address,15223 40TH AVE S 98188,4431 FERNCROFT RD 98040,4516 FERNCROFT RD 98040,4538 FERNCROFT RD 98040,3015 SW 105TH ST 98146
BuildingNumber,15223,4431,4516,4538,3015
Fraction,,,,,
DirectionPrefix,,,,,SW
StreetName,40TH,FERNCROFT,FERNCROFT,FERNCROFT,105TH
StreetType,AVE,RD,RD,RD,ST


### So many features!

As data scientists, we should be _very_ cautious about discarding potentially useful data. But, today, we're interested in _only_ the total square footage of each property. What can we do?


In [9]:
bldg_df = bldg_df[['Major', 'Minor', 'SqFtTotLiving', 'ZipCode']]

In [10]:
bldg_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 511412 entries, 0 to 511411
Data columns (total 4 columns):
Major            511412 non-null int64
Minor            511412 non-null int64
SqFtTotLiving    511412 non-null int64
ZipCode          468372 non-null object
dtypes: int64(3), object(1)
memory usage: 15.6+ MB


In [11]:
sales_data = pd.merge(sales_df, bldg_df, on=['Major', 'Minor'])

ValueError: You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat

### Error!

Why are we seeing an error when we try to join the dataframes?

<table>
    <tr>
        <td style="text-align:left"><pre>
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2013160 entries, 0 to 2013159
Data columns (total 4 columns):
Major           object
Minor           object
DocumentDate    object
SalePrice       int64
dtypes: int64(1), object(3)
memory usage: 61.4+ MB</pre></td>
        <td style="text-align:left"><pre>
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 511359 entries, 0 to 511358
Data columns (total 4 columns):
Major            511359 non-null int64
Minor            511359 non-null int64
SqFtTotLiving    511359 non-null int64
ZipCode          468345 non-null object
dtypes: int64(3), object(1)
memory usage: 15.6+ MB
</pre></td>
    </tr>
</table>

Review the error message in light of the above:

* `ValueError: You are trying to merge on object and int64 columns.`

In [12]:
#This will put the Major data into floats, so we can make empty values into NANs
sales_df['Major'] = pd.to_numeric(sales_df['Major'], errors = "coerce")

### Error!

Note the useful error message above:

`ValueError: Unable to parse string "      " at position 936643`

In this case, we want to treat non-numeric values as missing values. Let's see if there's a way to change how the `pd.to_numeric` function handles errors.

In [13]:
# The single question mark means "show me the docstring"
# pd.to_numeric?

Here's the part that we're looking for:
```
errors : {'ignore', 'raise', 'coerce'}, default 'raise'
    - If 'raise', then invalid parsing will raise an exception
    - If 'coerce', then invalid parsing will be set as NaN
    - If 'ignore', then invalid parsing will return the input
```

Let's try setting the `errors` parameter to `'coerce'`.

In [13]:
sales_df['Major'] = pd.to_numeric(sales_df['Major'], errors='coerce')

Did it work?

In [14]:
sales_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2014336 entries, 0 to 2014335
Data columns (total 4 columns):
Major           float64
Minor           object
DocumentDate    object
SalePrice       int64
dtypes: float64(1), int64(1), object(2)
memory usage: 61.5+ MB


It worked! Let's do the same thing with the `Minor` parcel number.

In [15]:
sales_df['Minor'] = pd.to_numeric(sales_df['Minor'], errors='coerce')

In [16]:
sales_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2014336 entries, 0 to 2014335
Data columns (total 4 columns):
Major           float64
Minor           float64
DocumentDate    object
SalePrice       int64
dtypes: float64(2), int64(1), object(1)
memory usage: 61.5+ MB


Now, let's try our join again.

In [17]:
sales_data = pd.merge(sales_df, bldg_df, on=['Major', 'Minor'])

In [18]:
sales_data.head()

Unnamed: 0,Major,Minor,DocumentDate,SalePrice,SqFtTotLiving,ZipCode
0,138860.0,110.0,08/21/2014,245000,1490,98002
1,138860.0,110.0,06/12/1989,109300,1490,98002
2,138860.0,110.0,01/16/2005,14684,1490,98002
3,138860.0,110.0,06/08/2005,0,1490,98002
4,423943.0,50.0,10/11/2014,0,960,98092


In [19]:
sales_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1437522 entries, 0 to 1437521
Data columns (total 6 columns):
Major            1437522 non-null float64
Minor            1437522 non-null float64
DocumentDate     1437522 non-null object
SalePrice        1437522 non-null int64
SqFtTotLiving    1437522 non-null int64
ZipCode          1322148 non-null object
dtypes: float64(2), int64(2), object(2)
memory usage: 76.8+ MB


We can see right away that we're missing zip codes for many of the sales transactions. (1321536 non-null entries for ZipCode is fewer than the 1436772 entries in the dataframe.) 

In [20]:
sales_data.loc[sales_data['ZipCode'].isna()].head()

Unnamed: 0,Major,Minor,DocumentDate,SalePrice,SqFtTotLiving,ZipCode
99,858140.0,376.0,05/22/2012,0,900,
100,858140.0,376.0,11/28/2017,0,900,
152,720319.0,520.0,11/20/2013,699950,2840,
153,720319.0,520.0,09/16/2013,0,2840,
163,894677.0,240.0,12/21/2016,818161,2450,


Because we are interested in finding houses in Seattle zip codes, we will need to drop the rows with missing zip codes.

In [21]:
sales_data = sales_data.loc[~sales_data['ZipCode'].isna(), :]

sales_data.head()

Unnamed: 0,Major,Minor,DocumentDate,SalePrice,SqFtTotLiving,ZipCode
0,138860.0,110.0,08/21/2014,245000,1490,98002
1,138860.0,110.0,06/12/1989,109300,1490,98002
2,138860.0,110.0,01/16/2005,14684,1490,98002
3,138860.0,110.0,06/08/2005,0,1490,98002
4,423943.0,50.0,10/11/2014,0,960,98092


# Your turn: Data Cleaning with Pandas

### 1. Investigate and drop rows with invalid values in the SalePrice and SqFtTotLiving columns.

Use multiple notebook cells to accomplish this! Press `[esc]` then `B` to create a new cell below the current cell. Press `[return]` to start typing in the new cell.

In [22]:
dummy_sales_data = sales_data

In [23]:
dummy_sales_data.head()

Unnamed: 0,Major,Minor,DocumentDate,SalePrice,SqFtTotLiving,ZipCode
0,138860.0,110.0,08/21/2014,245000,1490,98002
1,138860.0,110.0,06/12/1989,109300,1490,98002
2,138860.0,110.0,01/16/2005,14684,1490,98002
3,138860.0,110.0,06/08/2005,0,1490,98002
4,423943.0,50.0,10/11/2014,0,960,98092


Let's confirm how many records we have inside of `dummy_sales_data`

In [25]:
dummy_sales_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1322148 entries, 0 to 1437521
Data columns (total 6 columns):
Major            1322148 non-null float64
Minor            1322148 non-null float64
DocumentDate     1322148 non-null object
SalePrice        1322148 non-null int64
SqFtTotLiving    1322148 non-null int64
ZipCode          1322148 non-null object
dtypes: float64(2), int64(2), object(2)
memory usage: 70.6+ MB


In [26]:
# keep only the records whose 'SalePrice' is greater than 0
dummy_sales_data = dummy_sales_data.loc[dummy_sales_data["SalePrice"] > 0, :]
dummy_sales_data.head()

Unnamed: 0,Major,Minor,DocumentDate,SalePrice,SqFtTotLiving,ZipCode
0,138860.0,110.0,08/21/2014,245000,1490,98002
1,138860.0,110.0,06/12/1989,109300,1490,98002
2,138860.0,110.0,01/16/2005,14684,1490,98002
6,423943.0,50.0,07/15/1999,96000,960,98092
7,423943.0,50.0,01/08/2001,127500,960,98092


In [27]:
dummy_sales_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 867413 entries, 0 to 1437520
Data columns (total 6 columns):
Major            867413 non-null float64
Minor            867413 non-null float64
DocumentDate     867413 non-null object
SalePrice        867413 non-null int64
SqFtTotLiving    867413 non-null int64
ZipCode          867413 non-null object
dtypes: float64(2), int64(2), object(2)
memory usage: 46.3+ MB


After filtering based on `SalePrice` values greater than 0, `dummy_sales_data` now only has 867413 records as opposed to 1322148.

In [28]:
# keep only records where 'SqFtTotLiving' is greater than 0
dummy_sales_data = dummy_sales_data.loc[dummy_sales_data["SqFtTotLiving"] > 0,:]
dummy_sales_data.head()

Unnamed: 0,Major,Minor,DocumentDate,SalePrice,SqFtTotLiving,ZipCode
0,138860.0,110.0,08/21/2014,245000,1490,98002
1,138860.0,110.0,06/12/1989,109300,1490,98002
2,138860.0,110.0,01/16/2005,14684,1490,98002
6,423943.0,50.0,07/15/1999,96000,960,98092
7,423943.0,50.0,01/08/2001,127500,960,98092


In [29]:
dummy_sales_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 867404 entries, 0 to 1437520
Data columns (total 6 columns):
Major            867404 non-null float64
Minor            867404 non-null float64
DocumentDate     867404 non-null object
SalePrice        867404 non-null int64
SqFtTotLiving    867404 non-null int64
ZipCode          867404 non-null object
dtypes: float64(2), int64(2), object(2)
memory usage: 46.3+ MB


After filtering based on `SqFtTotLiving` values greater than 0, `dummy_sales_data` now only has 867404 records as opposed to 867413.

### 2. Investigate and handle non-numeric ZipCode values

Can you find a way to shorten ZIP+4 codes to the first five digits?

What's the right thing to do with missing values?

In [29]:
# Read the error message and decide how to fix it.
# Note: using errors='coerce' is the *wrong* choice in this case.
# def is_integer(x):
#     try:
#         _ = int(x)
#     except ValueError:
#         return False
#     return True

# sales_data.loc[sales_data['ZipCode'].apply(is_integer) == False, 'ZipCode'].head()

I did not know how to use the is_integer function so I ultimately did not.

In [30]:
#Get a look at some of the unusual values
dummy_sales_data["ZipCode"].value_counts()

98042         25815
98038         21859
98115         21314
98023         20504
98006         20282
98058         19715
98052         19394
98117         19174
98103         17849
98118         17667
98034         17530
98133         16672
98074         16293
98033         15905
98155         15499
98059         15051
98056         14565
98031         13860
98092         13592
98125         13480
98001         13368
98053         13144
98003         12798
98075         12299
98168         12057
98178         11889
98106         11840
98008         11709
98029         11238
98027         11175
              ...  
98028-8908        1
98028-4377        1
98075-9517        1
98044             1
98074-3738        1
95055             1
98059-7428        1
98023-7841        1
98075-9565        1
98006-3954        1
98028-8533        1
98074-4092        1
98013.0           1
98075-8005        1
B                 1
90855             1
98054             1
98176             1
98028-6100        1


In [31]:
#Find out what kind of data these zipcodes are
type(dummy_sales_data["ZipCode"][60])

str

In [32]:
dummy_sales_data = dummy_sales_data.loc[((dummy_sales_data["ZipCode"].str.len()) > 4), :]

In [33]:
dummy_sales_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 782966 entries, 0 to 1437520
Data columns (total 6 columns):
Major            782966 non-null float64
Minor            782966 non-null float64
DocumentDate     782966 non-null object
SalePrice        782966 non-null int64
SqFtTotLiving    782966 non-null int64
ZipCode          782966 non-null object
dtypes: float64(2), int64(2), object(2)
memory usage: 41.8+ MB


After filtering based on `ZipCode` length values greater than 4, `dummy_sales_data` now only has 782966 records as opposed to 867404.

In [34]:
len(dummy_sales_data["ZipCode"])

782966

Take a look at what kinds of zip codes we have left.

In [35]:
dummy_sales_data["ZipCode"].unique()

array(['98002', '98092', '98008', '98058', '98038', '98031', '98188',
       '98051', '98001', '98108', '98198', '98115', '98118', '98072',
       '98117', '98039', '98155', '98075', '98003', '98103', '98022',
       '98042', '98040', '98133', '98105', '98056', '98102', '98053',
       '98168', '98027', '98011', '98074', '98146', '98024', '98029',
       '98006', '98005', '98028', '98034', '98144', '98030', '98177',
       '98166', '98065', '98112', '98116', '98010', '98199', '98032',
       '98106', '98059', '98070', '98045', '98136', '98125', '98023',
       '98033', '98077', '98109', '98055', '98178', '98052', '98122',
       '98014', '98004', '98119', '98107', '98126', '98007', '98019',
       '98224', '98148', '98047', '98288', '98050', '98042-3001', '98354',
       '98068', '98199-3014', '98057', '98302', '98083', '98031-3173',
       '98033-4917', '98058-9018', '98121', '98136-1728', '98058-7983',
       '98074-6315', '98043', '98052-1963', '98113', '98134', '98026',
       '891

In [36]:
def before_dash(s):
    t = s.split("-")[0]
    return t

In [37]:
before_dash('98074-9301')

'98074'

In [38]:
dummy_sales_data["ZipCode"] = dummy_sales_data["ZipCode"].apply(before_dash)

In [39]:
dummy_sales_data["ZipCode"].unique()

array(['98002', '98092', '98008', '98058', '98038', '98031', '98188',
       '98051', '98001', '98108', '98198', '98115', '98118', '98072',
       '98117', '98039', '98155', '98075', '98003', '98103', '98022',
       '98042', '98040', '98133', '98105', '98056', '98102', '98053',
       '98168', '98027', '98011', '98074', '98146', '98024', '98029',
       '98006', '98005', '98028', '98034', '98144', '98030', '98177',
       '98166', '98065', '98112', '98116', '98010', '98199', '98032',
       '98106', '98059', '98070', '98045', '98136', '98125', '98023',
       '98033', '98077', '98109', '98055', '98178', '98052', '98122',
       '98014', '98004', '98119', '98107', '98126', '98007', '98019',
       '98224', '98148', '98047', '98288', '98050', '98354', '98068',
       '98057', '98302', '98083', '98121', '98043', '98113', '98134',
       '98026', '89118', '98104', '98000', '98204', '98035', '98422',
       '98132', '95059', '98097', '28028', '98189', '98017', '98025',
       '98044', '981

### 3. Add a column for PricePerSqFt



In [40]:
#I want to divide SalePrice by SqFtTotLiving, so I'll make sure they are all floats
dummy_sales_data["SalePrice"].dtype

dtype('int64')

In [41]:
#I reassigned it using astype. It got mad at me so I'll use loc for the next one
dummy_sales_data["SalePrice"].dtype

dtype('int64')

In [42]:
dummy_sales_data["SqFtTotLiving"].dtype

dtype('int64')

In [43]:
dummy_sales_data = dummy_sales_data.loc[dummy_sales_data["SqFtTotLiving"].astype(float), :]

Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
  return self._getitem_tuple(key)


In [44]:
dummy_sales_data["SqFtTotLiving"].dtype

dtype('float64')

In [45]:
dummy_sales_data["PricePerSqFt"] = dummy_sales_data["SalePrice"]/dummy_sales_data["SqFtTotLiving"]
dummy_sales_data.head()

Unnamed: 0,Major,Minor,DocumentDate,SalePrice,SqFtTotLiving,ZipCode,PricePerSqFt
1490.0,334570.0,202.0,09/27/1999,317000.0,2300.0,98056,137.826087
1490.0,334570.0,202.0,09/27/1999,317000.0,2300.0,98056,137.826087
1490.0,334570.0,202.0,09/27/1999,317000.0,2300.0,98056,137.826087
960.0,865010.0,200.0,11/15/2006,364950.0,2150.0,98042,169.744186
960.0,865010.0,200.0,11/15/2006,364950.0,2150.0,98042,169.744186


### 4. Subset the data to 2019 sales only.

We can assume that the DocumentDate is approximately the sale date.

In [46]:
dummy_sales_data["DocumentDate"].value_counts()

08/20/2013    4703
05/14/2007    4667
06/03/2003    4617
03/22/2016    4397
06/10/1999    4386
06/20/2001    4378
11/05/2003    4296
01/11/2007    4190
10/22/2003    4179
06/13/2013    4089
07/17/2008    4051
08/23/2006    3957
03/16/2005    3948
08/30/2005    3938
07/06/2004    3936
11/05/2014    3931
03/09/1995    3900
12/16/1996    3831
03/11/1997    3817
04/28/2003    3805
09/14/2012    3792
03/20/2018    3791
12/09/1993    3763
03/03/2000    3716
01/21/1988    3711
09/02/2004    3658
03/04/2011    3581
12/18/2013    3559
11/08/1999    3532
05/10/1995    3526
              ... 
05/24/2000       1
06/29/2016       1
03/05/2004       1
03/21/2019       1
07/12/2000       1
03/08/1999       1
06/12/2018       1
05/27/2005       1
10/26/1995       1
07/24/1986       1
07/27/2015       1
03/14/2006       1
11/27/2018       1
05/10/2016       1
06/26/2002       1
08/28/2018       1
11/18/1998       1
05/31/2014       1
08/05/1992       1
03/24/2003       1
06/23/2010       1
03/25/2016  

Let's create a new column called `DocumentYear` that requires us first to convert `DocumentDate` to a `Date` object. For safety, I'll store this coversion in a new column called `DocumentDateClean`.

In [48]:
dummy_sales_data["DocumentDateClean"] = pd.to_datetime(dummy_sales_data["DocumentDate"], 
                                                       format="%m/%d/%Y")
dummy_sales_data.head()

Unnamed: 0,Major,Minor,DocumentDate,SalePrice,SqFtTotLiving,ZipCode,PricePerSqFt,DocumentDateClean
1490.0,334570.0,202.0,09/27/1999,317000.0,2300.0,98056,137.826087,1999-09-27
1490.0,334570.0,202.0,09/27/1999,317000.0,2300.0,98056,137.826087,1999-09-27
1490.0,334570.0,202.0,09/27/1999,317000.0,2300.0,98056,137.826087,1999-09-27
960.0,865010.0,200.0,11/15/2006,364950.0,2150.0,98042,169.744186,2006-11-15
960.0,865010.0,200.0,11/15/2006,364950.0,2150.0,98042,169.744186,2006-11-15


In [52]:
dummy_sales_data["DocumentYear"] = dummy_sales_data["DocumentDateClean"].dt.year
dummy_sales_data.head()

Unnamed: 0,Major,Minor,DocumentDate,SalePrice,SqFtTotLiving,ZipCode,PricePerSqFt,DocumentDateClean,DocumentYear
1490.0,334570.0,202.0,09/27/1999,317000.0,2300.0,98056,137.826087,1999-09-27,1999.0
1490.0,334570.0,202.0,09/27/1999,317000.0,2300.0,98056,137.826087,1999-09-27,1999.0
1490.0,334570.0,202.0,09/27/1999,317000.0,2300.0,98056,137.826087,1999-09-27,1999.0
960.0,865010.0,200.0,11/15/2006,364950.0,2150.0,98042,169.744186,2006-11-15,2006.0
960.0,865010.0,200.0,11/15/2006,364950.0,2150.0,98042,169.744186,2006-11-15,2006.0


Now let's count how many records by `DocumentYear`

In [54]:
dummy_sales_data.DocumentYear.value_counts().sort_index()

1966.0     1167
1982.0       40
1983.0       47
1984.0       49
1985.0       88
1986.0      196
1987.0      107
1988.0     6186
1989.0       43
1990.0     1985
1991.0     5728
1992.0     4960
1993.0    10551
1994.0    16533
1995.0    13354
1996.0    19611
1997.0     7005
1998.0    15360
1999.0    21043
2000.0    14710
2001.0    11067
2002.0    15997
2003.0    28671
2004.0    20627
2005.0    16663
2006.0    17965
2007.0    15878
2008.0    11646
2009.0     3707
2010.0       98
2011.0     8581
2012.0    13113
2013.0    18954
2014.0    15964
2015.0     4258
2016.0    14910
2017.0     6660
2018.0    14698
2019.0       16
Name: DocumentYear, dtype: int64

Surprisingly enough, **there does seem to be only 16 records in the year 2019**. This leads me to wonder about the data collection process of King County. It is reasonable that there would be some time delay between transactions in the real estate market and the time they are available for public viewing.

In [55]:
dummy_sales_2019 = dummy_sales_data.loc[dummy_sales_data["DocumentDate"].str.contains("2019") == True]

In [56]:
dummy_sales_2019.head()


Unnamed: 0,Major,Minor,DocumentDate,SalePrice,SqFtTotLiving,ZipCode,PricePerSqFt,DocumentDateClean,DocumentYear
2555.0,98400.0,450.0,02/20/2019,409950.0,1850.0,98058,221.594595,2019-02-20,2019.0
2555.0,98400.0,450.0,02/20/2019,409950.0,1850.0,98058,221.594595,2019-02-20,2019.0
2555.0,98400.0,450.0,02/20/2019,409950.0,1850.0,98058,221.594595,2019-02-20,2019.0
1657.0,82007.0,9027.0,02/15/2019,895000.0,3160.0,98022,283.227848,2019-02-15,2019.0
1657.0,82007.0,9027.0,02/15/2019,895000.0,3160.0,98022,283.227848,2019-02-15,2019.0


In [57]:
dummy_sales_2019.info()

<class 'pandas.core.frame.DataFrame'>
Float64Index: 16 entries, 2555.0 to 3541.0
Data columns (total 9 columns):
Major                16 non-null float64
Minor                16 non-null float64
DocumentDate         16 non-null object
SalePrice            16 non-null float64
SqFtTotLiving        16 non-null float64
ZipCode              16 non-null object
PricePerSqFt         16 non-null float64
DocumentDateClean    16 non-null datetime64[ns]
DocumentYear         16 non-null float64
dtypes: datetime64[ns](1), float64(6), object(2)
memory usage: 1.2+ KB


### 5. Subset the data to zip codes within the City of Seattle.

You'll need to find a list of Seattle zip codes!

In [58]:
seattle_zips = [98101, 98102, 98103, 98104, 98105, 98106, 98107, 98108, 98109, 98112, 98115, 98116, 98117, 98118, 98119, 98121, 98122, 98125, 98126, 98133, 98134, 98136, 98144, 98146, 98154, 98164, 98174, 98177, 98178, 98195, 98199]

In [59]:
seattle_sales_2019 = dummy_sales_2019.loc[dummy_sales_2019["ZipCode"].isin([str(t) for t in seattle_zips]), :]


In [60]:
seattle_sales_2019.info()

<class 'pandas.core.frame.DataFrame'>
Float64Index: 5 entries, 2821.0 to 868.0
Data columns (total 9 columns):
Major                5 non-null float64
Minor                5 non-null float64
DocumentDate         5 non-null object
SalePrice            5 non-null float64
SqFtTotLiving        5 non-null float64
ZipCode              5 non-null object
PricePerSqFt         5 non-null float64
DocumentDateClean    5 non-null datetime64[ns]
DocumentYear         5 non-null float64
dtypes: datetime64[ns](1), float64(6), object(2)
memory usage: 400.0+ bytes


In [61]:
seattle_sales_2019

Unnamed: 0,Major,Minor,DocumentDate,SalePrice,SqFtTotLiving,ZipCode,PricePerSqFt,DocumentDateClean,DocumentYear
2821.0,945920.0,125.0,01/29/2019,1750000.0,1400.0,98118,1250.0,2019-01-29,2019.0
5359.0,186240.0,105.0,03/19/2019,380000.0,800.0,98117,475.0,2019-03-19,2019.0
5359.0,186240.0,105.0,03/19/2019,380000.0,800.0,98117,475.0,2019-03-19,2019.0
2821.0,945920.0,125.0,01/29/2019,1750000.0,1400.0,98118,1250.0,2019-01-29,2019.0
868.0,350160.0,125.0,02/06/2019,935000.0,2460.0,98117,380.081301,2019-02-06,2019.0


### 6. What is the mean price per square foot for a house sold in Seattle in 2019?

Don't just type the answer. Type code that generates the answer as output!

Assuming there were only 16 2019 sales on the list, 5 of them with Seattle Zip Codes

In [62]:
mean_price_sq_ft = seattle_sales_2019["PricePerSqFt"].mean()

In [63]:
mean_price_sq_ft

766.0162601626016

About 3/4 of a million dollars, which seems about right. 