# 3 Wrangling - San Francisco Home Sales by Neighborhood (Redfin)<a id='3_Wrangling_-_San_Francisco_Home_Sales_by_Neighborhood_(Redfin)'></a>

## 3.1 Contents<a id='3.1_Contents'></a>
* [3 Wrangling - San Francisco Home Sales by Neighborhood (Redfin)](#3_Wrangling_-_San_Francisco_Home_Sales_by_Neighborhood_(Redfin))
  * [3.1 Contents](#3.1_Contents)
  * [3.2 Introduction](#3.2_Introduction)
  * [3.3 Imports](#3.3_Imports)
  * [3.4 Load The Data](#3.4_Load_The_Data)
    * [3.4.1 Shape and Column Analysis](#3.4.1_Shape_and_Column_Analysis)
  * [3.5 Examining Region](#3.5_Examining_Region)
      * [3.5.1 Dropping non-SF Region](#3.5.1_Dropping_non-SF_Region)
      * [3.5.2 Loading other Neighborhood data](#3.5.2_Loading_other_Neighborhood_data)
      * [3.5.3 Converting Redfin Region into Neighborhood](#3.5.3_Converting_Redfin_Region_into_Neighborhood)
  * [3.6 Condense data](#3.6_Condense_data)
      * [3.6.1 Dropping columns](#3.6.1_Dropping_columns)
      * [3.6.2 Adding Sales Year Month](#3.6.2_Adding_Sales_Year_Month)
      * [3.6.3 Handling NULLs](#3.6.3_Handling_NULLS)
      * [3.6.4 Fixing datatypes](#3.6.4_Fixing_datatypes)
      * [3.6.5 Aggregating data](#3.6.5_Aggregating_data)
  * [3.7 Save data](#3.7_Save_data)

## 3.2 Introduction<a id='3.2_Introduction'></a>

Data provided by <a href="https://www.redfin.com/">Redfin</a>, a national real estate brokerage, on San Francisco home sales by neighborhood, downloaded in October 2020 into a single CSV file, and spanning from January 2018 up to and including September 2020. This dataset does not have individual home sales, but instead, an aggregated view.

We plan to explore this data in conjunction with San Francisco's Police Incident Report data as well as San Francisco's 311 case data, and we will do this by comparing across San Francisco neighborhoods.

In this notebook, we will:
  * load the Redfin housing sales data
  * examine the Redfin Region list
  * transform the Redfin Region into the appropriate official SF neighborhoods

At the end of this notebook, we will generate the following files:
  * Redfin_SF_sales_adjusted_neighborhood.csv : data aggregated by month, from January 2018 up to and including September 2020

## 3.3 Imports<a id='3.3_Imports'></a>

In [1]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns

## 3.4 Load The Data<a id='3.4_Load_The_Data'></a>

In [2]:
sales_data = pd.read_csv('raw_data/Redfin_SF-home-sales-by-neighborhood.csv')

In [3]:
sales_data.head()

Unnamed: 0,Region,Month of Period End,Median Sale Price,Median Sale Price MoM,Median Sale Price YoY,Homes Sold,Homes Sold MoM,Homes Sold YoY,New Listings,New Listings MoM,New Listings YoY,Inventory,Inventory MoM,Inventory YoY,Days on Market,Days on Market MoM,Days on Market YoY,Average Sale To List,Average Sale To List MoM,Average Sale To List YoY
0,"San Francisco, CA - Alamo Square",Jan-18,"$1,720K",22.90%,107.20%,6,-14.30%,20.00%,2.0,-50.00%,100.00%,2.0,100.00%,0.00%,22.0,1.0,-32.0,109.10%,-2.10%,7.70%
1,"San Francisco, CA - Alamo Square",Feb-18,"$1,020K",-40.70%,27.50%,2,-66.70%,-33.30%,1.0,-50.00%,-83.30%,1.0,-50.00%,-80.00%,72.0,50.0,-120.0,113.00%,3.90%,12.90%
2,"San Francisco, CA - Alamo Square",Mar-18,"$1,023K",0.20%,-24.30%,2,0.00%,-60.00%,4.0,300.00%,-33.30%,2.0,100.00%,0.00%,134.0,63.0,103.0,104.40%,-8.60%,3.60%
3,"San Francisco, CA - Alamo Square",Apr-18,"$1,150K",12.50%,1.10%,3,50.00%,-50.00%,8.0,100.00%,-11.10%,3.0,50.00%,200.00%,73.0,-61.0,42.0,110.00%,5.60%,6.70%
4,"San Francisco, CA - Alamo Square",May-18,"$2,000K",73.90%,116.00%,7,133.30%,-22.20%,9.0,12.50%,125.00%,2.0,-33.30%,,14.0,-59.0,-17.0,110.40%,0.40%,6.70%


### 3.4.1 Shape and Column Analysis<a id='3.4.1_Shape_and_Column_Analysis'></a>

In [4]:
sales_data.shape

(4084, 20)

In [5]:
sales_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4084 entries, 0 to 4083
Data columns (total 20 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Region                     4084 non-null   object 
 1   Month of Period End        4084 non-null   object 
 2   Median Sale Price          4084 non-null   object 
 3   Median Sale Price MoM      4013 non-null   object 
 4   Median Sale Price YoY      3964 non-null   object 
 5   Homes Sold                 4084 non-null   int64  
 6   Homes Sold MoM             4013 non-null   object 
 7   Homes Sold YoY             3964 non-null   object 
 8   New Listings               4021 non-null   float64
 9   New Listings MoM           3929 non-null   object 
 10  New Listings YoY           3881 non-null   object 
 11  Inventory                  3729 non-null   float64
 12  Inventory MoM              3474 non-null   object 
 13   Inventory YoY             3418 non-null   objec

There are 20 columns and 4084 rows. All rows have a non-null `Region`, which is Redfin's Neighborhood identifier.

## 3.5 Examining Region<a id='3.5_Examining_Region'></a>

Since we will be using Neighborhoods to compare across data sets, let's take a look at the `Region` values.

In [6]:
sales_data['Region'].value_counts()

San Francisco, CA - Clarendon Heights         33
San Francisco, CA - Mission District          33
San Francisco, CA - North Beach               33
San Francisco, CA - Lower Pacific Heights     33
South San Francisco, CA - Sunshine Gardens    33
                                              ..
San Francisco, CA - Duboce Park               15
San Francisco, CA - India Basin               14
San Francisco, CA - Presidio National Park    11
San Francisco, CA - Design District            9
San Francisco, CA - Produce Market             3
Name: Region, Length: 131, dtype: int64

Looks like we've included some South San Francisco rows! Let's drop those.

### 3.5.1 Dropping non-SF Region<a id='3.5.1_Dropping_non-SF_Region'></a>

In [7]:
sales_data[sales_data['Region'].str.startswith('South San Francisco')]['Region'].value_counts()

South San Francisco, CA - Paradise Valley-Terrabay        33
South San Francisco, CA - Avalon                          33
South San Francisco, CA - Westborough                     33
South San Francisco, CA - Downtown South San Francisco    33
South San Francisco, CA - Sign Hill                       33
South San Francisco, CA - Winston-Serra                   33
South San Francisco, CA - Sunshine Gardens                33
South San Francisco, CA - Orange Park                     32
Name: Region, dtype: int64

In [8]:
sales_data = sales_data[~sales_data['Region'].str.startswith('South San Francisco')]

In [9]:
sales_data.shape

(3821, 20)

### 3.5.2 Loading other Neighborhood data<a id='3.5.2_Loading_other_Neighborhood_data'></a>

Since we are comparing to the SF police incident data as well as 311 case data, we need to convert the Redfin Region into those Neighborhoods. We can do this as we have prepared a file containing the mappings from Redfin Region to 311 Neighborhood and SF police incident Analysis Neighborhood.

In [10]:
neighborhood_mapper = pd.read_csv('data/Neighborhoods_Map.csv')

In [11]:
neighborhood_mapper.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 130 entries, 0 to 129
Data columns (total 3 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   RedfinNeighborhood     130 non-null    object
 1   311 Neighborhood       130 non-null    object
 2   Analysis Neighborhood  130 non-null    object
dtypes: object(3)
memory usage: 3.2+ KB


In [12]:
neighborhood_mapper.head()

Unnamed: 0,RedfinNeighborhood,311 Neighborhood,Analysis Neighborhood
0,"San Francisco, CA - Alamo Square",Alamo Square,Hayes Valley
1,"San Francisco, CA - Anza Vista",Anza Vista,Lone Mountain/USF
2,"San Francisco, CA - Apparel City",Apparel City,Bayview Hunters Point
3,"San Francisco, CA - Aquatic Park-Fort Mason",Aquatic Park / Ft. Mason,Marina
4,"San Francisco, CA - Ashbury Heights",Ashbury Heights,Haight Ashbury


In [13]:
# compare the set of neighborhood_mapper's RedfinNeighborhood to sales_data's Region
print( np.setdiff1d( sales_data['Region'].unique(), neighborhood_mapper['RedfinNeighborhood'].unique() ) )

['San Francisco, CA - Northeast San Francisco'
 'San Francisco, CA - Northwest San Francisco'
 'San Francisco, CA - Southeast San Francisco'
 'San Francisco, CA - Southwest San Francisco']


The mapping table does not contain the following Redfin regions:
  * 'San Francisco, CA - Northeast San Francisco'
  * 'San Francisco, CA - Northwest San Francisco'
  * 'San Francisco, CA - Southeast San Francisco'
  * 'San Francisco, CA - Southwest San Francisco'

These aren't neighborhood regions, so we aren't concerned with losing them.

### 3.5.3 Converting Redfin Region into Neighborhood<a id='3.5.3_Converting_Redfin_Region_into_Neighborhood'></a>

We will join the Redfin sales data table with the Neighborhood mapping table.

In [14]:
modified_sales = pd.merge( sales_data, neighborhood_mapper, left_on='Region', right_on='RedfinNeighborhood', validate='many_to_one')

In [15]:
modified_sales.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3689 entries, 0 to 3688
Data columns (total 23 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Region                     3689 non-null   object 
 1   Month of Period End        3689 non-null   object 
 2   Median Sale Price          3689 non-null   object 
 3   Median Sale Price MoM      3619 non-null   object 
 4   Median Sale Price YoY      3569 non-null   object 
 5   Homes Sold                 3689 non-null   int64  
 6   Homes Sold MoM             3619 non-null   object 
 7   Homes Sold YoY             3569 non-null   object 
 8   New Listings               3626 non-null   float64
 9   New Listings MoM           3535 non-null   object 
 10  New Listings YoY           3486 non-null   object 
 11  Inventory                  3361 non-null   float64
 12  Inventory MoM              3125 non-null   object 
 13   Inventory YoY             3069 non-null   objec

In [16]:
modified_sales.head()

Unnamed: 0,Region,Month of Period End,Median Sale Price,Median Sale Price MoM,Median Sale Price YoY,Homes Sold,Homes Sold MoM,Homes Sold YoY,New Listings,New Listings MoM,...,Inventory YoY,Days on Market,Days on Market MoM,Days on Market YoY,Average Sale To List,Average Sale To List MoM,Average Sale To List YoY,RedfinNeighborhood,311 Neighborhood,Analysis Neighborhood
0,"San Francisco, CA - Alamo Square",Jan-18,"$1,720K",22.90%,107.20%,6,-14.30%,20.00%,2.0,-50.00%,...,0.00%,22.0,1.0,-32.0,109.10%,-2.10%,7.70%,"San Francisco, CA - Alamo Square",Alamo Square,Hayes Valley
1,"San Francisco, CA - Alamo Square",Feb-18,"$1,020K",-40.70%,27.50%,2,-66.70%,-33.30%,1.0,-50.00%,...,-80.00%,72.0,50.0,-120.0,113.00%,3.90%,12.90%,"San Francisco, CA - Alamo Square",Alamo Square,Hayes Valley
2,"San Francisco, CA - Alamo Square",Mar-18,"$1,023K",0.20%,-24.30%,2,0.00%,-60.00%,4.0,300.00%,...,0.00%,134.0,63.0,103.0,104.40%,-8.60%,3.60%,"San Francisco, CA - Alamo Square",Alamo Square,Hayes Valley
3,"San Francisco, CA - Alamo Square",Apr-18,"$1,150K",12.50%,1.10%,3,50.00%,-50.00%,8.0,100.00%,...,200.00%,73.0,-61.0,42.0,110.00%,5.60%,6.70%,"San Francisco, CA - Alamo Square",Alamo Square,Hayes Valley
4,"San Francisco, CA - Alamo Square",May-18,"$2,000K",73.90%,116.00%,7,133.30%,-22.20%,9.0,12.50%,...,,14.0,-59.0,-17.0,110.40%,0.40%,6.70%,"San Francisco, CA - Alamo Square",Alamo Square,Hayes Valley


In [17]:
# check for nulls
modified_sales.isnull().sum()

Region                         0
Month of Period End            0
Median Sale Price              0
Median Sale Price MoM         70
Median Sale Price YoY        120
Homes Sold                     0
Homes Sold MoM                70
Homes Sold YoY               120
New Listings                  63
New Listings MoM             154
New Listings YoY             203
Inventory                    328
Inventory MoM                564
 Inventory YoY               620
Days on Market                15
Days on Market MoM            92
Days on Market YoY           147
Average Sale To List           0
Average Sale To List MoM      70
Average Sale To List YoY     120
RedfinNeighborhood             0
311 Neighborhood               0
Analysis Neighborhood          0
dtype: int64

## 3.6 Condense data<a id='3.6_Condense_data'></a>

Let's condense this data. 

  1. Now that we have the Neighborhoods tying this data to the other datasets, we can drop `Region` and `RedfinNeighborhood`.
  2. We also don't require any month-over-month or year-over-year data, so we can drop those columns
  3. We should convert the `Month of Period End` into the same string format we use in the other datasets (YYYYMM)
  4. Finally, we want to aggregate the data by the two neighborhoods and year-month
    * First, we need to handle the NULL values
    * Then, we will apply certain methods of aggregation to the other columns, but we must convert them into the appropriate datatypes first.
      * `Median Sale Price`: mean()
      * `Homes Sold`: sum()
      * `New Listings`: sum()
      * `Inventory`: sum()
      * `Days on Market`: mean()
      * `Average Sale To List`: mean()

### 3.6.1 Dropping columns<a id='3.6.1_Dropping_columns'></a>

In [18]:
modified_sales.drop(columns=['Region','RedfinNeighborhood'], inplace=True)

In [19]:
cols_to_drop = [col for col in modified_sales.columns if (('MoM' in col) or ('YoY' in col))]
print(cols_to_drop)
modified_sales.drop(columns=cols_to_drop, inplace=True)

['Median Sale Price MoM ', 'Median Sale Price YoY ', 'Homes Sold MoM ', 'Homes Sold YoY ', 'New Listings MoM ', 'New Listings YoY ', 'Inventory MoM ', ' Inventory YoY ', 'Days on Market MoM', 'Days on Market YoY', 'Average Sale To List MoM ', 'Average Sale To List YoY ']


### 3.6.2 Adding Sales Year Month<a id='3.6.2_Adding_Sales_Year_Month'></a>

In [20]:
print(modified_sales['Month of Period End'].dtypes)

object


In [21]:
modified_sales['Month of Period End'] = pd.to_datetime(modified_sales['Month of Period End'], format='%b-%y')

In [22]:
print(modified_sales['Month of Period End'].dtypes)

datetime64[ns]


In [23]:
modified_sales['Sales Year Month'] = modified_sales['Month of Period End'].dt.strftime('%Y%m')

In [24]:
modified_sales.drop(columns=['Month of Period End'], inplace=True)

In [25]:
modified_sales.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3689 entries, 0 to 3688
Data columns (total 9 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Median Sale Price      3689 non-null   object 
 1   Homes Sold             3689 non-null   int64  
 2   New Listings           3626 non-null   float64
 3   Inventory              3361 non-null   float64
 4   Days on Market         3674 non-null   float64
 5   Average Sale To List   3689 non-null   object 
 6   311 Neighborhood       3689 non-null   object 
 7   Analysis Neighborhood  3689 non-null   object 
 8   Sales Year Month       3689 non-null   object 
dtypes: float64(3), int64(1), object(5)
memory usage: 288.2+ KB


### 3.6.3 Handling NULLs<a id='3.6.3_Handling_NULLS'></a>

In [26]:
# let's take a quick look at some of the null values
modified_sales[modified_sales['New Listings'].isna()].head(10)

Unnamed: 0,Median Sale Price,Homes Sold,New Listings,Inventory,Days on Market,Average Sale To List,311 Neighborhood,Analysis Neighborhood,Sales Year Month
23,$813K,6,,,23.0,112.70%,Alamo Square,Hayes Valley,201912
228,"$2,650K",2,,,78.0,104.10%,Balboa Terrace,West of Twin Peaks,201803
237,"$1,620K",2,,,27.0,107.80%,Balboa Terrace,West of Twin Peaks,201812
238,"$1,440K",1,,,43.0,99.40%,Balboa Terrace,West of Twin Peaks,201901
383,$790K,7,,2.0,43.0,105.20%,Candlestick Point SRA,Bayview Hunters Point,201801
548,$870K,3,,,22.0,104.60%,Central Waterfront,Potrero Hill,201801
549,$875K,2,,,18.0,106.90%,Central Waterfront,Potrero Hill,201802
598,$680K,1,,,96.0,98.80%,Chinatown,Chinatown,201912
599,$680K,1,,,96.0,98.80%,Chinatown,Chinatown,202001
772,"$1,180K",1,,,14.0,102.60%,Showplace Square,Mission Bay,201806


In [27]:
modified_sales[modified_sales['Inventory'].isna()].head(10)

Unnamed: 0,Median Sale Price,Homes Sold,New Listings,Inventory,Days on Market,Average Sale To List,311 Neighborhood,Analysis Neighborhood,Sales Year Month
14,$650K,5,4.0,,26.0,105.10%,Alamo Square,Hayes Valley,201903
22,$840K,8,6.0,,18.0,115.10%,Alamo Square,Hayes Valley,201911
23,$813K,6,,,23.0,112.70%,Alamo Square,Hayes Valley,201912
25,"$2,331K",2,2.0,,13.0,126.30%,Alamo Square,Hayes Valley,202002
26,"$2,331K",2,2.0,,13.0,126.30%,Alamo Square,Hayes Valley,202003
33,$898K,2,1.0,,41.0,108.10%,Anza Vista,Lone Mountain/USF,201802
34,"$2,263K",2,1.0,,86.0,97.60%,Anza Vista,Lone Mountain/USF,201803
40,"$2,334K",4,5.0,,42.0,108.40%,Anza Vista,Lone Mountain/USF,201809
46,"$1,150K",1,1.0,,140.0,106.00%,Anza Vista,Lone Mountain/USF,201903
56,"$1,325K",5,1.0,,14.0,110.10%,Anza Vista,Lone Mountain/USF,202001


In [28]:
# appears that we should fill the null New Listings and Inventory with 0
modified_sales[['New Listings', 'Inventory']] = modified_sales[['New Listings', 'Inventory']].fillna(value=0)

In [29]:
modified_sales[modified_sales['Days on Market'].isna()]

Unnamed: 0,Median Sale Price,Homes Sold,New Listings,Inventory,Days on Market,Average Sale To List,311 Neighborhood,Analysis Neighborhood,Sales Year Month
66,"$1,568K",1,4.0,4.0,,104.90%,Aquatic Park / Ft. Mason,Marina,201804
67,"$1,568K",1,5.0,3.0,,104.90%,Aquatic Park / Ft. Mason,Marina,201805
162,"$3,000K",1,7.0,3.0,,120.20%,Dolores Heights,Mission,201803
652,"$1,490K",1,3.0,1.0,,100.00%,Cole Valley,Haight Ashbury,201903
664,"$3,850K",1,7.0,5.0,,100.00%,Cole Valley,Haight Ashbury,202004
887,"$2,250K",1,1.0,0.0,,112.80%,Duboce Triangle,Castro/Upper Market,202001
888,"$2,250K",1,1.0,0.0,,112.80%,Duboce Triangle,Castro/Upper Market,202002
889,"$2,250K",1,1.0,0.0,,112.80%,Duboce Triangle,Castro/Upper Market,202003
1560,"$1,025K",1,3.0,2.0,,100.00%,Japantown,Japantown,202002
1751,"$2,210K",1,5.0,2.0,,110.80%,Lone Mountain,Lone Mountain/USF,201803


In [30]:
modified_sales['Days on Market'].describe()

count    3674.000000
mean       30.209036
std        25.464749
min         1.000000
25%        15.000000
50%        22.000000
75%        36.000000
max       475.000000
Name: Days on Market, dtype: float64

`Days on Market` is fairly Neighborhood and season dependent. All of the missing Days on Market have only 1 home sold. 

We did some research on Redfin but couldn't find the exact property on which the data is based.

**QUESTION: How should we deal with this missing data for our aggregation purposes?**

Let's take a peek at what eith ffill or bfill would look like.

In [31]:
missing_days_on_market = list(modified_sales[modified_sales['Days on Market'].isna()].index.values)
print(missing_days_on_market)
for i in missing_days_on_market:
    print(modified_sales.iloc[i-2:i+2])

[66, 67, 162, 652, 664, 887, 888, 889, 1560, 1751, 1974, 1985, 2189, 3012, 3648]
   Median Sale Price  Homes Sold  New Listings  Inventory  Days on Market  \
64           $1,638K           2           3.0        2.0            19.0   
65           $1,568K           3           4.0        2.0            19.0   
66           $1,568K           1           4.0        4.0             NaN   
67           $1,568K           1           5.0        3.0             NaN   

   Average Sale To List          311 Neighborhood Analysis Neighborhood  \
64              100.10%  Aquatic Park / Ft. Mason                Marina   
65              101.70%  Aquatic Park / Ft. Mason                Marina   
66              104.90%  Aquatic Park / Ft. Mason                Marina   
67              104.90%  Aquatic Park / Ft. Mason                Marina   

   Sales Year Month  
64           201802  
65           201803  
66           201804  
67           201805  
   Median Sale Price  Homes Sold  New Listings 

**ANSWER: For the most part, ffill looks like it would work, the only record that appears would have a discrepancy would be row 664, so let's look deeper at it.**

In [32]:
modified_sales.iloc[660:666]

Unnamed: 0,Median Sale Price,Homes Sold,New Listings,Inventory,Days on Market,Average Sale To List,311 Neighborhood,Analysis Neighborhood,Sales Year Month
660,"$1,825K",7,4.0,3.0,48.0,104.90%,Cole Valley,Haight Ashbury,201912
661,"$1,500K",7,1.0,0.0,48.0,105.90%,Cole Valley,Haight Ashbury,202001
662,"$2,305K",5,2.0,2.0,63.0,105.80%,Cole Valley,Haight Ashbury,202002
663,"$2,225K",2,4.0,3.0,167.0,94.80%,Cole Valley,Haight Ashbury,202003
664,"$3,850K",1,7.0,5.0,,100.00%,Cole Valley,Haight Ashbury,202004
665,"$2,620K",2,9.0,6.0,75.0,103.70%,Cole Valley,Haight Ashbury,202005


In [33]:
# we will instead use row 662's Days on Market for row 664
modified_sales.at[664,'Days on Market'] = modified_sales.at[662,'Days on Market']

In [34]:
# now we can use ffill
modified_sales['Days on Market'] = modified_sales['Days on Market'].fillna(method='ffill')

In [35]:
modified_sales.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3689 entries, 0 to 3688
Data columns (total 9 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Median Sale Price      3689 non-null   object 
 1   Homes Sold             3689 non-null   int64  
 2   New Listings           3689 non-null   float64
 3   Inventory              3689 non-null   float64
 4   Days on Market         3689 non-null   float64
 5   Average Sale To List   3689 non-null   object 
 6   311 Neighborhood       3689 non-null   object 
 7   Analysis Neighborhood  3689 non-null   object 
 8   Sales Year Month       3689 non-null   object 
dtypes: float64(3), int64(1), object(5)
memory usage: 448.2+ KB


### 3.6.4 Fixing datatypes<a id='3.6.4_Fixing_datatypes'></a>

In [36]:
modified_sales.head()

Unnamed: 0,Median Sale Price,Homes Sold,New Listings,Inventory,Days on Market,Average Sale To List,311 Neighborhood,Analysis Neighborhood,Sales Year Month
0,"$1,720K",6,2.0,2.0,22.0,109.10%,Alamo Square,Hayes Valley,201801
1,"$1,020K",2,1.0,1.0,72.0,113.00%,Alamo Square,Hayes Valley,201802
2,"$1,023K",2,4.0,2.0,134.0,104.40%,Alamo Square,Hayes Valley,201803
3,"$1,150K",3,8.0,3.0,73.0,110.00%,Alamo Square,Hayes Valley,201804
4,"$2,000K",7,9.0,2.0,14.0,110.40%,Alamo Square,Hayes Valley,201805


In [37]:
# change Median Sale Price into a number
modified_sales['Median Sale Price'] = modified_sales['Median Sale Price'].apply(
    lambda x: x.replace('$', '').replace(',', '').replace('K', '000') if isinstance(x, str) else x).astype(int)
modified_sales['Median Sale Price'].describe()

count    3.689000e+03
mean     1.569028e+06
std      6.804667e+05
min      2.650000e+05
25%      1.123000e+06
50%      1.488000e+06
75%      1.850000e+06
max      8.975000e+06
Name: Median Sale Price, dtype: float64

In [38]:
# change Average Sale To List to a number
modified_sales['Average Sale To List'] = modified_sales['Average Sale To List'].apply(
    lambda x: x.replace('%', '') if isinstance(x, str) else x).astype(float)
modified_sales.rename(columns={'Average Sale To List': 'Average Sale To List Pct'}, inplace=True)
modified_sales['Average Sale To List Pct'].describe()

count    3689.000000
mean      108.088561
std         7.814174
min        66.000000
25%       102.200000
50%       107.200000
75%       112.700000
max       144.900000
Name: Average Sale To List Pct, dtype: float64

In [39]:
modified_sales.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3689 entries, 0 to 3688
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Median Sale Price         3689 non-null   int32  
 1   Homes Sold                3689 non-null   int64  
 2   New Listings              3689 non-null   float64
 3   Inventory                 3689 non-null   float64
 4   Days on Market            3689 non-null   float64
 5   Average Sale To List Pct  3689 non-null   float64
 6   311 Neighborhood          3689 non-null   object 
 7   Analysis Neighborhood     3689 non-null   object 
 8   Sales Year Month          3689 non-null   object 
dtypes: float64(4), int32(1), int64(1), object(3)
memory usage: 273.8+ KB


### 3.6.5 Aggregating data<a id='3.6.5_Aggregating_data'></a>

Now that the datatypes have been fixed, we will apply the following aggregation methods:
  * `Median Sale Price`: mean()
  * `Homes Sold`: sum()
  * `New Listings`: sum()
  * `Inventory`: sum()
  * `Days on Market`: mean()
  * `Average Sale To List Pct`: mean()

In [40]:
cols_to_groupby = ['Sales Year Month', '311 Neighborhood', 'Analysis Neighborhood']
cols_to_agg = ['Median Sale Price', 'Homes Sold', 'New Listings', 'Inventory', 'Days on Market', 'Average Sale To List Pct']
cols_agg_methods = ['mean', 'sum', 'sum', 'sum', 'mean', 'mean']
agg_methods_to_use = dict(zip(cols_to_agg, cols_agg_methods))
aggregated_sales = modified_sales.groupby(cols_to_groupby).agg(agg_methods_to_use)
aggregated_sales = aggregated_sales.reset_index()

In [41]:
aggregated_sales.head()

Unnamed: 0,Sales Year Month,311 Neighborhood,Analysis Neighborhood,Median Sale Price,Homes Sold,New Listings,Inventory,Days on Market,Average Sale To List Pct
0,201801,Alamo Square,Hayes Valley,1720000.0,6,2.0,2.0,22.0,109.1
1,201801,Anza Vista,Lone Mountain/USF,1475000.0,4,1.0,2.0,18.0,117.5
2,201801,Aquatic Park / Ft. Mason,Marina,1230000.0,3,6.0,2.0,34.0,100.9
3,201801,Ashbury Heights,Haight Ashbury,1711500.0,23,8.0,1.0,16.0,110.85
4,201801,Balboa Terrace,West of Twin Peaks,1925000.0,3,2.0,1.0,14.0,104.2


In [42]:
aggregated_sales.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3359 entries, 0 to 3358
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Sales Year Month          3359 non-null   object 
 1   311 Neighborhood          3359 non-null   object 
 2   Analysis Neighborhood     3359 non-null   object 
 3   Median Sale Price         3359 non-null   float64
 4   Homes Sold                3359 non-null   int64  
 5   New Listings              3359 non-null   float64
 6   Inventory                 3359 non-null   float64
 7   Days on Market            3359 non-null   float64
 8   Average Sale To List Pct  3359 non-null   float64
dtypes: float64(5), int64(1), object(3)
memory usage: 236.3+ KB


## 3.7 Save data<a id='3.7_Save_data'></a>

We will be saving our data with these new Neighborhoods to a separate location, to guard against overwriting our original data.

In [43]:
aggregated_sales.shape

(3359, 9)

In [44]:
datapath = 'data'

# create datapath if it doesn't exist
if not os.path.exists(datapath):
    os.mkdir(datapath)

In [45]:
# write aggregated data
datapath_aggdata = os.path.join(datapath, 'Redfin_SF_sales_adjusted_neighborhood.csv')
if not os.path.exists(datapath_aggdata):
    aggregated_sales.to_csv(datapath_aggdata, index=False)