# 3 Wrangling - San Francisco Home Sales by Neighborhood (Redfin)<a id='3_Wrangling_-_San_Francisco_Home_Sales_by_Neighborhood_(Redfin)'></a>

## 3.1 Contents<a id='3.1_Contents'></a>
* [3 Wrangling - San Francisco Home Sales by Neighborhood (Redfin)](#3_Wrangling_-_San_Francisco_Home_Sales_by_Neighborhood_(Redfin))
  * [3.1 Contents](#3.1_Contents)
  * [3.2 Introduction](#3.2_Introduction)
  * [3.3 Imports](#3.3_Imports)
  * [3.4 Load The Data](#3.4_Load_The_Data)
  * [3.5 Explore The Data](#3.5_Explore_The_Data)
    * [3.5.1 Shape and Column Analysis](#3.5.1_Shape_and_Column_Analysis)
    * [3.5.2 Dropping unneeded columns](#3.5.2_Dropping_unneeded_columns)
    * [3.5.3 Reviewing NULL values](#3.5.3_Reviewing_NULL_values)
      * [3.5.3.1 Unique Resort Names](#3.5.3.1_Filed_Online)
      * [3.5.3.2 Analysis Neighborhood](#3.5.3.2_Analysis_Neighborhood)
  * [3.13 Summary](#3.13_Summary)

## 3.2 Introduction<a id='3.2_Introduction'></a>

Data provided by <a href="https://www.redfin.com/">Redfin</a>, a national real estate brokerage, on San Francisco home sales by neighborhood, downloaded in October 2020 into a single CSV file, and spanning from January 2018 up to and including September 2020. This dataset does not have individual home sales, but instead, an aggregated view.

We plan to explore this data in conjunction with San Francisco's Police Incident Report data as well as San Francisco's 311 case data, and we will do this by comparing across San Francisco neighborhoods and supervisor districts.

## 3.3 Imports<a id='2.3_Imports'></a>

In [1]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns

## 3.4 Load The Data<a id='3.4_Load_The_Data'></a>

In [4]:
sales_data = pd.read_csv('raw_data/Redfin_SF-home-sales-by-neighborhood.csv')

In [5]:
sales_data.head()

Unnamed: 0,Region,Month of Period End,Median Sale Price,Median Sale Price MoM,Median Sale Price YoY,Homes Sold,Homes Sold MoM,Homes Sold YoY,New Listings,New Listings MoM,New Listings YoY,Inventory,Inventory MoM,Inventory YoY,Days on Market,Days on Market MoM,Days on Market YoY,Average Sale To List,Average Sale To List MoM,Average Sale To List YoY
0,"San Francisco, CA - Alamo Square",Jan-18,"$1,720K",22.90%,107.20%,6,-14.30%,20.00%,2.0,-50.00%,100.00%,2.0,100.00%,0.00%,22.0,1.0,-32.0,109.10%,-2.10%,7.70%
1,"San Francisco, CA - Alamo Square",Feb-18,"$1,020K",-40.70%,27.50%,2,-66.70%,-33.30%,1.0,-50.00%,-83.30%,1.0,-50.00%,-80.00%,72.0,50.0,-120.0,113.00%,3.90%,12.90%
2,"San Francisco, CA - Alamo Square",Mar-18,"$1,023K",0.20%,-24.30%,2,0.00%,-60.00%,4.0,300.00%,-33.30%,2.0,100.00%,0.00%,134.0,63.0,103.0,104.40%,-8.60%,3.60%
3,"San Francisco, CA - Alamo Square",Apr-18,"$1,150K",12.50%,1.10%,3,50.00%,-50.00%,8.0,100.00%,-11.10%,3.0,50.00%,200.00%,73.0,-61.0,42.0,110.00%,5.60%,6.70%
4,"San Francisco, CA - Alamo Square",May-18,"$2,000K",73.90%,116.00%,7,133.30%,-22.20%,9.0,12.50%,125.00%,2.0,-33.30%,,14.0,-59.0,-17.0,110.40%,0.40%,6.70%


## 3.5 Explore The Data<a id='3.5_Explore_The_Data'></a>

### 3.5.1 Shape and Column Analysis<a id='3.5.1_Shape_and_Column_Analysis'></a>

In [6]:
sales_data.shape

(4084, 20)

In [7]:
sales_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4084 entries, 0 to 4083
Data columns (total 20 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Region                     4084 non-null   object 
 1   Month of Period End        4084 non-null   object 
 2   Median Sale Price          4084 non-null   object 
 3   Median Sale Price MoM      4013 non-null   object 
 4   Median Sale Price YoY      3964 non-null   object 
 5   Homes Sold                 4084 non-null   int64  
 6   Homes Sold MoM             4013 non-null   object 
 7   Homes Sold YoY             3964 non-null   object 
 8   New Listings               4021 non-null   float64
 9   New Listings MoM           3929 non-null   object 
 10  New Listings YoY           3881 non-null   object 
 11  Inventory                  3729 non-null   float64
 12  Inventory MoM              3474 non-null   object 
 13   Inventory YoY             3418 non-null   objec

There are 20 columns and 4084 rows. All rows have a non-null `Region`, which is Redfin's Neighborhood identifier.

### 3.5.2 Examining Region<a id='3.5.2_Examining_Region'></a>

Since we will be using Neighborhoods to compare across data sets, let's take a look at the `Region` values.

In [8]:
sales_data['Region'].value_counts()

San Francisco, CA - Transmission                      33
San Francisco, CA - Visitacion Valley                 33
San Francisco, CA - Golden Gate Heights               33
South San Francisco, CA - Paradise Valley-Terrabay    33
San Francisco, CA - South of Market                   33
                                                      ..
San Francisco, CA - Duboce Park                       15
San Francisco, CA - India Basin                       14
San Francisco, CA - Presidio National Park            11
San Francisco, CA - Design District                    9
San Francisco, CA - Produce Market                     3
Name: Region, Length: 131, dtype: int64

Looks like we've included some South San Francisco rows! Let's drop those.

In [11]:
sales_data[sales_data['Region'].str.startswith('South San Francisco')]['Region'].value_counts()

South San Francisco, CA - Winston-Serra                   33
South San Francisco, CA - Avalon                          33
South San Francisco, CA - Sunshine Gardens                33
South San Francisco, CA - Downtown South San Francisco    33
South San Francisco, CA - Westborough                     33
South San Francisco, CA - Paradise Valley-Terrabay        33
South San Francisco, CA - Sign Hill                       33
South San Francisco, CA - Orange Park                     32
Name: Region, dtype: int64

In [12]:
sales_data = sales_data[~sales_data['Region'].str.startswith('South San Francisco')]

In [13]:
sales_data.shape

(3821, 20)

In [14]:
sales_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3821 entries, 0 to 3820
Data columns (total 20 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Region                     3821 non-null   object 
 1   Month of Period End        3821 non-null   object 
 2   Median Sale Price          3821 non-null   object 
 3   Median Sale Price MoM      3751 non-null   object 
 4   Median Sale Price YoY      3701 non-null   object 
 5   Homes Sold                 3821 non-null   int64  
 6   Homes Sold MoM             3751 non-null   object 
 7   Homes Sold YoY             3701 non-null   object 
 8   New Listings               3758 non-null   float64
 9   New Listings MoM           3667 non-null   object 
 10  New Listings YoY           3618 non-null   object 
 11  Inventory                  3493 non-null   float64
 12  Inventory MoM              3257 non-null   object 
 13   Inventory YoY             3201 non-null   objec

### 3.5.3 Reviewing NULL values<a id='3.5.3_Reviewing_NULL_values'></a>