# 2 Wrangling - San Francisco 311 Data<a id='2_Wrangling_-_San_Francisco_311_Data'></a>

## 2.1 Contents<a id='2.1_Contents'></a>
* [2 Wrangling - San Francisco 311 Data](#2_Wrangling_-_San_Francisco_311_Data)
  * [2.1 Contents](#2.1_Contents)
  * [2.2 Introduction](#2.2_Introduction)
  * [2.3 Imports](#2.3_Imports)
  * [2.4 Load The Data](#2.4_Load_The_Data)
    * [2.4.1 Testing One Month](#2.4.1_Testing_One_Month)
    * [2.4.2 Load All Files](#2.4.2_Load_All_Files)
  * [2.5 Explore The Data](#2.5_Explore_The_Data)
    * [2.5.1 Shape and Column Analysis](#2.5.1_Shape_and_Column_Analysis)
    * [2.5.2 Dropping unneeded columns](#2.5.2_Dropping_unneeded_columns)
    * [2.5.3 Reviewing NULL values](#2.5.3_Reviewing_NULL_values)
      * [2.5.3.1 Unique Resort Names](#2.5.3.1_Filed_Online)
      * [2.5.3.2 Analysis Neighborhood](#2.5.3.2_Analysis_Neighborhood)
  * [2.13 Summary](#2.13_Summary)

## 2.2 Introduction<a id='2.2_Introduction'></a>

Data was downloaded from <a href="https://data.sfgov.org/City-Infrastructure/311-Cases/vw6y-z8j6">San Francisco's Open Data portal regarding 311 Data</a>. Since the files were too large, they were downloaded by month into separate CSVs files, spanning from January 2018 up to and including September 2020. This dataset comprises of SF311 requests; SF311 is the primary customer service center for the City of San Francisco. They receive requests via the phone, web, mobile, and Twitter.

We plan to explore this data in conjunction with San Francisco's Police Incident Report data as well as Redfin's San Francisco neighborhood house sales data, and we will do this by comparing across San Francisco neighborhoods and supervisor districts.

## 2.3 Imports<a id='2.3_Imports'></a>

In [1]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns

## 2.4 Load The Data<a id='2.4_Load_The_Data'></a>

The data is located in CSV files by month/year. As a result, we must load each month/year separately. Let's load a single month to do some quick exploration to see if we can reduce the number of rows, and what we should keep for our collation.

### 2.4.1 Testing One Month<a id='2.4.1_Testing_One_Month'></a>

In [2]:
Jan2018_data = pd.read_csv('raw_data/311_Cases_201801.csv')

In [3]:
Jan2018_data.head()

Unnamed: 0,CaseID,Opened,Closed,Updated,Status,Status Notes,Responsible Agency,Category,Request Type,Request Details,Address,Street,Supervisor District,Neighborhood,Police District,Latitude,Longitude,Point,Source,Media URL
0,8448866,01/01/2018 12:25:00 AM,01/05/2018 01:07:00 PM,01/05/2018 01:07:00 PM,Closed,Case is a Duplicate,DPW Ops Queue,Street and Sidewalk Cleaning,Bulky Items,Boxed or Bagged Items,"2833 25TH ST, SAN FRANCISCO, CA, 94110",25TH ST,9.0,Mission,MISSION,37.751209,-122.406845,"(37.75120926, -122.40684509)",Phone,
1,8448872,01/01/2018 12:28:00 AM,01/30/2018 10:21:55 PM,01/30/2018 10:21:55 PM,Closed,Case is a Duplicate - Request is a duplicate a...,DPW Ops Queue,Graffiti,Graffiti on Bike_rack,Bike_rack - Not_Offensive,"909 PAGE ST, SAN FRANCISCO, CA, 94117",PAGE ST,5.0,Lower Haight,NORTHERN,37.7723,-122.435745,"(37.7723, -122.435745)",Mobile/Open311,http://mobile311.sfgov.org/reports/8448872/photos
2,8448870,01/01/2018 12:28:00 AM,01/04/2018 01:15:50 PM,01/04/2018 01:15:50 PM,Closed,Case Resolved - Completed cleaned,DPW Ops Queue,Street and Sidewalk Cleaning,General Cleaning,Other Loose Garbage,"2833 25TH ST, SAN FRANCISCO, CA, 94110",25TH ST,9.0,Mission,MISSION,37.751209,-122.406845,"(37.75120926, -122.40684509)",Phone,
3,8448879,01/01/2018 12:31:00 AM,01/09/2018 05:14:00 AM,01/09/2018 05:14:00 AM,Closed,DPT Abandoned Vehicles- Gone on Arrival - veh ...,DPT Abandoned Vehicles Work Queue,Abandoned Vehicle,Abandoned Vehicles,DPT Abandoned Vehicles Low,"115 SAGAMORE ST, SAN FRANCISCO, CA, 94112",SAGAMORE ST,11.0,Oceanview,TARAVAL,37.711192,-122.456682,"(37.7111921, -122.4566817)",Web,http://mobile311.sfgov.org/reports/8448879/photos
4,8448882,01/01/2018 12:33:00 AM,01/01/2018 10:42:27 AM,01/01/2018 10:42:27 AM,Closed,Case Resolved - Pickup completed.,Recology_Abandoned,Street and Sidewalk Cleaning,Bulky Items,Boxed or Bagged Items,"1468 HAMPSHIRE ST, SAN FRANCISCO, CA, 94110",HAMPSHIRE ST,9.0,Mission,MISSION,37.749149,-122.406746,"(37.74914932, -122.40674591)",Phone,


In [4]:
Jan2018_data.shape

(47537, 20)

In [5]:
Jan2018_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47537 entries, 0 to 47536
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   CaseID               47537 non-null  int64  
 1   Opened               47537 non-null  object 
 2   Closed               47404 non-null  object 
 3   Updated              47537 non-null  object 
 4   Status               47537 non-null  object 
 5   Status Notes         47499 non-null  object 
 6   Responsible Agency   47537 non-null  object 
 7   Category             47537 non-null  object 
 8   Request Type         47537 non-null  object 
 9   Request Details      46966 non-null  object 
 10  Address              47537 non-null  object 
 11  Street               45891 non-null  object 
 12  Supervisor District  45891 non-null  float64
 13  Neighborhood         45771 non-null  object 
 14  Police District      45759 non-null  object 
 15  Latitude             47537 non-null 

In [6]:
# Let's convert the columns Opened, Closed, and Updated to type DateTime
Jan2018_data['Opened'] = pd.to_datetime(Jan2018_data['Opened'])
Jan2018_data['Closed'] = pd.to_datetime(Jan2018_data['Closed'])
Jan2018_data['Updated'] = pd.to_datetime(Jan2018_data['Updated'])

In [7]:
# looks like there is some data where the Neighborhood and/or the Supervisor District is null
# Let's review the Neighborhood NULLs first
Jan2018_data[Jan2018_data['Neighborhood'].isna()].head().T

Unnamed: 0,190,236,522,661,664
CaseID,8449474,8449570,8450010,8450361,8450366
Opened,2018-01-01 10:24:00,2018-01-01 10:56:00,2018-01-01 12:46:44,2018-01-01 14:53:00,2018-01-01 14:56:57
Closed,2018-01-03 12:50:11,2018-01-02 11:29:00,2018-01-01 13:00:00,2018-01-02 06:50:00,2018-01-02 14:07:00
Updated,2018-01-03 12:50:11,2018-01-02 11:29:00,2018-01-01 13:00:00,2018-01-02 06:50:00,2018-01-02 14:07:00
Status,Closed,Closed,Closed,Closed,Closed
Status Notes,Case Resolved - Pickup completed,Case is Invalid - Fire debris is responsibilit...,Case Resolved - Please contact the registratio...,Not Accepted - WEEKENDS AND HOLIDAYS MUST CALL...,"Case Resolved - Jessa Lazo ""! I rescheduled th..."
Responsible Agency,Recology_Litter,311 Supervisor Queue,RPD NSA Queue,SFMTA - Parking Enforcement - G,County Clerk - G
Category,Street and Sidewalk Cleaning,General Request - PUBLIC WORKS,General Request - RPD,General Request - MTA,General Request - COUNTY CLERK
Request Type,Bulky Items,complaint,other,complaint,request_for_service
Request Details,Refrigerator,bsm - complaint,rpd_other - other,parking_enforcement - complaint,county_clerk - request_for_service


In [8]:
Jan2018_data[Jan2018_data['Neighborhood'].isna()].groupby("Address")['CaseID'].count().sort_values(ascending=False)

Address
Not associated with a specific address         1646
7 FULTON ST, SAN FRANCISCO, CA, 94102            19
PIER 45, SAN FRANCISCO, CA, 94133                 5
1075 O'FARRELL ST, SAN FRANCISCO, CA, 94109       5
275 O'FARRELL ST, SAN FRANCISCO, CA, 94102        4
                                               ... 
485 O'FARRELL ST, SAN FRANCISCO, CA, 94102        1
PIER 35, SAN FRANCISCO, CA, 94133                 1
445 O'FARRELL ST, SAN FRANCISCO, CA, 94102        1
439 O'FARRELL ST, SAN FRANCISCO, CA, 94102        1
1 FERRY BUILDING, SAN FRANCISCO, CA, 94111        1
Name: CaseID, Length: 74, dtype: int64

In [9]:
Jan2018_data['Neighborhood'].isna().sum()

1766

In [10]:
Jan2018_data[Jan2018_data['Neighborhood'].isna()]["Street"].isna().sum()

1646

Out of 1766 null Neighborhoods, 1646 have 'Not associated with a specific address' and a NULL Street value!

In [11]:
# Let's review the NULL Supervisor District values
Jan2018_data[Jan2018_data['Supervisor District'].isna()].head().T

Unnamed: 0,190,236,522,661,664
CaseID,8449474,8449570,8450010,8450361,8450366
Opened,2018-01-01 10:24:00,2018-01-01 10:56:00,2018-01-01 12:46:44,2018-01-01 14:53:00,2018-01-01 14:56:57
Closed,2018-01-03 12:50:11,2018-01-02 11:29:00,2018-01-01 13:00:00,2018-01-02 06:50:00,2018-01-02 14:07:00
Updated,2018-01-03 12:50:11,2018-01-02 11:29:00,2018-01-01 13:00:00,2018-01-02 06:50:00,2018-01-02 14:07:00
Status,Closed,Closed,Closed,Closed,Closed
Status Notes,Case Resolved - Pickup completed,Case is Invalid - Fire debris is responsibilit...,Case Resolved - Please contact the registratio...,Not Accepted - WEEKENDS AND HOLIDAYS MUST CALL...,"Case Resolved - Jessa Lazo ""! I rescheduled th..."
Responsible Agency,Recology_Litter,311 Supervisor Queue,RPD NSA Queue,SFMTA - Parking Enforcement - G,County Clerk - G
Category,Street and Sidewalk Cleaning,General Request - PUBLIC WORKS,General Request - RPD,General Request - MTA,General Request - COUNTY CLERK
Request Type,Bulky Items,complaint,other,complaint,request_for_service
Request Details,Refrigerator,bsm - complaint,rpd_other - other,parking_enforcement - complaint,county_clerk - request_for_service


In [12]:
Jan2018_data[Jan2018_data['Supervisor District'].isna()].groupby("Address")['CaseID'].count().sort_values(ascending=False)

Address
Not associated with a specific address    1646
Name: CaseID, dtype: int64

In [13]:
Jan2018_data['Supervisor District'].isna().sum()

1646

Out of 1646 null Supervisor Districts, none have a specific address associated!

In [14]:
Jan2018_data[Jan2018_data['Supervisor District'].isna()]["Neighborhood"].isna().sum()

1646

All of the NULL Supervisor Districts have a NULL Neighborhood.

Based on the above analysis, we should be able to reduce data by:
  1. Converting `Media URL` to a True/False bool
  2. Dropping columns `Street`, `Latitude`, `Longitude`, and `point`
  3. Removing records where `Supervisor District` is null

In [15]:
Jan2018_data["Has Media"] = ~Jan2018_data["Media URL"].isna()

In [16]:
Jan2018_data.drop(columns=['Street', 'Latitude', 'Longitude', 'Point', 'Media URL'], inplace=True)

In [17]:
Jan2018_data = Jan2018_data[~Jan2018_data['Supervisor District'].isna()]

In [18]:
Jan2018_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 45891 entries, 0 to 47536
Data columns (total 16 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   CaseID               45891 non-null  int64         
 1   Opened               45891 non-null  datetime64[ns]
 2   Closed               45777 non-null  datetime64[ns]
 3   Updated              45891 non-null  datetime64[ns]
 4   Status               45891 non-null  object        
 5   Status Notes         45866 non-null  object        
 6   Responsible Agency   45891 non-null  object        
 7   Category             45891 non-null  object        
 8   Request Type         45891 non-null  object        
 9   Request Details      45325 non-null  object        
 10  Address              45891 non-null  object        
 11  Supervisor District  45891 non-null  float64       
 12  Neighborhood         45771 non-null  object        
 13  Police District      45759 non-

### 2.4.2 Load All Files<a id='2.4.2_Load_All_Files'></a>

In [23]:
filename_prefix = 'raw_data/311_Cases_'
filename_suffix = '.csv'

case_data = pd.DataFrame()

# note that we only have data up to September 2020
for year in [2018,2019,2020]:
    for month in range(1,13):
        if year == 2020 and month == 10:
            break
        filename = filename_prefix + str(year)+str(month).zfill(2) + filename_suffix
        print(filename)
#    period_data = pd.read_csv(filename, low_memory=False)
#    print(period_data.shape)
#    print(period_data.info())
#    case_data = case_data.append(period_data)

raw_data/311_Cases_201801.csv
raw_data/311_Cases_201802.csv
raw_data/311_Cases_201803.csv
raw_data/311_Cases_201804.csv
raw_data/311_Cases_201805.csv
raw_data/311_Cases_201806.csv
raw_data/311_Cases_201807.csv
raw_data/311_Cases_201808.csv
raw_data/311_Cases_201809.csv
raw_data/311_Cases_201810.csv
raw_data/311_Cases_201811.csv
raw_data/311_Cases_201812.csv
raw_data/311_Cases_201901.csv
raw_data/311_Cases_201902.csv
raw_data/311_Cases_201903.csv
raw_data/311_Cases_201904.csv
raw_data/311_Cases_201905.csv
raw_data/311_Cases_201906.csv
raw_data/311_Cases_201907.csv
raw_data/311_Cases_201908.csv
raw_data/311_Cases_201909.csv
raw_data/311_Cases_201910.csv
raw_data/311_Cases_201911.csv
raw_data/311_Cases_201912.csv
raw_data/311_Cases_202001.csv
raw_data/311_Cases_202002.csv
raw_data/311_Cases_202003.csv
raw_data/311_Cases_202004.csv
raw_data/311_Cases_202005.csv
raw_data/311_Cases_202006.csv
raw_data/311_Cases_202007.csv
raw_data/311_Cases_202008.csv
raw_data/311_Cases_202009.csv
