# 1.2 Wrangling - San Francisco 311 Data<a id='Wrangling_-_San_Francisco_311_Data'></a>

## 1 Contents<a id='1_Contents'></a>
* [Wrangling - San Francisco 311 Data](#Wrangling_-_San_Francisco_311_Data)
  * [1 Contents](#1_Contents)
  * [2 Introduction](#2_Introduction)
  * [3 Imports](#3_Imports)
  * [4 Load One Month](#4_Load_One_Month)
    * [4.1 Explore The Data](#4.1_Explore_The_Data)
      * [4.1.1 NULL Neighborhood values](#4.1.1_NULL_Neighborhood_values)
      * [4.1.2 Source](#4.1.2_Source)
      * [4.1.3 Responsible Agency](#4.1.3_Responsible_Agency)
      * [4.1.4 Category](#4.1.4_Category)
      * [4.1.5 Request Type](#4.1.5_Request_Type)
      * [4.1.6 Request Details](#4.1.6_Request_Details)
    * [4.2 Condense Data](#4.2_Condense_Data)
      * [4.2.1 Condense Data Function](#4.2.1_Condense_Data_Function)
  * [5 Load All Files](#5_Load_All_Files)
  * [6 Save data](#6_Save_data)

## 2 Introduction<a id='2_Introduction'></a>

Data was downloaded from <a href="https://data.sfgov.org/City-Infrastructure/311-Cases/vw6y-z8j6">San Francisco's Open Data portal regarding 311 Data</a>. Since the files were too large, they were downloaded by month into separate CSVs files, spanning from January 2018 up to and including September 2020. This dataset comprises of SF311 requests; SF311 is the primary customer service center for the City of San Francisco. They receive requests via the phone, web, mobile, and Twitter.

We plan to explore this data in conjunction with San Francisco's Police Incident Report data as well as Redfin's San Francisco neighborhood house sales data, and we will do this by comparing across San Francisco neighborhoods.

In this notebook, we will:
  * first load a single month of 311 data
  * explore a single month to determine how to deal with NULL values, which columns to keep, and how to aggregate the data
  * load every month, condense and aggregate it, and append it to the total data
  
At the end of this notebook, we will generate the following files:
  * 311_Cases_aggregated.csv : data aggregated by month, from January 2018 up to and including September 2020
  * 311_Neighborhoods.csv : a list of all the Neighborhoods in the data (note that this list can also be obtained by SF Find Neighborhoods https://data.sfgov.org/Geographic-Locations-and-Boundaries/SF-Find-Neighborhoods/pty2-tcw4 with geospatial data)

## 3 Imports<a id='3_Imports'></a>

In [1]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
import re

## 4 Load One Month<a id='4_Load_One_Month'></a>

The data is located in CSV files by month/year. As a result, we must load each month/year separately. Let's load a single month to do some quick exploration to see if we can reduce the number of rows, and what we should keep for our collation.

In [2]:
Jan2018_data = pd.read_csv('../raw_data/311_Cases_201801.csv')

### 4.1 Explore The Data<a id='4.1_Explore_The_Data'></a>

In [3]:
Jan2018_data.head()

Unnamed: 0,CaseID,Opened,Closed,Updated,Status,Status Notes,Responsible Agency,Category,Request Type,Request Details,Address,Street,Supervisor District,Neighborhood,Police District,Latitude,Longitude,Point,Source,Media URL
0,8448866,01/01/2018 12:25:00 AM,01/05/2018 01:07:00 PM,01/05/2018 01:07:00 PM,Closed,Case is a Duplicate,DPW Ops Queue,Street and Sidewalk Cleaning,Bulky Items,Boxed or Bagged Items,"2833 25TH ST, SAN FRANCISCO, CA, 94110",25TH ST,9.0,Mission,MISSION,37.751209,-122.406845,"(37.75120926, -122.40684509)",Phone,
1,8448872,01/01/2018 12:28:00 AM,01/30/2018 10:21:55 PM,01/30/2018 10:21:55 PM,Closed,Case is a Duplicate - Request is a duplicate a...,DPW Ops Queue,Graffiti,Graffiti on Bike_rack,Bike_rack - Not_Offensive,"909 PAGE ST, SAN FRANCISCO, CA, 94117",PAGE ST,5.0,Lower Haight,NORTHERN,37.7723,-122.435745,"(37.7723, -122.435745)",Mobile/Open311,http://mobile311.sfgov.org/reports/8448872/photos
2,8448870,01/01/2018 12:28:00 AM,01/04/2018 01:15:50 PM,01/04/2018 01:15:50 PM,Closed,Case Resolved - Completed cleaned,DPW Ops Queue,Street and Sidewalk Cleaning,General Cleaning,Other Loose Garbage,"2833 25TH ST, SAN FRANCISCO, CA, 94110",25TH ST,9.0,Mission,MISSION,37.751209,-122.406845,"(37.75120926, -122.40684509)",Phone,
3,8448879,01/01/2018 12:31:00 AM,01/09/2018 05:14:00 AM,01/09/2018 05:14:00 AM,Closed,DPT Abandoned Vehicles- Gone on Arrival - veh ...,DPT Abandoned Vehicles Work Queue,Abandoned Vehicle,Abandoned Vehicles,DPT Abandoned Vehicles Low,"115 SAGAMORE ST, SAN FRANCISCO, CA, 94112",SAGAMORE ST,11.0,Oceanview,TARAVAL,37.711192,-122.456682,"(37.7111921, -122.4566817)",Web,http://mobile311.sfgov.org/reports/8448879/photos
4,8448882,01/01/2018 12:33:00 AM,01/01/2018 10:42:27 AM,01/01/2018 10:42:27 AM,Closed,Case Resolved - Pickup completed.,Recology_Abandoned,Street and Sidewalk Cleaning,Bulky Items,Boxed or Bagged Items,"1468 HAMPSHIRE ST, SAN FRANCISCO, CA, 94110",HAMPSHIRE ST,9.0,Mission,MISSION,37.749149,-122.406746,"(37.74914932, -122.40674591)",Phone,


In [4]:
Jan2018_data.shape

(47537, 20)

In [5]:
Jan2018_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47537 entries, 0 to 47536
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   CaseID               47537 non-null  int64  
 1   Opened               47537 non-null  object 
 2   Closed               47404 non-null  object 
 3   Updated              47537 non-null  object 
 4   Status               47537 non-null  object 
 5   Status Notes         47499 non-null  object 
 6   Responsible Agency   47537 non-null  object 
 7   Category             47537 non-null  object 
 8   Request Type         47537 non-null  object 
 9   Request Details      46966 non-null  object 
 10  Address              47537 non-null  object 
 11  Street               45891 non-null  object 
 12  Supervisor District  45891 non-null  float64
 13  Neighborhood         45771 non-null  object 
 14  Police District      45759 non-null  object 
 15  Latitude             47537 non-null 

In [6]:
# Let's convert the column Opened to type DateTime
# we are not as interested in the Closed and Updated columns
Jan2018_data['Opened'] = pd.to_datetime(Jan2018_data['Opened'])

#### 4.1.1 NULL Neighborhood values<a id='4.1.1_NULL_Neighborhood_values'></a>

In [7]:
# looks like there is some data where the Neighborhood and/or the Supervisor District is null
# Let's review the Neighborhood NULLs first
Jan2018_data[Jan2018_data['Neighborhood'].isna()].head().T

Unnamed: 0,190,236,522,661,664
CaseID,8449474,8449570,8450010,8450361,8450366
Opened,2018-01-01 10:24:00,2018-01-01 10:56:00,2018-01-01 12:46:44,2018-01-01 14:53:00,2018-01-01 14:56:57
Closed,01/03/2018 12:50:11 PM,01/02/2018 11:29:00 AM,01/01/2018 01:00:00 PM,01/02/2018 06:50:00 AM,01/02/2018 02:07:00 PM
Updated,01/03/2018 12:50:11 PM,01/02/2018 11:29:00 AM,01/01/2018 01:00:00 PM,01/02/2018 06:50:00 AM,01/02/2018 02:07:00 PM
Status,Closed,Closed,Closed,Closed,Closed
Status Notes,Case Resolved - Pickup completed,Case is Invalid - Fire debris is responsibilit...,Case Resolved - Please contact the registratio...,Not Accepted - WEEKENDS AND HOLIDAYS MUST CALL...,"Case Resolved - Jessa Lazo ""! I rescheduled th..."
Responsible Agency,Recology_Litter,311 Supervisor Queue,RPD NSA Queue,SFMTA - Parking Enforcement - G,County Clerk - G
Category,Street and Sidewalk Cleaning,General Request - PUBLIC WORKS,General Request - RPD,General Request - MTA,General Request - COUNTY CLERK
Request Type,Bulky Items,complaint,other,complaint,request_for_service
Request Details,Refrigerator,bsm - complaint,rpd_other - other,parking_enforcement - complaint,county_clerk - request_for_service


In [8]:
Jan2018_data[Jan2018_data['Neighborhood'].isna()].groupby("Address")['CaseID'].count().sort_values(ascending=False)

Address
Not associated with a specific address         1646
7 FULTON ST, SAN FRANCISCO, CA, 94102            19
PIER 45, SAN FRANCISCO, CA, 94133                 5
1075 O'FARRELL ST, SAN FRANCISCO, CA, 94109       5
275 O'FARRELL ST, SAN FRANCISCO, CA, 94102        4
                                               ... 
485 O'FARRELL ST, SAN FRANCISCO, CA, 94102        1
PIER 35, SAN FRANCISCO, CA, 94133                 1
445 O'FARRELL ST, SAN FRANCISCO, CA, 94102        1
439 O'FARRELL ST, SAN FRANCISCO, CA, 94102        1
1 FERRY BUILDING, SAN FRANCISCO, CA, 94111        1
Name: CaseID, Length: 74, dtype: int64

In [9]:
Jan2018_data['Neighborhood'].isna().sum()

1766

In [10]:
Jan2018_data[Jan2018_data['Neighborhood'].isna()]["Street"].isna().sum()

1646

Out of 1766 null Neighborhoods, 1646 have 'Not associated with a specific address' and a NULL Street value!

Looks like we can go ahead and drop NULL Neighborhoods

#### 4.1.2 Source<a id='4.1.2_Source'></a>

Let's look at different `Source` values.

In [11]:
Jan2018_data['Source'].value_counts()

Mobile/Open311       25401
Phone                13880
Web                   6435
Integrated Agency     1549
Twitter                241
Other Department        31
Name: Source, dtype: int64

#### 4.1.3 Responsible Agency<a id='4.1.3_Responsible_Agency'></a>

Let's look at different `Responsible Agency` values.

In [12]:
Jan2018_data['Responsible Agency'].value_counts()

DPW Ops Queue                        20049
Recology_Abandoned                    8363
DPT Abandoned Vehicles Work Queue     2522
311 Supervisor Queue                  2166
MUNI Work Queue                       1286
                                     ...  
Contract Administration - G              1
DBI Plumbing Inspection Queue            1
SFMTA - SFpark Queue                     1
Environment - G                          1
SFFD Support Services Queue              1
Name: Responsible Agency, Length: 166, dtype: int64

Doesn't look all that interesting or provide much clarity on the issues.

#### 4.1.4 Category<a id='4.1.3_Category'></a>

Let's look at different `Category` values.

In [13]:
Jan2018_data['Category'].value_counts()

Street and Sidewalk Cleaning                 19514
Graffiti                                      7271
Encampments                                   4936
Abandoned Vehicle                             2512
MUNI Feedback                                 1573
                                             ...  
General Request - HUMAN RESOURCES                1
General Request - ENVIRONMENT                    1
General Request - HUMAN RIGHTS COMMISSION        1
General Request - ART COMMISSION                 1
General Request - SHERIFF                        1
Name: Category, Length: 66, dtype: int64

In [14]:
# perhaps we can fold all these General Requests into a single Category?
Jan2018_data['Category'].str.startswith('General Request').sum()

2388

In [15]:
Jan2018_data[~Jan2018_data['Category'].str.startswith('General Request')]['Category'].value_counts()

Street and Sidewalk Cleaning    19514
Graffiti                         7271
Encampments                      4936
Abandoned Vehicle                2512
MUNI Feedback                    1573
Homeless Concerns                1027
Damaged Property                 1006
Sewer Issues                      952
Tree Maintenance                  742
Streetlights                      727
Street Defects                    606
Rec and Park Requests             587
Litter Receptacles                566
Blocked Street or SideWalk        504
Noise Report                      481
Sidewalk or Curb                  465
Illegal Postings                  461
Sign Repair                       430
Residential Building Request      176
SFHA Requests                     168
Color Curb                        106
Temporary Sign Request            101
Construction Zone Permits          94
Catch Basin Maintenance            64
311 External Request               62
DPW Volunteer Programs             18
Name: Catego

#### 4.1.5 Request Type<a id='4.1.5_Request_Type'></a>

Let's look at different `Request Type` values.

In [16]:
Jan2018_data['Request Type'].value_counts()

Bulky Items                                                            8163
General Cleaning                                                       6292
Encampment Reports                                                     4486
Human or Animal Waste                                                  2138
Abandoned Vehicles                                                     1431
                                                                       ... 
mobile_food_facility                                                      1
Building - Deck_Stairs_Handrails                                          1
Building - Visitor_Policy_Violations                                      1
Construction Zone Tow-away Permits for DPW BSES                           1
Construction Zone Tow-away Permits for Roberts-Obayshi Construction       1
Name: Request Type, Length: 258, dtype: int64

In [17]:
# perhaps we can fold all these Construction Zone... into a single Request Type?
print(Jan2018_data[Jan2018_data['Request Type'].str.startswith('Construction Zone')]['Request Type'].unique())

Jan2018_data['Request Type'].str.startswith('Construction Zone').sum()

['Construction Zone Tow-away Permits for DPW BSES'
 'Construction Zone Tow-away Permits for Zayo Group'
 'Construction Zone Tow-away Permits for Abide International'
 'Construction Zone Tow-away Permits for Darcy and Harty Construction'
 'Construction Zone Tow-away Permits for DPW - BUF'
 'Construction Zone Tow-away Permits for DPW'
 'Construction Zone Tow-away Permits for Empire Engineering and Construction'
 'Construction Zone Tow-away Permits for Pacific West Communications'
 'Construction Zone Tow-away Permits for Cable Com'
 'Construction Zone Tow-away Permits for UNDERGROUND CONSTRUCTION'
 'Construction Zone Tow-away Permits for JMB Construction'
 'Construction Zone Tow-away Permits for Extenet Systems'
 'Construction Zone Tow-away Permits for Roberts-Obayshi Construction'
 'Construction Zone Tow-away Permits for DPW-BUF(Landscaping)'
 'Construction Zone Tow-away Permits for Mitchell Engineering'
 'Construction Zone Tow-away Permits for Underground Construction Co Inc'
 'Construc

94

In [18]:
Jan2018_data[~Jan2018_data['Request Type'].str.startswith('Construction Zone')]['Request Type'].value_counts(ascending=True).head(20)

Painting                                         1
delivery_service_vehicle                         1
Building - Fire_Alarm_System                     1
tour_bus                                         1
Streetlight - Pole_Cover_Missing                 1
Red Color Curb Request for HotelApartment        1
Streetlight - Pole_Leaning                       1
Building - Deck_Stairs_Handrails                 1
Building -                                       1
Building - Fire_Extinguishers_Missing_Expired    1
Building - Kitchen_Community                     1
Streetlight - Light_Glass_Cover_Missing          1
Sign Repair - Defaced                            1
Building - Visitor_Policy_Violations             1
Streetlight - Light_Glass_Cover_Hanging          1
Sign Repair - Incorrect_Signage                  1
Sign - Painted_Over                              1
mobile_food_facility                             1
Building - Mail_Service_Delivery_problem         1
Sign Repair - Not_Visible      

In [19]:
# perhaps we can fold all these Building ... into a single Request Type?
print( Jan2018_data[Jan2018_data['Request Type'].str.startswith('Building')]['Request Type'].unique() )

Jan2018_data['Request Type'].str.startswith('Building').sum()

['Building - Infestation_Rodent_Insect' 'Building - Garbage_Receptacles'
 'Building - General_Maintenance_Not_in_List_Above'
 'Building - Plumbing_Broken_leaking'
 'Building - Electrical_Hazardous_Condition' 'Building - Bathroom'
 'Building - Hot_Water_Lack_of_Hot_Water'
 'Building - Clutter_Hoarder_Unit_Interior_Storage'
 'Building - Infestation_Bed_Bugs' 'Building - Deck_Stairs_Handrails'
 'Building - Paint_Lead_Violating_Safe_Practices'
 'Building - Electrical_Non_Hazard' 'Building - Mold_and_Mildew'
 'Building - Kitchen_Community'
 'Building - Elevators_No_Working_Elevator_7_or_More_Stories'
 'Building - Illegal_Construction_No_Permit_Exceeds_Permit_Scope'
 'Building - Blocked_Exit_Common_Areas' 'Building - Fire_Hazard'
 'Building - Heat_Lack_of_Heat' 'Building - Second_Hand_Smoke'
 'Building - Doors_Windows_Broken_Defective'
 'Building - Elevators_Everthing_Else'
 'Building - Inadequately_Maintained_Building_Exterior'
 'Building - Fire_Alarm_System' 'Building - Light_Wells_Dirty_F

176

In [20]:
# perhaps we can fold all these Streetlight ... into a single Request Type?
Jan2018_data['Request Type'].str.startswith('Streetlight').sum()

727

In [21]:
# perhaps we can fold all these Sign ... into a single Request Type?
print( Jan2018_data[Jan2018_data['Request Type'].str.startswith('Sign')]['Request Type'].unique() )

Jan2018_data['Request Type'].str.startswith('Sign').sum()

['Sign - Defaced' 'Sign - Painted_Over' 'Sign - Other'
 'Sign Repair - Other' 'Sign Repair - On_Ground' 'Sign - Missing'
 'Sign - Bent' 'Sign Repair - Missing' 'Sign - Incorrect_Signage'
 'Sign Repair - Bent' 'Sign - Faded' 'Sign - Dirty' 'Sign Repair - Turned'
 'Sign - On_Ground' 'Sign Repair - Defaced' 'Sign Repair - Not_Visible'
 'Sign Repair - Incorrect_Signage' 'Sign - Turned' 'Sign Repair - Faded']


430

In [22]:
Jan2018_data[~Jan2018_data['Request Type'].str.startswith(('Construction Zone','Building','Streetlight','Sign'))]['Request Type'].value_counts()

Bulky Items                                  8163
General Cleaning                             6292
Encampment Reports                           4486
Human or Animal Waste                        2138
Abandoned Vehicles                           1431
                                             ... 
delivery_service_vehicle                        1
Red Color Curb Request for HotelApartment       1
Temporary Sign Request for Funerals             1
mobile_food_facility                            1
Painting                                        1
Name: Request Type, Length: 165, dtype: int64

In [23]:
Jan2018_data[~Jan2018_data['Request Type'].str.startswith(('Construction Zone','Building','Streetlight','Sign'))]['Request Type'].value_counts(ascending=True).head(30)

Painting                                         1
mobile_food_facility                             1
Temporary Sign Request for Funerals              1
Red Color Curb Request for HotelApartment        1
delivery_service_vehicle                         1
tour_bus                                         1
Damaged Kiosk_Public_Toilet                      2
Custodian                                        2
public_speech                                    2
vehicle_repair                                   2
emergency_equipment                              2
hospital                                         3
delivery_service_business                        3
Utility Lines/Wires                              3
Garbage                                          3
Blue Color Curb Request                          4
Sidewalk_or_Curb_Issues                          4
Illegal Postings - Posted_on_Directional_Sign    4
Graffiti on Bridge                               4
Catch_Basin_Other              

**TODO: let's drop the `Request Type` **

#### 4.1.6 Request Details<a id='4.1.6_Request_Details'></a>

Let's look at different `Request Details` values.

In [24]:
Jan2018_data.groupby('Request Details')['CaseID'].agg(['count','min','max']).reset_index().sort_values('count',ascending=False)

Unnamed: 0,Request Details,count,min,max
781,Other Loose Garbage,6284,8448870,8577917
481,Encampment Cleanup,4472,8448999,8577983
379,Boxed or Bagged Items,2914,8448866,8577984
497,Furniture,2433,8449086,8577923
639,Human or Animal Waste,2138,8448984,8577967
...,...,...,...,...
687,MULTICOLOR - FORD - - 3TEW569,1,8538672,8538672
686,MULTI COLO - CHEVY - VAN - 6TIH401,1,8542098,8542098
685,MOLD GREEN - FORD - - 72658C,1,8476684,8476684
684,MAROON - HONDA - - 4WSN026,1,8450864,8450864


In [25]:
Jan2018_data[Jan2018_data['CaseID'].isin([8448870,8448999,8448866,8538672,8542098,8476684,8450864,8511772])].T

Unnamed: 0,0,2,26,916,10852,23895,33522,34751
CaseID,8448866,8448870,8448999,8450864,8476684,8511772,8538672,8542098
Opened,2018-01-01 00:25:00,2018-01-01 00:28:00,2018-01-01 02:55:00,2018-01-01 17:18:00,2018-01-08 08:17:00,2018-01-16 10:44:00,2018-01-22 15:16:00,2018-01-23 12:06:00
Closed,01/05/2018 01:07:00 PM,01/04/2018 01:15:50 PM,01/01/2018 03:31:23 AM,01/03/2018 06:38:00 AM,01/10/2018 06:07:00 AM,01/24/2018 06:41:00 AM,01/26/2018 07:04:00 PM,01/29/2018 12:14:00 PM
Updated,01/05/2018 01:07:00 PM,01/04/2018 01:15:50 PM,01/01/2018 03:31:23 AM,01/03/2018 06:38:00 AM,01/10/2018 06:07:00 AM,01/24/2018 06:41:00 AM,01/26/2018 07:04:00 PM,01/29/2018 12:14:00 PM
Status,Closed,Closed,Closed,Closed,Closed,Closed,Closed,Closed
Status Notes,Case is a Duplicate,Case Resolved - Completed cleaned,Area Cleared,DPT Abandoned Vehicles- Gone on Arrival - 01/0...,DPT Abandoned Vehicles- Gone on Arrival - veh ...,DPT Abandoned Vehicles- Gone on Arrival - VEH ...,DPT Abandoned Vehicles- Gone on Arrival - GOA ...,DPT Abandoned Vehicles- Gone on Arrival - veh ...
Responsible Agency,DPW Ops Queue,DPW Ops Queue,DPW Ops Queue,DPT Abandoned Vehicles Work Queue,DPT Abandoned Vehicles Work Queue,DPT Abandoned Vehicles Work Queue,DPT Abandoned Vehicles Work Queue,DPT Abandoned Vehicles Work Queue
Category,Street and Sidewalk Cleaning,Street and Sidewalk Cleaning,Encampments,Abandoned Vehicle,Abandoned Vehicle,Abandoned Vehicle,Abandoned Vehicle,Abandoned Vehicle
Request Type,Bulky Items,General Cleaning,Encampment Reports,Abandoned Vehicle - Car4door,Abandoned Vehicle - PickupTruck,Abandoned Vehicle - Other,Abandoned Vehicle - Trailer,Abandoned Vehicle -
Request Details,Boxed or Bagged Items,Other Loose Garbage,Encampment Cleanup,MAROON - HONDA - - 4WSN026,MOLD GREEN - FORD - - 72658C,yellow - VW - bus - NONE,MULTICOLOR - FORD - - 3TEW569,MULTI COLO - CHEVY - VAN - 6TIH401


Since we are keeping `Category`, the `Request Details` appear to be too detailed for us to roll up or use in aggregation.

### 4.2 Condense Data<a id='4.1_Condense_Data'></a>

Based on the above analysis, we should be able to reduce data by:
  1. Converting `Media URL` to a True/False bool
  2. Dropping columns `Address`, `Street`, `Latitude`, `Longitude`, and `point`
  3. Drop column `Supervisor District` : we made a decision with the SF Police Incident data that this field has less relevance than `Neighborhood`
  4. Drop columns `Updated` and `Closed` : we have the `Status` column preserving the state of the case
  5. Drop columns `Status Notes` and `Responsible Agency`
  6. Roll up all of the `Category` values that starts with `General Request` into one category
  7. Roll up all of the `Request Type` values that start with certain names into one request type
  8. Drop column `Request Details` : too high a level of granularity

In [26]:
Jan2018_data["Has Media"] = ~Jan2018_data["Media URL"].isna()

In [27]:
columns_to_drop = ['Address', 'Street', 'Latitude', 'Longitude', 'Point', 'Media URL', 'Supervisor District', 'Updated', 'Closed', 'Status Notes', 'Responsible Agency', 'Request Details']
Jan2018_data.drop(columns=columns_to_drop, inplace=True)

In [28]:
Jan2018_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47537 entries, 0 to 47536
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   CaseID           47537 non-null  int64         
 1   Opened           47537 non-null  datetime64[ns]
 2   Status           47537 non-null  object        
 3   Category         47537 non-null  object        
 4   Request Type     47537 non-null  object        
 5   Neighborhood     45771 non-null  object        
 6   Police District  45759 non-null  object        
 7   Source           47537 non-null  object        
 8   Has Media        47537 non-null  bool          
dtypes: bool(1), datetime64[ns](1), int64(1), object(6)
memory usage: 2.9+ MB


In [29]:
# Roll up Category values
Jan2018_data['Category'] = Jan2018_data['Category'].apply(lambda x : re.sub(r"General Request.*", "General Request", x))

In [30]:
Jan2018_data[Jan2018_data['Category'].str.startswith('General Request')]['Category'].value_counts()

General Request    2388
Name: Category, dtype: int64

In [31]:
# Roll up Request Type values
request_type_rollups = ['Construction Zone','Building','Streetlight','Sign']
for request_type in request_type_rollups:
    repl = r''+request_type+".*"
    Jan2018_data['Request Type'] = Jan2018_data['Request Type'].apply(lambda x : re.sub(repl, request_type, x))

In [32]:
Jan2018_data[Jan2018_data['Request Type'].str.startswith(('Construction Zone','Building','Streetlight','Sign'))]['Request Type'].value_counts()

Streetlight          727
Sign                 430
Building             176
Construction Zone     94
Name: Request Type, dtype: int64

In [33]:
# Create year-month
Jan2018_data['Opened Year Month'] = Jan2018_data['Opened'].dt.strftime('%Y%m')

In [34]:
# aggregated data
cols_to_groupby = ['Opened Year Month','Source','Neighborhood','Police District','Status','Category','Request Type','Has Media']
agg_Jan2018_data = Jan2018_data.groupby(cols_to_groupby).agg({'CaseID': 'count'})
agg_Jan2018_data.columns = ['Case Count']
agg_Jan2018_data = agg_Jan2018_data.reset_index()

In [35]:
agg_Jan2018_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10019 entries, 0 to 10018
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Opened Year Month  10019 non-null  object
 1   Source             10019 non-null  object
 2   Neighborhood       10019 non-null  object
 3   Police District    10019 non-null  object
 4   Status             10019 non-null  object
 5   Category           10019 non-null  object
 6   Request Type       10019 non-null  object
 7   Has Media          10019 non-null  bool  
 8   Case Count         10019 non-null  int64 
dtypes: bool(1), int64(1), object(7)
memory usage: 636.1+ KB


In [36]:
agg_Jan2018_data.sort_values('Case Count',ascending=False).head()

Unnamed: 0,Opened Year Month,Source,Neighborhood,Police District,Status,Category,Request Type,Has Media,Case Count
2562,201801,Mobile/Open311,Mission,MISSION,Closed,Street and Sidewalk Cleaning,General Cleaning,True,578
2493,201801,Mobile/Open311,Mission,MISSION,Closed,Encampments,Encampment Reports,True,466
3734,201801,Mobile/Open311,Showplace Square,SOUTHERN,Closed,Encampments,Encampment Reports,False,433
2511,201801,Mobile/Open311,Mission,MISSION,Closed,Graffiti,Graffiti on Parking_meter,True,400
7110,201801,Phone,Portola,BAYVIEW,Closed,Street and Sidewalk Cleaning,Bulky Items,False,399


This looks good, so we will define a function to handle this aggregation for every month.

#### 4.2.1 Condense Data Function<a id='4.2.1_Condense_Data_Function'></a>

This function will take in raw 311 monthly data and return an aggregated, condensed version.

In [37]:
def aggregate_monthly(df_monthly):
    df_monthly['Opened'] = pd.to_datetime(df_monthly['Opened'])
    
    df_monthly['Has Media'] = ~df_monthly['Media URL'].isna()
    
    columns_to_drop = ['Address', 'Street', 'Latitude', 'Longitude', 'Point', 'Media URL', 'Supervisor District', 'Updated', 'Closed', 'Status Notes', 'Responsible Agency', 'Request Details']
    df_monthly.drop(columns=columns_to_drop, inplace=True)
    
    # Roll up Category values
    df_monthly['Category'] = df_monthly['Category'].apply(lambda x : re.sub(r"General Request.*", "General Request", x))
    
    # Roll up Request Type values
    request_type_rollups = ['Construction Zone','Building','Streetlight','Sign']
    for request_type in request_type_rollups:
        repl = r''+request_type+".*"
        df_monthly['Request Type'] = df_monthly['Request Type'].apply(lambda x : re.sub(repl, request_type, x))
    
    # Create year-month
    df_monthly['Opened Year Month'] = df_monthly['Opened'].dt.strftime('%Y%m')
    
    # aggregated data
    cols_to_groupby = ['Opened Year Month','Source','Neighborhood','Police District','Status','Category','Request Type','Has Media']
    df_agg = df_monthly.groupby(cols_to_groupby).agg({'CaseID': 'count'})
    df_agg.columns = ['Case Count']
    df_agg = df_agg.reset_index()
    
    return df_agg

In [38]:
# test the function against Jan 2018 data and we should see the same result
df_test = aggregate_monthly( pd.read_csv('../raw_data/311_Cases_201801.csv') )

In [39]:
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10019 entries, 0 to 10018
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Opened Year Month  10019 non-null  object
 1   Source             10019 non-null  object
 2   Neighborhood       10019 non-null  object
 3   Police District    10019 non-null  object
 4   Status             10019 non-null  object
 5   Category           10019 non-null  object
 6   Request Type       10019 non-null  object
 7   Has Media          10019 non-null  bool  
 8   Case Count         10019 non-null  int64 
dtypes: bool(1), int64(1), object(7)
memory usage: 636.1+ KB


In [40]:
df_test.sort_values('Case Count',ascending=False).head()

Unnamed: 0,Opened Year Month,Source,Neighborhood,Police District,Status,Category,Request Type,Has Media,Case Count
2562,201801,Mobile/Open311,Mission,MISSION,Closed,Street and Sidewalk Cleaning,General Cleaning,True,578
2493,201801,Mobile/Open311,Mission,MISSION,Closed,Encampments,Encampment Reports,True,466
3734,201801,Mobile/Open311,Showplace Square,SOUTHERN,Closed,Encampments,Encampment Reports,False,433
2511,201801,Mobile/Open311,Mission,MISSION,Closed,Graffiti,Graffiti on Parking_meter,True,400
7110,201801,Phone,Portola,BAYVIEW,Closed,Street and Sidewalk Cleaning,Bulky Items,False,399


Great! We are ready to load all files and combine the aggregated data.

## 5 Load All Files<a id='5_Load_All_Files'></a>

In [51]:
filename_prefix = '../raw_data/311_Cases_'
filename_suffix = '.csv'

agg_311 = pd.DataFrame()

# note that we only have data up to September 2020
for year in [2018,2019,2020]:
    for month in range(1,13):
        if year == 2020 and month == 10:
            break
        filename = filename_prefix + str(year)+str(month).zfill(2) + filename_suffix
        monthly_311 = pd.read_csv(filename, low_memory=False)
        print(filename)
        agg_monthly_311 = aggregate_monthly( monthly_311 )
        print(agg_monthly_311.shape)
#        print(monthly_311.info())
        agg_311 = agg_311.append( agg_monthly_311, ignore_index=True )

raw_data/311_Cases_201801.csv
(10019, 9)
raw_data/311_Cases_201802.csv
(9610, 9)
raw_data/311_Cases_201803.csv
(10241, 9)
raw_data/311_Cases_201804.csv
(11667, 9)
raw_data/311_Cases_201805.csv
(11948, 9)
raw_data/311_Cases_201806.csv
(11598, 9)
raw_data/311_Cases_201807.csv
(11503, 9)
raw_data/311_Cases_201808.csv
(11850, 9)
raw_data/311_Cases_201809.csv
(11313, 9)
raw_data/311_Cases_201810.csv
(11733, 9)
raw_data/311_Cases_201811.csv
(10968, 9)
raw_data/311_Cases_201812.csv
(10779, 9)
raw_data/311_Cases_201901.csv
(12048, 9)
raw_data/311_Cases_201902.csv
(10690, 9)
raw_data/311_Cases_201903.csv
(11309, 9)
raw_data/311_Cases_201904.csv
(11422, 9)
raw_data/311_Cases_201905.csv
(11689, 9)
raw_data/311_Cases_201906.csv
(11599, 9)
raw_data/311_Cases_201907.csv
(11807, 9)
raw_data/311_Cases_201908.csv
(12181, 9)
raw_data/311_Cases_201909.csv
(12056, 9)
raw_data/311_Cases_201910.csv
(12531, 9)
raw_data/311_Cases_201911.csv
(11702, 9)
raw_data/311_Cases_201912.csv
(11596, 9)
raw_data/311_Case

In [52]:
agg_311.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 374480 entries, 0 to 374479
Data columns (total 9 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   Opened Year Month  374480 non-null  object
 1   Source             374480 non-null  object
 2   Neighborhood       374480 non-null  object
 3   Police District    374480 non-null  object
 4   Status             374480 non-null  object
 5   Category           374480 non-null  object
 6   Request Type       374480 non-null  object
 7   Has Media          374480 non-null  bool  
 8   Case Count         374480 non-null  int64 
dtypes: bool(1), int64(1), object(7)
memory usage: 23.2+ MB


## 6 Save data<a id='6_Save_data'></a>

We will be saving our aggregated data and the list of `Neighborhood`s to a separate location, to guard against overwriting our original data.

In [53]:
agg_311.shape

(374480, 9)

In [54]:
agg_311['Neighborhood'].unique()

array(['Alamo Square', 'Anza Vista', 'Apparel City',
       'Aquatic Park / Ft. Mason', 'Ashbury Heights', 'Bayview',
       'Bernal Heights', 'Bret Harte', 'Buena Vista',
       'Candlestick Point SRA', 'Castro', 'Cathedral Hill', 'Cayuga',
       'Central Waterfront', 'Chinatown', 'Civic Center', 'Cole Valley',
       'Cow Hollow', 'Crocker Amazon', 'Dogpatch', 'Dolores Heights',
       'Downtown / Union Square', 'Duboce Triangle', 'Eureka Valley',
       'Excelsior', 'Fairmount', 'Financial District',
       "Fisherman's Wharf", 'Forest Hill', 'Forest Knolls', 'Glen Park',
       'Golden Gate Heights', 'Golden Gate Park', 'Haight Ashbury',
       'Hayes Valley', 'Hunters Point', 'Ingleside', 'Inner Richmond',
       'Inner Sunset', 'Lakeshore', 'Little Hollywood', 'Lone Mountain',
       'Lower Haight', 'Lower Nob Hill', 'Lower Pacific Heights',
       'Marina', 'Merced Heights', 'Mint Hill', 'Miraloma Park',
       'Mission', 'Mission Bay', 'Mission Dolores', 'Mission Terrace',
   

In [55]:
neighborhood_data = pd.DataFrame(agg_311['Neighborhood'].unique())

In [56]:
neighborhood_data.columns = ['Neighborhood']
neighborhood_data

Unnamed: 0,Neighborhood
0,Alamo Square
1,Anza Vista
2,Apparel City
3,Aquatic Park / Ft. Mason
4,Ashbury Heights
...,...
112,Monterey Heights
113,Parkmerced
114,Presidio National Park
115,Westwood Highlands


In [57]:
datapath = '../data'

# create datapath if it doesn't exist
if not os.path.exists(datapath):
    os.mkdir(datapath)

In [58]:
# write aggregated data
datapath_aggdata = os.path.join(datapath, '311_Cases_aggregated.csv')
if not os.path.exists(datapath_aggdata):
    agg_311.to_csv(datapath_aggdata, index=False)

In [59]:
# write neighborhoods list
datapath_neighborhoods = os.path.join(datapath, '311_Neighborhoods.csv')
if not os.path.exists(datapath_neighborhoods):
    neighborhood_data.to_csv(datapath_neighborhoods, index=False)