# **NYU Wagner - Python Coding for Public Policy**
# Class 4: Dates and time series analysis

# LECTURE

As an example, we'll look at the frequency of 311 requests:

- Over time
- By day of the week
- By hour

From [Wikipedia](https://en.wikipedia.org/wiki/Time_series):

> A time series is a series of data points indexed (or listed or graphed) in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data.

Does the 311 data meet this definition?

## Importing necessary packages

In [1]:
import pandas as pd
import plotly.express as px

# boilerplate for allowing PDF export
import plotly.io as pio
pio.renderers.default = "notebook_connected+pdf"

## Data preparation

### Load 311 data

In [2]:
df = pd.read_csv('https://storage.googleapis.com/python-public-policy/data/311_Service_Requests_2018-19_clean.csv.zip')


Columns (8,17,20,31,34) have mixed types.Specify dtype option on import or set low_memory=False.



In [3]:
df.head()

Unnamed: 0,Unique Key,Created Date,Closed Date,Agency,Agency Name,Complaint Type,Descriptor,Location Type,Incident Zip,Incident Address,...,Vehicle Type,Taxi Company Borough,Taxi Pick Up Location,Bridge Highway Name,Bridge Highway Direction,Road Ramp,Bridge Highway Segment,Latitude,Longitude,Location
0,39888071,08/01/2018 12:00:10 AM,08/01/2018 01:52:46 AM,DHS,Operations Unit - Department of Homeless Services,Homeless Person Assistance,,Other,10029,200 EAST 109 STREET,...,,,,,,,,40.793339,-73.942942,"(40.79333937834769, -73.9429417746998)"
1,39889166,08/01/2018 12:00:26 AM,08/18/2018 10:46:43 AM,HPD,Department of Housing Preservation and Develop...,DOOR/WINDOW,DOOR,RESIDENTIAL BUILDING,10031,528 WEST 136 STREET,...,,,,,,,,40.820124,-73.953071,"(40.82012422332215, -73.9530712339799)"
2,39882869,08/01/2018 12:00:54 AM,08/01/2018 12:49:55 AM,NYPD,New York City Police Department,Noise - Residential,Loud Music/Party,Residential Building/House,11216,761 LINCOLN PLACE,...,,,,,,,,40.670809,-73.951399,"(40.67080917938279, -73.9513990916184)"
3,39894246,08/01/2018 12:01:00 AM,08/02/2018 10:30:00 PM,DEP,Department of Environmental Protection,Noise,Noise: Construction Before/After Hours (NM1),,10010,,...,,,,,,,,40.740262,-73.990517,"(40.74026158873342, -73.99051651686905)"
4,39881329,08/01/2018 12:01:00 AM,08/05/2018 12:00:00 AM,DSNY,Department of Sanitation,Request Large Bulky Item Collection,Request Large Bulky Item Collection,Sidewalk,11413,121-28 198 STREET,...,,,,,,,,40.688144,-73.75099,"(40.68814402968042, -73.75098958473612)"


In [4]:
# check data types and see that dates are stores as strings (objects)

df.dtypes

Unique Key                          int64
Created Date                       object
Closed Date                        object
Agency                             object
Agency Name                        object
Complaint Type                     object
Descriptor                         object
Location Type                      object
Incident Zip                       object
Incident Address                   object
Street Name                        object
Cross Street 1                     object
Cross Street 2                     object
Intersection Street 1              object
Intersection Street 2              object
Address Type                       object
City                               object
Landmark                           object
Facility Type                      object
Status                             object
Due Date                           object
Resolution Description             object
Resolution Action Updated Date     object
Community Board                   

Remember this problem?

In [5]:
df['Created Date'].min()

'01/01/2019 01:00:00 PM'

In [6]:
df['Created Date'].max()

'12/31/2018 12:59:41 PM'

### Convert columns to datetime timestamps using [pandas' `to_datetime()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html)

In [7]:
df['Created Date'] = pd.to_datetime(df['Created Date'], format='%m/%d/%Y %I:%M:%S %p')
df['Closed Date'] = pd.to_datetime(df['Closed Date'], format='%m/%d/%Y %I:%M:%S %p')

In [8]:
df.head()

Unnamed: 0,Unique Key,Created Date,Closed Date,Agency,Agency Name,Complaint Type,Descriptor,Location Type,Incident Zip,Incident Address,...,Vehicle Type,Taxi Company Borough,Taxi Pick Up Location,Bridge Highway Name,Bridge Highway Direction,Road Ramp,Bridge Highway Segment,Latitude,Longitude,Location
0,39888071,2018-08-01 00:00:10,2018-08-01 01:52:46,DHS,Operations Unit - Department of Homeless Services,Homeless Person Assistance,,Other,10029,200 EAST 109 STREET,...,,,,,,,,40.793339,-73.942942,"(40.79333937834769, -73.9429417746998)"
1,39889166,2018-08-01 00:00:26,2018-08-18 10:46:43,HPD,Department of Housing Preservation and Develop...,DOOR/WINDOW,DOOR,RESIDENTIAL BUILDING,10031,528 WEST 136 STREET,...,,,,,,,,40.820124,-73.953071,"(40.82012422332215, -73.9530712339799)"
2,39882869,2018-08-01 00:00:54,2018-08-01 00:49:55,NYPD,New York City Police Department,Noise - Residential,Loud Music/Party,Residential Building/House,11216,761 LINCOLN PLACE,...,,,,,,,,40.670809,-73.951399,"(40.67080917938279, -73.9513990916184)"
3,39894246,2018-08-01 00:01:00,2018-08-02 22:30:00,DEP,Department of Environmental Protection,Noise,Noise: Construction Before/After Hours (NM1),,10010,,...,,,,,,,,40.740262,-73.990517,"(40.74026158873342, -73.99051651686905)"
4,39881329,2018-08-01 00:01:00,2018-08-05 00:00:00,DSNY,Department of Sanitation,Request Large Bulky Item Collection,Request Large Bulky Item Collection,Sidewalk,11413,121-28 198 STREET,...,,,,,,,,40.688144,-73.75099,"(40.68814402968042, -73.75098958473612)"


[More about the `format` string.](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes) If you don't provide one, it will take much longer to convert.

In [9]:
# check data types and confirm they are now datetime

df.dtypes

Unique Key                                 int64
Created Date                      datetime64[ns]
Closed Date                       datetime64[ns]
Agency                                    object
Agency Name                               object
Complaint Type                            object
Descriptor                                object
Location Type                             object
Incident Zip                              object
Incident Address                          object
Street Name                               object
Cross Street 1                            object
Cross Street 2                            object
Intersection Street 1                     object
Intersection Street 2                     object
Address Type                              object
City                                      object
Landmark                                  object
Facility Type                             object
Status                                    object
Due Date            

In [10]:
df['Created Date'].min()

Timestamp('2018-08-01 00:00:10')

In [11]:
df['Created Date'].max()

Timestamp('2019-08-24 02:00:56')

## In-class exercise

Let's do up through Step 3 of [Homework 4](https://padmgp-4506001-fall.rcnyu.org/user-redirect/notebooks/class_materials/hw_4.ipynb).

## Noise complaints per day

In [12]:
noise = df[df['Complaint Type'] == 'Noise - Residential']
noise_per_day = noise.resample('D', on='Created Date').size().reset_index(name='count_requests')

noise_per_day

Unnamed: 0,Created Date,count_requests
0,2018-08-01,331
1,2018-08-02,267
2,2018-08-03,395
3,2018-08-04,943
4,2018-08-05,1044
...,...,...
384,2019-08-20,330
385,2019-08-21,313
386,2019-08-22,343
387,2019-08-23,591


## [Resampling](https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/09_timeseries.html#resample-a-time-series-to-another-frequency)

Once you have a column with datetime objects, pandas can manipulate them directly. From [the User Guide](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#resampling):

> `resample()` is a time-based `groupby`

```python
.resample('D', on='Created Date')
```

The `'D'` is the [offset alias](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timeseries-offset-aliases), i.e. the desired frequency.

In [13]:
fig = px.line(noise_per_day, x='Created Date', y='count_requests', title='Noise complaints per day')
fig.show()

How about a rolling average?

In [14]:
noise_per_day_rolling = noise_per_day.rolling('7D', on='Created Date').mean()

fig = px.line(noise_per_day_rolling, x='Created Date', y='count_requests', title='7-day rolling average of noise complaints per day')
fig.show()

Let's try weekly:

In [15]:
noise_per_week = noise.resample('W', on='Created Date').size().reset_index(name='count_requests')
fig = px.line(noise_per_week, x='Created Date', y='count_requests', title='Noise complaints per week')
fig.show()

## Resampling with other grouping

In [16]:
noise.resample('W', on='Created Date').size()

# can be rewrritten as

noise.groupby([pd.Grouper(key='Created Date', freq='W')]).size()

Created Date
2018-08-05    2980
2018-08-12    3324
2018-08-19    3732
2018-08-26    4848
2018-09-02    4287
2018-09-09    4429
2018-09-16    4722
2018-09-23    4578
2018-09-30    4612
2018-10-07    4290
2018-10-14    4074
2018-10-21    3956
2018-10-28    4113
2018-11-04    4100
2018-11-11    3771
2018-11-18    3610
2018-11-25    3973
2018-12-02    3773
2018-12-09    3830
2018-12-16    3686
2018-12-23    3383
2018-12-30    4563
2019-01-06    4476
2019-01-13    3549
2019-01-20    3477
2019-01-27    3765
2019-02-03    3831
2019-02-10    3538
2019-02-17    3719
2019-02-24    4072
2019-03-03    3468
2019-03-10    3729
2019-03-17    3932
2019-03-24    3934
2019-03-31    4330
2019-04-07    4636
2019-04-14    4771
2019-04-21    4260
2019-04-28    4535
2019-05-05    4860
2019-05-12    4988
2019-05-19    5575
2019-05-26    5811
2019-06-02    6597
2019-06-09    5659
2019-06-16    5986
2019-06-23    5697
2019-06-30    4098
2019-07-07    4974
2019-07-14    4046
2019-07-21    3461
2019-07-28    3976

This means you can add other columns to group by:

In [17]:
noise.groupby([pd.Grouper(key='Created Date', freq='W'), 'Borough']).size()

Created Date  Borough      
2018-08-05    BRONX            769
              BROOKLYN         914
              MANHATTAN        453
              QUEENS           703
              STATEN ISLAND    132
                              ... 
2019-08-25    BROOKLYN         683
              MANHATTAN        408
              QUEENS           476
              STATEN ISLAND     56
              Unspecified       17
Length: 330, dtype: int64

## In-class exercise

Let's do up through Step 5 of [Homework 4](https://padmgp-4506001-fall.rcnyu.org/user-redirect/notebooks/class_materials/hw_4.ipynb).

## Does the frequency of noise complaints vary by day of the week?

### Get the day of the week for each 311 request

Add column using the [time/date component](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#time-date-components).

In [18]:
noise_per_day['weekday_name'] = noise_per_day['Created Date'].dt.day_name()
noise_per_day['weekday'] = noise_per_day['Created Date'].dt.weekday

noise_per_day[['Created Date', 'weekday_name', 'weekday']]

Unnamed: 0,Created Date,weekday_name,weekday
0,2018-08-01,Wednesday,2
1,2018-08-02,Thursday,3
2,2018-08-03,Friday,4
3,2018-08-04,Saturday,5
4,2018-08-05,Sunday,6
...,...,...,...
384,2019-08-20,Tuesday,1
385,2019-08-21,Wednesday,2
386,2019-08-22,Thursday,3
387,2019-08-23,Friday,4


### Tip

- Use resampling when you want to work with dates as continuous values, e.g. points in time
- Use date components when you want to work with dates as categorical values, e.g. month number, day of week

### Find the median count of 311 requests per weekday

In [19]:
noise_weekday = noise_per_day.groupby(['weekday', 'weekday_name'])['count_requests'].median().reset_index(name='median_requests')
noise_weekday

Unnamed: 0,weekday,weekday_name,median_requests
0,0,Monday,419.0
1,1,Tuesday,377.0
2,2,Wednesday,375.5
3,3,Thursday,372.0
4,4,Friday,525.5
5,5,Saturday,951.5
6,6,Sunday,993.0


In [20]:
fig = px.bar(noise_weekday, x='weekday_name', y='median_requests', title='Noise complaints per day of week')
fig.show()

## What about by time of day?

Get count of noise complaints per individual date and hour:

In [21]:
noise_per_date_hour = noise.resample('H', on='Created Date').size().reset_index(name='count_requests')
# create a column for the hour number, so we can group on it
noise_per_date_hour['hour'] = noise_per_date_hour['Created Date'].dt.hour

noise_per_date_hour

Unnamed: 0,Created Date,count_requests,hour
0,2018-08-01 00:00:00,29,0
1,2018-08-01 01:00:00,23,1
2,2018-08-01 02:00:00,15,2
3,2018-08-01 03:00:00,5,3
4,2018-08-01 04:00:00,5,4
...,...,...,...
9309,2019-08-23 21:00:00,81,21
9310,2019-08-23 22:00:00,95,22
9311,2019-08-23 23:00:00,142,23
9312,2019-08-24 00:00:00,108,0


Get the median count of complaints per hour:

In [22]:
noise_hour = noise_per_date_hour.groupby('hour')['count_requests'].median().reset_index(name='median_requests')
noise_hour

Unnamed: 0,hour,median_requests
0,0,41.0
1,1,24.0
2,2,15.0
3,3,10.0
4,4,8.0
5,5,7.0
6,6,6.0
7,7,7.0
8,8,8.0
9,9,9.0


In [23]:
fig = px.line(noise_hour, x='hour', y='median_requests', title='Noise complaints per hour')
fig.show()

## Which 311 complaints take the longest to resolve?

In [24]:
# calculate the amount of time that passed between Created Date and Closed Date

df['resolution_duration'] = df['Closed Date'] - df['Created Date']

# print head to check results

df[['Closed Date', 'Created Date', 'resolution_duration']].head()

Unnamed: 0,Closed Date,Created Date,resolution_duration
0,2018-08-01 01:52:46,2018-08-01 00:00:10,0 days 01:52:36
1,2018-08-18 10:46:43,2018-08-01 00:00:26,17 days 10:46:17
2,2018-08-01 00:49:55,2018-08-01 00:00:54,0 days 00:49:01
3,2018-08-02 22:30:00,2018-08-01 00:01:00,1 days 22:29:00
4,2018-08-05 00:00:00,2018-08-01 00:01:00,3 days 23:59:00


In [25]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2859011 entries, 0 to 2859010
Data columns (total 42 columns):
 #   Column                          Dtype          
---  ------                          -----          
 0   Unique Key                      int64          
 1   Created Date                    datetime64[ns] 
 2   Closed Date                     datetime64[ns] 
 3   Agency                          object         
 4   Agency Name                     object         
 5   Complaint Type                  object         
 6   Descriptor                      object         
 7   Location Type                   object         
 8   Incident Zip                    object         
 9   Incident Address                object         
 10  Street Name                     object         
 11  Cross Street 1                  object         
 12  Cross Street 2                  object         
 13  Intersection Street 1           object         
 14  Intersection Street 2           ob

In [26]:
df.resolution_duration.median()

Timedelta('1 days 05:18:00')

In [27]:
# let's ignore empty values
df_clean = df.dropna(subset=['resolution_duration'])
median_durations = df_clean.groupby('Complaint Type').resolution_duration.median(numeric_only=False)

median_durations.nlargest(15).reset_index(name='median_duration')

Unnamed: 0,Complaint Type,median_duration
0,FHV Licensee Complaint,117 days 21:16:54.500000
1,Taxi Complaint,99 days 16:14:32
2,Radioactive Material,98 days 00:25:34
3,For Hire Vehicle Complaint,96 days 16:03:04
4,Graffiti,88 days 09:34:52
5,New Tree Request,83 days 23:42:27.500000
6,Taxi Licensee Complaint,67 days 09:34:14
7,Food Establishment,60 days 15:32:19
8,Facades,53 days 07:07:02
9,Sustainability Enforcement,48 days 07:05:53


In [28]:
median_durations.nsmallest(15).reset_index(name='median_duration')

Unnamed: 0,Complaint Type,median_duration
0,BEST/Site Safety,0 days 00:00:00
1,Construction Safety Enforcement,0 days 00:00:00
2,Derelict Vehicles,0 days 00:00:00
3,Quality of Life,0 days 00:00:00
4,Street Light Condition,0 days 00:00:00
5,Taxi Report,0 days 00:00:36
6,Benefit Card Replacement,0 days 00:00:38
7,For Hire Vehicle Report,0 days 00:00:38
8,Advocate-Commercial Exemptions,0 days 00:06:34
9,Advocate-Property Value,0 days 00:07:14.500000


## [Filtering timestamps](https://www.interviewqs.com/ddi_code_snippets/select_pandas_dataframe_rows_between_two_dates)

Noise complaints over New Year's.

In [29]:
noise[(noise['Created Date'] >= '2018-12-31') & (noise['Created Date'] < '2019-01-02') & (noise['Complaint Type'] == 'Noise - Residential')]

Unnamed: 0,Unique Key,Created Date,Closed Date,Agency,Agency Name,Complaint Type,Descriptor,Location Type,Incident Zip,Incident Address,...,Vehicle Type,Taxi Company Borough,Taxi Pick Up Location,Bridge Highway Name,Bridge Highway Direction,Road Ramp,Bridge Highway Segment,Latitude,Longitude,Location
1153445,41303456,2018-12-31 00:00:25,2018-12-31 05:13:01,NYPD,New York City Police Department,Noise - Residential,Banging/Pounding,Residential Building/House,11434,128-09 SIDWAY PLACE,...,,,,,,,,40.681950,-73.762979,"(40.68194967436897, -73.76297933533925)"
1153450,41302283,2018-12-31 00:02:02,2018-12-31 03:32:50,NYPD,New York City Police Department,Noise - Residential,Loud Music/Party,Residential Building/House,11226,27 CROOKE AVENUE,...,,,,,,,,40.652084,-73.964390,"(40.65208355865196, -73.96439024060807)"
1153456,41303466,2018-12-31 00:03:32,2018-12-31 06:07:50,NYPD,New York City Police Department,Noise - Residential,Loud Music/Party,Residential Building/House,10032,560 WEST 165 STREET,...,,,,,,,,40.838793,-73.940416,"(40.83879341512027, -73.9404163746114)"
1153457,41302844,2018-12-31 00:03:37,2018-12-31 06:49:47,NYPD,New York City Police Department,Noise - Residential,Loud Music/Party,Residential Building/House,11226,1060 OCEAN AVENUE,...,,,,,,,,40.636818,-73.958504,"(40.63681783665883, -73.95850381713777)"
1153460,41302158,2018-12-31 00:03:52,2018-12-31 06:23:33,NYPD,New York City Police Department,Noise - Residential,Banging/Pounding,Residential Building/House,10453,57 WEST 175 STREET,...,,,,,,,,40.848115,-73.915201,"(40.84811513088202, -73.91520061603975)"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1164767,41317586,2019-01-01 23:56:22,2019-01-02 01:12:42,NYPD,New York City Police Department,Noise - Residential,Banging/Pounding,Residential Building/House,11210.0,2501 NOSTRAND AVENUE,...,,,,,,,,40.623528,-73.946250,"(40.62352798084708, -73.94624978072031)"
1164768,41319052,2019-01-01 23:56:33,2019-01-02 01:43:58,NYPD,New York City Police Department,Noise - Residential,Loud Talking,Residential Building/House,10023.0,250 WEST 61 DRIVE,...,,,,,,,,40.773083,-73.988819,"(40.77308309732899, -73.9888185606095)"
1164772,41317202,2019-01-01 23:57:13,2019-01-02 01:25:48,NYPD,New York City Police Department,Noise - Residential,Loud Music/Party,Residential Building/House,10031.0,509 WEST 135 STREET,...,,,,,,,,40.819185,-73.952768,"(40.81918540590982, -73.95276841145517)"
1164773,41318737,2019-01-01 23:57:16,2019-01-02 01:14:43,NYPD,New York City Police Department,Noise - Residential,Banging/Pounding,Residential Building/House,11236.0,10702 FARRAGUT ROAD,...,,,,,,,,40.651367,-73.896628,"(40.651367400005725, -73.89662791387717)"


## Pivoting

Pandas supports "reshaping" DataFrames through [pivoting](https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html#reshaping-by-pivoting-dataframe-objects), [like spreadsheets do](https://support.google.com/docs/answer/1272900).

In [30]:
%%html
<video controls width="700">
  <source src="https://github.com/afeld/python-public-policy/raw/main/extras/img/pivot.mp4" type="video/mp4">
</video>