When looking at the outliers for completion time, I saw anecdotally that many outliers were because of administrative errors: they were duplicates or somehow invalid. We should remove these issues from our dataset.

In [1]:
import pandas as pd
from IPython.display import display
from datetime import datetime, timedelta
import re

pd.set_option('max_colwidth', 400)

In [20]:
df = pd.read_pickle('../data/data_w_transformed_census.pkl')
df.shape

(905205, 48)

## Looking at any issues with negative completion times

In [26]:
(df1.COMPLETION_TIME < 0).sum()

39675

Yikes. Let's inspect these.

In [120]:
aa = df1[['COMPLETION_TIME', 'OPEN_DT', 'CLOSED_DT', 'TYPE', 'CLOSURE_REASON']][df1.COMPLETION_TIME < 0].head(500).tail()
aa

Unnamed: 0,COMPLETION_TIME,OPEN_DT,CLOSED_DT,TYPE,CLOSURE_REASON
11413,-10.428611,2016-06-10 12:06:00,2016-06-10 01:40:17,Missed Trash/Recycling/Yard Waste/Bulk Item,Case Closed. Closed date : 2016-06-10 13:40:17.743 Case Noted On inspection of above address there was no missed trash/rec visible but there is grass on the sidewalk not sure if that's what was missed but its not a YW service week so we can't service YW I did leave a callender for them.
11418,-8.720278,2015-09-01 12:32:00,2015-09-01 03:48:47,Needle Pickup,Case Closed. Closed date : 2015-09-01 15:48:47.68 Case Noted MSU responded to a report of a single syringe at Hobart Park in Brighton at the corner of Olivia and Ranleigh. After conducting a thorough sweep of the area and contacting the constituent who happened to be a city councilman we were unable to locate the syringe. We are however confident that the syringe is no longer in that park....
11423,-7.645278,2016-06-02 09:14:00,2016-06-02 01:35:17,Pick up Dead Animal,Case Closed. Closed date : 2016-06-02 13:35:17.557 Case Resolved Alset picked up
11460,-6.488611,2011-11-01 09:44:58,2011-11-01 03:15:39,General Comments For An Employee,Case Closed Case Resolved taken care of thank you
11486,-8.427222,2013-08-29 10:32:55,2013-08-29 02:07:17,Missed Trash/Recycling/Yard Waste/Bulk Item,Case Closed Case Resolved no tv on curb during my inspection . 8/29/13


In [193]:
df1[['COMPLETION_TIME', 'OPEN_DT', 'CLOSED_DT', 'TYPE', 'CLOSURE_REASON']][df1.CLOSED_DT < '2015-05-03'] \
    [df1.COMPLETION_TIME > 0][df1.CLOSURE_REASON.str.contains(r'(\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d.\d\d\d)', na=False)] \
#     .head(500).tail()

  if __name__ == '__main__':
  if __name__ == '__main__':


Unnamed: 0,COMPLETION_TIME,OPEN_DT,CLOSED_DT,TYPE,CLOSURE_REASON
17770,1607.934722,2015-02-24 08:14:52,2015-05-02 08:10:57,Highway Maintenance,Case Closed. Closed date : 2015-05-02 20:10:57.963 Case Referred to External Agency gave to DCR in Feb
22621,1607.863056,2015-02-24 08:18:43,2015-05-02 08:10:30,Requests for Traffic Signal Studies or Reviews,Case Closed. Closed date : 2015-05-02 20:10:30.943 Case Referred to External Agency gave to DCR in Feb
30192,12.510278,2015-05-01 12:46:10,2015-05-02 01:16:47,Request for Pothole Repair,Case Closed. Closed date : 2015-05-02 13:16:47.883 Case Resolved has been filled
59339,1.6025,2015-05-02 04:32:44,2015-05-02 06:08:53,Empty Litter Basket,Case Closed. Closed date : 2015-05-02 18:08:53.343 Case Resolved emptied
65039,12.939444,2015-05-01 12:19:26,2015-05-02 01:15:48,Sidewalk Repair (Make Safe),Case Closed. Closed date : 2015-05-02 13:15:48.053 Case Resolved has been made safe
81892,1250.975556,2015-03-11 04:32:24,2015-05-02 07:30:56,Sidewalk Repair (Make Safe),Case Closed. Closed date : 2015-05-02 19:30:56.463 Case Referred to External Agency reported to NPS
103924,3.4475,2015-05-02 03:53:05,2015-05-02 07:19:56,Pick up Dead Animal,Case Closed. Closed date : 2015-05-02 19:19:56.617 Case Resolved dead animal picked up
105867,1201.949722,2015-03-13 05:32:57,2015-05-02 07:29:56,Requests for Street Cleaning,Case Closed. Closed date : 2015-05-02 19:29:56.793 Case Referred to External Agency reported to National Park Service in March
135007,1271.081111,2015-03-10 08:27:54,2015-05-02 07:32:46,Sidewalk Cover / Manhole,Case Closed. Closed date : 2015-05-02 19:32:46.067 Case Referred to External Agency reported to BWSC
161208,1147.827222,2015-03-15 11:38:54,2015-05-02 07:28:32,Sidewalk Repair (Make Safe),Case Closed. Closed date : 2015-05-02 19:28:32.003 Case Referred to External Agency reported to DCR in March


It looks like they changed their `CLOSED_DT` definition on 2015-05-02. To verify, are there any negative completion time issues before that `CLOSED_DT` date? If not, then I will want to make all the `CLOSED_DT`s after that date line up with the time in `CLOSURE_REASON`.

In [194]:
df1[['COMPLETION_TIME', 'OPEN_DT', 'CLOSED_DT', 'TYPE', 'CLOSURE_REASON']][df1.COMPLETION_TIME < 0][df1.CLOSED_DT < '2015-05-02'].head(500).tail()

  if __name__ == '__main__':


Unnamed: 0,COMPLETION_TIME,OPEN_DT,CLOSED_DT,TYPE,CLOSURE_REASON
20193,-9.230278,2011-09-22 11:35:37,2011-09-22 02:21:48,Empty Litter Basket,Case Closed Case Resolved all barrels are empty
20287,-7.95,2013-10-02 10:52:45,2013-10-02 02:55:45,Traffic Signal Repair,Case Closed Case Resolved
20366,-7.066111,2014-01-24 09:50:53,2014-01-24 02:46:55,Request for Pothole Repair,Case Closed Case Resolved made safe
20376,-6.302222,2012-02-08 09:25:25,2012-02-08 03:07:17,Requests for Street Cleaning,Case Closed Case Resolved glass cleaned up.
20391,-6.917778,2014-03-24 09:31:17,2014-03-24 02:36:13,Request for Pothole Repair,Case Closed Case Noted pothole patched


Yes.

How many negative completion times before and after that date?

In [196]:
df1[['COMPLETION_TIME', 'CLOSED_DT']][df1.COMPLETION_TIME < 0][df1.CLOSED_DT < '2015-05-02'].shape

  if __name__ == '__main__':


(22070, 2)

In [197]:
df1[['COMPLETION_TIME', 'CLOSED_DT']][df1.COMPLETION_TIME < 0][df1.CLOSED_DT > '2015-05-02'].shape

  if __name__ == '__main__':


(17605, 2)

In [85]:
39675 / 905205.

0.043829850696803486

Now I have 2 options:
- change all the `CLOSED_DT`s after 2015-05-02 to match what's in the description
- since only 4% of the issues have negative completion times, trust `CLOSED_DT` and not what's in the description, and drop those 4% of issues

Given more time, I could contact the city and ask them why the completion time stopped matching up after 2015-05-02. A quick Google search for 'boston 311 closure reason' didn't yield any helpful results. I could also look at the before-and-after completion times to see if there's a big difference.

But for now, I will do the simple thing, trust the `CLOSED_DT` values, and drop the issues with negative completion times.

In [37]:
df2 = df1[(df1.COMPLETION_TIME > 0) | (df1.COMPLETION_TIME.isnull())]

In [41]:
df2.CLOSED_DT.isnull().sum()

70557

We can describe those issues with negative completion times by datetime, category, amongst other factors.

## OK, let's make a revised final df

Look in `remove_from_dataset.py`