When looking at the outliers for completion time, I saw anecdotally that many outliers were because of administrative errors: they were duplicates or somehow invalid. We should remove these issues from our dataset.

In [123]:
import pandas as pd
from IPython.display import display
from datetime import datetime, timedelta
import re

pd.set_option('max_colwidth', 400)

In [59]:
df = pd.read_pickle('../data/data_w_transformed_census.pkl')
df.shape

(905205, 48)

## Dropping invalid or duplicate or administratively closed issues

Let's manually inspect our "classifier"; any false positives?

In [67]:
df.loc[:, 'CLOSURE_REASON': 'CLOSURE_REASON'][df.CLOSURE_REASON.str.contains(exclusion_str, case=False, na=False)].head(20)

  from ipykernel import kernelapp as app


Unnamed: 0,CLOSURE_REASON
19,Case Closed Case Invalid No eform
102,Case Closed. Closed date : 2015-07-13 11:17:13.853 ADCLSD: Administratively Closed
104,Case Closed Duplicate of Existing Case %1219035
110,Case Closed. Closed date : 2015-09-18 09:57:35.08 Case Invalid wrong case (bulk Item)
124,Case Closed. Closed date : 2016-02-29 17:07:39.243 Duplicate of Existing Case
126,Case Closed Duplicate of Existing Case #101001148920 duplicate
142,Case Closed ADCLSD: Administratively Closed
170,Case Closed. Closed date : 2016-10-24 09:23:27.363 ADCLSD: Administratively Closed
181,Case Closed Case Invalid Opened in error.
194,Case Closed. Closed date : 2016-03-22 11:24:50.387 Case Invalid TV has been taken by someone else already


Doesn't look like it. Still probably a number of false negatives, though.

In [68]:
exclusion_str = r'(duplicate of|administrative|never closed|invalid)'
df1 = df[~df.CLOSURE_REASON.str.contains(exclusion_str, case=False, na=False)]
df1.shape

  from ipykernel import kernelapp as app


(843352, 48)

## Drop internal categs?

Decided not to drop internal categs, or issues from internal sources. Since the internal categs and sources don't overlap with the non-internal categs and sources, including them in the dataset won't adversely affect my model.

## Looking at any issues with negative completion times

In [70]:
(df1.COMPLETION_TIME < 0).sum()

39675

Yikes. Let's inspect these.

In [120]:
aa = df1[['COMPLETION_TIME', 'OPEN_DT', 'CLOSED_DT', 'TYPE', 'CLOSURE_REASON']][df1.COMPLETION_TIME < 0].head(500).tail()
aa

Unnamed: 0,COMPLETION_TIME,OPEN_DT,CLOSED_DT,TYPE,CLOSURE_REASON
11413,-10.428611,2016-06-10 12:06:00,2016-06-10 01:40:17,Missed Trash/Recycling/Yard Waste/Bulk Item,Case Closed. Closed date : 2016-06-10 13:40:17.743 Case Noted On inspection of above address there was no missed trash/rec visible but there is grass on the sidewalk not sure if that's what was missed but its not a YW service week so we can't service YW I did leave a callender for them.
11418,-8.720278,2015-09-01 12:32:00,2015-09-01 03:48:47,Needle Pickup,Case Closed. Closed date : 2015-09-01 15:48:47.68 Case Noted MSU responded to a report of a single syringe at Hobart Park in Brighton at the corner of Olivia and Ranleigh. After conducting a thorough sweep of the area and contacting the constituent who happened to be a city councilman we were unable to locate the syringe. We are however confident that the syringe is no longer in that park....
11423,-7.645278,2016-06-02 09:14:00,2016-06-02 01:35:17,Pick up Dead Animal,Case Closed. Closed date : 2016-06-02 13:35:17.557 Case Resolved Alset picked up
11460,-6.488611,2011-11-01 09:44:58,2011-11-01 03:15:39,General Comments For An Employee,Case Closed Case Resolved taken care of thank you
11486,-8.427222,2013-08-29 10:32:55,2013-08-29 02:07:17,Missed Trash/Recycling/Yard Waste/Bulk Item,Case Closed Case Resolved no tv on curb during my inspection . 8/29/13


In [85]:
39675 / 905205.

0.043829850696803486

From this sample, it looks like completion times were somehow mislogged, but the correct ones are in the `CLOSURE_REASON` description.

My best guess for the rest is that user input allowed the user to choose a time before start time, and the user intended to input a 'finished soon after' time.

Since this affects 4% of the issues, if there's a datettime, I'll use that, and I'll make `COMPLETION_TIME` a half-hour otherwise.

In [133]:
def fix_neg_completion_time(row):
    # very slow; do this using native pandas
    if row['COMPLETION_TIME'] < 0:
        if re.compile(r'(\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d.\d\d\d)').search(row['CLOSURE_REASON']):
            row['CLOSED_DT'] = row['CLOSED_DT'] + timedelta(hours=12)
        else:
            row['CLOSED_DT'] = row['OPEN_DT'] + timedelta(minutes=30)
    
    return row

In [137]:
df2 = df1.apply(fix_neg_completion_time, axis=1)

In [141]:
df2['COMPLETION_TIME'] = (df2.CLOSED_DT - df2.OPEN_DT).apply(lambda x: x / pd.np.timedelta64(1, 'h'))

In [151]:
df2[['COMPLETION_TIME', 'OPEN_DT', 'CLOSED_DT', 'TYPE', 'CLOSURE_REASON']].loc[11413:11414]

Unnamed: 0,COMPLETION_TIME,OPEN_DT,CLOSED_DT,TYPE,CLOSURE_REASON
11413,1.571389,2016-06-10 12:06:00,2016-06-10 13:40:17,Missed Trash/Recycling/Yard Waste/Bulk Item,Case Closed. Closed date : 2016-06-10 13:40:17.743 Case Noted On inspection of above address there was no missed trash/rec visible but there is grass on the sidewalk not sure if that's what was missed but its not a YW service week so we can't service YW I did leave a callender for them.
11414,1.233611,2016-02-10 08:46:00,2016-02-10 10:00:01,Unshoveled Sidewalk,Case Closed. Closed date : 2016-02-10 10:00:01.353 Case Noted cited unshoveled sifewalk


In [152]:
(df2.COMPLETION_TIME < 0).sum()

0

From eyeballing it, it looks good. There may possibly have been some issues that had negative completion time that were meant to not have been created, but I think these are very few.

I think the majority of issues with negative completion times were just done quickly, and there was user input error.

## OK, let's make a revised final df

In [153]:
df2.shape

(843352, 48)

In [154]:
df2.to_pickle('../data/data_w_transformed_census_and_removed_invalid_rows.pkl')