# Analyzing and Predicting DCs Service Request Types of 2018

The flow adopted in this notebook is as follows:

> 1. Read in the datasets using ArcGIS API for Python
> 2. Merge datasets
> 3. Construct model that predicts service type
> 4. How does my neighborhood fare?
> 5. Next Steps

The datasets used in this notebook are the 
1. __`City Service Requests in 2018`__
2. __`Neighborhood Clusters`__

These datasets can be found on [opendata.dc.gov](http://opendata.dc.gov/)

We start by importing the ArcGIS package to load the data using a service URL

In [1]:
from arcgis.features import *
import arcgis

### 1.1 Read in service requests for 2018

[Link](http://opendata.dc.gov/datasets/city-service-requests-in-2018/geoservice?geometry=-77.49%2C38.811%2C-76.534%2C38.998) to Service Requests 2018 dataset

In [2]:
lyr_url = 'https://maps2.dcgis.dc.gov/dcgis/rest/services/DCGIS_DATA/ServiceRequests/MapServer/9'

req_layer = FeatureLayer(lyr_url)
req_layer

<FeatureLayer url:"https://maps2.dcgis.dc.gov/dcgis/rest/services/DCGIS_DATA/ServiceRequests/MapServer/9">

In [3]:
#Extract all the data and display number of rows
all_features = req_layer.query()
print('Total number of rows in the dataset: ')
print(len(all_features.features))

Total number of rows in the dataset: 
16780


In [4]:
#store as dataframe
requests = all_features.df

#View first 5 rows
requests.head()

Unnamed: 0,ADDDATE,CITY,DETAILS,INSPECTIONDATE,INSPECTIONFLAG,INSPECTORNAME,LATITUDE,LONGITUDE,MARADDRESSREPOSITORYID,OBJECTID,...,SERVICEREQUESTID,SERVICETYPECODEDESCRIPTION,STATE,STATUS_CODE,STREETADDRESS,WARD,XCOORD,YCOORD,ZIPCODE,SHAPE
0,1515054684000,WASHINGTON,Per T. Duckett 01-09-18. closed by A. Hedgeman...,1515497000000.0,Y,,38.860941,-76.989057,65434.0,309014,...,18-00005582,SNOW,DC,CLOSED,1375 MORRIS ROAD SE,8,400949.75,132569.42,20020,"{'spatialReference': {'wkid': 26985, 'latestWk..."
1,1514890494000,WASHINGTON,completed per Supervisor P. Redman-Smith. Clo...,1515049000000.0,Y,,38.872655,-76.972942,48733.0,309015,...,18-00001141,SNOW,DC,CLOSED,2310 MINNESOTA AVENUE SE,8,402348.03,133870.07,20020,"{'spatialReference': {'wkid': 26985, 'latestWk..."
2,1515014214000,,No information. closed by sg 1/4/18,1515049000000.0,Y,,38.846461,-76.971636,,309016,...,18-00005386,SNOW,,CLOSED,,8,402462.228864,130962.50761,20020,"{'spatialReference': {'wkid': 26985, 'latestWk..."
3,1515041355000,WASHINGTON,DPW Officer issued one ticket for fire hydrant...,,N,,38.920416,-77.013792,228207.0,309017,...,18-00005455,PEMA- Parking Enforcement Management Administr...,DC,CLOSED,149 ADAMS STREET NW,5,398803.97,139171.7,20001,"{'spatialReference': {'wkid': 26985, 'latestWk..."
4,1515006248000,WASHINGTON,,,N,,38.952663,-77.069486,223159.0,309018,...,18-00005276,PEMA- Parking Enforcement Management Administr...,DC,CLOSED,4817 36TH STREET NW,3,393976.96,142753.59,20008,"{'spatialReference': {'wkid': 26985, 'latestWk..."


In [5]:
#Import other necessary packages
import geopandas as gpd
import pandas as pd
from shapely.geometry import Point

We now convert this dataframe to a GeoDataFrame

In [6]:
geometry = [Point(xy) for xy in zip(requests['LONGITUDE'], requests['LATITUDE'])]
requests = requests.drop(['LONGITUDE', 'LATITUDE'], axis=1)
crs = {'init': 'epsg:4326'}
requests_gdf = gpd.GeoDataFrame(requests, crs=crs, geometry=geometry)

In [7]:
requests_gdf.head()

Unnamed: 0,ADDDATE,CITY,DETAILS,INSPECTIONDATE,INSPECTIONFLAG,INSPECTORNAME,MARADDRESSREPOSITORYID,OBJECTID,ORGANIZATIONACRONYM,PRIORITY,...,SERVICETYPECODEDESCRIPTION,STATE,STATUS_CODE,STREETADDRESS,WARD,XCOORD,YCOORD,ZIPCODE,SHAPE,geometry
0,1515054684000,WASHINGTON,Per T. Duckett 01-09-18. closed by A. Hedgeman...,1515497000000.0,Y,,65434.0,309014,DPW,STANDARD,...,SNOW,DC,CLOSED,1375 MORRIS ROAD SE,8,400949.75,132569.42,20020,"{'x': 400949.75, 'y': 132569.4200000018}",POINT (-76.98905709 38.86094051)
1,1514890494000,WASHINGTON,completed per Supervisor P. Redman-Smith. Clo...,1515049000000.0,Y,,48733.0,309015,DPW,STANDARD,...,SNOW,DC,CLOSED,2310 MINNESOTA AVENUE SE,8,402348.03,133870.07,20020,"{'x': 402348.0300000012, 'y': 133870.0700000003}",POINT (-76.97294183 38.87265468)
2,1515014214000,,No information. closed by sg 1/4/18,1515049000000.0,Y,,,309016,DPW,STANDARD,...,SNOW,,CLOSED,,8,402462.228864,130962.50761,20020,"{'x': 402462.22890000045, 'y': 130962.50759999...",POINT (-76.971636238 38.84646148780001)
3,1515041355000,WASHINGTON,DPW Officer issued one ticket for fire hydrant...,,N,,228207.0,309017,DPW,STANDARD,...,PEMA- Parking Enforcement Management Administr...,DC,CLOSED,149 ADAMS STREET NW,5,398803.97,139171.7,20001,"{'x': 398803.9699999988, 'y': 139171.69999999925}",POINT (-77.01379201 38.92041601)
4,1515006248000,WASHINGTON,,,N,,223159.0,309018,DPW,STANDARD,...,PEMA- Parking Enforcement Management Administr...,DC,CLOSED,4817 36TH STREET NW,3,393976.96,142753.59,20008,"{'x': 393976.9600000009, 'y': 142753.58999999985}",POINT (-77.06948607 38.95266291)


### 1.2 Read in Neighborhood Clusters dataset

[Link](http://opendata.dc.gov/datasets/neighborhood-clusters) to this dataset

In [8]:
neighborhood = gpd.read_file('D:\Data\DC_Neighborhood\\Neighborhood_Clusters\\Neighborhood_Clusters.shp')
neighborhood.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 46 entries, 0 to 45
Data columns (total 8 columns):
OBJECTID      46 non-null int64
WEB_URL       39 non-null object
NAME          46 non-null object
NBH_NAMES     46 non-null object
Shape_Leng    46 non-null float64
Shape_Area    46 non-null float64
TYPE          46 non-null object
geometry      46 non-null object
dtypes: float64(2), int64(1), object(5)
memory usage: 3.0+ KB


In [9]:
neighborhood.head()

Unnamed: 0,OBJECTID,WEB_URL,NAME,NBH_NAMES,Shape_Leng,Shape_Area,TYPE,geometry
0,1,http://planning.dc.gov/,Cluster 39,"Congress Heights, Bellevue, Washington Highlands",10711.66801,4886463.0,Original,POLYGON ((-76.99401890037231 38.84519662346873...
1,2,http://planning.dc.gov/,Cluster 38,"Douglas, Shipley Terrace",8229.486324,2367958.0,Original,"POLYGON ((-76.97471813575507 38.8528706360112,..."
2,3,http://planning.dc.gov/,Cluster 36,"Woodland/Fort Stanton, Garfield Heights, Knox ...",4746.344457,1119573.0,Original,"POLYGON ((-76.9687730019474 38.86067206227963,..."
3,4,http://planning.dc.gov/,Cluster 27,"Near Southeast, Navy Yard",7286.968902,1619167.0,Original,"POLYGON ((-76.9872595922274 38.87711832849107,..."
4,5,http://planning.dc.gov/,Cluster 32,"River Terrace, Benning, Greenway, Dupont Park",11251.012821,4286254.0,Original,POLYGON ((-76.93760147029893 38.88995958845385...


The `SHAPE` column needs to be renamed to `geometry` for use with `geopandas`

In [10]:
neighborhood.rename(columns={'SHAPE': 'geometry'}, inplace=True)

We now merge the two datasets

In [11]:
merged = gpd.sjoin(requests_gdf, neighborhood, how="inner", op='intersects')
merged.head()

Unnamed: 0,ADDDATE,CITY,DETAILS,INSPECTIONDATE,INSPECTIONFLAG,INSPECTORNAME,MARADDRESSREPOSITORYID,OBJECTID_left,ORGANIZATIONACRONYM,PRIORITY,...,SHAPE,geometry,index_right,OBJECTID_right,WEB_URL,NAME,NBH_NAMES,Shape_Leng,Shape_Area,TYPE
0,1515054684000,WASHINGTON,Per T. Duckett 01-09-18. closed by A. Hedgeman...,1515497000000.0,Y,,65434.0,309014,DPW,STANDARD,...,"{'x': 400949.75, 'y': 132569.4200000018}",POINT (-76.98905709 38.86094051),28,29,http://planning.dc.gov/,Cluster 37,"Sheridan, Barry Farm, Buena Vista",7600.043391,2052485.0,Original
9,1514881461000,WASHINGTON,completed per Supervisor P. Redman-Smith. Clo...,1515049000000.0,Y,,803072.0,309023,DPW,STANDARD,...,"{'x': 400200.6319999993, 'y': 132307.12110000104}",POINT (-76.99768841789999 38.8585777698),28,29,http://planning.dc.gov/,Cluster 37,"Sheridan, Barry Farm, Buena Vista",7600.043391,2052485.0,Original
184,1514996387000,WASHINGTON,We are unable to complete verification until w...,1515071000000.0,Y,,67141.0,309198,DPW,STANDARD,...,"{'x': 400741.8599999994, 'y': 132536.73999999836}",POINT (-76.99145240999999 38.86064632),28,29,http://planning.dc.gov/,Cluster 37,"Sheridan, Barry Farm, Buena Vista",7600.043391,2052485.0,Original
316,1515232328000,,Closed per T. Duckett on 1/9/2018. Closed by ...,1515491000000.0,Y,,903772.0,310604,DPW,STANDARD,...,"{'x': 400322.6000000015, 'y': 132462.3200000003}",POINT (-76.9962830905 38.8599761644),28,29,http://planning.dc.gov/,Cluster 37,"Sheridan, Barry Farm, Buena Vista",7600.043391,2052485.0,Original
361,1515244430000,WASHINGTON,Collect on 1-7-18 by the big truck,,N,,62170.0,310649,DPW,STANDARD,...,"{'x': 401239.3900000006, 'y': 132151.4200000018}",POINT (-76.98572065 38.85717463),28,29,http://planning.dc.gov/,Cluster 37,"Sheridan, Barry Farm, Buena Vista",7600.043391,2052485.0,Original


### 3. Construct model that predicts service type

The variables used to build the model are:
> 1. City Quadrant
> 2. Neighborhood cluster
> 3. Organization acronym
> 4. Status Code

### 3.1 Data preprocessing

In [12]:
quads = ['NE', 'NW', 'SE', 'SW']
def generateQuadrant(x):
    '''Function that extracts quadrant from street address'''
    try:
        temp = x[-2:]
        if temp in quads:
            return temp
        else:
            return 'NaN'
    except:
        return 'NaN'

In [13]:
merged['QUADRANT'] = merged['STREETADDRESS'].apply(generateQuadrant)
merged['QUADRANT'].head()

0       SE
9       SE
184     SE
316    NaN
361     SE
Name: QUADRANT, dtype: object

In [14]:
merged['QUADRANT'].unique()

array(['SE', 'NaN', 'NE', 'NW', 'SW'], dtype=object)

In [15]:
merged['CLUSTER'] = merged['NAME'].apply(lambda x: x[8:])
merged['CLUSTER'].head()

0      37
9      37
184    37
316    37
361    37
Name: CLUSTER, dtype: object

In [16]:
merged['CLUSTER'] = merged['CLUSTER'].astype(int)

In [17]:
merged['ORGANIZATIONACRONYM'].unique()

array(['DPW', 'DDOT', 'DMV', 'DOEE', 'FEMS', 'DOH', 'OUC', 'ORM', 'DC-ICH'], dtype=object)

In [18]:
merged['STATUS_CODE'].unique()

array(['CLOSED', 'OPEN'], dtype=object)

Let's extract the number of possible outcomes, i.e. length of the target variable

In [19]:
len(merged['SERVICETYPECODEDESCRIPTION'].unique())

22

### 3.2 Model building

In [20]:
#Import necessary packages
from sklearn.preprocessing import *
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

In [21]:
#Converting categorical (text) fields to numbers
number = LabelEncoder()
merged['SERVICETYPE_NUMBER'] = number.fit_transform(merged['SERVICETYPECODEDESCRIPTION'].astype('str'))
merged['STATUS_CODE_NUMBER'] = number.fit_transform(merged['STATUS_CODE'].astype('str'))

In [22]:
#Extracting desired fields
data = merged[['SERVICETYPECODEDESCRIPTION', 'SERVICETYPE_NUMBER', 'QUADRANT', 'CLUSTER', 'ORGANIZATIONACRONYM', 'STATUS_CODE', 'STATUS_CODE_NUMBER']]
data.reset_index(inplace=True)
data.head()

Unnamed: 0,index,SERVICETYPECODEDESCRIPTION,SERVICETYPE_NUMBER,QUADRANT,CLUSTER,ORGANIZATIONACRONYM,STATUS_CODE,STATUS_CODE_NUMBER
0,0,SNOW,13,SE,37,DPW,CLOSED,0
1,9,SNOW,13,SE,37,DPW,CLOSED,0
2,184,SNOW,13,SE,37,DPW,CLOSED,0
3,316,SNOW,13,,37,DPW,CLOSED,0
4,361,SWMA- Solid Waste Management Admistration,14,SE,37,DPW,CLOSED,0


Let's binarize values in fields `QUADRANT` (4) and `ORGANIZATIONACRONYM` (8)

Wonder why are not doing it for `CLUSTER`? Appropriate nomenclature of [adjacent clusters](http://opendata.dc.gov/datasets/neighborhood-clusters).

In [23]:
data = pd.get_dummies(data=data, columns=['QUADRANT', 'ORGANIZATIONACRONYM'])
data.head()

Unnamed: 0,index,SERVICETYPECODEDESCRIPTION,SERVICETYPE_NUMBER,CLUSTER,STATUS_CODE,STATUS_CODE_NUMBER,QUADRANT_NE,QUADRANT_NW,QUADRANT_NaN,QUADRANT_SE,QUADRANT_SW,ORGANIZATIONACRONYM_DC-ICH,ORGANIZATIONACRONYM_DDOT,ORGANIZATIONACRONYM_DMV,ORGANIZATIONACRONYM_DOEE,ORGANIZATIONACRONYM_DOH,ORGANIZATIONACRONYM_DPW,ORGANIZATIONACRONYM_FEMS,ORGANIZATIONACRONYM_ORM,ORGANIZATIONACRONYM_OUC
0,0,SNOW,13,37,CLOSED,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0
1,9,SNOW,13,37,CLOSED,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0
2,184,SNOW,13,37,CLOSED,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0
3,316,SNOW,13,37,CLOSED,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0
4,361,SWMA- Solid Waste Management Admistration,14,37,CLOSED,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0


In [24]:
#Extracting input dataframe
model_data = data.drop(['SERVICETYPECODEDESCRIPTION', 'SERVICETYPE_NUMBER', 'STATUS_CODE'], axis=1)
model_data.head()

Unnamed: 0,index,CLUSTER,STATUS_CODE_NUMBER,QUADRANT_NE,QUADRANT_NW,QUADRANT_NaN,QUADRANT_SE,QUADRANT_SW,ORGANIZATIONACRONYM_DC-ICH,ORGANIZATIONACRONYM_DDOT,ORGANIZATIONACRONYM_DMV,ORGANIZATIONACRONYM_DOEE,ORGANIZATIONACRONYM_DOH,ORGANIZATIONACRONYM_DPW,ORGANIZATIONACRONYM_FEMS,ORGANIZATIONACRONYM_ORM,ORGANIZATIONACRONYM_OUC
0,0,37,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0
1,9,37,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0
2,184,37,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0
3,316,37,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0
4,361,37,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0


In [25]:
#Defining independent and dependent variables
y = data['SERVICETYPE_NUMBER'].values
X = model_data.values

In [26]:
#Splitting data to a training and test sample of 70%-30%
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = .3, random_state=522, stratify=y)

In [27]:
#n_estimators = number of trees in the forest
#min_samples_leaf = minimum number of samples required to be at a leaf node for the tree
rf = RandomForestClassifier(n_estimators=1500, min_samples_leaf=20, random_state=522)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
print(y_pred)

[14 18 14 ..., 14 14 14]


In [28]:
print('Accuracy: ', accuracy_score(y_test, y_pred))

Accuracy:  0.705286168521


### 3.3 Alternate model, excluding the department codes

In [29]:
data = merged[['SERVICETYPECODEDESCRIPTION', 'SERVICETYPE_NUMBER', 'QUADRANT', 'CLUSTER', 'ORGANIZATIONACRONYM', 'STATUS_CODE', 'STATUS_CODE_NUMBER']]
data.reset_index(inplace=True)
data.head()

Unnamed: 0,index,SERVICETYPECODEDESCRIPTION,SERVICETYPE_NUMBER,QUADRANT,CLUSTER,ORGANIZATIONACRONYM,STATUS_CODE,STATUS_CODE_NUMBER
0,0,SNOW,13,SE,37,DPW,CLOSED,0
1,9,SNOW,13,SE,37,DPW,CLOSED,0
2,184,SNOW,13,SE,37,DPW,CLOSED,0
3,316,SNOW,13,,37,DPW,CLOSED,0
4,361,SWMA- Solid Waste Management Admistration,14,SE,37,DPW,CLOSED,0


In [30]:
data_test = pd.get_dummies(data=data,columns=['QUADRANT'])
data_test.head()

Unnamed: 0,index,SERVICETYPECODEDESCRIPTION,SERVICETYPE_NUMBER,CLUSTER,ORGANIZATIONACRONYM,STATUS_CODE,STATUS_CODE_NUMBER,QUADRANT_NE,QUADRANT_NW,QUADRANT_NaN,QUADRANT_SE,QUADRANT_SW
0,0,SNOW,13,37,DPW,CLOSED,0,0,0,0,1,0
1,9,SNOW,13,37,DPW,CLOSED,0,0,0,0,1,0
2,184,SNOW,13,37,DPW,CLOSED,0,0,0,0,1,0
3,316,SNOW,13,37,DPW,CLOSED,0,0,0,1,0,0
4,361,SWMA- Solid Waste Management Admistration,14,37,DPW,CLOSED,0,0,0,0,1,0


In [31]:
model_test_data = data_test.drop(['SERVICETYPECODEDESCRIPTION', 'SERVICETYPE_NUMBER', 'STATUS_CODE', 'ORGANIZATIONACRONYM'], axis=1)
model_test_data.head()

Unnamed: 0,index,CLUSTER,STATUS_CODE_NUMBER,QUADRANT_NE,QUADRANT_NW,QUADRANT_NaN,QUADRANT_SE,QUADRANT_SW
0,0,37,0,0,0,0,1,0
1,9,37,0,0,0,0,1,0
2,184,37,0,0,0,0,1,0
3,316,37,0,0,0,1,0,0
4,361,37,0,0,0,0,1,0


In [32]:
y = data['SERVICETYPE_NUMBER'].values
X = model_test_data.values

In [33]:
#Splitting data to a training and test sample of 70%-30%
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = .3, random_state=522, stratify=y)

In [34]:
#n_estimators = number of trees in the forest
#min_samples_leaf = minimum number of samples required to be at a leaf node for the tree
rf = RandomForestClassifier(n_estimators=1500, min_samples_leaf=20, random_state=522)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
print(y_pred)

[14 14 18 ..., 14 14 14]


In [35]:
print('Accuracy: ', accuracy_score(y_test, y_pred))

Accuracy:  0.52146263911


A drop in accuracy from __70.52%__ to __52.14%__ demonstrates the importance of using the correct predictors.

### 4. How does my neighborhood fare?

In [36]:
#Count of service requests per cluster
cluster_count = merged.groupby('NAME').size().reset_index(name='counts')
cluster_count.head()

Unnamed: 0,NAME,counts
0,Cluster 1,349
1,Cluster 10,353
2,Cluster 11,506
3,Cluster 12,140
4,Cluster 13,369


In [37]:
#merge with original file
neighborhood = pd.merge(neighborhood, cluster_count, on='NAME')
neighborhood.head()

Unnamed: 0,OBJECTID,WEB_URL,NAME,NBH_NAMES,Shape_Leng,Shape_Area,TYPE,geometry,counts
0,1,http://planning.dc.gov/,Cluster 39,"Congress Heights, Bellevue, Washington Highlands",10711.66801,4886463.0,Original,POLYGON ((-76.99401890037231 38.84519662346873...,461
1,2,http://planning.dc.gov/,Cluster 38,"Douglas, Shipley Terrace",8229.486324,2367958.0,Original,"POLYGON ((-76.97471813575507 38.8528706360112,...",144
2,3,http://planning.dc.gov/,Cluster 36,"Woodland/Fort Stanton, Garfield Heights, Knox ...",4746.344457,1119573.0,Original,"POLYGON ((-76.9687730019474 38.86067206227963,...",87
3,4,http://planning.dc.gov/,Cluster 27,"Near Southeast, Navy Yard",7286.968902,1619167.0,Original,"POLYGON ((-76.9872595922274 38.87711832849107,...",84
4,5,http://planning.dc.gov/,Cluster 32,"River Terrace, Benning, Greenway, Dupont Park",11251.012821,4286254.0,Original,POLYGON ((-76.93760147029893 38.88995958845385...,241


In [38]:
temp = neighborhood.sort_values(['counts'], ascending=[False])
temp[['NAME', 'NBH_NAMES', 'counts']]

Unnamed: 0,NAME,NBH_NAMES,counts
33,Cluster 2,"Columbia Heights, Mt. Pleasant, Pleasant Plain...",1192
20,Cluster 18,"Brightwood Park, Crestwood, Petworth",1141
30,Cluster 25,"Union Station, Stanton Park, Kingman Park",1018
13,Cluster 6,"Dupont Circle, Connecticut Avenue/K Street",879
38,Cluster 26,"Capitol Hill, Lincoln Park",759
5,Cluster 8,"Downtown, Chinatown, Penn Quarters, Mount Vern...",757
32,Cluster 21,"Edgewood, Bloomingdale, Truxton Circle, Eckington",744
23,Cluster 17,"Takoma, Brightwood, Manor Park",543
8,Cluster 31,"Deanwood, Burrville, Grant Park, Lincoln Heigh...",537
29,Cluster 34,"Twining, Fairlawn, Randle Highlands, Penn Bran...",514


### 5. Next Steps

> 1. Use __historical data__ and train predictive model including __month of request, duration of service, etc__
> 2. __Binarize Neighborhood Clusters__ for increased accuracy
> 3. Test for __spatial/temporal autocorrelation within each neighborhood cluster__