# <b> Problem Statement

The people of New Yorker use the 311 system to report complaints about the non-emergency problems to local authorities. Various agencies in New York are assigned these problems. The Department of Housing Preservation and Development of New York City is the agency that processes 311 complaints that are related to housing and buildings.

In the last few years, the number of 311 complaints coming to the Department of Housing Preservation and Development has increased significantly. Although these complaints are not necessarily urgent, the large volume of complaints and the sudden increase is impacting the overall efficiency of operations of the agency.

Therefore, the Department of Housing Preservation and Development has approached your organization to help them manage the large volume of 311 complaints they are receiving every year.

The agency needs answers to several questions. The answers to those questions must be supported by data and analytics. 
## These are their  questions:

### Which type of complaint should the Department of Housing Preservation and Development of New York City focus on first?
###    Should the Department of Housing Preservation and Development of New York City focus on any particular set of boroughs, ZIP codes, or street (where the complaints are severe) for the specific type of complaints you identified in response to Question 1?
###    Does the Complaint Type that you identified in response to question 1 have an obvious relationship with any particular characteristic or characteristics of the houses or buildings?
###    Can a predictive model be built for a future prediction of the possibility of complaints of the type that you have identified in response to question 1?

    
Your organization has assigned you as the lead data scientist to provide the answers to these questions. You need to work on getting answers to them in this Capstone Project by following the standard approach of data science and machine learning.

# Importing the necessary python libraries

In [1]:
import pandas as pd
import numpy as np
import os
#from sodapy import Socrata

## Importing PLUTO data for Brooklyn, Bronx, Manhattan, Queens, and Staten Island

### Brooklyn PLUTO data

In [35]:
# The code was removed by Watson Studio for sharing.

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,Borough,Block,Lot,CD,CT2010,CB2010,SchoolDist,Council,ZipCode,FireComp,...,ZMCode,Sanborn,TaxMap,EDesigNum,APPBBL,APPDate,PLUTOMapID,FIRM07_FLAG,PFIRM15_FLAG,Version
0,BK,1,1,302,21.0,,13.0,33.0,11201.0,L118,...,,302 007,30101.0,,3000010000.0,11/26/2013,1,1.0,1.0,18V1
1,BK,1,50,302,21.0,2000.0,13.0,33.0,11201.0,L118,...,,302 007,30101.0,E-231,0.0,,1,1.0,1.0,18V1
2,BK,1,7501,302,21.0,2000.0,13.0,33.0,11201.0,L118,...,,302 007,30101.0,,3000010000.0,03/04/2016,1,1.0,1.0,18V1
3,BK,3,1,302,21.0,3002.0,13.0,33.0,11201.0,L118,...,,302 007,30101.0,,0.0,,1,1.0,1.0,18V1
4,BK,3,5,302,21.0,,13.0,33.0,11201.0,L118,...,,302 007,30101.0,,0.0,,4,1.0,1.0,18V1


### Bronx PLUTO data

In [36]:
body = client_1e57117b9f404ba098ce646504561a71.get_object(Bucket='ibmdatasciencecapstoneproject-donotdelete-pr-ldxtiwxezyt0jo',Key='BX_18v1.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

df_pl_BX = pd.read_csv(body)
df_pl_BX.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,Borough,Block,Lot,CD,CT2010,CB2010,SchoolDist,Council,ZipCode,FireComp,...,ZMCode,Sanborn,TaxMap,EDesigNum,APPBBL,APPDate,PLUTOMapID,FIRM07_FLAG,PFIRM15_FLAG,Version
0,BX,2260,1,201,19.0,1022.0,7.0,8.0,10454.0,L029,...,,209S016,20901.0,E-143,0.0,,1,,,18V1
1,BX,2260,4,201,19.0,1022.0,7.0,8.0,10454.0,L029,...,,209S016,20901.0,E-143,0.0,,1,,,18V1
2,BX,2260,10,201,19.0,1022.0,7.0,8.0,10454.0,L029,...,,209S016,20901.0,E-143,0.0,,1,,,18V1
3,BX,2260,17,201,19.0,1022.0,7.0,8.0,10454.0,L029,...,,209S016,20901.0,E-143,0.0,,1,,,18V1
4,BX,2260,18,201,19.0,1022.0,7.0,8.0,10454.0,L029,...,,209S016,20901.0,E-143,0.0,,1,,,18V1


### Manhattan PLUTO data

In [37]:
body = client_1e57117b9f404ba098ce646504561a71.get_object(Bucket='ibmdatasciencecapstoneproject-donotdelete-pr-ldxtiwxezyt0jo',Key='MN_18v1.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

df_pl_MN = pd.read_csv(body)
df_pl_MN.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,Borough,Block,Lot,CD,CT2010,CB2010,SchoolDist,Council,ZipCode,FireComp,...,ZMCode,Sanborn,TaxMap,EDesigNum,APPBBL,APPDate,PLUTOMapID,FIRM07_Flag,PFIRM15_Flag,Version
0,MN,1,10,101,5.0,1011.0,2.0,1.0,10004.0,E007,...,Y,199 999,10101.0,,0.0,,1,1.0,1.0,18V1
1,MN,1,101,101,1.0,1001.0,2.0,1.0,10004.0,E007,...,Y,199 999,10101.0,,0.0,,1,,1.0,18V1
2,MN,1,201,101,1.0,1000.0,2.0,1.0,10004.0,E007,...,,199 999,10101.0,,0.0,,1,,1.0,18V1
3,MN,1,301,101,,,2.0,1.0,10004.0,E007,...,,199 999,10101.0,,0.0,,4,1.0,1.0,18V1
4,MN,1,401,101,,,2.0,1.0,10004.0,E007,...,,1 99 999,10101.0,,0.0,,4,1.0,1.0,18V1


### Queens PLUTO data

In [38]:
body = client_1e57117b9f404ba098ce646504561a71.get_object(Bucket='ibmdatasciencecapstoneproject-donotdelete-pr-ldxtiwxezyt0jo',Key='QN_18v1.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

df_pl_QN = pd.read_csv(body)
df_pl_QN.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,Borough,Block,Lot,CD,CT2010,CB2010,SchoolDist,Council,ZipCode,FireComp,...,ZMCode,Sanborn,TaxMap,EDesigNum,APPBBL,APPDate,PLUTOMapID,FIRM07_FLAG,PFIRM15_FLAG,Version
0,QN,6,1,402,1.0,,30.0,26.0,11101.0,L115,...,Y,401 011,40101.0,,4000060000.0,09/20/2013,1,1.0,1.0,18V1
1,QN,6,3,402,1.0,1015.0,30.0,26.0,11101.0,L115,...,,401 011,40101.0,,0.0,,1,1.0,1.0,18V1
2,QN,6,8,402,1.0,1011.0,30.0,26.0,11101.0,L115,...,,401 011,40101.0,,4000060000.0,08/07/2013,1,1.0,1.0,18V1
3,QN,6,20,402,1.0,,30.0,26.0,11101.0,L115,...,,401 011,40101.0,,4000060000.0,09/20/2013,1,1.0,1.0,18V1
4,QN,6,30,402,1.0,,30.0,26.0,11101.0,L115,...,,401 011,40101.0,,4000060000.0,09/08/2017,1,1.0,1.0,18V1


### Staten Island PLUTO data

In [6]:
body = client_1e57117b9f404ba098ce646504561a71.get_object(Bucket='ibmdatasciencecapstoneproject-donotdelete-pr-ldxtiwxezyt0jo',Key='SI_18v1.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

df_data_5 = pd.read_csv(body)
df_data_5.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,Borough,Block,Lot,CD,CT2010,CB2010,SchoolDist,Council,ZipCode,FireComp,...,ZMCode,Sanborn,TaxMap,EDesigNum,APPBBL,APPDate,PLUTOMapID,FIRM07_FLAG,PFIRM15_FLAG,Version
0,SI,1,10,501,3.0,2000.0,31.0,49.0,10301.0,L078,...,,501 017,50101.0,,0.0,,1,,,18V1
1,SI,1,17,501,3.0,,31.0,49.0,10301.0,L078,...,,501 017,50101.0,,0.0,,1,,,18V1
2,SI,1,18,501,3.0,,31.0,49.0,10301.0,L078,...,,501 017,50101.0,,0.0,,1,,,18V1
3,SI,1,19,501,3.0,,31.0,49.0,10301.0,L078,...,,501 017,50101.0,,0.0,,1,,,18V1
4,SI,1,21,501,3.0,,31.0,49.0,10301.0,L078,...,,501 017,50101.0,,0.0,,1,,,18V1


Create a socrata object to handle the API

### NYC 311 calls for Department of Housing Preservation and Development data

In [7]:
#!wget -o NYC_HPD_data.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DS0702EN/datasets/311_Service_Requests_from_2010_to_Present_min.csv
df_hpd = pd.read_csv('https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DS0702EN/datasets/311_Service_Requests_from_2010_to_Present_min.csv')
df_hpd.head()

Unnamed: 0.1,Unnamed: 0,Unique Key,Created Date,Closed Date,Complaint Type,Location Type,Incident Zip,Incident Address,Street Name,Address Type,City,Status,Resolution Description,Borough,Latitude,Longitude
0,0,45531130,02/02/2020 06:09:17 AM,,HEAT/HOT WATER,RESIDENTIAL BUILDING,10019.0,426 WEST 52 STREET,WEST 52 STREET,ADDRESS,NEW YORK,Open,The following complaint conditions are still o...,MANHATTAN,40.765132,-73.988993
1,1,45529784,02/02/2020 02:15:24 PM,,UNSANITARY CONDITION,RESIDENTIAL BUILDING,11204.0,1751 67 STREET,67 STREET,ADDRESS,BROOKLYN,Open,The following complaint conditions are still o...,BROOKLYN,40.618484,-73.992673
2,2,45527528,02/02/2020 02:27:41 AM,,HEAT/HOT WATER,RESIDENTIAL BUILDING,11372.0,87-15 37 AVENUE,37 AVENUE,ADDRESS,Jackson Heights,Open,The following complaint conditions are still o...,QUEENS,40.750269,-73.879432
3,3,45530329,02/02/2020 12:13:18 PM,,HEAT/HOT WATER,RESIDENTIAL BUILDING,10458.0,2405 SOUTHERN BOULEVARD,SOUTHERN BOULEVARD,ADDRESS,BRONX,Open,The following complaint conditions are still o...,BRONX,40.853773,-73.881558
4,4,45528814,02/02/2020 01:59:44 PM,,APPLIANCE,RESIDENTIAL BUILDING,11209.0,223 78 STREET,78 STREET,ADDRESS,BROOKLYN,Open,The following complaint conditions are still o...,BROOKLYN,40.629745,-74.030533


## Let's find out what are the major complaints

In [47]:
df_complain=df_hpd.groupby('Complaint Type')[['Unique Key']].count()
#df_complain.sort_values('Unique Key')
#df_hpd['Unique Key']
df_complain.shape

(30, 1)

We don't need all of this. We will select only calls made for Department of Housing Preservation and Development. We also will need cases that hasn't been closed yet.
We are going to make a query and embed the query with the API request

In [49]:
df_complain.sort_values('Unique Key',ascending=False).head()

Unnamed: 0_level_0,Unique Key
Complaint Type,Unnamed: 1_level_1
HEAT/HOT WATER,1261574
HEATING,887850
PLUMBING,711130
GENERAL CONSTRUCTION,500863
UNSANITARY CONDITION,451643


# The Major complaint is about Heat/Hot water
    

  ### The other major complaints are Heating, Plumbing, General Construction
  
  Are these somehow related? Because hot water and plumbing is related. Is some construction work in an area causing these?
  To answer these, <b> let's look at <b>
  
 
# Which area has the most complaints?

In [51]:
df_hpd.groupby('Incident Zip')[['Unique Key']].count().sort_values('Unique Key',ascending=False).head(10)

Unnamed: 0_level_0,Unique Key
Incident Zip,Unnamed: 1_level_1
11226.0,215709
10467.0,173911
10458.0,169485
10453.0,162532
10468.0,148213
10457.0,146199
10452.0,146016
10456.0,132748
10031.0,123853
11225.0,120913


## It seems like the area with ZipCode 11226 has the most complaint. It is Flatbush, Brooklyn NY.

### Another pattern is noticable is that 10467, 10458, 10453, 10468, 10457, 10452, and 10456 are very close by area and having very high number of complains. This is west Bronx area and near the Yankee Stadium.
### Are these for some common reason?

###Let's look at the top complaints zip first

In [50]:
df_hpd[df_hpd['Incident Zip']==11226.0].groupby('Complaint Type')[['Unique Key']].count().sort_values('Unique Key',ascending=False)

Unnamed: 0_level_0,Unique Key
Complaint Type,Unnamed: 1_level_1
HEAT/HOT WATER,41786
PLUMBING,28534
HEATING,27255
GENERAL CONSTRUCTION,19994
UNSANITARY CONDITION,16627
PAINT - PLASTER,15531
PAINT/PLASTER,14566
ELECTRIC,11060
NONCONST,9227
WATER LEAK,8560


In [48]:
df_hpd.groupby('Borough')[['Unique Key']].count()

Unnamed: 0_level_0,Unique Key
Borough,Unnamed: 1_level_1
BRONX,1617956
BROOKLYN,1739886
MANHATTAN,1055225
QUEENS,645971
STATEN ISLAND,87584
Unspecified,873221


In [46]:
df_hpd.shape

(6019843, 16)