## I need to answer several questions. The answers to those questions must be supported by data and analytics. These are the questions:

### 1. Which type of complaint should the Department of Housing Preservation and Development of New York City focus on first?
### 2. Should the Department of Housing Preservation and Development of New York City focus on any particular set of boroughs, ZIP codes, or street (where the complaints are severe) for the specific type of complaints you identified in response to Question 1?
### 3. Does the Complaint Type that you identified in response to question 1 have an obvious relationship with any particular characteristic or characteristics of the houses or buildings?
### 4. Can a predictive model be built for a future prediction of the possibility of complaints of the type that you have identified in response to question 1?

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
df= pd.read_csv('/kaggle/input/nyc-311-hpd-calls/311_Service_Requests_from_2010_to_Present.csv')
print(df.shape)
df.head()

  interactivity=interactivity, compiler=compiler, result=result)


(6087779, 41)


Unnamed: 0,Unique Key,Created Date,Closed Date,Agency,Agency Name,Complaint Type,Descriptor,Location Type,Incident Zip,Incident Address,...,Vehicle Type,Taxi Company Borough,Taxi Pick Up Location,Bridge Highway Name,Bridge Highway Direction,Road Ramp,Bridge Highway Segment,Latitude,Longitude,Location
0,45980284,04/10/2020 09:10:10 AM,,HPD,Department of Housing Preservation and Develop...,HEAT/HOT WATER,ENTIRE BUILDING,RESIDENTIAL BUILDING,10030.0,138 WEST 137 STREET,...,,,,,,,,40.8159,-73.941112,"(40.81590010208267, -73.9411124524788)"
1,45978285,04/10/2020 12:13:02 PM,,HPD,Department of Housing Preservation and Develop...,HEAT/HOT WATER,ENTIRE BUILDING,RESIDENTIAL BUILDING,11235.0,3105 BRIGHTON 3 STREET,...,,,,,,,,40.576327,-73.964056,"(40.576327154021826, -73.9640562531078)"
2,45978263,04/10/2020 07:37:51 AM,04/10/2020 05:56:21 PM,HPD,Department of Housing Preservation and Develop...,HEAT/HOT WATER,ENTIRE BUILDING,RESIDENTIAL BUILDING,10462.0,2040 BRONXDALE AVENUE,...,,,,,,,,40.850795,-73.866537,"(40.850794587937656, -73.86653703997725)"
3,45976274,04/10/2020 01:53:42 PM,,HPD,Department of Housing Preservation and Develop...,UNSANITARY CONDITION,PESTS,RESIDENTIAL BUILDING,10462.0,1435 DORIS STREET,...,,,,,,,,40.835502,-73.849272,"(40.835501747393344, -73.84927206222912)"
4,45980927,04/10/2020 11:20:50 AM,,HPD,Department of Housing Preservation and Develop...,HEAT/HOT WATER,ENTIRE BUILDING,RESIDENTIAL BUILDING,10458.0,2410 WASHINGTON AVENUE,...,,,,,,,,40.858053,-73.890913,"(40.858052508996636, -73.8909130551251)"


In [3]:
df.columns

Index(['Unique Key', 'Created Date', 'Closed Date', 'Agency', 'Agency Name',
       'Complaint Type', 'Descriptor', 'Location Type', 'Incident Zip',
       'Incident Address', 'Street Name', 'Cross Street 1', 'Cross Street 2',
       'Intersection Street 1', 'Intersection Street 2', 'Address Type',
       'City', 'Landmark', 'Facility Type', 'Status', 'Due Date',
       'Resolution Description', 'Resolution Action Updated Date',
       'Community Board', 'BBL', 'Borough', 'X Coordinate (State Plane)',
       'Y Coordinate (State Plane)', 'Open Data Channel Type',
       'Park Facility Name', 'Park Borough', 'Vehicle Type',
       'Taxi Company Borough', 'Taxi Pick Up Location', 'Bridge Highway Name',
       'Bridge Highway Direction', 'Road Ramp', 'Bridge Highway Segment',
       'Latitude', 'Longitude', 'Location'],
      dtype='object')

### Let's get rid of all the unnecessary fields

In [4]:
df=df[['Unique Key', 'Created Date', 'Closed Date',
       'Complaint Type', 'Descriptor', 'Location Type', 'Incident Zip',
       'Incident Address', 'Street Name','Address Type',
       'City', 'Status', 'Due Date',
       'Resolution Description','Borough',
       'Latitude', 'Longitude']]
df.head()

Unnamed: 0,Unique Key,Created Date,Closed Date,Complaint Type,Descriptor,Location Type,Incident Zip,Incident Address,Street Name,Address Type,City,Status,Due Date,Resolution Description,Borough,Latitude,Longitude
0,45980284,04/10/2020 09:10:10 AM,,HEAT/HOT WATER,ENTIRE BUILDING,RESIDENTIAL BUILDING,10030.0,138 WEST 137 STREET,WEST 137 STREET,ADDRESS,NEW YORK,Open,,The following complaint conditions are still o...,MANHATTAN,40.8159,-73.941112
1,45978285,04/10/2020 12:13:02 PM,,HEAT/HOT WATER,ENTIRE BUILDING,RESIDENTIAL BUILDING,11235.0,3105 BRIGHTON 3 STREET,BRIGHTON 3 STREET,ADDRESS,BROOKLYN,Open,,The following complaint conditions are still o...,BROOKLYN,40.576327,-73.964056
2,45978263,04/10/2020 07:37:51 AM,04/10/2020 05:56:21 PM,HEAT/HOT WATER,ENTIRE BUILDING,RESIDENTIAL BUILDING,10462.0,2040 BRONXDALE AVENUE,BRONXDALE AVENUE,ADDRESS,BRONX,Closed,,The Department of Housing Preservation and Dev...,BRONX,40.850795,-73.866537
3,45976274,04/10/2020 01:53:42 PM,,UNSANITARY CONDITION,PESTS,RESIDENTIAL BUILDING,10462.0,1435 DORIS STREET,DORIS STREET,ADDRESS,BRONX,Open,,The following complaint conditions are still o...,BRONX,40.835502,-73.849272
4,45980927,04/10/2020 11:20:50 AM,,HEAT/HOT WATER,ENTIRE BUILDING,RESIDENTIAL BUILDING,10458.0,2410 WASHINGTON AVENUE,WASHINGTON AVENUE,ADDRESS,BRONX,Open,,The following complaint conditions are still o...,BRONX,40.858053,-73.890913


In [5]:
df[['Address Type']].describe()

Unnamed: 0,Address Type
count,6003012
unique,1
top,ADDRESS
freq,6003012


### The field "Address Type" seems to have only one value. It's not useful information. We will LET IT GO

In [6]:
df=df.drop(columns=['Address Type'])

In [7]:
df_comp=df.groupby('Complaint Type')[["Unique Key"]].count()
df_comp.sort_values('Unique Key',ascending=False).head()

Unnamed: 0_level_0,Unique Key
Complaint Type,Unnamed: 1_level_1
HEAT/HOT WATER,1300488
HEATING,887869
PLUMBING,716042
GENERAL CONSTRUCTION,500863
UNSANITARY CONDITION,458663


In [8]:
df_comp=df_comp[df_comp['Unique Key']>80000].sort_values(by='Unique Key',ascending=False)
df_comp.columns=['No of complaints']
df_comp.T

Complaint Type,HEAT/HOT WATER,HEATING,PLUMBING,GENERAL CONSTRUCTION,UNSANITARY CONDITION,PAINT - PLASTER,PAINT/PLASTER,ELECTRIC,NONCONST,DOOR/WINDOW,WATER LEAK,GENERAL,FLOORING/STAIRS,APPLIANCE
No of complaints,1300488,887869,716042,500863,458663,361257,349651,309214,260890,208121,196048,153816,138862,114414


# ***It seems like the highest complaints are for HEAT or HOT WATER!!***

### We are only going to focus on the most frequent occuring problems

In [9]:
print(df.shape)
df=df[df['Complaint Type'].isin(df_comp.index)]
print(df.shape)
df.head()

(6087779, 16)
(5956198, 16)


Unnamed: 0,Unique Key,Created Date,Closed Date,Complaint Type,Descriptor,Location Type,Incident Zip,Incident Address,Street Name,City,Status,Due Date,Resolution Description,Borough,Latitude,Longitude
0,45980284,04/10/2020 09:10:10 AM,,HEAT/HOT WATER,ENTIRE BUILDING,RESIDENTIAL BUILDING,10030.0,138 WEST 137 STREET,WEST 137 STREET,NEW YORK,Open,,The following complaint conditions are still o...,MANHATTAN,40.8159,-73.941112
1,45978285,04/10/2020 12:13:02 PM,,HEAT/HOT WATER,ENTIRE BUILDING,RESIDENTIAL BUILDING,11235.0,3105 BRIGHTON 3 STREET,BRIGHTON 3 STREET,BROOKLYN,Open,,The following complaint conditions are still o...,BROOKLYN,40.576327,-73.964056
2,45978263,04/10/2020 07:37:51 AM,04/10/2020 05:56:21 PM,HEAT/HOT WATER,ENTIRE BUILDING,RESIDENTIAL BUILDING,10462.0,2040 BRONXDALE AVENUE,BRONXDALE AVENUE,BRONX,Closed,,The Department of Housing Preservation and Dev...,BRONX,40.850795,-73.866537
3,45976274,04/10/2020 01:53:42 PM,,UNSANITARY CONDITION,PESTS,RESIDENTIAL BUILDING,10462.0,1435 DORIS STREET,DORIS STREET,BRONX,Open,,The following complaint conditions are still o...,BRONX,40.835502,-73.849272
4,45980927,04/10/2020 11:20:50 AM,,HEAT/HOT WATER,ENTIRE BUILDING,RESIDENTIAL BUILDING,10458.0,2410 WASHINGTON AVENUE,WASHINGTON AVENUE,BRONX,Open,,The following complaint conditions are still o...,BRONX,40.858053,-73.890913


### Which area had the largest number of complaints?

In [10]:
df_bor= df.groupby('Borough')[['Unique Key']].count().sort_values('Unique Key',ascending=False)
df_bor

Unnamed: 0_level_0,Unique Key
Borough,Unnamed: 1_level_1
BROOKLYN,1731517
BRONX,1621995
MANHATTAN,1055419
Unspecified,818832
QUEENS,641707
STATEN ISLAND,86728


## It seems like Brooklyn has the highest number of complaints. But BRONX is also very close. There are also some unspecified entries. We will have to find what borough those zip numbers belong to.

In [11]:
df_pluto = pd.read_csv('/kaggle/input/nyc-pluto/pluto_20v2.csv')
df_pluto.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,borough,block,lot,cd,ct2010,cb2010,schooldist,council,zipcode,firecomp,...,appbbl,appdate,plutomapid,version,sanitdistrict,healthcenterdistrict,firm07_flag,pfirm15_flag,dcpedited,notes
0,QN,3331,57,409.0,136.0,2001.0,28.0,29.0,11415.0,E298,...,,,1,20v2,9.0,45.0,,,,
1,MN,1140,27,107.0,153.0,2000.0,3.0,6.0,10023.0,L035,...,,,1,20v2,7.0,15.0,,,,
2,QN,10192,43,412.0,264.0,1004.0,28.0,27.0,11433.0,E275,...,,,1,20v2,12.0,44.0,,,,
3,BK,1549,4,303.0,301.0,3000.0,16.0,41.0,11233.0,L176,...,,,1,20v2,3.0,32.0,,,t,
4,QN,15706,61,414.0,1008.02,1010.0,27.0,31.0,11691.0,E328,...,,,1,20v2,14.0,45.0,,,,


In [12]:
df_zip=df.groupby('Incident Zip')[['Borough']].agg(lambda x:x.value_counts().index[0])
df_zip.head()

Unnamed: 0_level_0,Borough
Incident Zip,Unnamed: 1_level_1
10001.0,MANHATTAN
10002.0,MANHATTAN
10003.0,MANHATTAN
10004.0,MANHATTAN
10005.0,MANHATTAN


In [13]:
for i,j in zip(df[df['Borough']=='Unspecified'].index,df[df['Borough']=='Unspecified']['Incident Zip']):
    if np.isnan(j):
        continue
    df.at[i,'Borough']=df_zip.at[j,'Borough']
    #print(type(j))
    


In [14]:
df[df['Borough']=='Unspecified'].head()

Unnamed: 0,Unique Key,Created Date,Closed Date,Complaint Type,Descriptor,Location Type,Incident Zip,Incident Address,Street Name,City,Status,Due Date,Resolution Description,Borough,Latitude,Longitude
6462,45534862,02/03/2020 04:09:47 PM,02/21/2020 09:13:21 AM,PLUMBING,BATHTUB/SHOWER,RESIDENTIAL BUILDING,,149 WESER AVENUE,WESER AVENUE,,Closed,,The Department of Housing Preservation and Dev...,Unspecified,,
245677,19682256,01/25/2011 12:00:00 AM,01/30/2011 12:00:00 AM,HEATING,HEAT,RESIDENTIAL BUILDING,,102-23 HARACE HARDING EXPRESSWAY,HARACE HARDING EXPRESSWAY,,Closed,,The Department of Housing Preservation and Dev...,Unspecified,,
245889,19023113,11/01/2010 12:00:00 AM,11/12/2010 12:00:00 AM,HEATING,HEAT,RESIDENTIAL BUILDING,,146-26 BURLING STREET,BURLING STREET,,Closed,,The Department of Housing Preservation and Dev...,Unspecified,,
264446,43950950,10/02/2019 06:09:24 AM,10/17/2019 08:57:30 AM,ELECTRIC,POWER OUTAGE,RESIDENTIAL BUILDING,,147 SCRIBNER LANE,SCRIBNER LANE,,Closed,,The Department of Housing Preservation and Dev...,Unspecified,,
307324,15635792,01/01/2010 12:00:00 AM,01/13/2010 12:00:00 AM,HEATING,HEAT,RESIDENTIAL BUILDING,,63-11 63 STREET,63 STREET,,Closed,,The Department of Housing Preservation and Dev...,Unspecified,,


In [15]:
df.groupby('Borough')[['Unique Key']].count().sort_values('Unique Key',ascending=False)

Unnamed: 0_level_0,Unique Key
Borough,Unnamed: 1_level_1
BROOKLYN,2026221
BRONX,1873618
MANHATTAN,1208830
QUEENS,745901
STATEN ISLAND,101070
Unspecified,558


## We still have some Unspecified values. But they are because their zip wasn't given. Since their number is now less by multiple factors of ten, we will ignore the rest.

### Let's see which address has the most complaints

In [16]:
df.groupby('Incident Address')[['Unique Key']].count().sort_values('Unique Key',ascending=False)

Unnamed: 0_level_0,Unique Key
Incident Address,Unnamed: 1_level_1
34 ARDEN STREET,14362
89-21 ELMHURST AVENUE,13343
1025 BOYNTON AVENUE,9729
3810 BAILEY AVENUE,7178
9511 SHORE ROAD,5539
...,...
534 PUGSLEY AVENUE,1
188-22 87 DRIVE,1
534 TARGEE STREET,1
188-20C 69 AVENUE,1


In [17]:
df[df['Incident Address']=='34 ARDEN STREET'][['Incident Address','Incident Zip','Borough']].head(1)


Unnamed: 0,Incident Address,Incident Zip,Borough
472,34 ARDEN STREET,10040.0,MANHATTAN


### The address where most number of complaints came from is 
# 34 ARDEN STREET, MANHATTAN 10040

In [18]:
df.groupby('Incident Zip')[['Unique Key']].count().sort_values('Unique Key',ascending=False).head()

Unnamed: 0_level_0,Unique Key
Incident Zip,Unnamed: 1_level_1
11226.0,215792
10467.0,174156
10458.0,169767
10453.0,162725
10468.0,148501


## The zipcode where the most number of complaints came from is 
# 11226

In [19]:
df.groupby('Status')[['Unique Key']].count().sort_values('Unique Key',ascending=False).head()

Unnamed: 0_level_0,Unique Key
Status,Unnamed: 1_level_1
Closed,5821801
Open,134395
Pending,2


In [20]:
df_pluto=df_pluto[['address','bldgarea','bldgdepth','builtfar','commfar','facilfar','lot','lotarea',
                   'lotdepth','numbldgs','numfloors','officearea','resarea','residfar','retailarea',
                   'yearbuilt','yearalter1','zipcode','ycoord','xcoord']]
df_pluto.shape

(859038, 20)

In [21]:
df_pluto['bldgage']=2020-df_pluto['yearbuilt']
df_pluto.head()

Unnamed: 0,address,bldgarea,bldgdepth,builtfar,commfar,facilfar,lot,lotarea,lotdepth,numbldgs,...,officearea,resarea,residfar,retailarea,yearbuilt,yearalter1,zipcode,ycoord,xcoord,bldgage
0,84-33 ABINGDON ROAD,2372.0,40.0,0.47,0.0,2.0,57,5000.0,100.0,2.0,...,0.0,2372.0,0.75,0.0,1920.0,0.0,11415.0,196855.0,1031306.0,100.0
1,107 WEST 68 STREET,22380.0,83.0,3.68,0.0,4.0,27,6075.0,100.42,1.0,...,0.0,22380.0,4.0,0.0,1930.0,1989.0,10023.0,221723.0,989505.0,90.0
2,CLAUDE AVENUE,0.0,0.0,0.0,0.0,1.0,43,2628.0,136.0,0.0,...,,,0.5,,0.0,0.0,11433.0,191541.0,1043307.0,2020.0
3,FULTON STREET,0.0,0.0,0.0,0.0,4.2,4,2000.0,100.0,0.0,...,,,4.2,,0.0,0.0,11233.0,186469.0,1006800.0,2020.0
4,10-12 GRASSMERE TERRACE,0.0,0.0,0.0,0.0,2.0,61,2771.0,102.03,0.0,...,,,0.75,,0.0,0.0,11691.0,158883.0,1051634.0,2020.0


In [22]:
df_comp_count=df.groupby('Incident Address')[['Incident Address']].count()

In [23]:
df_comp_count.columns=['count of complaints']
df_comp_count['address']=df_comp_count.index
df_comp_count.head()

Unnamed: 0_level_0,count of complaints,address
Incident Address,Unnamed: 1_level_1,Unnamed: 2_level_1
.537 SHEPERD AVE,4,.537 SHEPERD AVE
1 1 AVENUE,2,1 1 AVENUE
1 1 PLACE,3,1 1 PLACE
1 12 STREET,1,1 12 STREET
1 23 STREET,1,1 23 STREET


In [24]:
#df_comp_count.index=None
df_comp_count.reset_index(drop=True,inplace=True)
df_comp_count.head()

Unnamed: 0,count of complaints,address
0,4,.537 SHEPERD AVE
1,2,1 1 AVENUE
2,3,1 1 PLACE
3,1,1 12 STREET
4,1,1 23 STREET


In [25]:
df_corr = pd.merge(df_comp_count,df_pluto,on='address')
df_corr.head()

Unnamed: 0,count of complaints,address,bldgarea,bldgdepth,builtfar,commfar,facilfar,lot,lotarea,lotdepth,...,officearea,resarea,residfar,retailarea,yearbuilt,yearalter1,zipcode,ycoord,xcoord,bldgage
0,3,1 1 PLACE,5332.0,50.0,2.48,0.0,2.0,50,2150.0,100.0,...,0.0,5332.0,2.0,0.0,1900.0,1982.0,11231.0,187611.0,984123.0,120.0
1,2,1 5 AVENUE,238923.0,100.0,19.91,0.0,10.0,22,12000.0,100.0,...,0.0,227923.0,10.0,11000.0,1926.0,1988.0,10003.0,205930.0,985305.0,94.0
2,5,1 7 AVENUE,15075.0,110.0,0.94,0.0,4.0,5,16050.0,96.25,...,0.0,0.0,4.0,15075.0,1900.0,0.0,11217.0,186061.0,991719.0,120.0
3,64,1 74 STREET,112140.0,294.0,3.36,0.0,3.0,1,33400.0,294.0,...,0.0,112140.0,3.0,0.0,1938.0,0.0,11209.0,170478.0,974216.0,82.0
4,7,1 ADLER PLACE,1512.0,36.0,0.96,0.0,2.0,66,1568.0,80.08,...,0.0,1512.0,1.25,0.0,1920.0,0.0,11208.0,188877.0,1019618.0,100.0


In [26]:
print(df_comp_count.shape)
print(df_pluto.shape)
df_corr.shape

(182867, 2)
(859038, 21)


(142710, 22)

In [27]:
df_corr.corr()

Unnamed: 0,count of complaints,bldgarea,bldgdepth,builtfar,commfar,facilfar,lot,lotarea,lotdepth,numbldgs,...,officearea,resarea,residfar,retailarea,yearbuilt,yearalter1,zipcode,ycoord,xcoord,bldgage
count of complaints,1.0,0.097426,0.162229,0.089475,-0.021558,0.129655,-0.011392,0.027738,0.069044,-0.01647,...,-0.008231,0.16539,0.122435,0.004708,0.006644,0.057835,-0.09411,0.114882,-0.013338,-0.006644
bldgarea,0.097426,1.0,0.29012,0.457397,0.195094,0.212181,0.124292,0.302213,0.406654,0.143215,...,0.397871,0.623516,0.20591,0.22403,0.019414,0.071891,-0.113277,0.060287,-0.045015,-0.019414
bldgdepth,0.162229,0.29012,1.0,0.170076,0.137881,0.262826,-0.02501,0.122172,0.32109,0.004894,...,0.134987,0.275001,0.250923,0.113024,0.098898,0.161832,-0.14898,0.096937,-0.09444,-0.098898
builtfar,0.089475,0.457397,0.170076,1.0,0.227412,0.308852,0.122556,0.011359,0.024113,-0.027139,...,0.125707,0.182936,0.32596,0.068506,0.030435,0.148602,-0.196368,0.096571,-0.108675,-0.030435
commfar,-0.021558,0.195094,0.137881,0.227412,1.0,0.497404,0.125636,0.014596,0.036042,-0.014142,...,0.255903,0.08798,0.428192,0.144676,-0.046126,0.152968,-0.256341,0.04097,-0.152642,0.046126
facilfar,0.129655,0.212181,0.262826,0.308852,0.497404,1.0,0.120924,0.012921,0.025221,-0.053082,...,0.141514,0.206506,0.854016,0.097823,-0.047386,0.300273,-0.481069,0.300637,-0.196543,0.047386
lot,-0.011392,0.124292,-0.02501,0.122556,0.125636,0.120924,1.0,0.035919,-0.005342,0.057543,...,0.036407,0.141548,0.129733,0.058683,-0.016734,0.010673,-0.057861,0.005356,-0.05679,0.016734
lotarea,0.027738,0.302213,0.122172,0.011359,0.014596,0.012921,0.035919,1.0,0.503267,0.289445,...,0.116675,0.268479,0.007612,0.074088,-0.029802,0.016345,-0.020338,0.013214,-0.002485,0.029802
lotdepth,0.069044,0.406654,0.32109,0.024113,0.036042,0.025221,-0.005342,0.503267,1.0,0.234118,...,0.155099,0.378341,0.008634,0.141594,-0.009899,0.029062,-0.026759,0.007452,-0.014669,0.009899
numbldgs,-0.01647,0.143215,0.004894,-0.027139,-0.014142,-0.053082,0.057543,0.289445,0.234118,1.0,...,0.027591,0.183762,-0.055859,0.009603,-0.008103,-0.012335,0.028656,-0.018,0.03744,0.008103
