# Getting and cleaning housing maintenance code complaint data

To get and clean the housing maintenance code complaints data we will:
1. Read in housing maintenance code complaints dataset (HMCC).
2. Drop all features except
 - ComplaintID
 - BoroughID
 - Block
 - Lot
 - ReceivedDate
3. Convert ReceivedDateto pd.datetime
4. Drop all rows with pd.datetime before April 1st, 2015
5. Drop all rows with missing values and with Borough not in range(1,6)
6. Construct the BBL using the Borough, block, and lot features in the housing 
7. Drop block and lot

In [3]:
import pandas as pd
import numpy as np
import re

### 

def get_complaint_data():
    query = ("https://data.cityofnewyork.us/api/views/uwyv-629c/rows.csv?accessType=DOWNLOAD")
    complaints = pd.read_csv(query)
    return complaints
            
complaints = get_complaint_data()
complaints.head(3)

Unnamed: 0,ComplaintID,BuildingID,BoroughID,Borough,HouseNumber,StreetName,Zip,Block,Lot,Apartment,CommunityBoard,ReceivedDate,StatusID,Status,StatusDate
0,6960137,3418,1,MANHATTAN,1989,ADAM C POWELL BOULEVARD,10026,1904,4,12D,10,07/07/2014,2,CLOSE,07/29/2014
1,6960832,3512,1,MANHATTAN,2267,ADAM C POWELL BOULEVARD,10030,1918,4,3B,10,07/08/2014,2,CLOSE,07/12/2014
2,6946867,5318,1,MANHATTAN,778,11 AVENUE,10019,1083,1,4P,4,06/19/2014,2,CLOSE,07/13/2014


In [5]:
print complaints.shape
complaints = complaints.drop(['BuildingID','Borough','HouseNumber','StreetName','Zip','Apartment','CommunityBoard','StatusID','Status','StatusDate'],axis=1)
print complaints.shape

(474746, 15)
(474746, 5)


In [6]:
complaints.head(3)

Unnamed: 0,ComplaintID,BoroughID,Block,Lot,ReceivedDate
0,6960137,1,1904,4,07/07/2014
1,6960832,1,1918,4,07/08/2014
2,6946867,1,1083,1,06/19/2014


In [7]:
complaints.ReceivedDate = pd.to_datetime(complaints.ReceivedDate)
print complaints.ReceivedDate.dtype

datetime64[ns]


In [8]:
print complaints.shape
complaints = complaints[~(complaints.isnull().any(axis=1))]
print complaints.shape
complaints = complaints[complaints.BoroughID.isin(range(1,6))]
print complaints.shape

(474746, 5)
(474746, 5)
(474746, 5)


In [9]:
start = pd.datetime(2014,11,1)
end = pd.datetime(2015,10,31)
allowed_date_range = pd.date_range(start, end, freq='D')
print allowed_date_range
complaints = complaints[(complaints['ReceivedDate'].isin(allowed_date_range))]
print complaints.shape

DatetimeIndex(['2014-11-01', '2014-11-02', '2014-11-03', '2014-11-04',
               '2014-11-05', '2014-11-06', '2014-11-07', '2014-11-08',
               '2014-11-09', '2014-11-10', 
               ...
               '2015-10-22', '2015-10-23', '2015-10-24', '2015-10-25',
               '2015-10-26', '2015-10-27', '2015-10-28', '2015-10-29',
               '2015-10-30', '2015-10-31'],
              dtype='datetime64[ns]', length=365, freq='D', tz=None)
(394775, 5)


In [10]:
def make_BBL(borough, block, lot): 
    '''
    The borough code is one numeric digit. 
    The tax block is one to five numeric digits, preceded with leading zeroswhen the block is less than five digits.
    The tax lot is one to four digits and is preceded with leading zeros when the lot is less than four digits.
    
    >>> make_BBL(1,16,100)
    1000160100
    >>> make_BBL(3,15828,7501)
    3158287501
    '''
    return int(str(borough) + str(block).zfill(5) + str(lot).zfill(4))
    
complaints['BBL'] = map(make_BBL, complaints['BoroughID'], complaints['Block'], complaints['Lot'])
complaints.head(5)

Unnamed: 0,ComplaintID,BoroughID,Block,Lot,ReceivedDate,BBL
79034,7085199,1,1788,22,2014-11-07,1017880022
79035,7075907,1,1901,1,2014-11-02,1019010001
79036,7080436,1,2034,34,2014-11-05,1020340034
79037,7131077,1,2046,63,2014-11-29,1020460063
79038,7124360,1,1060,36,2014-11-25,1010600036


In [11]:
complaints = complaints.drop(['Block','Lot'], axis=1)
complaints.head(3)

Unnamed: 0,ComplaintID,BoroughID,ReceivedDate,BBL
79034,7085199,1,2014-11-07,1017880022
79035,7075907,1,2014-11-02,1019010001
79036,7080436,1,2014-11-05,1020340034


To recap- we now have a filtered dataset of all completed complaints received after April 1st, 2015, including ComplaintID's, validated BoroughId. and 10-digit BBLs. This will be:
1. Merged with the processed complaint problems dataset, using ComplaintID as a key.
2. Merged with the processed violation dataset, using constructed BBLs as keys.
3. Merged with processed PLUTO sub-dataset using BBL as a key.