# Getting and cleaning housing maintenance code complaint data

To get and clean the housing maintenance code complaints data we will:
1. Read in housing maintenance code complaints dataset (HMCC).
2. Drop all features except
-ComplaintID
-BoroughID
-Block
-Lot
-StatusDate
3. Convert StatusDate to pd.datetime
4. Drop all rows with pd.datetime before April 1st, 2015
5. Drop all rows with missing values and with Borough not in range(1,6)
6. Construct the BBL using the Borough, block, and lot features in the housing 
7. Drop block and lot

In [2]:
import pandas as pd
import numpy as np
import re

### 

def get_complaint_data():
    query = ("https://data.cityofnewyork.us/api/views/uwyv-629c/rows.csv?accessType=DOWNLOAD")
    complaints = pd.read_csv(query)
    return complaints
            
complaints = get_complaint_data()
complaints.head(3)

Unnamed: 0,ComplaintID,BuildingID,BoroughID,Borough,HouseNumber,StreetName,Zip,Block,Lot,Apartment,CommunityBoard,ReceivedDate,StatusID,Status,StatusDate
0,6960137,3418,1,MANHATTAN,1989,ADAM C POWELL BOULEVARD,10026,1904,4,12D,10,07/07/2014,2,CLOSE,07/29/2014
1,6960832,3512,1,MANHATTAN,2267,ADAM C POWELL BOULEVARD,10030,1918,4,3B,10,07/08/2014,2,CLOSE,07/12/2014
2,6946867,5318,1,MANHATTAN,778,11 AVENUE,10019,1083,1,4P,4,06/19/2014,2,CLOSE,07/13/2014


In [3]:
complaints = complaint_problems
print complaints.shape
complaints = complaints.drop(['BuildingID','Borough','HouseNumber','StreetName','Zip','Apartment','CommunityBoard','StatusID','Status','StatusDate'],axis=1)
print complaints.shape

(474746, 15)
(474746, 5)


In [4]:
complaints.head(3)

Unnamed: 0,ComplaintID,BoroughID,Block,Lot,ReceivedDate
0,6960137,1,1904,4,07/07/2014
1,6960832,1,1918,4,07/08/2014
2,6946867,1,1083,1,06/19/2014


In [5]:
complaints.ReceivedDate = pd.to_datetime(complaints.ReceivedDate)
print complaints.ReceivedDate.dtype

datetime64[ns]


In [9]:
print complaints.shape
complaints = complaints[~(complaints.isnull().any(axis=1))]
print complaints.shape
complaints = complaints[complaints.BoroughID.isin(range(1,6))]
print complaints.shape

(474746, 5)
(474746, 5)
(474746, 5)


In [10]:
start = pd.datetime(2014,11,1)
end = pd.datetime.now()
allowed_date_range = pd.date_range(start, end, freq='D')
print allowed_date_range
complaints = complaints[(complaints['ReceivedDate'].isin(allowed_date_range))]
print complaints.shape

DatetimeIndex(['2015-04-01', '2015-04-02', '2015-04-03', '2015-04-04',
               '2015-04-05', '2015-04-06', '2015-04-07', '2015-04-08',
               '2015-04-09', '2015-04-10', 
               ...
               '2015-11-04', '2015-11-05', '2015-11-06', '2015-11-07',
               '2015-11-08', '2015-11-09', '2015-11-10', '2015-11-11',
               '2015-11-12', '2015-11-13'],
              dtype='datetime64[ns]', length=227, freq='D', tz=None)
(142794, 5)


In [11]:
def make_BBL(borough, block, lot): 
    '''
    The borough code is one numeric digit. 
    The tax block is one to five numeric digits, preceded with leading zeroswhen the block is less than five digits.
    The tax lot is one to four digits and is preceded with leading zeros when the lot is less than four digits.
    
    >>> make_BBL(1,16,100)
    1000160100
    >>> make_BBL(3,15828,7501)
    3158287501
    '''
    return int(str(borough) + str(block).zfill(5) + str(lot).zfill(4))
    
complaints['BBL'] = map(make_BBL, complaints['BoroughID'], complaints['Block'], complaints['Lot'])
complaints.head(5)

Unnamed: 0,ComplaintID,BoroughID,Block,Lot,ReceivedDate,BBL
331595,7423689,2,4833,64,2015-04-05,2048330064
331596,7447983,2,3944,7501,2015-04-26,2039447501
331597,7444237,2,2463,53,2015-04-23,2024630053
331598,7428031,2,4510,62,2015-04-08,2045100062
331599,7423638,2,3201,30,2015-04-05,2032010030


In [13]:
complaints = complaints.drop(['Block','Lot'], axis=1)
complaints.head(3)

Unnamed: 0,ComplaintID,BoroughID,ReceivedDate,BBL
331595,7423689,2,2015-04-05,2048330064
331596,7447983,2,2015-04-26,2039447501
331597,7444237,2,2015-04-23,2024630053


To recap- we now have a filtered dataset of all completed complaints received after April 1st, 2015, including ComplaintID's, validated BoroughId. and 10-digit BBLs. This will be:
1. Merged with the processed complaint problems dataset, using ComplaintID as a key.
2. Merged with the processed violation dataset, using constructed BBLs as keys.
3. Merged with processed PLUTO sub-dataset using BBL as a key.