# Getting and cleaning housing maintenance code violation data

In this notebook, we will get and clean the NYC Housing Maintenance Code Violation data. Specifically, we will:
1. Read in the violation data from NYC Open Data.
2. Drop all features except:
 * BoroID
 * Block
 * Lot
 * Class
 * ApprovedDate
3. Then we will convert ApproveDate to a datetime datatype.
4. Then we will clean the data by dropping all records with:
 * Incomplete data
 * BoroID not in range(1,6)
 * Class not in ['A', 'B', 'C']
 * ApprovedDate not between April 1st, 2014, and March 31st, 2015. Note this data range will give us one complete year of violations, but not overlap with the "new" complaints in our dataset (thus avoiding leakage).
5. Then we will construct the BBL for each violation before dropping BoroID, Block, and Lot.
6. Next, we will group by BBL and aggregate by counting the total number of violations of each type for each BBL over our one-year window.

In [75]:
import pandas as pd
import numpy as np
import re

### 

def get_complaint_data():
    query = ("https://data.cityofnewyork.us/api/views/wvxf-dwi5/rows.csv?accessType=DOWNLOAD")
    violations = pd.read_csv(query)
    return violations
            
violations = get_complaint_data()
violations.head(3)

Unnamed: 0,ViolationID,BuildingID,RegistrationID,BoroID,Boro,HouseNumber,LowHouseNumber,HighHouseNumber,StreetName,StreetCode,...,NewCertifyByDate,NewCorrectByDate,CertifiedDate,OrderNumber,NOVID,NOVDescription,NOVIssuedDate,CurrentStatusID,CurrentStatus,CurrentStatusDate
0,10304176,45567,202840,2,BRONX,1905,1905,1905,ANDREWS AVENUE SOUTH,8820,...,,,,508,4873659,Â§ 27-2005 ADM CODE REPAIR THE BROKEN OR DEFEC...,07/14/2014,19,VIOLATION CLOSED,08/01/2014
1,10340355,41491,105339,1,MANHATTAN,111,111,115,WEST 141 STREET,36590,...,,,,508,4893132,Â§ 27-2005 ADM CODE REPAIR THE BROKEN OR DEFEC...,08/12/2014,22,FIRST NO ACCESS TO RE- INSPECT VIOLATION,09/18/2015
2,10337179,27609,107359,1,MANHATTAN,272,272,274,SHERMAN AVENUE,30490,...,,,08/19/2014,508,4891907,Â§ 27-2005 ADM CODE REPAIR THE BROKEN OR DEFEC...,08/11/2014,19,VIOLATION CLOSED,10/30/2014


In [76]:
print violations.shape
violations = violations[['BoroID', 'Block', 'Lot', 'Class', 'ApprovedDate']]
print violations.shape

(1131841, 30)
(1131841, 5)


In [77]:
print violations.shape
violations = violations[~(violations.isnull().any(axis=1))]
print violations.shape
violations = violations[violations.BoroID.isin(range(1,6))]
print violations.shape
violations = violations[violations.Class.isin(['A','B','C'])]
print violations.shape

(1131841, 5)
(1131841, 5)
(1131841, 5)
(1016271, 5)


In [78]:
violations.ApprovedDate = pd.to_datetime(violations.ApprovedDate)
print violations.ApprovedDate.dtype

datetime64[ns]


In [79]:
start = pd.datetime(2010,4,1)
end = pd.datetime(2015,3,31)
allowed_date_range_violation_approval = pd.date_range(start, end, freq='D')
violations = violations[(violations['ApprovedDate'].isin(allowed_date_range_violation_approval))]
print violations.shape

(548802, 5)


In [80]:
def make_BBL(borough, block, lot): 
    '''
    The borough code is one numeric digit. 
    The tax block is one to five numeric digits, preceded with leading zeroswhen the block is less than five digits.
    The tax lot is one to four digits and is preceded with leading zeros when the lot is less than four digits.
    
    >>> make_BBL(1,16,100)
    1000160100
    >>> make_BBL(3,15828,7501)
    3158287501
    '''
    return int(str(borough) + str(block).zfill(5) + str(lot).zfill(4))
    
violations['BBL'] = map(make_BBL, violations['BoroID'], violations['Block'], violations['Lot'])
violations.head(5)

Unnamed: 0,BoroID,Block,Lot,Class,ApprovedDate,BBL
0,2,3221,90,C,2014-07-11,2032210090
1,1,2010,21,B,2014-08-11,1020100021
2,1,2228,42,B,2014-08-08,1022280042
3,1,2153,36,A,2013-03-18,1021530036
4,3,1419,6,B,2014-08-28,3014190006


In [81]:
violations = violations.drop(['BoroID','Block','Lot'],axis=1)
violations.head(5)

Unnamed: 0,Class,ApprovedDate,BBL
0,C,2014-07-11,2032210090
1,B,2014-08-11,1020100021
2,B,2014-08-08,1022280042
3,A,2013-03-18,1021530036
4,B,2014-08-28,3014190006


In [82]:
grouped_by_BBL = violations.groupby(['BBL','Class']).size().reset_index() 
grouped_by_BBL.columns = ['BBL','Class','Count']
grouped_by_BBL = grouped_by_BBL.pivot('BBL','Class','Count')
grouped_by_BBL = grouped_by_BBL.fillna(0)
grouped_by_BBL.head(10)

Class,A,B,C
BBL,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1000157501,0,0,1
1000160100,1,14,2
1000167508,0,0,1
1000167515,1,4,0
1000167516,2,5,0
1000310001,1,0,0
1000317501,0,0,1
1000330011,1,3,0
1000420022,0,0,1
1000530033,1,7,3


In [83]:
print grouped_by_BBL.shape

(32636, 3)
