# Data Profiling and Cleaning

We profiled and cleaned the NYC opendata `Historical_DOB_Permit` data using pandas and openclean

Run all the cells in order to profile and clean the data

Robert Ronan, Sheng Tong, Jerry Lee

In [1]:
import openclean
import glob
import pandas as pd
import numpy as np
import re

# Data Downloading

Download the data using openClean

In [2]:
import gzip
import humanfriendly
import os

from openclean.data.source.socrata import Socrata

dataset = Socrata().dataset('bty7-2jhb')
datafile = './bty7-2jhb.tsv.gz'

if not os.path.isfile(datafile):
    with gzip.open(datafile, 'wb') as f:
        print('Downloading ...\n')
        dataset.write(f)


fsize = humanfriendly.format_size(os.stat(datafile).st_size)
print("Using '{}' in file {} of size {}".format(dataset.name, datafile, fsize))

Using 'Historical DOB Permit Issuance' in file ./bty7-2jhb.tsv.gz of size 321.34 MB


# Data Loading

Load the data into pandas and openClean dataset object

In [3]:
import pandas as pd
from openclean.pipeline import stream

df  = pd.read_csv(datafile, dtype='object', sep='\t')
ds = stream(datafile)

In [4]:
np.__version__

'1.21.4'

In [5]:
pd.__version__

'1.3.4'

In [6]:
import glob

In [7]:
glob.glob("*")

['bty7-2jhb.tsv.gz',
 'cleaned_data.csv',
 'dm9a-ab7w.tsv.gz',
 'DOB_Job_Cleaning.ipynb',
 'DOB_NOW_BUILD.ipynb',
 'DOB_NOW_Electrical.ipynb',
 'Historical_DOB_Permit.ipynb',
 'ic3t-wcy2.tsv.gz',
 'README.md',
 'w9ak-ipjd.tsv.gz']

### Get some basic info about the dataset columns

In [8]:
df.info(verbose=True, show_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2428526 entries, 0 to 2428525
Data columns (total 60 columns):
 #   Column                            Non-Null Count    Dtype 
---  ------                            --------------    ----- 
 0   BOROUGH                           2428526 non-null  object
 1   BIN                               2428526 non-null  object
 2   Number                            2428522 non-null  object
 3   Street                            2428522 non-null  object
 4   Job #                             2428526 non-null  object
 5   Job doc. #                        2428526 non-null  object
 6   Job Type                          2428526 non-null  object
 7   Self_Cert                         900685 non-null   object
 8   Block                             2428028 non-null  object
 9   Lot                               2428019 non-null  object
 10  Community Board                   2425674 non-null  object
 11  Postcode                          2427476 non-null

If any rows are complete duplicates, drop them

In [9]:
df = df.drop_duplicates()

Take an a look at some of the rows to get an idea of what the datset looks like

In [10]:
df

Unnamed: 0,BOROUGH,BIN,Number,Street,Job #,Job doc. #,Job Type,Self_Cert,Block,Lot,...,Owner’s House State,Owner’s House Zip Code,Owner's Phone #,DOBRunDate,Latitude,Longitude,Council District,Census Tract,BBL,NTA
0,BRONX,2118801,2960,WEBSTER AVENUE,201088492,4,NB,,3274,4,...,,,,2016-01-03T00:00:00,40.86749,-73.883225,11,425,2032740001,Norwood ...
1,BRONX,2096812,100,DEKRUIF PLACE,200716298,2,A2,,5141,120,...,,,,2016-01-03T00:00:00,40.875769,-73.828899,12,46201,2051410120,Co-op City ...
2,BRONX,2008604,1898,HARRISON AVENUE,200974650,2,A2,,2869,87,...,,,,2016-01-03T00:00:00,40.852603,-73.911461,14,243,2028690087,University Heights-Morris Heights ...
3,BRONX,2007652,1998,MORRIS AVENUE,200278118,2,A1,,2807,15,...,,,,2016-01-03T00:00:00,40.851661,-73.906937,14,241,2028070015,Mount Hope ...
4,BRONX,2084155,565,WEST 235 STREET,201119173,2,A2,Y,5794,484,...,,,,2016-01-03T00:00:00,40.88572,-73.91027,11,297,2057940484,Spuyten Duyvil-Kingsbridge ...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2428521,STATEN ISLAND,5141909,200,LANDER AVENUE,500644699,1,NB,Y,1599,10,...,NY,10304,7189809327,2016-01-03T00:00:00,40.614117,-74.16564,50,29104,5015990010,New Springville-Bloomfield-Travis ...
2428522,STATEN ISLAND,5127569,26,DIAZ PLACE,500145256,1,NB,Y,4055,37,...,NY,10309,7183170679,2016-01-03T00:00:00,40.561562,-74.108193,50,12806,5040550037,Oakwood-Oakwood Beach ...
2428523,STATEN ISLAND,5046419,1055,TARGEE STREET,500301176,1,A2,Y,3171,1,...,NY,11101,7184728597,2016-01-03T00:00:00,40.60223,-74.091355,50,50,5031710001,Grasmere-Arrochar-Ft. Wadsworth ...
2428524,STATEN ISLAND,5043894,10,WHITE PLAINS AVENUE,500112326,1,A2,,2968,30,...,NY,10305,7189815066,2016-01-03T00:00:00,40.614249,-74.075822,49,36,5029680033,Stapleton-Rosebank ...


In [11]:
#need 382 examples for sample

In [12]:
df_sample = df.sample(382).copy()

In [13]:
df_sample

Unnamed: 0,BOROUGH,BIN,Number,Street,Job #,Job doc. #,Job Type,Self_Cert,Block,Lot,...,Owner’s House State,Owner’s House Zip Code,Owner's Phone #,DOBRunDate,Latitude,Longitude,Council District,Census Tract,BBL,NTA
1590526,MANHATTAN,1032679,251,WEST 81ST STREET,110247812,1,A2,Y,1229,8,...,NY,10530,9149972435,2016-01-03T00:00:00,40.785351,-73.979439,6,167,1012290008,Upper West Side ...
524120,BROOKLYN,3001939,280,CADMAN PLAZA WEST,300739324,1,A3,,239,16,...,NY,11232,7186265654,2016-01-03T00:00:00,40.695657,-73.991006,33,502,3002390016,Brooklyn Heights-Cobble Hill ...
2176945,QUEENS,4190933,107-11,92 STREET,420051856,1,A2,Y,9159,72,...,NY,11417,7188433678,2016-01-03T00:00:00,40.678251,-73.845932,32,54,4091590072,Ozone Park ...
1132106,MANHATTAN,1038538,232,E 53 ST,120614200,1,A2,,1326,35,...,NY,10022,2122073339,2016-01-03T00:00:00,40.757339,-73.968554,4,98,1013260035,Turtle Bay-East Midtown ...
2356636,STATEN ISLAND,5151844,700,FR CAPODANNO BLVD,520045040,3,NB,,3833,100,...,NY,11368,7187606601,2016-01-03T00:00:00,40.578925,-74.078151,50,11201,5038330100,Old Town-Dongan Hills-South Beach ...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1154136,MANHATTAN,1030139,104,WEST 76 STREET,120756592,1,A2,,1147,35,...,NY,10023,2125018814,2016-01-03T00:00:00,40.780204,-73.977414,6,161,1011470035,Upper West Side ...
101937,BRONX,2120464,1401,TELLER AVENUE,210025337,1,NB,,2784,29,...,NY,10012,2125764124,2016-01-03T00:00:00,40.837152,-73.909577,16,225,2027840029,East Concourse-Concourse Village ...
975928,MANHATTAN,1076262,30,ROCKEFELLER PLAZA,120861647,1,A2,Y,1265,7501,...,NY,10111,2125888650,2016-01-03T00:00:00,40.758754,-73.978692,4,104,1012657501,Midtown-Midtown South ...
2134898,QUEENS,4180106,85-36,260 STREET,401744225,1,A2,,8802,55,...,NY,11368,6464080741,2016-01-03T00:00:00,40.734656,-73.707339,23,157903,4088020055,Glen Oaks-Floral Park-New Hyde Park ...


## Describe columns in groups so they fit on screen

In [14]:
df[df.columns[:20]].describe()

Unnamed: 0,BOROUGH,BIN,Number,Street,Job #,Job doc. #,Job Type,Self_Cert,Block,Lot,Community Board,Postcode,Bldg Type,Residential,Special District 1,Special District 2,Work Type,Permit Status,Filing Status,Permit Type
count,2428524,2428524,2428520,2428520,2428524,2428524,2428524,900685,2428026,2428017,2425672,2427474,2374245,772772,186592,2639,1974555,2419619,2428524,2428523
unique,5,300024,28639,20201,1110544,12,6,1,13625,1718,140,227,2,1,82,14,13,4,2,8
top,MANHATTAN,1015862,1,BROADWAY,200718330,1,A2,Y,16,1,105,10022,2,YES,MID,LH-1A,OT,ISSUED,INITIAL,EW
freq,1008003,1957,18445,68853,57,2222894,1389185,900685,7200,289865,247167,65192,1786549,772772,46452,1481,636379,2384954,1851581,1029741


In [15]:
# Notes: 
# Special District, status, BOROUGH, bin, Job Number, block, lot can all be reused

In [16]:
df[df.columns[20:40]].describe()

Unnamed: 0,Permit Sequence #,Permit Subtype,Oil Gas,Site Fill,Filing Date,Issuance Date,Expiration Date,Job Start Date,Permittee's First Name,Permittee's Last Name,Permittee's Business Name,Permittee's Phone #,Permittee's License Type,Permittee's License #,Act as Superintendent,Permittee's Other Title,HIC License,Site Safety Mgr's First Name,Site Safety Mgr's Last Name,Site Safety Mgr Business Name
count,2428524,1418231,31034,2260821,2428524,2428524,2428520,2428521,2428453,2428436,2396874,2428220,2173572,2207326,1595225,268182,26804,10237,10261,7861
unique,26,15,2,5,6415,6409,9024,7475,35865,85623,351176,145121,11,46745,2,2795,5512,506,1235,1257
top,1,OT,OIL,NONE,2007-03-29T00:00:00,2007-03-29T00:00:00,2007-12-31T00:00:00,2008-06-27T00:00:00,JOHN,SINGH,ROCKLEDGE SCAFFOLD,2124816100,GENERAL CONTRACTOR,0,Y,GC,0,JOHN,WAIVER,TOTAL SAFETY CONSULTING
freq,1851595,585307,29215,1553657,998,994,18638,1376,112451,28500,7732,16556,1343650,78260,1586973,105964,845,503,376,507


In [17]:
# Dates can all be reused

In [18]:
df[df.columns[40:60]].describe()

Unnamed: 0,Superintendent First & Last Name,Superintendent Business Name,Owner's Business Type,Non-Profit,Owner's Business Name,Owner's First Name,Owner's Last Name,Owner's House #,Owner's House Street Name,Owner’s House City,Owner’s House State,Owner’s House Zip Code,Owner's Phone #,DOBRunDate,Latitude,Longitude,Council District,Census Tract,BBL,NTA
count,1602998,1602813,2267529,157298,1759197,2270361,2270672,2270987,2270837,2271354,2271391,2266426,2226545,2428524,2419008.0,2419008.0,2419008,2419008,2398934,2419552
unique,309028,307996,14,1,378163,75825,147394,35113,119849,11545,57,6785,284544,1,200886.0,206784.0,51,1326,281136,195
top,ROCKLEDGE SCAFFOLD,ROCKLEDGE SCAFFOLD,CORPORATION,Y,NY SCHOOL CONSTRUCTION AUTHORITY,ROBERT,COHEN,100,BROADWAY,NEW YORK,NY,10038,7184728000,2016-01-03T00:00:00,40.754162,-73.976557,4,7,1009720001,Midtown-Midtown South ...
freq,6842,6842,791752,157298,19966,48131,15554,71273,45474,516017,2216175,83810,17604,2428524,1973.0,1973.0,302821,20042,3923,182177


In [19]:
# Owner's info and GIS info can be reimplemented

In [20]:
df.columns

Index(['BOROUGH', 'BIN', 'Number', 'Street', 'Job #', 'Job doc. #', 'Job Type',
       'Self_Cert', 'Block', 'Lot', 'Community Board', 'Postcode', 'Bldg Type',
       'Residential', 'Special District 1', 'Special District 2', 'Work Type',
       'Permit Status', 'Filing Status', 'Permit Type', 'Permit Sequence #',
       'Permit Subtype', 'Oil Gas', 'Site Fill', 'Filing Date',
       'Issuance Date', 'Expiration Date', 'Job Start Date',
       'Permittee's First Name', 'Permittee's Last Name',
       'Permittee's Business Name', 'Permittee's Phone #',
       'Permittee's License Type', 'Permittee's License #',
       'Act as Superintendent', 'Permittee's Other Title', 'HIC License',
       'Site Safety Mgr's First Name', 'Site Safety Mgr's Last Name',
       'Site Safety Mgr Business Name', 'Superintendent First & Last Name',
       'Superintendent Business Name', 'Owner's Business Type', 'Non-Profit',
       'Owner's Business Name', 'Owner's First Name', 'Owner's Last Name',
       'O

## Renaming columns

In [21]:
rename_list = list(df.columns)
rename_dict = dict()

for i in rename_list:
    col_name = str(i)
    # https://stackoverflow.com/questions/2277352/split-a-string-at-uppercase-letters
    # Split on upper case to seperate cocnatenated words:
    if (not col_name.islower()) and (not col_name.isupper()):
        col_name = " ".join(re.sub("([A-Z])", r" \1", col_name).split())
    # Split on underscores and make Title Case 
    col_name = col_name.strip().replace("_", " ").lower().title().replace("’", "'").replace(".", "")
        
    col_name = col_name.replace("No", "Number")
    col_name = col_name.replace("#", "Number")
    
    rename_dict[i] = col_name

In [22]:
rename_dict

{'BOROUGH': 'Borough',
 'BIN': 'Bin',
 'Number': 'Number',
 'Street': 'Street',
 'Job #': 'Job Number',
 'Job doc. #': 'Job Doc Number',
 'Job Type': 'Job Type',
 'Self_Cert': 'Self  Cert',
 'Block': 'Block',
 'Lot': 'Lot',
 'Community Board': 'Community Board',
 'Postcode': 'Postcode',
 'Bldg Type': 'Bldg Type',
 'Residential': 'Residential',
 'Special District 1': 'Special District 1',
 'Special District 2': 'Special District 2',
 'Work Type': 'Work Type',
 'Permit Status': 'Permit Status',
 'Filing Status': 'Filing Status',
 'Permit Type': 'Permit Type',
 'Permit Sequence #': 'Permit Sequence Number',
 'Permit Subtype': 'Permit Subtype',
 'Oil Gas': 'Oil Gas',
 'Site Fill': 'Site Fill',
 'Filing Date': 'Filing Date',
 'Issuance Date': 'Issuance Date',
 'Expiration Date': 'Expiration Date',
 'Job Start Date': 'Job Start Date',
 "Permittee's First Name": "Permittee'S First Name",
 "Permittee's Last Name": "Permittee'S Last Name",
 "Permittee's Business Name": "Permittee'S Business Nam

In [23]:
df = df.rename(columns=rename_dict)

#### Method to get an idea of the top 10 values of a column

In [24]:
def show_vals(column_name, show_rows=10, df=df):
    print("Top {} {}:\n".format(show_rows, column_name))
    print(df[column_name].value_counts(dropna=False)[:show_rows])
    print()

### Examining Job Numbers

Some repition in the Job Number's, but nothing major. We will check some of the repeated Job Numbers to be sure they actually refer to the same jobs

In [25]:
df['Job Number'].value_counts(dropna=False)

200718330    57
103297521    54
102318804    50
101229840    48
401910438    47
             ..
302403254     1
301359330     1
120278074     1
102443660     1
500301176     1
Name: Job Number, Length: 1110544, dtype: int64

Nothing weird looking here

In [26]:
df['Job Number'].min()

'100030011'

In [27]:
df['Job Number'].max()

'577777776'

In [28]:
df.loc[df['Job Number'].str.startswith('0')]

Unnamed: 0,Borough,Bin,Number,Street,Job Number,Job Doc Number,Job Type,Self Cert,Block,Lot,...,Owner'S House State,Owner'S House Zip Code,Owner'S Phone Number,D O B Run Date,Latitude,Longitude,Council District,Census Tract,Bbl,Nta


Fraction of Rows with unique job numbers

In [29]:
df['Job Number'].nunique()/df['Job Number'].count()

0.45729175416837553

Group by Job Number and check if latitude and longitude are the same all the same for the job, which would indicate different instances of the Job Number all refer to the same Job.

In [30]:
group = df[['Job Number', 'Latitude', 'Longitude']].groupby('Job Number')

This will take a little while to run

In [31]:
tranformed = group.aggregate(lambda x: x.unique().shape[0])

Jobs with multiple latitude and longitudes:

In [32]:
# potential bad jobs
tranformed.loc[(tranformed['Latitude']!=1)
              |(tranformed['Longitude']!=1)]

Unnamed: 0_level_0,Latitude,Longitude
Job Number,Unnamed: 1_level_1,Unnamed: 2_level_1
100053406,2,2
100107493,2,2
100134472,2,2
100134481,2,2
100134515,2,2
100134524,2,2
100340481,2,2
100734706,2,2
100749442,2,2
100786295,2,2


In [33]:
potential_bad_jobs = list(tranformed.loc[(tranformed['Latitude']!=1)
              |(tranformed['Longitude']!=1)].index.unique())

Separate these into a temporary dataframe to play around with:

In [34]:
df_temp = df.loc[df['Job Number'].isin(potential_bad_jobs)].copy()

In [35]:
df_temp = df_temp.sort_values(['Job Number', 'Latitude', 'Longitude'])

Most of these are just missing lat and long values.

The others look to be Jobs that manage multiple houses/lots in a small area, so are probably correct

In [36]:
df_temp[df_temp.duplicated(subset=['Job Number', 'Block', 'Lot', 'Bin', 'Job Type'], keep=False)]

Unnamed: 0,Borough,Bin,Number,Street,Job Number,Job Doc Number,Job Type,Self Cert,Block,Lot,...,Owner'S House State,Owner'S House Zip Code,Owner'S Phone Number,D O B Run Date,Latitude,Longitude,Council District,Census Tract,Bbl,Nta
1172405,MANHATTAN,1001084,1,LIBERTY PLAZA,100053406,2,A2,,62,1,...,,,,2016-01-03T00:00:00,40.70943,-74.011102,1,13,1000627501,Battery Park City-Lower Manhattan ...
1248743,MANHATTAN,1001084,1,LIBERTY PLAZA,100053406,2,A2,,62,1,...,,,,2016-01-03T00:00:00,40.70943,-74.011102,1,13,1000627501,Battery Park City-Lower Manhattan ...
763613,MANHATTAN,1085348,298A,WEST 137 STREET,100134472,1,NB,,1942,7501,...,NY,10038,2129786310,2016-01-03T00:00:00,40.817208,-73.944222,9,228,1019427501,Central Harlem North-Polo Grounds ...
1316050,MANHATTAN,1085348,298A,WEST 137 STREET,100134472,1,NB,,1942,7501,...,NY,10038,2129786310,2016-01-03T00:00:00,40.817208,-73.944222,9,228,1019427501,Central Harlem North-Polo Grounds ...
876095,MANHATTAN,1085348,298A,WEST 137 STREET,100134481,1,NB,,1942,7501,...,NY,10038,2129786310,2016-01-03T00:00:00,40.817208,-73.944222,9,228,1019427501,Central Harlem North-Polo Grounds ...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1789002,QUEENS,4454470,37-02,73 STREET,400391143,1,A1,,1283,8,...,NY,11372,7184571171,2016-01-03T00:00:00,40.748565,-73.89267,25,291,4012830008,Jackson Heights ...
1985086,QUEENS,4454470,37-02,73 STREET,400391143,1,A1,,1283,8,...,NY,11372,7184571171,2016-01-03T00:00:00,40.748565,-73.89267,25,291,4012830008,Jackson Heights ...
2273243,QUEENS,4454470,37-02,73 STREET,400391143,1,A1,,1283,8,...,NY,11372,7184571171,2016-01-03T00:00:00,40.748565,-73.89267,25,291,4012830008,Jackson Heights ...
1811127,QUEENS,4045999,90-15,QUEENS BLVD,401042518,1,A2,,1860,100,...,NY,10001,2124945574,2016-01-03T00:00:00,40.733848,-73.871578,25,683,4018600100,Elmhurst ...


#### Later, after we have cleaned more values, we will fill these missing values by Job Number

Remove Jobs we know to be just missing data from the list of bad jobs

In [37]:
not_bad_jobs = df_temp[df_temp.duplicated(subset=['Job Number', 'Block', 'Lot', 'Bin', 'Job Type'], keep=False)]['Job Number'].unique()

In [38]:
df_temp = df_temp.loc[~df_temp['Job Number'].isin(not_bad_jobs)]

All of these are jobs that handle multiple lots or Owner\'S House Numbers, which explains why the lat/long change 


In [39]:
df_temp

Unnamed: 0,Borough,Bin,Number,Street,Job Number,Job Doc Number,Job Type,Self Cert,Block,Lot,...,Owner'S House State,Owner'S House Zip Code,Owner'S Phone Number,D O B Run Date,Latitude,Longitude,Council District,Census Tract,Bbl,Nta
1075701,MANHATTAN,1046972,41,EAST 89 STREET,100107493,1,A1,,1501,7501,...,NY,10017.0,2122106666.0,2016-01-03T00:00:00,40.782479,-73.956647,4,15002,1015017501,Upper East Side-Carnegie Hill ...
1305991,MANHATTAN,1046972,45,EAST 89 STREET,100107493,2,A1,,1501,20,...,,,,2016-01-03T00:00:00,40.782869,-73.957095,4,15002,1015017501,Upper East Side-Carnegie Hill ...
848336,MANHATTAN,1085627,298A,WEST 138 STREET,100134515,1,NB,,2023,7503,...,NY,10038.0,2129786310.0,2016-01-03T00:00:00,40.818232,-73.944712,9,228,1020237503,Central Harlem North-Polo Grounds ...
872911,MANHATTAN,1085627,298,WEST 138 STREET,100134515,2,NB,,2023,1,...,,,,2016-01-03T00:00:00,40.818362,-73.945016,9,228,1020237503,Central Harlem North-Polo Grounds ...
997180,MANHATTAN,1085627,298A,WEST 138 STREET,100134524,1,NB,,2023,7503,...,NY,10038.0,2129786310.0,2016-01-03T00:00:00,40.818232,-73.944712,9,228,1020237503,Central Harlem North-Polo Grounds ...
1547307,MANHATTAN,1085627,298,WEST 138 STREET,100134524,2,NB,,2023,1,...,,,,2016-01-03T00:00:00,40.818362,-73.945016,9,228,1020237503,Central Harlem North-Polo Grounds ...
1661643,MANHATTAN,1057135,301,WEST 100 STREET,100786295,1,A1,,1889,7501,...,NY,10025.0,2128655858.0,2016-01-03T00:00:00,40.797937,-73.971604,6,187,1018897501,Upper West Side ...
1655044,MANHATTAN,1057135,823,WEST END AVENUE,100786295,2,A1,,1889,17,...,,,,2016-01-03T00:00:00,40.797964,-73.971243,6,187,1018897501,Upper West Side ...
2126442,QUEENS,4029819,37-33,74 STREET,400242171,3,A1,,1285,13,...,,,,2016-01-03T00:00:00,40.748048,-73.891656,25,289,4012850019,Jackson Heights ...
2076239,QUEENS,4029819,37-27,74 STREET,400242171,1,A1,,1285,19,...,NY,10017.0,2126829595.0,2016-01-03T00:00:00,40.748157,-73.891678,25,289,4012850019,Jackson Heights ...


Check if any Job Numbers have non-digit values

In [40]:
df['Job Number'] = df['Job Number'].astype('str')

In [41]:
df.loc[(~df['Job Number'].isna())
       &(~df['Job Number'].str.isdigit())]['Job Number']

Series([], Name: Job Number, dtype: object)

All Job Numbers entirely composed of digits, so we cast them to ints

In [42]:
df['Job Number'] = df['Job Number'].astype('int')

In [43]:
df['Job Number'].describe()

count    2.428524e+06
mean     2.481319e+08
std      1.364586e+08
min      1.000300e+08
25%      1.043485e+08
50%      2.201623e+08
75%      4.005741e+08
max      5.777778e+08
Name: Job Number, dtype: float64

## Examining and reparing Owner\'S House Numbers

Owner\'S House Number's appear to be mostly ints

However, there are legitimate house numbers with dashes so we'll have to make them strings

In [44]:
show_vals('Owner\'S House Number', show_rows=10)

Top 10 Owner'S House Number:

NaN      157537
100       71273
30-30     43920
1         29243
75        18254
90        17500
200       16593
150       15762
60        13827
15        10979
Name: Owner'S House Number, dtype: int64



Replace NaN values with empty strings, then convert column to string, and make everything uppercase


In [45]:
df['Owner\'S House Number'].fillna('', inplace=True)
df['Owner\'S House Number'] = df['Owner\'S House Number'].astype('str')
df['Owner\'S House Number'] = df['Owner\'S House Number'].str.upper()

Check for numbers spelled out as words

In [46]:
df.loc[(~df['Owner\'S House Number'].isna())
       &(df['Owner\'S House Number'].str.isalpha())]['Owner\'S House Number']

455              ONE
586              TWO
603          OLMSTED
655        MUNICIPAL
740         OLMSTEAD
             ...    
2426095           PO
2427485          ONE
2428080          ONE
2428193        NORTH
2428485           PO
Name: Owner'S House Number, Length: 15955, dtype: object

Maybe the Owner\'S House Number and BOROUGH were flipped in the 'manhattan' case?

In [47]:
# Nope:
df.loc[df['Owner\'S House Number'].str.strip('')=='MANHATTAN'][['Owner\'S House Number', 'Borough']]

Unnamed: 0,Owner'S House Number,Borough
803015,MANHATTAN,MANHATTAN
1351640,MANHATTAN,MANHATTAN


Check if thses are empty strings:

In [48]:
df.loc[(~df['Owner\'S House Number'].str.contains('\\d', regex=True))]['Owner\'S House Number']

0                      
1                      
2                      
3                      
4                      
               ...     
2428376        P.O. BOX
2428401                
2428439        P.O. BOX
2428476    MUNICIPAL BL
2428485              PO
Name: Owner'S House Number, Length: 204268, dtype: object

Replace spelling of numbers with their value, and remove values a lot of values

In [49]:
df.loc[df['Owner\'S House Number'].str.strip('')=='ONE', 'Owner\'S House Number'] = '1'
df.loc[df['Owner\'S House Number'].str.strip('')=='TWO', 'Owner\'S House Number'] = '2'
df.loc[df['Owner\'S House Number'].str.strip('')=='PIER', 'Owner\'S House Number'] = ''
df.loc[df['Owner\'S House Number'].str.strip('')=='MANHATTAN', 'Owner\'S House Number'] = ''
df.loc[df['Owner\'S House Number'].str.strip('')=='NO NUMBER', 'Owner\'S House Number'] = ''
df.loc[df['Owner\'S House Number'].str.strip('')=='OLMSTED', 'Owner\'S House Number'] = ''
df.loc[df['Owner\'S House Number'].str.strip('')=='OLMSTEAD', 'Owner\'S House Number'] = ''
df.loc[df['Owner\'S House Number'].str.strip('')=='PO', 'Owner\'S House Number'] = ''
df.loc[df['Owner\'S House Number'].str.strip('')=='NORTH', 'Owner\'S House Number'] = ''
df.loc[df['Owner\'S House Number'].str.strip('')=='MUNICIPAL', 'Owner\'S House Number'] = ''

Most of these will probably be legitimate house numbers, since house numbers can have dashes

In [50]:
df.loc[(~df['Owner\'S House Number'].isna())
       &(~df['Owner\'S House Number'].str.isdigit())]['Owner\'S House Number']

0               
1               
2               
3               
4               
           ...  
2428482    30-30
2428485         
2428507    30-30
2428519     123A
2428523    30-30
Name: Owner'S House Number, Length: 715905, dtype: object

Check non-numeric Owner\'S House Number's that don't have dashes

In [51]:
df.loc[(~df['Owner\'S House Number'].isna())
       &(~df['Owner\'S House Number'].str.isdigit())
      &(~df['Owner\'S House Number'].str.contains('-', regex=False))]['Owner\'S House Number'][:25]

0     
1     
2     
3     
4     
5     
6     
7     
8     
9     
10    
11    
13    
14    
15    
16    
17    
18    
19    
20    
21    
22    
23    
25    
26    
Name: Owner'S House Number, dtype: object

We see a mix of reference to the house's garage, the rear house and single letters that likely indicate apartments in multi-occupancy venues. 

We will standardize the formatting, and maintain the reference to garage, rear, and appartment, since there is no apartment column for the job.

First split the numbers and words with a space

In [52]:
df['Owner\'S House Number'] = df['Owner\'S House Number'].str.replace(pat='(?P<one>\\d)(?P<two>[A-Z]+)', repl='\g<one> \g<two>', regex=True)

Now we will fix the formatting for garage and 
remove references to north, south, east, west, since they should be in street #

In [53]:
df['Owner\'S House Number'] = df['Owner\'S House Number'].str.replace(pat='(?P<one>GAR$)', repl='GARAGE', regex=True)

In [54]:
df['Owner\'S House Number'] = df['Owner\'S House Number'].str.replace(pat='NORTH([A-Z]+)?', repl='', regex=True)
df['Owner\'S House Number'] = df['Owner\'S House Number'].str.replace(pat='EAST([A-Z]+)?', repl='', regex=True)
df['Owner\'S House Number'] = df['Owner\'S House Number'].str.replace(pat='SOUTH([A-Z]+)?', repl='', regex=True)
df['Owner\'S House Number'] = df['Owner\'S House Number'].str.replace(pat='WEST([A-Z]+)?', repl='', regex=True)

In [55]:
# Confirm that it worked correctly:
df.loc[(~df['Owner\'S House Number'].isna())
       &(~df['Owner\'S House Number'].str.isdigit())
       &(~df['Owner\'S House Number'].str.contains('-', regex=False))]['Owner\'S House Number'][:30]

0     
1     
2     
3     
4     
5     
6     
7     
8     
9     
10    
11    
13    
14    
15    
16    
17    
18    
19    
20    
21    
22    
23    
25    
26    
27    
28    
31    
32    
33    
Name: Owner'S House Number, dtype: object

## Looking at Binary/Pseudo-binary columns:

For these columns it's clear NaN idicates 'no', since the only other option was 'X' which indicates yes

In [56]:
for col in df.columns:
    if df[col].nunique() < 4:
        show_vals(col)

Top 10 Self  Cert:

NaN    1527839
Y       900685
Name: Self  Cert, dtype: int64

Top 10 Bldg Type:

2      1786549
1       587696
NaN      54279
Name: Bldg Type, dtype: int64

Top 10 Residential:

NaN    1655752
YES     772772
Name: Residential, dtype: int64

Top 10 Filing Status:

INITIAL    1851581
RENEWAL     576943
Name: Filing Status, dtype: int64

Top 10 Oil Gas:

NaN    2397490
OIL      29215
GAS       1819
Name: Oil Gas, dtype: int64

Top 10 Act As Superintendent:

Y      1586973
NaN     833299
N         8252
Name: Act As Superintendent, dtype: int64

Top 10 Numbern- Profit:

NaN    2271226
Y       157298
Name: Numbern- Profit, dtype: int64

Top 10 D O B Run Date:

2016-01-03T00:00:00    2428524
Name: D O B Run Date, dtype: int64



Replace the Nan values with False and replace the 'X' values with True, and then cast the columns to be type bool

In [57]:
df["Bldg Type"].fillna("", inplace=True)
df["Oil Gas"].fillna("", inplace=True)
df["Act As Superintendent"].fillna("N", inplace=True)

In [58]:
df["Self  Cert"].fillna("N", inplace=True)
df["Numbern- Profit"].fillna("N", inplace=True)
df["Residential"].fillna("NO", inplace=True)

## Checking Monetary Values for consistency

In [59]:
# No monetary value

## Checking owner's information

In [60]:
# list of Owner's columns:
owner_cols = df.columns[np.where(np.char.find(np.array(list(df.columns)), 'Owner') > -1)[0]]

In [61]:
np.where(np.char.find(np.array(list(df.columns)), 'Owner') > -1)[0]

array([42, 44, 45, 46, 47, 48, 49, 50, 51, 52], dtype=int64)

In [62]:
owner_cols

Index(['Owner'S Business Type', 'Owner'S Business Name', 'Owner'S First Name',
       'Owner'S Last Name', 'Owner'S House Number',
       'Owner'S House Street Name', 'Owner'S House City',
       'Owner'S House State', 'Owner'S House Zip Code',
       'Owner'S Phone Number'],
      dtype='object')

In [63]:
for c in owner_cols:
    show_vals(c)

Top 10 Owner'S Business Type:

CORPORATION           791752
INDIVIDUAL            784316
PARTNERSHIP           483072
NaN                   160995
OTHER                 144842
CONDO/CO-OP            28483
OTHER GOV'T AGENCY     22525
NYCHA                   6180
HPD                     3473
HHC                      836
Name: Owner'S Business Type, dtype: int64

Top 10 Owner'S Business Name:

NaN                                 669327
NY SCHOOL CONSTRUCTION AUTHORITY     19966
HPD                                  17505
NONE                                 13668
OWNER                                12935
NYC HPD                               9877
NYC HOUSING AUTHORITY                 9726
NYC SCA                               7662
NYCHA                                 6704
-                                     6173
Name: Owner'S Business Name, dtype: int64

Top 10 Owner'S First Name:

NaN        158163
ROBERT      48131
MICHAEL     48075
JOHN        45967
JOSEPH      45317
DAVID       39

## Fixing owner's informations

In [64]:
df.loc[~df["Owner'S Business Name"].isna() & df["Owner'S Business Name"].str.contains("(?i)new york city")]["Owner'S Business Name"].value_counts()

NEW YORK CITY HOUSING AUTHORITY     2713
New York City Housing Authority      536
NEW YORK CITY HPD                    299
NEW YORK CITY HOUSING                192
NEW YORK CITY SCHOOL CONSTRUCTIO     131
                                    ... 
HPD/NEW YORK CITY                      1
NEW YORK CITY HUM. RES. ADM'N.         1
NEW YORK CITY DEPT OF PARK & REC       1
NEW YORK CITY TRANSIT AUTHORITY        1
NEW YORK CITY OFF-TRACK BETTING        1
Name: Owner'S Business Name, Length: 237, dtype: int64

Normalizes a couple of duplicate names

In [65]:
df["Owner'S Business Name"] = df["Owner'S Business Name"].str.replace("NEW YORK CITY", "NYC")
df["Owner'S Business Name"] = df["Owner'S Business Name"].str.upper()
df["Owner'S Business Name"] = df["Owner'S Business Name"].str.replace(".", '', regex=False)
df["Owner'S Business Name"] = df["Owner'S Business Name"].str.replace(",", '', regex=False)

All these are the same thing. Uses clusters to fix

In [66]:
#may have to use fuzzy/cluster to fix this problem
df.loc[~df["Owner'S Business Name"].isna() & df["Owner'S Business Name"].str.contains("(?i)HOUSING AUTHORITY")]["Owner'S Business Name"].value_counts()

NYC HOUSING AUTHORITY              14747
NEW YORK CITY HOUSING AUTHORITY      539
NEW YORK HOUSING AUTHORITY           201
NYCHOUSING AUTHORITY                 181
NY CITY HOUSING AUTHORITY            130
                                   ...  
N  Y C HOUSING AUTHORITY               1
NEW YORK    HOUSING AUTHORITY          1
NEW YOR CITY HOUSING AUTHORITY         1
HPD - NYC HOUSING AUTHORITY            1
NTC HOUSING AUTHORITY                  1
Name: Owner'S Business Name, Length: 71, dtype: int64

Used clusters to try to fix the rest of them further below

In [67]:
df["Owner'S House State"].value_counts()

NY        2216175
NJ          29283
CT           3165
PA           3077
FL           2916
CA           2547
IL           1660
MA           1584
NC           1493
OH           1142
VA           1136
MD            875
TX            846
RI            649
GA            627
NV            529
DC            385
CO            369
AZ            343
NH            303
UT            220
TN            202
MI            189
WA            185
MN            165
SC            153
DE            143
MO            120
NM            118
KY            114
WI             98
IN             71
KS             57
VT             57
IA             56
NE             53
AL             40
OK             37
LA             29
AR             26
ME             25
OR             21
AK             17
ND             16
WV             12
PR             12
HI             11
SD              8
MS              8
ID              6
MT              5
CN              5
WY              3
sw              2
FQ              1
ï¿½ï¿½    

Since the states can be outside NYC, these are probably fine

In [68]:
df["Owner'S House Zip Code"].value_counts()

10038        83810
10022        69839
10017        69353
11101        68188
10019        57685
             ...  
100190118        1
100191457        1
100173144        1
98035            1
28690            1
Name: Owner'S House Zip Code, Length: 6785, dtype: int64

## Looking at Phone Numbers:

In [69]:
show_vals("Owner'S Phone Number")

Top 10 Owner'S Phone Number:

NaN           201979
7184728000     17604
2128637625     13286
2128637490      8500
2129786310      6780
0               5989
2120000000      5895
2123867490      5387
2128947000      4725
2124072400      4287
Name: Owner'S Phone Number, dtype: int64



A lot of the same phone numbers

In [70]:
df["Owner'S Phone Number"] = df["Owner'S Phone Number"].astype('str')

In [71]:
df.loc[df["Owner'S Phone Number"].str.contains("7184728000")][["Owner'S First Name", "Owner'S Last Name","Owner'S Business Name", "Owner'S Phone Number"]]

Unnamed: 0,Owner'S First Name,Owner'S Last Name,Owner'S Business Name,Owner'S Phone Number
357,STEPHEN,GRANT,NY SCHOOL CONSTRUCTION AUTHORITY,7184728000
372,STEPHEN,GRANT,NY SCHOOL CONSTRUCTION AUTHORITY,7184728000
381,STEPHEN,GRANT,NY SCHOOL CONSTRUCTION AUTHORITY,7184728000
386,RAY,NAMI,NY SCHOOL CONSTRUCTION AUTHORITY,7184728000
388,STEPHEN,GRANT,NY SCHOOL CONSTRUCTION AUTHORITY,7184728000
...,...,...,...,...
2427194,WILLIAM,REISACHER,NY SCHOOL CONSTRUCTION AUTHORITY,7184728000
2427423,ANGELO,VOLONOKIS,NYC SCHOOL CONST AUTHORITY,7184728000
2427930,ELAN,ABNERI,NYC SCHOOL CONST AUTHORITY,7184728000
2428206,ELAN,ABNERI,NYC SCHOOL CONSTAUTHORITY,7184728000


All from the same business name so it makes sense

In [72]:
df.loc[df["Owner'S Phone Number"]=='nan']

Unnamed: 0,Borough,Bin,Number,Street,Job Number,Job Doc Number,Job Type,Self Cert,Block,Lot,...,Owner'S House State,Owner'S House Zip Code,Owner'S Phone Number,D O B Run Date,Latitude,Longitude,Council District,Census Tract,Bbl,Nta
0,BRONX,2118801,2960,WEBSTER AVENUE,201088492,4,NB,N,3274,4,...,,,,2016-01-03T00:00:00,40.86749,-73.883225,11,425,2032740001,Norwood ...
1,BRONX,2096812,100,DEKRUIF PLACE,200716298,2,A2,N,5141,120,...,,,,2016-01-03T00:00:00,40.875769,-73.828899,12,46201,2051410120,Co-op City ...
2,BRONX,2008604,1898,HARRISON AVENUE,200974650,2,A2,N,2869,87,...,,,,2016-01-03T00:00:00,40.852603,-73.911461,14,243,2028690087,University Heights-Morris Heights ...
3,BRONX,2007652,1998,MORRIS AVENUE,200278118,2,A1,N,2807,15,...,,,,2016-01-03T00:00:00,40.851661,-73.906937,14,241,2028070015,Mount Hope ...
4,BRONX,2084155,565,WEST 235 STREET,201119173,2,A2,Y,5794,484,...,,,,2016-01-03T00:00:00,40.88572,-73.91027,11,297,2057940484,Spuyten Duyvil-Kingsbridge ...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2428247,STATEN ISLAND,5133377,52,FABER STREET,500185015,2,NB,Y,1076,51,...,,,,2016-01-03T00:00:00,40.639159,-74.135194,49,207,5010760051,Port Richmond ...
2428310,STATEN ISLAND,5000064,18,RICHMOND TERRACE,500332696,3,A2,N,7,12,...,,,,2016-01-03T00:00:00,40.643004,-74.075772,49,3,5000070012,West New Brighton-New Brighton-St. George ...
2428352,STATEN ISLAND,5007607,234,LAWRENCE AVENUE,500023379,1,A3,N,280,54,...,NY,10314,,2016-01-03T00:00:00,40.623854,-74.106291,49,121,5002800054,New Brighton-Silver Lake ...
2428401,STATEN ISLAND,5005991,311,TAYLOR STREET,500083526,2,A1,N,222,25,...,,,,2016-01-03T00:00:00,40.631106,-74.122669,49,125,5002220025,West New Brighton-New Brighton-St. George ...


Nothing wrong with these jobs without an owner's phone number

In [73]:
df.loc[~df["Owner'S Phone Number"].isna() & df["Owner'S Phone Number"].str.contains("-")]["Owner'S Phone Number"].value_counts()

71822936-3    63
-212386749     8
71835636-7     7
-718446591     7
-212789118     4
              ..
-718859650     1
191-15         1
-718270499     1
718886-455     1
2128-68306     1
Name: Owner'S Phone Number, Length: 82, dtype: int64

Phone numbers should not contain "-"

In [74]:
df.loc[~df["Owner'S Phone Number"].isna() & df["Owner'S Phone Number"].str.contains(" ")]["Owner'S Phone Number"].value_counts()

212 929318    46
718  42334    30
212 226650    17
718  88876    10
718   3891    10
              ..
2129252 70     1
212787  69     1
212  29389     1
212593  37     1
212960  52     1
Name: Owner'S Phone Number, Length: 209, dtype: int64

Phone numbers should not contain empty space

### Cleaning phone number

#### removes non-numeric characters

In [75]:

df["Owner'S Phone Number"] = df["Owner'S Phone Number"].str.extract('(\d+)', expand=False)
df.loc[~df["Owner'S Phone Number"].isna() & df["Owner'S Phone Number"].str.contains(" ")]["Owner'S Phone Number"].value_counts()

Series([], Name: Owner'S Phone Number, dtype: int64)

#### Turns phone numbers that start with 0, 1, and does not have 10 digits into nan

In [76]:
df["Owner'S Phone Number"] = df["Owner'S Phone Number"].astype('str')
df.loc[~df["Owner'S Phone Number"].isna() & ((df["Owner'S Phone Number"].str[0] == "0") | (df["Owner'S Phone Number"].str[0] == "1") | (df["Owner'S Phone Number"].apply(len) != 10)), ["Owner'S Phone Number"]] = np.nan

#### Checks to see if there are any others not of length 10

In [77]:
df["Owner'S Phone Number"] = df["Owner'S Phone Number"].astype('str')
df.loc[(df["Owner'S Phone Number"].apply(len) != 10)]["Owner'S Phone Number"]

0          nan
1          nan
2          nan
3          nan
4          nan
          ... 
2428247    nan
2428310    nan
2428352    nan
2428401    nan
2428525    nan
Name: Owner'S Phone Number, Length: 213116, dtype: object

#### Check for non-numeric charaters

In [78]:
df.loc[(~df["Owner'S Phone Number"].str.isnumeric()) & (~(df["Owner'S Phone Number"]=='nan'))]["Owner'S Phone Number"]

Series([], Name: Owner'S Phone Number, dtype: object)

### Checking additional numerical columns for coherency

## Districts

### Looking at special districts

In [79]:
show_vals("Special District 1")
show_vals("Special District 2")

Top 10 Special District 1:

NaN    2241932
MID      46452
SR       20816
OP       10133
PI        9922
SRD       8645
LM        7415
BR        6028
CL        5938
MP        4810
Name: Special District 1, dtype: int64

Top 10 Special District 2:

NaN      2425885
LH-1A       1481
PI           416
LH-1         241
MP           166
TA           154
MID           80
HY            66
DB            15
MX-9           6
Name: Special District 2, dtype: int64



In [80]:
#Checks to see if there are lower case values
df.loc[~df["Special District 1"].isna() & df["Special District 1"].str.islower()]

Unnamed: 0,Borough,Bin,Number,Street,Job Number,Job Doc Number,Job Type,Self Cert,Block,Lot,...,Owner'S House State,Owner'S House Zip Code,Owner'S Phone Number,D O B Run Date,Latitude,Longitude,Council District,Census Tract,Bbl,Nta


### Analysis

Special Districts didn't have any noticable values that were out of place

## Quick look at GIS

In [81]:
show_vals("Latitude")
show_vals("Longitude")
show_vals("Council District")
show_vals("Census Tract")
show_vals("Nta")
show_vals("Bbl")

Top 10 Latitude:

NaN          9516
40.754162    1973
40.748342    1935
40.751098    1820
40.758754    1626
40.758171    1419
40.733848    1415
40.76402     1414
40.703597    1396
40.582305    1392
Name: Latitude, dtype: int64

Top 10 Longitude:

NaN           9516
-73.976557    1973
-73.984643    1952
-73.992926    1808
-73.978692    1628
-73.973246    1425
-73.973189    1414
-74.009781    1396
-74.169053    1392
-73.871578    1386
Name: Longitude, dtype: int64

Top 10 Council District:

4     302821
3     182073
1     160263
2      93817
6      81411
33     71819
51     61214
5      61194
19     58197
50     51271
Name: Council District, dtype: int64

Top 10 Census Tract:

7      20042
33     18829
102    17524
104    17081
92     16222
9      16054
137    15909
96     15623
119    15101
94     14953
Name: Census Tract, dtype: int64

Top 10 Nta:

Midtown-Midtown South                                                          182177
Upper East Side-Carnegie Hill                        

In [82]:
#Manually looking at some of these
df[["Latitude", "Longitude", "Council District", "Census Tract", "Nta", "Bbl"]]

Unnamed: 0,Latitude,Longitude,Council District,Census Tract,Nta,Bbl
0,40.86749,-73.883225,11,425,Norwood ...,2032740001
1,40.875769,-73.828899,12,46201,Co-op City ...,2051410120
2,40.852603,-73.911461,14,243,University Heights-Morris Heights ...,2028690087
3,40.851661,-73.906937,14,241,Mount Hope ...,2028070015
4,40.88572,-73.91027,11,297,Spuyten Duyvil-Kingsbridge ...,2057940484
...,...,...,...,...,...,...
2428521,40.614117,-74.16564,50,29104,New Springville-Bloomfield-Travis ...,5015990010
2428522,40.561562,-74.108193,50,12806,Oakwood-Oakwood Beach ...,5040550037
2428523,40.60223,-74.091355,50,50,Grasmere-Arrochar-Ft. Wadsworth ...,5031710001
2428524,40.614249,-74.075822,49,36,Stapleton-Rosebank ...,5029680033


In [83]:
#shouldn't be 0
df["Latitude"] = df["Latitude"].astype('float')
df["Latitude"].min()

40.498628

In [84]:
df["Latitude"].max()

40.913711

In [85]:
df.loc[df["Latitude"] ==40.913711]

Unnamed: 0,Borough,Bin,Number,Street,Job Number,Job Doc Number,Job Type,Self Cert,Block,Lot,...,Owner'S House State,Owner'S House Zip Code,Owner'S Phone Number,D O B Run Date,Latitude,Longitude,Council District,Census Tract,Bbl,Nta
60668,BRONX,2096790,600B,DEPEYSTER STREET,220026816,1,A2,Y,5958,65,...,NY,10705,9143758703,2016-01-03T00:00:00,40.913711,-73.904845,11,319,2059580065,North Riverdale-Fieldston-Riverdale ...
87369,BRONX,2096790,600B,DEPEYSTER STREET,220038750,1,A1,N,5958,65,...,NY,10705,9143758703,2016-01-03T00:00:00,40.913711,-73.904845,11,319,2059580065,North Riverdale-Fieldston-Riverdale ...
159008,BRONX,2096790,600B,DEPEYSTER STREET,220026816,1,A2,Y,5958,65,...,NY,10705,9143758703,2016-01-03T00:00:00,40.913711,-73.904845,11,319,2059580065,North Riverdale-Fieldston-Riverdale ...
200641,BRONX,2096790,600B,DEPEYSTER STREET,220026816,1,A2,Y,5958,65,...,NY,10705,9143758703,2016-01-03T00:00:00,40.913711,-73.904845,11,319,2059580065,North Riverdale-Fieldston-Riverdale ...


#### The min and max makes sense as the values range from Staten Island to the Bronx

In [86]:
df["Longitude"] = df["Longitude"].astype('float')
df["Longitude"].min()

-74.254825

In [87]:
df["Longitude"].max()

-73.700376

In [88]:
df.loc[df["Longitude"] == -73.700376]

Unnamed: 0,Borough,Bin,Number,Street,Job Number,Job Doc Number,Job Type,Self Cert,Block,Lot,...,Owner'S House State,Owner'S House Zip Code,Owner'S Phone Number,D O B Run Date,Latitude,Longitude,Council District,Census Tract,Bbl,Nta
1905673,QUEENS,4179641,270-03,HILLSIDE AVENUE,420236097,1,A2,N,8781,101,...,NY,11001,5163540656,2016-01-03T00:00:00,40.739112,-73.700376,23,157901,4087810101,Glen Oaks-Floral Park-New Hyde Park ...
1990793,QUEENS,4179641,270-03,HILLSIDE AVENUE,420236097,1,A2,N,8781,101,...,NY,11001,5163540656,2016-01-03T00:00:00,40.739112,-73.700376,23,157901,4087810101,Glen Oaks-Floral Park-New Hyde Park ...
2090103,QUEENS,4179641,270-03,HILLSIDE AVENUE,420236097,1,A2,N,8781,101,...,NY,11001,5163540656,2016-01-03T00:00:00,40.739112,-73.700376,23,157901,4087810101,Glen Oaks-Floral Park-New Hyde Park ...


#### These longitudes and latitudes range from Queens to Staten Island which is also consistent with our dataset

In [89]:
df["Council District"] = df["Council District"].astype('float')
df["Council District"].min()

1.0

In [90]:
df["Council District"].max()

51.0

#### 1-51 are all valid districts

In [91]:
df["Census Tract"] = df["Census Tract"].astype('float')
df["Census Tract"].min()

1.0

In [92]:
df["Census Tract"].max()

157903.0

In [93]:
df.loc[df["Census Tract"] == 157903]

Unnamed: 0,Borough,Bin,Number,Street,Job Number,Job Doc Number,Job Type,Self Cert,Block,Lot,...,Owner'S House State,Owner'S House Zip Code,Owner'S Phone Number,D O B Run Date,Latitude,Longitude,Council District,Census Tract,Bbl,Nta
1747331,QUEENS,4180642,87-33,258 STREET,400596146,1,A2,N,8826,19,...,NY,11368,7184709081,2016-01-03T00:00:00,40.730361,-73.707679,23.0,157903.0,4088260019,Glen Oaks-Floral Park-New Hyde Park ...
1755777,QUEENS,4179705,255-16,85 AVENUE,401278452,1,A3,Y,8785,12,...,NY,11001,7183430276,2016-01-03T00:00:00,40.734289,-73.712085,23.0,157903.0,4087850012,Glen Oaks-Floral Park-New Hyde Park ...
1755900,QUEENS,4180509,86-52,260 STREET,400895938,1,A3,Y,8818,62,...,NY,11010,7182257755,2016-01-03T00:00:00,40.732560,-73.706561,23.0,157903.0,4088180062,Glen Oaks-Floral Park-New Hyde Park ...
1756213,QUEENS,4179761,84-35,256 STREET,420519727,1,A2,N,8786,71,...,NY,11004,7186260205,2016-01-03T00:00:00,40.735996,-73.711714,23.0,157903.0,4087860071,Glen Oaks-Floral Park-New Hyde Park ...
1756250,QUEENS,4827043,255-41,JAMAICA AVENUE,401428594,1,A2,N,8830,50,...,NY,11001,7183434616,2016-01-03T00:00:00,40.727421,-73.709622,23.0,157903.0,,Glen Oaks-Floral Park-New Hyde Park ...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2270050,QUEENS,4439982,87-06,259 STREET,400258537,1,NB,N,8826,39,...,NY,11791,5169220223,2016-01-03T00:00:00,40.731175,-73.707022,23.0,157903.0,4088260039,Glen Oaks-Floral Park-New Hyde Park ...
2270521,QUEENS,4179958,263-05,EAST WILLISTON AVENUE,420461627,1,DM,N,8793,4,...,NY,11001,7189626227,2016-01-03T00:00:00,40.736722,-73.704955,23.0,157903.0,4087930004,Glen Oaks-Floral Park-New Hyde Park ...
2270808,QUEENS,4180369,86-56,256 STREET,400853536,1,A2,N,8814,53,...,NY,11001,7183436749,2016-01-03T00:00:00,40.731864,-73.710198,23.0,157903.0,4088140053,Glen Oaks-Floral Park-New Hyde Park ...
2271424,QUEENS,4180237,85-15,263 STREET,400744637,1,A1,N,8806,43,...,NY,11001,7184701811,2016-01-03T00:00:00,40.736006,-73.704854,23.0,157903.0,4088060043,Glen Oaks-Floral Park-New Hyde Park ...


#### No irregulars for census tract

In [94]:
df["Bbl"] = df["Bbl"].astype('float')
df["Bbl"].min()

0.0

In [95]:
df["Bbl"] = df["Bbl"].astype('float')
df["Bbl"].max()

5200149999.0

In [96]:
df.loc[df["Bbl"] == 5200149999.0]

Unnamed: 0,Borough,Bin,Number,Street,Job Number,Job Doc Number,Job Type,Self Cert,Block,Lot,...,Owner'S House State,Owner'S House Zip Code,Owner'S Phone Number,D O B Run Date,Latitude,Longitude,Council District,Census Tract,Bbl,Nta
2317092,STATEN ISLAND,5164775,13,CLARKE AVENUE,500029248,1,A2,N,4391,34,...,NY,11101,7187296500,2016-01-03T00:00:00,40.564611,-74.131919,50.0,138.0,5200150000.0,Oakwood-Oakwood Beach ...
2324162,STATEN ISLAND,5164775,13,CLARKE AVENUE,500752116,1,A2,N,4395,34,...,NY,11101,7187296500,2016-01-03T00:00:00,40.564611,-74.131919,50.0,138.0,5200150000.0,Oakwood-Oakwood Beach ...
2334268,STATEN ISLAND,5164775,13,CLARKE AVENUE,500262592,1,A2,N,7391,34,...,NY,11753,5163386000,2016-01-03T00:00:00,40.564611,-74.131919,50.0,138.0,5200150000.0,Oakwood-Oakwood Beach ...
2401658,STATEN ISLAND,5164775,13,CLARKE AVENUE,500259837,1,A2,N,4391,34,...,NY,11753,5163386000,2016-01-03T00:00:00,40.564611,-74.131919,50.0,138.0,5200150000.0,Oakwood-Oakwood Beach ...
2417700,STATEN ISLAND,5164775,13,CLARKE AVENUE,500752116,1,A2,N,4395,34,...,NY,11101,7187296500,2016-01-03T00:00:00,40.564611,-74.131919,50.0,138.0,5200150000.0,Oakwood-Oakwood Beach ...


#### Same location but different jobs. Probably fine..

# Data Profilling for datetime columns


Find format problems and outliers in all datetime columns

Using openclean's sklearn modules to detect problems and outliers

In [97]:
from openclean.profiling.anomalies.sklearn import DBSCANOutliers

def findDateOutliers(column_name, eps_setting = 0.05):
    datetime_data = ds.distinct(column_name)
    print("Column: ",column_name)
    
    for rank, val in enumerate(datetime_data.most_common(10)):        
        st, freq = val
        print('{:<3} {:>8}  {:>10}'.format('{}.'.format(rank + 1), st, '{:,}'.format(freq)))

    print('\nTotal number of distinct values in {} is {}'.format(column_name, len(datetime_data)))
    print(DBSCANOutliers().find(datetime_data))
    print(DBSCANOutliers(eps = eps_setting).find(datetime_data))
    print('\n==================================')

In [98]:
date_cols = []

print("Datetime Data columns:\n")
for col in ds.columns:
    if 'Date' in col or 'DATE' in col:
        print(col)
        date_cols.append(col)

print("----------------------------\n")        
        
for col in date_cols:
    findDateOutliers(col, 0.02)

Datetime Data columns:

Filing Date
Issuance Date
Expiration Date
Job Start Date
DOBRunDate
----------------------------

Column:  Filing Date
1.  2007-03-29T00:00:00         998
2.  2007-03-30T00:00:00         981
3.  2006-12-28T00:00:00         927
4.  2008-01-07T00:00:00         920
5.  2007-12-28T00:00:00         900
6.  2008-04-04T00:00:00         896
7.  2011-03-30T00:00:00         893
8.  2008-06-27T00:00:00         890
9.  2011-03-31T00:00:00         887
10. 2007-01-04T00:00:00         878

Total number of distinct values in Filing Date is 6415
[]
['2006-12-28T00:00:00', '2000-02-02T00:00:00', '2007-03-30T00:00:00', '2011-03-30T00:00:00', '2001-12-01T00:00:00', '2011-11-10T00:00:00', '2008-01-07T00:00:00', '2011-04-01T00:00:00', '2002-02-20T00:00:00', '2000-02-22T00:00:00', '2011-03-31T00:00:00', '2008-06-19T00:00:00', '2002-02-02T00:00:00', '2006-02-02T00:00:00', '2007-01-04T00:00:00', '2002-02-22T00:00:00', '2003-02-22T00:00:00', '2008-06-27T00:00:00', '2001-01-20T00:00:00', 

# Remember that after changing some of the column names, there are some columns that are also datetime data:

"Paid": "Paid Date"\
"Fully Paid": "Fully Paid Date"\
"Assigned": "Assigned Date"\
"Approved": "Approved Date"\
"Pre- Filing Date": "Pre-Filing Date"\
"DOBRunDate": "D O B Run Date"\
"SIGNOFF_DATE": "Signoff Date"\
"SPECIAL_ACTION_DATE": "Special Action Date"\

In [99]:
# takes awhile to run

In [100]:
date_cols = ["Filing Date","Issuance Date","Expiration Date","Job Start Date", "DOBRunDate"]

for col in date_cols:
    findDateOutliers(col, 0.02)

Column:  Filing Date
1.  2007-03-29T00:00:00         998
2.  2007-03-30T00:00:00         981
3.  2006-12-28T00:00:00         927
4.  2008-01-07T00:00:00         920
5.  2007-12-28T00:00:00         900
6.  2008-04-04T00:00:00         896
7.  2011-03-30T00:00:00         893
8.  2008-06-27T00:00:00         890
9.  2011-03-31T00:00:00         887
10. 2007-01-04T00:00:00         878

Total number of distinct values in Filing Date is 6415
[]
['2006-12-28T00:00:00', '2000-02-02T00:00:00', '2007-03-30T00:00:00', '2011-03-30T00:00:00', '2001-12-01T00:00:00', '2011-11-10T00:00:00', '2008-01-07T00:00:00', '2011-04-01T00:00:00', '2002-02-20T00:00:00', '2000-02-22T00:00:00', '2011-03-31T00:00:00', '2008-06-19T00:00:00', '2002-02-02T00:00:00', '2006-02-02T00:00:00', '2007-01-04T00:00:00', '2002-02-22T00:00:00', '2003-02-22T00:00:00', '2008-06-27T00:00:00', '2001-01-20T00:00:00', '1994-05-27T00:00:00', '2008-04-03T00:00:00', '2007-12-28T00:00:00', '2007-03-29T00:00:00', '2007-07-02T00:00:00', '2010-1

# Data Cleaning for outliers in datetime columns

## Fixing Datetime columns format

In [101]:
datetime_column_list =["Filing Date","Issuance Date","Expiration Date","Job Start Date", "D O B Run Date"]
for col in datetime_column_list:
    show_vals(col)

Top 10 Filing Date:

2007-03-29T00:00:00    998
2007-03-30T00:00:00    981
2006-12-28T00:00:00    927
2008-01-07T00:00:00    920
2007-12-28T00:00:00    900
2008-04-04T00:00:00    896
2011-03-30T00:00:00    893
2008-06-27T00:00:00    890
2011-03-31T00:00:00    887
2007-01-04T00:00:00    878
Name: Filing Date, dtype: int64

Top 10 Issuance Date:

2007-03-29T00:00:00    994
2007-03-30T00:00:00    959
2006-12-28T00:00:00    947
2007-12-28T00:00:00    918
2008-06-27T00:00:00    909
2008-04-03T00:00:00    895
2011-03-31T00:00:00    894
2008-01-07T00:00:00    873
2006-01-05T00:00:00    869
2011-03-30T00:00:00    869
Name: Issuance Date, dtype: int64

Top 10 Expiration Date:

2007-12-31T00:00:00    18638
2006-12-31T00:00:00    18065
2005-12-31T00:00:00    16359
2004-12-31T00:00:00    13974
2009-04-01T00:00:00    11426
2003-12-31T00:00:00    10472
2010-04-01T00:00:00    10205
2007-04-01T00:00:00     9778
2008-04-01T00:00:00     9732
2006-04-01T00:00:00     9599
Name: Expiration Date, dtype: int

Check to see if any columns have values in year-month-day format

In [102]:
for col in datetime_column_list:
    print(col, '\n', df.loc[df[col].str.contains('-', regex=False, na=False)][col], '\n\n')

Filing Date 
 0          2010-11-05T00:00:00
1          2012-01-30T00:00:00
2          2008-02-04T00:00:00
3          1998-08-31T00:00:00
4          2007-04-30T00:00:00
                  ...         
2428521    2003-10-08T00:00:00
2428522    1996-07-29T00:00:00
2428523    1999-07-09T00:00:00
2428524    1996-06-25T00:00:00
2428525    1999-09-20T00:00:00
Name: Filing Date, Length: 2428524, dtype: object 


Issuance Date 
 0          2010-11-05T00:00:00
1          2012-01-30T00:00:00
2          2008-02-04T00:00:00
3          1998-08-31T00:00:00
4          2007-04-30T00:00:00
                  ...         
2428521    2003-10-08T00:00:00
2428522    1997-07-28T00:00:00
2428523    1999-07-09T00:00:00
2428524    1996-06-25T00:00:00
2428525    1999-09-20T00:00:00
Name: Issuance Date, Length: 2428524, dtype: object 


Expiration Date 
 0          2011-11-05T00:00:00
1          2013-01-29T00:00:00
2          2009-02-03T00:00:00
3          1999-08-31T00:00:00
4          2008-01-08T00:00:00
       

#### Fix D O B Run Date formatting issues
D O B Run Date and Filing Date both have values in multiple datetime formats

We want all the dates to be in a standard year-month-day datetime format, so we'll seperately convert the values not in this format, and in this format to datetime objects


In [103]:
df.loc[~df['D O B Run Date'].str.contains('-', regex=False), 'D O B Run Date'] = pd.to_datetime(
    df.loc[~df['D O B Run Date'].str.contains('-', regex=False)]['D O B Run Date'], infer_datetime_format=True, errors='coerce')

This mapped most of our values to datetimes, now just need to map the rest

In [104]:
df.loc[:,'D O B Run Date'] = pd.to_datetime(
    df['D O B Run Date'], infer_datetime_format=True, errors='coerce')

In [105]:
df['D O B Run Date']

0         2016-01-03
1         2016-01-03
2         2016-01-03
3         2016-01-03
4         2016-01-03
             ...    
2428521   2016-01-03
2428522   2016-01-03
2428523   2016-01-03
2428524   2016-01-03
2428525   2016-01-03
Name: D O B Run Date, Length: 2428524, dtype: datetime64[ns]

In [106]:
df['D O B Run Date'].value_counts()

2016-01-03    2428524
Name: D O B Run Date, dtype: int64

#### Fix Filing Date formatting issues

We'll use the same method as above

In [107]:
df.loc[~df['Filing Date'].str.contains('-', regex=False), 'Filing Date'] = pd.to_datetime(
    df.loc[~df['Filing Date'].str.contains('-', regex=False)]['Filing Date'], infer_datetime_format=True, errors='coerce')

In [108]:
df.loc[:,'Filing Date'] = pd.to_datetime(
    df['Filing Date'], infer_datetime_format=True, errors='coerce')

In [109]:
df['Filing Date']

0         2010-11-05
1         2012-01-30
2         2008-02-04
3         1998-08-31
4         2007-04-30
             ...    
2428521   2003-10-08
2428522   1996-07-29
2428523   1999-07-09
2428524   1996-06-25
2428525   1999-09-20
Name: Filing Date, Length: 2428524, dtype: datetime64[ns]

In [110]:
df['Filing Date'].value_counts()

2007-03-29    998
2007-03-30    981
2006-12-28    927
2008-01-07    920
2007-12-28    900
             ... 
2011-11-08      1
2012-02-25      1
2011-12-24      1
2008-09-06      1
2003-10-18      1
Name: Filing Date, Length: 6415, dtype: int64

These should all be proper datetime64[ns] columns now:

In [111]:
df.select_dtypes(include='datetime')

Unnamed: 0,Filing Date,D O B Run Date
0,2010-11-05,2016-01-03
1,2012-01-30,2016-01-03
2,2008-02-04,2016-01-03
3,1998-08-31,2016-01-03
4,2007-04-30,2016-01-03
...,...,...
2428521,2003-10-08,2016-01-03
2428522,1996-07-29,2016-01-03
2428523,1999-07-09,2016-01-03
2428524,1996-06-25,2016-01-03


In [112]:
for col in datetime_column_list:
    show_vals(col)

Top 10 Filing Date:

2007-03-29    998
2007-03-30    981
2006-12-28    927
2008-01-07    920
2007-12-28    900
2008-04-04    896
2011-03-30    893
2008-06-27    890
2011-03-31    887
2007-01-04    878
Name: Filing Date, dtype: int64

Top 10 Issuance Date:

2007-03-29T00:00:00    994
2007-03-30T00:00:00    959
2006-12-28T00:00:00    947
2007-12-28T00:00:00    918
2008-06-27T00:00:00    909
2008-04-03T00:00:00    895
2011-03-31T00:00:00    894
2008-01-07T00:00:00    873
2006-01-05T00:00:00    869
2011-03-30T00:00:00    869
Name: Issuance Date, dtype: int64

Top 10 Expiration Date:

2007-12-31T00:00:00    18638
2006-12-31T00:00:00    18065
2005-12-31T00:00:00    16359
2004-12-31T00:00:00    13974
2009-04-01T00:00:00    11426
2003-12-31T00:00:00    10472
2010-04-01T00:00:00    10205
2007-04-01T00:00:00     9778
2008-04-01T00:00:00     9732
2006-04-01T00:00:00     9599
Name: Expiration Date, dtype: int64

Top 10 Job Start Date:

2008-06-27T00:00:00    1376
2008-06-25T00:00:00    1095
2007-0

# Data Profilling for City and Other Description

Find format problems and outliers in City and Description columns

Using openclean's sklearn modules to detect problems and outliers

In [113]:
from openclean.profiling.anomalies.sklearn import DBSCANOutliers

# Print the ten most frequent values for the 'Vehicle Expiration Date' column.
def findDateOutliers(column_name, eps_setting = 0.05):
    applicant_data = ds.distinct(column_name)
    print("Column: ",column_name)
    
    for rank, val in enumerate(applicant_data.most_common(10)):        
        st, freq = val
        print('{:<3} {:>8}  {:>10}'.format('{}.'.format(rank + 1), st, '{:,}'.format(freq)))

    print('\nTotal number of distinct values in {} is {}'.format(column_name, len(applicant_data)))
    print(DBSCANOutliers(eps = eps_setting).find(applicant_data))
    print('\n==================================')

# Data Cleaning for Applicant columns

* how to deal with empty values has not decided yet

# Transform all city names to upper case

### Remember that we have changed some column names:
"City ": "Owner's House City"\
"State": "Owner’s House State"

In [114]:
df["Owner'S House City"] = df["Owner'S House City"].str.upper()

# Convert similar values to suggested value using kNN clustering

In [115]:
# Cluster string using kNN clusterer (with the default n-gram setting)
# using the Levenshtein distance as the similarity measure.

from openclean.cluster.knn import knn_clusters
from openclean.function.similarity.base import SimilarityConstraint
from openclean.function.similarity.text import LevenshteinDistance
from openclean.function.value.threshold import GreaterThan

def getClusters(col, minsize = 2, preds = 0.5):
    dba = ds.select(col).distinct()
    clusters = knn_clusters(
        values=dba,
        sim=SimilarityConstraint(func=LevenshteinDistance(), pred=GreaterThan(preds)),
        minsize=minsize
    )
    return clusters

def print_cluster(cnumber, cluster):
    item_count = 0
    print('Cluster {} (of size {})\n'.format(cnumber, len(cluster)))
    for val, count in cluster.items():
        item_count += 1
        if item_count <= 10:
            print('{} ({})'.format(val, count))
    if item_count>10:
        print(".......{} more items".format(item_count-10))
    print('\nSuggested value: {}\n\n'.format(cluster.suggestion()))

def updateUsingClusters(col, clusters, isPrint = False):
    
    orignal_list = []
    suggestion_list = []
    clusters.sort(key=lambda c: len(c), reverse=True)
       
    for i, cluster in enumerate(clusters):        
        suggestion = cluster.suggestion()
        orignal_list = []
        suggestion_list = []
        if isPrint and i <5:
            print_cluster(i, cluster)
        
        for val, count in cluster.items(): 
            orignal_list.append(val)
            suggestion_list.append(suggestion)
    
    df[col] = df[col].replace(orignal_list, suggestion_list)

In [116]:
# too laggy so didn't use

# Check State Column

In [117]:
df.columns

Index(['Borough', 'Bin', 'Number', 'Street', 'Job Number', 'Job Doc Number',
       'Job Type', 'Self  Cert', 'Block', 'Lot', 'Community Board', 'Postcode',
       'Bldg Type', 'Residential', 'Special District 1', 'Special District 2',
       'Work Type', 'Permit Status', 'Filing Status', 'Permit Type',
       'Permit Sequence Number', 'Permit Subtype', 'Oil Gas', 'Site Fill',
       'Filing Date', 'Issuance Date', 'Expiration Date', 'Job Start Date',
       'Permittee'S First Name', 'Permittee'S Last Name',
       'Permittee'S Business Name', 'Permittee'S Phone Number',
       'Permittee'S License Type', 'Permittee'S License Number',
       'Act As Superintendent', 'Permittee'S Other Title', 'H I C License',
       'Site Safety Mgr'S First Name', 'Site Safety Mgr'S Last Name',
       'Site Safety Mgr Business Name', 'Superintendent First & Last Name',
       'Superintendent Business Name', 'Owner'S Business Type',
       'Numbern- Profit', 'Owner'S Business Name', 'Owner'S First Name'

In [118]:
ds.columns

['BOROUGH',
 'BIN',
 'Number',
 'Street',
 'Job #',
 'Job doc. #',
 'Job Type',
 'Self_Cert',
 'Block',
 'Lot',
 'Community Board',
 'Postcode',
 'Bldg Type',
 'Residential',
 'Special District 1',
 'Special District 2',
 'Work Type',
 'Permit Status',
 'Filing Status',
 'Permit Type',
 'Permit Sequence #',
 'Permit Subtype',
 'Oil Gas',
 'Site Fill',
 'Filing Date',
 'Issuance Date',
 'Expiration Date',
 'Job Start Date',
 "Permittee's First Name",
 "Permittee's Last Name",
 "Permittee's Business Name",
 "Permittee's Phone #",
 "Permittee's License Type",
 "Permittee's License #",
 'Act as Superintendent',
 "Permittee's Other Title",
 'HIC License',
 "Site Safety Mgr's First Name",
 "Site Safety Mgr's Last Name",
 'Site Safety Mgr Business Name',
 'Superintendent First & Last Name',
 'Superintendent Business Name',
 "Owner's Business Type",
 'Non-Profit',
 "Owner's Business Name",
 "Owner's First Name",
 "Owner's Last Name",
 "Owner's House #",
 "Owner's House Street Name",
 'Ownerâ€™

In [119]:
state_col = "Ownerâ€™s House State"
findDateOutliers(state_col, 0.1)

Column:  Ownerâ€™s House State
1.        NY   2,216,177
2.               157,133
3.        NJ      29,283
4.        CT       3,165
5.        PA       3,077
6.        FL       2,916
7.        CA       2,547
8.        IL       1,660
9.        MA       1,584
10.       NC       1,493

Total number of distinct values in Ownerâ€™s House State is 58
['', 'NY', 'Ã¯Â¿Â½Ã¯Â¿Â½']



In [120]:
ds.select('Ownerâ€™s House State').distinct()

Counter({'': 157133,
         'NY': 2216177,
         'PA': 3077,
         'NJ': 29283,
         'CT': 3165,
         'IL': 1660,
         'NC': 1493,
         'MN': 165,
         'KY': 114,
         'MA': 1584,
         'GA': 627,
         'FL': 2916,
         'TX': 846,
         'SC': 153,
         'AZ': 343,
         'OH': 1142,
         'WA': 185,
         'CA': 2547,
         'UT': 220,
         'NM': 118,
         'MD': 875,
         'VA': 1136,
         'TN': 202,
         'NV': 529,
         'CO': 369,
         'RI': 649,
         'MI': 189,
         'MS': 8,
         'PR': 12,
         'AR': 26,
         'WI': 98,
         'NH': 303,
         'KS': 57,
         'DC': 385,
         'IN': 71,
         'MO': 120,
         'WY': 3,
         'AK': 17,
         'CN': 5,
         'WV': 12,
         'DE': 143,
         'IA': 56,
         'VT': 57,
         'NE': 53,
         'ID': 6,
         'ME': 25,
         'MT': 5,
         'LA': 29,
         'AL': 40,
         'FQ': 1,
         

# Find functional dependencies violations on City -> State

In [121]:
from openclean.operator.collector.count import distinct
from openclean.operator.map.violations import fd_violations

groups = fd_violations(df, lhs="Owner'S House City", rhs="Owner'S House State")

print('City         \t|            State')
print('=============\t|  ===============')
for key in groups:
    conflicts = distinct(groups.get(key), "Owner'S House State").most_common()
    state, count = conflicts[0]
    print('{:<12} \t| {} x {}'.format(key, count, state))
    for state, count in conflicts[1:]:
        print('             \t| {} x {}'.format(count, state))
    print('-------------\t|  ---------------')

City         	|            State
nan          	| 157032 x nan
             	| 129 x NY
             	| 6 x NJ
             	| 3 x RI
-------------	|  ---------------
BRONX        	| 92906 x NY
             	| 82 x NC
             	| 17 x NV
             	| 15 x NM
             	| 5 x NH
             	| 3 x NJ
             	| 1 x NE
-------------	|  ---------------
BROOKLYN     	| 300042 x NY
             	| 250 x NC
             	| 135 x NV
             	| 63 x nan
             	| 22 x NJ
             	| 1 x NM
             	| 1 x NE
             	| 1 x NH
             	| 1 x OH
-------------	|  ---------------
MANHATTAN    	| 12710 x NY
             	| 5 x NC
             	| 1 x OR
             	| 1 x NV
             	| 1 x CA
-------------	|  ---------------
NEW YORK     	| 677593 x NY
             	| 252 x NC
             	| 121 x NV
             	| 40 x NJ
             	| 17 x NE
             	| 8 x NM
             	| 6 x TN
             	| 5 x nan
             	| 4 x OH
          

STAMFORD     	| 343 x CT
             	| 17 x NY
             	| 3 x CO
             	| 2 x VT
-------------	|  ---------------
MONSEY       	| 644 x NY
             	| 1 x NJ
-------------	|  ---------------
NEW YORK,    	| 3965 x NY
             	| 5 x NC
-------------	|  ---------------
VERONA       	| 13 x NJ
             	| 2 x CA
-------------	|  ---------------
SARASOTA     	| 43 x FL
             	| 1 x NY
-------------	|  ---------------
SO. PLAINFIELD 	| 51 x NJ
             	| 1 x NY
-------------	|  ---------------
HUNTINGTON   	| 782 x NY
             	| 6 x WV
             	| 2 x PA
             	| 1 x CT
-------------	|  ---------------
LAWRENCE     	| 855 x NY
             	| 2 x KS
-------------	|  ---------------
EAST MEADOW  	| 543 x NY
             	| 7 x NJ
-------------	|  ---------------
ROSEDALE     	| 5666 x NY
             	| 35 x NC
             	| 3 x NJ
-------------	|  ---------------
SUFFERN      	| 228 x NY
             	| 1 x NJ
             	| 1 x NC
-

JAMACIA      	| 167 x NY
             	| 1 x NC
-------------	|  ---------------
POTOMAC      	| 55 x MD
             	| 5 x MO
-------------	|  ---------------
HARTFORD     	| 183 x CT
             	| 9 x AL
             	| 7 x NY
             	| 6 x CO
             	| 1 x CN
-------------	|  ---------------
LOS ANGELOS  	| 11 x CA
             	| 4 x NY
-------------	|  ---------------
MARLTON      	| 57 x NJ
             	| 5 x NY
-------------	|  ---------------
CLEVELAND    	| 153 x OH
             	| 8 x NY
-------------	|  ---------------
WOODBURY     	| 334 x NY
             	| 9 x NJ
             	| 3 x NC
-------------	|  ---------------
ROCKVILLE CENTR 	| 298 x NY
             	| 1 x NJ
-------------	|  ---------------
HAZLET       	| 34 x NJ
             	| 5 x NY
-------------	|  ---------------
PALISADES PARK 	| 31 x NJ
             	| 4 x NY
-------------	|  ---------------
LITTLE NECK  	| 4230 x NY
             	| 5 x NC
-------------	|  ---------------
BRIARCLIFF   	| 

TETERBORO    	| 63 x NJ
             	| 5 x NY
-------------	|  ---------------
BEDMINSTER   	| 20 x NJ
             	| 2 x NY
-------------	|  ---------------
HILLSIDE     	| 141 x NJ
             	| 80 x NY
             	| 4 x NH
-------------	|  ---------------
BELROSE      	| 70 x NY
             	| 2 x NC
-------------	|  ---------------
COS COB      	| 22 x CT
             	| 2 x NY
-------------	|  ---------------
CLARK        	| 13 x NJ
             	| 2 x NY
-------------	|  ---------------
CANAAN       	| 7 x NY
             	| 6 x CT
-------------	|  ---------------
FT LEE       	| 42 x NJ
             	| 4 x NY
-------------	|  ---------------
PARSIPANY    	| 41 x NJ
             	| 4 x NY
-------------	|  ---------------
BRKLYN       	| 1757 x NY
             	| 8 x NJ
-------------	|  ---------------
NANUET       	| 90 x NY
             	| 2 x NJ
-------------	|  ---------------
LYNDHURST    	| 50 x NJ
             	| 6 x NY
-------------	|  ---------------
SO PLAINFIELD 

PASADENA     	| 27 x TX
             	| 21 x CA
-------------	|  ---------------
WAKEFIELD    	| 20 x NY
             	| 1 x MA
-------------	|  ---------------
SANTA MONICA 	| 575 x CA
             	| 18 x NY
-------------	|  ---------------
MILVILLE     	| 3 x NY
             	| 1 x NJ
-------------	|  ---------------
GREENVALE    	| 195 x NY
             	| 1 x SC
-------------	|  ---------------
ALLENHURST   	| 47 x NJ
             	| 3 x NH
             	| 2 x NY
-------------	|  ---------------
HAWLEY       	| 10 x PA
             	| 2 x NY
-------------	|  ---------------
KING OF PRUSSIA 	| 72 x PA
             	| 2 x NY
-------------	|  ---------------
STRATFORD    	| 3 x CO
             	| 2 x CT
-------------	|  ---------------
S. OZONE PARK 	| 1073 x NY
             	| 2 x NV
-------------	|  ---------------
MENDHAM      	| 43 x NJ
             	| 12 x NY
-------------	|  ---------------
HENDERSON    	| 8 x NV
             	| 1 x NE
-------------	|  ---------------
MIDDLETON

MARLBORO     	| 138 x NJ
             	| 4 x NY
             	| 2 x MA
-------------	|  ---------------
EATONTOWN,   	| 6 x NJ
             	| 1 x NY
-------------	|  ---------------
SILVER SPRINGS 	| 19 x MD
             	| 7 x CO
-------------	|  ---------------
BROOK        	| 11 x NY
             	| 2 x IL
-------------	|  ---------------
ENGLISHTOWN  	| 67 x NJ
             	| 1 x NY
-------------	|  ---------------
MANCHESTER   	| 13 x NJ
             	| 5 x VT
             	| 3 x NH
             	| 2 x NY
             	| 1 x CT
-------------	|  ---------------
ROCKY MOUNT  	| 10 x NC
             	| 2 x NY
-------------	|  ---------------
SPRINGFIELD GDN 	| 85 x NY
             	| 1 x NE
-------------	|  ---------------
GLEN ROCK    	| 21 x NJ
             	| 7 x NY
-------------	|  ---------------
MILFORD      	| 10 x PA
             	| 4 x CT
             	| 4 x NJ
             	| 3 x NY
-------------	|  ---------------
MANTOLOKING  	| 11 x NJ
             	| 3 x NY
----------

WATCHUNG     	| 25 x NJ
             	| 1 x NY
-------------	|  ---------------
HUDSON       	| 4 x FL
             	| 4 x NY
-------------	|  ---------------
CHESTER      	| 87 x NY
             	| 1 x PA
             	| 1 x NJ
-------------	|  ---------------
EASTON       	| 8 x CT
             	| 4 x PA
             	| 1 x NJ
-------------	|  ---------------
RED BANK     	| 18 x NJ
             	| 4 x NY
-------------	|  ---------------
SCOTTSDALE   	| 20 x AZ
             	| 5 x KY
             	| 1 x AR
             	| 1 x NY
-------------	|  ---------------
FOXPOINT     	| 1 x NY
             	| 1 x WI
-------------	|  ---------------
EDGEWOOD     	| 57 x NY
             	| 4 x MD
-------------	|  ---------------
BAYONE       	| 65 x NJ
             	| 3 x NH
-------------	|  ---------------
JERCEY CITY  	| 35 x NJ
             	| 3 x NY
-------------	|  ---------------
LAWRENCEVILLE 	| 8 x NJ
             	| 2 x NY
-------------	|  ---------------
HAVENFORD    	| 3 x PA
        

NEWWARK      	| 6 x NJ
             	| 2 x NY
-------------	|  ---------------
DURHAM       	| 4 x NC
             	| 1 x NJ
-------------	|  ---------------
NEWBURY      	| 7 x MA
             	| 1 x NY
-------------	|  ---------------
NB           	| 3 x NY
             	| 2 x NJ
-------------	|  ---------------
SANTA ROSA   	| 1 x CA
             	| 1 x NY
-------------	|  ---------------
STOCKHOLM    	| 3 x SD
             	| 2 x sw
             	| 1 x NY
-------------	|  ---------------
HAMPTON      	| 15 x NH
             	| 2 x NY
-------------	|  ---------------
GOSHEN       	| 19 x NY
             	| 7 x CT
-------------	|  ---------------
GREENWOOD    	| 9 x NY
             	| 2 x CT
-------------	|  ---------------
TORONTO ONTARIO 	| 2 x CA
             	| 1 x NY
-------------	|  ---------------
MARGATE      	| 5 x FL
             	| 1 x NY
-------------	|  ---------------
ENGLEWOOD CLIFT 	| 9 x NJ
             	| 1 x NY
-------------	|  ---------------
PORTSMOUTH   	| 15 x 

There is a row that has "NEW YORK CITY" as city, but have "NJ" as State, fix its state to "NY"

In [122]:
df["Owner'S House State"].loc[(df["Owner'S House City"] == "NEW YORK CITY") & (df["Owner'S House State"] == "NJ")]

Series([], Name: Owner'S House State, dtype: object)

# Apply similar operation on Owner'S Business Name

In [123]:
bn_col = "Owner's Business Name"
findDateOutliers(bn_col)

Column:  Owner's Business Name
1.       N/A     348,159
2.               243,930
3.        NA      67,584
4.  NY SCHOOL CONSTRUCTION AUTHORITY      19,966
5.       HPD      17,505
6.      NONE      13,668
7.     OWNER      12,935
8.   NYC HPD       9,877
9.  NYC HOUSING AUTHORITY       9,726
10.      n/a       9,654

Total number of distinct values in Owner's Business Name is 378167



# Using clustering for Business Name takes too much time, we can only clean those empty data for now 

In [124]:
bn_col = "Owner'S Business Name"
df[bn_col] = df[bn_col].replace(['N/A', '', 'NA','NONE'], [None,None,None,None])

# Data Profilling for applicant columns

Find format problems and outliers in all applicant columns

Using openclean's sklearn modules to detect problems and outliers

In [125]:
from openclean.profiling.anomalies.sklearn import DBSCANOutliers

# Print the ten most frequent values for the 'Vehicle Expiration Date' column.
def findDateOutliers(column_name, eps_setting = 0.05):
    applicant_data = ds.distinct(column_name)
    print("Column: ",column_name)
    
    for rank, val in enumerate(applicant_data.most_common(10)):        
        st, freq = val
        print('{:<3} {:>8}  {:>10}'.format('{}.'.format(rank + 1), st, '{:,}'.format(freq)))

    print('\nTotal number of distinct values in {} is {}'.format(column_name, len(applicant_data)))
    print(DBSCANOutliers(eps = eps_setting).find(applicant_data))
    print('\n==================================')

In [126]:
date_cols = []

print("Applicant Data columns:\n")
for col in ds.columns:
    if 'Permittee' in col:
        print(col)
        date_cols.append(col)

Applicant Data columns:

Permittee's First Name
Permittee's Last Name
Permittee's Business Name
Permittee's Phone #
Permittee's License Type
Permittee's License #
Permittee's Other Title


# Convert similar values to suggested value using kNN clustering

In [127]:
# Cluster string using kNN clusterer (with the default n-gram setting)
# using the Levenshtein distance as the similarity measure.

from openclean.cluster.knn import knn_clusters
from openclean.function.similarity.base import SimilarityConstraint
from openclean.function.similarity.text import LevenshteinDistance
from openclean.function.value.threshold import GreaterThan

def getClusters(col, minsize = 2):
    dba = ds.select(col).distinct()
    clusters = knn_clusters(
        values=dba,
        sim=SimilarityConstraint(func=LevenshteinDistance(), pred=GreaterThan(0.75)),
        minsize=minsize
    )
    return clusters

def print_cluster(cnumber, cluster):
    item_count = 0
    print('Cluster {} (of size {})\n'.format(cnumber, len(cluster)))
    for val, count in cluster.items():
        item_count += 1
        if item_count <= 10:
            print('{} ({})'.format(val, count))
    if item_count>10:
        print(".......{} more items".format(item_count-10))
    print('\nSuggested value: {}\n\n'.format(cluster.suggestion()))

def updateUsingClusters(col, clusters, isPrint = False):
    
    orignal_list = []
    suggestion_list = []
    clusters.sort(key=lambda c: len(c), reverse=True)
       
    for i, cluster in enumerate(clusters):        
        suggestion = cluster.suggestion()
        orignal_list = []
        suggestion_list = []
        if isPrint and i < 5:
            print_cluster(i, cluster)
        
        for val, count in cluster.items(): 
            orignal_list.append(val)
            suggestion_list.append(suggestion)
            
    df[col] = df[col].replace(orignal_list, suggestion_list)

In [128]:
# takes too long to run
'''
for col in date_cols[:3]:
    print("kNN cluster for ", col)
    col_clusters = getClusters(col)
    print("updating column ", col)
    print("----------------------\nTop 5 Cluster:\n----------------------")
    updateUsingClusters(col, col_clusters, True)
    print("================")
'''




## Precision and Recall

In [129]:
cleaned_columns = ["Job Number", 'Owner\'S House Number', "Bldg Type", "Oil Gas", "Act As Superintendent", "Self  Cert", "Numbern- Profit", "Residential", 
                       "Owner'S Business Type", 
                       "Owner'S Business Name", "Owner'S First Name", "Owner'S Last Name", "Owner'S House Number", 
                       "Owner'S House Street Name", "Owner'S House City", "Owner'S House State", "Owner'S House Zip Code", 
                       "Owner'S Phone Number", "Filing Date","Issuance Date","Expiration Date","Job Start Date", "D O B Run Date"]

In [130]:
df_sample_data = df_sample.rename(columns=rename_dict)

In [131]:
df_sample_data.columns

Index(['Borough', 'Bin', 'Number', 'Street', 'Job Number', 'Job Doc Number',
       'Job Type', 'Self  Cert', 'Block', 'Lot', 'Community Board', 'Postcode',
       'Bldg Type', 'Residential', 'Special District 1', 'Special District 2',
       'Work Type', 'Permit Status', 'Filing Status', 'Permit Type',
       'Permit Sequence Number', 'Permit Subtype', 'Oil Gas', 'Site Fill',
       'Filing Date', 'Issuance Date', 'Expiration Date', 'Job Start Date',
       'Permittee'S First Name', 'Permittee'S Last Name',
       'Permittee'S Business Name', 'Permittee'S Phone Number',
       'Permittee'S License Type', 'Permittee'S License Number',
       'Act As Superintendent', 'Permittee'S Other Title', 'H I C License',
       'Site Safety Mgr'S First Name', 'Site Safety Mgr'S Last Name',
       'Site Safety Mgr Business Name', 'Superintendent First & Last Name',
       'Superintendent Business Name', 'Owner'S Business Type',
       'Numbern- Profit', 'Owner'S Business Name', 'Owner'S First Name'

In [132]:
df_sample_data = df_sample_data[cleaned_columns]

In [133]:
df_temp = df.loc[df_sample_data.index][cleaned_columns].copy()

In [134]:
def precision(tp, fp):
    return tp/(tp+fp)

def recall(tp, fn):
    return tp/(tp+fn)

In [135]:
cleaned_columns

['Job Number',
 "Owner'S House Number",
 'Bldg Type',
 'Oil Gas',
 'Act As Superintendent',
 'Self  Cert',
 'Numbern- Profit',
 'Residential',
 "Owner'S Business Type",
 "Owner'S Business Name",
 "Owner'S First Name",
 "Owner'S Last Name",
 "Owner'S House Number",
 "Owner'S House Street Name",
 "Owner'S House City",
 "Owner'S House State",
 "Owner'S House Zip Code",
 "Owner'S Phone Number",
 'Filing Date',
 'Issuance Date',
 'Expiration Date',
 'Job Start Date',
 'D O B Run Date']

In [136]:
col_idx = 0
tp = 0
fp = 0
fn = 0

In [137]:
col = cleaned_columns[col_idx]
print("column: ", col)
print("Original,\t Cleaned\n")
for i in range(50):
    print(df_sample_data[col].iloc[i], '\t', df_temp[col].iloc[i])

print('======================\n\n')
col_idx += 1

column:  Job Number
Original,	 Cleaned

110247812 	 110247812
300739324 	 300739324
420051856 	 420051856
120614200 	 120614200
520045040 	 520045040
402019338 	 402019338
104162683 	 104162683
104585708 	 104585708
100728875 	 100728875
110361625 	 110361625
120586277 	 120586277
101548595 	 101548595
320192750 	 320192750
300937805 	 300937805
400968761 	 400968761
400869379 	 400869379
120704540 	 120704540
301663224 	 301663224
100499649 	 100499649
302213218 	 302213218
500842092 	 500842092
104806051 	 104806051
100276194 	 100276194
100730023 	 100730023
103522369 	 103522369
103465849 	 103465849
520038138 	 520038138
103918913 	 103918913
121481967 	 121481967
121044967 	 121044967
500693724 	 500693724
310310632 	 310310632
104749710 	 104749710
301729048 	 301729048
320264209 	 320264209
500331802 	 500331802
200902540 	 200902540
101095477 	 101095477
104871818 	 104871818
402053638 	 402053638
101737827 	 101737827
120633216 	 120633216
420056012 	 420056012
100961363 	 10

In [138]:
col = cleaned_columns[col_idx]
print("column: ", col)
print("Original,\t Cleaned\n")
for i in range(50):
    print(df_sample_data[col].iloc[i], '\t', df_temp[col].iloc[i])

print('======================\n\n')
col_idx += 1

column:  Owner'S House Number
Original,	 Cleaned

Owner'S House Number    280
Owner'S House Number    280
Name: 1590526, dtype: object 	 Owner'S House Number    280
Owner'S House Number    280
Name: 1590526, dtype: object
Owner'S House Number    1
Owner'S House Number    1
Name: 524120, dtype: object 	 Owner'S House Number    1
Owner'S House Number    1
Name: 524120, dtype: object
Owner'S House Number    107-11
Owner'S House Number    107-11
Name: 2176945, dtype: object 	 Owner'S House Number    107-11
Owner'S House Number    107-11
Name: 2176945, dtype: object
Owner'S House Number    232
Owner'S House Number    232
Name: 1132106, dtype: object 	 Owner'S House Number    232
Owner'S House Number    232
Name: 1132106, dtype: object
Owner'S House Number    117-02
Owner'S House Number    117-02
Name: 2356636, dtype: object 	 Owner'S House Number    117-02
Owner'S House Number    117-02
Name: 2356636, dtype: object
Owner'S House Number    89-24
Owner'S House Number    89-24
Name: 2180533, d

In [139]:
tp += 1
fn += 6

In [140]:
col = cleaned_columns[col_idx]
print("column: ", col)
print("Original,\t Cleaned\n")
for i in range(50):
    print(df_sample_data[col].iloc[i], '\t', df_temp[col].iloc[i])

print('======================\n\n')
col_idx += 1

column:  Bldg Type
Original,	 Cleaned

2 	 2
2 	 2
1 	 1
2 	 2
2 	 2
1 	 1
2 	 2
2 	 2
2 	 2
2 	 2
2 	 2
1 	 1
2 	 2
2 	 2
1 	 1
2 	 2
2 	 2
2 	 2
2 	 2
1 	 1
2 	 2
2 	 2
2 	 2
2 	 2
2 	 2
2 	 2
1 	 1
2 	 2
2 	 2
2 	 2
1 	 1
2 	 2
2 	 2
2 	 2
2 	 2
1 	 1
1 	 1
2 	 2
2 	 2
1 	 1
2 	 2
2 	 2
1 	 1
2 	 2
2 	 2
2 	 2
1 	 1
1 	 1
2 	 2
2 	 2




In [141]:
tp += 4

In [142]:
col = cleaned_columns[col_idx]
print("column: ", col)
print("Original,\t Cleaned\n")
for i in range(50):
    print(df_sample_data[col].iloc[i], '\t', df_temp[col].iloc[i])

print('======================\n\n')
col_idx += 1

column:  Oil Gas
Original,	 Cleaned

nan 	 
nan 	 
nan 	 
nan 	 
nan 	 
nan 	 
nan 	 
nan 	 
nan 	 
nan 	 
nan 	 
nan 	 
nan 	 
nan 	 
nan 	 
nan 	 
nan 	 
nan 	 
nan 	 
nan 	 
nan 	 
nan 	 
nan 	 
nan 	 
nan 	 
nan 	 
nan 	 
nan 	 
nan 	 
nan 	 
nan 	 
nan 	 
nan 	 
nan 	 
nan 	 
nan 	 
nan 	 
nan 	 
nan 	 
nan 	 
nan 	 
nan 	 
nan 	 
nan 	 
nan 	 
nan 	 
nan 	 
nan 	 
nan 	 
nan 	 




In [143]:
# or would this fall under fp?
tp += 49

In [144]:
col = cleaned_columns[col_idx]
print("column: ", col)
print("Original,\t Cleaned\n")
for i in range(50):
    print(df_sample_data[col].iloc[i], '\t', df_temp[col].iloc[i])

print('======================\n\n')
col_idx += 1

column:  Act As Superintendent
Original,	 Cleaned

nan 	 N
Y 	 Y
nan 	 N
nan 	 N
nan 	 N
Y 	 Y
Y 	 Y
nan 	 N
Y 	 Y
nan 	 N
nan 	 N
Y 	 Y
nan 	 N
Y 	 Y
Y 	 Y
Y 	 Y
nan 	 N
Y 	 Y
Y 	 Y
nan 	 N
nan 	 N
nan 	 N
Y 	 Y
Y 	 Y
Y 	 Y
Y 	 Y
nan 	 N
Y 	 Y
nan 	 N
nan 	 N
Y 	 Y
nan 	 N
nan 	 N
Y 	 Y
nan 	 N
Y 	 Y
Y 	 Y
Y 	 Y
nan 	 N
Y 	 Y
Y 	 Y
nan 	 N
nan 	 N
Y 	 Y
Y 	 Y
nan 	 N
nan 	 N
Y 	 Y
Y 	 Y
Y 	 Y




In [145]:
tp += 19

In [146]:
col = cleaned_columns[col_idx]
print("column: ", col)
print("Original,\t Cleaned\n")
for i in range(50):
    print(df_sample_data[col].iloc[i], '\t', df_temp[col].iloc[i])

print('======================\n\n')
col_idx += 1

column:  Self  Cert
Original,	 Cleaned

Y 	 Y
nan 	 N
Y 	 Y
nan 	 N
nan 	 N
Y 	 Y
Y 	 Y
nan 	 N
nan 	 N
nan 	 N
nan 	 N
nan 	 N
nan 	 N
nan 	 N
Y 	 Y
nan 	 N
Y 	 Y
Y 	 Y
nan 	 N
Y 	 Y
nan 	 N
nan 	 N
nan 	 N
nan 	 N
nan 	 N
Y 	 Y
Y 	 Y
Y 	 Y
nan 	 N
nan 	 N
Y 	 Y
nan 	 N
nan 	 N
Y 	 Y
Y 	 Y
Y 	 Y
Y 	 Y
nan 	 N
nan 	 N
nan 	 N
nan 	 N
nan 	 N
Y 	 Y
nan 	 N
nan 	 N
Y 	 Y
nan 	 N
nan 	 N
nan 	 N
nan 	 N




In [147]:
tp += 31

In [148]:
col = cleaned_columns[col_idx]
print("column: ", col)
print("Original,\t Cleaned\n")
for i in range(50):
    print(df_sample_data[col].iloc[i], '\t', df_temp[col].iloc[i])

print('======================\n\n')
col_idx += 1

column:  Numbern- Profit
Original,	 Cleaned

nan 	 N
nan 	 N
nan 	 N
nan 	 N
Y 	 Y
nan 	 N
nan 	 N
nan 	 N
nan 	 N
Y 	 Y
Y 	 Y
nan 	 N
nan 	 N
nan 	 N
nan 	 N
nan 	 N
nan 	 N
nan 	 N
nan 	 N
nan 	 N
Y 	 Y
nan 	 N
nan 	 N
nan 	 N
nan 	 N
nan 	 N
nan 	 N
nan 	 N
nan 	 N
nan 	 N
nan 	 N
nan 	 N
nan 	 N
Y 	 Y
nan 	 N
nan 	 N
nan 	 N
nan 	 N
nan 	 N
nan 	 N
nan 	 N
nan 	 N
nan 	 N
nan 	 N
nan 	 N
nan 	 N
nan 	 N
nan 	 N
nan 	 N
nan 	 N




In [149]:
tp += 48

In [150]:
col = cleaned_columns[col_idx]
print("column: ", col)
print("Original,\t Cleaned\n")
for i in range(50):
    print(df_sample_data[col].iloc[i], '\t', df_temp[col].iloc[i])

print('======================\n\n')
col_idx += 1

column:  Residential
Original,	 Cleaned

YES 	 YES
nan 	 NO
YES 	 YES
nan 	 NO
nan 	 NO
nan 	 NO
nan 	 NO
nan 	 NO
nan 	 NO
YES 	 YES
nan 	 NO
nan 	 NO
nan 	 NO
nan 	 NO
nan 	 NO
nan 	 NO
YES 	 YES
YES 	 YES
nan 	 NO
nan 	 NO
nan 	 NO
nan 	 NO
nan 	 NO
nan 	 NO
nan 	 NO
nan 	 NO
nan 	 NO
nan 	 NO
YES 	 YES
YES 	 YES
YES 	 YES
YES 	 YES
YES 	 YES
nan 	 NO
nan 	 NO
YES 	 YES
YES 	 YES
nan 	 NO
nan 	 NO
YES 	 YES
nan 	 NO
YES 	 YES
YES 	 YES
nan 	 NO
nan 	 NO
nan 	 NO
nan 	 NO
YES 	 YES
nan 	 NO
nan 	 NO




In [151]:
tp += 39

In [152]:
col = cleaned_columns[col_idx]
print("column: ", col)
print("Original,\t Cleaned\n")
for i in range(50):
    print(df_sample_data[col].iloc[i], '\t', df_temp[col].iloc[i])

print('======================\n\n')
col_idx += 1

column:  Owner'S Business Type
Original,	 Cleaned

CORPORATION 	 CORPORATION
OTHER 	 OTHER
INDIVIDUAL 	 INDIVIDUAL
PARTNERSHIP 	 PARTNERSHIP
OTHER GOV'T AGENCY 	 OTHER GOV'T AGENCY
INDIVIDUAL 	 INDIVIDUAL
CORPORATION 	 CORPORATION
PARTNERSHIP 	 PARTNERSHIP
nan 	 nan
PARTNERSHIP 	 PARTNERSHIP
PARTNERSHIP 	 PARTNERSHIP
INDIVIDUAL 	 INDIVIDUAL
PARTNERSHIP 	 PARTNERSHIP
INDIVIDUAL 	 INDIVIDUAL
INDIVIDUAL 	 INDIVIDUAL
INDIVIDUAL 	 INDIVIDUAL
INDIVIDUAL 	 INDIVIDUAL
CORPORATION 	 CORPORATION
nan 	 nan
INDIVIDUAL 	 INDIVIDUAL
PARTNERSHIP 	 PARTNERSHIP
CORPORATION 	 CORPORATION
nan 	 nan
PARTNERSHIP 	 PARTNERSHIP
nan 	 nan
nan 	 nan
INDIVIDUAL 	 INDIVIDUAL
PARTNERSHIP 	 PARTNERSHIP
CORPORATION 	 CORPORATION
INDIVIDUAL 	 INDIVIDUAL
CORPORATION 	 CORPORATION
INDIVIDUAL 	 INDIVIDUAL
CORPORATION 	 CORPORATION
OTHER 	 OTHER
CORPORATION 	 CORPORATION
CORPORATION 	 CORPORATION
INDIVIDUAL 	 INDIVIDUAL
INDIVIDUAL 	 INDIVIDUAL
PARTNERSHIP 	 PARTNERSHIP
CORPORATION 	 CORPORATION
CORPORATION 	 CORPORATION

In [153]:
fn += 4


In [154]:
col = cleaned_columns[col_idx]
print("column: ", col)
print("Original,\t Cleaned\n")
for i in range(50):
    print(df_sample_data[col].iloc[i], '\t', df_temp[col].iloc[i])

print('======================\n\n')
col_idx += 1

column:  Owner'S Business Name
Original,	 Cleaned

SHARFMAN ORGANIZATION 	 SHARFMAN ORGANIZATION
BROOKLYN PULIC LIBRARY 	 BROOKLYN PULIC LIBRARY
0 	 0
KC 53 LLC DBA SOCIAL EATZ 	 KC 53 LLC DBA SOCIAL EATZ
NYC PARKS & RECEATION 	 NYC PARKS & RECEATION
nan 	 nan
New Water Street Corp. 	 NEW WATER STREET CORP
WATERSCAPE RESORTS, LLC 	 WATERSCAPE RESORTS LLC
nan 	 nan
960 5TH AVE CORP WALLACH MNGR 	 960 5TH AVE CORP WALLACH MNGR
TRUSTEE OF SPENCE SCHOOL 	 TRUSTEE OF SPENCE SCHOOL
nan 	 nan
CBC REALTY LLC 	 CBC REALTY LLC
nan 	 nan
na 	 None
nan 	 nan
ROSE ASSOCIATES, INC., MANAGER 	 ROSE ASSOCIATES INC MANAGER
ROSKO HOLDING, INC. 	 ROSKO HOLDING INC
nan 	 nan
nan 	 nan
Brisam Staten Island LLC. 	 BRISAM STATEN ISLAND LLC
12 East 87th Street Owners, Inc. 	 12 EAST 87TH STREET OWNERS INC
nan 	 nan
291 BROADWAY REALTY ASSOCIATES 	 291 BROADWAY REALTY ASSOCIATES
nan 	 nan
nan 	 nan
nan 	 nan
RECKSON ASSOCIATES REALTY CORP 	 RECKSON ASSOCIATES REALTY CORP
86 APARTMENT CORP 	 86 APARTMENT CORP
n

In [155]:
tp += 8
fp += 9
fn += 16

In [156]:
col = cleaned_columns[col_idx]
print("column: ", col)
print("Original,\t Cleaned\n")
for i in range(50):
    print(df_sample_data[col].iloc[i], '\t', df_temp[col].iloc[i])

print('======================\n\n')
col_idx += 1

column:  Owner'S First Name
Original,	 Cleaned

MARK 	 MARK
HELMUT 	 HELMUT
DRUEPATIE 	 DRUEPATIE
SCOTT 	 SCOTT
THERESE 	 THERESE
JOSE 	 JOSE
William 	 William
SOLLY 	 SOLLY
nan 	 nan
BURTON 	 BURTON
KATHY 	 KATHY
THOMAS 	 THOMAS
PAT 	 PAT
SOO CHUL 	 SOO CHUL
Paul 	 Paul
JUNG SHIN 	 JUNG SHIN
MARCO 	 MARCO
ANDREW 	 ANDREW
nan 	 nan
WILLIAM 	 WILLIAM
Sam 	 Sam
Stephen 	 Stephen
nan 	 nan
SAM 	 SAM
nan 	 nan
nan 	 nan
ANTHONY 	 ANTHONY
MATTHEW 	 MATTHEW
MICHAEL 	 MICHAEL
RAM 	 RAM
RANDY 	 RANDY
XI 	 XI
Wing 	 Wing
TIMOTHY 	 TIMOTHY
ROBERT 	 ROBERT
BOBBY 	 BOBBY
ORATZIO 	 ORATZIO
SANG 	 SANG
JONATHAN 	 JONATHAN
Fred 	 Fred
BERNARD 	 BERNARD
SUSAN 	 SUSAN
EUNJA 	 EUNJA
REBECCA 	 REBECCA
Larry 	 Larry
NED 	 NED
SAL 	 SAL
MOHINDER 	 MOHINDER
nan 	 nan
VINCENT 	 VINCENT




In [157]:
fn += 11

In [158]:
col = cleaned_columns[col_idx]
print("column: ", col)
print("Original,\t Cleaned\n")
for i in range(50):
    print(df_sample_data[col].iloc[i], '\t', df_temp[col].iloc[i])

print('======================\n\n')
col_idx += 1

column:  Owner'S Last Name
Original,	 Cleaned

SHARFMAN 	 SHARFMAN
HUTTER 	 HUTTER
POONAI 	 POONAI
ALLING 	 ALLING
BRADDICK 	 BRADDICK
MOLINA 	 MOLINA
Pupplo 	 Pupplo
ASSA 	 ASSA
nan 	 nan
WALLACH 	 WALLACH
JONES 	 JONES
BISHOP 	 BISHOP
CONTE 	 CONTE
YOO 	 YOO
Peters 	 Peters
HUANG 	 HUANG
MATTIA 	 MATTIA
JERRO 	 JERRO
nan 	 nan
REESE 	 REESE
Chang 	 Chang
Reinstadtler 	 Reinstadtler
nan 	 nan
SUTTON 	 SUTTON
nan 	 nan
nan 	 nan
LOPRESTI 	 LOPRESTI
DUTHIE 	 DUTHIE
ZINDER 	 ZINDER
SUBRAMANIAN 	 SUBRAMANIAN
LEE 	 LEE
WU 	 WU
Ma 	 Ma
NG 	 NG
SANNA 	 SANNA
RICCA 	 RICCA
LAPIETRA 	 LAPIETRA
PARK 	 PARK
LANMAN 	 LANMAN
Pouration 	 Pouration
ISER 	 ISER
HEWITT 	 HEWITT
RO 	 RO
ROBERTSON 	 ROBERTSON
Henzel 	 Henzel
BERNSTEIN 	 BERNSTEIN
CALCAGNO 	 CALCAGNO
SINGH 	 SINGH
nan 	 nan
TROCCHIA 	 TROCCHIA




In [159]:
fn += 8

In [160]:
col = cleaned_columns[col_idx]
print("column: ", col)
print("Original,\t Cleaned\n")
for i in range(50):
    print(df_sample_data[col].iloc[i], '\t', df_temp[col].iloc[i])

print('======================\n\n')
col_idx += 1

column:  Owner'S House Number
Original,	 Cleaned

Owner'S House Number    280
Owner'S House Number    280
Name: 1590526, dtype: object 	 Owner'S House Number    280
Owner'S House Number    280
Name: 1590526, dtype: object
Owner'S House Number    1
Owner'S House Number    1
Name: 524120, dtype: object 	 Owner'S House Number    1
Owner'S House Number    1
Name: 524120, dtype: object
Owner'S House Number    107-11
Owner'S House Number    107-11
Name: 2176945, dtype: object 	 Owner'S House Number    107-11
Owner'S House Number    107-11
Name: 2176945, dtype: object
Owner'S House Number    232
Owner'S House Number    232
Name: 1132106, dtype: object 	 Owner'S House Number    232
Owner'S House Number    232
Name: 1132106, dtype: object
Owner'S House Number    117-02
Owner'S House Number    117-02
Name: 2356636, dtype: object 	 Owner'S House Number    117-02
Owner'S House Number    117-02
Name: 2356636, dtype: object
Owner'S House Number    89-24
Owner'S House Number    89-24
Name: 2180533, d

In [161]:
tp += 3
fn += 2

In [162]:
col = cleaned_columns[col_idx]
print("column: ", col)
print("Original,\t Cleaned\n")
for i in range(50):
    print(df_sample_data[col].iloc[i], '\t', df_temp[col].iloc[i])

print('======================\n\n')
col_idx += 1

column:  Owner'S House Street Name
Original,	 Cleaned

NORTH CENTERAL PARK AVENUE 	 NORTH CENTERAL PARK AVENUE
GRAND ARMY PLAZA 	 GRAND ARMY PLAZA
92 STREET 	 92 STREET
E 53 ST 	 E 53 ST
ROOSEVELT AVENUE 	 ROOSEVELT AVENUE
183 STREET 	 183 STREET
Water Street 	 Water Street
WEST 34TH STREET, 7TH FLOOR 	 WEST 34TH STREET, 7TH FLOOR
nan 	 nan
E 64 ST 	 E 64 ST
EAST 91ST STREET 	 EAST 91ST STREET
BEDFORD STREET 	 BEDFORD STREET
GRAND STREET 	 GRAND STREET
MYRTLE AVE 	 MYRTLE AVE
Corporal Kennedy St. 	 Corporal Kennedy St.
ROOSVELT AVE. 	 ROOSVELT AVE.
AVENUE C 	 AVENUE C
5TH AVENUE 	 5TH AVENUE
nan 	 nan
VERNON AVENUE 	 VERNON AVENUE
Great Neck Road 	 Great Neck Road
East 87 Street 	 East 87 Street
nan 	 nan
BROADWAY 	 BROADWAY
nan 	 nan
nan 	 nan
HYLAN BLVD 	 HYLAN BLVD
WEST 45TH STREET 	 WEST 45TH STREET
SEVENTH AVENUE 	 SEVENTH AVENUE
WASHINGTON AVE 	 WASHINGTON AVE
CHRISTOPHER LANE 	 CHRISTOPHER LANE
67TH STREET 	 67TH STREET
Bowery 	 Bowery
THOMSON AVENUE 	 THOMSON AVENUE
METROTECH C

In [163]:
fn += 8

In [164]:
col = cleaned_columns[col_idx]
print("column: ", col)
print("Original,\t Cleaned\n")
for i in range(50):
    print(df_sample_data[col].iloc[i], '\t', df_temp[col].iloc[i])

print('======================\n\n')
col_idx += 1

column:  Owner'S House City
Original,	 Cleaned

HARTSDALE 	 HARTSDALE
BKLYN 	 BKLYN
OZONE PARK 	 OZONE PARK
NEW YORK 	 NEW YORK
FLUSHING 	 FLUSHING
QUEENS 	 QUEENS
New York 	 NEW YORK
NEW YORK 	 NEW YORK
nan 	 nan
N.Y. 	 N.Y.
NY 	 NY
NEW YORK 	 NEW YORK
BROOKLYN 	 BROOKLYN
BROOKLYN 	 BROOKLYN
Bayside 	 BAYSIDE
QUEENS 	 QUEENS
NEW YORK 	 NEW YORK
BROOKLYN 	 BROOKLYN
nan 	 nan
BROOKLYN 	 BROOKLYN
Great Neck 	 GREAT NECK
New York 	 NEW YORK
nan 	 nan
NEW YORK 	 NEW YORK
nan 	 nan
nan 	 nan
STATEN ISLAND 	 STATEN ISLAND
NEW YORK 	 NEW YORK
NEW YORK 	 NEW YORK
NEW YORK 	 NEW YORK
STATEN ISLAND 	 STATEN ISLAND
BROOKLYN 	 BROOKLYN
New York 	 NEW YORK
LIC 	 LIC
BROOKLYN 	 BROOKLYN
STATEN ISLAND 	 STATEN ISLAND
BRONX 	 BRONX
NY 	 NY
N.Y. 	 N.Y.
Great Neck, 	 GREAT NECK,
NEW YORK 	 NEW YORK
NEW YORK 	 NEW YORK
REGO PARK 	 REGO PARK
NEW YORK 	 NEW YORK
New York 	 NEW YORK
NEW YORK 	 NEW YORK
STATEN ISLAND 	 STATEN ISLAND
QUEENS VILLAGE 	 QUEENS VILLAGE
nan 	 nan
NEW YORK 	 NEW YORK




In [165]:
tp += 6
fn += 15

In [166]:
col = cleaned_columns[col_idx]
print("column: ", col)
print("Original,\t Cleaned\n")
for i in range(50):
    print(df_sample_data[col].iloc[i], '\t', df_temp[col].iloc[i])

print('======================\n\n')
col_idx += 1

column:  Owner'S House State
Original,	 Cleaned

NY 	 NY
NY 	 NY
NY 	 NY
NY 	 NY
NY 	 NY
NY 	 NY
NY 	 NY
NY 	 NY
nan 	 nan
NY 	 NY
NY 	 NY
NY 	 NY
NY 	 NY
NY 	 NY
NY 	 NY
NY 	 NY
NY 	 NY
NY 	 NY
nan 	 nan
NY 	 NY
NY 	 NY
NY 	 NY
nan 	 nan
NY 	 NY
nan 	 nan
nan 	 nan
NY 	 NY
NY 	 NY
NY 	 NY
NY 	 NY
NY 	 NY
NY 	 NY
NY 	 NY
NY 	 NY
NY 	 NY
NY 	 NY
NY 	 NY
NY 	 NY
NY 	 NY
NY 	 NY
NY 	 NY
NY 	 NY
NY 	 NY
NY 	 NY
NY 	 NY
NY 	 NY
NY 	 NY
NY 	 NY
nan 	 nan
NY 	 NY




In [167]:
#didn't change any for these

In [168]:
col = cleaned_columns[col_idx]
print("column: ", col)
print("Original,\t Cleaned\n")
for i in range(50):
    print(df_sample_data[col].iloc[i], '\t', df_temp[col].iloc[i])

print('======================\n\n')
col_idx += 1

column:  Owner'S House Zip Code
Original,	 Cleaned

10530 	 10530
11232 	 11232
11417 	 11417
10022 	 10022
11368 	 11368
11423 	 11423
10041 	 10041
10001 	 10001
nan 	 nan
10021 	 10021
10128 	 10128
10014 	 10014
11211 	 11211
11206 	 11206
11361 	 11361
11368 	 11368
10009 	 10009
11215 	 11215
nan 	 nan
11206 	 11206
11021 	 11021
10128 	 10128
nan 	 nan
10018 	 10018
nan 	 nan
nan 	 nan
103120000 	 103120000
10036 	 10036
10123 	 10123
10011 	 10011
10314 	 10314
11220 	 11220
10002 	 10002
11101 	 11101
11201 	 11201
10314 	 10314
10465 	 10465
10021 	 10021
10012 	 10012
11023 	 11023
10019 	 10019
10003 	 10003
11374 	 11374
10036 	 10036
10011 	 10011
10001 	 10001
10304 	 10304
11428 	 11428
nan 	 nan
10016 	 10016




In [169]:
fn += 4

In [170]:
col = cleaned_columns[col_idx]
print("column: ", col)
print("Original,\t Cleaned\n")
for i in range(50):
    print(df_sample_data[col].iloc[i], '\t', df_temp[col].iloc[i])

print('======================\n\n')
col_idx += 1

column:  Owner'S Phone Number
Original,	 Cleaned

9149972435 	 9149972435
7186265654 	 7186265654
7188433678 	 7188433678
2122073339 	 2122073339
7187606601 	 7187606601
7182064203 	 7182064203
2127470114 	 2127470114
2122399900 	 2122399900
nan 	 nan
2127533381 	 2127533381
6469436822 	 6469436822
2122799722 	 2122799722
7187821222 	 7187821222
7188868781 	 7188868781
7189393901 	 7189393901
7184614587 	 7184614587
2125985271 	 2125985271
3472315798 	 3472315798
nan 	 nan
7185736368 	 7185736368
5167739300 	 5167739300
2122893493 	 2122893493
nan 	 nan
2125943336 	 2125943336
nan 	 nan
nan 	 nan
6462203326 	 6462203326
2127536600 	 2127536600
2127363680 	 2127363680
9179714987 	 9179714987
7189838800 	 7189838800
6466433234 	 6466433234
9172738188 	 9172738188
7184728000 	 7184728000
7189238400 	 7189238400
9175522410 	 9175522410
7188638943 	 7188638943
2128792900 	 2128792900
2125295688 	 2125295688
5164660055 	 5164660055
2122472603 	 2122472603
2128241191 	 2128241191
7184295007 	

In [171]:
fp += 3

In [172]:
tp

208

In [173]:
fp

12

In [174]:
fn

74

In [175]:
precision(tp, fp)

0.9454545454545454

In [176]:
recall(tp,fn)

0.7375886524822695

# Save cleaned data

In [177]:
outputpath = 'cleaned_data.csv'
df.to_csv(outputpath,sep=',',index=False,header=True) 

# Some discussion

We have profiled and cleaned most of the columns, we first change some of the column names so that they present right information about the data, then we look at each of these columns to detect outliers and wrong format.

However, there are still some issues, first we keep most of the empty value as NaN, and we don't know if clustering is the best way to clean the name data since it might convert similar names to one same name. And, business names are too long that we can not perform clustering on them so we only fixed empty values. Also, there are some column names in upper case, we do not know if we should convert them to lower case as other columns.