# DOHMH New York City Restaurant Inspection Results

## Data Cleaning

Our goal is to reproduce a previous [study from 2014 that looks at the distribution of restaurant inspection grades in New York City](https://iquantny.tumblr.com/post/76928412519/think-nyc-restaurant-grading-is-flawed-heres). For our study, we use data that was downloaded in Sept. 2019.

We performed some basic [data profiling steps](https://github.com/VIDA-NYU/openclean-core/blob/master/examples/notebooks/NYCRestaurantInspections/NYC%20Restaurant%20Inspections%20-%20Profiling.ipynb) to get an understanding of the data and to identify a some initial data cleaning task.

In [1]:
# Open the downloaded dataset to extract the relevant columns and records.

import os

from openclean.data.load import stream

ds = stream(os.path.join('data', '43nn-pn8j.tsv.gz'))

## Extract Relevant Records

Get the data records for the data strudy. Perform initial cleaning guided by the profing step.

In [2]:
# We will only consider the following columns from the full dataset.

columns=[
    'CAMIS',
    'DBA',
    'BORO',
    'BUILDING',
    'STREET',
    'ZIPCODE',
    'CUISINE DESCRIPTION',
    'INSPECTION DATE',
    'VIOLATION CODE',
    'CRITICAL FLAG',
    'SCORE',
    'GRADE',
    'INSPECTION TYPE'
]
ds = ds.select(*columns)

In [3]:
# During data profiling we decided to
#
# - replace the 'BORO' for records that have value '0' and a 'ZIPCODE' in ['11249', '10168', '10285'].
# - remove records ...
#   + for new establishments ('INSPECTION DATE' == 1900/1/1)
#   + with empty 'SCORE's
#   + 'BORO' == 'N/A'

from datetime import datetime
from openclean.function.eval.base import Col
from openclean.function.eval.datatype import Datetime
from openclean.function.eval.logic import And
from openclean.function.eval.null import IsNotEmpty
from openclean.function.eval.mapping import Lookup

# Date that identifies new establishments that have not
# been inspected.
new_establ_date = datetime(1900, 1, 1)

boro_0_mapping = {'11249': 'Brooklyn', '10168': 'Manhattan', '10285': 'Manhattan'}

# Filter and update the data in the stream. Then perform atype casting and
# generate the pandas data frame with the selected records that we will be
# using for the remainder of our study.

df = ds\
    .filter(And(Col('BORO') != 'N/A', Datetime('INSPECTION DATE') != new_establ_date, IsNotEmpty('SCORE')))\
    .update('BORO', Lookup('ZIPCODE', boro_0_mapping, default=Col('BORO')))\
    .typecast()\
    .to_df()


In [4]:
# What is the size of the resulting dataset?

df.shape

(375085, 13)

In [5]:
# Remove any exact duplicates from the dataset.

df = df.drop_duplicates()
df.shape

(375078, 13)

In [6]:
# Profile the dataset.

from openclean.profiling.dataset import dataset_profile

profile = dataset_profile(df)

In [7]:
profile.stats()

Unnamed: 0,total,empty,distinct,uniqueness,entropy
CAMIS,375078,0,25530,0.068066,14.296316
DBA,375078,0,20468,0.05457,13.627345
BORO,375078,0,5,1.3e-05,1.998406
BUILDING,375078,224,7166,0.019117,11.707591
STREET,375078,0,3191,0.008508,9.835756
ZIPCODE,375078,5316,224,0.000606,7.029375
CUISINE DESCRIPTION,375078,0,84,0.000224,4.726841
INSPECTION DATE,375078,0,1301,0.003469,9.821872
VIOLATION CODE,375078,1590,73,0.000195,4.396638
CRITICAL FLAG,375078,2793,2,5e-06,0.983115


In [8]:
profile.types()

Unnamed: 0,date,int,str
CAMIS,0,375078,0
DBA,9,52,375017
BORO,0,0,375078
BUILDING,170,365018,9666
STREET,0,0,375078
ZIPCODE,0,369762,0
CUISINE DESCRIPTION,0,0,375078
INSPECTION DATE,375078,0,0
VIOLATION CODE,0,0,373488
CRITICAL FLAG,0,0,372285


### FD Violations

We want to make sure that certain constraints hold on the data before we start generating the charts.

From the [dataset description](https://data.cityofnewyork.us/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/43nn-pn8j): "*... When an inspection results in more than one violation, values for associated fields are repeated for each additional violation record.*"

There might be multiple records in the dataset for each business ('CAMIS') and inspection date. We want to ensure that the score that a business gets for an inspection is the same accross all records for that individual inspection, i.e., **CAMIS, INSPECTION DATE -> SCORE**.

The unique business identifier should also uniquely identify all the inspection idenpendent (statis) attributes in the dataset, i.e., **CAMIS -> DBA, BORO, BUILDING, STREET, ZIPCODE, CUISINE DESCRIPTION**.

One would also assume that the inspection score determines the inspection grade, i.e., **SCORE -> GRADE**.

In [9]:
# FD1: CAMIS, INSPECTION DATE -> SCORE

from openclean.operator.map.violations import fd_violations
fd1_violations = fd_violations(df, ['CAMIS', 'INSPECTION DATE'], 'SCORE')

print('# of violations for FD(CAMIS, INSPECTION DATE -> SCORE) is {}'.format(len(fd1_violations)))

# of violations for FD CAMIS, INSPECTION DATE -> SCORE: 8


In [11]:
# Have a look at the different sets of violations.

for key, gr in fd1_violations.items():
    print(gr[['CAMIS', 'INSPECTION DATE', 'VIOLATION CODE', 'SCORE']])
    print()

          CAMIS INSPECTION DATE VIOLATION CODE  SCORE
4569   41702610      2017-07-17                    29
78197  41702610      2017-07-17                     0

           CAMIS INSPECTION DATE VIOLATION CODE  SCORE
14420   40911114      2017-11-04            04N     20
31755   40911114      2017-11-04            08C     20
49122   40911114      2017-11-04            04M     15
56000   40911114      2017-11-04            08A     20
69598   40911114      2017-11-04            04N     15
112737  40911114      2017-11-04            08A     15
148895  40911114      2017-11-04            08C     15
167080  40911114      2017-11-04            04M     20
274101  40911114      2017-11-04            06C     20

           CAMIS INSPECTION DATE VIOLATION CODE  SCORE
29316   41720266      2018-03-14            02B      0
298918  41720266      2018-03-14            02B      9

           CAMIS INSPECTION DATE VIOLATION CODE  SCORE
66064   50065982      2018-03-23            08A     12
210211  50

#### Violation Repair Strategy

First, we are goin to take a closer look at records where the score is zero. Are there other records with score zero and if how many? Should we remove all records with a score of zero? Are there other records that we may want to get rid of (e.g., score - 1).

For the two groups that have violations without zero we use majority voting to update the data.

In [12]:
# How many records are there where the 'SCORE' is zero?

df['SCORE'].value_counts().loc[0]

1626

In [33]:
# What are the VIOLATION CODES for score zero?

df.loc[df['SCORE'] == 0]['VIOLATION CODE'].value_counts()

       1579
08A       6
10F       5
04L       5
02B       4
06C       4
22F       3
06D       3
02G       3
10B       2
22G       2
06B       2
04N       1
04M       1
05A       1
04H       1
10D       1
04F       1
09B       1
04K       1
Name: VIOLATION CODE, dtype: int64

In [34]:
# There are also scores of -1

df.loc[df['SCORE'] == -1]['VIOLATION CODE'].value_counts()

08A    11
02G    10
04L     8
06D     8
02B     7
06C     7
10F     7
06E     6
05D     4
04H     4
04A     3
10B     3
04M     3
09B     2
06A     2
06F     2
10H     2
10I     2
04N     2
09C     1
06B     1
07A     1
04C     1
04K     1
08C     1
03A     1
Name: VIOLATION CODE, dtype: int64

In [26]:
# Delete records with score lower or equal zero or empty violation code.

from openclean.function.eval.logic import Or
from openclean.function.eval.null import IsEmpty
from openclean.operator.transform.filter import delete

df = delete(df, Or(IsEmpty('VIOLATION CODE'), Col('SCORE') <= 0))

# What is the pandas equivalent ??? These are the rows we want to delete:
# df[df['VIOLATION CODE'].isnull() | df['SCORE'] <= 0]

NameError: name 'IsEmpty' is not defined

In [32]:
df.shape

(373441, 13)

In [53]:
# FD2: AMIS -> DBA, BORO, BUILDING, STREET, ZIPCODE, CUISINE DESCRIPTION

from openclean.operator.map.violations import fd_violations

fd2_violations = fd_violations(df, 'CAMIS', ['DBA', 'BORO', 'BUILDING', 'STREET', 'ZIPCODE', 'CUISINE DESCRIPTION'])

print('# of violations for FD(CAMIS -> DBA, BORO, BUILDING, STREET, ZIPCODE, CUISINE DESCRIPTION) is {}'.format(len(fd2_violations)))

In [33]:
# FD 3: SCORE -> GRADE

from openclean.operator.map.violations import fd_violations
fd3_violations = fd_violations(df, 'SCORE', 'GRADE')

print('# of violations for FD(SCORE -> GRADE) is {}'.format(len(fd3_violations)))

Create a datset with 'CAMIS', 'BORO', 'CUISINE', 'DATE', 'TYPE', 'SCORE' with one row per (business, inspection date).

In [39]:
filter(df, Col('CAMIS') == 50073585)

Unnamed: 0,CAMIS,DBA,BORO,BUILDING,STREET,ZIPCODE,CUISINE DESCRIPTION,INSPECTION DATE,VIOLATION CODE,CRITICAL FLAG,SCORE,GRADE,INSPECTION TYPE
153,50073585,PUNTO ROJO,Queens,14716,HILLSIDE AVE,11435,Spanish,2018-03-23,09B,N,14,,Pre-permit (Operational) / Initial Inspection
14005,50073585,PUNTO ROJO,Queens,14716,HILLSIDE AVE,11435,Spanish,2018-04-30,04L,Y,23,B,Pre-permit (Operational) / Re-inspection
19603,50073585,PUNTO ROJO,Queens,14716,HILLSIDE AVE,11435,Spanish,2018-04-30,08A,N,23,B,Pre-permit (Operational) / Re-inspection
37700,50073585,PUNTO ROJO,Queens,14716,HILLSIDE AVE,11435,Spanish,2018-03-23,02H,Y,14,,Pre-permit (Operational) / Initial Inspection
71282,50073585,PUNTO ROJO,Queens,14716,HILLSIDE AVE,11435,Spanish,2018-12-31,10F,N,12,A,Cycle Inspection / Re-inspection
106469,50073585,PUNTO ROJO,Queens,14716,HILLSIDE AVE,11435,Spanish,2019-03-30,10F,N,13,A,Cycle Inspection / Initial Inspection
116799,50073585,PUNTO ROJO,Queens,14716,HILLSIDE AVE,11435,Spanish,2019-03-30,06D,Y,13,A,Cycle Inspection / Initial Inspection
195385,50073585,PUNTO ROJO,Queens,14716,HILLSIDE AVE,11435,Spanish,2018-11-24,10F,N,16,,Cycle Inspection / Initial Inspection
211683,50073585,PUNTO ROJO,Queens,14716,HILLSIDE AVE,11435,Spanish,2018-12-31,08A,N,12,A,Cycle Inspection / Re-inspection
256317,50073585,PUNTO ROJO,Queens,14716,HILLSIDE AVE,11435,Spanish,2018-11-24,06C,Y,16,,Cycle Inspection / Initial Inspection


In [41]:
# Count 'INSPECTION TYPE's for empty 'GRAND' records.

stream(df).filter(IsEmpty('GRADE')).select('INSPECTION TYPE').distinct()


Counter({'Cycle Inspection / Initial Inspection': 146307,
         'Pre-permit (Operational) / Initial Inspection': 18158,
         'Pre-permit (Operational) / Compliance Inspection': 1177,
         'Cycle Inspection / Re-inspection': 3439,
         'Pre-permit (Non-operational) / Initial Inspection': 2902,
         'Pre-permit (Operational) / Reopening Inspection': 451,
         'Pre-permit (Non-operational) / Re-inspection': 249,
         'Cycle Inspection / Reopening Inspection': 1095,
         'Cycle Inspection / Compliance Inspection': 718,
         'Inter-Agency Task Force / Initial Inspection': 769,
         'Pre-permit (Operational) / Re-inspection': 533,
         'Pre-permit (Operational) / Second Compliance Inspection': 104,
         'Cycle Inspection / Second Compliance Inspection': 16,
         'Pre-permit (Non-operational) / Compliance Inspection': 18,
         'Trans Fat / Re-inspection': 1,
         'Administrative Miscellaneous / Re-inspection': 1})

In [45]:
from openclean.operator.map.groupby import groupby

establishments = groupby(df, 'CAMIS')

In [46]:
len(establishments)

25508

In [48]:
list(establishments.keys())[:10]

[50069505,
 50069510,
 50069517,
 50069518,
 50069519,
 50069532,
 41680924,
 50069534,
 50069539,
 41418789]

In [50]:
establishments.get(50069505)[['INSPECTION DATE', 'VIOLATION CODE', 'SCORE', 'GRADE', 'INSPECTION TYPE']].sort_values('INSPECTION DATE')

Unnamed: 0,INSPECTION DATE,VIOLATION CODE,SCORE,GRADE,INSPECTION TYPE
366185,2017-10-25,04H,14,,Pre-permit (Operational) / Initial Inspection
281900,2017-10-25,09B,14,,Pre-permit (Operational) / Initial Inspection
183023,2017-10-25,04M,14,,Pre-permit (Operational) / Initial Inspection
139880,2017-10-25,04N,14,,Pre-permit (Operational) / Initial Inspection
353141,2018-01-25,04M,13,A,Pre-permit (Operational) / Re-inspection
319668,2018-01-25,10F,13,A,Pre-permit (Operational) / Re-inspection
77643,2018-01-25,08A,13,A,Pre-permit (Operational) / Re-inspection
100147,2018-09-06,06A,29,,Cycle Inspection / Initial Inspection
99727,2018-09-06,04M,29,,Cycle Inspection / Initial Inspection
5178,2018-09-06,06D,29,,Cycle Inspection / Initial Inspection


In [31]:
df

Unnamed: 0,CAMIS,DBA,BORO,BUILDING,STREET,ZIPCODE,CUISINE DESCRIPTION,INSPECTION DATE,VIOLATION CODE,CRITICAL FLAG,SCORE,GRADE,INSPECTION TYPE
0,50001317,D'LILI BAKERY,Manhattan,526,W 207TH ST,10034,Bakery,2018-05-04,02G,Y,19,,Cycle Inspection / Initial Inspection
1,41716448,HARLEM SHAKE,Manhattan,100,WEST 124 STREET,10027,American,2019-07-30,09B,N,27,B,Cycle Inspection / Re-inspection
2,50044535,HAWKERS,Manhattan,225,E 14TH ST,10003,Asian,2017-06-14,04L,Y,24,,Cycle Inspection / Initial Inspection
3,50041601,118 KITCHEN,Manhattan,1,E 118TH ST,10035,Tex-Mex,2019-05-09,10B,N,11,A,Cycle Inspection / Initial Inspection
4,50045596,NEW CHINA,Bronx,5690,MOSHOLU AVE,10471,Chinese,2016-09-07,04M,Y,12,A,Cycle Inspection / Re-inspection
...,...,...,...,...,...,...,...,...,...,...,...,...,...
392126,50036584,BAR GOTO,Manhattan,245,ELDRIDGE ST,10002,Japanese,2016-09-14,06D,Y,12,A,Cycle Inspection / Initial Inspection
392127,41519582,CHINA KING,Staten Island,14,BRADLEY AVENUE,10314,Chinese,2019-03-13,10F,N,21,,Cycle Inspection / Initial Inspection
392128,50043143,IHOP,Queens,15517,NORTHERN BLVD,11354,Pancakes/Waffles,2016-11-07,04N,Y,30,,Cycle Inspection / Initial Inspection
392129,50002267,OCEAN SUSHI,Staten Island,20,JEFFERSON BLVD,10312,Japanese,2016-06-01,08A,N,32,,Cycle Inspection / Initial Inspection
