# Profiling - DOHMH New York City Restaurant Inspection Results

## Data Profiling

In this step we start be getting an overview of the properties for the Restaurant Inspection dataset. We then focus on a few quality issues related to the data that we want to extract to plot the score distributions.

In [1]:
# For this demo we make use of openclean's streaming option for large datasets (to avoid
# having to load the full dataset into memory).

import os

from openclean.pipeline import stream

ds = stream(os.path.join('data', '43nn-pn8j.tsv.gz'))

### Basic Dataset Properties and Statistics

In [2]:
# Get the list of column names in the dataset.

ds.columns

['CAMIS',
 'DBA',
 'BORO',
 'BUILDING',
 'STREET',
 'ZIPCODE',
 'PHONE',
 'CUISINE DESCRIPTION',
 'INSPECTION DATE',
 'ACTION',
 'VIOLATION CODE',
 'VIOLATION DESCRIPTION',
 'CRITICAL FLAG',
 'SCORE',
 'GRADE',
 'GRADE DATE',
 'RECORD DATE',
 'INSPECTION TYPE',
 'Latitude',
 'Longitude',
 'Community Board',
 'Council District',
 'Census Tract',
 'BIN',
 'BBL',
 'NTA']

In [3]:
# Take a look ath the first 10 rows in the dataset.

ds.head()

Unnamed: 0,CAMIS,DBA,BORO,BUILDING,STREET,ZIPCODE,PHONE,CUISINE DESCRIPTION,INSPECTION DATE,ACTION,...,RECORD DATE,INSPECTION TYPE,Latitude,Longitude,Community Board,Council District,Census Tract,BIN,BBL,NTA
0,50001317,D'LILI BAKERY,Manhattan,526,W 207TH ST,10034,2123040356,Bakery,05/04/2018,Violations were cited in the following area(s).,...,09/28/2019,Cycle Inspection / Initial Inspection,40.865269576991,-73.919671779695,112,10,29300,1064780,1022220025,MN01
1,41716448,HARLEM SHAKE,Manhattan,100,WEST 124 STREET,10027,2122228300,American,07/30/2019,Violations were cited in the following area(s).,...,09/28/2019,Cycle Inspection / Re-inspection,40.807286117139,-73.946454954277,110,9,22200,1057790,1019080035,MN11
2,50044535,HAWKERS,Manhattan,225,E 14TH ST,10003,2129821688,Asian,06/14/2017,Violations were cited in the following area(s).,...,09/28/2019,Cycle Inspection / Initial Inspection,40.732989026505,-73.986454624452,106,2,4800,1019505,1008960012,MN21
3,50041601,118 KITCHEN,Manhattan,1,E 118TH ST,10035,2127226888,Tex-Mex,05/09/2019,Violations were cited in the following area(s).,...,09/28/2019,Cycle Inspection / Initial Inspection,40.801870167292,-73.945260143617,111,9,18400,1053946,1017450001,MN34
4,50045596,NEW CHINA,Bronx,5690,MOSHOLU AVE,10471,7188841111,Chinese,09/07/2016,Violations were cited in the following area(s).,...,09/28/2019,Cycle Inspection / Re-inspection,40.905434879643,-73.900843816155,208,11,33700,2084805,2058481761,BX22
5,41156678,SUNSWICK,Queens,3502,35 STREET,11106,7187520620,American,05/02/2017,Violations were cited in the following area(s).,...,09/28/2019,Cycle Inspection / Re-inspection,40.756372985356,-73.925517485863,401,26,5700,4009585,4006380025,QN70
6,41434036,CANCUN MEXICAN RESTAURANT,Manhattan,937,8 AVENUE,10019,2123077307,Mexican,04/14/2016,Violations were cited in the following area(s).,...,09/28/2019,Cycle Inspection / Re-inspection,40.765676019444,-73.983642958358,104,3,13900,1025437,1010460032,MN15
7,50058219,HAN JOO BBQ CYCJ,Queens,4106,149TH PL,11355,7183596888,Korean,03/21/2017,Violations were cited in the following area(s).,...,09/28/2019,Pre-permit (Operational) / Initial Inspection,40.762113093414,-73.814688971988,407,20,116700,4447126,4050540032,QN51
8,41018004,ANGELICA PIZZERIA,Brooklyn,30,NEVINS STREET,11217,7188522728,Pizza/Italian,04/18/2019,Violations were cited in the following area(s).,...,09/28/2019,Cycle Inspection / Initial Inspection,40.687850595312,-73.981473402889,302,33,3700,3000515,3001660037,BK38
9,50064285,E.A.K. RAMEN,Manhattan,469,6TH AVE,10011,6468632027,Japanese,06/07/2017,Establishment Closed by DOHMH. Violations wer...,...,09/28/2019,Pre-permit (Operational) / Initial Inspection,40.73558361361,-73.998141677927,102,3,7100,1010580,1006070045,MN23


In [4]:
# Count the number of rows in the dataset.

ds.count()

392131

In [5]:
# Get basic profile information for all columns in the dataset.

profiles = ds.profile()
profiles

[{'column': 'CAMIS',
  'stats': {'totalValueCount': 392131,
   'emptyValueCount': 0,
   'datatypes': Counter({'int': 392131}),
   'minmaxValues': {'int': {'minimum': 30075445, 'maximum': 50099136}}}},
 {'column': 'DBA',
  'stats': {'totalValueCount': 392131,
   'emptyValueCount': 588,
   'datatypes': Counter({'str': 391478, 'int': 55, 'date': 10}),
   'minmaxValues': {'str': {'minimum': '#1 Chinese Restaurant',
     'maximum': 'zx accounting service llc'},
    'int': {'minimum': 33, 'maximum': 1976},
    'date': {'minimum': datetime.datetime(1983, 1, 2, 0, 0),
     'maximum': datetime.datetime(1983, 1, 2, 0, 0)}}}},
 {'column': 'BORO',
  'stats': {'totalValueCount': 392131,
   'emptyValueCount': 0,
   'datatypes': Counter({'str': 392015, 'int': 116}),
   'minmaxValues': {'str': {'minimum': 'Bronx', 'maximum': 'Staten Island'},
    'int': {'minimum': 0, 'maximum': 0}}}},
 {'column': 'BUILDING',
  'stats': {'totalValueCount': 392131,
   'emptyValueCount': 231,
   'datatypes': Counter({'i

In [6]:
# Print number of empty cells for each column

profiles.stats()['empty']

CAMIS                         0
DBA                         588
BORO                          0
BUILDING                    231
STREET                        0
ZIPCODE                    5557
PHONE                        11
CUISINE DESCRIPTION           0
INSPECTION DATE               0
ACTION                     1337
VIOLATION CODE             5696
VIOLATION DESCRIPTION      8918
CRITICAL FLAG              8918
SCORE                     17046
GRADE                    193610
GRADE DATE               195416
RECORD DATE                   0
INSPECTION TYPE            1337
Latitude                    430
Longitude                   430
Community Board            5987
Council District           5987
Census Tract               5987
BIN                        7670
BBL                         430
NTA                        5987
Name: empty, dtype: int64

In [7]:
# How many records are there for date '09/28/2019'
from openclean.function.eval.base import Col

ds.filter(Col('RECORD DATE') == '09/28/2019').count()

392131

In [8]:
# Print column type information (as data frame)

profiles.types()

Unnamed: 0,date,float,int,str
CAMIS,0,0,392131,0
DBA,10,0,55,391478
BORO,0,0,116,392015
BUILDING,180,0,381625,10095
STREET,0,0,0,392131
ZIPCODE,0,0,386558,16
PHONE,0,0,391725,395
CUISINE DESCRIPTION,0,0,0,392131
INSPECTION DATE,392131,0,0,0
ACTION,0,0,0,390794


### Datatype Outliers (?)

Note that type detection is difficult and can be ambigious for some values. Let us take a closer look at some columns that appear to have values of different types.

In [9]:
# Get a quick look at the columns that contain values of
# different (raw) data types.

profiles.multitype_columns().types()

Unnamed: 0,date,float,int,str
DBA,10,0,55,391478
BORO,0,0,116,392015
BUILDING,180,0,381625,10095
ZIPCODE,0,0,386558,16
PHONE,0,0,391725,395
VIOLATION CODE,0,163,0,386272
Latitude,0,386144,5557,0
Longitude,0,386144,5557,0


In [10]:
# Get minimum and maximum values for each datatype in columns 'DBA'

profiles.minmax('DBA')

Unnamed: 0,min,max
str,#1 Chinese Restaurant,zx accounting service llc
int,33,1976
date,1983-01-02 00:00:00,1983-01-02 00:00:00


In [11]:
# Which values are identified as 'date' in column 'BUILDING'

from openclean.function.eval.datatype import IsDatetime

ds.select('BUILDING').filter(IsDatetime('BUILDING')).distinct()

Counter({'271 1/2': 7,
         '94 1/2': 23,
         '168 1/2': 5,
         '3906 1/2': 12,
         '34   1/2': 37,
         '120 1/2': 10,
         '227-02/08': 35,
         '134 1/2': 7,
         '1045 1/2': 19,
         '22 1-2': 6,
         '42 1/2': 3,
         '67 1/2': 6,
         '501 1/2': 6,
         '48 1/2': 3,
         '27 1/2': 1})

In [12]:
# Filter distinct values in column 'VIOLATION CODE' that were recognized
# as being of type float.

from openclean.function.eval.base import Col
from openclean.function.eval.datatype import IsFloat

ds.filter(IsFloat('VIOLATION CODE')).distinct('VIOLATION CODE')

Counter({'15E2': 134, '15E3': 29})

In [13]:
# List the distinct values (and their total counts) for
# column 'BORO'

ds.distinct('BORO')

Counter({'Manhattan': 153958,
         'Bronx': 35761,
         'Queens': 89963,
         'Brooklyn': 99317,
         'Staten Island': 13016,
         '0': 116})

### Boroughs that are '0'

Take a look at the rows that have 0 as 'BORO'. Is it possible to infer the borough from other columns soch as BBL (block-borough-lot) or NTA (Neighborhood Tabulation Area).

In [14]:
# Profile only those rows that have a 'BORO' value '0'. Here we use the default
# column profiler that generates sets of distinct values for each column.

from openclean.function.eval.base import Col
from openclean.profiling.column import DefaultColumnProfiler

boro_0 = ds.filter(Col('BORO') == '0').profile(default_profiler=DefaultColumnProfiler)
boro_0.stats()

Unnamed: 0,total,empty,distinct,uniqueness,entropy
CAMIS,116,0,21,0.181034,2.75874
DBA,116,9,12,0.11215,2.29743
BORO,116,0,1,0.008621,0.0
BUILDING,116,0,6,0.051724,2.207016
STREET,116,0,6,0.051724,2.207016
ZIPCODE,116,0,4,0.034483,1.319701
PHONE,116,0,21,0.181034,2.75874
CUISINE DESCRIPTION,116,0,4,0.034483,1.596319
INSPECTION DATE,116,0,32,0.275862,4.709064
ACTION,116,16,1,0.01,0.0


In [15]:
# There are only four different zip codes. The different values and their
# frequency are included in the profiler results.

boro_0.column('ZIPCODE')['topValues']

[('11249', 81), ('N/A', 16), ('10168', 14), ('10285', 5)]

In [16]:
# What are the boroughs that are associated with zip codes
# '10168', '10285' in our dataset.

from openclean.function.eval.domain import IsIn

ds.filter(IsIn('ZIPCODE', {'10168', '10285'})).select('BORO').distinct()

Counter({'Manhattan': 8, '0': 19})

In [17]:
# What are the boroughs that are associated with zip code
# '11249' in our dataset.

ds.filter(Col('ZIPCODE') == '11249').select('BORO').distinct()

Counter({'Brooklyn': 2358, '0': 81})

In [18]:
# NOTE (for cleaning step):
#
# It seems save to replace '0' in column 'BORO' for those rows that have
# zip codes in ['11249', '10168', '10285'] using the mapping below.Also
# delete those records that have 'ZIPCODE' 'N/A'

boro_0_mapping = {'11249': 'Brooklyn', '10168': 'Manhattan', '10285': 'Manhattan'}

### New Establishments

From the [dataset description](https://data.cityofnewyork.us/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/43nn-pn8j): "*... Records are also included for each restaurant that has applied for a permit but has not yet been inspected and for inspections resulting in no violations. Establishments with inspection date of 1/1/1900 are new establishments that have not yet received an inspection. Restaurants that received no violations are represented by a single row and coded as having no violations using the ACTION field.*"

We expect that records with an 'INSPECTION DATE' 1900/1/1 have no value for 'ACTION', 'INSPECTION TYPE', 'SCORE', and 'GRADE'.

In [19]:
from datetime import datetime
from openclean.function.eval.datatype import Datetime
from openclean.profiling.column import DefaultColumnProfiler

# Date that identifies new establishments that have not
# been inspected.
new_establ_date = datetime(1900, 1, 1)

# Count total, empty, and distinct values for the columns that we
# expect to be empty for new establishments.
ds.filter(Datetime('INSPECTION DATE') == new_establ_date)\
    .select(['ACTION', 'INSPECTION TYPE', 'SCORE', 'GRADE'])\
    .profile(default_profiler=DefaultColumnProfiler)\
    .stats()

Unnamed: 0,total,empty,distinct,uniqueness,entropy
ACTION,1337,1337,0,,
INSPECTION TYPE,1337,1337,0,,
SCORE,1337,1337,0,,
GRADE,1337,1337,0,,


In [20]:
# NOTE (for cleaning step):
#
# Delete records with date 1900/1/1.

### Missing Scores

We are interested in the overall scores that were assigned to individual establishments as a result of the inspection. For new establishments we have seen that the score can be missing in some cases. Are there other records that do not have a 'SCORE'?

In [21]:
from openclean.function.eval.logic import And
from openclean.function.eval.datatype import Datetime
from openclean.function.eval.null import IsEmpty

df = ds.filter(And(Datetime('INSPECTION DATE') != new_establ_date, IsEmpty('SCORE'))).to_df()

In [22]:
df.shape

(15709, 26)

In [23]:
df['INSPECTION TYPE'].value_counts()

Administrative Miscellaneous / Initial Inspection              7018
Smoke-Free Air Act / Initial Inspection                        2320
Administrative Miscellaneous / Re-inspection                   2063
Trans Fat / Initial Inspection                                 1648
Calorie Posting / Initial Inspection                           1123
Smoke-Free Air Act / Re-inspection                              586
Trans Fat / Re-inspection                                       394
Calorie Posting / Re-inspection                                 233
Administrative Miscellaneous / Compliance Inspection            118
Administrative Miscellaneous / Reopening Inspection             102
Trans Fat / Compliance Inspection                                42
Administrative Miscellaneous / Second Compliance Inspection      19
Smoke-Free Air Act / Compliance Inspection                       17
Calorie Posting / Compliance Inspection                          16
Trans Fat / Second Compliance Inspection        

In [24]:
# QUESTION:
#
# IS there an easy way to get the difference between distinct value counts for those records that
# have a score and tose that don't. That is, are there INSPECTION TYPE's that never result in score
# and others that always have a score?

# NOTE (for cleaning step):
#
# Delete records with empty score.

### Cuisine Types

We want to compare inspection results by cuisine type. Get a quick look at the different values in that column.

In [25]:
ds.distinct('CUISINE DESCRIPTION')

Counter({'Bakery': 12305,
         'American': 82709,
         'Asian': 6237,
         'Tex-Mex': 1786,
         'Chinese': 41431,
         'Mexican': 15811,
         'Korean': 5431,
         'Pizza/Italian': 8354,
         'Japanese': 13956,
         'Donuts': 5199,
         'Sandwiches': 4108,
         'Pizza': 17536,
         'Caribbean': 14080,
         'French': 4499,
         'Delicatessen': 6257,
         'Seafood': 3092,
         'Café/Coffee/Tea': 18833,
         'Spanish': 12231,
         'Salads': 956,
         'Juice, Smoothies, Fruit Salads': 4573,
         'Afghan': 190,
         'Chinese/Japanese': 884,
         'Filipino': 666,
         'Thai': 5334,
         'Irish': 3336,
         'Latin (Cuban, Dominican, Puerto Rican, South & Central American)': 17117,
         'Eastern European': 1369,
         'Turkish': 1179,
         'Italian': 15798,
         'Bottled beverages, including water, sodas, juices, etc.': 1202,
         'Hamburgers': 4385,
         'African': 1729,
