# Final Project

## NYC's New Restaurant Owners Guide

#### The dos and dont's that you should keep an eye out to pass inspections

### Analysis to Explore

#### Food Safety Inspections:

- Most Common Violations for NYC Restaurants
  - Trends Over Time (Common Violations per Year)

- Average Grades by Boro

- Common Cuisines by Boro

### About this DataFrame

#### What it is:
 - Inspection for restaurants and college cafeterias in an active status as of 06/17/2024
 - This DataFrame is from NYC Open Data and can be found [here](https://data.cityofnewyork.us/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/43nn-pn8j/about_data)
 - Another way to view this information is by going to [The Health Department’s Restaurant Grading Website](http://www1.nyc.gov/site/doh/services/restaurant-grades.page)

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import scipy.stats as st

In [2]:
inspections_df = pd.read_csv('New_York_City_Restaurant_Inspection_Results.csv')

In [3]:
with pd.option_context('display.max_columns', None):
    display(inspections_df.head())

Unnamed: 0,CAMIS,DBA,BORO,BUILDING,STREET,ZIPCODE,PHONE,CUISINE DESCRIPTION,INSPECTION DATE,ACTION,VIOLATION CODE,VIOLATION DESCRIPTION,CRITICAL FLAG,SCORE,GRADE,GRADE DATE,RECORD DATE,INSPECTION TYPE,Latitude,Longitude,Community Board,Council District,Census Tract,BIN,BBL,NTA,Location Point1
0,50132353,AVENUE RESTAURANT,Manhattan,355,WEST 16 STREET,10011.0,2122292559,,01/01/1900,,,,Not Applicable,,,,06/17/2024,,40.741726,-74.003082,104.0,3.0,8300.0,1088885.0,1007408000.0,MN13,
1,50151116,PLANET HOLLYWOOD/CHICKEN GUY!,Manhattan,136,WEST 42 STREET,10036.0,4079035637,,01/01/1900,,,,Not Applicable,,,,06/17/2024,,40.755331,-73.985313,105.0,4.0,11300.0,1022578.0,1009948000.0,MN17,
2,50150858,,Queens,13335,ROOSEVELT AVE,11354.0,9179743745,,01/01/1900,,,,Not Applicable,,,,06/17/2024,,40.758502,-73.833242,407.0,20.0,87100.0,4112276.0,4049730000.0,QN22,
3,50149676,CONOR'S GOAT,Manhattan,23,AVENUE A,10009.0,2126735550,,01/01/1900,,,,Not Applicable,,,,06/17/2024,,40.722864,-73.985865,103.0,2.0,3002.0,1005747.0,1004290000.0,MN22,
4,50132087,FANTASTIC BEASTS,Queens,36-10,UNION STREET,11354.0,2538807717,,01/01/1900,,,,Not Applicable,,,,06/17/2024,,40.763482,-73.828056,407.0,20.0,86900.0,4112354.0,4049770000.0,QN22,


In [4]:
inspections_df.isna().sum()

CAMIS                         0
DBA                         716
BORO                          0
BUILDING                    331
STREET                       12
ZIPCODE                    2795
PHONE                         3
CUISINE DESCRIPTION        2475
INSPECTION DATE               0
ACTION                     2475
VIOLATION CODE             3728
VIOLATION DESCRIPTION      3728
CRITICAL FLAG                 0
SCORE                     11339
GRADE                    119784
GRADE DATE               128360
RECORD DATE                   0
INSPECTION TYPE            2475
Latitude                    340
Longitude                   340
Community Board            3368
Council District           3364
Census Tract               3364
BIN                        4562
BBL                         585
NTA                        3368
Location Point1          234037
dtype: int64

### Cleaning that could be done

> Change names of columns **DONE!**

> Drop rows that don't have violation code or description (this is the base of the analysis) **DONE!**

> Drop unnecessary columns (Location Point, Phone, Building... many more!) **DONE!**

> Check for Duplicates **DONE!**

> Check Null Values (fill with average/median/other or drop) **DONE!**

In [5]:
#Drop Duplicates
inspections_df.drop_duplicates(inplace=True)

#Drop Null Values for Violation Code - necessary for our analysis
inspections_df.dropna(subset='VIOLATION CODE', inplace=True)

#Drop Unnecessary Columns
inspections_df.drop(columns=['BUILDING', 'STREET', 'ZIPCODE', 'PHONE', 'SCORE', 'GRADE DATE', 'RECORD DATE', 'Community Board', 'Council District', 'Census Tract', 'BIN', 'BBL', 'NTA', 'Location Point1'], inplace=True)

#Make Columns Lowercase & Change Spaces to '_'
inspections_df.columns = inspections_df.columns.str.lower().str.replace(' ', '_')

In [6]:
inspections_df.duplicated().sum()

4

In [7]:
inspections_df.isna().sum()

camis                         0
dba                           0
boro                          0
cuisine_description           0
inspection_date               0
action                        0
violation_code                0
violation_description         0
critical_flag                 0
grade                    116516
inspection_type               0
latitude                    280
longitude                   280
dtype: int64

In [8]:
inspections_df.shape

(230303, 13)

In [9]:
#Rename Columns for Clarification
inspections_df.rename(columns = {'camis':'establishment_id', 'dba':'establishment_name', 'boro':'borough'}, inplace=True)

In [10]:
#Change Date Columns to DateTime
inspections_df['inspection_date'] = pd.to_datetime(inspections_df['inspection_date'])

In [11]:
inspections_df.reset_index(drop=True, inplace=True)

In [12]:
inspections_df['establishment_id'].duplicated().sum()

203956

In [13]:
inspections_df['establishment_name'].duplicated().sum()

209320

### Although these are duplicated, there are chain restaurants and these could be different inspection.

In [14]:
#Making sure that the 'duplicated' restaurants are not the same inspections

inspections_df[['establishment_name', 'establishment_id', 'borough', 'inspection_date', 'violation_code', 'violation_description', 'latitude', 'longitude', 'inspection_type']].duplicated().sum()

4

In [15]:
#Drop rows with duplicated inspections (at least for the following columns)

inspections_df.drop_duplicates(subset=['establishment_name', 'establishment_id', 'borough', 'inspection_date', 'violation_code', 'violation_description', 'latitude', 'longitude', 'inspection_type'], inplace=True)

In [16]:
#Confirmation that the rows were dropped

inspections_df[['establishment_name', 'establishment_id', 'borough', 'inspection_date', 'violation_code', 'violation_description', 'latitude', 'longitude', 'inspection_type']].duplicated().sum()

0

In [17]:
#Another way to confirm that rows have unique values

inspections_df[['establishment_name', 'establishment_id', 'borough', 'inspection_date', 'violation_code']].value_counts().sort_values()

establishment_name                establishment_id  borough    inspection_date  violation_code
"U" LIKE CHINESE TAKE OUT         50126747          Manhattan  2023-02-03       02B               1
#1 GARDEN CHINESE RESTAURANT      50075009          Brooklyn   2023-02-09       06B               1
$1 PIZZA                          50086385          Manhattan  2022-08-16       08A               1
"W" CAFE                          50018480          Manhattan  2023-08-22       04M               1
"U" LIKE CHINESE TAKE OUT         50126747          Manhattan  2023-02-03       04H               1
                                                                                                 ..
kampai hibachi & brazilian Grill  50107632          Queens     2023-03-23       10B               1
                                                                                10F               1
                                                                                10G               1
ZOUJI

In [18]:
#Fill Null Values with N = Not Yet Graded
inspections_df['grade'].fillna(value='N', inplace=True)

In [19]:
inspections_df['grade'].value_counts()

#Should I combine N (Not Yet Graded) and Z (Grade Pending)
#P is pending but issued on re-opening following an initial inspection that resulted in a closure

grade
N    125079
A     78957
B     13394
C      8041
Z      4198
P       630
Name: count, dtype: int64

### Grade Values

- N = Not Yet Graded

- A = Grade A

- B = Grade B

- C = Grade C

- Z = Grade Pending

- P = Grade Pending issued on re-opening following an initial inspection that resulted in a closure

In [20]:
#Confirm that all rows have had inspections in order to 'assume' that N and Z could be combined

inspections_df['inspection_date'].value_counts().sort_index()

inspection_date
2015-09-24      5
2015-10-14      2
2015-10-15      1
2015-11-19      2
2015-11-20      4
             ... 
2024-06-11    457
2024-06-12    367
2024-06-13    572
2024-06-14    238
2024-06-15      3
Name: count, Length: 1694, dtype: int64

In [21]:
#inspections_df['inspection_day'] = inspections_df['inspection_date'].dt.strftime('%d').astype(int)
inspections_df['inspection_month'] = inspections_df['inspection_date'].dt.strftime('%B')
inspections_df['inspection_year'] = inspections_df['inspection_date'].dt.strftime('%Y').astype(int)
inspections_df.drop(columns='inspection_date', inplace=True)

In [22]:
#inspections_df['inspection_day'].value_counts().sort_index()

In [23]:
inspections_df['inspection_month'].value_counts().sort_index()

inspection_month
April        23794
August       20767
December     15178
February     20549
January      21078
July         14233
June         18560
March        26231
May          26258
November     14830
October      14734
September    14087
Name: count, dtype: int64

In [24]:
inspections_df['inspection_year'].value_counts().sort_index()

inspection_year
2015       18
2016      358
2017      972
2018     1390
2019     2657
2020     2542
2021    18719
2022    79394
2023    81673
2024    42576
Name: count, dtype: int64

In [25]:
display(inspections_df.dtypes)
display(inspections_df.head())
inspections_df.shape

establishment_id           int64
establishment_name        object
borough                   object
cuisine_description       object
action                    object
violation_code            object
violation_description     object
critical_flag             object
grade                     object
inspection_type           object
latitude                 float64
longitude                float64
inspection_month          object
inspection_year            int64
dtype: object

Unnamed: 0,establishment_id,establishment_name,borough,cuisine_description,action,violation_code,violation_description,critical_flag,grade,inspection_type,latitude,longitude,inspection_month,inspection_year
0,50057566,DOMINO'S,Queens,Pizza,Violations were cited in the following area(s).,09C,Food contact surface not properly maintained.,Not Critical,A,Cycle Inspection / Initial Inspection,40.665341,-73.730655,August,2021
1,50065306,CHENG'S,Staten Island,Chinese,Violations were cited in the following area(s).,04L,Evidence of mice or live mice in establishment...,Critical,N,Cycle Inspection / Initial Inspection,40.62601,-74.156541,April,2023
2,41163307,TAQUERIA SAN PEDRO,Manhattan,Mexican,Violations were cited in the following area(s).,02B,Hot TCS food item not held at or above 140 °F.,Critical,B,Cycle Inspection / Re-inspection,40.830403,-73.947535,September,2022
3,50007331,PALACE RESTAURANT,Manhattan,American,Violations were cited in the following area(s).,02B,Hot food item not held at or above 140º F.,Critical,N,Cycle Inspection / Initial Inspection,40.761164,-73.969736,May,2022
4,40512788,ELIAS CORNER FOR FISH,Queens,Seafood,Violations were cited in the following area(s).,04M,Live roaches present in facility's food and/or...,Critical,A,Cycle Inspection / Initial Inspection,40.772154,-73.915568,January,2022


(230299, 14)

In [26]:
print(inspections_df['cuisine_description'].value_counts().sort_values())
len(inspections_df['cuisine_description'].value_counts())

#Many values... should I filter them out?

cuisine_description
Chimichurri               2
Haute Cuisine             5
Czech                     7
Basque                    9
Nuts/Confectionary       18
                      ...  
Latin American         9430
Pizza                 14167
Coffee/Tea            16015
Chinese               22513
American              37720
Name: count, Length: 89, dtype: int64


89

In [27]:
len(inspections_df['inspection_type'].value_counts())

#Is this column even necessary?

30

In [28]:
len(inspections_df['violation_code'].value_counts().sort_values())

143

In [29]:
len(inspections_df['violation_description'].value_counts().sort_values(ascending=False))

220

In [30]:
#Filter Violation Code - Only include those with more than 1000 value counts

violation_counts = inspections_df['violation_code'].value_counts()

filtered_violations = violation_counts[violation_counts > 1000].index

inspections_df = inspections_df[inspections_df['violation_code'].isin(filtered_violations)]

In [31]:
inspections_df.shape

(214174, 14)

In [32]:
dropped_columns = 230299 - 214174
dropped_columns

#We still have sufficient data to draw conclusions

16125

In [33]:
len(inspections_df['violation_code'].value_counts())

30

In [34]:
len(inspections_df['violation_description'].value_counts())

#Should I filter them out?

67

In [35]:
len(inspections_df['cuisine_description'].value_counts())

#No values were eliminated

89

In [36]:
print(inspections_df['inspection_type'].value_counts())
len(inspections_df['inspection_type'].value_counts())

inspection_type
Cycle Inspection / Initial Inspection                          121194
Cycle Inspection / Re-inspection                                40965
Pre-permit (Operational) / Initial Inspection                   32492
Pre-permit (Operational) / Re-inspection                         9436
Pre-permit (Non-operational) / Initial Inspection                2605
Administrative Miscellaneous / Initial Inspection                1961
Pre-permit (Operational) / Compliance Inspection                 1554
Cycle Inspection / Reopening Inspection                          1310
Cycle Inspection / Compliance Inspection                          917
Pre-permit (Operational) / Reopening Inspection                   683
Administrative Miscellaneous / Re-inspection                      353
Pre-permit (Non-operational) / Re-inspection                      196
Inter-Agency Task Force / Initial Inspection                      181
Pre-permit (Operational) / Second Compliance Inspection           176
Admi

20

In [37]:
#Only keep the actual Type of Inspection (Second Value)

inspections_df['inspection_type'] = inspections_df['inspection_type'].str.split('/').str.get(1).str.strip()

In [38]:
inspections_df['inspection_type'].value_counts()

inspection_type
Initial Inspection              158433
Re-inspection                    50950
Compliance Inspection             2563
Reopening Inspection              2001
Second Compliance Inspection       227
Name: count, dtype: int64

In [39]:
print(inspections_df['action'].value_counts())
len(inspections_df['action'].value_counts())

#Summarize each value for better visual

action
Violations were cited in the following area(s).                                                                                       204962
Establishment Closed by DOHMH. Violations were cited in the following area(s) and those requiring immediate action were addressed.      7460
Establishment re-opened by DOHMH.                                                                                                       1723
No violations were recorded at the time of this inspection.                                                                               29
Name: count, dtype: int64


4

In [40]:
#Function to summarize each value for better visual

def summarize_action(value):
    if 'Violations were cited in the following area(s).' in value:
        return 'Had Violations'
    elif 'Establishment Closed by DOHMH. Violations were cited in the following area(s) and those requiring immediate action were addressed.' in value:
        return 'Closed by DOHMH'
    elif 'Establishment re-opened by DOHMH.' in value:
        return 'Re-opened by DOHMH'
    elif 'No violations were recorded at the time of this inspection.' in value:
        return 'No Violations'
    else:
        return value
    
inspections_df['action'] = inspections_df['action'].apply(summarize_action)

inspections_df['action'].value_counts()

action
Had Violations        204962
Closed by DOHMH         7460
Re-opened by DOHMH      1723
No Violations             29
Name: count, dtype: int64

In [41]:
inspections_df['grade'].value_counts()

grade
N    112893
A     76181
B     12993
C      7517
Z      3969
P       621
Name: count, dtype: int64

In [42]:
inspections_df['grade'] = inspections_df['grade'].replace({'P':'PRO', 'Z':'PEN', 'N':'PEN'})
inspections_df['grade'].value_counts()

grade
PEN    116862
A       76181
B       12993
C        7517
PRO       621
Name: count, dtype: int64

In [43]:
inspections_df['critical_flag'].value_counts()

critical_flag
Critical        120629
Not Critical     93545
Name: count, dtype: int64

In [44]:
# Filter the data where critical_flag is 'Critical'
#critical_violations_df = pd.DataFrame(inspections_df[inspections_df['critical_flag'] == 'Critical'])

#print(len(critical_violations_df['violation_description'].value_counts()))
#critical_violations_df[['violation_code','violation_description']].value_counts()

In [45]:
# Filter the data where critical_flag is 'Not Critical'
#not_critical_violations_df = pd.DataFrame(inspections_df[inspections_df['critical_flag'] == 'Not Critical'])

#print(len(not_critical_violations_df['violation_description'].value_counts()))
#not_critical_violations_df[['violation_code','violation_description']].value_counts()

In [46]:
#Categorizing Violation Codes 

violation_mapping = {
    'Food Handling Violation' : ['06D','02G','02B','06C','04H','02G','06C','09B','04J','10E','10H','02H','04C','10G'],
    'Documentation Violation' : ['04A','09E','20-04','20-06'],
    'Pests or Rodents Violation' : ['08A','04L','04N','04M','04K'],
    'Safety Violation' : ['08C','10D'],
    'Cleaning and Sanitation Violation' : ['10F','10B','06E','06F','10G','05D','06A','10H','09C']
}

# Function to categorize violation codes
def categorize_violation(code):
    for category, codes in violation_mapping.items():
        if code in codes:
            return category
    return 'Other'

# Assign categories based on violation code using map() function
inspections_df['violation_category'] = inspections_df['violation_code'].apply(categorize_violation)

In [47]:
#Drop any columns that won't be necessary after further cleaning

inspections_df.drop(columns=['establishment_id','violation_description','inspection_month'], inplace=True)
inspections_df.shape

(214174, 12)

In [48]:
#Make sure all rows have a category

print(inspections_df['violation_category'].value_counts())

print('Total number of value counts: ', inspections_df['violation_category'].value_counts().sum())

violation_category
Food Handling Violation              77191
Cleaning and Sanitation Violation    64999
Pests or Rodents Violation           55946
Documentation Violation              10330
Safety Violation                      5708
Name: count, dtype: int64
Total number of value counts:  214174


In [49]:
new_column_order = ['establishment_name', 'cuisine_description', 'inspection_type', 'inspection_year', 'action', 'critical_flag', 'violation_code', 'violation_category', 'grade', 'borough', 'latitude', 'longitude']

In [50]:
inspections_df = inspections_df[new_column_order]
inspections_df

Unnamed: 0,establishment_name,cuisine_description,inspection_type,inspection_year,action,critical_flag,violation_code,violation_category,grade,borough,latitude,longitude
0,DOMINO'S,Pizza,Initial Inspection,2021,Had Violations,Not Critical,09C,Cleaning and Sanitation Violation,A,Queens,40.665341,-73.730655
1,CHENG'S,Chinese,Initial Inspection,2023,Had Violations,Critical,04L,Pests or Rodents Violation,PEN,Staten Island,40.626010,-74.156541
2,TAQUERIA SAN PEDRO,Mexican,Re-inspection,2022,Had Violations,Critical,02B,Food Handling Violation,B,Manhattan,40.830403,-73.947535
3,PALACE RESTAURANT,American,Initial Inspection,2022,Had Violations,Critical,02B,Food Handling Violation,PEN,Manhattan,40.761164,-73.969736
4,ELIAS CORNER FOR FISH,Seafood,Initial Inspection,2022,Had Violations,Critical,04M,Pests or Rodents Violation,A,Queens,40.772154,-73.915568
...,...,...,...,...,...,...,...,...,...,...,...,...
230297,KYURAMEN / TBAAR,Japanese,Re-inspection,2023,Had Violations,Not Critical,10D,Safety Violation,A,Manhattan,40.802480,-73.968023
230298,LOS NISPEROS PERUVIAN RESTAURANT,Peruvian,Initial Inspection,2021,Had Violations,Not Critical,10F,Cleaning and Sanitation Violation,PEN,Bronx,40.814823,-73.914657
230299,THE RED GRILL MEXICAN RESTAURANT,Mexican,Initial Inspection,2021,Had Violations,Not Critical,08A,Pests or Rodents Violation,PEN,Manhattan,40.779272,-73.950735
230301,GYRO KING,Bangladeshi,Re-inspection,2024,Had Violations,Critical,06E,Cleaning and Sanitation Violation,PEN,Brooklyn,40.633534,-73.967149


In [51]:
#Fill Null Values with N = Not Yet Graded
inspections_df['latitude'].fillna(value=0.0, inplace=True)
inspections_df['latitude'].astype(float)
inspections_df['longitude'].fillna(value=0.0, inplace=True)
inspections_df['longitude'].astype(float)
print(inspections_df.isna().sum())
inspections_df.dtypes

establishment_name     0
cuisine_description    0
inspection_type        0
inspection_year        0
action                 0
critical_flag          0
violation_code         0
violation_category     0
grade                  0
borough                0
latitude               0
longitude              0
dtype: int64


establishment_name      object
cuisine_description     object
inspection_type         object
inspection_year          int64
action                  object
critical_flag           object
violation_code          object
violation_category      object
grade                   object
borough                 object
latitude               float64
longitude              float64
dtype: object

In [53]:
inspections_df.to_csv('inspections_df.csv', index=False)