# Final Project

## NYC's New Restaurant Owners Guide

#### The dos and dont's that you should keep an eye out to pass inspections

### Analysis to Explore

#### Food Safety Inspections:

- Most Common Violations for NYC Restaurants
  - Trends Over Time (Common Violations per Year)

- Average Grades by Borough

- Common Cuisines by Borough

### About this DataFrame

#### What it is:
 - Inspection for restaurants and college cafeterias in an active status as of 06/17/2024
 - This DataFrame is from NYC Open Data and can be found [here](https://data.cityofnewyork.us/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/43nn-pn8j/about_data)
 - Another way to view this information is by going to [The Health Department’s Restaurant Grading Website](http://www1.nyc.gov/site/doh/services/restaurant-grades.page)
 - Restaurants with a score between **0 and 13 points earn an A**, those with **14 to 27 points receive a B** and those with **28 or more a C**

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import scipy.stats as st

In [2]:
inspections_df = pd.read_csv('CSV Files/New_York_City_Restaurant_Inspection_Results.csv')

In [3]:
with pd.option_context('display.max_columns', None):
    display(inspections_df.head())

Unnamed: 0,CAMIS,DBA,BORO,BUILDING,STREET,ZIPCODE,PHONE,CUISINE DESCRIPTION,INSPECTION DATE,ACTION,VIOLATION CODE,VIOLATION DESCRIPTION,CRITICAL FLAG,SCORE,GRADE,GRADE DATE,RECORD DATE,INSPECTION TYPE,Latitude,Longitude,Community Board,Council District,Census Tract,BIN,BBL,NTA,Location Point1
0,50132353,AVENUE RESTAURANT,Manhattan,355,WEST 16 STREET,10011.0,2122292559,,01/01/1900,,,,Not Applicable,,,,06/17/2024,,40.741726,-74.003082,104.0,3.0,8300.0,1088885.0,1007408000.0,MN13,
1,50151116,PLANET HOLLYWOOD/CHICKEN GUY!,Manhattan,136,WEST 42 STREET,10036.0,4079035637,,01/01/1900,,,,Not Applicable,,,,06/17/2024,,40.755331,-73.985313,105.0,4.0,11300.0,1022578.0,1009948000.0,MN17,
2,50150858,,Queens,13335,ROOSEVELT AVE,11354.0,9179743745,,01/01/1900,,,,Not Applicable,,,,06/17/2024,,40.758502,-73.833242,407.0,20.0,87100.0,4112276.0,4049730000.0,QN22,
3,50149676,CONOR'S GOAT,Manhattan,23,AVENUE A,10009.0,2126735550,,01/01/1900,,,,Not Applicable,,,,06/17/2024,,40.722864,-73.985865,103.0,2.0,3002.0,1005747.0,1004290000.0,MN22,
4,50132087,FANTASTIC BEASTS,Queens,36-10,UNION STREET,11354.0,2538807717,,01/01/1900,,,,Not Applicable,,,,06/17/2024,,40.763482,-73.828056,407.0,20.0,86900.0,4112354.0,4049770000.0,QN22,


In [4]:
inspections_df.isna().sum()

CAMIS                         0
DBA                         716
BORO                          0
BUILDING                    331
STREET                       12
ZIPCODE                    2795
PHONE                         3
CUISINE DESCRIPTION        2475
INSPECTION DATE               0
ACTION                     2475
VIOLATION CODE             3728
VIOLATION DESCRIPTION      3728
CRITICAL FLAG                 0
SCORE                     11339
GRADE                    119784
GRADE DATE               128360
RECORD DATE                   0
INSPECTION TYPE            2475
Latitude                    340
Longitude                   340
Community Board            3368
Council District           3364
Census Tract               3364
BIN                        4562
BBL                         585
NTA                        3368
Location Point1          234037
dtype: int64

### Cleaning that could be done

> Change names of columns **DONE!**

> Drop rows that don't have violation code or description (this is the base of the analysis) **DONE!**

> Drop unnecessary columns (Location Point, Phone, Building... many more!) **DONE!**

> Check for Duplicates **DONE!**

> Check Null Values (fill with average/median/other or drop) **DONE!**

In [5]:
#Drop Duplicates
inspections_df.drop_duplicates(inplace=True)

#Drop Null Values for Violation Code - necessary for our analysis
inspections_df.dropna(subset='VIOLATION CODE', inplace=True)

#Drop Unnecessary Columns
inspections_df.drop(columns=['BUILDING', 'STREET', 'ZIPCODE', 'PHONE', 'GRADE DATE', 'RECORD DATE', 'Community Board', 'Council District', 'Census Tract', 'BIN', 'BBL', 'NTA', 'Location Point1'], inplace=True)

#Make Columns Lowercase & Change Spaces to '_'
inspections_df.columns = inspections_df.columns.str.lower().str.replace(' ', '_')

In [6]:
inspections_df.duplicated().sum()

0

In [7]:
inspections_df.isna().sum()

camis                         0
dba                           0
boro                          0
cuisine_description           0
inspection_date               0
action                        0
violation_code                0
violation_description         0
critical_flag                 0
score                      8337
grade                    116516
inspection_type               0
latitude                    280
longitude                   280
dtype: int64

In [8]:
inspections_df.shape

(230303, 14)

In [9]:
#Rename Columns for Clarification
inspections_df.rename(columns = {'camis':'establishment_id', 'dba':'establishment_name', 'boro':'borough'}, inplace=True)

In [10]:
#Change Date Columns to DateTime
inspections_df['inspection_date'] = pd.to_datetime(inspections_df['inspection_date'])

In [11]:
inspections_df.reset_index(drop=True, inplace=True)

In [12]:
inspections_df['establishment_id'].duplicated().sum()

203956

In [13]:
inspections_df['establishment_name'].duplicated().sum()

209320

- Although these are duplicated, there are chain restaurants and these could be different inspections (same restaurant but different violations).

In [14]:
#Making sure that the 'duplicated' restaurants are not the same inspections

inspections_df[['establishment_name', 'establishment_id', 'borough', 'inspection_date', 'score', 'violation_code', 'violation_description', 'latitude', 'longitude', 'inspection_type']].duplicated().sum()

0

In [15]:
#Another way to confirm that rows have unique values

inspections_df[['establishment_name', 'establishment_id', 'borough', 'inspection_date', 'score', 'violation_code']].value_counts().sort_values()

establishment_name                establishment_id  borough    inspection_date  score  violation_code
"U" LIKE CHINESE TAKE OUT         50126747          Manhattan  2023-02-03       31.0   02B               1
#1 GARDEN CHINESE RESTAURANT      50075009          Brooklyn   2023-02-09       58.0   06B               1
$1 PIZZA                          50086385          Manhattan  2022-08-16       25.0   06C               1
#1 GARDEN CHINESE RESTAURANT      50075009          Brooklyn   2023-02-09       58.0   06A               1
"U" LIKE CHINESE TAKE OUT         50126747          Manhattan  2023-02-03       31.0   04H               1
                                                                                                        ..
kampai hibachi & brazilian Grill  50107632          Queens     2023-03-23       64.0   10B               1
                                                                                       10F               1
                                          

### Grade Values

- N = Not Yet Graded

- A = Grade A

- B = Grade B

- C = Grade C

- Z = Grade Pending

- P = Grade Pending issued on re-opening following an initial inspection that resulted in a closure

In [16]:
#Fill Null Values with N = Not Yet Graded
inspections_df['grade'].fillna(value='N', inplace=True)

In [17]:
inspections_df['grade'].value_counts()

#Should I combine N (Not Yet Graded) and Z (Grade Pending)
#P is pending but issued on re-opening following an initial inspection that resulted in a closure

grade
N    125083
A     78957
B     13394
C      8041
Z      4198
P       630
Name: count, dtype: int64

In [18]:
#Confirm that all rows have had inspections in order to 'assume' that N and Z could be combined

inspections_df['inspection_date'].value_counts().sort_index()

inspection_date
2015-09-24      5
2015-10-14      2
2015-10-15      1
2015-11-19      2
2015-11-20      4
             ... 
2024-06-11    457
2024-06-12    367
2024-06-13    572
2024-06-14    238
2024-06-15      3
Name: count, Length: 1694, dtype: int64

In [19]:
#inspections_df['inspection_day'] = inspections_df['inspection_date'].dt.strftime('%d').astype(int)
#inspections_df['inspection_month'] = inspections_df['inspection_date'].dt.strftime('%B')
inspections_df['inspection_year'] = inspections_df['inspection_date'].dt.strftime('%Y').astype(int)
inspections_df.drop(columns='inspection_date', inplace=True)

In [20]:
#inspections_df['inspection_day'].value_counts().sort_index()

In [21]:
#inspections_df['inspection_month'].value_counts().sort_index()

In [22]:
inspections_df['inspection_year'].value_counts().sort_index()

inspection_year
2015       18
2016      358
2017      976
2018     1390
2019     2657
2020     2542
2021    18719
2022    79394
2023    81673
2024    42576
Name: count, dtype: int64

In [23]:
display(inspections_df.dtypes)
display(inspections_df.head())
inspections_df.shape

establishment_id           int64
establishment_name        object
borough                   object
cuisine_description       object
action                    object
violation_code            object
violation_description     object
critical_flag             object
score                    float64
grade                     object
inspection_type           object
latitude                 float64
longitude                float64
inspection_year            int64
dtype: object

Unnamed: 0,establishment_id,establishment_name,borough,cuisine_description,action,violation_code,violation_description,critical_flag,score,grade,inspection_type,latitude,longitude,inspection_year
0,50057566,DOMINO'S,Queens,Pizza,Violations were cited in the following area(s).,09C,Food contact surface not properly maintained.,Not Critical,13.0,A,Cycle Inspection / Initial Inspection,40.665341,-73.730655,2021
1,50065306,CHENG'S,Staten Island,Chinese,Violations were cited in the following area(s).,04L,Evidence of mice or live mice in establishment...,Critical,18.0,N,Cycle Inspection / Initial Inspection,40.62601,-74.156541,2023
2,41163307,TAQUERIA SAN PEDRO,Manhattan,Mexican,Violations were cited in the following area(s).,02B,Hot TCS food item not held at or above 140 °F.,Critical,17.0,B,Cycle Inspection / Re-inspection,40.830403,-73.947535,2022
3,50007331,PALACE RESTAURANT,Manhattan,American,Violations were cited in the following area(s).,02B,Hot food item not held at or above 140º F.,Critical,19.0,N,Cycle Inspection / Initial Inspection,40.761164,-73.969736,2022
4,40512788,ELIAS CORNER FOR FISH,Queens,Seafood,Violations were cited in the following area(s).,04M,Live roaches present in facility's food and/or...,Critical,10.0,A,Cycle Inspection / Initial Inspection,40.772154,-73.915568,2022


(230303, 14)

In [24]:
print(inspections_df['cuisine_description'].value_counts().sort_values())
len(inspections_df['cuisine_description'].value_counts())

#Many values... should I filter them out? I can't combine any. Too specific!

cuisine_description
Chimichurri               2
Haute Cuisine             5
Czech                     7
Basque                    9
Nuts/Confectionary       18
                      ...  
Latin American         9430
Pizza                 14167
Coffee/Tea            16015
Chinese               22513
American              37724
Name: count, Length: 89, dtype: int64


89

In [25]:
len(inspections_df['violation_description'].value_counts().sort_values(ascending=False))

220

In [26]:
len(inspections_df['inspection_type'].value_counts())

30

In [27]:
len(inspections_df['violation_code'].value_counts())

143

In [28]:
#Filter Violation Code - Only include those with more than 300 value counts

violation_counts = inspections_df['violation_code'].value_counts()

filtered_violations = violation_counts[violation_counts > 150].index

inspections_df = inspections_df[inspections_df['violation_code'].isin(filtered_violations)]

In [29]:
# Filter the data where critical_flag is 'Critical'
#critical_violations_df = pd.DataFrame(inspections_df[inspections_df['critical_flag'] == 'Critical'])

#print(len(critical_violations_df['violation_description'].value_counts()))
#critical_violations_df[['violation_code','violation_description']].value_counts()

# Filter the data where critical_flag is 'Not Critical'
#not_critical_violations_df = pd.DataFrame(inspections_df[inspections_df['critical_flag'] == 'Not Critical'])

#print(len(not_critical_violations_df['violation_description'].value_counts()))
#not_critical_violations_df[['violation_code','violation_description']].value_counts()

In [30]:
#Categorize Violations Codes and Descriptions

#Create a violations Dataframe to use as guide
violations = pd.DataFrame(inspections_df[['violation_code', 'violation_description']].value_counts())

#with pd.option_context('display.max_rows', None):
#    display(violations.sort_values(by='violation_code'))
 
print(inspections_df.shape)    
len(inspections_df['violation_code'].value_counts())

(227732, 14)


64

In [31]:
#Categorizing Violation Codes 

violation_mapping = {
    'Food Handling Violations':['04H', '06C', '06D', '06E', '09A', '09B', '10E', '10H', '10I', '28-05', '02A', '02B', '02C', '02G', '02H', '02I', '04J', '05F', '03A', '03I'], 
    'Employee Practices Violations':['04C', '06A', '06B', '19-06', '19-07'], 
    'Documentation Violation':['04A', '05H', '09E', '10J', '16-02', '16-03', '19-10', '20-01', '20-04', '20-06', '20-08', '20A', '20D', '20F'], 
    'Safety Violations':['05A', '08C', '10C', '10D', '22A', '28-01', '28-03'], 
    'Cleaning and Sanitation Violations':['04F', '05C', '05D', '05E', '06F', '08B', '09C', '10A', '10B', '10F', '10G'], 
    'Pests and Rodents Violations':['04K', '04L', '04M', '04N', '04O', '08A', '28-06'] 
}

# Function to categorize violation codes
def categorize_violation(code):
    for category, codes in violation_mapping.items():
        if code in codes:
            return category
    return 'Other'

# Assign categories based on violation code using map() function
inspections_df['violation_category'] = inspections_df['violation_code'].apply(categorize_violation)

In [32]:
inspections_df.shape

(227732, 15)

In [33]:
#Make sure all rows have a category

print(inspections_df['violation_category'].value_counts())

print('Total number of value counts: ', inspections_df['violation_category'].value_counts().sum())

violation_category
Food Handling Violations              79531
Cleaning and Sanitation Violations    62713
Pests and Rodents Violations          57108
Documentation Violation               14347
Employee Practices Violations          7322
Safety Violations                      6711
Name: count, dtype: int64
Total number of value counts:  227732


In [34]:
#Changing the validation_description created duplicates

inspections_df.duplicated().sum()

40

In [35]:
inspections_df.drop_duplicates(inplace=True)
inspections_df.shape

(227692, 15)

In [36]:
inspections_df.duplicated().sum()

0

In [37]:
#Drop any columns that won't be necessary after further cleaning

inspections_df.drop(columns=['establishment_id','violation_description'], inplace=True)
inspections_df.shape

(227692, 13)

In [38]:
len(inspections_df['cuisine_description'].value_counts())

#No values were eliminated

89

In [39]:
print(inspections_df['inspection_type'].value_counts())
len(inspections_df['inspection_type'].value_counts())

inspection_type
Cycle Inspection / Initial Inspection                          125750
Cycle Inspection / Re-inspection                                42487
Pre-permit (Operational) / Initial Inspection                   34564
Pre-permit (Operational) / Re-inspection                         9947
Administrative Miscellaneous / Initial Inspection                5001
Pre-permit (Non-operational) / Initial Inspection                3028
Pre-permit (Operational) / Compliance Inspection                 1651
Cycle Inspection / Reopening Inspection                          1332
Administrative Miscellaneous / Re-inspection                      981
Cycle Inspection / Compliance Inspection                          963
Pre-permit (Operational) / Reopening Inspection                   708
Trans Fat / Initial Inspection                                    243
Pre-permit (Non-operational) / Re-inspection                      218
Inter-Agency Task Force / Initial Inspection                      205
Pre-

27

In [40]:
#Only keep the actual Type of Inspection (Second Value)

inspections_df['inspection_type'] = inspections_df['inspection_type'].str.split('/').str.get(1).str.strip()

In [41]:
inspections_df['inspection_type'].value_counts()

inspection_type
Initial Inspection              168918
Re-inspection                    53704
Compliance Inspection             2758
Reopening Inspection              2069
Second Compliance Inspection       243
Name: count, dtype: int64

In [42]:
print(inspections_df['action'].value_counts())
len(inspections_df['action'].value_counts())

#Summarize each value for better visual

action
Violations were cited in the following area(s).                                                                                       217596
Establishment Closed by DOHMH. Violations were cited in the following area(s) and those requiring immediate action were addressed.      8180
Establishment re-opened by DOHMH.                                                                                                       1771
No violations were recorded at the time of this inspection.                                                                              145
Name: count, dtype: int64


4

In [43]:
#Function to summarize each 'action' for better visual

def summarize_action(value):
    if 'Violations were cited in the following area(s).' in value:
        return 'Had Violations'
    elif 'Establishment Closed by DOHMH. Violations were cited in the following area(s) and those requiring immediate action were addressed.' in value:
        return 'Closed by DOHMH'
    elif 'Establishment re-opened by DOHMH.' in value:
        return 'Re-opened by DOHMH'
    elif 'No violations were recorded at the time of this inspection.' in value:
        return 'No Violations'
    else:
        return value
    
inspections_df['action'] = inspections_df['action'].apply(summarize_action)

inspections_df['action'].value_counts()

action
Had Violations        217596
Closed by DOHMH         8180
Re-opened by DOHMH      1771
No Violations            145
Name: count, dtype: int64

In [44]:
inspections_df['grade'].value_counts()

grade
N    122775
A     78788
B     13339
C      7987
Z      4173
P       630
Name: count, dtype: int64

In [45]:
inspections_df['grade'] = inspections_df['grade'].replace({'P':'PRO', 'Z':'PEN', 'N':'PEN'})
inspections_df['grade'].value_counts()

grade
PEN    126948
A       78788
B       13339
C        7987
PRO       630
Name: count, dtype: int64

In [46]:
inspections_df['critical_flag'].value_counts()

critical_flag
Critical          125534
Not Critical      101713
Not Applicable       445
Name: count, dtype: int64

In [47]:
new_column_order = ['establishment_name', 'cuisine_description', 'inspection_type', 'inspection_year', 'action', 'critical_flag', 'violation_code', 'violation_category', 'score', 'grade', 'borough', 'latitude', 'longitude']

In [48]:
inspections_df = inspections_df[new_column_order]
inspections_df

Unnamed: 0,establishment_name,cuisine_description,inspection_type,inspection_year,action,critical_flag,violation_code,violation_category,score,grade,borough,latitude,longitude
0,DOMINO'S,Pizza,Initial Inspection,2021,Had Violations,Not Critical,09C,Cleaning and Sanitation Violations,13.0,A,Queens,40.665341,-73.730655
1,CHENG'S,Chinese,Initial Inspection,2023,Had Violations,Critical,04L,Pests and Rodents Violations,18.0,PEN,Staten Island,40.626010,-74.156541
2,TAQUERIA SAN PEDRO,Mexican,Re-inspection,2022,Had Violations,Critical,02B,Food Handling Violations,17.0,B,Manhattan,40.830403,-73.947535
3,PALACE RESTAURANT,American,Initial Inspection,2022,Had Violations,Critical,02B,Food Handling Violations,19.0,PEN,Manhattan,40.761164,-73.969736
4,ELIAS CORNER FOR FISH,Seafood,Initial Inspection,2022,Had Violations,Critical,04M,Pests and Rodents Violations,10.0,A,Queens,40.772154,-73.915568
...,...,...,...,...,...,...,...,...,...,...,...,...,...
230298,LOS NISPEROS PERUVIAN RESTAURANT,Peruvian,Initial Inspection,2021,Had Violations,Not Critical,10F,Cleaning and Sanitation Violations,25.0,PEN,Bronx,40.814823,-73.914657
230299,THE RED GRILL MEXICAN RESTAURANT,Mexican,Initial Inspection,2021,Had Violations,Not Critical,08A,Pests and Rodents Violations,18.0,PEN,Manhattan,40.779272,-73.950735
230300,MIYAKO,Japanese,Re-inspection,2023,Had Violations,Not Critical,20-08,Documentation Violation,,PEN,Manhattan,40.791024,-73.972586
230301,GYRO KING,Bangladeshi,Re-inspection,2024,Had Violations,Critical,06E,Food Handling Violations,24.0,PEN,Brooklyn,40.633534,-73.967149


In [49]:
# Calculate the first quartile (Q1) and third quartile (Q3) of the score data
Q1 = inspections_df['score'].quantile(0.25)
Q3 = inspections_df['score'].quantile(0.75)

# Calculate the interquartile range (IQR)
IQR = Q3 - Q1

# Define the lower and upper bounds to filter out outliers
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Filter out scores outside of the IQR range (Make a .copy() to ensure the df is a copy and not a view)
inspections_df = inspections_df[(inspections_df['score'] >= lower_bound) & (inspections_df['score'] <= upper_bound)].copy()

inspections_df.shape

(210682, 13)

In [50]:
#Check for Duplicates after further cleaning

print('Duplicates: ', inspections_df.duplicated().sum())

Duplicates:  42


In [51]:
#Drop Duplicates
inspections_df.drop_duplicates(inplace=True)

In [52]:
inspections_df.isna().sum()

establishment_name       0
cuisine_description      0
inspection_type          0
inspection_year          0
action                   0
critical_flag            0
violation_code           0
violation_category       0
score                    0
grade                    0
borough                  0
latitude               255
longitude              255
dtype: int64

In [53]:
#Fill Null Values of Latitude and Longitude
inspections_df['latitude'].fillna(value=0.0, inplace=True)
inspections_df['latitude'].astype(float)
inspections_df['longitude'].fillna(value=0.0, inplace=True)
inspections_df['longitude'].astype(float)

print(inspections_df.isna().sum())
print(inspections_df.dtypes)

establishment_name     0
cuisine_description    0
inspection_type        0
inspection_year        0
action                 0
critical_flag          0
violation_code         0
violation_category     0
score                  0
grade                  0
borough                0
latitude               0
longitude              0
dtype: int64
establishment_name      object
cuisine_description     object
inspection_type         object
inspection_year          int64
action                  object
critical_flag           object
violation_code          object
violation_category      object
score                  float64
grade                   object
borough                 object
latitude               float64
longitude              float64
dtype: object


In [54]:
inspections_df.reset_index(drop=True, inplace=True)

In [55]:
inspections_df.shape

(210640, 13)

In [56]:
#with pd.option_context('display.max_rows', None):
#    display(inspections_df['cuisine_description'].value_counts())

### Cuisine Descriptions

- American = ['American', 'Chicken', 'Hamburgers', 'Barbecue', 'Soul Food', 'Steakhouse', 'Pancakes/Waffles', 'Hotdogs', 'New American', 'Hotdogs/Pretzels', 'Californian', 'Southwestern']

- Beverages = ['Coffee/Tea', 'Juice, Smoothies, Fruit Salads', 'Bottled Beverages']

- Latin American = ['Latin American', 'Peruvian', 'Brazilian', 'Chilean', 'Chimichurri']

- Mexican = ['Mexican', 'Tex-Mex']

- Soups, Salads and Sandwiches = ['Sandwiches', 'Sandwiches/Salads/Mixed Buffet', 'Salads', 'Soups/Salads/Sandwiches', 'Soups']

- Bakery & Desserts = ['Bakery Products/Desserts', 'Donuts', 'Frozen Desserts', 'Bagels/Pretzels', 'Nuts/Confectionary']

- Italian & Pizza = ['Italian', 'Pizza']

- Creole/Cajun = ['Creole', 'Cajun', 'Creole/Cajun']

- Caribbean = ['Caribbean']

- Chinese = ['Chinese', 'Chinese/Cuban', 'Chinese/Japanese']

- East Asian = ['Japanese', 'Korean']

- Southeast Asian = ['Southeast Asian', 'Thai', 'Filipino', 'Indonesian']

- South Asian = ['Bangladeshi', 'Pakistani', 'Indian']

- Other Asian = ['Asian/Asian Fusion', 'Hawaiian']

- Mediterranean = ['Mediterranean', 'Greek', 'Portuguese', 'French', 'New French', 'Spanish', 'Tapas', 'Basque']

- Middle Eastern = ['Middle Eastern', 'Iranian', 'Lebanese', 'Armenian', 'Afghan', 'Turkish']

- Eastern European = ['Eastern European', 'Russian', 'Czech', 'Polish']

- African = ['African', 'Egyptian', 'Ethiopian', 'Moroccan']

- Vegan/Vegetarian = ['Vegan', 'Vegetarian']

- Western Europe = ['English', 'Irish', 'German']

- Australian = ['Australian']

- Scandinavian = ['Scandinavian']

- Other = ['Other', 'Seafood', 'Fusion', 'Continental', 'Fruits/Vegetables', 'Haute Cuisine', 'Jewish/Kosher']

#### _____________________________________________________________________________________________________

- ELIMINATE!! 'Not Listed/Not Applicable'

- After merging, check value counts in case I want to eliminate others



In [57]:
#Drop rows with value 'Not Listed/Not Applicable' in cuisine_description

inspections_df = inspections_df[inspections_df['cuisine_description'] != 'Not Listed/Not Applicable']
print(len(inspections_df['cuisine_description'].value_counts()))
inspections_df.shape

88


(210560, 13)

In [58]:
#Categorizing Cuisine Descriptions

cuisine_mapping = {
    'American' : ['American', 'Chicken', 'Hamburgers', 'Barbecue', 'Soul Food', 'Steakhouse', 'Pancakes/Waffles', 'Hotdogs', 'New American', 'Hotdogs/Pretzels', 'Californian', 'Southwestern'],
    'Beverages' : ['Coffee/Tea', 'Juice, Smoothies, Fruit Salads', 'Bottled Beverages'],
    'Latin American' : ['Latin American', 'Peruvian', 'Brazilian', 'Chilean', 'Chimichurri'],
    'Mexican' : ['Mexican', 'Tex-Mex'],
    'Soups, Salads and Sandwiches' : ['Sandwiches', 'Sandwiches/Salads/Mixed Buffet', 'Salads', 'Soups/Salads/Sandwiches', 'Soups'],
    'Bakery & Desserts' : ['Bakery Products/Desserts', 'Donuts', 'Frozen Desserts', 'Bagels/Pretzels', 'Nuts/Confectionary'],
    'Italian & Pizza' : ['Italian', 'Pizza'],
    'Creole/Cajun' : ['Creole', 'Cajun', 'Creole/Cajun'],
    'Caribbean' : ['Caribbean'],
    'Chinese' : ['Chinese', 'Chinese/Cuban', 'Chinese/Japanese'],
    'East Asian' : ['Japanese', 'Korean'],
    'Southeast Asian' : ['Southeast Asian', 'Thai', 'Filipino', 'Indonesian'],
    'South Asian' : ['Bangladeshi', 'Pakistani', 'Indian'],
    'Other Asian' : ['Asian/Asian Fusion', 'Hawaiian'],
    'Mediterranean' : ['Mediterranean', 'Greek', 'Portuguese', 'French', 'New French', 'Spanish', 'Tapas', 'Basque'],
    'Middle Eastern' : ['Middle Eastern', 'Iranian', 'Lebanese', 'Armenian', 'Afghan', 'Turkish'],
    'Eastern European' : ['Eastern European', 'Russian', 'Czech', 'Polish'],
    'African' : ['African', 'Egyptian', 'Ethiopian', 'Moroccan'],
    'Vegan/Vegetarian' : ['Vegan', 'Vegetarian'],
    'Western Europe' : ['English', 'Irish', 'German'],
    'Australian' : ['Australian'],
    'Scandinavian' : ['Scandinavian'],
    'Other' : ['Other', 'Seafood', 'Fusion', 'Continental', 'Fruits/Vegetables', 'Haute Cuisine', 'Jewish/Kosher']
}

# Function to categorize cuisine descriptions
def categorize_cuisine(cuisine):
    for category, cuisines in cuisine_mapping.items():
        if cuisine in cuisines:
            return category
    return 'No Category'

# Assign categories based on violation code using map() function
inspections_df['cuisine_description'] = inspections_df['cuisine_description'].apply(categorize_cuisine)

In [59]:
inspections_df['cuisine_description'].value_counts()

cuisine_description
American                        46624
Chinese                         21057
Italian & Pizza                 20121
Beverages                       19026
Bakery & Desserts               16989
Mediterranean                   11851
Mexican                         10869
East Asian                      10364
Latin American                   9729
Other                            8056
Caribbean                        7518
Soups, Salads and Sandwiches     6251
Southeast Asian                  4582
South Asian                      4340
Other Asian                      3964
Middle Eastern                   2773
Western Europe                   1793
Eastern European                 1520
African                          1196
Vegan/Vegetarian                 1193
Creole/Cajun                      439
Australian                        260
Scandinavian                       45
Name: count, dtype: int64

In [60]:
#Check that all rows are accounted for
print(inspections_df['cuisine_description'].value_counts().sum())
#Check all categories have been added
len(inspections_df['cuisine_description'].value_counts())

210560


23

In [63]:
inspections_df.reset_index(drop=True, inplace=True)

In [64]:
inspections_df

Unnamed: 0,establishment_name,cuisine_description,inspection_type,inspection_year,action,critical_flag,violation_code,violation_category,score,grade,borough,latitude,longitude
0,DOMINO'S,Italian & Pizza,Initial Inspection,2021,Had Violations,Not Critical,09C,Cleaning and Sanitation Violations,13.0,A,Queens,40.665341,-73.730655
1,CHENG'S,Chinese,Initial Inspection,2023,Had Violations,Critical,04L,Pests and Rodents Violations,18.0,PEN,Staten Island,40.626010,-74.156541
2,TAQUERIA SAN PEDRO,Mexican,Re-inspection,2022,Had Violations,Critical,02B,Food Handling Violations,17.0,B,Manhattan,40.830403,-73.947535
3,PALACE RESTAURANT,American,Initial Inspection,2022,Had Violations,Critical,02B,Food Handling Violations,19.0,PEN,Manhattan,40.761164,-73.969736
4,ELIAS CORNER FOR FISH,Other,Initial Inspection,2022,Had Violations,Critical,04M,Pests and Rodents Violations,10.0,A,Queens,40.772154,-73.915568
...,...,...,...,...,...,...,...,...,...,...,...,...,...
210555,KYURAMEN / TBAAR,East Asian,Re-inspection,2023,Had Violations,Not Critical,10D,Safety Violations,11.0,A,Manhattan,40.802480,-73.968023
210556,LOS NISPEROS PERUVIAN RESTAURANT,Latin American,Initial Inspection,2021,Had Violations,Not Critical,10F,Cleaning and Sanitation Violations,25.0,PEN,Bronx,40.814823,-73.914657
210557,THE RED GRILL MEXICAN RESTAURANT,Mexican,Initial Inspection,2021,Had Violations,Not Critical,08A,Pests and Rodents Violations,18.0,PEN,Manhattan,40.779272,-73.950735
210558,GYRO KING,South Asian,Re-inspection,2024,Had Violations,Critical,06E,Food Handling Violations,24.0,PEN,Brooklyn,40.633534,-73.967149


In [65]:
#inspections_df.to_csv('inspections_df.csv', index=False)