# Exploratory data analysis

## 1. Dataset description

I will visualize the [DOHMH New York City Restaurant Inspection Results](https://data.cityofnewyork.us/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/rs6k-p7g6) dataset.

- The dataset is provided by the [New York City OpenData](https://opendata.cityofnewyork.us/).

- The dataset includes information about NYC restaurant names, locations, claims, inspections, violations, grades and adjudication. There are 26 variables, 19 of which are potentially useful and described here: 

    + CAMIS: an unique identifier of 10-digit integer for a restaurant. 
        
    + DBA: "doing business as", the name of the restaurant.

    + BORO: the Borough of the restaurant location.
    
    + BUILDING: the building of the restaurant location.
    
    + STREET: the street of the restaurant location.
    
    + ZIPCODE: the zipcode of the restaurant location.

    + PHONE: the phone number of the restaurant.

    + CUISINE DESCRIPTION: the cuisine type of the restaurant.

    + INSPECTION DATE: the date of the inspection. the date 1/1/1900 indicates the restaurant has not yet received any inspections.

    + ACTION: the action related with the inspection.

    + VIOLATION CODE: the violation code related with the inspection.

    + VIOLATION DESCRIPTION: the violation description related with the inspection.

    + CRITICAL FLAG: an indicator of critical violation.

    + INSPECTION TYPE: the type of inspection performed.

    + Latitude: the latitude of the restaurant location.

    + Longitude: the longitude of the restaurant location.

    + SCORE: score for the inspection, updated based on adjudication results.

    + GRADE: grade of the inspection. The grade is based on a two-step inspection process. A restaurant that doesn't receive a grade "A" on the initial inspection can be reinspected:
        
        - scoring less than 14 on the initial/re-inspection: grade "A";
        
        - scoring 14-27 on re-inspection: grade "B";
        
        - scoring 28 or more on re-inspection: grade "C";
        
        - reopening inspection or scoring more than 13 on the initial inspection: grade "P".

- The current dataset was collected between April, 2016 and March, 2020.

- The dataset is free to the public and was collected and provided by the Department of Health and Mental Hygiene (DOHMH). The purpose of this dataset is to collect recent restaurant inspection results.

- The dataset may contain missing values and errors.


## 2. Load the dataset

In [1]:
import pandas as pd
import numpy as np
import re
import json
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

In [2]:
# dataset is downloaded from https://data.cityofnewyork.us/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/rs6k-p7g6
data = pd.read_csv('../data/raw_data/DOHMH_New_York_City_Restaurant_Inspection_Results.csv')

## 3. Explore the dataset

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 402013 entries, 0 to 402012
Data columns (total 26 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   CAMIS                  402013 non-null  int64  
 1   DBA                    401647 non-null  object 
 2   BORO                   402013 non-null  object 
 3   BUILDING               401749 non-null  object 
 4   STREET                 402011 non-null  object 
 5   ZIPCODE                396529 non-null  float64
 6   PHONE                  401996 non-null  object 
 7   CUISINE DESCRIPTION    402013 non-null  object 
 8   INSPECTION DATE        402013 non-null  object 
 9   ACTION                 400732 non-null  object 
 10  VIOLATION CODE         396459 non-null  object 
 11  VIOLATION DESCRIPTION  393064 non-null  object 
 12  CRITICAL FLAG          393064 non-null  object 
 13  SCORE                  385257 non-null  float64
 14  GRADE                  203724 non-nu

In [4]:
# select columns of interest
isp_data = data.copy()[['CAMIS', 'DBA', 'BORO', 'BUILDING',
                        'STREET', 'ZIPCODE', 'PHONE',
                        'CUISINE DESCRIPTION', 'INSPECTION TYPE',
                        'VIOLATION CODE', 'VIOLATION DESCRIPTION',
                        'Latitude', 'Longitude', 'SCORE', 'GRADE']]

isp_data.columns = ['camis', 'dba', 'boro', 'building',
                    'street', 'zipcode', 'phone', 
                    'cuisine description', 'inspection type',
                    'violation code', 'violation description', 
                    'latitude', 'longitude', 'score', 'grade']

isp_data['inspection date'] = pd.to_datetime(data['INSPECTION DATE'].copy())

In [5]:
isp_data.head()

Unnamed: 0,camis,dba,boro,building,street,zipcode,phone,cuisine description,inspection type,violation code,violation description,latitude,longitude,score,grade,inspection date
0,50082518,PAPA JOHN'S,Brooklyn,1408,NEPTUNE AVE,11224.0,7182657272,Pizza,Calorie Posting / Re-inspection,16C,"Caloric content not posted on menus, menu boar...",40.579177,-73.982177,,,2018-12-13
1,50018573,TURCO,Manhattan,604,9TH AVE,10036.0,2125108666,Middle Eastern,Cycle Inspection / Initial Inspection,08A,Facility not vermin proof. Harborage or condit...,40.759235,-73.992026,48.0,,2019-10-25
2,41600457,TABATA NOODLE RESTAURANT,Manhattan,540,9 AVENUE,10018.0,2122907681,Japanese,Cycle Inspection / Re-inspection,04L,Evidence of mice or live mice present in facil...,40.756993,-73.993654,28.0,C,2018-01-18
3,41405368,DYLAN MURPHY'S,Manhattan,1453,3 AVENUE,10028.0,2129889434,American,Cycle Inspection / Re-inspection,08A,Facility not vermin proof. Harborage or condit...,40.776321,-73.955788,11.0,A,2018-04-10
4,41594669,TAR PIT,Brooklyn,135,WOODPOINT ROAD,11211.0,6464699494,Café/Coffee/Tea,Cycle Inspection / Initial Inspection,04L,Evidence of mice or live mice present in facil...,40.7175,-73.941398,18.0,,2019-02-14


In [6]:
isp_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 402013 entries, 0 to 402012
Data columns (total 16 columns):
 #   Column                 Non-Null Count   Dtype         
---  ------                 --------------   -----         
 0   camis                  402013 non-null  int64         
 1   dba                    401647 non-null  object        
 2   boro                   402013 non-null  object        
 3   building               401749 non-null  object        
 4   street                 402011 non-null  object        
 5   zipcode                396529 non-null  float64       
 6   phone                  401996 non-null  object        
 7   cuisine description    402013 non-null  object        
 8   inspection type        400732 non-null  object        
 9   violation code         396459 non-null  object        
 10  violation description  393064 non-null  object        
 11  latitude               401591 non-null  float64       
 12  longitude              401591 non-null  floa

In [7]:
print(f"There are {len(np.unique(isp_data['camis']))} restaurants in this dataset.\n")

max_isp = isp_data.groupby(['camis'])[['cuisine description']].count().sort_values(['cuisine description']).iloc[-1]

print(f"The restaurant with the maximal number of inspections is {isp_data[isp_data['camis'] == max_isp.name].iloc[0]['dba']} \
with {max_isp['cuisine description']} inspections.\n")

print(f"The boroughs of the restaurant locations are {np.unique(isp_data['boro'])}.\n\
There are {len(np.unique(isp_data[isp_data['boro'] == '0']['camis']))} restaurants without borough information.\n")

print(f"There are {len(np.unique(isp_data['cuisine description']))} cuisine types in this dataset.\n\
The cuisine types are \n{np.unique(isp_data['cuisine description'])}.\n")

print(f"There are {sum(isp_data['inspection date'].dt.year == 1900)} restaurants without any inspections.\n")

print(f"There are {len(set(isp_data['inspection type'])) - 1} inspection types:\n\
{list(set(isp_data['inspection type']))[1:]}.\n")

print(f"The inspection dates are between {np.min(isp_data[isp_data['inspection date'].dt.year > 1900]['inspection date'])} \
to {np.max(isp_data[isp_data['inspection date'].dt.year > 1900]['inspection date'])}.\n")

print(f"There are {len(set(isp_data['violation code'])) - 1} violation types in this dataset.\n\
The violation codes are \n{list(set(isp_data['violation code']))[1:]}.\n")

print(f"There are {len(set(isp_data['violation description'])) - 1} violation description in this dataset.\n")

print(f"The range of score is from {np.min(isp_data['score'])} to {np.max(isp_data['score'])}. \
There are {sum(isp_data['score'] < 0)} negative scores.\n")

print(f"There are {len(set(isp_data['grade'])) - 1} grade levels, including {list(set(isp_data['grade']))[1:]}.\n")

There are 27311 restaurants in this dataset.

The restaurant with the maximal number of inspections is ORCHID DYNASTY RESTAURANT with 98 inspections.

The boroughs of the restaurant locations are ['0' 'Bronx' 'Brooklyn' 'Manhattan' 'Queens' 'Staten Island'].
There are 8 restaurants without borough information.

There are 84 cuisine types in this dataset.
The cuisine types are 
['Afghan' 'African' 'American' 'Armenian' 'Asian' 'Australian'
 'Bagels/Pretzels' 'Bakery' 'Bangladeshi' 'Barbecue' 'Basque'
 'Bottled beverages, including water, sodas, juices, etc.' 'Brazilian'
 'Café/Coffee/Tea' 'Cajun' 'Californian' 'Caribbean' 'Chicken' 'Chilean'
 'Chinese' 'Chinese/Cuban' 'Chinese/Japanese' 'Continental' 'Creole'
 'Creole/Cajun' 'Czech' 'Delicatessen' 'Donuts' 'Eastern European'
 'Egyptian' 'English' 'Ethiopian' 'Filipino' 'French' 'Fruits/Vegetables'
 'German' 'Greek' 'Hamburgers' 'Hawaiian' 'Hotdogs' 'Hotdogs/Pretzels'
 'Ice Cream, Gelato, Yogurt, Ices' 'Indian' 'Indonesian' 'Iranian' 'Ir

## 4. Initial thoughts

1) What is the distribution of restaurant locations?
 
2) What is the distribution of the number of inspections per restaurant (per cuisine type)?
 
3) What is the distribution of cuisine types?
 
4) What is the distribution of inspection dates (per restaurant)?
 
5) What is the distribution of gaps between the initial inspection and re-inspection?
 
6) What is the distribution of grades?

7) What is the probability of getting a grade "A" on re-inspection?

8) What's the rank of Boroughs in the number of restaurants with grade "A"?

9) What's the rank of cuisine types in the number of restaurants with grade "A"?

10) I need to decide how to deal with restaurants that miss Borough or lon/lat information.

11) There are seven grade levels. I may need to regrade the scores into just three grade levels of "A", "B", "C".

## 5. Wrangling

In [8]:
mis_boro = np.unique(isp_data[isp_data['boro'] == '0']['camis'])
mis_lat = np.unique(isp_data[isp_data['latitude'].isnull()]['camis'])
mis_lon = np.unique(isp_data[isp_data['longitude'].isnull()]['camis'])

pd.DataFrame({'missing information': ['boro', 'latitude', 'longitude'],
              'number of missing': [len(mis_boro), len(mis_lat), len(mis_lon)],
              'overlap with boro': ['-', len(np.intersect1d(mis_boro, mis_lat)), len(np.intersect1d(mis_boro, mis_lon))],
              'overlap with latitude': ['-', '-', len(np.intersect1d(mis_lat, mis_lon))]})


Unnamed: 0,missing information,number of missing,overlap with boro,overlap with latitude
0,boro,8,-,-
1,latitude,63,8,-
2,longitude,63,8,63


In [9]:
# Check if it's possible to fill those missing location information based on other rows
dfg = isp_data.groupby(['camis'])

ids_1 = []
for id in mis_boro:
    if any(dfg.get_group(id)['boro'] != '0'):
        ids_1.append(id)
        
ids_2 = []
for id in mis_lat:
    if dfg.get_group(id)['latitude'].notnull().any():
        ids_2.append(id)


In [10]:
# it's not possible to fill any of the rows with missing Borough information based related rows
len(ids_1)

0

In [11]:
# it's not possible to fill any of the rows with missing Lat/Lon information based related rows
len(ids_2)

0

> We still have the cuisine type information for those restaurants without location information, so I'll keep those rows.

In [12]:
mis_dba = np.unique(isp_data[isp_data['dba'].isnull()]['camis'])
mis_build = np.unique(isp_data[isp_data['building'].isnull()]['camis'])
mis_zip = np.unique(isp_data[isp_data['zipcode'].isnull()]['camis'])

# Check if it's possible to fill those missing information based on other rows

ids_1 = []
for id in mis_dba:
    if dfg.get_group(id)['dba'].notnull().any():
        ids_1.append(id)
        
ids_2 = []
for id in mis_build:
    if dfg.get_group(id)['building'].notnull().any():
        ids_2.append(id)
        
ids_3 = []
for id in mis_zip:
    if dfg.get_group(id)['zipcode'].notnull().any():
        ids_3.append(id)


In [13]:
# it's not possible to fill any of the rows with missing DBA information based related rows
len(ids_1)

0

In [14]:
# it's not possible to fill any of the rows with missing building information based related rows
len(ids_2)

0

In [15]:
# it's not possible to fill any of the rows with missing zipcode information based related rows
len(ids_3)

0

In [16]:
# check whether information for each restaurant is coherent.
for id in dfg.groups.keys():
    df = dfg.get_group(id)
    for col in ['dba', 'boro', 'building', 'street', 'zipcode', 'phone', 
                 'cuisine description', 'latitude', 'longitude']:
        if not df[col].isnull().any() and len(np.unique(df[col])) > 1:
            print(id, col, np.unique(df[col]))

> Great, all those information is coherent.

In [17]:
# code inspection types into four groups: 0 : initial inspection, 1 : re-inspection, 2: reopening, -2: nan
code = np.full([isp_data.shape[0]], -2)
re_ips = isp_data['inspection type'].copy().str.contains('Re-inspection|Second', flags=re.IGNORECASE, regex=True)
re_op = isp_data['inspection type'].copy().str.contains('Reopening', flags=re.IGNORECASE, regex=True)
code[re_ips == True] = 1
code[re_ips == False] = 0
code[re_op == True] = 2
isp_data['inspection code'] = code

# replace all nans with -2
isp_data = isp_data.fillna(-2)

In [18]:
# check whether scores and grades match. If not, re-assign the grades.
dfg = isp_data.groupby(['camis'])
for id in dfg.groups.keys():
    df = dfg.get_group(id)
    for i in range(df.shape[0]):
        
        if df.iloc[i]['score'] >= 0:
            
            if df.iloc[i]['inspection code'] == 2:
                if df.iloc[i]['grade'] != 'P':
                    isp_data.loc[df.iloc[i].name, 'grade'] = 'P'
                    
            elif df.iloc[i]['score'] < 14:
                if df.iloc[i]['grade'] != 'A':
                    isp_data.loc[df.iloc[i].name, 'grade'] = 'A'
                    
            elif df.iloc[i]['inspection code'] == 1:
                if 14 <= df.iloc[i]['score'] < 28:
                    if df.iloc[i]['grade'] != 'B':
                        isp_data.loc[df.iloc[i].name, 'grade'] = 'B'
                        
                elif df.iloc[i]['grade'] != 'C':
                    isp_data.loc[df.iloc[i].name, 'grade'] = 'C'
                    
            elif df.iloc[i]['inspection code'] == 0:
                if df.iloc[i]['grade'] != 'P':
                    isp_data.loc[df.iloc[i].name, 'grade'] = 'P'
                
        elif df.iloc[i]['grade'] != -2 :
            isp_data.loc[df.iloc[i].name, 'grade'] = -2

In [19]:
isp_data.to_csv('../data/clean_data/nyc_restaurants_results.csv', index=False)

In [2]:
isp_data = pd.read_csv('../data/clean_data/nyc_restaurants_results.csv')

In [3]:
# remove observations without dba or boro info
isp_data = isp_data[isp_data.dba != '-2']
isp_data = isp_data[isp_data.boro != '0']


In [4]:
# # replace cuisine description with code
# code_to_cuisine = dict(zip(range(len(set(isp_data['cuisine description']))), set(isp_data['cuisine description'])))
# cuisine_to_code = dict(zip(set(isp_data['cuisine description']), range(len(set(isp_data['cuisine description'])))))

# code_to_cuisine[cuisine_to_code['Latin (Cuban, Dominican, Puerto Rican, South & Central American)']] = \
# 'Latin (Cuban, Dominican, Puerto<br>Rican, South & Central American)'
# cuisine_to_code['Latin (Cuban, Dominican, Puerto<br>Rican, South & Central American)'] =\
# cuisine_to_code['Latin (Cuban, Dominican, Puerto Rican, South & Central American)']

# isp_data['cuisine type'] = isp_data['cuisine description'].replace(cuisine_to_code)

isp_data['cuisine description'] = isp_data['cuisine description'].\
replace({'Latin (Cuban, Dominican, Puerto Rican, South & Central American)':
         'Latin (Cuban, Dominican, Puerto<br>Rican, South & Central American)'})

violations = {}

# add breaks in violation descriptions
for item in set(isp_data['violation description']):
    
    if item == '-2':
        violations[item] = 'NA'
    else:
        l = item.split()
        temp = []
        i = 5
        while i < len(l):
            temp.append(' '.join(l[max(0, i - 10): i]))
            i += 10
        temp.append(' '.join(l[i - 10:]))
        item_br = '<br>'.join(temp)
        violations[item] = item_br

isp_data['violation description'] = isp_data['violation description'].replace(violations)

In [5]:
isp_data['camis'] = isp_data['camis'].astype(str)
isp_data['dba'] = isp_data.dba + '<br>(CAMIS: ' + isp_data.camis + ')'
isp_data.head()

Unnamed: 0,camis,dba,boro,building,street,zipcode,phone,cuisine description,inspection type,violation code,violation description,latitude,longitude,score,grade,inspection date,inspection code
0,50082518,PAPA JOHN'S<br>(CAMIS: 50082518),Brooklyn,1408,NEPTUNE AVE,11224.0,7182657272,Pizza,Calorie Posting / Re-inspection,16C,"Caloric content not posted on<br>menus, menu b...",40.579177,-73.982177,-2.0,-2,2018-12-13,1
1,50018573,TURCO<br>(CAMIS: 50018573),Manhattan,604,9TH AVE,10036.0,2125108666,Middle Eastern,Cycle Inspection / Initial Inspection,08A,Facility not vermin proof. Harborage<br>or con...,40.759235,-73.992026,48.0,P,2019-10-25,0
2,41600457,TABATA NOODLE RESTAURANT<br>(CAMIS: 41600457),Manhattan,540,9 AVENUE,10018.0,2122907681,Japanese,Cycle Inspection / Re-inspection,04L,Evidence of mice or live<br>mice present in fa...,40.756993,-73.993654,28.0,C,2018-01-18,1
3,41405368,DYLAN MURPHY'S<br>(CAMIS: 41405368),Manhattan,1453,3 AVENUE,10028.0,2129889434,American,Cycle Inspection / Re-inspection,08A,Facility not vermin proof. Harborage<br>or con...,40.776321,-73.955788,11.0,A,2018-04-10,1
4,41594669,TAR PIT<br>(CAMIS: 41594669),Brooklyn,135,WOODPOINT ROAD,11211.0,6464699494,Café/Coffee/Tea,Cycle Inspection / Initial Inspection,04L,Evidence of mice or live<br>mice present in fa...,40.7175,-73.941398,18.0,P,2019-02-14,0


In [6]:
# save restaurant information
rst_info = isp_data[['camis', 'dba', 'boro', 'building', 
                     'street', 'zipcode', 'phone',
                     'cuisine description', 
                     'latitude', 'longitude']].drop_duplicates()
rst_info = rst_info.sort_values(['camis'])
rst_info['current_grade'] = isp_data.sort_values(['camis', 'inspection date']).groupby(['camis'])['grade'].apply(lambda x: x.iloc[-1]).values
rst_info.head()

Unnamed: 0,camis,dba,boro,building,street,zipcode,phone,cuisine description,latitude,longitude,current_grade
35540,30075445,MORRIS PARK BAKE SHOP<br>(CAMIS: 30075445),Bronx,1007,MORRIS PARK AVE,10462.0,7188924968,Bakery,40.848231,-73.855972,A
39335,30112340,WENDY'S<br>(CAMIS: 30112340),Brooklyn,469,FLATBUSH AVENUE,11225.0,7182875005,Hamburgers,40.662652,-73.962081,A
30166,30191841,DJ REYNOLDS PUB AND RESTAURANT<br>(CAMIS: 3019...,Manhattan,351,WEST 57 STREET,10019.0,2122452912,Irish,40.767326,-73.98431,A
44804,40356018,RIVIERA CATERERS<br>(CAMIS: 40356018),Brooklyn,2780,STILLWELL AVENUE,11224.0,7183723031,American,40.57992,-73.98209,A
5899,40356483,WILKEN'S FINE FOOD<br>(CAMIS: 40356483),Brooklyn,7114,AVENUE U,11234.0,7184443838,Delicatessen,40.620112,-73.906989,A


In [7]:
rst_info.to_pickle('../data/clean_data/nyc_restaurants_info.pkl')

In [8]:
#rst_info.to_csv('../data/clean_data/nyc_restaurants_info.csv', index=False)

In [9]:
# save inspection information
isp_info = isp_data[['camis', 'dba', 'inspection code', 
                     'violation description',
                     'inspection date', 'score', 'grade']]\
[isp_data['score'] >= 0].sort_values(['camis', 'inspection date'])

#isp_info.to_csv('../data/clean_data/nyc_restaurants_grades.csv', index=False)

In [10]:
isp_info.to_pickle('../data/clean_data/nyc_restaurants_grades.pkl')

In [11]:
# with open('../data/clean_data/code_to_cuisine.json', 'w') as json_file:
#     json.dump(code_to_cuisine, json_file)
    
# with open('../data/clean_data/cuisine_to_code.json', 'w') as json_file:
#     json.dump(cuisine_to_code, json_file)


In [12]:
analysis_data = rst_info.groupby(['boro', 'cuisine description', 'current_grade'])[['dba']].count().reset_index()
analysis_data.columns = ['boro', 'cuisine description', 'grade', 'count']
analysis_data

Unnamed: 0,boro,cuisine description,grade,count
0,Bronx,African,A,16
1,Bronx,African,B,4
2,Bronx,African,C,2
3,Bronx,African,P,6
4,Bronx,American,-2,10
...,...,...,...,...
1020,Staten Island,Turkish,A,1
1021,Staten Island,Turkish,P,1
1022,Staten Island,Vegetarian,A,1
1023,Staten Island,Vegetarian,P,1


In [13]:
analysis_data.to_pickle('../data/clean_data/nyc_restaurants_analysis.pkl')

In [14]:
# geojson data from https://data.cityofnewyork.us/City-Government/Borough-Boundaries/tqmj-j8zm
with open('../data/raw_data/nyk_borough_boundaries.geojson', 'r') as json_data:
    nyk = json.load(json_data)

borough_loc = [{"type": "FeatureCollection", 'features': [feat]} for feat in nyk['features']]
colors = ['#2a6fd4', '#ffec00', '#d10010', '#c483cc', '#05a113']
layer = [dict(sourcetype = 'geojson',
              source =borough_loc[k],
              type = 'fill',
              color = colors[k],
              opacity=0.1
             ) for k in range(len(borough_loc))]

with open('../data/clean_data/borough_loc.json', 'w') as json_file:
    json.dump(layer, json_file)

## 6. Research questions

1) Which borough has the largest number of grade 'A' restaurants given a cuisine type?

2) What are the cuisine types with the top five largest number of grade 'A' in a given borough?

3) What's the inspection history given a restaurant?

## 7. Data Analysis & Visualizations

In [12]:
# rst_info = pd.read_csv('../data/clean_data/nyc_restaurants_info.csv')
# isp_info = pd.read_csv('../data/clean_data/nyc_restaurants_grades.csv')
# analysis_data = pd.read_csv('../data/clean_data/nyc_restaurants_analysis.csv').replace('-2', 'NA')

In [13]:
with open('../data/clean_data/borough_loc.json', 'r') as json_data:
    layer = json.load(json_data)

In [14]:
rst_info = pd.read_pickle('../data/clean_data/nyc_restaurants_info.pkl')
isp_info = pd.read_pickle('../data/clean_data/nyc_restaurants_grades.pkl')
analysis_data = pd.read_pickle('../data/clean_data/nyc_restaurants_analysis.pkl').replace('-2', 'NA')

In [16]:
# with open('../data/clean_data/code_to_cuisine.json', 'r') as json_data:
#     code_to_cuisine = json.load(json_data)
    
# with open('../data/clean_data/cuisine_to_code.json', 'r') as json_data:
#     cuisine_to_code = json.load(json_data)

In [None]:
df_1 = isp_data.groupby(['camis'])[['inspection date']].count()
df_1.columns = ['# of inspections per restaurant']
df_1.head()

In [32]:
fig = px.histogram(df_1, x='# of inspections per restaurant')
fig.update_layout(title_text='The histogram of the number of inspections per restaurant')
fig.write_html('../img/inspection_per_restaurant.html', auto_open=True)


In [14]:
def plot_grades_boro(data):
    """
    Returns a bar graph of grades based on borough.
    
    Parameters:
    -----------
    data: pandas.DataFrame
        the array of the sample.  
        
    Returns:
    --------
    plotly figure
        a bar graph of grades based on borough
    """
    df = data.groupby(['boro', 'grade'])[['count']].sum().reset_index()
    df['sum'] =np.repeat(df.groupby(['boro'])['count'].sum().values, 5)
    df['percentage'] = df['count'] / df['sum'] * 100

    boros = df.groupby(['boro'])[['count']].sum().sort_values(['count'], ascending=False).index

    fig = px.bar(df, 
                 y='boro', 
                 x='count', 
                 color='grade', 
                 text='percentage',
                 orientation='h',
                 category_orders={'boro': boros,
                                  'grade': ['A', 'B', 'C', 'P', 'NA']})
    
    fig.update_traces(texttemplate='%{text:.1f}%', 
                      textposition='inside',
                      hovertemplate = 'Borough: %{y}' + '<br>Number: %{x}<br>' + 'Percentage: %{text:.1f}%',)
    
    fig.update_layout(title_text='The distribution of restaurant grades in differnt boroughs',
                      xaxis_title="Number of restaurants",
                      yaxis_title="Borough",
                      legend_title='Grade',
                      clickmode="event+select",
                      dragmode="lasso")
    

    return fig

In [15]:
plot_grades_boro(analysis_data).write_html('../img/plot1.html', auto_open=True)

In [21]:
def plot_grades_cuisine(data, title='', types=False):
    """
    Returns a bar graph of grades based on cuisine.
    
    Parameters:
    -----------
    data: pandas.DataFrame
        the array of the sample.  
    title: string (default: '')
        part of the plot title indicates borough.
    types: bool (default: False)
        plot all cuisine types if False, 
        plot selected cuisines if True. 
    
    Returns:
    --------
    plotly figure
        a bar graph of grades based on cuisine.
    """
    df = data.groupby(['cuisine description', 'grade'])[['count']].sum().reset_index()
    sum_df = df.groupby(['cuisine description'])[['count']].sum().sort_values(['count'], ascending=False).reset_index()
    
    if not types:
        sum_df = sum_df.iloc[:20]
        title = 'Restaurant grades distribution of top 20 most common cuisine types ' + title
    else:
        title = 'Restaurant grades distribution ' + title
        
    sum_df.columns = ['cuisine description', 'sum']
    df = pd.merge(sum_df, df, how="left", on='cuisine description')
    df['percentage'] = df['count'] / df['sum'] * 100

    fig = px.bar(df, 
                 y='cuisine description', 
                 x='count', 
                 color='grade', 
                 text='percentage',
                 orientation='h', 
                 height=max(400, 40 * sum_df.shape[0]),
                 category_orders={'cuisine description': sum_df['cuisine description'],
                                  'grade': ['A', 'B', 'C', 'P', 'NA']})
    fig.update_traces(texttemplate='%{text:.1f}%', 
                      textposition='inside',
                      hovertemplate = 'Cuisine description: %{y}' + '<br>Number: %{x}<br>' + 'Percentage: %{text:.1f}%',)
    fig.update_layout(title_text=title,
                      xaxis_title="Number of restaurants",
                      yaxis_title="Cuisine type",
                      legend_title='Grade')

    return fig

In [17]:
plot_grades_cuisine(analysis_data).write_html('../img/plot2.html', auto_open=True)

In [18]:
plot_grades_cuisine(analysis_data[analysis_data['boro'] == 'Manhattan'], 'in Manhattan').write_html('../img/plot3.html', auto_open=True)

In [18]:
plot_grades_cuisine(analysis_data[analysis_data['boro'] == 'Bronx'], 'in Bronx').write_html('../img/plot4.html', auto_open=True)

In [19]:
plot_grades_cuisine(analysis_data[analysis_data['boro'] == 'Brooklyn'], 'in Brooklyn').write_html('../img/plot5.html', auto_open=True)

In [20]:
plot_grades_cuisine(analysis_data[analysis_data['boro'] == 'Queens'], 'in Queens').write_html('../img/plot6.html', auto_open=True)

In [21]:
plot_grades_cuisine(analysis_data[analysis_data['boro'] == 'Staten Island'], 'in Staten Island').write_html('../img/plot7.html', auto_open=True)

In [22]:
types = ['American', 'Chinese']
plot_grades_cuisine(analysis_data[analysis_data['cuisine description'].isin(types)], 'in Staten Island', True).write_html('../img/plot8.html', auto_open=True)

In [16]:
token = 'pk.eyJ1IjoiZmxpemhvdSIsImEiOiJjazg4Y2hjaW4wMjFlM3NtemhhNG90Z2ZzIn0.gxDbD64mpZbxE2HMjxEZng'

fig = px.scatter_mapbox(rst_info, lat="latitude", lon="longitude", 
                        hover_name="dba", 
                        hover_data=['camis', 'boro', 'current_grade', 'cuisine description',
                                    'building', 'street', 'zipcode', 'phone'],
                        color='boro',
                        size=np.full(rst_info.shape[0], 1),
                        size_max=2,
                        center=dict(lat=40.7, lon=-74), 
                        zoom=10, 
                        height=500)

fig.update_layout(mapbox_accesstoken=token,
                  mapbox_layers=layer,
                  margin={"r":0,"t":0,"l":0,"b":0},
                  legend_title='Borough')

fig.write_html('../img/restaurant_location.html', auto_open=True)

In [23]:
def plot_restaurants(data):
    """
    Returns a plot of grades of selected restaurants over time.
    
    Parameters:
    -----------
    data: pandas.DataFrame
        the array of the sample.  
        
    Returns:
    --------
    plotly figure
        a plot of grades of selected restaurants over time
    """
    fig = px.line(data,
                  y='grade', 
                  x='inspection date',
                  color='camis',
                  hover_name="dba", 
                  hover_data=['camis', 'grade', 'inspection date', 'violation description'],
                  category_orders={'grade': ['A', 'P', 'B', 'C', 'NA']},
                  height=500)

    fig.update_traces(mode='markers+lines', 
                      opacity=0.5)

    fig.update_layout(title_text='Restaurants inspection results over time',
                      xaxis_title="Inspection date",
                      yaxis_title="Grade",
                      legend_title='CAMIS')
    
    return fig

In [24]:
restaurants = isp_info['camis'].unique()[:10]
plot_restaurants(isp_info[isp_info['camis'].isin(restaurants)].copy()).write_html('../img/plot8.html', auto_open=True)

## 8. Summary and conclusions
