## Food Inspection Analysis

### Potential Features  

1.  name  
1.  license number  
1.  result (pass/fail)
1.  business age (default start from 2010 inspection date)  
1.  *number of chains / is_chain boolean 
1.  risk  
1.  ward / neighborhood
1.  license code  
1.  renew  
1.  conditional approved  
1.  business activity
1.  *number of (pass/fail) inspections during 1st, 2nd, 3rd, and 4th most recent  license period
    * this can have errors so maybe use years from year_min to simplify
1.  *geoapify number of starbucks within 0.5 mile radius  
1.  *geoapify related business within 0.5 mile radius  
1.  *us census track info of income  

In [None]:
# # Code formatter
# # !pip3 install nb_black
# %load_ext nb_black

In [1]:
# Import required libraries

# eda tools
import numpy as np
import pandas as pd
import re

# visualization dependencies
import matplotlib.pyplot as plt  
import seaborn as sns

# hide jupyter lab warnings
import warnings
warnings.filterwarnings('ignore')

# expand the number of dataframe columns visible
pd.options.display.max_columns = 100

# make sound when this code executes: Audio(sound_file, autoplay=True)
from IPython.display import Audio
sound_file = './sound/chord.wav'


In [2]:
# display package informatin
# !conda install -c conda-forge session-info
import session_info
session_info.show()

### Read Dataset

In [None]:
# Read data
restaurant_df = pd.read_csv('./data/combined_date.csv')

## Dataframe Review
restraunt_df.head(10)

### Create Summary Statistics  
- in place of number of fail, I will often use percent fail  

**Checks**  
1 - Categorical plot of X (# starbucks) with pass/fail color coding   
1 - Bar chart of X (# starbucks) with pass/fail bars    
1 - Line chart of Fail Percent versus # of starbucks  
2/3 - Similar to above  
4/5 - not applicable but used to derive other features  

6/7/8/9 - Does time affect events?  How does the total/percent of fails change over time?  Are there more fails during a particular season?   Start with overall macro effects.  Are there multiple inspections for the same day for the same license number?  May need to combine.   

** To do micro (comapny or license scale) the first need to answer these questions:  
Is the company and address the same but different license number? May need to drop one of the license numbers.   
What is the relation between previous number of inspections and success? - maybe show a category chart of 1st inspection (red is fail, blue is pass) through 10th inspection.  

10 - line chart versus percent fail  
11 - bar chart of franchise size  
11 - line chart of franchise size versus percent fail  
12 - bar chart comparing is chain and is not chain  
12 - bar chart comparing percent fail of chain versus not chain  
13 - bar chart comparing 3 risk categories and percent fail  
14 - bar chart of wards and percent fail  
14 - Does the location make a difference?  Make a bar chart.  
14 (other) - Do certain addresses with multiple companies have a history of failing?   
15 - license code - not needed  
16 - bar chart of renew and percent fail  
16 - bar chart of renew count verus first inspection count  
17 - bar chart of condiational approval  boolean and percent fail  
18 - bar chart of total count fails per business and bar chart of percent fail per business  
19 - active or inactive business (based on business license experation)

- Is there a pair plot with bars in the background?  

## Analysis

### Data Consistency Checks

In [None]:
# shows that the two address look like they largely line up for the 148k records - 2500 records have mismatches but due to abbreviations and spacing differences
x = restaurant_df['address_x'].apply(lambda x: x[0:10].lower())
y = restaurant_df['address_y'].apply(lambda x: x[0:10].lower())

restaurant_df[ x != y][['aka_name_x','address_x', 'aka_name_y','address_y']]

In [None]:
# show that for every name and address, there is a unique license_alias id number
# this removes confusion for restaurants with the same name and address that have multiple licenses
# information about the license is not lost since the original license_num is kept but also there are identifying columns like violations, license type, and license code that provide clarity
restaurant_df.groupby(['aka_name_x','address_x','license_alias']).count()['risk']

In [None]:
# Are there multiple inspections for the same day for the same license number? 
# Yes, removed the rows that were identical
# remaining duplicates are from failed initial attempt and later in the day fix and reinspect

# restaurant_df.groupby(['inspect_date','license_num', 'aka_name_x']).count()['inspect_id'].sort_values(ascending=False)
# restaurant_df[restaurant_df['aka_name_x'] == "PETE'S  PIZZA & BAKEHOUSE"].loc[:,['aka_name_x','license_num','facility_type','risk','address_x','city','state','zipcode',
#                                                                                'inspect_date','inspect_type','results','violations','lat','lon', 'year']].sort_values(by='inspect_date')

### Basic Feature Analysis

In [None]:
total_records = restaurant_df.groupby('year').count()['inspect_id']

failed_records = restaurant_df[restaurant_df['results'] == 'Fail'].groupby('year').count()['inspect_id']
failed_records.plot.bar()

In [None]:
failed_percent = failed_records/total_records*100
failed_percent.plot.bar()

In [None]:
failed_records = restaurant_df[restaurant_df['results'] == 'Fail'].groupby(['year','month']).count()['inspect_id']
failed_records.plot()

In [None]:
failed_records = restaurant_df[(restaurant_df['results'] == 'Fail') & (restaurant_df['year'] == 2014)].groupby(['year','month']).count()['inspect_id']
failed_records.plot()

In [None]:
failed_records = restaurant_df[(restaurant_df['results'] == 'Fail') & (restaurant_df['year'] == 2015)].groupby(['year','month']).count()['inspect_id']
failed_records.plot()

In [None]:
failed_records = restaurant_df[(restaurant_df['results'] == 'Fail') & (restaurant_df['year'] == 2016)].groupby(['year','month']).count()['inspect_id']
failed_records.plot()

In [None]:
total_records = restaurant_df.groupby(['year','month']).count()['inspect_id']

failed_records = restaurant_df[restaurant_df['results'] == 'Fail'].groupby(['year','month']).count()['inspect_id']

diff = failed_records/total_records*100

diff.plot()

In [None]:
total_records = restaurant_df.groupby(['year','month']).count()['inspect_id']
failed_records = restaurant_df[(restaurant_df['results'] == 'Fail') & (restaurant_df['year'] == 2014)].groupby(['year','month']).count()['inspect_id']
diff = failed_records/total_records*100
diff.plot()

In [None]:
total_records = restaurant_df.groupby(['year','month']).count()['inspect_id']
failed_records = restaurant_df[(restaurant_df['results'] == 'Fail') & (restaurant_df['year'] == 2015)].groupby(['year','month']).count()['inspect_id']
diff = failed_records/total_records*100
diff.plot()

In [None]:
total_records = restaurant_df.groupby(['year','month']).count()['inspect_id']
failed_records = restaurant_df[(restaurant_df['results'] == 'Fail') & (restaurant_df['year'] == 2016)].groupby(['year','month']).count()['inspect_id']
diff = failed_records/total_records*100
diff.plot()

In [None]:
# Post-Analysis Summary:  
# shows building up for first half of year and then decrease for the latter half of the year

### Odd listings  
O'Hare  
PRESENCE RESURRECTION MEDICAL CENTER  
Hospital Food courts  
2-4 downtown food courts  

In [None]:
# restaurant_df[restaurant_df['aka_name_x'] == 'PRESENCE RESURRECTION MEDICAL CENTER']

In [None]:
temp.xs('#1 CHOP SUEY', level='aka_name_x')

In [None]:
temp.sort_values(by='duplicate_licenses_per_name_address', ascending=False).index.get_level_values('aka_name_x').to_list()

In [None]:
temp.xs("RENAISSANCE CHICAGO O'HARE HOTEL", level='aka_name_x')

In [None]:
restaurant_df[restaurant_df['aka_name_x'] == "RENAISSANCE CHICAGO O'HARE HOTEL"].sort_values(by='license_num', ascending=True).sort_values(by='inspect_date', ascending=True)

In [None]:
temp.sort_values(by='number_of_chain_stores', ascending=False)

In [None]:
temp.sort_values(by='duplicate_licenses_per_name_address', ascending=False).loc["MISS RICKY'S",:]

In [None]:
temp.sort_values(by='duplicate_licenses_per_name_address', ascending=False).loc["SUBWAY",:]

In [None]:
restaurant_df.groupby(['aka_name_x','license_description']).count().iloc[:,0:2]

In [None]:
temp = restaurant_df[['aka_name_x','facility_type','inspect_type','license_description','bus_activity', 'application_type']]

In [None]:
temp['bus_activity'].value_counts()

In [None]:
temp['application_type'].value_counts()

### Aggregate data  
**business activity**:  good feature where a business might have multiple license numbers and each one is for a different license code(?) - description can be broken into new categories  
**license code**:  is the same as license description - can be broken into a couple categories


In [None]:
#Check to see if this is different than license description 
restaurant_df['license_code'].value_counts()

In [None]:
temp['license_description'].value_counts()

In [None]:
# count the number of business_activity per business name
# derive stats of how long the business has had each service
# aggregate all values to a single dataframe.  
# split business_activity into:  perishable foods, onsite-prep-dining, onsite-prep-nodining,  
restaurant_df.groupby(['aka_name_x','license_num','bus_activity','license_code']).count()