# Where to eat in Chigago : an analysis of the inspections from the Chicago department of Public Health's Food Protection Programm

## 1. Introduction

The Chicago department of Public Health’s Food Protection Program provides a database which contains the information from inspection reports of restaurants and other food establishments in Chicago from 2010 to the present. It contains many informations about the establishments, like their type of facility (groceries’ stores, restaurants, coffee shop, …) and their locations. Many informations about the violations listed are also provided in the database, like the description of the findings that caused them and the reason that invoked the staff from the program to led an inspection.

In our project we endeavor to visualize the healthiness of public food establishments according to their type of facility, their ward and the date of the inspection. An analysis of the violation’s types according to these three parameters will also be conducted. 

The principal questions we'll answer are : 
    - Which ward of Chicago are the most healthy and unhealthy ? 
    - Which type of facility tend to be less healthy ? 
    - Did the healthiness of the food in Chicago increase or decrease from 2010 until now ?

New problematics could be asked during the analysis and would be added to these.

The purpose of the project is to help the consumer to easily choose where to eat in Chicago and to provide them an interactive and intuitive way to browse the different places offered to them. Also, it could help the Chicago department of Public Health’s Food Protection Program to adapt their methods relying on the situation described by the findings of the analysis (for example, if a prevention program should be proposed for a specific area or type of facility).

In [564]:
%matplotlib inline
import pandas as pd
import numpy as np
import re
from matplotlib.ticker import MaxNLocator
import matplotlib.pyplot as plt
import requests as req
from bs4 import BeautifulSoup
import seaborn as sns

## 2. Preprocessing

### 2.1 Facilities of interest Selection

First a quick look at how is organized the dataset. 

In [565]:
df = pd.read_csv('food-inspections.csv',sep=',') #creation of the dataframe
df.head(3)

Unnamed: 0,Inspection ID,DBA Name,AKA Name,License #,Facility Type,Risk,Address,City,State,Zip,...,Results,Violations,Latitude,Longitude,Location,Historical Wards 2003-2015,Zip Codes,Community Areas,Census Tracts,Wards
0,2320971,JUMPSTART EARLY LEARNING ACADEMY,JUMPSTART EARLY LEARNING ACADEMY,2589822.0,Children's Services Facility,Risk 1 (High),7559 W ADDISON ST,CHICAGO,IL,60634.0,...,Pass w/ Conditions,5. PROCEDURES FOR RESPONDING TO VOMITING AND D...,41.945065,-87.816734,"{'longitude': '41.945064857019986', 'latitude'...",,,,,
1,2320918,BEEFSTEAK,BEEFSTEAK,2698445.0,Restaurant,Risk 1 (High),303 E SUPERIOR ST,CHICAGO,IL,60611.0,...,Pass,39. CONTAMINATION PREVENTED DURING FOOD PREPAR...,41.895692,-87.620143,"{'longitude': '41.895692401410514', 'latitude'...",,,,,
2,2320986,BABA'S COFFEE,BABA'S COFFEE,2423353.0,Restaurant,Risk 1 (High),5544-5546 N KEDZIE AVE,CHICAGO,IL,60625.0,...,No Entry,,41.982582,-87.708996,"{'longitude': '41.98258181784537', 'latitude':...",,,,,


In [566]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 195312 entries, 0 to 195311
Data columns (total 22 columns):
Inspection ID                 195312 non-null int64
DBA Name                      195312 non-null object
AKA Name                      192862 non-null object
License #                     195295 non-null float64
Facility Type                 190535 non-null object
Risk                          195239 non-null object
Address                       195312 non-null object
City                          195173 non-null object
State                         195270 non-null object
Zip                           195261 non-null float64
Inspection Date               195312 non-null object
Inspection Type               195311 non-null object
Results                       195312 non-null object
Violations                    143530 non-null object
Latitude                      194627 non-null float64
Longitude                     194627 non-null float64
Location                      194627 n

It is a dataset of 195'312 entries with 22 columns listed above.
First thing first, we want to put the different facility types in categories that make sense for our project.

In [567]:
#df['Facility Type'].unique()
#Run this command to see the list of the different facility types in the data.

A lot of different types of facility are found in the data.

First, we thought about only select the "private" establishments, where it is possible to eat a main course (for example, the places where you can only eat an ice cream are deleted of our list). They all are categorized in order to be compared with each other.


In [568]:
public_dic = {'restaurant' : ['Restaurant', 'DINING HALL', 'TENT RSTAURANT'], \
              'grocery_restaurant' : ['Grocery & Restaurant', 'GROCERY& RESTAURANT', 'GROCERY/RESTAURANT',\
                                    'GROCERY/ RESTAURANT', 'GROCERY STORE/ RESTAURANT', 'GROCERY & RESTAURANT',\
                                    'RESTAURANT/GROCERY', 'grocery & restaurant', 'RESTAURANT/GROCERY STORE',\
                                    'GROCERY/TAQUERIA', 'GAS STATION/RESTAURANT'],\
              'banquet' : ['LOUNGE/BANQUET HALL', 'BANQUET', 'Banquet Hall', 'BANQUET FACILITY', 'banquet hall',\
                         'banquets', 'Banquet Dining',  'Banquet/kitchen','RESTAURANT.BANQUET HALLS',\
                         'BANQUET HALL', 'Banquet', 'BOWLING LANES/BANQUETS'], \
              'rooftop_restaurant' : ['Wrigley Roof Top', 'REST/ROOFTOP'],\
              'bar_restaurant' : ['RESTAURANT/BAR', 'RESTUARANT AND BAR', 'BAR/GRILL', 'RESTAURANT/BAR/THEATER',\
                                'JUICE AND SALAD BAR', 'SUSHI COUNTER', 'TAVERN/RESTAURANT', 'tavern/restaurant',\
                                'TAVERN GRILL'], \
              'bakery_restaurant' : ['BAKERY/ RESTAURANT', 'bakery/restaurant', 'RESTAURANT/BAKERY'], \
              'liquor_restaurant' : ['RESTAURANT AND LIQUOR', 'RESTAURANT/LIQUOR'], \
              'catering' : ['CATERING/CAFE', 'Catering'], \
              'golden_diner' : ['Golden Diner']}

In [569]:
nombre = 0
for y in df['Facility Type'] :
    if y == 'Daycare (2 - 6 Years)' :
        nombre += 1
    else :
        pass
print(nombre)

# mais pourquoi pas utiliser panda ici ?? genre len(df['Facility Type'] == 'Daycare') ou quelque chose du genre...

2684


This command returns the number of occurencs of the `Facility Type` inputed. 

With trying different types previously categorized and listed in the `public_dic` dictionary we have noted that the results were too distant to conduct a meaningful analysis. That's why we then decided to also select "public" establishments like school cafeterias and hospitals. It could be interesting to compare private and public inspection results.


In [570]:
private_dic = {'daycare' : ['Daycare Above and Under 2 Years', 'Daycare (2 - 6 Years)', 'Daycare Combo 1586',\
                          'Daycare (Under 2 Years)', 'DAYCARE 2 YRS TO 12 YRS', 'Daycare Night', 'DAY CARE 2-14',\
                          'Daycare (2 Years)', 'DAYCARE', 'ADULT DAYCARE', '15 monts to 5 years old', 'youth housing',\
                          'DAYCARE 1586', 'DAYCARE COMBO', '1584-DAY CARE ABOVE 2 YEARS', 'CHURCH/DAY CARE', 'DAY CARE',\
                          'DAYCARE 6 WKS-5YRS', 'DAY CARE 1023', 'DAYCARE 2-6, UNDER 6', 'Day Care Combo (1586)'], \
               'school' : ['SCHOOL', 'School', 'PRIVATE SCHOOL', 'AFTER SCHOOL PROGRAM', 'COLLEGE',\
                         'BEFORE AND AFTER SCHOOL PROGRAM', 'Private School', 'TEACHING SCHOOL',\
                         'PUBLIC SHCOOL', 'CHARTER SCHOOL CAFETERIA', 'CAFETERIA', 'Cafeteria', 'cafeteria',\
                         'UNIVERSITY CAFETERIA', 'PREP INSIDE SCHOOL', 'CHARTER SCHOOL', 'school cafeteria',\
                         'CHARTER SCHOOL/CAFETERIA', 'School Cafeteria', 'ALTERNATIVE SCHOOL', 'CITY OF CHICAGO COLLEGE',\
                         'after school program', 'CHURCH/AFTER SCHOOL PROGRAM', 'AFTER SCHOOL CARE'], \
               'childrens_services' : ["Children's Services Facility", 'CHILDRENS SERVICES FACILITY', \
                                     "CHILDERN'S SERVICE FACILITY", "1023 CHILDREN'S SERVICES FACILITY", \
                                     "1023 CHILDERN'S SERVICES FACILITY", "1023-CHILDREN'S SERVICES FACILITY", \
                                     "1023 CHILDERN'S SERVICE FACILITY", "1023 CHILDERN'S SERVICE S FACILITY", \
                                     'CHILDERN ACTIVITY FACILITY', "CHILDERN'S SERVICES  FACILITY", '1023'], \
               'adultcare' : ['Long Term Care', 'REHAB CENTER', 'Hospital', 'ASSISTED LIVING', 'SENIOR DAY CARE',\
                            'Assisted Living', 'NURSING HOME', 'ASSISTED LIVING FACILITY', 'SUPPORTIVE LIVING FACILITY',\
                            'Assisted Living Senior Care', 'Adult Family Care Center', '1005 NURSING HOME', \
                            'Long-Term Care Facility', 'LONG TERM CARE FACILITY', 'ASSISSTED LIVING',\
                            'Long-Term Care','Long Term Care Facility', 'VFW HALL']}

In [571]:
total_dic = {**public_dic , **private_dic}

In [572]:
def newcolfromdict(dataf, dic, inputcolumn, outputcolumn) :
    new_list = []
    for content in dataf[inputcolumn] :
        new_content = 'Not listed'

        for k, v in dic.items() :
            if type(v) == list :
                for element in v :
                    if content == element :
                        new_content = k
            else :
                if content == v :
                    new_content = k

        new_list.append(new_content)
    dataf[outputcolumn] = new_list
    return dataf

The **newcolfromdict** function constructs a new column `outputcolum` from the inputed column `inputcolum` comparing its values to the `dic` dictionary and adds it to the `dataf` dataframe.

### 2.2 New dataframe Creation

To construct the new dataframe, the `Facility Type` column is dropped because it has been replaced by the column `Facility group` and the *Not Listed* establishments are not selected.

The duplicates are dropped.

In [573]:
eat_seat = newcolfromdict(df, total_dic, 'Facility Type', 'Facility Group')
eat_seat = df.loc[df['Facility Group'] != 'Not Listed']
eat_seat = eat_seat.drop(columns = ['Facility Type'])

eat_seat = eat_seat.drop_duplicates()

eat_seat.head(2)

Unnamed: 0,Inspection ID,DBA Name,AKA Name,License #,Risk,Address,City,State,Zip,Inspection Date,...,Violations,Latitude,Longitude,Location,Historical Wards 2003-2015,Zip Codes,Community Areas,Census Tracts,Wards,Facility Group
0,2320971,JUMPSTART EARLY LEARNING ACADEMY,JUMPSTART EARLY LEARNING ACADEMY,2589822.0,Risk 1 (High),7559 W ADDISON ST,CHICAGO,IL,60634.0,2019-11-01T00:00:00.000,...,5. PROCEDURES FOR RESPONDING TO VOMITING AND D...,41.945065,-87.816734,"{'longitude': '41.945064857019986', 'latitude'...",,,,,,childrens_services
1,2320918,BEEFSTEAK,BEEFSTEAK,2698445.0,Risk 1 (High),303 E SUPERIOR ST,CHICAGO,IL,60611.0,2019-11-01T00:00:00.000,...,39. CONTAMINATION PREVENTED DURING FOOD PREPAR...,41.895692,-87.620143,"{'longitude': '41.895692401410514', 'latitude'...",,,,,,restaurant


Since we only care about establishments in Chicago Illinois, we will only keep the data for this city and drop the `City` and `State` columns.

We first check the different Citys' names to avoid deleting rows due to missprints.

In [574]:
df['City'].unique()

array(['CHICAGO', nan, 'chicago', 'Chicago', 'GRIFFITH', 'NEW YORK',
       'SCHAUMBURG', 'ELMHURST', 'ALGONQUIN', 'NEW HOLSTEIN', 'CCHICAGO',
       'NILES NILES', 'EVANSTON', 'CHICAGO.', 'CHESTNUT STREET',
       'LANSING', 'CHICAGOCHICAGO', 'WADSWORTH', 'WILMETTE', 'WHEATON',
       'CHICAGOHICAGO', 'ROSEMONT', 'CHicago', 'CALUMET CITY',
       'PLAINFIELD', 'HIGHLAND PARK', 'PALOS PARK', 'ELK GROVE VILLAGE',
       'CICERO', 'BRIDGEVIEW', 'OAK PARK', 'MAYWOOD', 'LAKE BLUFF',
       '312CHICAGO', 'SCHILLER PARK', 'SKOKIE', 'BEDFORD PARK',
       'BANNOCKBURNDEERFIELD', 'CHCICAGO', 'BLOOMINGDALE', 'Norridge',
       'CHARLES A HAYES', 'CHCHICAGO', 'CHICAGOI', 'SUMMIT',
       'OOLYMPIA FIELDS', 'WESTMONT', 'CHICAGO HEIGHTS', 'JUSTICE',
       'TINLEY PARK', 'LOMBARD', 'EAST HAZEL CREST', 'COUNTRY CLUB HILLS',
       'STREAMWOOD', 'BOLINGBROOK', 'INACTIVE', 'BERWYN', 'BURNHAM',
       'DES PLAINES', 'LAKE ZURICH', 'OLYMPIA FIELDS', 'alsip',
       'OAK LAWN', 'BLUE ISLAND', 'GLENCOE',

In [575]:
cities = ['CHICAGO','chicago','Chicago','CCHICAGO','CHICAGO.','CHICAGOCHICAGO','CHICAGOHICAGO',\
          'CHicago','312CHICAGO','CHCICAGO','CHCHICAGO','CHICAGOI','CHICAGO HEIGHTS']

eat_seat = eat_seat.loc[eat_seat['City'].isin(cities)]

eat_seat = eat_seat.drop(columns = ['City','State'])

# à la limite pour être stylés on pourrait tenter une expression régulière pour filtrer les différentes orthographes 
#de chigago au lieu de sélectionner à la main

Then we want to check the missing values.

In [576]:
eat_seat.isnull().sum()

Inspection ID                      0
DBA Name                           0
AKA Name                        2416
License #                         17
Risk                              70
Address                            0
Zip                                3
Inspection Date                    0
Inspection Type                    1
Results                            0
Violations                     51550
Latitude                         515
Longitude                        515
Location                         515
Historical Wards 2003-2015    194729
Zip Codes                     194729
Community Areas               194729
Census Tracts                 194729
Wards                         194729
Facility Group                     0
dtype: int64

In [577]:
print(len(eat_seat.index))    ##returns the number of rows of the df

194729


As we can see, the columns `Historical Wards 2003-2015`, `Zip Codes`, `Community Areas`, `Census Tracts` and `Wards` are empty and will be dropped.

We will only be using the `DBA Name` (the name under which the establishment is doing business ; DBA = doing business as), so we drop the `AKA Name`.

In [578]:
eat_seat = eat_seat.drop(columns = ['AKA Name','Historical Wards 2003-2015', 'Zip Codes', 'Community Areas',\
                                    'Census Tracts', 'Wards'])
eat_seat.head(2)

Unnamed: 0,Inspection ID,DBA Name,License #,Risk,Address,Zip,Inspection Date,Inspection Type,Results,Violations,Latitude,Longitude,Location,Facility Group
0,2320971,JUMPSTART EARLY LEARNING ACADEMY,2589822.0,Risk 1 (High),7559 W ADDISON ST,60634.0,2019-11-01T00:00:00.000,Canvass,Pass w/ Conditions,5. PROCEDURES FOR RESPONDING TO VOMITING AND D...,41.945065,-87.816734,"{'longitude': '41.945064857019986', 'latitude'...",childrens_services
1,2320918,BEEFSTEAK,2698445.0,Risk 1 (High),303 E SUPERIOR ST,60611.0,2019-11-01T00:00:00.000,License,Pass,39. CONTAMINATION PREVENTED DURING FOOD PREPAR...,41.895692,-87.620143,"{'longitude': '41.895692401410514', 'latitude'...",restaurant


There is also adjustments to make in some columns, because the formats aren't usable :

* In `Inspection Date`, only the day will be kept, not the time of day that is actually not given
* In `Risk`, only the number will remain

For the column `Risk`, we first want to check what types of risk are listed.

In [579]:
eat_seat['Inspection Date'] = eat_seat['Inspection Date'].apply(lambda x:x.split('T')[0])

In [580]:
eat_seat.Risk.unique()
#print all types of entries in the column Risk

array(['Risk 1 (High)', 'Risk 2 (Medium)', 'Risk 3 (Low)', nan, 'All'],
      dtype=object)

We will replace **All** and **High Risk** by **3**, **Medium Risk** by **2** and **Low Risk** by **1**.
--> pourquoi remplacer All by High Risk ? A vérifier si c'était vraiment pensé comme ça quand la bdd a été conçue

In [581]:
eat_seat['Risk'] = eat_seat['Risk'].replace({'All':1, 'Risk 1 (High)':3, 'Risk 2 (Medium)':2, 'Risk 3 (Low)':1})
eat_seat.head(2)

Unnamed: 0,Inspection ID,DBA Name,License #,Risk,Address,Zip,Inspection Date,Inspection Type,Results,Violations,Latitude,Longitude,Location,Facility Group
0,2320971,JUMPSTART EARLY LEARNING ACADEMY,2589822.0,3.0,7559 W ADDISON ST,60634.0,2019-11-01,Canvass,Pass w/ Conditions,5. PROCEDURES FOR RESPONDING TO VOMITING AND D...,41.945065,-87.816734,"{'longitude': '41.945064857019986', 'latitude'...",childrens_services
1,2320918,BEEFSTEAK,2698445.0,3.0,303 E SUPERIOR ST,60611.0,2019-11-01,License,Pass,39. CONTAMINATION PREVENTED DURING FOOD PREPAR...,41.895692,-87.620143,"{'longitude': '41.895692401410514', 'latitude'...",restaurant


In [582]:
eat_seat = eat_seat.rename(columns={"License #": "License"}) #rename the column 'License #' into 'License'

In [583]:
len(eat_seat.License.unique())

37145

In [584]:
eat_seat.head()

Unnamed: 0,Inspection ID,DBA Name,License,Risk,Address,Zip,Inspection Date,Inspection Type,Results,Violations,Latitude,Longitude,Location,Facility Group
0,2320971,JUMPSTART EARLY LEARNING ACADEMY,2589822.0,3.0,7559 W ADDISON ST,60634.0,2019-11-01,Canvass,Pass w/ Conditions,5. PROCEDURES FOR RESPONDING TO VOMITING AND D...,41.945065,-87.816734,"{'longitude': '41.945064857019986', 'latitude'...",childrens_services
1,2320918,BEEFSTEAK,2698445.0,3.0,303 E SUPERIOR ST,60611.0,2019-11-01,License,Pass,39. CONTAMINATION PREVENTED DURING FOOD PREPAR...,41.895692,-87.620143,"{'longitude': '41.895692401410514', 'latitude'...",restaurant
2,2320986,BABA'S COFFEE,2423353.0,3.0,5544-5546 N KEDZIE AVE,60625.0,2019-11-01,Canvass,No Entry,,41.982582,-87.708996,"{'longitude': '41.98258181784537', 'latitude':...",restaurant
3,2320910,J.T.'S GENUINE SANDWICH,2689893.0,3.0,3970 N ELSTON AVE,60618.0,2019-11-01,License,Pass,51. PLUMBING INSTALLED; PROPER BACKFLOW DEVICE...,41.953378,-87.718848,"{'longitude': '41.95337788158545', 'latitude':...",restaurant
4,2320904,"KID'Z COLONY DAYCARE, INC.",2215609.0,3.0,6287 S ARCHER AVE,60638.0,2019-11-01,Canvass,Fail,16. FOOD-CONTACT SURFACES: CLEANED & SANITIZED...,41.793235,-87.777776,"{'longitude': '41.7932347787373', 'latitude': ...",daycare


In [585]:
eat_seat.Results.unique()

array(['Pass w/ Conditions', 'Pass', 'No Entry', 'Fail',
       'Out of Business', 'Not Ready', 'Business Not Located'],
      dtype=object)

As done for Risk, we should replace Pass w/ Conditions by 1, Pass by 0 and Fail by 2 (to compute the score of the facilities). But what should we do with Out of Business and Not Ready ?

### Zip Codes cleaning

In [586]:
def clean_zip(zipcode):
    try :
        return int(float(zipcode))
    except :
        return 0

In [587]:
eat_seat.Zip = eat_seat.Zip.apply(clean_zip)    ##guarantees the zipcodes to be integers

We found a file associating the chicago zipe codes and their associated community area. Using it we can create a new `Community Area` column.

In [588]:
zip_to_area = pd.read_csv('ZipCode_to_ComArea.csv',sep=',') ##creation of the dataframe
zip_to_area = zip_to_area.drop(columns = ['TOT2010'])

zip_to_area.ZipCode = zip_to_area.ZipCode.apply(clean_zip) ##guarantees the zipcodes to be integers

zip_to_area = zip_to_area.groupby('ComArea')['ZipCode'].apply(list)    ##groups the zipcodes by community area number
zip_to_area = zip_to_area.reset_index()

In [589]:
zip_dic = zip_to_area.set_index('ComArea')['ZipCode'].to_dict()

In [590]:
eat_seat = newcolfromdict(eat_seat, zip_dic, 'Zip', 'Community Area')

In [591]:
eat_seat.head(3)

Unnamed: 0,Inspection ID,DBA Name,License,Risk,Address,Zip,Inspection Date,Inspection Type,Results,Violations,Latitude,Longitude,Location,Facility Group,Community Area
0,2320971,JUMPSTART EARLY LEARNING ACADEMY,2589822.0,3.0,7559 W ADDISON ST,60634,2019-11-01,Canvass,Pass w/ Conditions,5. PROCEDURES FOR RESPONDING TO VOMITING AND D...,41.945065,-87.816734,"{'longitude': '41.945064857019986', 'latitude'...",childrens_services,76
1,2320918,BEEFSTEAK,2698445.0,3.0,303 E SUPERIOR ST,60611,2019-11-01,License,Pass,39. CONTAMINATION PREVENTED DURING FOOD PREPAR...,41.895692,-87.620143,"{'longitude': '41.895692401410514', 'latitude'...",restaurant,8
2,2320986,BABA'S COFFEE,2423353.0,3.0,5544-5546 N KEDZIE AVE,60625,2019-11-01,Canvass,No Entry,,41.982582,-87.708996,"{'longitude': '41.98258181784537', 'latitude':...",restaurant,14


In [592]:
pb = 0
for zipcode in eat_seat.Zip :
    if zipcode == 0 : 
        pb += 1
print(pb)

# same remarque qu'avant, tenter avec len(eat_seat.Zip == 0) ou qqch du genre

3


The **clean_zip()** function returns **0** if the zip code is not convertible into an integer. Here it means that there are 51 missing zip codes in the `Zip` column, so we need to complete the data using the other informations available.
We found a dataset online which contains all the US zip code and the longitude and latitude associated. Only the Chicago zip codes have been kept.


In [593]:
ziplatlong = pd.read_csv('ZipLatLong.csv',sep=',')
ziplatlong.head()

Unnamed: 0,ZIP,LAT,LNG
0,60007,42.0086,-87.99734
1,60008,42.069786,-88.016221
2,60010,42.146494,-88.164651
3,60012,42.272492,-88.314084
4,60013,42.223439,-88.235506


### Second Dataset - 1. BUSINESS LICENSES/OWNERS

We found two datasets on Kaggle gathering the business licenses and the business owners of Chicago. It could be interesting to observe the results of different establishments owned by the same person.

In [594]:
licedf = pd.read_csv('business-licenses.csv', sep=',') #creation of the dataframe
licedf = licedf.rename(str.lower, axis='columns')
licedf = licedf.drop(columns = ['city', 'state', 'id', 'precinct', 'ward precinct', 'business activity id',\
                                'license number', 'application type', 'application created date',\
                                'application requirements complete', 'payment date', 'conditional approval',\
                                'license term start date', 'license term expiration date', 'license approved for issuance',\
                                'date issued', 'license status', 'license status change date', 'ssa',\
                                'historical wards 2003-2015', 'zip codes', 'wards', 'census tracts', 'location',\
                                'license code', 'license description', 'business activity', 'site number',\
                               'zip code', 'latitude', 'longitude', 'address', 'legal name', 'doing business as name',\
                                'community areas', 'ward'])
licedf.head(3)

FileNotFoundError: [Errno 2] File b'business-licenses.csv' does not exist: b'business-licenses.csv'

In [None]:
print(len(licedf.index))

In [None]:
owndf = pd.read_csv('business-owners.csv',sep=',') #creation of the dataframe
owndf = owndf.rename(str.lower, axis='columns')
owndf = owndf.drop(columns = ['suffix', 'legal entity owner', 'owner middle initial', 'legal name', 'title'])
owndf.head(3)

In [None]:
print(len(owndf.index))

In [None]:
liceown = pd.merge(owndf,  licedf, on = 'account number')
liceown = liceown.rename(columns={"license id": "License"})
liceown.head(3)

In [None]:
liceown = liceown.set_index('License')
liceown.head(3)

eat_seat = eat_seat.set_index('License')
eat_seat.head(3)

In [None]:
print(len(liceown.index))

In [None]:
scd_eat_seat = pd.merge(eat_seat, liceown, right_index = True, left_index = True)

In [None]:
scd_eat_seat.head()

In [None]:
scd_eat_seat.to_csv('newfood.csv')

In [None]:
est_cent = pd.read_csv("newfood.csv", index_col=['License','Inspection ID'])
est_cent.head(20)

### Managing the changes in the Food Code Rules

The Food Code Rules has changed since the 1st July 2018. After investigating those changes, it seems that only the denomination of the violations but not the violation itself has changed, and a few additionnal violations has been added in the possible violations. It means that those changes does not need more processing and can just be considered together as a common list of violations.

In [595]:
len(eat_seat.Violations.unique())

142356

It seems that every violation is a unique entry because it contains not only the violation type but also the comments of the inspectors. We have to split the Violations column into 3 different columns :
- Violation number
- Violation type
- Violation comments

It seems that every violation cell is architectured this way :
"number of the violation". "TYPE OF THE VIOLATION" - Comments : "comments of the inspector" (this format repeated as many times as the number of violations detected the day of the inspection, separated with a vertical line)

We just want to keep the violation number (because we can check which violation it is online). We create a column NumberViolations containing the ID of the violations found during the corresponding investigation.

In [596]:
eat_seat.isnull().sum()

Inspection ID          0
DBA Name               0
License               17
Risk                  70
Address                0
Zip                    0
Inspection Date        0
Inspection Type        1
Results                0
Violations         51550
Latitude             515
Longitude            515
Location             515
Facility Group         0
Community Area         0
dtype: int64

We see that there are more than 50'000 rows where the Violations column is empty. There is no point in keeping the entries with no violations because our research is based on the study of those violations so those entries are dropped. Then the identifying numbers of the violations are searched and added to the new column "NumberViolations". The rest is not kept because the titles of the violations can be found online and we do not plan on using the comments of the investigators.

In [597]:
eat_seat = eat_seat.dropna(subset=['Violations'], axis = 0, how = 'all')

In [598]:
violations = eat_seat.apply(lambda row: re.findall('\|\s([0-9]+)[.]', str(row['Violations'])), axis = 1)

In [599]:
first_violations = eat_seat.apply(lambda row: row['Violations'].split('.')[0], axis = 1)

In [600]:
for violation, first_violation in zip(violations, first_violations):
    violation.append(first_violation)

flat_list = [item for sublist in violations for item in sublist]
unique, counts = np.unique(flat_list, return_counts=True)

In [601]:
eat_seat['NumberViolations'] = violations

In [602]:
eat_seat = eat_seat.drop(columns = ['Location', 'Violations'])
eat_seat.head()

Unnamed: 0,Inspection ID,DBA Name,License,Risk,Address,Zip,Inspection Date,Inspection Type,Results,Latitude,Longitude,Facility Group,Community Area,NumberViolations
0,2320971,JUMPSTART EARLY LEARNING ACADEMY,2589822.0,3.0,7559 W ADDISON ST,60634,2019-11-01,Canvass,Pass w/ Conditions,41.945065,-87.816734,childrens_services,76,"[10, 56, 5]"
1,2320918,BEEFSTEAK,2698445.0,3.0,303 E SUPERIOR ST,60611,2019-11-01,License,Pass,41.895692,-87.620143,restaurant,8,"[55, 39]"
3,2320910,J.T.'S GENUINE SANDWICH,2689893.0,3.0,3970 N ELSTON AVE,60618,2019-11-01,License,Pass,41.953378,-87.718848,restaurant,22,"[53, 58, 51]"
4,2320904,"KID'Z COLONY DAYCARE, INC.",2215609.0,3.0,6287 S ARCHER AVE,60638,2019-11-01,Canvass,Fail,41.793235,-87.777776,daycare,64,[16]
5,2320969,JUST A PIZZA PLUS INC,75583.0,3.0,5136 S ARCHER AVE,60632,2019-11-01,Complaint,Fail,41.800619,-87.731143,restaurant,63,"[38, 55, 55, 58, 60, 38]"


In [603]:
eat_seat = eat_seat.set_index(['Inspection ID']) #redifines the Index
eat_seat.head()

Unnamed: 0_level_0,DBA Name,License,Risk,Address,Zip,Inspection Date,Inspection Type,Results,Latitude,Longitude,Facility Group,Community Area,NumberViolations
Inspection ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2320971,JUMPSTART EARLY LEARNING ACADEMY,2589822.0,3.0,7559 W ADDISON ST,60634,2019-11-01,Canvass,Pass w/ Conditions,41.945065,-87.816734,childrens_services,76,"[10, 56, 5]"
2320918,BEEFSTEAK,2698445.0,3.0,303 E SUPERIOR ST,60611,2019-11-01,License,Pass,41.895692,-87.620143,restaurant,8,"[55, 39]"
2320910,J.T.'S GENUINE SANDWICH,2689893.0,3.0,3970 N ELSTON AVE,60618,2019-11-01,License,Pass,41.953378,-87.718848,restaurant,22,"[53, 58, 51]"
2320904,"KID'Z COLONY DAYCARE, INC.",2215609.0,3.0,6287 S ARCHER AVE,60638,2019-11-01,Canvass,Fail,41.793235,-87.777776,daycare,64,[16]
2320969,JUST A PIZZA PLUS INC,75583.0,3.0,5136 S ARCHER AVE,60632,2019-11-01,Complaint,Fail,41.800619,-87.731143,restaurant,63,"[38, 55, 55, 58, 60, 38]"


### creating a dataframe where the Index is the DBA Name so that we can compute a score

In [610]:
#keeping just one occurence of DBA Name
df_restaurants = eat_seat.drop_duplicates(subset=['License'])
df_restaurants.head()

Unnamed: 0_level_0,DBA Name,License,Risk,Address,Zip,Inspection Date,Inspection Type,Results,Latitude,Longitude,Facility Group,Community Area,NumberViolations
Inspection ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2320971,JUMPSTART EARLY LEARNING ACADEMY,2589822.0,3.0,7559 W ADDISON ST,60634,2019-11-01,Canvass,Pass w/ Conditions,41.945065,-87.816734,childrens_services,76,"[10, 56, 5]"
2320918,BEEFSTEAK,2698445.0,3.0,303 E SUPERIOR ST,60611,2019-11-01,License,Pass,41.895692,-87.620143,restaurant,8,"[55, 39]"
2320910,J.T.'S GENUINE SANDWICH,2689893.0,3.0,3970 N ELSTON AVE,60618,2019-11-01,License,Pass,41.953378,-87.718848,restaurant,22,"[53, 58, 51]"
2320904,"KID'Z COLONY DAYCARE, INC.",2215609.0,3.0,6287 S ARCHER AVE,60638,2019-11-01,Canvass,Fail,41.793235,-87.777776,daycare,64,[16]
2320969,JUST A PIZZA PLUS INC,75583.0,3.0,5136 S ARCHER AVE,60632,2019-11-01,Complaint,Fail,41.800619,-87.731143,restaurant,63,"[38, 55, 55, 58, 60, 38]"


In [611]:
df_restaurants = df_restaurants.set_index(['License']) #redifines the Index

In [612]:
df_restaurants.head()

Unnamed: 0_level_0,DBA Name,Risk,Address,Zip,Inspection Date,Inspection Type,Results,Latitude,Longitude,Facility Group,Community Area,NumberViolations
License,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2589822.0,JUMPSTART EARLY LEARNING ACADEMY,3.0,7559 W ADDISON ST,60634,2019-11-01,Canvass,Pass w/ Conditions,41.945065,-87.816734,childrens_services,76,"[10, 56, 5]"
2698445.0,BEEFSTEAK,3.0,303 E SUPERIOR ST,60611,2019-11-01,License,Pass,41.895692,-87.620143,restaurant,8,"[55, 39]"
2689893.0,J.T.'S GENUINE SANDWICH,3.0,3970 N ELSTON AVE,60618,2019-11-01,License,Pass,41.953378,-87.718848,restaurant,22,"[53, 58, 51]"
2215609.0,"KID'Z COLONY DAYCARE, INC.",3.0,6287 S ARCHER AVE,60638,2019-11-01,Canvass,Fail,41.793235,-87.777776,daycare,64,[16]
75583.0,JUST A PIZZA PLUS INC,3.0,5136 S ARCHER AVE,60632,2019-11-01,Complaint,Fail,41.800619,-87.731143,restaurant,63,"[38, 55, 55, 58, 60, 38]"


In [613]:
df_restaurants = df_restaurants.drop(columns = ['Inspection Date', 'Inspection Type', 'Results', 'NumberViolations'])
df_restaurants.head()

Unnamed: 0_level_0,DBA Name,Risk,Address,Zip,Latitude,Longitude,Facility Group,Community Area
License,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2589822.0,JUMPSTART EARLY LEARNING ACADEMY,3.0,7559 W ADDISON ST,60634,41.945065,-87.816734,childrens_services,76
2698445.0,BEEFSTEAK,3.0,303 E SUPERIOR ST,60611,41.895692,-87.620143,restaurant,8
2689893.0,J.T.'S GENUINE SANDWICH,3.0,3970 N ELSTON AVE,60618,41.953378,-87.718848,restaurant,22
2215609.0,"KID'Z COLONY DAYCARE, INC.",3.0,6287 S ARCHER AVE,60638,41.793235,-87.777776,daycare,64
75583.0,JUST A PIZZA PLUS INC,3.0,5136 S ARCHER AVE,60632,41.800619,-87.731143,restaurant,63


Now we have a dataframe with every facility listed in our preprocessed database. We want to create two columns, one containing how many times the facility has been inspected, and one containing every violation of this facility. 

In [614]:
#jârrive pas à le faire mieux avec panda donc ça prend mille ans...

dic = {}

for license in df_restaurants.index :
    score = len(eat_seat[eat_seat['License'] == license])
    dic[license] = score

In [617]:
df_restaurants['NumberOfInspections'] = dic.values()

In [618]:
df_restaurants.head()

Unnamed: 0_level_0,DBA Name,Risk,Address,Zip,Latitude,Longitude,Facility Group,Community Area,NumberOfInspections
License,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2589822.0,JUMPSTART EARLY LEARNING ACADEMY,3.0,7559 W ADDISON ST,60634,41.945065,-87.816734,childrens_services,76,3
2698445.0,BEEFSTEAK,3.0,303 E SUPERIOR ST,60611,41.895692,-87.620143,restaurant,8,1
2689893.0,J.T.'S GENUINE SANDWICH,3.0,3970 N ELSTON AVE,60618,41.953378,-87.718848,restaurant,22,1
2215609.0,"KID'Z COLONY DAYCARE, INC.",3.0,6287 S ARCHER AVE,60638,41.793235,-87.777776,daycare,64,4
75583.0,JUST A PIZZA PLUS INC,3.0,5136 S ARCHER AVE,60632,41.800619,-87.731143,restaurant,63,13


In [624]:
dic = {}

for license in df_restaurants.index :
    liste = []
    inspections = eat_seat['NumberViolations'][eat_seat['License'] == license]
    for inspection in inspections :
        for violation in inspection :
            liste.append(violation)
    dic[license] = liste

In [626]:
df_restaurants['TotalViolations'] = dic.values()

In [627]:
df_restaurants.head()

Unnamed: 0_level_0,DBA Name,Risk,Address,Zip,Latitude,Longitude,Facility Group,Community Area,NumberOfInspections,TotalViolations
License,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2589822.0,JUMPSTART EARLY LEARNING ACADEMY,3.0,7559 W ADDISON ST,60634,41.945065,-87.816734,childrens_services,76,3,"[10, 56, 5, 5, 36, 51, 55, 57, 3, 2, 3, 5, 10,..."
2698445.0,BEEFSTEAK,3.0,303 E SUPERIOR ST,60611,41.895692,-87.620143,restaurant,8,1,"[55, 39]"
2689893.0,J.T.'S GENUINE SANDWICH,3.0,3970 N ELSTON AVE,60618,41.953378,-87.718848,restaurant,22,1,"[53, 58, 51]"
2215609.0,"KID'Z COLONY DAYCARE, INC.",3.0,6287 S ARCHER AVE,60638,41.793235,-87.777776,daycare,64,4,"[16, 8, 35, 35, 35]"
75583.0,JUST A PIZZA PLUS INC,3.0,5136 S ARCHER AVE,60632,41.800619,-87.731143,restaurant,63,13,"[38, 55, 55, 58, 60, 38, 5, 10, 23, 36, 51, 55..."


In [628]:
df_restaurants = df_restaurants.drop(columns = ['DBA Name', 'Risk', 'Address', 'Zip', 'Latitude', 'Longitude', 'Facility Group', 'Community Area'])
df_restaurants.head()

Unnamed: 0_level_0,NumberOfInspections,TotalViolations
License,Unnamed: 1_level_1,Unnamed: 2_level_1
2589822.0,3,"[10, 56, 5, 5, 36, 51, 55, 57, 3, 2, 3, 5, 10,..."
2698445.0,1,"[55, 39]"
2689893.0,1,"[53, 58, 51]"
2215609.0,4,"[16, 8, 35, 35, 35]"
75583.0,13,"[38, 55, 55, 58, 60, 38, 5, 10, 23, 36, 51, 55..."


In [632]:
eat_seat = pd.merge(eat_seat, df_restaurants, how='left', on='License', left_index=True, right_index=False)

In [633]:
eat_seat.head()

Unnamed: 0,DBA Name,License,Risk,Address,Zip,Inspection Date,Inspection Type,Results,Latitude,Longitude,Facility Group,Community Area,NumberViolations,NumberOfInspections,TotalViolations
0,JUMPSTART EARLY LEARNING ACADEMY,2589822.0,3.0,7559 W ADDISON ST,60634,2019-11-01,Canvass,Pass w/ Conditions,41.945065,-87.816734,childrens_services,76,"[10, 56, 5]",3,"[10, 56, 5, 5, 36, 51, 55, 57, 3, 2, 3, 5, 10,..."
1,BEEFSTEAK,2698445.0,3.0,303 E SUPERIOR ST,60611,2019-11-01,License,Pass,41.895692,-87.620143,restaurant,8,"[55, 39]",1,"[55, 39]"
2,J.T.'S GENUINE SANDWICH,2689893.0,3.0,3970 N ELSTON AVE,60618,2019-11-01,License,Pass,41.953378,-87.718848,restaurant,22,"[53, 58, 51]",1,"[53, 58, 51]"
3,"KID'Z COLONY DAYCARE, INC.",2215609.0,3.0,6287 S ARCHER AVE,60638,2019-11-01,Canvass,Fail,41.793235,-87.777776,daycare,64,[16],4,"[16, 8, 35, 35, 35]"
4,JUST A PIZZA PLUS INC,75583.0,3.0,5136 S ARCHER AVE,60632,2019-11-01,Complaint,Fail,41.800619,-87.731143,restaurant,63,"[38, 55, 55, 58, 60, 38]",13,"[38, 55, 55, 58, 60, 38, 5, 10, 23, 36, 51, 55..."


This will be used to create a score of healthiness for each establishment.