# Where to eat in Chigago : an analysis of the inspections from the Chicago department of Public Health's Food Protection Programm

## 1. Introduction

The Chicago department of Public Health’s Food Protection Program provides a database which contains the information from inspection reports of restaurants and other food establishments in Chicago from 2010 to the present. It contains many informations about the establishments, like their type of facility (groceries’ stores, restaurants, coffee shop, …) and their locations. Many informations about the violations listed are also provided in the database, like the description of the findings that caused them and the reason that invoked the staff from the program to led an inspection.

In our project we endeavor to visualize the healthiness of public food establishments according to their type of facility, their ward and the date of the inspection. An analysis of the violation’s types according to these three parameters will also be conducted. 

The principal questions we'll answer are : 
    - Which ward of Chicago are the most healthy and unhealthy ? 
    - Which type of facility tend to be less healthy ? 
    - Did the healthiness of the food in Chicago increase or decrease from 2010 until now ?

New problematics could be asked during the analysis and would be added to these.

The purpose of the project is to help the consumer to easily choose where to eat in Chicago and to provide them an interactive and intuitive way to browse the different places offered to them. Also, it could help the Chicago department of Public Health’s Food Protection Program to adapt their methods relying on the situation described by the findings of the analysis (for example, if a prevention program should be proposed for a specific area or type of facility).

In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import re
from matplotlib.ticker import MaxNLocator
import matplotlib.pyplot as plt
import requests as req
from bs4 import BeautifulSoup
import seaborn as sns

## 2. Preprocessing

### 2.1 Selecting the facilities of interest

First a quick look at how is organized the dataset. 

In [2]:
df = pd.read_csv('food-inspections.csv',sep=',') #creation of the dataframe
df.head(3)

Unnamed: 0,Inspection ID,DBA Name,AKA Name,License #,Facility Type,Risk,Address,City,State,Zip,...,Results,Violations,Latitude,Longitude,Location,Historical Wards 2003-2015,Zip Codes,Community Areas,Census Tracts,Wards
0,2320971,JUMPSTART EARLY LEARNING ACADEMY,JUMPSTART EARLY LEARNING ACADEMY,2589822.0,Children's Services Facility,Risk 1 (High),7559 W ADDISON ST,CHICAGO,IL,60634.0,...,Pass w/ Conditions,5. PROCEDURES FOR RESPONDING TO VOMITING AND D...,41.945065,-87.816734,"{'longitude': '41.945064857019986', 'latitude'...",,,,,
1,2320918,BEEFSTEAK,BEEFSTEAK,2698445.0,Restaurant,Risk 1 (High),303 E SUPERIOR ST,CHICAGO,IL,60611.0,...,Pass,39. CONTAMINATION PREVENTED DURING FOOD PREPAR...,41.895692,-87.620143,"{'longitude': '41.895692401410514', 'latitude'...",,,,,
2,2320986,BABA'S COFFEE,BABA'S COFFEE,2423353.0,Restaurant,Risk 1 (High),5544-5546 N KEDZIE AVE,CHICAGO,IL,60625.0,...,No Entry,,41.982582,-87.708996,"{'longitude': '41.98258181784537', 'latitude':...",,,,,


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 195312 entries, 0 to 195311
Data columns (total 22 columns):
Inspection ID                 195312 non-null int64
DBA Name                      195312 non-null object
AKA Name                      192862 non-null object
License #                     195295 non-null float64
Facility Type                 190535 non-null object
Risk                          195239 non-null object
Address                       195312 non-null object
City                          195173 non-null object
State                         195270 non-null object
Zip                           195261 non-null float64
Inspection Date               195312 non-null object
Inspection Type               195311 non-null object
Results                       195312 non-null object
Violations                    143530 non-null object
Latitude                      194627 non-null float64
Longitude                     194627 non-null float64
Location                      194627 n

It is a dataset of 195'312 entries with 22 columns listed above.

A lot of different types of facility are found in the data.

First, we thought about only select the "private" establishments, where it is possible to eat a main course (for example, the places where you can only eat an ice cream are deleted of our list). They all are categorized in order to be compared with each other.


In [4]:
public_dic = {'restaurant' : ['Restaurant', 'DINING HALL', 'TENT RSTAURANT'], \
              'grocery_restaurant' : ['Grocery & Restaurant', 'GROCERY& RESTAURANT', 'GROCERY/RESTAURANT',\
                                    'GROCERY/ RESTAURANT', 'GROCERY STORE/ RESTAURANT', 'GROCERY & RESTAURANT',\
                                    'RESTAURANT/GROCERY', 'grocery & restaurant', 'RESTAURANT/GROCERY STORE',\
                                    'GROCERY/TAQUERIA', 'GAS STATION/RESTAURANT'],\
              'banquet' : ['LOUNGE/BANQUET HALL', 'BANQUET', 'Banquet Hall', 'BANQUET FACILITY', 'banquet hall',\
                         'banquets', 'Banquet Dining',  'Banquet/kitchen','RESTAURANT.BANQUET HALLS',\
                         'BANQUET HALL', 'Banquet', 'BOWLING LANES/BANQUETS'], \
              'rooftop_restaurant' : ['Wrigley Roof Top', 'REST/ROOFTOP'],\
              'bar_restaurant' : ['RESTAURANT/BAR', 'RESTUARANT AND BAR', 'BAR/GRILL', 'RESTAURANT/BAR/THEATER',\
                                'JUICE AND SALAD BAR', 'SUSHI COUNTER', 'TAVERN/RESTAURANT', 'tavern/restaurant',\
                                'TAVERN GRILL'], \
              'bakery_restaurant' : ['BAKERY/ RESTAURANT', 'bakery/restaurant', 'RESTAURANT/BAKERY'], \
              'liquor_restaurant' : ['RESTAURANT AND LIQUOR', 'RESTAURANT/LIQUOR'], \
              'catering' : ['CATERING/CAFE', 'Catering'], \
              'golden_diner' : ['Golden Diner']}

In [5]:
nombre = 0
for y in df['Facility Type'] :
    if y == 'Daycare (2 - 6 Years)' :
        nombre += 1
    else :
        pass
print(nombre)

2684


This command returns the number of occurencs of the `Facility Type` inputed. 

With trying different types previously categorized and listed in the `public_dic` dictionary we have noted that the results were too distant to conduct a meaningful analysis. That's why we then decided to also select "public" establishments like school cafeterias and hospitals. It could be interesting to compare private and public inspection results.


In [6]:
private_dic = {'daycare' : ['Daycare Above and Under 2 Years', 'Daycare (2 - 6 Years)', 'Daycare Combo 1586',\
                          'Daycare (Under 2 Years)', 'DAYCARE 2 YRS TO 12 YRS', 'Daycare Night', 'DAY CARE 2-14',\
                          'Daycare (2 Years)', 'DAYCARE', 'ADULT DAYCARE', '15 monts to 5 years old', 'youth housing',\
                          'DAYCARE 1586', 'DAYCARE COMBO', '1584-DAY CARE ABOVE 2 YEARS', 'CHURCH/DAY CARE', 'DAY CARE',\
                          'DAYCARE 6 WKS-5YRS', 'DAY CARE 1023', 'DAYCARE 2-6, UNDER 6', 'Day Care Combo (1586)'], \
               'school' : ['SCHOOL', 'School', 'PRIVATE SCHOOL', 'AFTER SCHOOL PROGRAM', 'COLLEGE',\
                         'BEFORE AND AFTER SCHOOL PROGRAM', 'Private School', 'TEACHING SCHOOL',\
                         'PUBLIC SHCOOL', 'CHARTER SCHOOL CAFETERIA', 'CAFETERIA', 'Cafeteria', 'cafeteria',\
                         'UNIVERSITY CAFETERIA', 'PREP INSIDE SCHOOL', 'CHARTER SCHOOL', 'school cafeteria',\
                         'CHARTER SCHOOL/CAFETERIA', 'School Cafeteria', 'ALTERNATIVE SCHOOL', 'CITY OF CHICAGO COLLEGE',\
                         'after school program', 'CHURCH/AFTER SCHOOL PROGRAM', 'AFTER SCHOOL CARE'], \
               'childrens_services' : ["Children's Services Facility", 'CHILDRENS SERVICES FACILITY', \
                                     "CHILDERN'S SERVICE FACILITY", "1023 CHILDREN'S SERVICES FACILITY", \
                                     "1023 CHILDERN'S SERVICES FACILITY", "1023-CHILDREN'S SERVICES FACILITY", \
                                     "1023 CHILDERN'S SERVICE FACILITY", "1023 CHILDERN'S SERVICE S FACILITY", \
                                     'CHILDERN ACTIVITY FACILITY', "CHILDERN'S SERVICES  FACILITY", '1023'], \
               'adultcare' : ['Long Term Care', 'REHAB CENTER', 'Hospital', 'ASSISTED LIVING', 'SENIOR DAY CARE',\
                            'Assisted Living', 'NURSING HOME', 'ASSISTED LIVING FACILITY', 'SUPPORTIVE LIVING FACILITY',\
                            'Assisted Living Senior Care', 'Adult Family Care Center', '1005 NURSING HOME', \
                            'Long-Term Care Facility', 'LONG TERM CARE FACILITY', 'ASSISSTED LIVING',\
                            'Long-Term Care','Long Term Care Facility', 'VFW HALL']}

In [7]:
total_dic = {**public_dic , **private_dic}

In [8]:
def newcolfromdict(dataf, dic, inputcolumn, outputcolumn) :
    new_list = []
    for content in dataf[inputcolumn] :
        new_content = 'Not listed'

        for k, v in dic.items() :
            if type(v) == list :
                for element in v :
                    if content == element :
                        new_content = k
            else :
                if content == v :
                    new_content = k

        new_list.append(new_content)
    dataf[outputcolumn] = new_list
    return dataf

The **newcolfromdict** function constructs a new column `outputcolum` from the inputed column `inputcolum` comparing its values to the `dic` dictionary and adds it to the `dataf` dataframe.

### 2.2 Creation of a new dataframe

To construct the new dataframe, the `Facility Type` column is dropped because it has been replaced by the column `Facility group` and the *Not Listed* establishments are not selected.

The duplicates are dropped.

In [9]:
eat_seat = newcolfromdict(df, total_dic, 'Facility Type', 'Facility Group')
eat_seat = df.loc[df['Facility Group'] != 'Not Listed']
eat_seat = eat_seat.drop(columns = ['Facility Type'])

eat_seat = eat_seat.drop_duplicates()

eat_seat.head(2)

Unnamed: 0,Inspection ID,DBA Name,AKA Name,License #,Risk,Address,City,State,Zip,Inspection Date,...,Violations,Latitude,Longitude,Location,Historical Wards 2003-2015,Zip Codes,Community Areas,Census Tracts,Wards,Facility Group
0,2320971,JUMPSTART EARLY LEARNING ACADEMY,JUMPSTART EARLY LEARNING ACADEMY,2589822.0,Risk 1 (High),7559 W ADDISON ST,CHICAGO,IL,60634.0,2019-11-01T00:00:00.000,...,5. PROCEDURES FOR RESPONDING TO VOMITING AND D...,41.945065,-87.816734,"{'longitude': '41.945064857019986', 'latitude'...",,,,,,childrens_services
1,2320918,BEEFSTEAK,BEEFSTEAK,2698445.0,Risk 1 (High),303 E SUPERIOR ST,CHICAGO,IL,60611.0,2019-11-01T00:00:00.000,...,39. CONTAMINATION PREVENTED DURING FOOD PREPAR...,41.895692,-87.620143,"{'longitude': '41.895692401410514', 'latitude'...",,,,,,restaurant


Since we only care about establishments in Chicago Illinois, we will only keep the data for this city and drop the `City` and `State` columns.

We first check the different Citys' names to avoid deleting rows due to missprints.

In [10]:
city = []
for x in eat_seat['City'] :
    if x in city :
        pass
    else :
        city.append(x)
print(city)

['CHICAGO', nan, 'chicago', 'Chicago', 'GRIFFITH', 'NEW YORK', 'SCHAUMBURG', 'ELMHURST', 'ALGONQUIN', 'NEW HOLSTEIN', 'CCHICAGO', 'NILES NILES', 'EVANSTON', 'CHICAGO.', 'CHESTNUT STREET', 'LANSING', 'CHICAGOCHICAGO', 'WADSWORTH', 'WILMETTE', 'WHEATON', 'CHICAGOHICAGO', 'ROSEMONT', 'CHicago', 'CALUMET CITY', 'PLAINFIELD', 'HIGHLAND PARK', 'PALOS PARK', 'ELK GROVE VILLAGE', 'CICERO', 'BRIDGEVIEW', 'OAK PARK', 'MAYWOOD', 'LAKE BLUFF', '312CHICAGO', 'SCHILLER PARK', 'SKOKIE', 'BEDFORD PARK', 'BANNOCKBURNDEERFIELD', 'CHCICAGO', 'BLOOMINGDALE', 'Norridge', 'CHARLES A HAYES', 'CHCHICAGO', 'CHICAGOI', 'SUMMIT', 'OOLYMPIA FIELDS', 'WESTMONT', 'CHICAGO HEIGHTS', 'JUSTICE', 'TINLEY PARK', 'LOMBARD', 'EAST HAZEL CREST', 'COUNTRY CLUB HILLS', 'STREAMWOOD', 'BOLINGBROOK', 'INACTIVE', 'BERWYN', 'BURNHAM', 'DES PLAINES', 'LAKE ZURICH', 'OLYMPIA FIELDS', 'alsip', 'OAK LAWN', 'BLUE ISLAND', 'GLENCOE', 'FRANKFORT', 'NAPERVILLE', 'BROADVIEW', 'WORTH', 'Maywood', 'ALSIP', 'EVERGREEN PARK']


In [11]:
cities = ['CHICAGO','chicago','Chicago','CCHICAGO','CHICAGO.','CHICAGOCHICAGO','CHICAGOHICAGO',\
          'CHicago','312CHICAGO','CHCICAGO','CHCHICAGO','CHICAGOI','CHICAGO HEIGHTS']

eat_seat = eat_seat.loc[eat_seat['City'].isin(cities)]

eat_seat = eat_seat.drop(columns = ['City','State'])

Then we want to check the missing values.

In [12]:
eat_seat.isnull().sum()

Inspection ID                      0
DBA Name                           0
AKA Name                        2416
License #                         17
Risk                              70
Address                            0
Zip                                3
Inspection Date                    0
Inspection Type                    1
Results                            0
Violations                     51550
Latitude                         515
Longitude                        515
Location                         515
Historical Wards 2003-2015    194729
Zip Codes                     194729
Community Areas               194729
Census Tracts                 194729
Wards                         194729
Facility Group                     0
dtype: int64

In [13]:
print(len(eat_seat.index))    ##returns the number of rows of the df

194729


As we can see, the columns `Historical Wards 2003-2015`, `Zip Codes`, `Community Areas`, `Census Tracts` and `Wards` are empty and will be dropped.

We will only be using the `DBA Name` (the name under which the establishment is doing business ; DBA = doing business as), so we drop the `AKA Name`.

In [14]:
eat_seat = eat_seat.drop(columns = ['AKA Name','Historical Wards 2003-2015', 'Zip Codes', 'Community Areas',\
                                    'Census Tracts', 'Wards'])
eat_seat.head(2)

Unnamed: 0,Inspection ID,DBA Name,License #,Risk,Address,Zip,Inspection Date,Inspection Type,Results,Violations,Latitude,Longitude,Location,Facility Group
0,2320971,JUMPSTART EARLY LEARNING ACADEMY,2589822.0,Risk 1 (High),7559 W ADDISON ST,60634.0,2019-11-01T00:00:00.000,Canvass,Pass w/ Conditions,5. PROCEDURES FOR RESPONDING TO VOMITING AND D...,41.945065,-87.816734,"{'longitude': '41.945064857019986', 'latitude'...",childrens_services
1,2320918,BEEFSTEAK,2698445.0,Risk 1 (High),303 E SUPERIOR ST,60611.0,2019-11-01T00:00:00.000,License,Pass,39. CONTAMINATION PREVENTED DURING FOOD PREPAR...,41.895692,-87.620143,"{'longitude': '41.895692401410514', 'latitude'...",restaurant


There is also adjustments to make in some columns, because the formats aren't usable :

* In `Inspection Date`, only the day will be kept, not the time of day that is actually not given
* In `Risk`, only the number will remain

For the column `Risk`, we first want to check what types of risk are listed.

In [15]:
eat_seat['Inspection Date'] = eat_seat['Inspection Date'].apply(lambda x:x.split('T')[0])

In [16]:
type_of_risk = []
for x in eat_seat['Risk'] :
    if x in type_of_risk :
        pass
    else :
        type_of_risk.append(x)
print(type_of_risk)

['Risk 1 (High)', 'Risk 2 (Medium)', 'Risk 3 (Low)', nan, 'All']


We will replace **All** and **High Risk** by **1**, **Medium Risk** by **2** and **Low Risk** by **3**.

In [17]:
eat_seat['Risk'] = eat_seat['Risk'].replace({'All':1, 'Risk 1 (High)':1, 'Risk 2 (Medium)':2, 'Risk 3 (Low)':3})
eat_seat.head(2)

Unnamed: 0,Inspection ID,DBA Name,License #,Risk,Address,Zip,Inspection Date,Inspection Type,Results,Violations,Latitude,Longitude,Location,Facility Group
0,2320971,JUMPSTART EARLY LEARNING ACADEMY,2589822.0,1.0,7559 W ADDISON ST,60634.0,2019-11-01,Canvass,Pass w/ Conditions,5. PROCEDURES FOR RESPONDING TO VOMITING AND D...,41.945065,-87.816734,"{'longitude': '41.945064857019986', 'latitude'...",childrens_services
1,2320918,BEEFSTEAK,2698445.0,1.0,303 E SUPERIOR ST,60611.0,2019-11-01,License,Pass,39. CONTAMINATION PREVENTED DURING FOOD PREPAR...,41.895692,-87.620143,"{'longitude': '41.895692401410514', 'latitude'...",restaurant


In [18]:
eat_seat = eat_seat.rename(columns={"License #": "License"}) #rename the column 'License #' into 'License'

In [19]:
len(eat_seat.License.unique())

37145

### Zip Codes cleaning

In [20]:
def clean_zip(zipcode):
    try :
        return int(float(zipcode))
    except :
        return 0

In [21]:
eat_seat.Zip = eat_seat.Zip.apply(clean_zip)    ##guarantees the zipcodes to be integers

We found a file associating the chicago zipe codes and their associated community area. Using it we can create a new `Community Area` column.

In [22]:
zip_to_area = pd.read_csv('ZipCode_to_ComArea.csv',sep=',') ##creation of the dataframe
zip_to_area = zip_to_area.drop(columns = ['TOT2010'])

zip_to_area.ZipCode = zip_to_area.ZipCode.apply(clean_zip) ##guarantees the zipcodes to be integers

zip_to_area = zip_to_area.groupby('ComArea')['ZipCode'].apply(list)    ##groups the zipcodes by community area number
zip_to_area = zip_to_area.reset_index()

In [23]:
zip_dic = zip_to_area.set_index('ComArea')['ZipCode'].to_dict()

In [24]:
eat_seat = newcolfromdict(eat_seat, zip_dic, 'Zip', 'Community Area')

In [25]:
eat_seat.head(3)

Unnamed: 0,Inspection ID,DBA Name,License,Risk,Address,Zip,Inspection Date,Inspection Type,Results,Violations,Latitude,Longitude,Location,Facility Group,Community Area
0,2320971,JUMPSTART EARLY LEARNING ACADEMY,2589822.0,1.0,7559 W ADDISON ST,60634,2019-11-01,Canvass,Pass w/ Conditions,5. PROCEDURES FOR RESPONDING TO VOMITING AND D...,41.945065,-87.816734,"{'longitude': '41.945064857019986', 'latitude'...",childrens_services,76
1,2320918,BEEFSTEAK,2698445.0,1.0,303 E SUPERIOR ST,60611,2019-11-01,License,Pass,39. CONTAMINATION PREVENTED DURING FOOD PREPAR...,41.895692,-87.620143,"{'longitude': '41.895692401410514', 'latitude'...",restaurant,8
2,2320986,BABA'S COFFEE,2423353.0,1.0,5544-5546 N KEDZIE AVE,60625,2019-11-01,Canvass,No Entry,,41.982582,-87.708996,"{'longitude': '41.98258181784537', 'latitude':...",restaurant,14


In [26]:
pb = 0
for zipcode in eat_seat.Zip :
    if zipcode == 0 : 
        pb += 1
print(pb)

3


The **clean_zip()** function returns **0** if the zip code is not convertible into an integer. Here it means that there are 51 missing zip codes in the `Zip` column, so we need to complete the data using the other informations available.
We found a dataset online which contains all the US zip code and the longitude and latitude associated. Only the Chicago zip codes have been kept.


In [27]:
ziplatlong = pd.read_csv('ZipLatLong.csv',sep=',')
ziplatlong.head()

Unnamed: 0,ZIP,LAT,LNG
0,60007,42.0086,-87.99734
1,60008,42.069786,-88.016221
2,60010,42.146494,-88.164651
3,60012,42.272492,-88.314084
4,60013,42.223439,-88.235506


### Second Dataset - 1. BUSINESS LICENSES/OWNERS

We found two datasets on Kaggle gathering the business licenses and the business owners of Chicago. It could be interesting to observe the results of different establishments owned by the same person.

In [28]:
licedf = pd.read_csv('business-licenses.csv',sep=',') #creation of the dataframe
licedf = licedf.rename(str.lower, axis='columns')
licedf = licedf.drop(columns = ['city', 'state', 'id', 'precinct', 'ward precinct', 'business activity id',\
                                'license number', 'application type', 'application created date',\
                                'application requirements complete', 'payment date', 'conditional approval',\
                                'license term start date', 'license term expiration date', 'license approved for issuance',\
                                'date issued', 'license status', 'license status change date', 'ssa',\
                                'historical wards 2003-2015', 'zip codes', 'wards', 'census tracts', 'location',\
                                'license code', 'license description', 'business activity', 'site number',\
                               'zip code', 'latitude', 'longitude', 'address', 'legal name'])
licedf.head(3)

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,license id,account number,doing business as name,ward,police district,community areas
0,1480073,1,BERGHOFF'S RESTAURANT,42.0,1.0,38.0
1,1278029,1,BERGHOFF'S RESTAURANT,42.0,1.0,38.0
2,1337924,1,BERGHOFF'S RESTAURANT,42.0,1.0,38.0


In [29]:
print(len(licedf.index))

989790


In [30]:
owndf = pd.read_csv('business-owners.csv',sep=',') #creation of the dataframe
owndf = owndf.rename(str.lower, axis='columns')
owndf = owndf.drop(columns = ['suffix', 'legal entity owner', 'legal name'])
owndf.head(3)

Unnamed: 0,account number,owner first name,owner middle initial,owner last name,title
0,373231,GUY,WILLIAM,SELLARS,MANAGING MEMBER
1,203002,NELCY,,SANTANA,PARTNER
2,338012,GREGORY,,EDINGBURG,SECRETARY


In [31]:
print(len(owndf.index))

282463


In [38]:
liceown = pd.merge(owndf,  licedf, on = 'account number')
liceown = liceown.rename(columns={"license id": "License", "doing business as name" : "DBA Name"})
liceown.head(3)

Unnamed: 0,account number,owner first name,owner middle initial,owner last name,title,License,DBA Name,ward,police district,community areas
0,373231,GUY,WILLIAM,SELLARS,MANAGING MEMBER,2210780,"PROCURIA CONSULTING, LLC",46.0,19.0,57.0
1,373231,GUY,WILLIAM,SELLARS,MANAGING MEMBER,2162176,"PROCURIA CONSULTING, LLC",46.0,19.0,57.0
2,373231,MARK,ALAN,LAVENDER,MANAGING MEMBER,2210780,"PROCURIA CONSULTING, LLC",46.0,19.0,57.0


In [56]:
liceown = liceown.set_index('License')
liceown.head(3)

Unnamed: 0_level_0,account number,owner first name,owner middle initial,owner last name,title,DBA Name,ward,police district,community areas
License,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2210780,373231,GUY,WILLIAM,SELLARS,MANAGING MEMBER,"PROCURIA CONSULTING, LLC",46.0,19.0,57.0
2162176,373231,GUY,WILLIAM,SELLARS,MANAGING MEMBER,"PROCURIA CONSULTING, LLC",46.0,19.0,57.0
2210780,373231,MARK,ALAN,LAVENDER,MANAGING MEMBER,"PROCURIA CONSULTING, LLC",46.0,19.0,57.0


In [33]:
print(len(liceown.index))

2080401


In [53]:
def listing(center, columntoadd) :
    return pd.DataFrame(eat_seat.groupby(center)[columntoadd].apply(list))

In [54]:
base = pd.DataFrame(eat_seat.groupby('License')['Inspection ID'].apply(list))
for column in list(eat_seat.columns) :
    base[column] = listing('License', column)[column]

In [55]:
base.head()

Unnamed: 0_level_0,Inspection ID,DBA Name,License,Risk,Address,Zip,Inspection Date,Inspection Type,Results,Violations,Latitude,Longitude,Location,Facility Group,Community Area
License,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
0.0,"[2315561, 2313166, 2313036, 2312477, 2311708, ...","[DORE EARLY CHILDHOOD CENTER, EL COSTENO, TAFT...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[1.0, 1.0, 1.0, 2.0, 1.0, 2.0, 2.0, 1.0, 1.0, ...","[6108 S Natoma AVE , 235 W 87TH ST , 4071 N OA...","[60638, 60620, 60634, 60609, 60641, 60623, 606...","[2019-10-09, 2019-10-01, 2019-09-27, 2019-09-1...","[Canvass, Complaint, Canvass, Complaint Re-Ins...","[Pass, Fail, Pass, Out of Business, Pass w/ Co...","[nan, 1. PERSON IN CHARGE PRESENT, DEMONSTRATE...","[41.78092716332793, 41.735933440610275, 41.954...","[-87.78764008652699, -87.62987777048866, -87.7...","[{'longitude': '41.78092716332793', 'latitude'...","[school, Not listed, school, Not listed, Not l...","[64, 73, 76, 63, 21, 30, 63, 22, 2, 14, 14, 24..."
1.0,[250567],[HARVEST CRUSADES MINISTRIES],[1.0],[2.0],[118 N CENTRAL AVE ],[60644],[2010-06-04],[Special Events (Festivals)],[Pass],[nan],[41.882845074718844],[-87.76509545204391],"[{'longitude': '41.88284507471884', 'latitude'...",[Not listed],[25]
2.0,"[2144871, 2050308, 1977093, 1970902, 1970312, ...","[COSI, COSI, COSI, COSI, COSI, COSI, COSI, COS...","[2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, ...","[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[230 W MONROE ST , 230 W MONROE ST , 230 W MON...","[60606, 60606, 60606, 60606, 60606, 60606, 606...","[2018-02-13, 2017-05-12, 2016-12-14, 2016-11-0...","[Canvass, Canvass, Short Form Complaint, Canva...","[Pass w/ Conditions, Pass, Pass w/ Conditions,...",[3. POTENTIALLY HAZARDOUS FOOD MEETS TEMPERATU...,"[41.880757158647214, 41.880757158647214, 41.88...","[-87.6347092983425, -87.6347092983425, -87.634...","[{'longitude': '41.88075715864721', 'latitude'...","[restaurant, restaurant, restaurant, restauran...","[32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 3..."
9.0,"[2304407, 2181605, 2181227, 2050713, 1975322, ...","[XANDO COFFEE & BAR / COSI SANDWICH BAR, XANDO...","[9.0, 9.0, 9.0, 9.0, 9.0, 9.0, 9.0, 9.0, 9.0, ...","[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[116 S MICHIGAN AVE , 116 S MICHIGAN AVE , 116...","[60603, 60603, 60603, 60603, 60603, 60603, 606...","[2019-08-09, 2018-06-19, 2018-06-12, 2017-05-1...","[Canvass, Canvass Re-Inspection, Canvass, Canv...","[Pass w/ Conditions, Pass, Fail, Pass, Pass, P...","[1. PERSON IN CHARGE PRESENT, DEMONSTRATES KNO...","[41.880395838259616, 41.880395838259616, 41.88...","[-87.62450172159464, -87.62450172159464, -87.6...","[{'longitude': '41.88039583825962', 'latitude'...","[restaurant, restaurant, restaurant, restauran...","[32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 3..."
40.0,"[2222357, 2151032, 2079140, 2078535, 2072109, ...","[COSI, COSI, COSI, COSI, COSI, COSI, COSI, COS...","[40.0, 40.0, 40.0, 40.0, 40.0, 40.0, 40.0, 40....","[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[233 N MICHIGAN AVE , 233 N MICHIGAN AVE , 233...","[60601, 60601, 60601, 60601, 60601, 60601, 606...","[2018-09-14, 2018-03-27, 2017-08-28, 2017-08-1...","[Complaint, Canvass, Complaint Re-Inspection, ...","[Pass w/ Conditions, Pass, Pass, Fail, Fail, P...",[2. CITY OF CHICAGO FOOD SERVICE SANITATION CE...,"[41.886567370886944, 41.886567370886944, 41.88...","[-87.62438467059714, -87.62438467059714, -87.6...","[{'longitude': '41.886567370886944', 'latitude...","[restaurant, restaurant, restaurant, restauran...","[32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 3..."


In [57]:
scd_eat_seat = pd.merge(base, liceown, left_index=True, right_index=True)

In [58]:
scd_eat_seat.head()

Unnamed: 0_level_0,Inspection ID,DBA Name_x,License,Risk,Address,Zip,Inspection Date,Inspection Type,Results,Violations,...,Community Area,account number,owner first name,owner middle initial,owner last name,title,DBA Name_y,ward,police district,community areas
License,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1770.0,"[2229993, 2015455, 1951414, 1948654, 1561990, ...","[VALENTINO CLUB CAFE, VALENTINO CLUB CAFE, VAL...","[1770.0, 1770.0, 1770.0, 1770.0, 1770.0, 1770....","[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[7150 W GRAND AVE , 7150 W GRAND AVE , 7150 W ...","[60707, 60707, 60707, 60707, 60707, 60707, 607...","[2018-10-18, 2017-08-21, 2017-02-16, 2016-08-0...","[Canvass, Canvass, Complaint, Canvass, Canvass...","[Out of Business, No Entry, No Entry, No Entry...","[nan, nan, nan, nan, nan, nan, nan, nan, 35. W...",...,"[25, 25, 25, 25, 25, 25, 25, 25, 25, 25]",81,RUDOLFO,,GUERRERO,SECRETARY,SABRINA'S CLUB,25.0,10.0,33.0
1770.0,"[2229993, 2015455, 1951414, 1948654, 1561990, ...","[VALENTINO CLUB CAFE, VALENTINO CLUB CAFE, VAL...","[1770.0, 1770.0, 1770.0, 1770.0, 1770.0, 1770....","[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[7150 W GRAND AVE , 7150 W GRAND AVE , 7150 W ...","[60707, 60707, 60707, 60707, 60707, 60707, 607...","[2018-10-18, 2017-08-21, 2017-02-16, 2016-08-0...","[Canvass, Canvass, Complaint, Canvass, Canvass...","[Out of Business, No Entry, No Entry, No Entry...","[nan, nan, nan, nan, nan, nan, nan, nan, 35. W...",...,"[25, 25, 25, 25, 25, 25, 25, 25, 25, 25]",81,JOSE,,GUERRERO,OTHER,SABRINA'S CLUB,25.0,10.0,33.0
1770.0,"[2229993, 2015455, 1951414, 1948654, 1561990, ...","[VALENTINO CLUB CAFE, VALENTINO CLUB CAFE, VAL...","[1770.0, 1770.0, 1770.0, 1770.0, 1770.0, 1770....","[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[7150 W GRAND AVE , 7150 W GRAND AVE , 7150 W ...","[60707, 60707, 60707, 60707, 60707, 60707, 607...","[2018-10-18, 2017-08-21, 2017-02-16, 2016-08-0...","[Canvass, Canvass, Complaint, Canvass, Canvass...","[Out of Business, No Entry, No Entry, No Entry...","[nan, nan, nan, nan, nan, nan, nan, nan, 35. W...",...,"[25, 25, 25, 25, 25, 25, 25, 25, 25, 25]",81,JOSE,,GUERRERO,PRESIDENT,SABRINA'S CLUB,25.0,10.0,33.0
7141.0,"[1229350, 606283, 277110, 250506, 250311]","[SOUTHWEST MONTESSORI PRE SCHOL, SOUTHWEST MON...","[7141.0, 7141.0, 7141.0, 7141.0, 7141.0]","[1.0, 1.0, 1.0, 1.0, 1.0]","[8620 -08624 S RACINE AVE , 8620 -08624 S RA...","[60620, 60620, 60620, 60620, 60620]","[2012-06-05, 2011-06-17, 2010-07-07, 2010-06-0...","[License, Canvass, License Re-Inspection, Lice...","[Pass, Pass, Pass, Fail, Fail]",[33. FOOD AND NON-FOOD CONTACT EQUIPMENT UTENS...,...,"[73, 73, 73, 73, 73]",282,HECTOR,,RODRIGUEZ,OTHER,EL TIPICO,40.0,20.0,6.0
9803.0,[1329644],[PEERLESS TRADING CORP],[9803.0],[3.0],[5933-5935 W LAWRENCE AVE ],[60630],[2013-12-06],[Canvass],[Pass],[33. FOOD AND NON-FOOD CONTACT EQUIPMENT UTENS...,...,[16],397,PAUL,,MARTINEZ,PRESIDENT,WEEDS TAVERN,2.0,18.0,37.0


### Managing the changes in the Food Code Rules

The Food Code Rules has changed since the 1st July 2018. 

The following code returns every unique entry in the considered column : 

In [None]:
#eat_seat.Violations.unique()

In [None]:
len(eat_seat.Violations.unique())

It seems that every violation is a unique entry because it contains not only the violation type but also the comments of the inspectors. We have to split the Violations column into 3 different columns :
- Violation number
- Violation type
- Violation comments

It seems that every violation cell is architectured this way :
"number of the violation". "TYPE OF THE VIOLATION" - Comments : "comments of the inspector"

We can also try to split each line in three, separating the numbers (taking each character until the character "."), then separating the violation type (every following character until the character "-"), and the rest of the characters would be the comments of the inspector. 

In [None]:
eat_seat['Violations'][0]

Actually it seems that it is not a good idea because there is more than one violation in every Violations cell. 

In [None]:
eat_seat['Violations'] = eat_seat.Violations.apply(lambda x : str(x))

eat_seat['Violations'] = eat_seat.Violations.apply(lambda x : x.split("."))

eat_seat['Violations'][0]

`****  PB A PARTIR DE LA CELL SUIVANTE PAS MEMES RESULTATS QUE DANS CLAIRE  ****`

In [None]:

len(eat_seat['Violations'][0])

In [None]:
eat_seat['ViolationNumber'] = eat_seat.Violations.apply(lambda x : x[0])

eat_seat['ViolationNumber'][0]

In [None]:


eat_seat['Violations'] = eat_seat.Violations.apply(lambda x : x[1])

eat_seat['Violations'] = eat_seat.Violations.apply(lambda x : x.split("-"))

eat_seat['ViolationType'] = eat_seat.Violations.apply(lambda x : x[0])

eat_seat['ViolationComment'] = eat_seat.Violations.apply(lambda x : x[1])


In [None]:
violators = eat_seat.dropna(subset=['Violations'], axis = 0, how = 'all')
violations = violators.apply(lambda row: re.findall('\|\s([0-9]+)[.]', str(row['Violations'])), axis = 1)
first_violations = violators.apply(lambda row: row['Violations'].split('.')[0], axis = 1)

for violation, first_violation in zip(violations, first_violations):
    violation.append(first_violation)

flat_list = [item for sublist in violations for item in sublist]
unique, counts = np.unique(flat_list, return_counts=True)

In [None]:
violations