# Where to eat in Chigago : an analysis of the inspections from the Chicago department of Public Health's Food Protection Programm

## 1. Introduction

The Chicago department of Public Health’s Food Protection Program provides a database which contains the information from inspection reports of restaurants and other food establishments in Chicago from 2010 to the present. It contains many informations about the establishments, like their type of facility (groceries’ stores, restaurants, coffee shop, …) and their locations. Many informations about the violations listed are also provided in the database, like the description of the findings that caused them and the reason that invoked the staff from the program to led an inspection.

In our project we endeavor to visualize the healthiness of public food establishments according to their type of facility, their ward and the date of the inspection. An analysis of the violation’s types according to these three parameters will also be conducted. 

The principal questions we'll answer are : 
    - Which ward of Chicago are the most healthy and unhealthy ? 
    - Which type of facility tend to be less healthy ? 
    - Did the healthiness of the food in Chicago increase or decrease from 2010 until now ?

New problematics could be asked during the analysis and would be added to these.

The purpose of the project is to help the consumer to easily choose where to eat in Chicago and to provide them an interactive and intuitive way to browse the different places offered to them. Also, it could help the Chicago department of Public Health’s Food Protection Program to adapt their methods relying on the situation described by the findings of the analysis (for example, if a prevention program should be proposed for a specific area or type of facility).

In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import re
from matplotlib.ticker import MaxNLocator
import matplotlib.pyplot as plt
import requests as req
from bs4 import BeautifulSoup
import seaborn as sns

## 2. Preprocessing

### 2.1 Facilities of interest Selection

First a quick look at how is organized the dataset. 

In [2]:
df = pd.read_csv('food-inspections.csv',sep=',') #creation of the dataframe
df.head(3)

Unnamed: 0,Inspection ID,DBA Name,AKA Name,License #,Facility Type,Risk,Address,City,State,Zip,...,Results,Violations,Latitude,Longitude,Location,Historical Wards 2003-2015,Zip Codes,Community Areas,Census Tracts,Wards
0,2320971,JUMPSTART EARLY LEARNING ACADEMY,JUMPSTART EARLY LEARNING ACADEMY,2589822.0,Children's Services Facility,Risk 1 (High),7559 W ADDISON ST,CHICAGO,IL,60634.0,...,Pass w/ Conditions,5. PROCEDURES FOR RESPONDING TO VOMITING AND D...,41.945065,-87.816734,"{'longitude': '41.945064857019986', 'latitude'...",,,,,
1,2320918,BEEFSTEAK,BEEFSTEAK,2698445.0,Restaurant,Risk 1 (High),303 E SUPERIOR ST,CHICAGO,IL,60611.0,...,Pass,39. CONTAMINATION PREVENTED DURING FOOD PREPAR...,41.895692,-87.620143,"{'longitude': '41.895692401410514', 'latitude'...",,,,,
2,2320986,BABA'S COFFEE,BABA'S COFFEE,2423353.0,Restaurant,Risk 1 (High),5544-5546 N KEDZIE AVE,CHICAGO,IL,60625.0,...,No Entry,,41.982582,-87.708996,"{'longitude': '41.98258181784537', 'latitude':...",,,,,


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 195312 entries, 0 to 195311
Data columns (total 22 columns):
Inspection ID                 195312 non-null int64
DBA Name                      195312 non-null object
AKA Name                      192862 non-null object
License #                     195295 non-null float64
Facility Type                 190535 non-null object
Risk                          195239 non-null object
Address                       195312 non-null object
City                          195173 non-null object
State                         195270 non-null object
Zip                           195261 non-null float64
Inspection Date               195312 non-null object
Inspection Type               195311 non-null object
Results                       195312 non-null object
Violations                    143530 non-null object
Latitude                      194627 non-null float64
Longitude                     194627 non-null float64
Location                      194627 n

It is a dataset of 195'312 entries with 22 columns listed above.
First thing first, we want to put the different facility types in categories that make sense for our project.

In [4]:
#df['Facility Type'].unique()

This command returns a list which contains the different facility types found in the `Facility Type column`. A lot of different types of facility are found in the data.

First, we thought about only select the "private" establishments, where it is possible to eat a main course (for example, the places where you can only eat an ice cream are deleted of our list). They all are categorized in order to be compared with each other.


In [5]:
public_dic = {'restaurant' : ['Restaurant', 'DINING HALL', 'TENT RSTAURANT'], \
              'grocery_restaurant' : ['Grocery & Restaurant', 'GROCERY& RESTAURANT', 'GROCERY/RESTAURANT',\
                                    'GROCERY/ RESTAURANT', 'GROCERY STORE/ RESTAURANT', 'GROCERY & RESTAURANT',\
                                    'RESTAURANT/GROCERY', 'grocery & restaurant', 'RESTAURANT/GROCERY STORE',\
                                    'GROCERY/TAQUERIA', 'GAS STATION/RESTAURANT'],\
              'banquet' : ['LOUNGE/BANQUET HALL', 'BANQUET', 'Banquet Hall', 'BANQUET FACILITY', 'banquet hall',\
                         'banquets', 'Banquet Dining',  'Banquet/kitchen','RESTAURANT.BANQUET HALLS',\
                         'BANQUET HALL', 'Banquet', 'BOWLING LANES/BANQUETS'], \
              'rooftop_restaurant' : ['Wrigley Roof Top', 'REST/ROOFTOP'],\
              'bar_restaurant' : ['RESTAURANT/BAR', 'RESTUARANT AND BAR', 'BAR/GRILL', 'RESTAURANT/BAR/THEATER',\
                                'JUICE AND SALAD BAR', 'SUSHI COUNTER', 'TAVERN/RESTAURANT', 'tavern/restaurant',\
                                'TAVERN GRILL'], \
              'bakery_restaurant' : ['BAKERY/ RESTAURANT', 'bakery/restaurant', 'RESTAURANT/BAKERY'], \
              'liquor_restaurant' : ['RESTAURANT AND LIQUOR', 'RESTAURANT/LIQUOR'], \
              'catering' : ['CATERING/CAFE', 'Catering'], \
              'golden_diner' : ['Golden Diner']}

In [6]:
facilitytype = 'BANQUET'
len(df[df['Facility Type'] == facilitytype])

64

This command returns the number of occurencs of the `Facility Type` inputed. 

With trying different types previously categorized and listed in the `public_dic` dictionary we have noted that the results were too distant to conduct a meaningful analysis. That's why we then decided to also select "public" establishments like school cafeterias and hospitals. It could be interesting to compare private and public inspection results.


In [7]:
private_dic = {'daycare' : ['Daycare Above and Under 2 Years', 'Daycare (2 - 6 Years)', 'Daycare Combo 1586',\
                          'Daycare (Under 2 Years)', 'DAYCARE 2 YRS TO 12 YRS', 'Daycare Night', 'DAY CARE 2-14',\
                          'Daycare (2 Years)', 'DAYCARE', 'ADULT DAYCARE', '15 monts to 5 years old', 'youth housing',\
                          'DAYCARE 1586', 'DAYCARE COMBO', '1584-DAY CARE ABOVE 2 YEARS', 'CHURCH/DAY CARE', 'DAY CARE',\
                          'DAYCARE 6 WKS-5YRS', 'DAY CARE 1023', 'DAYCARE 2-6, UNDER 6', 'Day Care Combo (1586)'], \
               'school' : ['SCHOOL', 'School', 'PRIVATE SCHOOL', 'AFTER SCHOOL PROGRAM', 'COLLEGE',\
                         'BEFORE AND AFTER SCHOOL PROGRAM', 'Private School', 'TEACHING SCHOOL',\
                         'PUBLIC SHCOOL', 'CHARTER SCHOOL CAFETERIA', 'CAFETERIA', 'Cafeteria', 'cafeteria',\
                         'UNIVERSITY CAFETERIA', 'PREP INSIDE SCHOOL', 'CHARTER SCHOOL', 'school cafeteria',\
                         'CHARTER SCHOOL/CAFETERIA', 'School Cafeteria', 'ALTERNATIVE SCHOOL', 'CITY OF CHICAGO COLLEGE',\
                         'after school program', 'CHURCH/AFTER SCHOOL PROGRAM', 'AFTER SCHOOL CARE'], \
               'childrens_services' : ["Children's Services Facility", 'CHILDRENS SERVICES FACILITY', \
                                     "CHILDERN'S SERVICE FACILITY", "1023 CHILDREN'S SERVICES FACILITY", \
                                     "1023 CHILDERN'S SERVICES FACILITY", "1023-CHILDREN'S SERVICES FACILITY", \
                                     "1023 CHILDERN'S SERVICE FACILITY", "1023 CHILDERN'S SERVICE S FACILITY", \
                                     'CHILDERN ACTIVITY FACILITY', "CHILDERN'S SERVICES  FACILITY", '1023'], \
               'adultcare' : ['Long Term Care', 'REHAB CENTER', 'Hospital', 'ASSISTED LIVING', 'SENIOR DAY CARE',\
                            'Assisted Living', 'NURSING HOME', 'ASSISTED LIVING FACILITY', 'SUPPORTIVE LIVING FACILITY',\
                            'Assisted Living Senior Care', 'Adult Family Care Center', '1005 NURSING HOME', \
                            'Long-Term Care Facility', 'LONG TERM CARE FACILITY', 'ASSISSTED LIVING',\
                            'Long-Term Care','Long Term Care Facility', 'VFW HALL']}

In [8]:
total_dic = {**public_dic , **private_dic}

In [9]:
def newcolfromdict(dataf, dic, inputcolumn, outputcolumn) :
    new_list = []
    for content in dataf[inputcolumn] :
        new_content = 'Not listed'

        for k, v in dic.items() :
            if type(v) == list :
                for element in v :
                    if content == element :
                        new_content = k
            else :
                if content == v :
                    new_content = k

        new_list.append(new_content)
    dataf[outputcolumn] = new_list
    return dataf

The **newcolfromdict** function constructs a new column `outputcolum` from the inputed column `inputcolum` comparing its values to the `dic` dictionary and adds it to the `dataf` dataframe.

### 2.2 New dataframe Creation

To construct the new dataframe, the `Facility Type` column is dropped because it has been replaced by the column `Facility group` and the *Not Listed* establishments are not selected.

The duplicates are dropped.

In [10]:
eat_seat = newcolfromdict(df, total_dic, 'Facility Type', 'Facility Group')
eat_seat = df.loc[df['Facility Group'] != 'Not Listed']
eat_seat = eat_seat.drop(columns = ['Facility Type'])

eat_seat = eat_seat.drop_duplicates()

eat_seat.head(2)

Unnamed: 0,Inspection ID,DBA Name,AKA Name,License #,Risk,Address,City,State,Zip,Inspection Date,...,Violations,Latitude,Longitude,Location,Historical Wards 2003-2015,Zip Codes,Community Areas,Census Tracts,Wards,Facility Group
0,2320971,JUMPSTART EARLY LEARNING ACADEMY,JUMPSTART EARLY LEARNING ACADEMY,2589822.0,Risk 1 (High),7559 W ADDISON ST,CHICAGO,IL,60634.0,2019-11-01T00:00:00.000,...,5. PROCEDURES FOR RESPONDING TO VOMITING AND D...,41.945065,-87.816734,"{'longitude': '41.945064857019986', 'latitude'...",,,,,,childrens_services
1,2320918,BEEFSTEAK,BEEFSTEAK,2698445.0,Risk 1 (High),303 E SUPERIOR ST,CHICAGO,IL,60611.0,2019-11-01T00:00:00.000,...,39. CONTAMINATION PREVENTED DURING FOOD PREPAR...,41.895692,-87.620143,"{'longitude': '41.895692401410514', 'latitude'...",,,,,,restaurant


Since we only care about establishments in Chicago Illinois, we will only keep the data for this city and drop the `City` and `State` columns.

We first check the different City's names to avoid deleting rows due to missprints.

In [11]:
df['City'].unique()

array(['CHICAGO', nan, 'chicago', 'Chicago', 'GRIFFITH', 'NEW YORK',
       'SCHAUMBURG', 'ELMHURST', 'ALGONQUIN', 'NEW HOLSTEIN', 'CCHICAGO',
       'NILES NILES', 'EVANSTON', 'CHICAGO.', 'CHESTNUT STREET',
       'LANSING', 'CHICAGOCHICAGO', 'WADSWORTH', 'WILMETTE', 'WHEATON',
       'CHICAGOHICAGO', 'ROSEMONT', 'CHicago', 'CALUMET CITY',
       'PLAINFIELD', 'HIGHLAND PARK', 'PALOS PARK', 'ELK GROVE VILLAGE',
       'CICERO', 'BRIDGEVIEW', 'OAK PARK', 'MAYWOOD', 'LAKE BLUFF',
       '312CHICAGO', 'SCHILLER PARK', 'SKOKIE', 'BEDFORD PARK',
       'BANNOCKBURNDEERFIELD', 'CHCICAGO', 'BLOOMINGDALE', 'Norridge',
       'CHARLES A HAYES', 'CHCHICAGO', 'CHICAGOI', 'SUMMIT',
       'OOLYMPIA FIELDS', 'WESTMONT', 'CHICAGO HEIGHTS', 'JUSTICE',
       'TINLEY PARK', 'LOMBARD', 'EAST HAZEL CREST', 'COUNTRY CLUB HILLS',
       'STREAMWOOD', 'BOLINGBROOK', 'INACTIVE', 'BERWYN', 'BURNHAM',
       'DES PLAINES', 'LAKE ZURICH', 'OLYMPIA FIELDS', 'alsip',
       'OAK LAWN', 'BLUE ISLAND', 'GLENCOE',

In [12]:
cities = ['CHICAGO','chicago','Chicago','CCHICAGO','CHICAGO.','CHICAGOCHICAGO','CHICAGOHICAGO',\
          'CHicago','312CHICAGO','CHCICAGO','CHCHICAGO','CHICAGOI','CHICAGO HEIGHTS']

eat_seat = eat_seat.loc[eat_seat['City'].isin(cities)]

eat_seat = eat_seat.drop(columns = ['City','State'])

# à la limite pour être stylés on pourrait tenter une expression régulière pour filtrer les différentes orthographes 
#de chigago au lieu de sélectionner à la main

Then we want to check the missing values.

In [13]:
eat_seat.isnull().sum()

Inspection ID                      0
DBA Name                           0
AKA Name                        2416
License #                         17
Risk                              70
Address                            0
Zip                                3
Inspection Date                    0
Inspection Type                    1
Results                            0
Violations                     51550
Latitude                         515
Longitude                        515
Location                         515
Historical Wards 2003-2015    194729
Zip Codes                     194729
Community Areas               194729
Census Tracts                 194729
Wards                         194729
Facility Group                     0
dtype: int64

In [14]:
print(len(eat_seat.index))    ##returns the number of rows of the df

194729


As we can see, the columns `Historical Wards 2003-2015`, `Zip Codes`, `Community Areas`, `Census Tracts` and `Wards` are empty and will be dropped.

We will only be using the `DBA Name` (the name under which the establishment is doing business ; DBA = doing business as), so we drop the `AKA Name` column too.

In [15]:
eat_seat = eat_seat.drop(columns = ['AKA Name','Historical Wards 2003-2015', 'Zip Codes', 'Community Areas',\
                                    'Census Tracts', 'Wards'])
eat_seat.head(2)

Unnamed: 0,Inspection ID,DBA Name,License #,Risk,Address,Zip,Inspection Date,Inspection Type,Results,Violations,Latitude,Longitude,Location,Facility Group
0,2320971,JUMPSTART EARLY LEARNING ACADEMY,2589822.0,Risk 1 (High),7559 W ADDISON ST,60634.0,2019-11-01T00:00:00.000,Canvass,Pass w/ Conditions,5. PROCEDURES FOR RESPONDING TO VOMITING AND D...,41.945065,-87.816734,"{'longitude': '41.945064857019986', 'latitude'...",childrens_services
1,2320918,BEEFSTEAK,2698445.0,Risk 1 (High),303 E SUPERIOR ST,60611.0,2019-11-01T00:00:00.000,License,Pass,39. CONTAMINATION PREVENTED DURING FOOD PREPAR...,41.895692,-87.620143,"{'longitude': '41.895692401410514', 'latitude'...",restaurant


There is also adjustments to make in some columns, because the formats aren't usable :

* In `Inspection Date`, only the day will be kept, not the time of day that is actually not given
* In `Risk`, only the number will remain

For the column `Risk`, we first want to check what types of risk are listed.

In [16]:
eat_seat['Inspection Date'] = eat_seat['Inspection Date'].apply(lambda x:x.split('T')[0])

In [17]:
eat_seat.Risk.unique()
#print all types of entries in the column Risk

array(['Risk 1 (High)', 'Risk 2 (Medium)', 'Risk 3 (Low)', nan, 'All'],
      dtype=object)

We will replace **All** and **High Risk** by *3*, **Medium Risk** by *2* and **Low Risk** by *1*.


***--> pourquoi remplacer All by High Risk ?***

***A vérifier si c'était vraiment pensé comme ça quand la bdd a été conçue***

In [18]:
eat_seat['Risk'] = eat_seat['Risk'].replace({'All':1, 'Risk 1 (High)':3, 'Risk 2 (Medium)':2, 'Risk 3 (Low)':1})
eat_seat.head(2)

Unnamed: 0,Inspection ID,DBA Name,License #,Risk,Address,Zip,Inspection Date,Inspection Type,Results,Violations,Latitude,Longitude,Location,Facility Group
0,2320971,JUMPSTART EARLY LEARNING ACADEMY,2589822.0,3.0,7559 W ADDISON ST,60634.0,2019-11-01,Canvass,Pass w/ Conditions,5. PROCEDURES FOR RESPONDING TO VOMITING AND D...,41.945065,-87.816734,"{'longitude': '41.945064857019986', 'latitude'...",childrens_services
1,2320918,BEEFSTEAK,2698445.0,3.0,303 E SUPERIOR ST,60611.0,2019-11-01,License,Pass,39. CONTAMINATION PREVENTED DURING FOOD PREPAR...,41.895692,-87.620143,"{'longitude': '41.895692401410514', 'latitude'...",restaurant


In [19]:
eat_seat = eat_seat.rename(columns={"License #": "License"}) #rename the column 'License #' into 'License'

In [20]:
len(eat_seat.License.unique())

37145

In [21]:
eat_seat.head()

Unnamed: 0,Inspection ID,DBA Name,License,Risk,Address,Zip,Inspection Date,Inspection Type,Results,Violations,Latitude,Longitude,Location,Facility Group
0,2320971,JUMPSTART EARLY LEARNING ACADEMY,2589822.0,3.0,7559 W ADDISON ST,60634.0,2019-11-01,Canvass,Pass w/ Conditions,5. PROCEDURES FOR RESPONDING TO VOMITING AND D...,41.945065,-87.816734,"{'longitude': '41.945064857019986', 'latitude'...",childrens_services
1,2320918,BEEFSTEAK,2698445.0,3.0,303 E SUPERIOR ST,60611.0,2019-11-01,License,Pass,39. CONTAMINATION PREVENTED DURING FOOD PREPAR...,41.895692,-87.620143,"{'longitude': '41.895692401410514', 'latitude'...",restaurant
2,2320986,BABA'S COFFEE,2423353.0,3.0,5544-5546 N KEDZIE AVE,60625.0,2019-11-01,Canvass,No Entry,,41.982582,-87.708996,"{'longitude': '41.98258181784537', 'latitude':...",restaurant
3,2320910,J.T.'S GENUINE SANDWICH,2689893.0,3.0,3970 N ELSTON AVE,60618.0,2019-11-01,License,Pass,51. PLUMBING INSTALLED; PROPER BACKFLOW DEVICE...,41.953378,-87.718848,"{'longitude': '41.95337788158545', 'latitude':...",restaurant
4,2320904,"KID'Z COLONY DAYCARE, INC.",2215609.0,3.0,6287 S ARCHER AVE,60638.0,2019-11-01,Canvass,Fail,16. FOOD-CONTACT SURFACES: CLEANED & SANITIZED...,41.793235,-87.777776,"{'longitude': '41.7932347787373', 'latitude': ...",daycare


In [22]:
eat_seat.Results.unique()

array(['Pass w/ Conditions', 'Pass', 'No Entry', 'Fail',
       'Out of Business', 'Not Ready', 'Business Not Located'],
      dtype=object)

As done for Risk, we should replace Pass w/ Conditions by 1, Pass by 0 and Fail by 2 (to compute the score of the facilities). But what should we do with Out of Business and Not Ready ?

### Float to Integer

In [23]:
def clean_float(floatnumb):
    try :
        return int(float(floatnumb))
    except :
        return 0

In [24]:
eat_seat.Zip = eat_seat.Zip.apply(clean_float)    ##guarantees the zipcodes to be integers

In [25]:
eat_seat.Risk = eat_seat.Risk.apply(clean_float)    ##guarantees the Risk numbers to be integers

In [26]:
eat_seat.License = eat_seat.License.apply(clean_float)    ##guarantees the License numbers to be integers

### Community Areas

We found a file associating the chicago zipe codes and their associated community area. Using it we can create a new `Community Area` column.

In [27]:
zip_to_area = pd.read_csv('ZipCode_to_ComArea.csv',sep=',') ##creation of the dataframe
zip_to_area = zip_to_area.drop(columns = ['TOT2010'])

zip_to_area.ZipCode = zip_to_area.ZipCode.apply(clean_float) ##guarantees the zipcodes to be integers

zip_to_area = zip_to_area.groupby('ComArea')['ZipCode'].apply(list)    ##groups the zipcodes by community area number
zip_to_area = zip_to_area.reset_index()

In [28]:
zip_dic = zip_to_area.set_index('ComArea')['ZipCode'].to_dict()

In [29]:
eat_seat = newcolfromdict(eat_seat, zip_dic, 'Zip', 'Community Area')

In [30]:
eat_seat.head(3)

Unnamed: 0,Inspection ID,DBA Name,License,Risk,Address,Zip,Inspection Date,Inspection Type,Results,Violations,Latitude,Longitude,Location,Facility Group,Community Area
0,2320971,JUMPSTART EARLY LEARNING ACADEMY,2589822,3,7559 W ADDISON ST,60634,2019-11-01,Canvass,Pass w/ Conditions,5. PROCEDURES FOR RESPONDING TO VOMITING AND D...,41.945065,-87.816734,"{'longitude': '41.945064857019986', 'latitude'...",childrens_services,76
1,2320918,BEEFSTEAK,2698445,3,303 E SUPERIOR ST,60611,2019-11-01,License,Pass,39. CONTAMINATION PREVENTED DURING FOOD PREPAR...,41.895692,-87.620143,"{'longitude': '41.895692401410514', 'latitude'...",restaurant,8
2,2320986,BABA'S COFFEE,2423353,3,5544-5546 N KEDZIE AVE,60625,2019-11-01,Canvass,No Entry,,41.982582,-87.708996,"{'longitude': '41.98258181784537', 'latitude':...",restaurant,14


In [31]:
len(eat_seat[eat_seat['Zip'] == 0])

3

The **clean_float()** function returns **0** if the zip code is not convertible into an integer. Here it means that there are 3 missing zip codes in the `Zip` column. 

In [32]:
eat_seat[eat_seat['Zip'] == 0].head()

Unnamed: 0,Inspection ID,DBA Name,License,Risk,Address,Zip,Inspection Date,Inspection Type,Results,Violations,Latitude,Longitude,Location,Facility Group,Community Area
114117,1464217,DUNKIN DONUTS,1515116,2,7545 N PAULINA ST,0,2014-04-02,Canvass,Out of Business,,42.019032,-87.673459,"{'longitude': '42.01903180273219', 'latitude':...",restaurant,Not listed
153781,1106210,DUNKIN DONUTS,1515116,2,7545 N PAULINA ST,0,2012-04-09,Canvass Re-Inspection,Pass,33. FOOD AND NON-FOOD CONTACT EQUIPMENT UTENSI...,42.019032,-87.673459,"{'longitude': '42.01903180273219', 'latitude':...",restaurant,Not listed
156081,670661,DUNKIN DONUTS,1515116,2,7545 N PAULINA ST,0,2012-02-21,Complaint,Pass w/ Conditions,"6. HANDS WASHED AND CLEANED, GOOD HYGIENIC PRA...",42.019032,-87.673459,"{'longitude': '42.01903180273219', 'latitude':...",restaurant,Not listed


We can observe that the missing zip codes are all from the same esablishment. By doing a google search the zip code corresponding to the adress is easily retrieved :
*7545 N Paulina St
Chicago, IL 60626, USA*

So that we just have to replace the missing zip codes by 60626.

In [33]:
def clean_zip(zipcode):
    if zipcode == 0:
        return 60626

In [34]:
eat_seat.Zip = eat_seat.Zip.apply(clean_zip)

### Second Dataset - 1. BUSINESS LICENSES/OWNERS

We found two datasets on Kaggle gathering the business licenses and the business owners of Chicago. It could be interesting to observe the results of different establishments owned by the same person.

### Business Licenses

The first dataset contains the details about every licensed establishments. There are a lot of columns but the only ones interesting us are :
- the `license id` column to have a link with the *chicago food inspections* dataset
- the `account number` column to have a link with the *business owners* dataset
- the `police district` column in case we want to have a link with the *crime* dataset

In [35]:
licenses = pd.read_csv('business-licenses.csv', sep=',') #creation of the dataframe
licenses = licenses.rename(str.lower, axis='columns')
licenses = licenses.drop(columns = ['city', 'state', 'id', 'precinct', 'ward precinct', 'business activity id',\
                                'license number', 'application type', 'application created date',\
                                'application requirements complete', 'payment date', 'conditional approval',\
                                'license term start date', 'license term expiration date', 'license approved for issuance',\
                                'date issued', 'license status', 'license status change date', 'ssa',\
                                'historical wards 2003-2015', 'zip codes', 'wards', 'census tracts', 'location',\
                                'license code', 'license description', 'business activity', 'site number',\
                               'zip code', 'latitude', 'longitude', 'address', 'legal name', 'doing business as name',\
                                'community areas', 'ward'])
licenses['police district'] = licenses['police district'].apply(clean_float) 

licenses.head(3)

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,license id,account number,police district
0,1480073,1,1
1,1278029,1,1
2,1337924,1,1


In [36]:
licenses = licenses.set_index('account number')
licenses.head(3)

Unnamed: 0_level_0,license id,police district
account number,Unnamed: 1_level_1,Unnamed: 2_level_1
1,1480073,1
1,1278029,1
1,1337924,1


In [37]:
print(len(licenses.index))

989790


### Business Owners

The second dataset contains the details about every license owners. We have decided to keep the following columns :
- the `account number` column to have a link with the *business licenses* dataset
- the `owner first name` and the `owner last name` columns in order to create a `full name` column

In [38]:
owners = pd.read_csv('business-owners.csv',sep=',') #creation of the dataframe
owners = owners.rename(str.lower, axis='columns')
owners = owners.drop(columns = ['suffix', 'legal entity owner', 'owner middle initial', 'legal name', 'title'])
owners.head(3)

Unnamed: 0,account number,owner first name,owner last name
0,373231,GUY,SELLARS
1,203002,NELCY,SANTANA
2,338012,GREGORY,EDINGBURG


In [39]:
owners['full name'] = owners['owner first name'] + ' ' + owners['owner last name']
owners = owners.drop(columns = ['owner first name', 'owner last name'])

A `full name` column is enough for the needs that we have to link the licenses to the owners.

In [40]:
owners.head(3)

Unnamed: 0,account number,full name
0,373231,GUY SELLARS
1,203002,NELCY SANTANA
2,338012,GREGORY EDINGBURG


In [41]:
len(owners['account number'].unique())

167992

In [42]:
len(owners['full name'].unique())

192354

Here we can see that the number of accounts is not the same that the number of full names. For now, we will consider that a same account can be shared by several people (for example, it could be the case for partners owning a business together).

In [43]:
owners = pd.DataFrame(owners.groupby('account number')['full name'].apply(list))

In [44]:
owners.head()

Unnamed: 0_level_0,full name
account number,Unnamed: 1_level_1
1,"[PETER BERGHOFF, PETER BERGHOFF, HERMAN BERGHOFF]"
2,"[nan, HERMAN BERGHOFF, PETER BERGHOFF, nan, EI..."
4,[LAWRENCE PRICE]
6,[JOHN SCHALLER]
8,"[CHRISTINE MAIR, CHRISTINE MAIR]"


We can observe that the lists of the `full name` column contain duplicates and 'nan' values. A function *clean_list* can be defined to clean them.

In [45]:
def clean_list(liste) :
    cleaned = []
    for element in liste :
        if type(element) == str and element not in cleaned :
            cleaned.append(element)
    return cleaned

In [46]:
owners['full name'] = owners['full name'].apply(clean_list)

In [47]:
owners.head()

Unnamed: 0_level_0,full name
account number,Unnamed: 1_level_1
1,"[PETER BERGHOFF, HERMAN BERGHOFF]"
2,"[HERMAN BERGHOFF, PETER BERGHOFF, EILEEN GORMA..."
4,[LAWRENCE PRICE]
6,[JOHN SCHALLER]
8,[CHRISTINE MAIR]


### Business Licenses-Owners

Setting both the indexes of the *licenses* and the *owners* dataframes we can now merge them together.

In [48]:
business = pd.merge(licenses, owners, right_index = True, left_index = True)

In [49]:
business.head()

Unnamed: 0_level_0,license id,police district,full name
account number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,1480073,1,"[PETER BERGHOFF, HERMAN BERGHOFF]"
1,1278029,1,"[PETER BERGHOFF, HERMAN BERGHOFF]"
1,1337924,1,"[PETER BERGHOFF, HERMAN BERGHOFF]"
1,1480076,1,"[PETER BERGHOFF, HERMAN BERGHOFF]"
1,1404362,1,"[PETER BERGHOFF, HERMAN BERGHOFF]"


In [50]:
business.reset_index(level=0, inplace=True)

In [51]:
business.head()

Unnamed: 0,account number,license id,police district,full name
0,1,1480073,1,"[PETER BERGHOFF, HERMAN BERGHOFF]"
1,1,1278029,1,"[PETER BERGHOFF, HERMAN BERGHOFF]"
2,1,1337924,1,"[PETER BERGHOFF, HERMAN BERGHOFF]"
3,1,1480076,1,"[PETER BERGHOFF, HERMAN BERGHOFF]"
4,1,1404362,1,"[PETER BERGHOFF, HERMAN BERGHOFF]"


### Second Main Dataframe

Setting both the indexes of the *business* and the *eat_seat* dataframes we can now merge them together.

In [52]:
business = business.rename(columns= {'license id' : 'License'})
business = business.set_index('License')

eat_seat = eat_seat.set_index('License')

In [53]:
eat_seat_2 = pd.merge(eat_seat, business, right_index = True, left_index = True)

In [54]:
eat_seat_2.head()

Unnamed: 0_level_0,Inspection ID,DBA Name,Risk,Address,Zip,Inspection Date,Inspection Type,Results,Violations,Latitude,Longitude,Location,Facility Group,Community Area,account number,police district,full name
License,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
1770,2229993,VALENTINO CLUB CAFE,3,7150 W GRAND AVE,,2018-10-18,Canvass,Out of Business,,41.923853,-87.804535,"{'longitude': '41.92385304509883', 'latitude':...",restaurant,25,81,10,"[RUDOLFO GUERRERO, JOSE GUERRERO]"
1770,2015455,VALENTINO CLUB CAFE,3,7150 W GRAND AVE,,2017-08-21,Canvass,No Entry,,41.923853,-87.804535,"{'longitude': '41.92385304509883', 'latitude':...",restaurant,25,81,10,"[RUDOLFO GUERRERO, JOSE GUERRERO]"
1770,1951414,VALENTINO CLUB CAFE,3,7150 W GRAND AVE,,2017-02-16,Complaint,No Entry,,41.923853,-87.804535,"{'longitude': '41.92385304509883', 'latitude':...",restaurant,25,81,10,"[RUDOLFO GUERRERO, JOSE GUERRERO]"
1770,1948654,VALENTINO CLUB CAFE,3,7150 W GRAND AVE,,2016-08-02,Canvass,No Entry,,41.923853,-87.804535,"{'longitude': '41.92385304509883', 'latitude':...",restaurant,25,81,10,"[RUDOLFO GUERRERO, JOSE GUERRERO]"
1770,1561990,VALENTINO CLUB CAFE,3,7150 W GRAND AVE,,2015-08-07,Canvass,No Entry,,41.923853,-87.804535,"{'longitude': '41.92385304509883', 'latitude':...",restaurant,25,81,10,"[RUDOLFO GUERRERO, JOSE GUERRERO]"


In [55]:
eat_seat_2.to_csv('newfood.csv')

### Managing the changes in the Food Code Rules

The Food Code Rules has changed since the 1st July 2018. After investigating those changes, it seems that only the denomination of the violations but not the violation itself has changed, and a few additionnal violations has been added in the possible violations. It means that those changes does not need more processing and can just be considered together as a common list of violations.

In [None]:
len(eat_seat.Violations.unique())

It seems that every violation is a unique entry because it contains not only the violation type but also the comments of the inspectors. We have to split the Violations column into 3 different columns :
- Violation number
- Violation type
- Violation comments

It seems that every violation cell is architectured this way :
"number of the violation". "TYPE OF THE VIOLATION" - Comments : "comments of the inspector" (this format repeated as many times as the number of violations detected the day of the inspection, separated with a vertical line)

We just want to keep the violation number (because we can check which violation it is online). We create a column NumberViolations containing the ID of the violations found during the corresponding investigation.

In [None]:
eat_seat.isnull().sum()

We see that there are more than 50'000 rows where the Violations column is empty. There is no point in keeping the entries with no violations because our research is based on the study of those violations so those entries are dropped. Then the identifying numbers of the violations are searched and added to the new column "NumberViolations". The rest is not kept because the titles of the violations can be found online and we do not plan on using the comments of the investigators.

In [None]:
eat_seat = eat_seat.dropna(subset=['Violations'], axis = 0, how = 'all')

In [None]:
violations = eat_seat.apply(lambda row: re.findall('\|\s([0-9]+)[.]', str(row['Violations'])), axis = 1)

In [None]:
first_violations = eat_seat.apply(lambda row: row['Violations'].split('.')[0], axis = 1)

In [None]:
for violation, first_violation in zip(violations, first_violations):
    violation.append(first_violation)

flat_list = [item for sublist in violations for item in sublist]
unique, counts = np.unique(flat_list, return_counts=True)

In [None]:
eat_seat['NumberViolations'] = violations

In [None]:
eat_seat = eat_seat.drop(columns = ['Location', 'Violations'])
eat_seat.head()

In [None]:
eat_seat = eat_seat.set_index(['Inspection ID']) #redifines the Index
eat_seat.head()

### creating a dataframe where the Index is the DBA Name so that we can compute a score

In [None]:
#keeping just one occurence of DBA Name
df_restaurants = eat_seat.drop_duplicates(subset=['License'])
df_restaurants.head()

In [None]:
df_restaurants = df_restaurants.set_index(['License']) #redifines the Index

In [None]:
df_restaurants.head()

In [None]:
df_restaurants = df_restaurants.drop(columns = ['Inspection Date', 'Inspection Type', 'Results', 'NumberViolations'])
df_restaurants.head()

Now we have a dataframe with every facility listed in our preprocessed database. We want to create two columns, one containing how many times the facility has been inspected, and one containing every violation of this facility. 

In [None]:
#jârrive pas à le faire mieux avec panda donc ça prend mille ans...

dic = {}

for license in df_restaurants.index :
    score = len(eat_seat[eat_seat['License'] == license])
    dic[license] = score

In [None]:
df_restaurants['NumberOfInspections'] = dic.values()

In [None]:
df_restaurants.head()

In [None]:
dic = {}

for license in df_restaurants.index :
    liste = []
    inspections = eat_seat['NumberViolations'][eat_seat['License'] == license]
    for inspection in inspections :
        for violation in inspection :
            liste.append(violation)
    dic[license] = liste

In [None]:
df_restaurants['TotalViolations'] = dic.values()

In [None]:
df_restaurants.head()

In [None]:
df_restaurants = df_restaurants.drop(columns = ['DBA Name', 'Risk', 'Address', 'Zip', 'Latitude', 'Longitude', 'Facility Group', 'Community Area'])
df_restaurants.head()

In [None]:
eat_seat = pd.merge(eat_seat, df_restaurants, how='left', on='License', left_index=True, right_index=False)

In [None]:
eat_seat.head()

# The Crime Dataset

## looking at the dataset and modify it

In [None]:
crime_2001_2004 = pd.read_csv('Chicago_Crimes_2001_to_2004.csv', error_bad_lines=False)

In [None]:
crime_2005_2007 = pd.read_csv('Chicago_Crimes_2005_to_2007.csv', error_bad_lines=False)

In [None]:
crime_2008_2011 = pd.read_csv('Chicago_Crimes_2008_to_2011.csv', error_bad_lines=False)

In [None]:
crime_2012_2017 = pd.read_csv('Chicago_Crimes_2012_to_2017.csv', error_bad_lines=False)

These 4 cells load the data taken from kaggle in a panda DataFrame

In [None]:
crime = crime_2001_2004.append(crime_2005_2007).append(crime_2008_2011).append(crime_2012_2017)

crime.head()

This cell is uniting all the data in a single DataFrame to make the data cleaning easier

In [None]:
crime = crime.drop(['Unnamed: 0','Case Number','IUCR','Arrest', 'Domestic','FBI Code','X Coordinate','Y Coordinate','Updated On', 'Location Description'], axis=1)

crime.head()

Here the columns that are not useful for our analysis are dropped

In [None]:
Block_map = []
for x in crime['Block'] :
    y =x.replace('XX','00')
    Block_map.append(y)

The Adress under the `Block` column are anonimized with the XX at the end of the adress number. We'll use the adress for our map so this is a problem. We can't have the exact adress, so we make the aproximation of using the 00 adress instead of the XX. This is not exact but will give an approximative location of where the crimes where comitted.

In [None]:
crime['Block_mapable'] = Block_map

crime.head()

The new Serie created with the 00 instead of the XX is load in the DataFrame

In [None]:
for x in crime.columns :
    print(x + ' : ' + str(crime[x].isnull().values.any()) + ' --> ' + str(crime[x].isnull().sum()))

Here we are looking at missing data, to see there number and position, and hen be able to find a way to deal with them.

We can see that most of the missing data are set in the `Ward`and `Community Area` columns.

Most of the missing data are set in the location part of the data

By now, we keep all the data and we will see later if we have to deal with missing values (maybe for the map).

In [None]:
print(set(crime['Primary Type']))

We print all the Primary Types of crimes to create a dictionary with every crime type keyed with the minimal number of year of prison that someone risk when commiting this crime. More details below

In [None]:
crime_penalty = {'PROSTITUTION' : 1, 'DOMESTIC VIOLENCE' : 1, 'MOTOR VEHICLE THEFT' : 3, 'ASSAULT' : 1, 'OFFENSE INVOLVING CHILDREN' : 1,\
                 'RITUALISM' : 1, 'BATTERY' : 1, 'CRIM SEXUAL ASSAULT' : 4, 'GAMBLING' : 1,\
                 'PUBLIC INDECENCY' : 1, 'OTHER OFFENSE' : 1, 'LIQUOR LAW VIOLATION' : 1, 'OTHER NARCOTIC VIOLATION' : 1, 'OBSCENITY' : 1,\
                 'NON-CRIMINAL' : 1, 'KIDNAPPING' : 3, 'HOMICIDE' : 20, 'NARCOTICS' : 1, 'ARSON' : 6, 'DECEPTIVE PRACTICE' : 1, 'ROBBERY' : 3,\
                 'BURGLARY' : 3, 'INTIMIDATION' : 2, 'HUMAN TRAFFICKING' : 4, 'SEX OFFENSE' : 4, 'CRIMINAL TRESPASS' : 1,\
                 'CONCEALED CARRY LICENSE VIOLATION' : 2, 'CRIMINAL DAMAGE' : 1, 'INTERFERENCE WITH PUBLIC OFFICER' : 1, 'PUBLIC PEACE VIOLATION' : 1,\
                 'WEAPONS VIOLATION' : 1, 'THEFT' : 1, 'STALKING' : 1}

To make the crime_penalty dict, we look for each crime type what is the minimum prison penalty (in year). For the crime where the penalty is below 1 year of prison, we took the 1 value.

These values are taken from the Illinois penalty code (found on internet).

In [None]:
Primary_Type = []
for x in crime['Primary Type'] :
    if x == 'NON-CRIMINAL (SUBJECT SPECIFIED)' or x == 'NON - CRIMINAL' :
        Primary_Type.append('NON-CRIMINAL')
    else :
        Primary_Type.append(x)
        
crime['Primary_Type'] = Primary_Type

Here we group all the NON-CRIMINAL values under one label (3 label before)

In [None]:
crimescore = []
for x in crime['Primary_Type'] :
    crimescore.append(crime_penalty[x])

Here we create a new column with the crimescore for each row 8associated with the Primary type crime

In [None]:
crime['crimescore'] = crimescore

In [None]:
crime.head()

In [None]:
crimetest = pd.DataFrame()
crimetest['Community Area'] = crime['Community Area']
crimetest['crimescore'] = crime['crimescore']

crimetest.head()

Here we select the community area and the crimescore area to calculate the crimescore for each community area

In [None]:
crime_CA = crimetest.groupby('Community Area').sum()

crime_CA

Here we have the DataFrame interesting us : 

- Community area | Crimescore

With this df, we'll be able to see if there is a correlation between restaurant quality and crime in Chicago

# About this notebook

We saw that the data are not that hard to handle : The data is not that big, and the problems on the format of the different columns is not a problem anymore. However, the missing values on the Wards, the Community Area and the location (long, lat) is still a problem.

This can become a problem, these data without any location information are difficult to place on a map, we will certainly use them during the analysis and plotting phase, but we'll certainly delete them when we'll pin the data up on the map.

With these 3 datasets, we want to compare the global healthiness of public (hospital, ...) facilities vs private facilities (restaurant, ...).

We also want to see if there is differences of the global healthiness of facilites between all the community area. We'll then pin the facilities on a map and delimit healthy and unhealthy community area.

We'll next watch if there are similarities in the healthiness of several restaurants directed by one person (food chain), and if there is a difference between the restaurant of people that handle only one restaurant vs people that handle several restaurants.

We also want to see if there is a correlation between the healthiness of the restaurant and the crimes that take place in the same community area.

All these correlations, differences, similarities will be visualized and mapped.

To compare the healthiness of restaurants (and community area) and the criminality in different community area, we designed a way to get the healthiness score and the crimescore of these different things.

We calculated the healthiness score by taking the several infractions, and the score given in the dataset (to be more precise).

We calculated the crime score based on the minimal prison penalty of each infraction. We took the primary infractions only, this was easier to set up, but can bring some little imprecisions.

# Can the data bring us answer to our questions ?

We think so, All the informations we need to make the different analysis are in the 3 datasets. For the healthiness of the restaurants and in the different community area, all the data are in the food inspections dataframe.

For the analysis about the different owners of different facilities, all the informations needed were processed and are in the license dataset.

And for the crime related analysis, the informations were processed and we have all we need in the crime dataset.

# What comes next

We'll first start to get deeper into the analysis, and maybe come back to modify or arrange the data one last time according to our needs.

When the analysis will be completed, we'll start to get to the visualization : What is the best way to present our results ? Is the map still a good visualization ? How to handle the map ?

Then we'll finally report our results, and start preparing the presentation.