# Where to eat in Chigago : an analysis of the inspections from the Chicago department of Public Health's Food Protection Programm

## 1. Introduction

The Chicago department of Public Health’s Food Protection Program provides a database which contains the information from inspection reports of restaurants and other food establishments in Chicago from 2010 to the present. It contains many informations about the establishments, like their type of facility (groceries’ stores, restaurants, coffee shop, …) and their locations. Many informations about the violations listed are also provided in the database, like the description of the findings that caused them and the reason that invoked the staff from the program to led an inspection.

In our project we endeavor to visualize the healthiness of public food establishments according to their type of facility, their ward and the date of the inspection. An analysis of the violation’s types according to these three parameters will also be conducted. 

The principal questions we'll answer are : 
    - Which ward of Chicago are the most healthy and unhealthy ? 
    - Which type of facility tend to be less healthy ? 
    - Did the healthiness of the food in Chicago increase or decrease from 2010 until now ?

New problematics could be asked during the analysis and would be added to these.

The purpose of the project is to help the consumer to easily choose where to eat in Chicago and to provide them an interactive and intuitive way to browse the different places offered to them. Also, it could help the Chicago department of Public Health’s Food Protection Program to adapt their methods relying on the situation described by the findings of the analysis (for example, if a prevention program should be proposed for a specific area or type of facility).

#### Assistant's comments :
•	What do you mean with healthiness of the owner (not ward, this is the wrong word), how do you assess this?

In [49]:
%matplotlib inline
import pandas as pd
import numpy as np
import re
from matplotlib.ticker import MaxNLocator
import matplotlib.pyplot as plt
import requests as req
from bs4 import BeautifulSoup
import seaborn as sns

## 2. Preprocessing

### 2.1 Facilities of interest Selection

First a quick look at how is organized the dataset. 

In [50]:
df = pd.read_csv('food-inspections.csv',sep=',') #creation of the dataframe
df.head(3)

Unnamed: 0,Inspection ID,DBA Name,AKA Name,License #,Facility Type,Risk,Address,City,State,Zip,...,Results,Violations,Latitude,Longitude,Location,Historical Wards 2003-2015,Zip Codes,Community Areas,Census Tracts,Wards
0,2320971,JUMPSTART EARLY LEARNING ACADEMY,JUMPSTART EARLY LEARNING ACADEMY,2589822.0,Children's Services Facility,Risk 1 (High),7559 W ADDISON ST,CHICAGO,IL,60634.0,...,Pass w/ Conditions,5. PROCEDURES FOR RESPONDING TO VOMITING AND D...,41.945065,-87.816734,"{'longitude': '41.945064857019986', 'latitude'...",,,,,
1,2320918,BEEFSTEAK,BEEFSTEAK,2698445.0,Restaurant,Risk 1 (High),303 E SUPERIOR ST,CHICAGO,IL,60611.0,...,Pass,39. CONTAMINATION PREVENTED DURING FOOD PREPAR...,41.895692,-87.620143,"{'longitude': '41.895692401410514', 'latitude'...",,,,,
2,2320986,BABA'S COFFEE,BABA'S COFFEE,2423353.0,Restaurant,Risk 1 (High),5544-5546 N KEDZIE AVE,CHICAGO,IL,60625.0,...,No Entry,,41.982582,-87.708996,"{'longitude': '41.98258181784537', 'latitude':...",,,,,


In [51]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 195312 entries, 0 to 195311
Data columns (total 22 columns):
Inspection ID                 195312 non-null int64
DBA Name                      195312 non-null object
AKA Name                      192862 non-null object
License #                     195295 non-null float64
Facility Type                 190535 non-null object
Risk                          195239 non-null object
Address                       195312 non-null object
City                          195173 non-null object
State                         195270 non-null object
Zip                           195261 non-null float64
Inspection Date               195312 non-null object
Inspection Type               195311 non-null object
Results                       195312 non-null object
Violations                    143530 non-null object
Latitude                      194627 non-null float64
Longitude                     194627 non-null float64
Location                      194627 n

It is a dataset of 195'312 entries with 22 columns listed above.
First thing first, we want to put the different facility types in categories that make sense for our project.

In [52]:
#df['Facility Type'].unique()

This command returns a list which contains the different facility types found in the `Facility Type column`. A lot of different types of facility are found in the data.

First, we thought about only select the "private" establishments, where it is possible to eat a main course (for example, the places where you can only eat an ice cream are deleted of our list). They all are categorized in order to be compared with each other.


In [53]:
public_dic = {'restaurant' : ['Restaurant', 'DINING HALL', 'TENT RSTAURANT'], \
              'grocery_restaurant' : ['Grocery & Restaurant', 'GROCERY& RESTAURANT', 'GROCERY/RESTAURANT',\
                                    'GROCERY/ RESTAURANT', 'GROCERY STORE/ RESTAURANT', 'GROCERY & RESTAURANT',\
                                    'RESTAURANT/GROCERY', 'grocery & restaurant', 'RESTAURANT/GROCERY STORE',\
                                    'GROCERY/TAQUERIA', 'GAS STATION/RESTAURANT'],\
              'banquet' : ['LOUNGE/BANQUET HALL', 'BANQUET', 'Banquet Hall', 'BANQUET FACILITY', 'banquet hall',\
                         'banquets', 'Banquet Dining',  'Banquet/kitchen','RESTAURANT.BANQUET HALLS',\
                         'BANQUET HALL', 'Banquet', 'BOWLING LANES/BANQUETS'], \
              'rooftop_restaurant' : ['Wrigley Roof Top', 'REST/ROOFTOP'],\
              'bar_restaurant' : ['RESTAURANT/BAR', 'RESTUARANT AND BAR', 'BAR/GRILL', 'RESTAURANT/BAR/THEATER',\
                                'JUICE AND SALAD BAR', 'SUSHI COUNTER', 'TAVERN/RESTAURANT', 'tavern/restaurant',\
                                'TAVERN GRILL'], \
              'bakery_restaurant' : ['BAKERY/ RESTAURANT', 'bakery/restaurant', 'RESTAURANT/BAKERY'], \
              'liquor_restaurant' : ['RESTAURANT AND LIQUOR', 'RESTAURANT/LIQUOR'], \
              'catering' : ['CATERING/CAFE', 'Catering'], \
              'golden_diner' : ['Golden Diner']}

In [54]:
facilitytype = 'BANQUET'
len(df[df['Facility Type'] == facilitytype])

64

This command returns the number of occurencs of the `Facility Type` inputed. 

With trying different types previously categorized and listed in the `public_dic` dictionary we have noted that the results were too distant to conduct a meaningful analysis. That's why we then decided to also select "public" establishments like school cafeterias and hospitals. It could be interesting to compare private and public inspection results.


In [55]:
private_dic = {'daycare' : ['Daycare Above and Under 2 Years', 'Daycare (2 - 6 Years)', 'Daycare Combo 1586',\
                          'Daycare (Under 2 Years)', 'DAYCARE 2 YRS TO 12 YRS', 'Daycare Night', 'DAY CARE 2-14',\
                          'Daycare (2 Years)', 'DAYCARE', 'ADULT DAYCARE', '15 monts to 5 years old', 'youth housing',\
                          'DAYCARE 1586', 'DAYCARE COMBO', '1584-DAY CARE ABOVE 2 YEARS', 'CHURCH/DAY CARE', 'DAY CARE',\
                          'DAYCARE 6 WKS-5YRS', 'DAY CARE 1023', 'DAYCARE 2-6, UNDER 6', 'Day Care Combo (1586)'], \
               'school' : ['SCHOOL', 'School', 'PRIVATE SCHOOL', 'AFTER SCHOOL PROGRAM', 'COLLEGE',\
                         'BEFORE AND AFTER SCHOOL PROGRAM', 'Private School', 'TEACHING SCHOOL',\
                         'PUBLIC SHCOOL', 'CHARTER SCHOOL CAFETERIA', 'CAFETERIA', 'Cafeteria', 'cafeteria',\
                         'UNIVERSITY CAFETERIA', 'PREP INSIDE SCHOOL', 'CHARTER SCHOOL', 'school cafeteria',\
                         'CHARTER SCHOOL/CAFETERIA', 'School Cafeteria', 'ALTERNATIVE SCHOOL', 'CITY OF CHICAGO COLLEGE',\
                         'after school program', 'CHURCH/AFTER SCHOOL PROGRAM', 'AFTER SCHOOL CARE'], \
               'childrens_services' : ["Children's Services Facility", 'CHILDRENS SERVICES FACILITY', \
                                     "CHILDERN'S SERVICE FACILITY", "1023 CHILDREN'S SERVICES FACILITY", \
                                     "1023 CHILDERN'S SERVICES FACILITY", "1023-CHILDREN'S SERVICES FACILITY", \
                                     "1023 CHILDERN'S SERVICE FACILITY", "1023 CHILDERN'S SERVICE S FACILITY", \
                                     'CHILDERN ACTIVITY FACILITY', "CHILDERN'S SERVICES  FACILITY", '1023'], \
               'adultcare' : ['Long Term Care', 'REHAB CENTER', 'Hospital', 'ASSISTED LIVING', 'SENIOR DAY CARE',\
                            'Assisted Living', 'NURSING HOME', 'ASSISTED LIVING FACILITY', 'SUPPORTIVE LIVING FACILITY',\
                            'Assisted Living Senior Care', 'Adult Family Care Center', '1005 NURSING HOME', \
                            'Long-Term Care Facility', 'LONG TERM CARE FACILITY', 'ASSISSTED LIVING',\
                            'Long-Term Care','Long Term Care Facility', 'VFW HALL']}

In [56]:
total_dic = {**public_dic , **private_dic}

In [57]:
#inverting the dict 
facilities = {}
for key in total_dic :
    for facility in total_dic[key] :
        facilities[facility] = key

In [58]:
facilitygroups = pd.DataFrame(data = facilities.values(), index=facilities.keys(), columns = ['FacilityGroup'])
facilitygroups.head()

Unnamed: 0,FacilityGroup
Restaurant,restaurant
DINING HALL,restaurant
TENT RSTAURANT,restaurant
Grocery & Restaurant,grocery_restaurant
GROCERY& RESTAURANT,grocery_restaurant


In [60]:
facilitygroups.index.name = 'Facility Type'

In [66]:
#pandas.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, 
#             right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)

eat_seat = pd.merge(df, facilitygroups, on = ['Facility Type'])
eat_seat.head()

Unnamed: 0,Inspection ID,DBA Name,AKA Name,License #,Facility Type,Risk,Address,City,State,Zip,...,Violations,Latitude,Longitude,Location,Historical Wards 2003-2015,Zip Codes,Community Areas,Census Tracts,Wards,FacilityGroup
0,2320971,JUMPSTART EARLY LEARNING ACADEMY,JUMPSTART EARLY LEARNING ACADEMY,2589822.0,Children's Services Facility,Risk 1 (High),7559 W ADDISON ST,CHICAGO,IL,60634.0,...,5. PROCEDURES FOR RESPONDING TO VOMITING AND D...,41.945065,-87.816734,"{'longitude': '41.945064857019986', 'latitude'...",,,,,,childrens_services
1,2320944,"ACADEMY OF CREATIVE THINKING, INC.",ACADEMY OF CREATIVE THINKING,2308857.0,Children's Services Facility,Risk 1 (High),4200 N Central AVE,CHICAGO,IL,60634.0,...,10. ADEQUATE HANDWASHING SINKS PROPERLY SUPPLI...,41.956874,-87.767272,"{'longitude': '41.95687428646395', 'latitude':...",,,,,,childrens_services
2,2320896,LITTLE ACHIEVERS ACADEMY,LITTLE ACHIEVERS ACADEMY,2216050.0,Children's Services Facility,Risk 1 (High),3801 W DIVERSEY AVE,CHICAGO,IL,60647.0,...,"3. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL E...",41.931741,-87.722146,"{'longitude': '41.93174082265093', 'latitude':...",,,,,,childrens_services
3,2320771,DIVERSEY DAY CARE CENTER & KINDERGARTEN,DIVERSEY DAY CARE CENTER & KINDERGARTEN,2215545.0,Children's Services Facility,Risk 1 (High),3155 W DIVERSEY AVE,CHICAGO,IL,60647.0,...,"3. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL E...",41.931918,-87.70721,"{'longitude': '41.93191841302698', 'latitude':...",,,,,,childrens_services
4,2320762,VIREVA NURSEY SCHOOL,VIREVA NURSEY SCHOOL,2215871.0,Children's Services Facility,Risk 1 (High),1935 W 51ST ST,CHICAGO,IL,60609.0,...,51. PLUMBING INSTALLED; PROPER BACKFLOW DEVICE...,41.801134,-87.673444,"{'longitude': '41.80113414357615', 'latitude':...",,,,,,childrens_services


#### Assistant's comments :
•	Good work on the facility type! But please create a pandas-index and use this instead of your for-loop. This will be much, much faster and more aligned with how pandas is used. In general refrain from looping over a column.



### 2.2 Columns Cleaning

### A) City Selection

To construct the new dataframe, the `Facility Type` column is dropped because it has been replaced by the column `Facility group` and the *Not Listed* establishments are not selected.

The duplicates are dropped.

#### Assistant's comments :
•	You can't just drop all duplicates like that! Specify the column, but also look at the duplicates first!

In [10]:
eat_seat = newcolfromdict(df, total_dic, 'Facility Type', 'Facility Group')
eat_seat = df.loc[df['Facility Group'] != 'Not Listed']
eat_seat = eat_seat.drop(columns = ['Facility Type'])

eat_seat = eat_seat.drop_duplicates()

eat_seat.head(2)

Unnamed: 0,Inspection ID,DBA Name,AKA Name,License #,Risk,Address,City,State,Zip,Inspection Date,...,Violations,Latitude,Longitude,Location,Historical Wards 2003-2015,Zip Codes,Community Areas,Census Tracts,Wards,Facility Group
0,2320971,JUMPSTART EARLY LEARNING ACADEMY,JUMPSTART EARLY LEARNING ACADEMY,2589822.0,Risk 1 (High),7559 W ADDISON ST,CHICAGO,IL,60634.0,2019-11-01T00:00:00.000,...,5. PROCEDURES FOR RESPONDING TO VOMITING AND D...,41.945065,-87.816734,"{'longitude': '41.945064857019986', 'latitude'...",,,,,,childrens_services
1,2320918,BEEFSTEAK,BEEFSTEAK,2698445.0,Risk 1 (High),303 E SUPERIOR ST,CHICAGO,IL,60611.0,2019-11-01T00:00:00.000,...,39. CONTAMINATION PREVENTED DURING FOOD PREPAR...,41.895692,-87.620143,"{'longitude': '41.895692401410514', 'latitude'...",,,,,,restaurant


Since we only care about establishments in Chicago Illinois, we will only keep the data for this city and drop the `City` and `State` columns.

We first check the different City's names to avoid deleting rows due to missprints.

In [11]:
df['City'].unique()

array(['CHICAGO', nan, 'chicago', 'Chicago', 'GRIFFITH', 'NEW YORK',
       'SCHAUMBURG', 'ELMHURST', 'ALGONQUIN', 'NEW HOLSTEIN', 'CCHICAGO',
       'NILES NILES', 'EVANSTON', 'CHICAGO.', 'CHESTNUT STREET',
       'LANSING', 'CHICAGOCHICAGO', 'WADSWORTH', 'WILMETTE', 'WHEATON',
       'CHICAGOHICAGO', 'ROSEMONT', 'CHicago', 'CALUMET CITY',
       'PLAINFIELD', 'HIGHLAND PARK', 'PALOS PARK', 'ELK GROVE VILLAGE',
       'CICERO', 'BRIDGEVIEW', 'OAK PARK', 'MAYWOOD', 'LAKE BLUFF',
       '312CHICAGO', 'SCHILLER PARK', 'SKOKIE', 'BEDFORD PARK',
       'BANNOCKBURNDEERFIELD', 'CHCICAGO', 'BLOOMINGDALE', 'Norridge',
       'CHARLES A HAYES', 'CHCHICAGO', 'CHICAGOI', 'SUMMIT',
       'OOLYMPIA FIELDS', 'WESTMONT', 'CHICAGO HEIGHTS', 'JUSTICE',
       'TINLEY PARK', 'LOMBARD', 'EAST HAZEL CREST', 'COUNTRY CLUB HILLS',
       'STREAMWOOD', 'BOLINGBROOK', 'INACTIVE', 'BERWYN', 'BURNHAM',
       'DES PLAINES', 'LAKE ZURICH', 'OLYMPIA FIELDS', 'alsip',
       'OAK LAWN', 'BLUE ISLAND', 'GLENCOE',

#### Assistant's comments :
•	When you remove the unwanted cities, make sure that what you're doing is what you want. Chicago Heights and Oak Park are both in Cook country, quite close to the municipality of Chicago... seems odd that you keep one but not the other. Please give a reason for that.

In [12]:
cities = ['CHICAGO','chicago','Chicago','CCHICAGO','CHICAGO.','CHICAGOCHICAGO','CHICAGOHICAGO',\
          'CHicago','312CHICAGO','CHCICAGO','CHCHICAGO','CHICAGOI','CHICAGO HEIGHTS']

eat_seat = eat_seat.loc[eat_seat['City'].isin(cities)]

eat_seat = eat_seat.drop(columns = ['City','State'])

Then we want to check the missing values.

In [13]:
eat_seat.isnull().sum()

Inspection ID                      0
DBA Name                           0
AKA Name                        2416
License #                         17
Risk                              70
Address                            0
Zip                                3
Inspection Date                    0
Inspection Type                    1
Results                            0
Violations                     51550
Latitude                         515
Longitude                        515
Location                         515
Historical Wards 2003-2015    194729
Zip Codes                     194729
Community Areas               194729
Census Tracts                 194729
Wards                         194729
Facility Group                     0
dtype: int64

In [14]:
print(len(eat_seat.index))    ##returns the number of rows of the df

194729


As we can see, the columns `Historical Wards 2003-2015`, `Zip Codes`, `Community Areas`, `Census Tracts` and `Wards` are empty and will be dropped.

We will only be using the `DBA Name` (the name under which the establishment is doing business ; DBA = doing business as), so we drop the `AKA Name` column too.

In [67]:
eat_seat = eat_seat.drop(columns = ['AKA Name','Historical Wards 2003-2015', 'Zip Codes', 'Community Areas',\
                                    'Census Tracts', 'Wards'])
eat_seat.head(2)

Unnamed: 0,Inspection ID,DBA Name,License #,Facility Type,Risk,Address,City,State,Zip,Inspection Date,Inspection Type,Results,Violations,Latitude,Longitude,Location,FacilityGroup
0,2320971,JUMPSTART EARLY LEARNING ACADEMY,2589822.0,Children's Services Facility,Risk 1 (High),7559 W ADDISON ST,CHICAGO,IL,60634.0,2019-11-01T00:00:00.000,Canvass,Pass w/ Conditions,5. PROCEDURES FOR RESPONDING TO VOMITING AND D...,41.945065,-87.816734,"{'longitude': '41.945064857019986', 'latitude'...",childrens_services
1,2320944,"ACADEMY OF CREATIVE THINKING, INC.",2308857.0,Children's Services Facility,Risk 1 (High),4200 N Central AVE,CHICAGO,IL,60634.0,2019-11-01T00:00:00.000,Canvass,Pass w/ Conditions,10. ADEQUATE HANDWASHING SINKS PROPERLY SUPPLI...,41.956874,-87.767272,"{'longitude': '41.95687428646395', 'latitude':...",childrens_services


### B) Inspection date and Risk

There is also adjustments to make in some columns, because the formats can be optimized :

* In `Inspection Date`, only the day will be kept, not the time of day that is actually not given
* In `Risk`, only the number will remain

For the column `Risk`, we first want to check what types of risk are listed.

#### Assistant's comments :
•	Why don't you parse the Inspection Date to a date?

In [16]:
eat_seat['Inspection Date'] = eat_seat['Inspection Date'].apply(lambda x:x.split('T')[0])

In [17]:
eat_seat.Risk.unique()
#print all types of entries in the column Risk

array(['Risk 1 (High)', 'Risk 2 (Medium)', 'Risk 3 (Low)', nan, 'All'],
      dtype=object)

We will replace **All** and **High Risk** by *3*, **Medium Risk** by *2* and **Low Risk** by *1*.

In [18]:
eat_seat['Risk'] = eat_seat['Risk'].replace({'All':3, 'Risk 1 (High)':3, 'Risk 2 (Medium)':2, 'Risk 3 (Low)':1})
eat_seat.head(2)

Unnamed: 0,Inspection ID,DBA Name,License #,Risk,Address,Zip,Inspection Date,Inspection Type,Results,Violations,Latitude,Longitude,Location,Facility Group
0,2320971,JUMPSTART EARLY LEARNING ACADEMY,2589822.0,3.0,7559 W ADDISON ST,60634.0,2019-11-01,Canvass,Pass w/ Conditions,5. PROCEDURES FOR RESPONDING TO VOMITING AND D...,41.945065,-87.816734,"{'longitude': '41.945064857019986', 'latitude'...",childrens_services
1,2320918,BEEFSTEAK,2698445.0,3.0,303 E SUPERIOR ST,60611.0,2019-11-01,License,Pass,39. CONTAMINATION PREVENTED DURING FOOD PREPAR...,41.895692,-87.620143,"{'longitude': '41.895692401410514', 'latitude'...",restaurant


In [19]:
eat_seat = eat_seat.rename(columns={"License #": "License"}) #rename the column 'License #' into 'License'

In [20]:
len(eat_seat.License.unique())

37145

In [21]:
eat_seat.head()

Unnamed: 0,Inspection ID,DBA Name,License,Risk,Address,Zip,Inspection Date,Inspection Type,Results,Violations,Latitude,Longitude,Location,Facility Group
0,2320971,JUMPSTART EARLY LEARNING ACADEMY,2589822.0,3.0,7559 W ADDISON ST,60634.0,2019-11-01,Canvass,Pass w/ Conditions,5. PROCEDURES FOR RESPONDING TO VOMITING AND D...,41.945065,-87.816734,"{'longitude': '41.945064857019986', 'latitude'...",childrens_services
1,2320918,BEEFSTEAK,2698445.0,3.0,303 E SUPERIOR ST,60611.0,2019-11-01,License,Pass,39. CONTAMINATION PREVENTED DURING FOOD PREPAR...,41.895692,-87.620143,"{'longitude': '41.895692401410514', 'latitude'...",restaurant
2,2320986,BABA'S COFFEE,2423353.0,3.0,5544-5546 N KEDZIE AVE,60625.0,2019-11-01,Canvass,No Entry,,41.982582,-87.708996,"{'longitude': '41.98258181784537', 'latitude':...",restaurant
3,2320910,J.T.'S GENUINE SANDWICH,2689893.0,3.0,3970 N ELSTON AVE,60618.0,2019-11-01,License,Pass,51. PLUMBING INSTALLED; PROPER BACKFLOW DEVICE...,41.953378,-87.718848,"{'longitude': '41.95337788158545', 'latitude':...",restaurant
4,2320904,"KID'Z COLONY DAYCARE, INC.",2215609.0,3.0,6287 S ARCHER AVE,60638.0,2019-11-01,Canvass,Fail,16. FOOD-CONTACT SURFACES: CLEANED & SANITIZED...,41.793235,-87.777776,"{'longitude': '41.7932347787373', 'latitude': ...",daycare


### C) Float to Integer

In [22]:
def clean_float(floatnumb):
    try :
        return int(float(floatnumb))
    except :
        return 0

In [23]:
eat_seat.Risk = eat_seat.Risk.apply(clean_float)    ##guarantees the Risk numbers to be integers

In [24]:
eat_seat.License = eat_seat.License.apply(clean_float)    ##guarantees the License numbers to be integers

In [25]:
eat_seat.Zip = eat_seat.Zip.apply(clean_float)    ##guarantees the zipcodes to be integers

In [26]:
len(eat_seat[eat_seat['Zip'] == 0])

3

The **clean_float()** function returns **0** if the zip code is not convertible into an integer. Here it means that there are 3 missing zip codes in the `Zip` column. 

In [27]:
eat_seat[eat_seat['Zip'] == 0].head()

Unnamed: 0,Inspection ID,DBA Name,License,Risk,Address,Zip,Inspection Date,Inspection Type,Results,Violations,Latitude,Longitude,Location,Facility Group
114117,1464217,DUNKIN DONUTS,1515116,2,7545 N PAULINA ST,0,2014-04-02,Canvass,Out of Business,,42.019032,-87.673459,"{'longitude': '42.01903180273219', 'latitude':...",restaurant
153781,1106210,DUNKIN DONUTS,1515116,2,7545 N PAULINA ST,0,2012-04-09,Canvass Re-Inspection,Pass,33. FOOD AND NON-FOOD CONTACT EQUIPMENT UTENSI...,42.019032,-87.673459,"{'longitude': '42.01903180273219', 'latitude':...",restaurant
156081,670661,DUNKIN DONUTS,1515116,2,7545 N PAULINA ST,0,2012-02-21,Complaint,Pass w/ Conditions,"6. HANDS WASHED AND CLEANED, GOOD HYGIENIC PRA...",42.019032,-87.673459,"{'longitude': '42.01903180273219', 'latitude':...",restaurant


We can observe that the missing zip codes are all from the same esablishment. By doing a google search the zip code corresponding to the adress is easily retrieved :
*7545 N Paulina St
Chicago, IL 60626, USA*.

So that we just have to replace the missing zip codes by 60626.

In [28]:
def clean_zip(zipcode):
    if zipcode == 0 :
        return 60626
    else :
        return zipcode

In [29]:
eat_seat.Zip = eat_seat.Zip.apply(clean_zip)

#### Assistants' comments :
•	NB.: in cell In [29] use df.loc[df['Zip'].eq(0), 'Zip'] = ... instead. Much, much faster and more readable (in this context, all hail pandas).

### D) Community Areas

We found a file associating the chicago zipe codes and their associated community area. Using it we can create a new `Community Area` column.

#### Assistants' comments :
•	In D) Community Areas provide a source (e.g. URL) for that csv-file.

In [30]:
zip_to_area = pd.read_csv('ZipCode_to_ComArea.csv',sep=',') ##creation of the dataframe
zip_to_area = zip_to_area.drop(columns = ['TOT2010'])

zip_to_area.ZipCode = zip_to_area.ZipCode.apply(clean_float) ##guarantees the zipcodes to be integers

zip_to_area = zip_to_area.groupby('ComArea')['ZipCode'].apply(list)    ##groups the zipcodes by community area number
zip_to_area = zip_to_area.reset_index()

#### Assistants' comments :
•	If you want to cast columns to a certain type, prefer

pandas.to_numeric
 

same for using findall (pandas.Series.str.findall)

In [31]:
zip_dic = zip_to_area.set_index('ComArea')['ZipCode'].to_dict()

In [32]:
eat_seat = newcolfromdict(eat_seat, zip_dic, 'Zip', 'Community Area')

In [33]:
eat_seat.head(3)

Unnamed: 0,Inspection ID,DBA Name,License,Risk,Address,Zip,Inspection Date,Inspection Type,Results,Violations,Latitude,Longitude,Location,Facility Group,Community Area
0,2320971,JUMPSTART EARLY LEARNING ACADEMY,2589822,3,7559 W ADDISON ST,60634,2019-11-01,Canvass,Pass w/ Conditions,5. PROCEDURES FOR RESPONDING TO VOMITING AND D...,41.945065,-87.816734,"{'longitude': '41.945064857019986', 'latitude'...",childrens_services,76
1,2320918,BEEFSTEAK,2698445,3,303 E SUPERIOR ST,60611,2019-11-01,License,Pass,39. CONTAMINATION PREVENTED DURING FOOD PREPAR...,41.895692,-87.620143,"{'longitude': '41.895692401410514', 'latitude'...",restaurant,8
2,2320986,BABA'S COFFEE,2423353,3,5544-5546 N KEDZIE AVE,60625,2019-11-01,Canvass,No Entry,,41.982582,-87.708996,"{'longitude': '41.98258181784537', 'latitude':...",restaurant,14


### 2.3 The Violations Column

The Food Code Rules has changed since the 1st July 2018. After investigating those changes, it seems that only the denomination of the violations but not the violation itself has changed, and a few additionnal violations has been added in the possible violations. It means that those changes does not need more processing and can just be considered together as a common list of violations.

In [34]:
len(eat_seat.Violations.unique())

142356

In [35]:
eat_seat.Violations[0]

'5. PROCEDURES FOR RESPONDING TO VOMITING AND DIARRHEAL EVENTS - Comments: OBSERVED INCOMPLETE CLEAN UP KIT ON SITE.INSTRUCTED MANAGEMENT TO HAVE A COMPLETE CLEAN UP KIT BY NEXT INSPECTION.CLEAN UP PROCEDURES  ON SITE PRIORITY FOUNDATION VIOLATION  7-38-005   NO CITATION ISSUED | 10. ADEQUATE HANDWASHING SINKS PROPERLY SUPPLIED AND ACCESSIBLE - Comments: OBSERVED EXPOSED HAND SINK NOT ACCESSIBLE FOR HANDWASHING IN THE FOOD PREP AREA.INSTRUCTED MANAGEMENT TO INSTALL EXPOSED HAND SINK AT A LOCATION THAT IS ACCESSIBLE. PRIORITY FOUNDATION VIOLATION 7-38-030(c) NO CITATION ISSUED | 56. ADEQUATE VENTILATION & LIGHTING; DESIGNATED AREAS USED - Comments: OBSERVED HOT LINE HOOD FILTERS DUSTY.INSTRUCTED MANAGEMENT TO CLEAN HOT LINE HOOD FILTERS.'

It seems that every violation is a unique entry because it contains not only the violation type but also the comments of the inspectors. We have to split the Violations column into 3 different columns :
- Violation number
- Violation type
- Violation comments

It seems that every violation cell is architectured this way :
"number of the violation". "TYPE OF THE VIOLATION" - Comments : "comments of the inspector" (this format repeated as many times as the number of violations detected the day of the inspection, separated with a vertical line)

We just want to keep the violation number (because we can check which violation it is online). We create a column NumberViolations containing the ID of the violations found during the corresponding investigation. The rest is not kept because the titles of the violations can be found online and we do not plan on using the comments of the investigators.

As we can do this cleaning only for the rows where the field Violations is not empty, we will temporarily drop all the other rows.

In [36]:
temp = eat_seat.dropna(subset=['Violations'], axis = 0, how = 'all')

In [37]:
violations = temp.apply(lambda row: re.findall('\|\s([0-9]+)[.]', str(row['Violations'])), axis = 1)

In [38]:
first_violations = temp.apply(lambda row: row['Violations'].split('.')[0], axis = 1)

In [39]:
for violation, first_violation in zip(violations, first_violations):
    violation.append(first_violation)

flat_list = [item for sublist in violations for item in sublist]
unique, counts = np.unique(flat_list, return_counts=True)

In [40]:
temp = temp.assign(NumberViolations = violations)

In [41]:
temp = temp[['Inspection ID', 'NumberViolations']]
temp.head()

Unnamed: 0,Inspection ID,NumberViolations
0,2320971,"[10, 56, 5]"
1,2320918,"[55, 39]"
3,2320910,"[53, 58, 51]"
4,2320904,[16]
5,2320969,"[38, 55, 55, 58, 60, 38]"


In [42]:
len(eat_seat)

194729

Now that we have a dataframe with every inspection ID of the inspections where violations has been found and a column containing the list of those violations, we can add it to the primary dataframe.

In [43]:
eat_seat = pd.merge(eat_seat, temp, how='left', on='Inspection ID', left_index=True, right_index=False)

#### Assistants' comments :
•	Very good for using Inspection ID as the index! But you should have checked if it is truly unique first.

In [44]:
eat_seat = eat_seat.set_index(['Inspection ID']) #redifines the Index
eat_seat.head()

Unnamed: 0_level_0,DBA Name,License,Risk,Address,Zip,Inspection Date,Inspection Type,Results,Violations,Latitude,Longitude,Location,Facility Group,Community Area,NumberViolations
Inspection ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
2320971,JUMPSTART EARLY LEARNING ACADEMY,2589822,3,7559 W ADDISON ST,60634,2019-11-01,Canvass,Pass w/ Conditions,5. PROCEDURES FOR RESPONDING TO VOMITING AND D...,41.945065,-87.816734,"{'longitude': '41.945064857019986', 'latitude'...",childrens_services,76,"[10, 56, 5]"
2320918,BEEFSTEAK,2698445,3,303 E SUPERIOR ST,60611,2019-11-01,License,Pass,39. CONTAMINATION PREVENTED DURING FOOD PREPAR...,41.895692,-87.620143,"{'longitude': '41.895692401410514', 'latitude'...",restaurant,8,"[55, 39]"
2320986,BABA'S COFFEE,2423353,3,5544-5546 N KEDZIE AVE,60625,2019-11-01,Canvass,No Entry,,41.982582,-87.708996,"{'longitude': '41.98258181784537', 'latitude':...",restaurant,14,
2320910,J.T.'S GENUINE SANDWICH,2689893,3,3970 N ELSTON AVE,60618,2019-11-01,License,Pass,51. PLUMBING INSTALLED; PROPER BACKFLOW DEVICE...,41.953378,-87.718848,"{'longitude': '41.95337788158545', 'latitude':...",restaurant,22,"[53, 58, 51]"
2320904,"KID'Z COLONY DAYCARE, INC.",2215609,3,6287 S ARCHER AVE,60638,2019-11-01,Canvass,Fail,16. FOOD-CONTACT SURFACES: CLEANED & SANITIZED...,41.793235,-87.777776,"{'longitude': '41.7932347787373', 'latitude': ...",daycare,64,[16]


In [45]:
len(eat_seat)

194729

### 2.4 The Inspection Results

In [46]:
eat_seat.isnull().sum()

DBA Name                0
License                 0
Risk                    0
Address                 0
Zip                     0
Inspection Date         0
Inspection Type         1
Results                 0
Violations          51550
Latitude              515
Longitude             515
Location              515
Facility Group          0
Community Area          0
NumberViolations    51550
dtype: int64

We see that there are more than 50'000 rows where the Violations column is empty. We have to study wether those cells are empty because there were no violations (meaning the estalishment is healthy) or because the inspection was not successfull (meaning we can drop the row because it can not be used in our research).

In [47]:
eat_seat.Results.unique()

array(['Pass w/ Conditions', 'Pass', 'No Entry', 'Fail',
       'Out of Business', 'Not Ready', 'Business Not Located'],
      dtype=object)

We create new databases for every entries of the column Results in order to study them.

In [48]:
noentry = eat_seat[eat_seat['Results']=='No Entry']

In [49]:
outofbusiness = eat_seat[eat_seat['Results']=='Out of Business']

In [50]:
notready = eat_seat[eat_seat['Results']=='Not Ready']

In [51]:
businessnotlocated = eat_seat[eat_seat['Results']=='Business Not Located']

In [52]:
passwithconditions = eat_seat[eat_seat['Results']=='Pass w/ Conditions']

In [53]:
passed = eat_seat[eat_seat['Results']=='Pass']

In [54]:
fail = eat_seat[eat_seat['Results']=='Fail']

We investigate how many of each results' type has the column Violations empty.

In [55]:
results_dic = {'No Entry' : noentry, 'Out of Business' : outofbusiness, 'Not Ready' : notready,\
               'Business Not Located' : businessnotlocated, 'Pass With Conditions' : passwithconditions, 'Pass' : passed, 'Fail' : fail}

In [56]:
for name, result in results_dic.items() :
    print(name, ':', len(result[result['Violations'].isnull()]), 'empty Violation columns /', len(result),\
          'columns =', (len(result[result['Violations'].isnull()])/len(result)), '\n')

No Entry : 5768 empty Violation columns / 6199 columns = 0.9304726568801419 

Out of Business : 16728 empty Violation columns / 16757 columns = 0.9982693799606135 

Not Ready : 1799 empty Violation columns / 1849 columns = 0.9729583558680368 

Business Not Located : 65 empty Violation columns / 65 columns = 1.0 

Pass With Conditions : 444 empty Violation columns / 26847 columns = 0.016538160688345068 

Pass : 23676 empty Violation columns / 105356 columns = 0.2247237936140324 

Fail : 3070 empty Violation columns / 37656 columns = 0.08152751221584874 



We see that almost every entries where the Result is either 'No entry', 'Out of Business', 'Not ready' or 'Business not located' have the Violations field empty. We can safely drop those lines because they are not pertinent for our research.

In [57]:
results = ['Pass', 'Pass w/ Conditions', 'Fail']

In [58]:
eat_seat = eat_seat.loc[eat_seat['Results'].isin(results)]
len(eat_seat)

169859

Now we have to take care of the cases where there is no Violations and the Result is either Pass, Fail or Pass with conditions.

When the result is Pass and the Violation field is empty, we can add a 0 in the column "NumberViolations".

In [59]:
eat_seat['NumberViolations'].fillna(0, inplace=True)

In [60]:
len(eat_seat)

169859

When the result is either Fail or Pass with conditions but the Violations field is empty, we will drop those rows because there are missing values. An establishment can indeed not fail an inspection or receive conditions when no violation is found, those entries make no sense and can not be taken into account in our research.

In [61]:
results = ['Pass w/ Conditions', 'Fail']
EmptyViolations = eat_seat[eat_seat['NumberViolations'] == 0]
EmptyinResults = EmptyViolations.loc[eat_seat['Results'].isin(results)]
indexes = EmptyinResults.index
eat_seat = eat_seat.drop(labels = indexes)

In [62]:
len(eat_seat)

166345

Now we can replace Pass by 1, Pass w/ conditions by 2, and Fail by 3 (that will be useful during the computation of the healthiness score).

In [63]:
eat_seat['Results'].unique()

array(['Pass w/ Conditions', 'Pass', 'Fail'], dtype=object)

In [64]:
eat_seat['Results'] = eat_seat['Results'].replace({'Fail':3, 'Pass w/ Conditions':2, 'Pass':1})


In [65]:
eat_seat.head()

Unnamed: 0_level_0,DBA Name,License,Risk,Address,Zip,Inspection Date,Inspection Type,Results,Violations,Latitude,Longitude,Location,Facility Group,Community Area,NumberViolations
Inspection ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
2320971,JUMPSTART EARLY LEARNING ACADEMY,2589822,3,7559 W ADDISON ST,60634,2019-11-01,Canvass,2,5. PROCEDURES FOR RESPONDING TO VOMITING AND D...,41.945065,-87.816734,"{'longitude': '41.945064857019986', 'latitude'...",childrens_services,76,"[10, 56, 5]"
2320918,BEEFSTEAK,2698445,3,303 E SUPERIOR ST,60611,2019-11-01,License,1,39. CONTAMINATION PREVENTED DURING FOOD PREPAR...,41.895692,-87.620143,"{'longitude': '41.895692401410514', 'latitude'...",restaurant,8,"[55, 39]"
2320910,J.T.'S GENUINE SANDWICH,2689893,3,3970 N ELSTON AVE,60618,2019-11-01,License,1,51. PLUMBING INSTALLED; PROPER BACKFLOW DEVICE...,41.953378,-87.718848,"{'longitude': '41.95337788158545', 'latitude':...",restaurant,22,"[53, 58, 51]"
2320904,"KID'Z COLONY DAYCARE, INC.",2215609,3,6287 S ARCHER AVE,60638,2019-11-01,Canvass,3,16. FOOD-CONTACT SURFACES: CLEANED & SANITIZED...,41.793235,-87.777776,"{'longitude': '41.7932347787373', 'latitude': ...",daycare,64,[16]
2320969,JUST A PIZZA PLUS INC,75583,3,5136 S ARCHER AVE,60632,2019-11-01,Complaint,3,"38. INSECTS, RODENTS, & ANIMALS NOT PRESENT - ...",41.800619,-87.731143,"{'longitude': '41.8006193315046', 'latitude': ...",restaurant,63,"[38, 55, 55, 58, 60, 38]"


In [66]:
eat_seat.Results = eat_seat.Results.apply(clean_float) 

Now we can compute the healthiness score of each inspection by multiplying the Results score with the number of Violations :


In [67]:
#add 0 in the new column InspectionScore for every row
liste = []
for i in range (0, len(eat_seat)) :
                liste.append(0)
                
eat_seat['InspectionScore'] = liste

In [68]:
#add the multiplication for the rows where NumberViolations != 0
for ID in eat_seat.index :
    if eat_seat.at[ID, 'NumberViolations'] != 0 :
        eat_seat.at[ID, 'InspectionScore'] = len(eat_seat.at[ID, 'NumberViolations']) * eat_seat.at[ID, 'Results']

In [69]:
eat_seat.head()

Unnamed: 0_level_0,DBA Name,License,Risk,Address,Zip,Inspection Date,Inspection Type,Results,Violations,Latitude,Longitude,Location,Facility Group,Community Area,NumberViolations,InspectionScore
Inspection ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
2320971,JUMPSTART EARLY LEARNING ACADEMY,2589822,3,7559 W ADDISON ST,60634,2019-11-01,Canvass,2,5. PROCEDURES FOR RESPONDING TO VOMITING AND D...,41.945065,-87.816734,"{'longitude': '41.945064857019986', 'latitude'...",childrens_services,76,"[10, 56, 5]",6
2320918,BEEFSTEAK,2698445,3,303 E SUPERIOR ST,60611,2019-11-01,License,1,39. CONTAMINATION PREVENTED DURING FOOD PREPAR...,41.895692,-87.620143,"{'longitude': '41.895692401410514', 'latitude'...",restaurant,8,"[55, 39]",2
2320910,J.T.'S GENUINE SANDWICH,2689893,3,3970 N ELSTON AVE,60618,2019-11-01,License,1,51. PLUMBING INSTALLED; PROPER BACKFLOW DEVICE...,41.953378,-87.718848,"{'longitude': '41.95337788158545', 'latitude':...",restaurant,22,"[53, 58, 51]",3
2320904,"KID'Z COLONY DAYCARE, INC.",2215609,3,6287 S ARCHER AVE,60638,2019-11-01,Canvass,3,16. FOOD-CONTACT SURFACES: CLEANED & SANITIZED...,41.793235,-87.777776,"{'longitude': '41.7932347787373', 'latitude': ...",daycare,64,[16],3
2320969,JUST A PIZZA PLUS INC,75583,3,5136 S ARCHER AVE,60632,2019-11-01,Complaint,3,"38. INSECTS, RODENTS, & ANIMALS NOT PRESENT - ...",41.800619,-87.731143,"{'longitude': '41.8006193315046', 'latitude': ...",restaurant,63,"[38, 55, 55, 58, 60, 38]",18


### 3. Computation of an healthiness score per facility

Now we want to add every InspectionScore that an establishment has and then divide this by the numer of inspections of this establishment in order to compute a score per establishment. This score will be stored in a new column FacilityScore.

**NB :** This is a first try in order to make an analysis of the data. We know that this score is not fair because an establishment could have been very bad in the past and is now very healthy but we could use it in another formula later (compute the progress of an establishment for example).

#### Assistants' comments :
•	NB.: Please read this to order to understand the difference between df.loc[...] and df[...]. When in doubt use .loc.
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html

•	Your method of removing duplicate restaurants assumes that license numbers are not reused, check that instead of assuming it.

In [70]:
#keeping just one occurence of License
df_restaurants = eat_seat.drop_duplicates(subset=['License'])
df_restaurants.head()

Unnamed: 0_level_0,DBA Name,License,Risk,Address,Zip,Inspection Date,Inspection Type,Results,Violations,Latitude,Longitude,Location,Facility Group,Community Area,NumberViolations,InspectionScore
Inspection ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
2320971,JUMPSTART EARLY LEARNING ACADEMY,2589822,3,7559 W ADDISON ST,60634,2019-11-01,Canvass,2,5. PROCEDURES FOR RESPONDING TO VOMITING AND D...,41.945065,-87.816734,"{'longitude': '41.945064857019986', 'latitude'...",childrens_services,76,"[10, 56, 5]",6
2320918,BEEFSTEAK,2698445,3,303 E SUPERIOR ST,60611,2019-11-01,License,1,39. CONTAMINATION PREVENTED DURING FOOD PREPAR...,41.895692,-87.620143,"{'longitude': '41.895692401410514', 'latitude'...",restaurant,8,"[55, 39]",2
2320910,J.T.'S GENUINE SANDWICH,2689893,3,3970 N ELSTON AVE,60618,2019-11-01,License,1,51. PLUMBING INSTALLED; PROPER BACKFLOW DEVICE...,41.953378,-87.718848,"{'longitude': '41.95337788158545', 'latitude':...",restaurant,22,"[53, 58, 51]",3
2320904,"KID'Z COLONY DAYCARE, INC.",2215609,3,6287 S ARCHER AVE,60638,2019-11-01,Canvass,3,16. FOOD-CONTACT SURFACES: CLEANED & SANITIZED...,41.793235,-87.777776,"{'longitude': '41.7932347787373', 'latitude': ...",daycare,64,[16],3
2320969,JUST A PIZZA PLUS INC,75583,3,5136 S ARCHER AVE,60632,2019-11-01,Complaint,3,"38. INSECTS, RODENTS, & ANIMALS NOT PRESENT - ...",41.800619,-87.731143,"{'longitude': '41.8006193315046', 'latitude': ...",restaurant,63,"[38, 55, 55, 58, 60, 38]",18


In [71]:
df_restaurants = df_restaurants.set_index(['License']) #redifines the Index

In [72]:
df_restaurants.head()

Unnamed: 0_level_0,DBA Name,Risk,Address,Zip,Inspection Date,Inspection Type,Results,Violations,Latitude,Longitude,Location,Facility Group,Community Area,NumberViolations,InspectionScore
License,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
2589822,JUMPSTART EARLY LEARNING ACADEMY,3,7559 W ADDISON ST,60634,2019-11-01,Canvass,2,5. PROCEDURES FOR RESPONDING TO VOMITING AND D...,41.945065,-87.816734,"{'longitude': '41.945064857019986', 'latitude'...",childrens_services,76,"[10, 56, 5]",6
2698445,BEEFSTEAK,3,303 E SUPERIOR ST,60611,2019-11-01,License,1,39. CONTAMINATION PREVENTED DURING FOOD PREPAR...,41.895692,-87.620143,"{'longitude': '41.895692401410514', 'latitude'...",restaurant,8,"[55, 39]",2
2689893,J.T.'S GENUINE SANDWICH,3,3970 N ELSTON AVE,60618,2019-11-01,License,1,51. PLUMBING INSTALLED; PROPER BACKFLOW DEVICE...,41.953378,-87.718848,"{'longitude': '41.95337788158545', 'latitude':...",restaurant,22,"[53, 58, 51]",3
2215609,"KID'Z COLONY DAYCARE, INC.",3,6287 S ARCHER AVE,60638,2019-11-01,Canvass,3,16. FOOD-CONTACT SURFACES: CLEANED & SANITIZED...,41.793235,-87.777776,"{'longitude': '41.7932347787373', 'latitude': ...",daycare,64,[16],3
75583,JUST A PIZZA PLUS INC,3,5136 S ARCHER AVE,60632,2019-11-01,Complaint,3,"38. INSECTS, RODENTS, & ANIMALS NOT PRESENT - ...",41.800619,-87.731143,"{'longitude': '41.8006193315046', 'latitude': ...",restaurant,63,"[38, 55, 55, 58, 60, 38]",18


In [73]:
#drop columns that gives info about inspection and not facility
df_restaurants = df_restaurants.drop(columns = ['Inspection Date', 'Inspection Type', 'Results', 'NumberViolations', 'InspectionScore'])
df_restaurants.head()

Unnamed: 0_level_0,DBA Name,Risk,Address,Zip,Violations,Latitude,Longitude,Location,Facility Group,Community Area
License,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2589822,JUMPSTART EARLY LEARNING ACADEMY,3,7559 W ADDISON ST,60634,5. PROCEDURES FOR RESPONDING TO VOMITING AND D...,41.945065,-87.816734,"{'longitude': '41.945064857019986', 'latitude'...",childrens_services,76
2698445,BEEFSTEAK,3,303 E SUPERIOR ST,60611,39. CONTAMINATION PREVENTED DURING FOOD PREPAR...,41.895692,-87.620143,"{'longitude': '41.895692401410514', 'latitude'...",restaurant,8
2689893,J.T.'S GENUINE SANDWICH,3,3970 N ELSTON AVE,60618,51. PLUMBING INSTALLED; PROPER BACKFLOW DEVICE...,41.953378,-87.718848,"{'longitude': '41.95337788158545', 'latitude':...",restaurant,22
2215609,"KID'Z COLONY DAYCARE, INC.",3,6287 S ARCHER AVE,60638,16. FOOD-CONTACT SURFACES: CLEANED & SANITIZED...,41.793235,-87.777776,"{'longitude': '41.7932347787373', 'latitude': ...",daycare,64
75583,JUST A PIZZA PLUS INC,3,5136 S ARCHER AVE,60632,"38. INSECTS, RODENTS, & ANIMALS NOT PRESENT - ...",41.800619,-87.731143,"{'longitude': '41.8006193315046', 'latitude': ...",restaurant,63


Now we have a dataframe with every facility listed in our preprocessed database. We want to create a column containing the addition of every InspectionScore of each license.

In [74]:
#compute the number of inspections in the primary dataset per license number

dic = {}

for license in df_restaurants.index :
    score = 0
    for row in eat_seat[eat_seat['License']==license].index :
        score = 0 + eat_seat.at[row, 'InspectionScore']
    dic[license] = score

#### Assistants' comments :
•	in cell In [74] you have a mistake... check that.

In [75]:
#add this number in the new database
df_restaurants['FacilityScore'] = dic.values()

In [76]:
#compute the number of inspections in the primary dataset per license number

dic = {}

for license in df_restaurants.index :
    score = len(eat_seat[eat_seat['License'] == license])
    dic[license] = score

In [77]:
#add this number in the new database
df_restaurants['NumberOfInspections'] = dic.values()

In [78]:
df_restaurants.head()

Unnamed: 0_level_0,DBA Name,Risk,Address,Zip,Violations,Latitude,Longitude,Location,Facility Group,Community Area,FacilityScore,NumberOfInspections
License,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2589822,JUMPSTART EARLY LEARNING ACADEMY,3,7559 W ADDISON ST,60634,5. PROCEDURES FOR RESPONDING TO VOMITING AND D...,41.945065,-87.816734,"{'longitude': '41.945064857019986', 'latitude'...",childrens_services,76,39,3
2698445,BEEFSTEAK,3,303 E SUPERIOR ST,60611,39. CONTAMINATION PREVENTED DURING FOOD PREPAR...,41.895692,-87.620143,"{'longitude': '41.895692401410514', 'latitude'...",restaurant,8,2,1
2689893,J.T.'S GENUINE SANDWICH,3,3970 N ELSTON AVE,60618,51. PLUMBING INSTALLED; PROPER BACKFLOW DEVICE...,41.953378,-87.718848,"{'longitude': '41.95337788158545', 'latitude':...",restaurant,22,3,1
2215609,"KID'Z COLONY DAYCARE, INC.",3,6287 S ARCHER AVE,60638,16. FOOD-CONTACT SURFACES: CLEANED & SANITIZED...,41.793235,-87.777776,"{'longitude': '41.7932347787373', 'latitude': ...",daycare,64,0,7
75583,JUST A PIZZA PLUS INC,3,5136 S ARCHER AVE,60632,"38. INSECTS, RODENTS, & ANIMALS NOT PRESENT - ...",41.800619,-87.731143,"{'longitude': '41.8006193315046', 'latitude': ...",restaurant,63,30,15


In [79]:
#only keep the new column
df_restaurants = df_restaurants.drop(columns = ['DBA Name', 'Risk', 'Address', 'Zip', 'Latitude', 'Longitude', 'Facility Group', 
                                                'Community Area', 'Violations', 'Location'])
df_restaurants.head()

Unnamed: 0_level_0,FacilityScore,NumberOfInspections
License,Unnamed: 1_level_1,Unnamed: 2_level_1
2589822,39,3
2698445,2,1
2689893,3,1
2215609,0,7
75583,30,15


In [80]:
dic = {}


for inspection in eat_seat.index :
    license = eat_seat.loc[inspection, 'License']
    facilityscore = df_restaurants.loc[license, 'FacilityScore'] / df_restaurants.loc[license, 'NumberOfInspections']
    dic[inspection] = facilityscore
    

In [81]:
eat_seat['FacilityScore'] = dic.values()

In [82]:
eat_seat.head()

Unnamed: 0_level_0,DBA Name,License,Risk,Address,Zip,Inspection Date,Inspection Type,Results,Violations,Latitude,Longitude,Location,Facility Group,Community Area,NumberViolations,InspectionScore,FacilityScore
Inspection ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
2320971,JUMPSTART EARLY LEARNING ACADEMY,2589822,3,7559 W ADDISON ST,60634,2019-11-01,Canvass,2,5. PROCEDURES FOR RESPONDING TO VOMITING AND D...,41.945065,-87.816734,"{'longitude': '41.945064857019986', 'latitude'...",childrens_services,76,"[10, 56, 5]",6,13.0
2320918,BEEFSTEAK,2698445,3,303 E SUPERIOR ST,60611,2019-11-01,License,1,39. CONTAMINATION PREVENTED DURING FOOD PREPAR...,41.895692,-87.620143,"{'longitude': '41.895692401410514', 'latitude'...",restaurant,8,"[55, 39]",2,2.0
2320910,J.T.'S GENUINE SANDWICH,2689893,3,3970 N ELSTON AVE,60618,2019-11-01,License,1,51. PLUMBING INSTALLED; PROPER BACKFLOW DEVICE...,41.953378,-87.718848,"{'longitude': '41.95337788158545', 'latitude':...",restaurant,22,"[53, 58, 51]",3,3.0
2320904,"KID'Z COLONY DAYCARE, INC.",2215609,3,6287 S ARCHER AVE,60638,2019-11-01,Canvass,3,16. FOOD-CONTACT SURFACES: CLEANED & SANITIZED...,41.793235,-87.777776,"{'longitude': '41.7932347787373', 'latitude': ...",daycare,64,[16],3,0.0
2320969,JUST A PIZZA PLUS INC,75583,3,5136 S ARCHER AVE,60632,2019-11-01,Complaint,3,"38. INSECTS, RODENTS, & ANIMALS NOT PRESENT - ...",41.800619,-87.731143,"{'longitude': '41.8006193315046', 'latitude': ...",restaurant,63,"[38, 55, 55, 58, 60, 38]",18,2.0


### 4.1 Additional Dataset - BUSINESS LICENSES/OWNERS

We found two datasets on Kaggle gathering the business licenses and the business owners of Chicago. It could be interesting to observe the results of different establishments owned by the same person.

### A) Business Licenses

The first dataset contains the details about every licensed establishments. There are a lot of columns but the only ones interesting us are :
- the `license id` column to have a link with the *chicago food inspections* dataset
- the `account number` column to have a link with the *business owners* dataset
- the `police district` column in case we want to have a link with the *crime* dataset

In [81]:
licenses = pd.read_csv('business-licenses.csv', sep=',') #creation of the dataframe
licenses = licenses.rename(str.lower, axis='columns')
licenses = licenses.drop(columns = ['city', 'state', 'id', 'precinct', 'ward precinct', 'business activity id',\
                                'license number', 'application type', 'application created date',\
                                'application requirements complete', 'payment date', 'conditional approval',\
                                'license term start date', 'license term expiration date', 'license approved for issuance',\
                                'date issued', 'license status', 'license status change date', 'ssa',\
                                'historical wards 2003-2015', 'zip codes', 'wards', 'census tracts', 'location',\
                                'license code', 'license description', 'business activity', 'site number',\
                               'zip code', 'latitude', 'longitude', 'address', 'legal name', 'doing business as name',\
                                'community areas', 'ward'])
licenses['police district'] = licenses['police district'].apply(clean_float) 

licenses.head(3)

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,license id,account number,police district
0,1480073,1,1
1,1278029,1,1
2,1337924,1,1


In [82]:
licenses = licenses.set_index('account number')
licenses.head(3)

Unnamed: 0_level_0,license id,police district
account number,Unnamed: 1_level_1,Unnamed: 2_level_1
1,1480073,1
1,1278029,1
1,1337924,1


In [83]:
print(len(licenses.index))

989790


### B) Business Owners

The second dataset contains the details about every license owners. We have decided to keep the following columns :
- the `account number` column to have a link with the *business licenses* dataset
- the `owner first name` and the `owner last name` columns in order to create a `full name` column

In [84]:
owners = pd.read_csv('business-owners.csv',sep=',') #creation of the dataframe
owners = owners.rename(str.lower, axis='columns')
owners = owners.drop(columns = ['suffix', 'legal entity owner', 'owner middle initial', 'legal name', 'title'])
owners.head(3)

Unnamed: 0,account number,owner first name,owner last name
0,373231,GUY,SELLARS
1,203002,NELCY,SANTANA
2,338012,GREGORY,EDINGBURG


In [85]:
owners['full name'] = owners['owner first name'] + ' ' + owners['owner last name']
owners = owners.drop(columns = ['owner first name', 'owner last name'])

A `full name` column is enough for the needs that we have to link the licenses to the owners.

In [86]:
owners.head(3)

Unnamed: 0,account number,full name
0,373231,GUY SELLARS
1,203002,NELCY SANTANA
2,338012,GREGORY EDINGBURG


In [87]:
len(owners['account number'].unique())

167992

In [88]:
len(owners['full name'].unique())

192354

Here we can see that the number of accounts is not the same that the number of full names. For now, we will consider that a same account can be shared by several people (for example, it could be the case for partners owning a business together).

In [89]:
owners = pd.DataFrame(owners.groupby('account number')['full name'].apply(list))

In [90]:
owners.head()

Unnamed: 0_level_0,full name
account number,Unnamed: 1_level_1
1,"[PETER BERGHOFF, PETER BERGHOFF, HERMAN BERGHOFF]"
2,"[nan, HERMAN BERGHOFF, PETER BERGHOFF, nan, EI..."
4,[LAWRENCE PRICE]
6,[JOHN SCHALLER]
8,"[CHRISTINE MAIR, CHRISTINE MAIR]"


We can observe that the lists of the `full name` column contain duplicates and 'nan' values. A function *clean_list* can be defined to clean them.

In [91]:
def clean_list(liste) :
    cleaned = []
    for element in liste :
        if type(element) == str and element not in cleaned :
            cleaned.append(element)
    return cleaned

In [92]:
owners['full name'] = owners['full name'].apply(clean_list)

In [93]:
owners.head()

Unnamed: 0_level_0,full name
account number,Unnamed: 1_level_1
1,"[PETER BERGHOFF, HERMAN BERGHOFF]"
2,"[HERMAN BERGHOFF, PETER BERGHOFF, EILEEN GORMA..."
4,[LAWRENCE PRICE]
6,[JOHN SCHALLER]
8,[CHRISTINE MAIR]


### C) Business Licenses-Owners

Setting both the indexes of the *licenses* and the *owners* dataframes we can now merge them together.

In [94]:
business = pd.merge(licenses, owners, right_index = True, left_index = True)

In [95]:
business.head()

Unnamed: 0_level_0,license id,police district,full name
account number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,1480073,1,"[PETER BERGHOFF, HERMAN BERGHOFF]"
1,1278029,1,"[PETER BERGHOFF, HERMAN BERGHOFF]"
1,1337924,1,"[PETER BERGHOFF, HERMAN BERGHOFF]"
1,1480076,1,"[PETER BERGHOFF, HERMAN BERGHOFF]"
1,1404362,1,"[PETER BERGHOFF, HERMAN BERGHOFF]"


In [96]:
business.reset_index(level=0, inplace=True)

In [97]:
business.head()

Unnamed: 0,account number,license id,police district,full name
0,1,1480073,1,"[PETER BERGHOFF, HERMAN BERGHOFF]"
1,1,1278029,1,"[PETER BERGHOFF, HERMAN BERGHOFF]"
2,1,1337924,1,"[PETER BERGHOFF, HERMAN BERGHOFF]"
3,1,1480076,1,"[PETER BERGHOFF, HERMAN BERGHOFF]"
4,1,1404362,1,"[PETER BERGHOFF, HERMAN BERGHOFF]"


### D) Second Main Dataframe

Setting both the indexes of the *business* and the *eat_seat* dataframes we can now merge them together.

In [98]:
business = business.rename(columns= {'license id' : 'License'})
business = business.set_index('License')

eat_seat = eat_seat.set_index('License')

In [99]:
eat_seat_2 = pd.merge(eat_seat, business, right_index = True, left_index = True)

In [100]:
eat_seat_2.head()

Unnamed: 0_level_0,DBA Name,Risk,Address,Zip,Inspection Date,Inspection Type,Results,Violations,Latitude,Longitude,Location,Facility Group,Community Area,NumberViolations,InspectionScore,FacilityScore,account number,police district,full name
License,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
1770,VALENTINO CLUB CAFE,3,7150 W GRAND AVE,60707,2012-11-23,Complaint,1,"35. WALLS, CEILINGS, ATTACHED EQUIPMENT CONSTR...",41.923853,-87.804535,"{'longitude': '41.92385304509883', 'latitude':...",restaurant,25,[35],1,1,81,10,"[RUDOLFO GUERRERO, JOSE GUERRERO]"
7141,SOUTHWEST MONTESSORI PRE SCHOL,3,8620 -08624 S RACINE AVE,60620,2012-06-05,License,1,33. FOOD AND NON-FOOD CONTACT EQUIPMENT UTENSI...,41.737096,-87.653521,"{'longitude': '41.73709646738882', 'latitude':...",daycare,73,"[35, 33]",2,15,282,20,[HECTOR RODRIGUEZ]
7141,SOUTHWEST MONTESSORI PRE SCHOL,3,8620 -08624 S RACINE AVE,60620,2011-06-17,Canvass,1,33. FOOD AND NON-FOOD CONTACT EQUIPMENT UTENSI...,41.737096,-87.653521,"{'longitude': '41.73709646738882', 'latitude':...",daycare,73,"[34, 35, 41, 33]",4,15,282,20,[HECTOR RODRIGUEZ]
7141,SOUTHWEST MONTESSORI PRE SCHOL,3,8620 -08624 S RACINE AVE,60620,2010-07-07,License Re-Inspection,1,,41.737096,-87.653521,"{'longitude': '41.73709646738882', 'latitude':...",daycare,73,0,0,15,282,20,[HECTOR RODRIGUEZ]
7141,SOUTHWEST MONTESSORI PRE SCHOL,3,8620 -08624 S RACINE AVE,60620,2010-06-03,License Re-Inspection,3,"35. WALLS, CEILINGS, ATTACHED EQUIPMENT CONSTR...",41.737096,-87.653521,"{'longitude': '41.73709646738882', 'latitude':...",daycare,73,[35],3,15,282,20,[HECTOR RODRIGUEZ]


In [101]:
eat_seat_2.to_csv('newfood.csv')

## 4.2 Additional Dataset - CRIME 

### looking at the dataset and modify it

In [2]:
crime_2001_2004 = pd.read_csv('Chicago_Crimes_2001_to_2004.csv', error_bad_lines=False)

b'Skipping line 1513591: expected 23 fields, saw 24\n'
  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
crime_2005_2007 = pd.read_csv('Chicago_Crimes_2005_to_2007.csv', error_bad_lines=False)

b'Skipping line 533719: expected 23 fields, saw 24\n'


In [4]:
crime_2008_2011 = pd.read_csv('Chicago_Crimes_2008_to_2011.csv', error_bad_lines=False)

b'Skipping line 1149094: expected 23 fields, saw 41\n'


In [5]:
crime_2012_2017 = pd.read_csv('Chicago_Crimes_2012_to_2017.csv', error_bad_lines=False)

In [6]:
crime = crime_2001_2004.append(crime_2005_2007).append(crime_2008_2011).append(crime_2012_2017)

crime.head()

Unnamed: 0.1,Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
0,879,4786321,HM399414,01/01/2004 12:01:00 AM,082XX S COLES AVE,840,THEFT,FINANCIAL ID THEFT: OVER $300,RESIDENCE,False,...,7.0,46.0,6,,,2004.0,08/17/2015 03:03:40 PM,,,
1,2544,4676906,HM278933,03/01/2003 12:00:00 AM,004XX W 42ND PL,2825,OTHER OFFENSE,HARASSMENT BY TELEPHONE,RESIDENCE,False,...,11.0,61.0,26,1173974.0,1876760.0,2003.0,04/15/2016 08:55:02 AM,41.8172,-87.637328,"(41.817229156, -87.637328162)"
2,2919,4789749,HM402220,06/20/2004 11:00:00 AM,025XX N KIMBALL AVE,1752,OFFENSE INVOLVING CHILDREN,AGG CRIM SEX ABUSE FAM MEMBER,RESIDENCE,False,...,35.0,22.0,20,,,2004.0,08/17/2015 03:03:40 PM,,,
3,2927,4789765,HM402058,12/30/2004 08:00:00 PM,045XX W MONTANA ST,840,THEFT,FINANCIAL ID THEFT: OVER $300,OTHER,False,...,31.0,20.0,6,,,2004.0,08/17/2015 03:03:40 PM,,,
4,3302,4677901,HM275615,05/01/2003 01:00:00 AM,111XX S NORMAL AVE,841,THEFT,FINANCIAL ID THEFT:$300 &UNDER,RESIDENCE,False,...,34.0,49.0,6,1174948.0,1831050.0,2003.0,04/15/2016 08:55:02 AM,41.6918,-87.635116,"(41.691784636, -87.635115968)"


In [7]:
crime.to_csv('crime.csv')

Load the data found on https://www.kaggle.com/currie32/crimes-in-chicago in pandas DataFrames and concatenate them. The *crime* dataframe is saved in a csv file.

The datasets have several columns not interesting for us, there are dropped. In terms of location we decide to only keep the `Community Area` column in order to have a link with the *chicago food inspections* dataset. We also keep the longitude and latitude to maybe map the crimes for the visualization.

In [8]:
crime = crime.drop(['District', 'Ward', 'Unnamed: 0','Case Number','IUCR','Arrest', 'Domestic','FBI Code','X Coordinate','Y Coordinate','Updated On', 'Location Description'], axis=1)

crime.head()

Unnamed: 0,ID,Date,Block,Primary Type,Description,Beat,Community Area,Year,Latitude,Longitude,Location
0,4786321,01/01/2004 12:01:00 AM,082XX S COLES AVE,THEFT,FINANCIAL ID THEFT: OVER $300,424,46.0,2004.0,,,
1,4676906,03/01/2003 12:00:00 AM,004XX W 42ND PL,OTHER OFFENSE,HARASSMENT BY TELEPHONE,935,61.0,2003.0,41.8172,-87.637328,"(41.817229156, -87.637328162)"
2,4789749,06/20/2004 11:00:00 AM,025XX N KIMBALL AVE,OFFENSE INVOLVING CHILDREN,AGG CRIM SEX ABUSE FAM MEMBER,1413,22.0,2004.0,,,
3,4789765,12/30/2004 08:00:00 PM,045XX W MONTANA ST,THEFT,FINANCIAL ID THEFT: OVER $300,2521,20.0,2004.0,,,
4,4677901,05/01/2003 01:00:00 AM,111XX S NORMAL AVE,THEFT,FINANCIAL ID THEFT:$300 &UNDER,2233,49.0,2003.0,41.6918,-87.635116,"(41.691784636, -87.635115968)"


In [9]:
for x in crime.columns :
    print(x + ' : ' + str(crime[x].isnull().values.any()) + ' --> ' + str(crime[x].isnull().sum()))

ID : False --> 0
Date : False --> 0
Block : False --> 0
Primary Type : False --> 0
Description : False --> 0
Beat : False --> 0
Community Area : True --> 702091
Year : False --> 0
Latitude : True --> 105573
Longitude : True --> 105574
Location : True --> 105574


Here we are looking at the missing values. Most of the missing data are part of the location's columns. The `Block` column contains anonymized addresses ("XX"). By replacing the XX by 00 we already can get an approximative position of where the crimes took place.

In [10]:
def block_approx(address) :
    return address.replace('XX','00')

In [11]:
crime.Block = crime.Block.apply(block_approx)

In [12]:
print(set(crime['Primary Type']))

{'CONCEALED CARRY LICENSE VIOLATION', 'PUBLIC PEACE VIOLATION', 'OFFENSE INVOLVING CHILDREN', 'THEFT', 'WEAPONS VIOLATION', 'LIQUOR LAW VIOLATION', 'DOMESTIC VIOLENCE', 'OTHER NARCOTIC VIOLATION', 'GAMBLING', 'ARSON', 'HUMAN TRAFFICKING', 'NARCOTICS', 'CRIM SEXUAL ASSAULT', 'MOTOR VEHICLE THEFT', 'OTHER OFFENSE', 'PROSTITUTION', 'BURGLARY', 'INTIMIDATION', 'KIDNAPPING', 'RITUALISM', 'NON-CRIMINAL', 'BATTERY', 'INTERFERENCE WITH PUBLIC OFFICER', 'HOMICIDE', 'OBSCENITY', 'NON-CRIMINAL (SUBJECT SPECIFIED)', 'ROBBERY', 'CRIMINAL DAMAGE', 'SEX OFFENSE', 'DECEPTIVE PRACTICE', 'NON - CRIMINAL', 'PUBLIC INDECENCY', 'ASSAULT', 'CRIMINAL TRESPASS', 'STALKING'}


We print all the `Primary Type` of crimes in order to create a dictionary with every crime type keyed with their minimum sentence in terms of a year of imprisonment. -- More details below.

In [13]:
crime_penalty = {'PROSTITUTION' : 0.1, 'DOMESTIC VIOLENCE' : 0.1, 'MOTOR VEHICLE THEFT' : 3.0, 'ASSAULT' : 0.1, 'OFFENSE INVOLVING CHILDREN' : 0.1,\
                 'RITUALISM' : 0.1, 'BATTERY' : 0.1, 'NON-CRIMINAL (SUBJECT SPECIFIED)' : 0.1, 'CRIM SEXUAL ASSAULT' : 4.0, 'GAMBLING' : 0.1,\
                 'PUBLIC INDECENCY' : 0.1, 'OTHER OFFENSE' : 0.1, 'LIQUOR LAW VIOLATION' : 0.1, 'OTHER NARCOTIC VIOLATION' : 0.1, 'OBSCENITY' : 0.1,\
                 'NON-CRIMINAL' : 0.1, 'KIDNAPPING' : 3.0, 'HOMICIDE' : 20.0, 'NARCOTICS' : 0.1, 'ARSON' : 6.0, 'DECEPTIVE PRACTICE' : 0.1, 'ROBBERY' : 3.0,\
                 'BURGLARY' : 3.0, 'NON - CRIMINAL' : 0.1, 'INTIMIDATION' : 2.0, 'HUMAN TRAFFICKING' : 4.0, 'SEX OFFENSE' : 4.0, 'CRIMINAL TRESPASS' : 0.1,\
                 'CONCEALED CARRY LICENSE VIOLATION' : 2.0, 'CRIMINAL DAMAGE' : 1.0, 'INTERFERENCE WITH PUBLIC OFFICER' : 0.1, 'PUBLIC PEACE VIOLATION' : 0.1,\
                 'WEAPONS VIOLATION' : 0.1, 'THEFT' : 0.1, 'STALKING' : 0.1}

To make the *crime_penalty* dictionary, we checked the minimum prison penalty (in years) for each crime types. For the crime where the penalty is less than 1 year, the value is fixed at 0.1 when no prison is needed, and the score in month (6 months = 0.5) if a minmum prison penalty is given in months.

These values are taken from the **Illinois Penalty Code**.

In [14]:
Primary_Type = []
for x in crime['Primary Type'] :
    if x == 'NON-CRIMINAL (SUBJECT SPECIFIED)' or x == 'NON - CRIMINAL' :
        Primary_Type.append('NON-CRIMINAL')
    else :
        Primary_Type.append(x)
        
crime['Primary_Type'] = Primary_Type

Here we group all the NON-CRIMINAL values under one label (3 label before)

In [15]:
crimescore = []
for x in crime['Primary_Type'] :
    crimescore.append(crime_penalty[x])
crime['crimescore'] = crimescore

Here we create a new column with the crimescore for each row associated with the correspondat Primary type crime.

In [16]:
crime.head()

Unnamed: 0,ID,Date,Block,Primary Type,Description,Beat,Community Area,Year,Latitude,Longitude,Location,Primary_Type,crimescore
0,4786321,01/01/2004 12:01:00 AM,08200 S COLES AVE,THEFT,FINANCIAL ID THEFT: OVER $300,424,46.0,2004.0,,,,THEFT,0.1
1,4676906,03/01/2003 12:00:00 AM,00400 W 42ND PL,OTHER OFFENSE,HARASSMENT BY TELEPHONE,935,61.0,2003.0,41.8172,-87.637328,"(41.817229156, -87.637328162)",OTHER OFFENSE,0.1
2,4789749,06/20/2004 11:00:00 AM,02500 N KIMBALL AVE,OFFENSE INVOLVING CHILDREN,AGG CRIM SEX ABUSE FAM MEMBER,1413,22.0,2004.0,,,,OFFENSE INVOLVING CHILDREN,0.1
3,4789765,12/30/2004 08:00:00 PM,04500 W MONTANA ST,THEFT,FINANCIAL ID THEFT: OVER $300,2521,20.0,2004.0,,,,THEFT,0.1
4,4677901,05/01/2003 01:00:00 AM,11100 S NORMAL AVE,THEFT,FINANCIAL ID THEFT:$300 &UNDER,2233,49.0,2003.0,41.6918,-87.635116,"(41.691784636, -87.635115968)",THEFT,0.1


In [17]:
crimetest = pd.DataFrame()
crimetest['Community Area'] = crime['Community Area']
crimetest['crimescore'] = crime['crimescore']

crimetest.head()

Unnamed: 0,Community Area,crimescore
0,46.0,0.1
1,61.0,0.1
2,22.0,0.1
3,20.0,0.1
4,49.0,0.1


Here we select the community area and the crimescore area in order to calculate the crimescore for each community area.

In [18]:
crime_CA = crimetest.groupby('Community Area').sum()

crime_CA

Unnamed: 0_level_0,crimescore
Community Area,Unnamed: 1_level_1
0.0,53.6
1.0,69736.4
2.0,67716.7
3.0,58611.1
4.0,34806.9
...,...
73.0,65604.1
74.0,9039.9
75.0,40583.7
76.0,13296.0


Here we have the DataFrame interesting us : 

- Community area | Crimescore

Using this DataFrame, we'll be able to see if there is a correlation between restaurant quality and crime in Chicago.

# About this notebook

We saw that the data are not that hard to handle : The data is not that big, and the problems on the format of the different columns is not a problem anymore.

With these 3 datasets, we want to compare the global healthiness of *public facilities* vs *private facilities*. We also want to see if there is difference of healthiness between all the Community Areas. The goal is to be able to obtain a good vizualization of the results on a map, using the Community Areas as delimitation zones.

We'll next watch if there are specific trends taking into account the facilities owners. We can make the hypothesis that a chain of restaurants would have better results by experiment and more organized structures, or that a same owner would get the same scores in his differents establishments.

We also would like to see if there is a correlation between the crimes taking place and the healthiness of the facilities in a zone - Community Area. This is an interesting question from a criminologist's point of view.

To compare the healthiness of facility types (and community area) and the criminality in different Community Area, we designed a way to get a **Healthiness Score** and a **Crime Score**.

We calculated the Healthiness Score by taking the several infractions, and the score given in the dataset (to be more precise). The Healthiness score still has to be improved in order to be more representative of the situation.

We calculated the Crime Score based on the minimal prison penalty of each infraction. We took the primary infractions only, this was easier to set up, but can bring some little imprecisions.

# Can the data bring us answer to our questions ?

We think so, All the informations we need to make the different analysis are in the 3 datasets. For the healthiness of the restaurants and in the different community area, all the data are in the food inspections dataframe.

For the analysis about the different owners of different facilities, all the informations needed were processed and are in the license dataset.

And for the crime related analysis, the informations were processed and we have all we need in the crime dataset.

# What comes next ?

We'll first start to get deeper into the analysis, and maybe come back to modify or arrange the data one last time according to our needs.

When the analysis will be completed, we'll start to get to the visualization : What is the best way to present our results ? Is the map still a good visualization ? How to handle the map ?

Then we'll finally report our results, and start preparing the presentation.

# Hypothesis ?

Our first hypothesis is that the repartition of the unhealthy facilities are not randomly distributed, we think that these unhealthy facilities are concentrated in particular community areas.

We think that the public facilities are usually healthier than private one. The hospital or public restaurants should have more restrictions that private one.

We also make the hypothesis that owners having several facilities are usually more looking at the healthiness of their restaurants than the owner of only one restaurant.

We finally make the hypothesis that the facilities placed in community areas with a lot of crimes are less inspected that other facilities. But when they're inspected, they may have a lower score than facilities placed in other places.

#### Assistants' comments :
•	maybe instead of just copy-pasting a map from somewhere else try to incorporate it in your own style and plots

•	the whole "using a dict and then it's values" works, but is far from an ideal choice. Try to find a pandas way. Data science in Python uses Pandas all the time, and thus is somewhat of the language you have to speak when you want to be a good data scientist.

•	w.r.t. Hypothesis: think again what dataset you have and what you have done with it. does having a dataset of food inspections give you a complete overview of where and if a restaurant exists? Justify this claim in your report/presentation.


•	in general:
•	good work, but correct the above-mentioned points.
•	nice separation of work, quite easy to follow.
•	you need to provide more structure
•	try to define what you mean with healthiness sooner. I would assume you would use the type of food they serve, and not how many inspection violations they have. Definitions should be first thing you mention.
•	I would have liked to see more plots and visualisations, as they would also have helped you to see what you are doing. Don't start with visualisations at the end, data science is not a linear path, but more like a circle of iterations.
