<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#DATA-SET:-Chicago's-Food-Inspection" data-toc-modified-id="DATA-SET:-Chicago's-Food-Inspection-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>DATA SET: Chicago's Food Inspection</a></span></li><li><span><a href="#DATA-EXPLORATION" data-toc-modified-id="DATA-EXPLORATION-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>DATA EXPLORATION</a></span><ul class="toc-item"><li><span><a href="#Let's-get-some-feeling-of-the-numbers" data-toc-modified-id="Let's-get-some-feeling-of-the-numbers-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Let's get some feeling of the numbers</a></span><ul class="toc-item"><li><span><a href="#Missing-data" data-toc-modified-id="Missing-data-2.1.1"><span class="toc-item-num">2.1.1&nbsp;&nbsp;</span>Missing data</a></span></li><li><span><a href="#Inspections-by-type--of-business" data-toc-modified-id="Inspections-by-type--of-business-2.1.2"><span class="toc-item-num">2.1.2&nbsp;&nbsp;</span>Inspections by type  of business</a></span></li><li><span><a href="#Relationship-between-columns" data-toc-modified-id="Relationship-between-columns-2.1.3"><span class="toc-item-num">2.1.3&nbsp;&nbsp;</span>Relationship between columns</a></span></li><li><span><a href="#Where-are-the-inspections-located?" data-toc-modified-id="Where-are-the-inspections-located?-2.1.4"><span class="toc-item-num">2.1.4&nbsp;&nbsp;</span>Where are the inspections located?</a></span></li></ul></li></ul></li></ul></div>

In [1]:
import pandas as pd
import numpy as np
import pickle 
from matplotlib import pyplot as plt
import warnings
from pyjarowinkler import distance
import geopandas as gpd
import folium
import seaborn as sns
import pgeocode
warnings.filterwarnings("ignore")

# DATA SET: Chicago's Food Inspection

Chicago's Food Inspection Dataset is derived from a larger initiative from the local authorities of Chicago to make Government's data publicly available to everyone. Other similar datasets are available at https://data.cityofchicago.org

This dataset, in particular, is generated from inspections of restaurants and other food establishments in Chicago from January 1, 2010, to the present. Inspections are performed by staff from the Chicago Department of Public Health’s Food Protection Program using a standardized procedure. The results of the inspection are inputted into a database, then reviewed and approved by a State of Illinois Licensed Environmental Health Practitioner (LEHP).

- __Data Owner__: Chicago Department of Public Health
- __Time Period__: 2010 - Present
- __Frequency__: This database was updated with information from new inspections each Friday.


# DATA EXPLORATION

First of all, we are going to understand the dataset and its content. Initial analyis on the food-inspections dataset in Chicago follows below:

In [2]:
chicago_df = pd.read_csv('../data/food-inspections.csv', delimiter=',')
chicago_df.head()

Unnamed: 0,Inspection ID,DBA Name,AKA Name,License #,Facility Type,Risk,Address,City,State,Zip,...,Results,Violations,Latitude,Longitude,Location,Historical Wards 2003-2015,Zip Codes,Community Areas,Census Tracts,Wards
0,2345318,SUBWAY,SUBWAY,2529116.0,Restaurant,Risk 1 (High),2620 N NARRAGANSETT AVE,CHICAGO,IL,60639.0,...,Pass w/ Conditions,"3. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL E...",41.927995,-87.785752,"{'latitude': '-87.78575236468352', 'longitude'...",,,,,
1,2345321,GOPUFF,GOPUFF,2684560.0,Grocery Store,Risk 3 (Low),1801 W WARNER AVE,CHICAGO,IL,60613.0,...,Pass,,41.956846,-87.674395,"{'latitude': '-87.6743946694658', 'longitude':...",,,,,
2,2345334,LA MICHOACANA ICE CREAM SHOP,LA MICHOACANA ICE CREAM SHOP,2698396.0,Restaurant,Risk 1 (High),3591-3597 N MILWAUKEE AVE,CHICAGO,IL,60641.0,...,Pass w/ Conditions,"3. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL E...",41.94614,-87.735183,"{'latitude': '-87.73518301995274', 'longitude'...",,,,,
3,2345339,THE CREPE SHOP,THE CREPE SHOP,2699005.0,Restaurant,Risk 1 (High),2934 N BROADWAY,CHICAGO,IL,60657.0,...,Fail,10. ADEQUATE HANDWASHING SINKS PROPERLY SUPPLI...,41.93593,-87.644407,"{'latitude': '-87.64440716256712', 'longitude'...",,,,,
4,2345319,GOPUFF,GOPUFF,2684558.0,Grocery Store,Risk 3 (Low),1801 W WARNER AVE,CHICAGO,IL,60613.0,...,Pass,39. CONTAMINATION PREVENTED DURING FOOD PREPAR...,41.956846,-87.674395,"{'latitude': '-87.6743946694658', 'longitude':...",,,,,


In [3]:
chicago_df.columns

Index(['Inspection ID', 'DBA Name', 'AKA Name', 'License #', 'Facility Type',
       'Risk', 'Address', 'City', 'State', 'Zip', 'Inspection Date',
       'Inspection Type', 'Results', 'Violations', 'Latitude', 'Longitude',
       'Location', 'Historical Wards 2003-2015', 'Zip Codes',
       'Community Areas', 'Census Tracts', 'Wards'],
      dtype='object')

As we can see, this dataset has 22 different columns, of different type:
- **Integer columns**: `'Inspection ID'`, `'License #'`, `'Zip'`
- **Floating point columns**: `'Latitude'`, `'Longitude'`
- **Date columns**: `'Inspection Date'`
- **String Columns**: `'Results'`\*, `'DBA Name'`, `'Address'`, `'Risk'`\*, `'Facility Type'`\*, `'City'`\*, `'State'`\*, `'Location'`, `'AKA Name'`, `'Inspection Type'`\*, `'Violations'`^, where:
   - \*= Categorical
   - ^= Too many categories to be called categorical (142431)
- **Unknown data-type columns**: `'Wards'`, `'Community Areas'`, `'Zip Codes'`, `'Historical Wards 2003-2015'`, `'Census Tracts'`

The contents of the columns whose meaning is not immediate are explained below:

- __DBA__: ‘Doing business as.’ This is the legal name of the establishment.
- __AKA__: ‘Also known as.’ This is the name the public would know the establishment as.
- __License number__: This is a unique number assigned to the establishment for the purposes of licensing by the Department of Business Affairs and Consumer Protection.
- __Type of facility__: Each establishment is described by one of the following: bakery, banquet hall, candy store, caterer, coffee shop, day care center (for ages less than 2), day care center (for ages 2 – 6), day care center (combo, for ages less than 2 and 2 – 6 combined), gas station, Golden Diner, grocery store, hospital, long term care center(nursing home), liquor store, mobile food dispenser, restaurant, paleteria, school, shelter, tavern, social club, wholesaler, or Wrigley Field Rooftop.
- __Risk category of facility__: Each establishment is categorized as to its risk of adversely affecting the public’s health, with 1 being the highest and 3 the lowest. The frequency of inspection is tied to this risk, with risk 1 establishments inspected most frequently and risk 3 least frequently.
- __Street address, city, state and zip code of facility__: This data provides a full address for each business.
- __Inspection date__: This is the date the inspection occurred. A particular establishment is likely to have multiple inspections which are denoted by different inspection dates.
- __Inspection type__: An inspection can be one of the following types: 
 - *canvass*: the most common type of inspection performed at a frequency relative to the risk of the establishment.
 - *consultation*: when the inspection is done at the request of the owner prior to the opening of the establishment. 
 - *complaint*: when the inspection is done in response to a complaint against the establishment.
 - *license*: when the inspection is done as a requirement for the establishment to receive its license to operate.
 - *suspect food poisoning* when the inspection is done in response to one or more persons claiming to have gotten ill as a result of eating at the establishment (a specific type of complaintbased inspection).
 - *task-force inspection* when an inspection of a bar or tavern is done.
 
 Re-inspections can occur for most types of these inspections and are indicated as such.
- __Results__: An inspection can pass, pass with conditions or fail. Establishments receiving a ‘pass’ were found to have no critical or serious violations (violation number 1-14 and 15- 29, respectively). Establishments receiving a ‘pass with conditions’ were found to have critical or serious violations, but these were corrected during the inspection. Establishments receiving a ‘fail’ were found to have critical or serious violations that were not correctable during the inspection. An establishment receiving a ‘fail’ does not necessarily mean the establishment’s licensed is suspended. Establishments found to be out of business or not located are indicated as such.
- __Violations__: An establishment can receive one or more of 45 distinct violations (violation numbers 1-44 and 70). For each violation number listed for a given establishment, the requirement the establishment must meet in order for it to NOT receive a violation is noted, followed by a specific description of the findings that caused the violation to be issued.


In [4]:
# get basic min/max statistics to get a feeling of some column's values range
basic_stats = (pd.Series(chicago_df.min(), name='mins').to_frame()
               .join(pd.Series(chicago_df.max(), name='maxs')))
basic_stats.index.name = 'Column'
basic_stats

Unnamed: 0_level_0,mins,maxs
Column,Unnamed: 1_level_1,Unnamed: 2_level_1
Inspection ID,44247,2345339
DBA Name,"#1 CHINA EXPRESS, LTD.",vitino pizzeria
License #,0,1e+07
Address,,N2660 HAYTON RD
Zip,10014,60827
Inspection Date,2010-01-04T00:00:00.000,2019-11-08T00:00:00.000
Results,Business Not Located,Pass w/ Conditions
Latitude,41.6447,42.0211
Longitude,-87.9144,-87.5251
Historical Wards 2003-2015,,


We observe there are 5 columns filled with NaN values: `Historical`, `Wards 2003-2015`, `Zip Codes`, `Community Areas`, `Census Tracts`, `Wards`. 

We are going to remove these columns from our data frame. From now on, we will only work with the dataframe where these colums have been removed (*'chicago_df_noNan'*).

In [5]:
chicago_df_noNan = chicago_df.drop(columns=["Historical Wards 2003-2015",
                                            "Zip Codes",
                                            "Community Areas",
                                            "Census Tracts",
                                            "Wards"])

In [6]:
chicago_df_noNan.head()

Unnamed: 0,Inspection ID,DBA Name,AKA Name,License #,Facility Type,Risk,Address,City,State,Zip,Inspection Date,Inspection Type,Results,Violations,Latitude,Longitude,Location
0,2345318,SUBWAY,SUBWAY,2529116.0,Restaurant,Risk 1 (High),2620 N NARRAGANSETT AVE,CHICAGO,IL,60639.0,2019-11-08T00:00:00.000,Canvass Re-Inspection,Pass w/ Conditions,"3. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL E...",41.927995,-87.785752,"{'latitude': '-87.78575236468352', 'longitude'..."
1,2345321,GOPUFF,GOPUFF,2684560.0,Grocery Store,Risk 3 (Low),1801 W WARNER AVE,CHICAGO,IL,60613.0,2019-11-08T00:00:00.000,License Re-Inspection,Pass,,41.956846,-87.674395,"{'latitude': '-87.6743946694658', 'longitude':..."
2,2345334,LA MICHOACANA ICE CREAM SHOP,LA MICHOACANA ICE CREAM SHOP,2698396.0,Restaurant,Risk 1 (High),3591-3597 N MILWAUKEE AVE,CHICAGO,IL,60641.0,2019-11-08T00:00:00.000,License,Pass w/ Conditions,"3. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL E...",41.94614,-87.735183,"{'latitude': '-87.73518301995274', 'longitude'..."
3,2345339,THE CREPE SHOP,THE CREPE SHOP,2699005.0,Restaurant,Risk 1 (High),2934 N BROADWAY,CHICAGO,IL,60657.0,2019-11-08T00:00:00.000,License,Fail,10. ADEQUATE HANDWASHING SINKS PROPERLY SUPPLI...,41.93593,-87.644407,"{'latitude': '-87.64440716256712', 'longitude'..."
4,2345319,GOPUFF,GOPUFF,2684558.0,Grocery Store,Risk 3 (Low),1801 W WARNER AVE,CHICAGO,IL,60613.0,2019-11-08T00:00:00.000,License Re-Inspection,Pass,39. CONTAMINATION PREVENTED DURING FOOD PREPAR...,41.956846,-87.674395,"{'latitude': '-87.6743946694658', 'longitude':..."


# Examination of the data
In this part, the following columns of the data are examined and cleaned:
Inspection ID', 'DBA Name', 'AKA Name', 'License #', 'Facility Type',
       'Risk', 'Address', 'City', 'State'

## inspection_id
Some inspections have been inserted twice in the dataset and need to be deleted to avoid duplicates.

In [7]:
# Create more coding friendly coding names :D
og_columns = chicago_df_noNan.columns
columns = [
    'inspection_id', 'dba_name', 'aka_name', 'license', 
    'facility_type', 'risk', 'address', 'city', 'state', 
    'zip', 'inspection_date', 'inspection_type', 'results', 
    'violations', 'latitude', 'longitude', 'location'
]
chicago_df_noNan.columns = columns

In [8]:
inspection_id = chicago_df_noNan['inspection_id']

We first verify that no entry has a null id

In [9]:
sum(inspection_id.isnull())

0

Then, we check for duplicate entries

In [10]:
duplicated = chicago_df_noNan.duplicated(keep='first')

In [11]:
print(len(chicago_df_noNan))
print(np.sum(duplicated))
chicago_df_noNan = chicago_df_noNan.drop_duplicates()
print(len(chicago_df_noNan)) 

195736
205
195531


## DBA name

In [12]:
# Checking if there are any null entries
sum(chicago_df_noNan['dba_name'].isnull())

0

In [13]:
# Standardising the name of the chain by lowering the cases
print('Number of unique restaurants in the dataset when the names are case sensitive:', len(chicago_df_noNan['dba_name'].unique()))
chicago_df_noNan['dba_name'] = chicago_df_noNan['dba_name'].str.lower()
print('Number of unique restaurants when the names are not case sensitive:', len(chicago_df_noNan['dba_name'].unique()))

Number of unique restaurants in the dataset when the names are case sensitive: 27545
Number of unique restaurants when the names are not case sensitive: 27259


## AKA Name
In this part, we cast all aka names into lower case, and replace the null entries by their DBA names

In [14]:
chicago_df_noNan['aka_name'].unique()

array(['SUBWAY', 'GOPUFF', 'LA MICHOACANA ICE CREAM SHOP', ...,
       'SAFAH FOOD & LIQUOR INC', 'MAKIA FOOD', 'RAINBOW GROCERY'],
      dtype=object)

In [15]:
print('Number of entries without an AKA name:', len(chicago_df_noNan[chicago_df_noNan['aka_name'].isnull()]))

Number of entries without an AKA name: 2456


In [16]:
# Replacing null entries with their dba name
temp0 = chicago_df_noNan[chicago_df_noNan['aka_name'].isnull()]
temp0['aka_name'] = temp0['dba_name']
temp1 = chicago_df_noNan[~chicago_df_noNan['aka_name'].isnull()]
chicago_df_noNan = temp1.append(temp0)

In [17]:
print('Number of unique names with case sensitivity:', len(chicago_df_noNan['aka_name'].unique()))
temp = chicago_df_noNan['aka_name'].str.lower()
print('Number of unique names without case sensitivity:', len(temp.unique()))
chicago_df_noNan['aka_name'] = chicago_df_noNan['aka_name'].str.lower()

Number of unique names with case sensitivity: 27211
Number of unique names without case sensitivity: 26734


## License Number
Some business don't have a license number. In this dataframe, we create a boolean column, indicating if the business has a license number or not

In [18]:
print('There are ', len(chicago_df_noNan[chicago_df_noNan['license'].isnull()]), 'entries without a license number')

There are  17 entries without a license number


In [19]:
chicago_df_noNan['has_license'] = ~chicago_df_noNan['license'].isnull()

## Facility Type
Some types of facilities only contain one establishment, either because it is a very niche category, or because it was entered in a way that the similar entries don't match because of spelling or specificity. In order to group those categories, we match the facility types grouping less than 40 estiablishments to the most similar category according to the Jaro Winkler distance. If none of the categories are a match (distance smaller than 0.70), we place them into misc (miscellaneous).

In [20]:
# Turn Nan values into a string to be able to operate on the column
# Standardising the facility types to lower case categories
chicago_df_noNan['facility_type']
len(chicago_df_noNan['facility_type'].unique())
chicago_df_noNan['facility_type'] = chicago_df_noNan['facility_type'].fillna('not available')
chicago_df_noNan['facility_type'] = chicago_df_noNan['facility_type'].str.lower()

In [21]:
len(chicago_df_noNan)

195531

In [22]:
print('Number of facility types before standardising:', len(chicago_df_noNan['facility_type'].unique()))

Number of facility types before standardising: 441


In [23]:
(chicago_df_noNan['facility_type'].unique())

array(['restaurant', 'grocery store', "children's services facility",
       'liquor', 'bakery', 'coffee shop', 'daycare (2 - 6 years)',
       'pop-up food establishment user-tier ii', 'not available',
       'school', 'daycare above and under 2 years', 'live poultry',
       '15 monts to 5 years old', 'gas station/grocery',
       'daycare (under 2 years)', 'long term care',
       'mobile food preparer', 'ice cream', 'charter school',
       'pop-up establishment host-tier ii', 'tavern', 'catering',
       'paleteria', 'mobile food dispenser',
       'childrens services facility', 'brewery', 'restaurant/bar', 'gym',
       "childern's service facility", 'banquet', 'golden diner',
       'grocery store /pharmacy', 'daycare combo 1586', 'hospital',
       'assisted living', 'airport lounge',
       "1023 children's services facility", 'private school',
       'cooking school', 'mobile frozen desserts vendor', 'sushi counter',
       'daycare night', 'banquet hall', 'after school progr

In [24]:
facility_count = chicago_df_noNan['facility_type'].value_counts()

i = 0

facility_types = []
small_types = {}

# Retrieving the main categories, and identifying the smallest ones
for facility in facility_count.index:
    if facility_count[facility] > 40:
        facility_types.append(facility)
    elif facility_count[facility] <= 40:
        small_types[facility] = ''

# Matching the small categories to the principal ones
for small in small_types:
    distances = []
    for facility in facility_types:
        dist = distance.get_jaro_distance(small, facility)
        distances.append(dist)
    index = np.argmax(distances)
    if distances[index] > 0.70:
        small_types[small] = facility_types[index]
    else:
        small_types[small] = 'misc'


In [25]:
print('Minority categories and their match in the main category pool:')
print(small_types)

Minority categories and their match in the main category pool:
{'store': 'misc', 'restaurant/bar': 'restaurant', 'church': 'misc', 'rooftop': 'misc', 'ice cream shop': 'coffee shop', "1023 childern's services facility": "children's services facility", 'church kitchen': 'shared kitchen', 'commissary': 'misc', 'cooking school': 'charter school', "1023-children's services facility": "children's services facility", 'culinary school': 'charter school', 'bar': 'bakery', 'grocery & restaurant': 'grocery/restaurant', 'pop-up establishment host-tier ii': 'misc', 'assisted living': 'misc', 'restaurant/grocery store': 'restaurant', 'roof tops': 'misc', 'theater': 'shelter', 'mobile desserts vendor': 'mobile frozen desserts vendor', 'restaurant/gas station': 'restaurant', 'nursing home': 'misc', 'paleteria': 'cafeteria', 'supportive living': 'misc', 'grocery store/gas station': 'grocery store', 'gas station/mini mart': 'gas station', 'roof top': 'misc', 'wrigley roof top': 'misc', 'after school pr

In [26]:
# Replacing the minority categories by majority ones in the dataframe

chicago_df_noNan['new_facility_type'] = chicago_df_noNan['facility_type']


def get_new_facility(x):
    if x not in small_types:
        return x
    else:
        return small_types[x]
    
chicago_df_noNan['new_facility_type'] = chicago_df_noNan['new_facility_type'].apply(get_new_facility)

In [27]:
print('Number of facilities in the dataset after matching minority types to majority types:', len(chicago_df_noNan['new_facility_type'].unique()))

Number of facilities in the dataset after matching minority types to majority types: 43


## Risk
Filled nan values by Risk -1 and change risk 'All' to risk 4 (All), so that we can extract the data numerically if we want later


In [28]:
chicago_df_noNan['risk'].unique()
chicago_df_noNan['risk'] = chicago_df_noNan['risk'].fillna('Risk -1 (None)')
chicago_df_noNan['risk'] = chicago_df_noNan['risk'].str.replace('All', 'Risk 4 (All)')

In [29]:
chicago_df_noNan['risk'].unique()

array(['Risk 1 (High)', 'Risk 3 (Low)', 'Risk 2 (Medium)', 'Risk 4 (All)',
       'Risk -1 (None)'], dtype=object)

## Address

In [30]:
chicago_df_noNan['city'].unique()

array(['CHICAGO', nan, 'chicago', 'Chicago', 'GRIFFITH', 'NEW YORK',
       'SCHAUMBURG', 'ELMHURST', 'ALGONQUIN', 'NEW HOLSTEIN', 'CCHICAGO',
       'NILES NILES', 'EVANSTON', 'CHICAGO.', 'CHESTNUT STREET',
       'LANSING', 'CHICAGOCHICAGO', 'WADSWORTH', 'WILMETTE', 'WHEATON',
       'CHICAGOHICAGO', 'ROSEMONT', 'CHicago', 'CALUMET CITY',
       'PLAINFIELD', 'HIGHLAND PARK', 'PALOS PARK', 'ELK GROVE VILLAGE',
       'CICERO', 'BRIDGEVIEW', 'OAK PARK', 'MAYWOOD', 'LAKE BLUFF',
       '312CHICAGO', 'SCHILLER PARK', 'SKOKIE', 'BEDFORD PARK',
       'BANNOCKBURNDEERFIELD', 'CHCICAGO', 'BLOOMINGDALE', 'Norridge',
       'CHARLES A HAYES', 'CHCHICAGO', 'CHICAGOI', 'SUMMIT',
       'OOLYMPIA FIELDS', 'WESTMONT', 'CHICAGO HEIGHTS', 'JUSTICE',
       'TINLEY PARK', 'LOMBARD', 'EAST HAZEL CREST', 'COUNTRY CLUB HILLS',
       'STREAMWOOD', 'BOLINGBROOK', 'INACTIVE', 'BERWYN', 'BURNHAM',
       'DES PLAINES', 'LAKE ZURICH', 'OLYMPIA FIELDS', 'OAK LAWN',
       'BLUE ISLAND', 'GLENCOE', 'FRANKFO

In [31]:
def split_address(address):
    liste = address.split(' ')
    nr = liste[0]
    cardinal = liste[1]
    reste = ' '.join(liste[2:])
    liste = nr + ('*') + cardinal + '*' + reste
    return liste

In [32]:
temp = chicago_df_noNan['address'].apply(split_address)
temp = temp.str.split('*', expand=True)


## City

In [33]:
print(len(chicago_df_noNan['city'].unique()))
chicago_df_noNan['city'] = chicago_df_noNan['city'].str.lower()
print(len(chicago_df_noNan['city'].unique()))

72
67


In [34]:
chicago_df_noNan['city'].unique()
chicago_df_noNan['city'] = chicago_df_noNan['city'].replace('cchicago', 'chicago').replace('chicago.', 'chicago')
chicago_df_noNan['city'] = chicago_df_noNan['city'].replace('chicagochicago', 'chicago')
chicago_df_noNan['city'] = chicago_df_noNan['city'].replace('chicagohicago', 'chicago')
chicago_df_noNan['city'] = chicago_df_noNan['city'].replace('312chicago', 'chicago').replace('chicagoi', 'chicago')
chicago_df_noNan['city'] = chicago_df_noNan['city'].replace('chchicago', 'chicago')
chicago_df_noNan['city'] = chicago_df_noNan['city'].replace('chcicago', 'chicago')
chicago_df_noNan['city'] = chicago_df_noNan['city'].fillna('chicago')

In [35]:
np.sort(chicago_df_noNan['city'].unique())

array(['algonquin', 'alsip', 'bannockburndeerfield', 'bedford park',
       'berwyn', 'bloomingdale', 'blue island', 'bolingbrook',
       'bridgeview', 'broadview', 'burnham', 'calumet city',
       'charles a hayes', 'chestnut street', 'chicago', 'chicago heights',
       'cicero', 'country club hills', 'des plaines', 'east hazel crest',
       'elk grove village', 'elmhurst', 'evanston', 'evergreen park',
       'frankfort', 'glencoe', 'griffith', 'highland park', 'inactive',
       'justice', 'lake bluff', 'lake zurich', 'lansing', 'lombard',
       'maywood', 'naperville', 'new holstein', 'new york', 'niles niles',
       'norridge', 'oak lawn', 'oak park', 'olympia fields',
       'oolympia fields', 'palos park', 'plainfield', 'rosemont',
       'schaumburg', 'schiller park', 'skokie', 'streamwood', 'summit',
       'tinley park', 'wadsworth', 'westmont', 'wheaton', 'wilmette',
       'worth'], dtype=object)

In [36]:
cities = pd.read_csv('../data/listsChicago.csv', sep=';', header=None)
cities[0] = cities[0].str.lower()
cities = cities[0].values

In [37]:
outs = []
ins = []
for city in chicago_df_noNan['city'].unique():
    if city not in cities:
        outs.append(city)
        chicago_df_noNan = chicago_df_noNan[chicago_df_noNan['city'] != city]
    else:
        ins.append(city)

In [38]:
print('Here are the real cities from Chicago!')
print(ins)

Here are the real cities from Chicago!
['chicago', 'cicero', 'bridgeview', 'oak park', 'maywood', 'bedford park', 'berwyn', 'oak lawn', 'broadview', 'evergreen park']


## State
 We make sure that all our remaining entries are from Chicago in Illinois! 

In [39]:
print(chicago_df_noNan['state'].unique())
chicago_df_noNan['state'] = np.where(chicago_df_noNan['state'].isnull(), 'IL', chicago_df_noNan['state'])
    

['IL' nan]


In [40]:
print(chicago_df_noNan['state'].unique())

['IL']


## Location

In [41]:
missing_longitude_indices = chicago_df_noNan['longitude'].isna()

In [42]:
np.sum(missing_longitude_indices)

556

In [43]:
chicago_df_noNan[missing_longitude_indices][['latitude', 'location']].isna().sum()

latitude    556
location    556
dtype: int64

We can conclude that all location attributes are missing together

In [44]:
# Find how many of the rows with missing location attributes have missing zip code also
np.sum(chicago_df_noNan['longitude'].isna() & chicago_df_noNan['zip'].isna())

3

In [64]:
# Define methods that performs geocoding using the zipcode

def geocode_zip_code(zip_code):
    nomi = pgeocode.Nominatim('us')
    nomi_query = nomi.query_postal_code(int(zip_code))
    return pd.Series([nomi_query.latitude, nomi_query.longitude])

In [67]:
# Perform geocoding using the zipcode

missing_longitude_indices_with_zip = chicago_df_noNan['longitude'].isna() & chicago_df_noNan['zip'].notna()

chicago_df_noNan.loc[missing_longitude_indices_with_zip,['latitude', 'longitude']] =\
    chicago_df_noNan.loc[missing_longitude_indices_with_zip, 'zip'].apply(geocode_zip_code)

In [47]:
missing_longitude_indices = chicago_df_noNan['latitude'].isna()

In [48]:
np.sum(missing_longitude_indices)

3

We have filled most of the missing locations by using geocoding using the zipcode which would help us define the areas, other geocoders using the address leading to more accurate locations exist but they either require keys such as google maps or require better address format such as open street maps which does not work for of all our examples

In [49]:
# Create a groupedby data frame on the name and aggeragate to find the count of missing longitude and total counts for each group
exp_df = chicago_df_noNan[['longitude', 'dba_name']].groupby('dba_name').agg({"longitude": [lambda x: x.isnull().sum()], 'dba_name':'count'})

In [50]:
# rename column for more intutive view
exp_df.columns.set_levels(['nulls_count', 'count'],level=1,inplace=True)

In [51]:
# Identify how many entries with missing locations have other entries for the same place with the location not missing

bool_1 = (exp_df['longitude']['nulls_count'] != 0)
bool_2 = (exp_df['dba_name']['count'] != exp_df['longitude']['nulls_count'])

np.sum(exp_df[ bool_1.values & bool_2.values]['longitude']['nulls_count'])

2.0

In [52]:
# use restaurants with multiple entries where some have their location fields to fill the same restaurant entries with missing location

missing_location_names = chicago_df_noNan.loc[missing_longitude_indices]['dba_name']

for index, name in missing_location_names.items():
    similar_entries = chicago_df_noNan[(chicago_df_noNan['dba_name'] == name) & (chicago_df_noNan['dba_name'].notna())] 
    chicago_df_noNan.at[index, 'latitude'] = similar_entries.iloc[0]['latitude']
    chicago_df_noNan.at[index, 'longitude'] = similar_entries.iloc[0]['longitude']
    chicago_df_noNan.at[index, 'zip'] = similar_entries.iloc[0]['zip']

In [53]:
missing_longitude_indices = chicago_df_noNan['latitude'].isna()

In [54]:
np.sum(missing_longitude_indices)

1

In [56]:
chicago_df_noNan.columns

Index(['inspection_id', 'dba_name', 'aka_name', 'license', 'facility_type',
       'risk', 'address', 'city', 'state', 'zip', 'inspection_date',
       'inspection_type', 'results', 'violations', 'latitude', 'longitude',
       'location', 'has_license', 'new_facility_type'],
      dtype='object')

## Inspection type

In [59]:
chicago_df_noNan['inspection_type'] = chicago_df_noNan['inspection_type'].str.lower()