**TODO Important !**
* data cleaning: explode the violations column: for each inspection keep the numbers of the violations and the comment of the inspector (comments can be used to do NLP, see if useful)..

# Relevant imports

In [80]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
import math
from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter

from autocorrect import Speller

from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error

import pandas_profiling


from datetime import date

#import findspark
#findspark.init()

#import pyspark
# Important to use sql functions in pyspark as example: sqlf.max()
#[source](https://stackoverflow.com/questions/36604460/python-function-such-as-max-doesnt-work-in-pyspark-application)
#from pyspark.sql import functions as sqlf

# I. Dataset(s) preparation and cleaning

Before we proceed to tackle each of our research questions, some data cleaning is in order.

## 1. Load the data and explore its structure

In [81]:
inspections = pd.read_csv('datasets/food-inspections.csv')

In [3]:
len(inspections)

194615

The dataset has 22 columns. Let's examine what each of them is.

In [4]:
#Display columns
inspections.columns

Index(['Inspection ID', 'DBA Name', 'AKA Name', 'License #', 'Facility Type',
       'Risk', 'Address', 'City', 'State', 'Zip', 'Inspection Date',
       'Inspection Type', 'Results', 'Violations', 'Latitude', 'Longitude',
       'Location', 'Historical Wards 2003-2015', 'Zip Codes',
       'Community Areas', 'Census Tracts', 'Wards'],
      dtype='object')

In [5]:
inspections.dtypes

Inspection ID                   int64
DBA Name                       object
AKA Name                       object
License #                     float64
Facility Type                  object
Risk                           object
Address                        object
City                           object
State                          object
Zip                           float64
Inspection Date                object
Inspection Type                object
Results                        object
Violations                     object
Latitude                      float64
Longitude                     float64
Location                       object
Historical Wards 2003-2015    float64
Zip Codes                     float64
Community Areas               float64
Census Tracts                 float64
Wards                         float64
dtype: object

A description of the features is given below [Source](https://data.cityofchicago.org/api/assets/BAD5301B-681A-4202-9D25-51B2CAE672FF).
The last five columns are ignored in the dataset source; we will see that those columns are in fact null

| Feature name                | Variable Type | Description 
|-----------------------------|---------------|--------------------------------------------------------
| Inspection ID        | Integer    | The inspection unique identifier.
| DBA Name                 | String        | ‘Doing business as.’Legal name of the establishment.
| AKA NAme                | String    |  ‘Also known as.’ Name the public would know the establishment as.
| License # | Integer    | Unique number assigned to the establishment for the purposes of licensing by the Department of Business Affairs and Consumer Protection.
| Type of facility                | String    | Each establishment is described by one of the following: bakery, banquet hall, candy store, caterer, coffee shop, day care center (for ages less than 2), day care center (for ages 2 – 6), day care center (combo, for ages less than 2 and 2 – 6 combined), gas station, Golden Diner, grocery store, hospital, long term care center(nursing home), liquor store, mobile food dispenser, restaurant, paleteria, school, shelter, tavern, social club, wholesaler, or Wrigley Field Rooftop.
| Risk                   | String    | Risk category of facility of adversely affecting the public’s health, with 1 being the highest and 3 the lowest. The frequency of inspection is tied to this risk, with risk 1 establishments inspected most frequently and risk 3 least frequently.
| Address        | String    | Street address of the establishment.
| City        | String    | City of the establishment.
| State        | String    | State of the establishment.
| Zip        | Integer    | Zip code of the establishment.
| Inspection Date        | Date    | Date of the inspection
| Inspection Type        | String    | An inspection can be one of the following types: canvass, the most common type of inspection performed at a frequency relative to the risk of the establishment; consultation, when the inspection is done at the request of the owner prior to the opening of the establishment; complaint, when the inspection is done in response to a complaint against the establishment; license, when the inspection is done as a requirement for the establishment to receive its license to operate; suspect food poisoning, when the inspection is done in response to one or more persons claiming to have gotten ill as a result of eating at the establishment (a specific type of complaint-based inspection); task-force inspection, when an inspection of a bar or tavern is done. Re-inspections can occur for most types of these inspections and are indicated as such.
| Results        | String    | Results: An inspection can pass, pass with conditions or fail. Establishments receiving a ‘pass’ were found to have no critical or serious violations (violation number 1-14 and 15- 29, respectively). Establishments receiving a ‘pass with conditions’ were found to have critical or serious violations, but these were corrected during the inspection. Establishments receiving a ‘fail’ were found to have critical or serious violations that were not correctable during the inspection. An establishment receiving a ‘fail’ does not necessarily mean the establishment’s licensed is suspended. Establishments found to be out of business or not located are indicated as such.
| Violations        | String    | An establishment can receive one or more of 45 distinct violations (violation numbers 1-44 and 70). For each violation number listed for a given establishment, the requirement the establishment must meet in order for it to NOT receive a violation is noted, followed by a specific description of the findings that caused the violation to be issued.
| Latitude        | Integer    | Latitude of the establishment.
| Longitude        | Integer    | Longitude of the establishment.



We use pandas_profiling to have a quick overview of our dataset; missing values, features distributions and features correlation.

In [6]:
#inspections.profile_report(style={'full_width':True})

In [7]:
#Save the report to a html file
#profile = IPO_data.profile_report(title='inspection data Profiling Report')
#profile.to_file(output_file="data_profile.html")

## 2. Drop duplicates

The dataset source explicitly says there are duplicates in our data, hence it makes sence to drop those. [source](https://www.kaggle.com/chicago/chicago-food-inspections)

In [8]:
inspections.drop_duplicates(inplace=True)
len(inspections)

194446

## 3. Dataset cleaning

### A. Drop null columns

The 'Location' column contains the latitude and longitude of the establishment. However, there are separate 'Latitude' and 'Longitude' columns. We can hence safely drop the 'Location' column.

In [9]:
#inspections = inspections.drop(columns=['Location'])
#inspections.rename(columns={"Location": "Location_original"})

The head of the dataset only contains NaN entries for the 'Historical Wards 2003-2015', 'Zip Codes', 'Community Areas', 'Census Tracts', 'Wards' columns. Let's see if this is true for the whole dataset.

In [10]:
# make sure that our assumption is correct
print('Values taken by \'Historical Wards 2003-2015\': ', inspections['Zip Codes'].unique())
print('Values taken by \'Zip Codes\': ', inspections['Zip Codes'].unique())
print('Values taken by \'Community Areas\': ', inspections['Zip Codes'].unique())
print('Values taken by \'Census Tracts\': ', inspections['Zip Codes'].unique())
print('Values taken by \'Wards\': ', inspections['Zip Codes'].unique())


Values taken by 'Historical Wards 2003-2015':  [nan]
Values taken by 'Zip Codes':  [nan]
Values taken by 'Community Areas':  [nan]
Values taken by 'Census Tracts':  [nan]
Values taken by 'Wards':  [nan]


We drop all columns apart from the 'Community Areas' because we will be needing it in our study. We will fill later.

In [11]:
inspections = inspections.drop(columns=['Historical Wards 2003-2015'])
inspections = inspections.drop(columns=['Zip Codes'])
inspections = inspections.drop(columns=['Census Tracts'])
inspections = inspections.drop(columns=['Wards'])

### B. Clean the location related features and fill in community area feature

Let's examine if the whole dataset is relevent to the study we are conducting by seeing which entries correspond to facilities in Chicago.

First, we check if there are any missing values for the column 'City' or 'State'

In [12]:
#Investigate the state=nan and city=nan restaurants
inspections[pd.isnull(inspections.State) | pd.isnull(inspections.City)]

Unnamed: 0,Inspection ID,DBA Name,AKA Name,License #,Facility Type,Risk,Address,City,State,Zip,Inspection Date,Inspection Type,Results,Violations,Latitude,Longitude,Community Areas
819,2312774,CHICAGO COLLEGIATE CHARTER,CHICAGO COLLEGIATE CHARTER,3846104.0,School,Risk 1 (High),10909 S COTTAGE GROVE AVE,,IL,,2019-09-24T00:00:00.000,Canvass Re-Inspection,Pass w/ Conditions,"3. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL E...",41.696087,-87.608945,
976,2312540,CHICAGO COLLEGIATE CHARTER,CHICAGO COLLEGIATE CHARTER,3846104.0,School,Risk 1 (High),10909 S COTTAGE GROVE AVE,,IL,,2019-09-19T00:00:00.000,Canvass Re-Inspection,Fail,"3. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL E...",41.696087,-87.608945,
982,2312545,JCYS IRIS & STEVEN PODOLSKY FAMILY CENTER,JCYS IRIS & STEVEN PODOLSKY FAMILY CENTER,2671297.0,Children's Services Facility,Risk 1 (High),2112 W LAWRENCE AVE,,IL,60625.0,2019-09-19T00:00:00.000,License Re-Inspection,Pass,"38. INSECTS, RODENTS, & ANIMALS NOT PRESENT - ...",41.968821,-87.682201,
2152,2305166,"AMY BECK CAKE DESIGN, LLC","AMY BECK CAKE DESIGN, LLC",2079264.0,Bakery,Risk 1 (High),636 N RACINE AVE,,,60642.0,2019-08-23T00:00:00.000,Canvass,Pass,"55. PHYSICAL FACILITIES INSTALLED, MAINTAINED ...",41.893380,-87.657588,
2767,2304583,JCYS IRIS & STEVEN PODOLSKY FAMILY CENTER,JCYS IRIS & STEVEN PODOLSKY FAMILY CENTER,2671297.0,Children's Services Facility,Risk 1 (High),2112 W LAWRENCE AVE,,IL,60625.0,2019-08-13T00:00:00.000,License,Fail,"3. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL E...",41.968821,-87.682201,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
193147,60291,"CLOVERHILL PASTRY-VEND,LLC","CLOVERHILL PASTRY-VEND,LLC",2004357.0,Wholesale,Risk 3 (Low),4464 W 44TH ST,,IL,60632.0,2010-02-03T00:00:00.000,License Re-Inspection,Pass,,41.814266,-87.736013,
193404,60282,"CLOVERHILL PASTRY-VEND,LLC","CLOVERHILL PASTRY-VEND,LLC",2004357.0,Wholesale,Risk 3 (Low),4464 W 44TH ST,,IL,60632.0,2010-01-28T00:00:00.000,License,Fail,32. FOOD AND NON-FOOD CONTACT SURFACES PROPERL...,41.814266,-87.736013,
193473,60279,"CLOVERHILL PASTRY-VEND,LLC","CLOVERHILL PASTRY-VEND,LLC",2004357.0,Wholesale,Risk 3 (Low),4464 W 44TH ST,,IL,60632.0,2010-01-27T00:00:00.000,License,Fail,,41.814266,-87.736013,
193994,67912,THREE CHEFS RESTURANT,THREE CHEFS RESTURANT,2009471.0,Restaurant,Risk 1 (High),8125 S HALSTED ST,,IL,60620.0,2010-01-15T00:00:00.000,License Re-Inspection,Pass,,41.746236,-87.643766,


Looking at the coordinates of these places, all of them seem to also be in chicago, so we will fill their City and State columns

In [13]:
inspections['City'] = inspections['City'].fillna('Chicago')
inspections['State'] = inspections['State'].fillna('IL')

Next, we check if there are any facilities which are not located in Chicago.

In [14]:
# make sure that our assumption is correct
print('Values taken by \'City\': ', inspections['City'].unique())

Values taken by 'City':  ['CHICAGO' 'Chicago' 'chicago' 'GRIFFITH' 'NEW YORK' 'SCHAUMBURG'
 'ELMHURST' 'ALGONQUIN' 'NEW HOLSTEIN' 'CCHICAGO' 'NILES NILES' 'EVANSTON'
 'CHICAGO.' 'CHESTNUT STREET' 'LANSING' 'CHICAGOCHICAGO' 'WADSWORTH'
 'WILMETTE' 'WHEATON' 'CHICAGOHICAGO' 'ROSEMONT' 'CHicago' 'CALUMET CITY'
 'PLAINFIELD' 'HIGHLAND PARK' 'PALOS PARK' 'ELK GROVE VILLAGE' 'CICERO'
 'BRIDGEVIEW' 'OAK PARK' 'MAYWOOD' 'LAKE BLUFF' '312CHICAGO'
 'SCHILLER PARK' 'SKOKIE' 'BEDFORD PARK' 'BANNOCKBURNDEERFIELD' 'CHCICAGO'
 'BLOOMINGDALE' 'Norridge' 'CHARLES A HAYES' 'CHCHICAGO' 'CHICAGOI'
 'SUMMIT' 'OOLYMPIA FIELDS' 'WESTMONT' 'CHICAGO HEIGHTS' 'JUSTICE'
 'TINLEY PARK' 'LOMBARD' 'EAST HAZEL CREST' 'COUNTRY CLUB HILLS'
 'STREAMWOOD' 'BOLINGBROOK' 'INACTIVE' 'BERWYN' 'BURNHAM' 'DES PLAINES'
 'LAKE ZURICH' 'OLYMPIA FIELDS' 'alsip' 'OAK LAWN' 'BLUE ISLAND' 'GLENCOE'
 'FRANKFORT' 'NAPERVILLE' 'BROADVIEW' 'WORTH' 'Maywood' 'ALSIP'
 'EVERGREEN PARK']


We can see that this column takes values which are not Chicago. The rows where the 'City' is not Chicago are hence irrelevent to our study and should be dropped. Let's first make sure tha the bulk of the data is for Chicago before proceeding

In [15]:
chicago_inspections = inspections.groupby('City')['Inspection ID'].nunique().filter(regex='(?i)chicago', axis=0)
print('{}% of the inpections in the dataframe come from Chicago.'.format(100 * chicago_inspections.values.sum()/len(inspections)))

99.89662939839339% of the inpections in the dataframe come from Chicago.


We can safely drop the rows which come from cities that are not Chicago.

In [16]:
# list of ways Chicago has been written in the dataset
chicago_variations = chicago_inspections.index.tolist()
inspections = inspections[inspections['City'].isin(chicago_variations)]
# drop the 'City' and 'State' columns since they have each only one value, 'Chicago' and 'IL' respectively
inspections = inspections.drop(columns=['City', 'State'])

Now that we only have facilities in Chicago in our dataset, let us fill the 'Community Areas' column. To that end, we use the geopy library.

We start by getting the unique locations in the dataset.

In [17]:
# def getareanneighbourhood(coord):
#     """
    
#     """
#     geolocator = Nominatim(timeout=10,user_agent="area_filler")
#     geocode = RateLimiter(geolocator.geocode, min_delay_seconds=1)
#     dic = geocode.reverse(coord).raw['address']
#     return dic.get('suburb', np.nan), dic.get('neighbourhood', np.nan)

def combineloc(latitude, longitude):
    """
    function to format the latitude and longitude such that they can be used in geopy requests
    """
    return '{}, {}'.format(latitude, longitude)

In [18]:
locations = inspections['Latitude'].dropna().combine(inspections['Longitude'].dropna(),combineloc)
unique_locs = locations.unique()

In [19]:
unique_locs

array(['42.00558686485114, -87.66107732040031',
       '41.88342263701489, -87.62802165207536',
       '41.91039897821153, -87.6902068285586', ...,
       '41.768328334800714, -87.67381938402686',
       '41.764896400247046, -87.65396483351302',
       '41.846516428599394, -87.69542345938575'], dtype=object)

In [20]:
len(unique_locs)

16791

In [21]:
unique_locs_s = pd.Series(unique_locs, dtype=str)

We then request the geopy entry for the locations we have (code takes 4h40 to run as we can only do one geopy query per second) and save the areas in a pickle.

In [22]:
# geolocator = Nominatim(timeout=17000,user_agent="area_filler")
# geocode = RateLimiter(geolocator.reverse, min_delay_seconds=1)
# # for i in unique_locs:
# #     print(i)
# #     print(geolocator.reverse(i))
# areas = unique_locs_s.copy().apply(geocode)
# areas.to_pickle('./areas')

In [23]:
areas = pd.read_pickle('./areas.pickle')

In [24]:
areas.isna().sum()

0

Let's add the community areas and neighborhoods to the dataframe.

In [25]:
# get latitude, longitude and corresponding community area and neighbourhood in same dataframe
suburbs_neighbourhoods = [(x.raw.get('address', {}).get('suburb',np.nan), x.raw.get('address', {}).get('neighbourhood',np.nan)) for x in areas]
suburbs, neighbourhoods = zip(*suburbs_neighbourhoods)
locs_df = pd.concat([pd.Series(unique_locs, name='Location'), pd.Series(suburbs,name='Community Area'), pd.Series(neighbourhoods,name='Neighbourhood')], axis=1)

In [26]:
# add the community area and the neighbourhood to each entry in our dataframe
inspections['Location'] = inspections['Latitude'].combine(inspections['Longitude'],combineloc)
inspections = inspections.merge(locs_df,on='Location',how='outer')
inspections = inspections.drop(columns=['Community Areas'])

Let's check if there are any NaN entries in our 'Community Area' column

In [27]:
print('{}% of rows don\'t have missing Community Areas'.format(100 * (1 - inspections['Community Area'].isna().sum()/len(inspections))))

96.55987151637446% of rows don't have missing Community Areas


We may safely drop the rows which have null 'Community Area'.

In [28]:
inspections = inspections[inspections['Community Area'].notna()]

### C. Check which columns still have missing values (& bug ?)

Let's check if there are anymore missing values in the dataframe.

In [29]:
inspections.isna().sum().apply(lambda x: '{}% missing values'.format(100 * x/len(inspections)))

Inspection ID      0.011195044327044561% missing values
DBA Name           0.011195044327044561% missing values
AKA Name             1.2794336373765214% missing values
License #          0.020257699258461586% missing values
Facility Type        2.4687738227877793% missing values
Risk                0.04531327465708513% missing values
Address            0.011195044327044561% missing values
Zip                 0.03625061972566811% missing values
Inspection Date    0.011195044327044561% missing values
Inspection Type    0.011728141675951445% missing values
Results            0.011195044327044561% missing values
Violations            26.55837682519205% missing values
Latitude           0.011195044327044561% missing values
Longitude          0.011195044327044561% missing values
Location           0.011195044327044561% missing values
Community Area                      0.0% missing values
Neighbourhood        16.249873389379633% missing values
dtype: object

#### Zeineb pour André: si tu dupliques ce code et regarde le dataset avant de rajouter les community areas, il n'y avait pas de missing Inspection ID -> bug qlq part ?
Looking at the missing values, it make no sence to have entries where the Inspection ID is null, we look at those entries and drop them.

In [30]:
missing_inspections = inspections[inspections['Inspection ID'].isnull()]
print("Number of missing inspections ID: ", len(missing_inspections))
missing_inspections.head()

Number of missing inspections ID:  21


Unnamed: 0,Inspection ID,DBA Name,AKA Name,License #,Facility Type,Risk,Address,Zip,Inspection Date,Inspection Type,Results,Violations,Latitude,Longitude,Location,Community Area,Neighbourhood
194245,,,,,,,,,,,,,,,,Humboldt Park,Beat 2534
194246,,,,,,,,,,,,,,,,Near West Side,Near West Side
194247,,,,,,,,,,,,,,,,Kenwood,Kenwood
194248,,,,,,,,,,,,,,,,Lower West Side,Pilsen
194249,,,,,,,,,,,,,,,,Logan Square,Maplewood


In [31]:
inspections.dropna(subset=['Inspection ID'],inplace = True)

In [32]:
inspections.isna().sum().apply(lambda x: '{}% missing values'.format(100 * x/len(inspections)))

Inspection ID                       0.0% missing values
DBA Name                            0.0% missing values
AKA Name             1.2683805888186306% missing values
License #          0.009063669613247886% missing values
Facility Type        2.4578539362983975% missing values
Risk                0.03412205030869792% missing values
Address                             0.0% missing values
Zip                 0.02505838069545004% missing values
Inspection Date                     0.0% missing values
Inspection Type    0.000533157036073405% missing values
Results                             0.0% missing values
Violations           26.550154082383425% missing values
Latitude                            0.0% missing values
Longitude                           0.0% missing values
Location                            0.0% missing values
Community Area                      0.0% missing values
Neighbourhood         16.24956014544524% missing values
dtype: object

* The AKA names still have missing entries. We will replace those with missing with the DBA name because we will need those for our recommendation map later on, and it make more sense to display the AKA names for the users. However will be  mostly sticking to the DBA Name when referring to establishments.
* The Lisence Number is missing for some entries. Seeing as it is not essential in our main analysis we will not pay attention to it for now.
* The missing Zip entries are not important as we have enough information regarding location (latitude, longitude, community area and address). Hence we can safely drop this column.
* The number of missing neighbourhoods is quite big. Hence, we may drop that column as well.
* We will try to recover the missing facility type from the restaurant's name using other entries where the name is the same and the type is filled in.
* We will see if the missing violations entries are consistent and have something to do with the inspection type and inspection results.
* The fraction of the data with missing Inspection Type represent a really small fraction in our dataset. Hence we can safely drop those entries.
* Missing values of Risk is the only ones that might hinder our analysis. We will try to recover those using the restaurant's name and other filled in entries.For the values that we cannot recover, we drop the corresponding entries.

### D. Drop unneeded columns

In [33]:
# drop neighbourhood and zip columns
inspections = inspections.drop(columns=['Neighbourhood','Zip'])

### E. Clean Facility Type column

Let's replace the null values with the right facility type

In [34]:
print("Number of null Facility Types before recovering: ",len(inspections[inspections['Facility Type'].isnull()]))
inspections[inspections['Facility Type'].isnull()].head(2)

Number of null Facility Types before recovering:  4610


Unnamed: 0,Inspection ID,DBA Name,AKA Name,License #,Facility Type,Risk,Address,Inspection Date,Inspection Type,Results,Violations,Latitude,Longitude,Location,Community Area
124,1515531.0,GATEWAY NEWS STAND,GATEWAY NEWS STAND,1245984.0,,Risk 3 (Low),108 N STATE ST,2014-12-30T00:00:00.000,Canvass,Out of Business,,41.883423,-87.628022,"41.88342263701489, -87.62802165207536",Irving Park
181,660114.0,SUBWAY SANDWICH & SALADS,SUBWAY SANDWICH & SALADS,1275947.0,,Risk 1 (High),2512 W NORTH AVE OOB,2012-01-23T00:00:00.000,Canvass,Out of Business,,41.910399,-87.690207,"41.91039897821153, -87.6902068285586",Lincoln Park


In [35]:
#we have to declare this function to be able to use it into aggregate
def to_set(a):
    #return {x for x in a if pd.notna(x)}
    return set(a)
establisment_facility_types = inspections[['DBA Name','Facility Type']].drop_duplicates().groupby('DBA Name')['Facility Type'].agg({'nbr_types': len, 'possible_types': to_set})
print("Possible number of different Facility Types for some establishment : ", establisment_facility_types['nbr_types'].unique())
establisment_facility_types[establisment_facility_types['nbr_types']>2].head()

is deprecated and will be removed in a future version. Use                 named aggregation instead.

    >>> grouper.agg(name_1=func_1, name_2=func_2)

  """


Possible number of different Facility Types for some establishment :  [1 2 4 3 7 5 6 9]


Unnamed: 0_level_0,nbr_types,possible_types
DBA Name,Unnamed: 1_level_1,Unnamed: 2_level_1
7-ELEVEN,4,"{nan, Liquor, Restaurant, Grocery Store}"
ALASKA PALETERIA Y NEVERIA,3,"{Restaurant, MOBILE FROZEN DESSERTS DISPENSER-..."
ALL ABOUT KIDS LEARNING ACADEMY,3,"{Children's Services Facility, nan, Daycare Co..."
ARAMARK,7,"{nan, CHARTER SCHOOL, Special Event, Restauran..."
ARAMARK EDUCATION SERVICES,4,"{CHARTER SCHOOL, Restaurant, SCHOOL, School}"


We see that some of the establishments have two possible Facility types, in this case we replace the rows which have null value with one of the possible types.

In [36]:
#First remove NAN from sets
establisment_facility_types['possible_types'] = establisment_facility_types.apply(
    lambda row: {x for x in row['possible_types'] if pd.notna(x)},
    axis=1
) 
#Then keep only one type per establishment
establisment_facility_types['possible_types'] = establisment_facility_types.apply(
    lambda row: next(iter(row['possible_types'])) if len(row['possible_types'])!=0 else np.NaN,
    axis=1
) 
establisment_types_dict=establisment_facility_types['possible_types'].to_dict()
#Assign a type to each missing value
inspections['Facility Type'] = inspections.apply(
    lambda row: establisment_types_dict[row['DBA Name']] if pd.isna(row['Facility Type']) else row['Facility Type'],
    axis=1
)

In [37]:
print(len(inspections[inspections['Facility Type'].isnull()]))

3822


We were able to successfully recover 17% of the missing facility types in our dataset. We will assign the value 'other' to those still missing.

In [38]:
inspections['Facility Type'] = inspections.apply(
    lambda row: 'other' if pd.isna(row['Facility Type']) else row['Facility Type'],
    axis=1
)

## Zeineb pour André: j'ai pas compris le but de cette partie, est-ce que tu veux creer un nouveau feature avec facility_type plus general ? si tu peux mettre un commentaire ici pour expliquer le but stp

We first examine the facility type entries

In [39]:
# from nltk.corpus import stopwords
# stop = stopwords.words('english')
spell = Speller(lang='en')
inspections['Facility Type'] = inspections['Facility Type'].str.lower().apply(lambda x: spell(str(x)))
unique_types = pd.Series(inspections['Facility Type'].unique())
unique_types = pd.Series(unique_types.apply(lambda x: re.sub(',|&|;','/',str(x)).split('/')).explode().unique())
unique_types.apply(lambda x: str(x).strip())

0            restaurant
1                museum
2               gallery
3             newsstand
4         grocery store
             ...       
375        soup kitchen
376        hooka lounge
377           religious
378    wholesale bakery
379          kids cafe'
Length: 380, dtype: object

In [40]:
#corrected_unique_types = unique_types.apply(lambda x: spell(str(x).strip()))

In [41]:
#corrected_unique_types

There are 427 facility types in the dataframe. Let's see if we can cluster some of them.

In [42]:
school_vocab = ['school', 'under 6', 'university', 'cafeteria', 'training program', 'kids', 'children', 'daycare', 'years', 'school cafeteria', 'college', 'class', 'day care']
restaurant_vocab = ['restaurant', 'smokehouse', 'diner', 'breakfast', 'lunch', 'grill', 'sushi','banquet', 'dining', 'taqueria']
homes_vocab = ['long term care','assisted living', 'nursing', 'care', 'supportive']

religious_vocab = ['religious', 'religion', 'church', 'synaguogue', 'temple']
coffee_vocab = ['coffee', 'cafe']
catering_vocab = ['cater']

bakery_vocab = ['bake', 'patisserie', 'boulangerie']
market_vocab = ['butcher', 'snack', 'dollar', 'grocery', 'frozen food storage', 'meat packing', 'food', 'market', 'packaged', 'popcorn', 'pantry', 'store', 'produce', 'kiosk', 'convenience', 'commiasary','gas station', 'wholesale', 'deli', 'convenient', 'retail', 'dollar tree']
dessert_vocab = ['ice cream', 'dessert', 'paleteria', 'candy', 'gelato', 'donut']

health_vocab = ['gym', 'exercise', 'nutrition','herbal', 'fitness', 'drug', 'rehab', 'herbalife', 'weight', 'herb', 'health', 'decream']
medical_vocab = ['pharmacy']
vending_vocab = ['vend', 'pop-up', 'mobile', 'cart', 'dispenser', 'truck']

drinks_vocab = ['liquor', 'music', 'bar', 'pub', 'beverage', 'club', 'roof', 'tavern', 'brewery', 'wine', 'beer', 'lounge', 'tea', 'shakes']
live_vocab =['live', 'poultry', 'slaughter', 'farm']
kitchen_vocab = ['kitchen']

distributor_vocab = ['distributor', 'distribution']
shelter_vocab = ['shelter', 'youth housing']
# other_vocab = ['warehouse', 'other', 'theater', 'golf', 'laundromat', 'hotel', 'riverbank', 'theatre', 'event', 'special', 'museum', 'hospital', 'pool', 'art', 'airport', 'gallery', 'terminal']


In [43]:
#unique_types = corrected_unique_types

In [44]:
#'live poultry' in corrected_unique_types.values

In [45]:
def extract_label(unique_labels, vocab,mssg):
    labels = unique_types.apply(lambda x: x if bool(re.search('|'.join(vocab), x)) else np.nan).dropna()
    print(labels)
    print('{}: {}'.format(mssg, labels.size))
    unique_labels = unique_types.drop(labels=labels.index,errors='ignore')
    return unique_labels, labels

In [46]:
unique_types, school_labels = extract_label(unique_types, school_vocab, 'Number of Schools: ')
unique_types, restaurant_labels = extract_label(unique_types, restaurant_vocab, 'Number of Restaurants: ')
unique_types, homes_labels = extract_label(unique_types, homes_vocab, 'Number of Nursing homes: ')

unique_types, religious_labels = extract_label(unique_types, religious_vocab, 'Number of Religious Establishments: ')
unique_types, coffee_labels = extract_label(unique_types, coffee_vocab, 'Number of Coffeeshops: ')
unique_types, catering_labels = extract_label(unique_types, catering_vocab, 'Number of Caterers: ')

unique_types, bakery_labels = extract_label(unique_types, bakery_vocab, 'Number of Bakeries: ')
unique_types, market_labels = extract_label(unique_types, market_vocab, 'Number of Markets: ')
unique_types, dessert_labels = extract_label(unique_types, dessert_vocab, 'Number of Dessert Places: ')

unique_types, health_labels = extract_label(unique_types, health_vocab, 'Number of Health Institutions: ')
unique_types, medical_labels = extract_label(unique_types, medical_vocab, 'Number of Pharmacies: ')
unique_types, vending_labels = extract_label(unique_types, vending_vocab, 'Number of Vending Establishments: ')

unique_types, drinks_labels = extract_label(unique_types, drinks_vocab, 'Number of Drinks Places: ')
unique_types, live_labels = extract_label(unique_types, live_vocab, 'Number of Live Animal Sellers and Slaughterhouses: ')
unique_types, kitchen_labels = extract_label(unique_types, kitchen_vocab, 'Number of Shared Kitchen: ')

unique_types, distributor_labels = extract_label(unique_types, distributor_vocab, 'Number of Distributors: ')
unique_types, shelter_labels = extract_label(unique_types, shelter_vocab, 'Number of Shelters: ')
other_labels = unique_types

5                       daycare (2 - 6 years)
7                children's services facility
10            daycare above and under 2 years
23                                     school
32                children's service facility
33                         daycare combo 1586
38                            teaching school
43          1023 children's services facility
44                             private school
46                             cooking school
47                             charter school
48                    daycare (under 2 years)
52                          daycare (2 years)
56                                  cafeteria
60                              daycare night
61                       after school program
74                                    daycare
86         1023 children's service s facility
107                           senior day care
119         1023-children's services facility
127                           culinary school
149       retail store offers cook

In [47]:
unique_types[0:10]

1                    museum
2                   gallery
3                 newsstand
9                     other
17     silverware warehouse
30                     pool
41                 hospital
50    chicago park district
66                  theater
77              event space
dtype: object

### f. Clean the AKA names column

**Explore the difference between DBA and AKA names**

## Zeineb: Toute cette partie ne sert à rien je trouve, psq on sait deja la difference entre DBA et AKA

In [48]:
print ('There are {0} unique DBA (‘Doing business as.’) names in the dataset.'.format(len(inspections['DBA Name'].unique())))

There are 26555 unique DBA (‘Doing business as.’) names in the dataset.


In [49]:
# Display the number of restaurants (we display the unique DBA names)
print ('There are {0} AKA (‘Also known as.’) names in the dataset.'.format(len(inspections['AKA Name'].unique())))

There are 25355 AKA (‘Also known as.’) names in the dataset.


In [50]:
# Explore how DBA and AKA names differ
print ('There are {0} rows where the DBA names and the AKA names differ.'\
       .format((len(inspections[inspections['DBA Name'] != inspections['AKA Name']]))))

There are 48701 rows where the DBA names and the AKA names differ.


In [51]:
print('Examples of different DBA and AKA names : ')
inspections[inspections['DBA Name'] != inspections['AKA Name']].head(10)

Examples of different DBA and AKA names : 


Unnamed: 0,Inspection ID,DBA Name,AKA Name,License #,Facility Type,Risk,Address,Inspection Date,Inspection Type,Results,Violations,Latitude,Longitude,Location,Community Area
3,2261305.0,ROGERS PIER,ROGERS PIER SEAFOOD BAR AND GRILL,2340712.0,restaurant,Risk 1 (High),6800-6806 N SHERIDAN RD,2019-01-29T00:00:00.000,Canvass,Out of Business,,42.005587,-87.661077,"42.00558686485114, -87.66107732040031",Belmont Cragin
4,2233129.0,ROGERS PIER,ROGERS PIER SEAFOOD BAR AND GRILL,2340712.0,restaurant,Risk 1 (High),6800-6806 N SHERIDAN RD,2018-11-13T00:00:00.000,Canvass,Out of Business,,42.005587,-87.661077,"42.00558686485114, -87.66107732040031",Belmont Cragin
5,2145011.0,ROGERS PIER,ROGERS PIER SEAFOOD BAR AND GRILL,2340712.0,restaurant,Risk 1 (High),6800-6806 N SHERIDAN RD,2018-02-15T00:00:00.000,Canvass Re-Inspection,Pass,"34. FLOORS: CONSTRUCTED PER CODE, CLEANED, GOO...",42.005587,-87.661077,"42.00558686485114, -87.66107732040031",Belmont Cragin
6,2144606.0,ROGERS PIER,ROGERS PIER SEAFOOD BAR AND GRILL,2340712.0,restaurant,Risk 1 (High),6800-6806 N SHERIDAN RD,2018-02-06T00:00:00.000,Canvass,Fail,12. HAND WASHING FACILITIES: WITH SOAP AND SAN...,42.005587,-87.661077,"42.00558686485114, -87.66107732040031",Belmont Cragin
7,1955612.0,ROGERS PIER,ROGERS PIER SEAFOOD BAR AND GRILL,2340712.0,restaurant,Risk 1 (High),6800-6806 N SHERIDAN RD,2017-04-12T00:00:00.000,Complaint,Pass,31. CLEAN MULTI-USE UTENSILS AND SINGLE SERVIC...,42.005587,-87.661077,"42.00558686485114, -87.66107732040031",Belmont Cragin
8,1955536.0,ROGERS PIER,ROGERS PIER SEAFOOD BAR AND GRILL,2340712.0,restaurant,Risk 1 (High),6800-6806 N SHERIDAN RD,2017-03-02T00:00:00.000,Short Form Complaint,Pass,"30. FOOD IN ORIGINAL CONTAINER, PROPERLY LABEL...",42.005587,-87.661077,"42.00558686485114, -87.66107732040031",Belmont Cragin
9,1955443.0,ROGERS PIER,ROGERS PIER SEAFOOD BAR AND GRILL,2340712.0,restaurant,Risk 1 (High),6800-6806 N SHERIDAN RD,2017-01-10T00:00:00.000,Canvass,Pass,"30. FOOD IN ORIGINAL CONTAINER, PROPERLY LABEL...",42.005587,-87.661077,"42.00558686485114, -87.66107732040031",Belmont Cragin
32,2311402.0,"RENA'S FUDGE SHOPS, INC./BLOCK 37 STARBUCKS",STARBUCK'S,2626476.0,restaurant,Risk 2 (Medium),108 N STATE ST,2019-08-27T00:00:00.000,Canvass Re-Inspection,Pass,,41.883423,-87.628022,"41.88342263701489, -87.62802165207536",Irving Park
33,2304949.0,MAGNOLIA BAKERY,MAGNOLIA BAKER,2114823.0,restaurant,Risk 1 (High),108 N STATE ST,2019-08-20T00:00:00.000,Complaint,Pass w/ Conditions,37. FOOD PROPERLY LABELED; ORIGINAL CONTAINER ...,41.883423,-87.628022,"41.88342263701489, -87.62802165207536",Irving Park
35,2305009.0,"RENA'S FUDGE SHOPS, INC./BLOCK 37 STARBUCKS",STARBUCK'S,2626476.0,restaurant,Risk 2 (Medium),108 N STATE ST,2019-08-20T00:00:00.000,Canvass,Fail,"1. PERSON IN CHARGE PRESENT, DEMONSTRATES KNOW...",41.883423,-87.628022,"41.88342263701489, -87.62802165207536",Irving Park


## END

We see that the AKA name is the name of the restaurant as known to the public. We decide to duplicate the DBA name to the missing AKA names: We will need those for our recommendation map later on, and it make more sense to display the AKA names for the users.

In [52]:
inspections['AKA Name'].fillna(inspections['DBA Name'], inplace=True)

### g. Explore the violations column

Let's first see if the missing violation are consistent with our entries

In [53]:
inspections[inspections['Violations'].isnull()]['Results'].unique()

array(['No Entry', 'Out of Business', 'Pass', 'Fail', 'Not Ready',
       'Pass w/ Conditions', 'Business Not Located'], dtype=object)

We expected the entries which have null Violations to have 'No Entry', 'Out of Business', 'Pass' or 'Business Not Located' as a value for Results. We see that we alse have null Violations for 'Fail' and 'Pass w/ Conditions' inspections. We still keep those entries because they can be useful for other metrics.

In [54]:
#TODO: Explode the violations column and create a pickle of all possible violations with their respective numbers

### h. Drop missing entries in Inspection Type

Let's see what's left to be done in our data cleaning process

In [55]:
inspections.isna().sum().apply(lambda x: '{}% missing values'.format(100 * x/len(inspections)))

Inspection ID                       0.0% missing values
DBA Name                            0.0% missing values
AKA Name                            0.0% missing values
License #          0.009063669613247886% missing values
Facility Type                       0.0% missing values
Risk                0.03412205030869792% missing values
Address                             0.0% missing values
Inspection Date                     0.0% missing values
Inspection Type    0.000533157036073405% missing values
Results                             0.0% missing values
Violations           26.550154082383425% missing values
Latitude                            0.0% missing values
Longitude                           0.0% missing values
Location                            0.0% missing values
Community Area                      0.0% missing values
dtype: object

As we can see the number of entries which have missing Inspection Type is really small, we can safely drop those.

In [56]:
inspections.dropna(axis=0, subset=['Inspection Type'], inplace=True)

### i. Clean Risk column

First of all, let's extract the Risk factor from the string, that way we will be able to perform arithmetic comparisons useful in our future analysis.

Example: Risk 3 (Low) -> becomes: 3

In [57]:
inspections['Risk'] = inspections.apply(
    lambda row: re.findall('\d+', row['Risk']) if pd.notna(row['Risk']) else [],
    axis=1
) 
inspections['Risk'] = inspections.apply(
    lambda row: row['Risk'][0] if len(row['Risk'])>0 else np.NaN,
    axis=1
) 

In [58]:
inspections['Risk'].unique()

array(['1', '2', '3', nan], dtype=object)

Now we try to recover the missing values from other data entries

In [59]:
print("Number of null Risk values before recovering: ",len(inspections[inspections['Risk'].isnull()]))
inspections[inspections['Risk'].isnull()].head(2)

Number of null Risk values before recovering:  92


Unnamed: 0,Inspection ID,DBA Name,AKA Name,License #,Facility Type,Risk,Address,Inspection Date,Inspection Type,Results,Violations,Latitude,Longitude,Location,Community Area
4020,2315960.0,MINGHIN JAPENESE,MINGHIN JAPENESE,2670193.0,other,,1232-1234 S MICHIGAN AVE,2019-10-15T00:00:00.000,License,Not Ready,,41.866541,-87.624281,"41.86654115432376, -87.62428056789263",Hermosa
5787,469312.0,ARACELI,TACOS ARACELI,2068908.0,other,,2158 W 23RD ST,2010-12-01T00:00:00.000,License,Fail,,41.850399,-87.680572,"41.850399240651626, -87.68057158857073",Logan Square


In [60]:
#we have to declare this function to be able to use it into aggregate
def to_set(a):
    #return {x for x in a if pd.notna(x)}
    return set(a)
establisment_risks = inspections[['DBA Name','Risk']].drop_duplicates().groupby('DBA Name')['Risk'].agg({'nbr_values': len, 'possible_values': to_set})
print("Possible number of values for risk for some establishment : ", establisment_risks['nbr_values'].unique())
establisment_risks[establisment_risks['nbr_values']>1].head()

is deprecated and will be removed in a future version. Use                 named aggregation instead.

    >>> grouper.agg(name_1=func_1, name_2=func_2)

  """


Possible number of values for risk for some establishment :  [1 2 3 4]


Unnamed: 0_level_0,nbr_values,possible_values
DBA Name,Unnamed: 1_level_1,Unnamed: 2_level_1
11 DINING,2,"{2, 1}"
"11 DINING, LLC",3,"{2, 3, 1}"
111 COFFEE BAR,2,"{nan, 2}"
123 FOOD,2,"{2, 3}"
"14 W. HUBBARD, LLC",2,"{3, 1}"


We see that the same establishment may have different Risk values (maybe the latter varies in time depending on the result of inspections). We choose to fill in the missing values in the Risk column by taking the maximum value of risk for that establishment.

In [61]:
#First remove NAN from sets
establisment_risks['possible_values'] = establisment_risks.apply(
    lambda row: {int(x) for x in row['possible_values'] if pd.notna(x)},
    axis=1
) 
#Then keep only one type per establishment
establisment_risks['possible_values'] = establisment_risks.apply(
    lambda row: max(row['possible_values']) if len(row['possible_values'])!=0 else np.NaN,
    axis=1
) 
establisment_risks_dict=establisment_risks['possible_values'].to_dict()
#Assign a type to each missing value
inspections['Risk'] = inspections.apply(
    lambda row: establisment_risks_dict[row['DBA Name']] if pd.isna(row['Risk']) else row['Risk'],
    axis=1
)

In [62]:
print("Number of null Risk values after recovering: ",len(inspections[inspections['Risk'].isnull()]))
inspections[inspections['Risk'].isnull()].head(2)

Number of null Risk values after recovering:  63


Unnamed: 0,Inspection ID,DBA Name,AKA Name,License #,Facility Type,Risk,Address,Inspection Date,Inspection Type,Results,Violations,Latitude,Longitude,Location,Community Area
4020,2315960.0,MINGHIN JAPENESE,MINGHIN JAPENESE,2670193.0,other,,1232-1234 S MICHIGAN AVE,2019-10-15T00:00:00.000,License,Not Ready,,41.866541,-87.624281,"41.86654115432376, -87.62428056789263",Hermosa
5787,469312.0,ARACELI,TACOS ARACELI,2068908.0,other,,2158 W 23RD ST,2010-12-01T00:00:00.000,License,Fail,,41.850399,-87.680572,"41.850399240651626, -87.68057158857073",Logan Square


We see that we were able to recover 42% of the missing values; for the others let's use the default value 'Risk 2 (Medium)' as it's the medium value.

In [63]:
inspections['Risk'].fillna('2', inplace = True)
print("Number of null Risk values after filling: ",len(inspections[inspections['Risk'].isnull()]))

Number of null Risk values after filling:  0


### j. Clean column License #:

What do the entries with no "license #" look like ?

In [64]:
inspections[inspections['License #'].isnull()]

Unnamed: 0,Inspection ID,DBA Name,AKA Name,License #,Facility Type,Risk,Address,Inspection Date,Inspection Type,Results,Violations,Latitude,Longitude,Location,Community Area
65421,2290863.0,ST. DEMETRIOS GREEK ORTHODOX CHURCH,ST. DEMETRIOS CHURCH,,special event,2,2727 W WINONA ST,2019-06-04T00:00:00.000,Canvass,Pass w/ Conditions,"3. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL E...",41.974653,-87.697529,"41.974653353169366, -87.69752945714045",Near North Side
65422,2181316.0,ST. DEMETRIOS GREEK ORTHODOX CHURCH,ST. DEMETRIOS CHURCH,,special event,2,2727 W WINONA ST,2018-06-13T00:00:00.000,Canvass,Pass,32. FOOD AND NON-FOOD CONTACT SURFACES PROPERL...,41.974653,-87.697529,"41.974653353169366, -87.69752945714045",Near North Side
65423,2071910.0,ST. DEMETRIOS GREEK ORTHODOX CHURCH,ST. DEMETRIOS CHURCH,,special event,2,2727 W WINONA ST,2017-08-04T00:00:00.000,Canvass,Pass,32. FOOD AND NON-FOOD CONTACT SURFACES PROPERL...,41.974653,-87.697529,"41.974653353169366, -87.69752945714045",Near North Side
65424,1933084.0,ST. DEMETRIOS GREEK ORTHODOX CHURCH,ST. DEMETRIOS CHURCH,,special event,2,2727 W WINONA ST,2016-06-20T00:00:00.000,Canvass,Pass,38. VENTILATION: ROOMS AND EQUIPMENT VENTED AS...,41.974653,-87.697529,"41.974653353169366, -87.69752945714045",Near North Side
65425,1561809.0,ST. DEMETRIOS GREEK ORTHODOX CHURCH,ST. DEMETRIOS CHURCH,,special event,2,2727 W WINONA ST,2015-08-04T00:00:00.000,Canvass,Pass,,41.974653,-87.697529,"41.974653353169366, -87.69752945714045",Near North Side
65426,1459918.0,ST. DEMETRIOS GREEK ORTHODOX CHURCH,ST. DEMETRIOS CHURCH,,special event,2,2727 W WINONA ST,2014-05-20T00:00:00.000,Canvass,Pass,33. FOOD AND NON-FOOD CONTACT EQUIPMENT UTENSI...,41.974653,-87.697529,"41.974653353169366, -87.69752945714045",Near North Side
65427,1099104.0,ST. DEMETRIOS GREEK ORTHODOX CHURCH,ST. DEMETRIOS CHURCH,,special event,2,2727 W WINONA ST,2013-07-24T00:00:00.000,Canvass,Pass,32. FOOD AND NON-FOOD CONTACT SURFACES PROPERL...,41.974653,-87.697529,"41.974653353169366, -87.69752945714045",Near North Side
65428,1188285.0,ST. DEMETRIOS GREEK ORTHODOX CHURCH,ST. DEMETRIOS CHURCH,,special event,2,2727 W WINONA ST,2012-07-25T00:00:00.000,Canvass,Pass,33. FOOD AND NON-FOOD CONTACT EQUIPMENT UTENSI...,41.974653,-87.697529,"41.974653353169366, -87.69752945714045",Near North Side
65429,521659.0,ST. DEMETRIOS GREEK ORTHODOX CHURCH,ST. DEMETRIOS CHURCH,,special event,2,2727 W WINONA ST,2011-08-10T00:00:00.000,Canvass,Pass,,41.974653,-87.697529,"41.974653353169366, -87.69752945714045",Near North Side
65430,339207.0,ST DEMETRIOS CHURCH,ST DEMETRIOS CHURCH,,special event,1,2727 W WINONA ST,2010-07-30T00:00:00.000,Special Events (Festivals),Pass,32. FOOD AND NON-FOOD CONTACT SURFACES PROPERL...,41.974653,-87.697529,"41.974653353169366, -87.69752945714045",Near North Side


In [65]:
inspections[inspections['DBA Name']=='ARGENTINA FOODS']

Unnamed: 0,Inspection ID,DBA Name,AKA Name,License #,Facility Type,Risk,Address,Inspection Date,Inspection Type,Results,Violations,Latitude,Longitude,Location,Community Area
159539,1152076.0,ARGENTINA FOODS,ARGENTINA FOODS,,grocery store,2,4500 S WOOD ST,2014-04-10T00:00:00.000,Canvass,Out of Business,,41.812105,-87.670072,"41.812105152977246, -87.67007183351623",Loop
159541,158273.0,ARGENTINA FOODS,ARGENTINA FOODS,57047.0,grocery store,2,4500 S WOOD ST,2010-01-20T00:00:00.000,Out of Business,Fail,,41.812105,-87.670072,"41.812105152977246, -87.67007183351623",Loop
159542,158274.0,ARGENTINA FOODS,ARGENTINA FOODS,57047.0,grocery store,2,4500 S WOOD ST,2010-01-20T00:00:00.000,Out of Business,Fail,,41.812105,-87.670072,"41.812105152977246, -87.67007183351623",Loop


* We see that almost all of the missing License numbers are those of churches (Maybe they don't need a lisence to serve food). We assign the default license number 99999 to those entries.
* The only real license numbers missing is the one from ARGENTINA FOODS and we are able to assign it manually.

In [66]:
inspections.loc[159539,('License #')]='57047'
inspections['License #'].fillna('99999',inplace=True)

### k. Attribute the right types to each column

In [67]:
inspections.isna().sum().apply(lambda x: '{}% missing values'.format(100 * x/len(inspections)))

Inspection ID                    0.0% missing values
DBA Name                         0.0% missing values
AKA Name                         0.0% missing values
License #                        0.0% missing values
Facility Type                    0.0% missing values
Risk                             0.0% missing values
Address                          0.0% missing values
Inspection Date                  0.0% missing values
Inspection Type                  0.0% missing values
Results                          0.0% missing values
Violations         26.54976247727406% missing values
Latitude                         0.0% missing values
Longitude                        0.0% missing values
Location                         0.0% missing values
Community Area                   0.0% missing values
dtype: object

Now that we are done with replacing Nan values, we want to attribute the right type to each column. [documentation about categorical type](https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html)


In [68]:
inspections = inspections.astype({"Inspection ID": int, "DBA Name": str, "AKA Name": str, "License #": int,\
                                  "Facility Type": 'category', "Risk": int,"Address": str, "Inspection Type": 'category',\
                                   "Results": 'category', "Community Area": 'category'})

We notice that the 'Inspection date' columns only contains dates and no times (the time seems to always be midnight by default). Hence we only keep the date and clean the values

In [69]:
inspections.loc[:,('Inspection Date')]=inspections.loc[:,('Inspection Date')].apply(pd.to_datetime)

In [70]:
inspections.dtypes

Inspection ID               int64
DBA Name                   object
AKA Name                   object
License #                   int64
Facility Type            category
Risk                        int64
Address                    object
Inspection Date    datetime64[ns]
Inspection Type          category
Results                  category
Violations                 object
Latitude                  float64
Longitude                 float64
Location                   object
Community Area           category
dtype: object

In [74]:
inspections.to_pickle('./cleaned_inspections.pickle')