# 0. Imports

In [0]:
# Imports
import pandas as pd
import numpy as np

In [0]:
# Load data set
imp_df = pd.read_csv("https://drive.switch.ch/index.php/s/jBrNwNiAGh7LcXX/download")

In [0]:
# Copy to easily reset df
df = imp_df

The EDA showed that the changes to the legislation had an impact on inspections. Those changes included new violations, terminology modification and more, which would lead to perturbation in the results of this project. Consequently, only the inspections that occured before July 1st, 2018 will be considered for this project.

In [0]:
# Date conversion
df["Inspection Date"] =  pd.to_datetime(df["Inspection Date"])

In [0]:
# Rows selection
df = df[df["Inspection Date"] < "2018-07-01"]

# 1. Augmentation using the original dataframe

### 1.1 Temporal information

The year can be extracted from the inspection date, as well as the month. The day of the week (Monday to Sunday) can also be extracted using the .dayofweek attribute. Those features could be helpful in predicting the result fo the inspections.

In [712]:
# Extract week day, month and year from inspection date
df["Year"] = pd.DatetimeIndex(df['Inspection Date']).year
df["Month"] = pd.DatetimeIndex(df['Inspection Date']).month
df["Weekday"] = pd.DatetimeIndex(df['Inspection Date']).dayofweek

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


### 1.2 "Violations" column

The set contains textual information in the form of comments relative to violations that were observed during the inspection. The underlying assumption is that, on average, the length of the comment will be greater if there are many violations.

In [713]:
# Replace NaN cells with "X"
df["Violations"] = df["Violations"].replace(np.nan, 'X', regex=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [0]:
# List of values for the "Violations" feature
violations = list(df.Violations)

lenOfViolations = []

# Goes through all the elements of the list
for i in range(len(violations)) :
  # Appends the length of the field "Violations" for each row
  lenOfViolations.append(len(violations[i]))

In [715]:
# Update the DF
df["LenViol"] = lenOfViolations

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


As shown below, on average, the comments for passed inspection are shorter than the ones for failed inspections.

In [716]:
# Verifiy the assumption
df[["Results", "LenViol"]].groupby("Results").mean()

Unnamed: 0_level_0,LenViol
Results,Unnamed: 1_level_1
Fail,1804.615017
Pass,650.129954


# 2. Augmentation using the NOAA Weather history

Another source of information is the weather history in Chicago. The assumption is that sanitary risks increase when the temperatures rise as defective freezers and firdges struggle to keep the aliments cold. As a result bacterial growth is faster and this could lead to failure in inscpetion.

In [0]:
# Load data
df_weather = pd.read_csv("https://drive.switch.ch/index.php/s/ui6Zr1v2vzPlieH/download")

In [0]:
# Column selection
df_weather_zip = df_weather[["DATE", "TMAX"]]

In [719]:
# Compute the mean maximum temparture over the last 3 days for each day.
df_weather_zip["MeanMaxTemp3Days"] = (1.0/3.0)*(df_weather_zip.TMAX.shift(1) + df_weather_zip.TMAX.shift(2) + df_weather_zip.TMAX.shift(3))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [720]:
# Example : 1/3 * ((-2.1) + (-4.9) + (2.8)) = -1.4
# which corresponds to 1/3 * (TMAX(2009-12-15) + TMAX(2009-12-16) + TMAX(2009-12-17)) = MeanMaxTemp3Days(2009-12-18)
df_weather_zip.head(10)

Unnamed: 0,DATE,TMAX,MeanMaxTemp3Days
0,2009-12-15,-2.1,
1,2009-12-16,-4.9,
2,2009-12-17,2.8,
3,2009-12-18,3.3,-1.4
4,2009-12-19,1.1,0.4
5,2009-12-20,-0.5,2.4
6,2009-12-21,-1.0,1.3
7,2009-12-22,0.0,-0.133333
8,2009-12-23,0.0,-0.5
9,2009-12-24,2.2,-0.333333


In [721]:
# Date conversion
df_weather_zip["DATE"] =  pd.to_datetime(df_weather_zip["DATE"])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [722]:
# End result
df_weather_zip

Unnamed: 0,DATE,TMAX,MeanMaxTemp3Days
0,2009-12-15,-2.1,
1,2009-12-16,-4.9,
2,2009-12-17,2.8,
3,2009-12-18,3.3,-1.400000
4,2009-12-19,1.1,0.400000
...,...,...,...
3617,2019-11-12,-7.1,6.133333
3618,2019-11-13,-2.1,0.800000
3619,2019-11-14,0.6,-2.133333
3620,2019-11-15,3.3,-2.866667


Both the maximum temperature for the day and the average maximum temperature for the last 3 days are added to the dataframe.

In [0]:
# Merge
df = df.merge(df_weather_zip, left_on = "Inspection Date", right_on = "DATE")

# 3. Augmentation using the "Business Licenses" data set

The "Business Licenses" data set contains information with regards to business licenses that were issued in the past. From this data set, an approximation of the creation date of the restaurant can be retreived.

In [724]:
# Load data set
df_licenses = pd.read_csv("https://drive.switch.ch/index.php/s/HUOmc3Db1dJ9WMh/download")

  interactivity=interactivity, compiler=compiler, result=result)


In [725]:
# Sample
df_licenses[["LICENSE NUMBER", "APPLICATION REQUIREMENTS COMPLETE", "DATE ISSUED"]].sample(1)

Unnamed: 0,LICENSE NUMBER,APPLICATION REQUIREMENTS COMPLETE,DATE ISSUED
225365,2240816.0,01/16/2015,02/13/2015


In [0]:
#Reformatting date
df_licenses["APPLICATION REQUIREMENTS COMPLETE"] =  pd.to_datetime(df_licenses["APPLICATION REQUIREMENTS COMPLETE"], format = "%m/%d/%Y")
df_licenses["DATE ISSUED"] =  pd.to_datetime(df_licenses["DATE ISSUED"], format = "%m/%d/%Y")

For each business license number, the earliest date indicated in the "APPLICATION REQUIREMENTS COMPLETE" column is retrieved. There can be multiple records for the same license number (for example when the license is renewed), which is why the .groupy() and .min() functions are needed.

In [0]:
#Extraction of the date of the very first record in the set for each business license number
df_licenses_bis = df_licenses[["LICENSE NUMBER", "APPLICATION REQUIREMENTS COMPLETE"]]
df_licenses_bis = df_licenses_bis.groupby("LICENSE NUMBER", as_index = False)["APPLICATION REQUIREMENTS COMPLETE"].min()

In [728]:
df[df["Inspection ID"] == 154341]

Unnamed: 0,Inspection ID,DBA Name,License #,Facility Type,Risk,Address,Zip,Inspection Date,Inspection Type,Results,Violations,Latitude,Longitude,Year,Month,Weekday,LenViol,DATE,TMAX,MeanMaxTemp3Days
1,154341,LEVY RESTAURANTS AT WRIGLEY FIELD,1574001,restaurant,Risk 2 (Medium),1060 W ADDISON ST,60613,2010-04-06,canvass,Pass,X,41.947317,-87.656418,2010,4,1,1,2010-04-06,25.6,20.2


In [0]:
#Merge
df = df.merge(df_licenses_bis, left_on = "License #", right_on = "LICENSE NUMBER")

In [0]:
# Drop columns created by the merge
df = df.drop(["LICENSE NUMBER", "DATE"], axis = 1)

For each business license number the time delta between the approximate date of creation and the date of the inspection is computed. This yields the approximate "age" of the business at the date of inspection in days.

In [0]:
# Time delta
df["DaysInBusiness"] = (df["Inspection Date"] - df["APPLICATION REQUIREMENTS COMPLETE"]).dt.days

In [0]:
# Rename
df = df.rename(columns = {"APPLICATION REQUIREMENTS COMPLETE" : "ApproxCreationDate"})

There are 962 rows for which the "DaysInBusiness" value is negative. Below, an example of a business license that yields a negative value.

Note : The date indicated in the "APPLICATION REQUIREMENTS COMPLETE" column yields less negative rows and is therfore chosen over the one in the "DATE ISSUES".

In [733]:
df_show = df_licenses[["LICENSE NUMBER", "APPLICATION TYPE", "APPLICATION REQUIREMENTS COMPLETE", "DATE ISSUED"]]
df_show[df_show["LICENSE NUMBER"] == 2208682]

Unnamed: 0,LICENSE NUMBER,APPLICATION TYPE,APPLICATION REQUIREMENTS COMPLETE,DATE ISSUED
61232,2208682.0,RENEW,2016-01-15,2016-03-08
61758,2208682.0,ISSUE,NaT,2012-12-29
61763,2208682.0,RENEW,2014-01-15,2014-02-26
122283,2208682.0,RENEW,2018-01-15,2018-03-08


Since a negative time delta does not make sense, those 962 rows will be dropped as it should not impact the results too much.

In [0]:
# Drop rows for which delta < 0
df = df[(df.DaysInBusiness > 0)]

# 4. Export

In [0]:
df.to_csv("augmented_food_inspections.csv", index=False)