# Relevant imports

In [13]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
import math
from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter

from autocorrect import Speller

from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error

import findspark
findspark.init()

import pyspark
import pyspark.sql
from pyspark.sql import *
from pyspark.sql.functions import *

# I. Dataset(s) preparation and cleaning

Before we proceed to tackle each of our research questions, some data cleaning is in order.

## 1. Load the data and explore its structure and meaning

In [14]:
inspections = pd.read_csv('datasets/food-inspections.csv')

In [15]:
len(inspections)

194615

The dataset has 22 columns. Let's examine what each of them is.

In [16]:
#Display columns
inspections.columns

Index(['Inspection ID', 'DBA Name', 'AKA Name', 'License #', 'Facility Type',
       'Risk', 'Address', 'City', 'State', 'Zip', 'Inspection Date',
       'Inspection Type', 'Results', 'Violations', 'Latitude', 'Longitude',
       'Location', 'Historical Wards 2003-2015', 'Zip Codes',
       'Community Areas', 'Census Tracts', 'Wards'],
      dtype='object')

In [17]:
inspections.dtypes

Inspection ID                   int64
DBA Name                       object
AKA Name                       object
License #                     float64
Facility Type                  object
Risk                           object
Address                        object
City                           object
State                          object
Zip                           float64
Inspection Date                object
Inspection Type                object
Results                        object
Violations                     object
Latitude                      float64
Longitude                     float64
Location                       object
Historical Wards 2003-2015    float64
Zip Codes                     float64
Community Areas               float64
Census Tracts                 float64
Wards                         float64
dtype: object

A description of the features is given below [Source](https://data.cityofchicago.org/api/assets/BAD5301B-681A-4202-9D25-51B2CAE672FF).
The last five columns are ignored in the dataset source; we will see that those columns are in fact null

| Feature name                | Variable Type | Description 
|-----------------------------|---------------|--------------------------------------------------------
| Inspection ID        | Integer    | The inspection unique identifier.
| DBA Name                 | String        | ‘Doing business as.’Legal name of the establishment.
| AKA NAme                | String    |  ‘Also known as.’ Name the public would know the establishment as.
| License # | Integer    | Unique number assigned to the establishment for the purposes of licensing by the Department of Business Affairs and Consumer Protection.
| Type of facility                | String    | Each establishment is described by one of the following: bakery, banquet hall, candy store, caterer, coffee shop, day care center (for ages less than 2), day care center (for ages 2 – 6), day care center (combo, for ages less than 2 and 2 – 6 combined), gas station, Golden Diner, grocery store, hospital, long term care center(nursing home), liquor store, mobile food dispenser, restaurant, paleteria, school, shelter, tavern, social club, wholesaler, or Wrigley Field Rooftop.
| Risk                   | String    | Risk category of facility of adversely affecting the public’s health, with 1 being the highest and 3 the lowest. The frequency of inspection is tied to this risk, with risk 1 establishments inspected most frequently and risk 3 least frequently.
| Address        | String    | Street address of the establishment.
| City        | String    | City of the establishment.
| State        | String    | State of the establishment.
| Zip        | Integer    | Zip code of the establishment.
| Inspection Date        | Date    | Date of the inspection
| Inspection Type        | String    | An inspection can be one of the following types: canvass, the most common type of inspection performed at a frequency relative to the risk of the establishment; consultation, when the inspection is done at the request of the owner prior to the opening of the establishment; complaint, when the inspection is done in response to a complaint against the establishment; license, when the inspection is done as a requirement for the establishment to receive its license to operate; suspect food poisoning, when the inspection is done in response to one or more persons claiming to have gotten ill as a result of eating at the establishment (a specific type of complaint-based inspection); task-force inspection, when an inspection of a bar or tavern is done. Re-inspections can occur for most types of these inspections and are indicated as such.
| Results        | String    | Results: An inspection can pass, pass with conditions or fail. Establishments receiving a ‘pass’ were found to have no critical or serious violations (violation number 1-14 and 15- 29, respectively). Establishments receiving a ‘pass with conditions’ were found to have critical or serious violations, but these were corrected during the inspection. Establishments receiving a ‘fail’ were found to have critical or serious violations that were not correctable during the inspection. An establishment receiving a ‘fail’ does not necessarily mean the establishment’s licensed is suspended. Establishments found to be out of business or not located are indicated as such.
| Violations        | String    | An establishment can receive one or more of 45 distinct violations (violation numbers 1-44 and 70). For each violation number listed for a given establishment, the requirement the establishment must meet in order for it to NOT receive a violation is noted, followed by a specific description of the findings that caused the violation to be issued.
| Latitude        | Integer    | Latitude of the establishment.
| Longitude        | Integer    | Longitude of the establishment.



## 2. Drop duplicates

The dataset source explicitly says there are duplicates in our data, hence it makes sence to drop those. [source](https://www.kaggle.com/chicago/chicago-food-inspections)

In [18]:
inspections.drop_duplicates(inplace=True)
len(inspections)

194446

## 3. Dataset cleaning

### A. Drop missing columns

The 'Location' column contains the latitude and longitude of the establishment. However, there are separate 'Latitude' and 'Longitude' columns. We can hence safely drop the 'Location' column.

In [20]:
inspections = inspections.drop(columns=['Location'])

The head of the dataset only contains NaN entries for the 'Historical Wards 2003-2015', 'Zip Codes', 'Community Areas', 'Census Tracts', 'Wards' columns. Let's see if this is true for the whole dataset.

In [21]:
# make sure that our assumption is correct
print('Values taken by \'Historical Wards 2003-2015\': ', inspections['Zip Codes'].unique())
print('Values taken by \'Zip Codes\': ', inspections['Zip Codes'].unique())
print('Values taken by \'Community Areas\': ', inspections['Zip Codes'].unique())
print('Values taken by \'Census Tracts\': ', inspections['Zip Codes'].unique())
print('Values taken by \'Wards\': ', inspections['Zip Codes'].unique())


Values taken by 'Historical Wards 2003-2015':  [nan]
Values taken by 'Zip Codes':  [nan]
Values taken by 'Community Areas':  [nan]
Values taken by 'Census Tracts':  [nan]
Values taken by 'Wards':  [nan]


We drop all columns apart from the 'Community Areas' because we will be needing it in our study. We will fill later.

In [22]:
inspections = inspections.drop(columns=['Historical Wards 2003-2015'])
inspections = inspections.drop(columns=['Zip Codes'])
inspections = inspections.drop(columns=['Census Tracts'])
inspections = inspections.drop(columns=['Wards'])

### B. Clean the location related features and fill in community area feature

Let's examine if the whole dataset is relevent to the study we are conducting by seeing which entries correspond to facilities in Chicago.

First, we check if there are any missing values for the column 'City' or 'State'

In [23]:
#Investigate the state=nan and city=nan restaurants
inspections[pd.isnull(inspections.State) | pd.isnull(inspections.City)]

Unnamed: 0,Inspection ID,DBA Name,AKA Name,License #,Facility Type,Risk,Address,City,State,Zip,Inspection Date,Inspection Type,Results,Violations,Latitude,Longitude,Community Areas
819,2312774,CHICAGO COLLEGIATE CHARTER,CHICAGO COLLEGIATE CHARTER,3846104.0,School,Risk 1 (High),10909 S COTTAGE GROVE AVE,,IL,,2019-09-24T00:00:00.000,Canvass Re-Inspection,Pass w/ Conditions,"3. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL E...",41.696087,-87.608945,
976,2312540,CHICAGO COLLEGIATE CHARTER,CHICAGO COLLEGIATE CHARTER,3846104.0,School,Risk 1 (High),10909 S COTTAGE GROVE AVE,,IL,,2019-09-19T00:00:00.000,Canvass Re-Inspection,Fail,"3. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL E...",41.696087,-87.608945,
982,2312545,JCYS IRIS & STEVEN PODOLSKY FAMILY CENTER,JCYS IRIS & STEVEN PODOLSKY FAMILY CENTER,2671297.0,Children's Services Facility,Risk 1 (High),2112 W LAWRENCE AVE,,IL,60625.0,2019-09-19T00:00:00.000,License Re-Inspection,Pass,"38. INSECTS, RODENTS, & ANIMALS NOT PRESENT - ...",41.968821,-87.682201,
2152,2305166,"AMY BECK CAKE DESIGN, LLC","AMY BECK CAKE DESIGN, LLC",2079264.0,Bakery,Risk 1 (High),636 N RACINE AVE,,,60642.0,2019-08-23T00:00:00.000,Canvass,Pass,"55. PHYSICAL FACILITIES INSTALLED, MAINTAINED ...",41.893380,-87.657588,
2767,2304583,JCYS IRIS & STEVEN PODOLSKY FAMILY CENTER,JCYS IRIS & STEVEN PODOLSKY FAMILY CENTER,2671297.0,Children's Services Facility,Risk 1 (High),2112 W LAWRENCE AVE,,IL,60625.0,2019-08-13T00:00:00.000,License,Fail,"3. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL E...",41.968821,-87.682201,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
193147,60291,"CLOVERHILL PASTRY-VEND,LLC","CLOVERHILL PASTRY-VEND,LLC",2004357.0,Wholesale,Risk 3 (Low),4464 W 44TH ST,,IL,60632.0,2010-02-03T00:00:00.000,License Re-Inspection,Pass,,41.814266,-87.736013,
193404,60282,"CLOVERHILL PASTRY-VEND,LLC","CLOVERHILL PASTRY-VEND,LLC",2004357.0,Wholesale,Risk 3 (Low),4464 W 44TH ST,,IL,60632.0,2010-01-28T00:00:00.000,License,Fail,32. FOOD AND NON-FOOD CONTACT SURFACES PROPERL...,41.814266,-87.736013,
193473,60279,"CLOVERHILL PASTRY-VEND,LLC","CLOVERHILL PASTRY-VEND,LLC",2004357.0,Wholesale,Risk 3 (Low),4464 W 44TH ST,,IL,60632.0,2010-01-27T00:00:00.000,License,Fail,,41.814266,-87.736013,
193994,67912,THREE CHEFS RESTURANT,THREE CHEFS RESTURANT,2009471.0,Restaurant,Risk 1 (High),8125 S HALSTED ST,,IL,60620.0,2010-01-15T00:00:00.000,License Re-Inspection,Pass,,41.746236,-87.643766,


Looking at the coordinates of these places, all of them seem to also be in chicago, so we will fill their City and State columns

In [24]:
inspections['City'] = inspections['City'].fillna('Chicago')
inspections['State'] = inspections['State'].fillna('IL')

Next, we check if there are any facilities which are not located in Chicago.

In [25]:
# make sure that our assumption is correct
print('Values taken by \'City\': ', inspections['City'].unique())

Values taken by 'City':  ['CHICAGO' 'Chicago' 'chicago' 'GRIFFITH' 'NEW YORK' 'SCHAUMBURG'
 'ELMHURST' 'ALGONQUIN' 'NEW HOLSTEIN' 'CCHICAGO' 'NILES NILES' 'EVANSTON'
 'CHICAGO.' 'CHESTNUT STREET' 'LANSING' 'CHICAGOCHICAGO' 'WADSWORTH'
 'WILMETTE' 'WHEATON' 'CHICAGOHICAGO' 'ROSEMONT' 'CHicago' 'CALUMET CITY'
 'PLAINFIELD' 'HIGHLAND PARK' 'PALOS PARK' 'ELK GROVE VILLAGE' 'CICERO'
 'BRIDGEVIEW' 'OAK PARK' 'MAYWOOD' 'LAKE BLUFF' '312CHICAGO'
 'SCHILLER PARK' 'SKOKIE' 'BEDFORD PARK' 'BANNOCKBURNDEERFIELD' 'CHCICAGO'
 'BLOOMINGDALE' 'Norridge' 'CHARLES A HAYES' 'CHCHICAGO' 'CHICAGOI'
 'SUMMIT' 'OOLYMPIA FIELDS' 'WESTMONT' 'CHICAGO HEIGHTS' 'JUSTICE'
 'TINLEY PARK' 'LOMBARD' 'EAST HAZEL CREST' 'COUNTRY CLUB HILLS'
 'STREAMWOOD' 'BOLINGBROOK' 'INACTIVE' 'BERWYN' 'BURNHAM' 'DES PLAINES'
 'LAKE ZURICH' 'OLYMPIA FIELDS' 'alsip' 'OAK LAWN' 'BLUE ISLAND' 'GLENCOE'
 'FRANKFORT' 'NAPERVILLE' 'BROADVIEW' 'WORTH' 'Maywood' 'ALSIP'
 'EVERGREEN PARK']


We can see that this column takes values which are not Chicago. The rows where the 'City' is not Chicago are hence irrelevent to our study and should be dropped. Let's first make sure tha the bulk of the data is for Chicago before proceeding

In [26]:
chicago_inspections = inspections.groupby('City')['Inspection ID'].nunique().filter(regex='(?i)chicago', axis=0)
print('{}% of the inpections in the dataframe come from Chicago.'.format(100 * chicago_inspections.values.sum()/len(inspections)))

99.89662939839339% of the inpections in the dataframe come from Chicago.


We can safely drop the rows which come from cities that are not Chicago.

In [27]:
# list of ways Chicago has been written in the dataset
chicago_variations = chicago_inspections.index.tolist()
inspections = inspections[inspections['City'].isin(chicago_variations)]
# drop the 'City' and 'State' columns since they have each only one value, 'Chicago' and 'IL' respectively
inspections = inspections.drop(columns=['City', 'State'])

Now that we only have facilities in Chicago in our dataset, let us fill the 'Community Areas' column. To that end, we use the geopy library.

We start by getting the unique locations in the dataset.

In [28]:
# def getareanneighbourhood(coord):
#     """
    
#     """
#     geolocator = Nominatim(timeout=10,user_agent="area_filler")
#     geocode = RateLimiter(geolocator.geocode, min_delay_seconds=1)
#     dic = geocode.reverse(coord).raw['address']
#     return dic.get('suburb', np.nan), dic.get('neighbourhood', np.nan)

def combineloc(latitude, longitude):
    """
    function to format the latitude and longitude such that they can be used in geopy requests
    """
    return '{}, {}'.format(latitude, longitude)

In [29]:
locations = inspections['Latitude'].dropna().combine(inspections['Longitude'].dropna(),combineloc)
unique_locs = locations.unique()

In [30]:
unique_locs

array(['42.00558686485114, -87.66107732040031',
       '41.88342263701489, -87.62802165207536',
       '41.91039897821153, -87.6902068285586', ...,
       '41.768328334800714, -87.67381938402686',
       '41.764896400247046, -87.65396483351302',
       '41.846516428599394, -87.69542345938575'], dtype=object)

In [31]:
len(unique_locs)

16791

In [32]:
unique_locs_s = pd.Series(unique_locs, dtype=str)

We then request the geopy entry for the locations we have (code takes 4h40 to run as we can only do one geopy query per second) and save the areas in a pickle.

In [33]:
# geolocator = Nominatim(timeout=17000,user_agent="area_filler")
# geocode = RateLimiter(geolocator.reverse, min_delay_seconds=1)
# # for i in unique_locs:
# #     print(i)
# #     print(geolocator.reverse(i))
# areas = unique_locs_s.copy().apply(geocode)
# areas.to_pickle('./areas')

In [34]:
areas = pd.read_pickle('./areas.pickle')

In [35]:
areas.isna().sum()

0

Let's add the community areas and neighborhoods to the dataframe.

In [36]:
# get latitude, longitude and corresponding community area and neighbourhood in same dataframe
suburbs_neighbourhoods = [(x.raw.get('address', {}).get('suburb',np.nan), x.raw.get('address', {}).get('neighbourhood',np.nan)) for x in areas]
suburbs, neighbourhoods = zip(*suburbs_neighbourhoods)
locs_df = pd.concat([pd.Series(unique_locs, name='Location'), pd.Series(suburbs,name='Community Area'), pd.Series(neighbourhoods,name='Neighbourhood')], axis=1)

In [37]:
# add the community area and the neighbourhood to each entry in our dataframe
inspections['Location'] = inspections['Latitude'].combine(inspections['Longitude'],combineloc)
inspections = inspections.merge(locs_df,on='Location',how='outer')
inspections = inspections.drop(columns=['Community Areas'])

Let's check if there are any NaN entries in our 'Community Area' column

In [38]:
print('{}% of rows don\'t have missing Community Areas'.format(100 * (1 - inspections['Community Area'].isna().sum()/len(inspections))))

96.55987151637446% of rows don't have missing Community Areas


We may safely drop the rows which have null 'Community Area'.

In [39]:
inspections = inspections[inspections['Community Area'].notna()]

### C. Check which columns still have missing values (& bug ?)

Let's check if there are anymore missing values in the dataframe.

In [40]:
inspections.isna().sum().apply(lambda x: '{}% missing values'.format(100 * x/len(inspections)))

Inspection ID      0.011195044327044561% missing values
DBA Name           0.011195044327044561% missing values
AKA Name             1.2794336373765214% missing values
License #          0.020257699258461586% missing values
Facility Type        2.4687738227877793% missing values
Risk                0.04531327465708513% missing values
Address            0.011195044327044561% missing values
Zip                 0.03625061972566811% missing values
Inspection Date    0.011195044327044561% missing values
Inspection Type    0.011728141675951445% missing values
Results            0.011195044327044561% missing values
Violations            26.55837682519205% missing values
Latitude           0.011195044327044561% missing values
Longitude          0.011195044327044561% missing values
Location           0.011195044327044561% missing values
Community Area                      0.0% missing values
Neighbourhood        16.249873389379633% missing values
dtype: object

#### Zeineb pour André: si tu dupliques ce code et regarde le dataset avant de rajouter les community areas, il n'y avait pas de missing Inspection ID -> bug qlq part ?
Looking at the missing values, it make no sence to have entries where the Inspection ID is null, we look at those entries and drop them.

In [41]:
missing_inspections = inspections[inspections['Inspection ID'].isnull()]
print("Number of missing inspections ID: ", len(missing_inspections))
missing_inspections.head()

Number of missing inspections ID:  21


Unnamed: 0,Inspection ID,DBA Name,AKA Name,License #,Facility Type,Risk,Address,Zip,Inspection Date,Inspection Type,Results,Violations,Latitude,Longitude,Location,Community Area,Neighbourhood
194245,,,,,,,,,,,,,,,,Humboldt Park,Beat 2534
194246,,,,,,,,,,,,,,,,Near West Side,Near West Side
194247,,,,,,,,,,,,,,,,Kenwood,Kenwood
194248,,,,,,,,,,,,,,,,Lower West Side,Pilsen
194249,,,,,,,,,,,,,,,,Logan Square,Maplewood


In [42]:
inspections.dropna(subset=['Inspection ID'],inplace = True)

In [43]:
inspections.isna().sum().apply(lambda x: '{}% missing values'.format(100 * x/len(inspections)))

Inspection ID                       0.0% missing values
DBA Name                            0.0% missing values
AKA Name             1.2683805888186306% missing values
License #          0.009063669613247886% missing values
Facility Type        2.4578539362983975% missing values
Risk                0.03412205030869792% missing values
Address                             0.0% missing values
Zip                 0.02505838069545004% missing values
Inspection Date                     0.0% missing values
Inspection Type    0.000533157036073405% missing values
Results                             0.0% missing values
Violations           26.550154082383425% missing values
Latitude                            0.0% missing values
Longitude                           0.0% missing values
Location                            0.0% missing values
Community Area                      0.0% missing values
Neighbourhood         16.24956014544524% missing values
dtype: object

* The AKA names still have missing entries. We will replace those with missing with the DBA name because we will need those for our recommendation map later on, and it make more sense to display the AKA names for the users. However will be  mostly sticking to the DBA Name when referring to establishments.
* The Lisence Number is missing for some entries. Seeing as it is not essential in our main analysis we will not pay attention to it for now.
* The missing Zip entries are not important as we have enough information regarding location (latitude, longitude, community area and address). Hence we can safely drop this column.
* The number of missing neighbourhoods is quite big. Hence, we may drop that column as well.
* We will try to recover the missing facility type from the restaurant's name using other entries where the name is the same and the type is filled in.
* We will see if the missing violations entries are consistent and have something to do with the inspection type and inspection results.
* We will see if we can recover the missing risk 
* The fraction of the data with missing Inspection Type is so 
and inspection type entries manually as these represent a really small fraction in our dataset and might hinder our analysis. For the values that we cannot recover, we drop the corresponding entries.

### D. Drop unneeded columns

In [44]:
# drop neighbourhood and zip columns
inspections = inspections.drop(columns=['Neighbourhood','Zip'])

### E. Clean Facility Type column

Let's replace the null values with the right facility type

In [45]:
inspections[inspections['Facility Type'].isnull()].head()

Unnamed: 0,Inspection ID,DBA Name,AKA Name,License #,Facility Type,Risk,Address,Inspection Date,Inspection Type,Results,Violations,Latitude,Longitude,Location,Community Area
124,1515531.0,GATEWAY NEWS STAND,GATEWAY NEWS STAND,1245984.0,,Risk 3 (Low),108 N STATE ST,2014-12-30T00:00:00.000,Canvass,Out of Business,,41.883423,-87.628022,"41.88342263701489, -87.62802165207536",Irving Park
181,660114.0,SUBWAY SANDWICH & SALADS,SUBWAY SANDWICH & SALADS,1275947.0,,Risk 1 (High),2512 W NORTH AVE OOB,2012-01-23T00:00:00.000,Canvass,Out of Business,,41.910399,-87.690207,"41.91039897821153, -87.6902068285586",Lincoln Park
275,1235584.0,PAPA JOHN'S PIZZA,PAPA JOHN'S PIZZA,1334580.0,,Risk 3 (Low),7003 N CLARK ST,2013-01-03T00:00:00.000,Canvass,Out of Business,,42.009133,-87.673846,"42.00913302441507, -87.6738463824712",North Park
285,1515226.0,ONE STOP FOOD ANDLIQUORS,ONE STOP FOOD ANDLIQUORS,8665.0,,Risk 3 (Low),3456 S WESTERN AVE,2014-12-18T00:00:00.000,Canvass,Out of Business,,41.830324,-87.685182,"41.83032421612542, -87.6851823335589",Portage Park
364,2315750.0,SPORTS BAR WHITE STAR,SPORTS BAR WHITE STAR,24612.0,,Risk 3 (Low),3049 N CICERO AVE,2019-10-10T00:00:00.000,Canvass,Out of Business,,41.936668,-87.746655,"41.93666807129704, -87.74665459887675",Albany Park


In [None]:
facility_type_series = inspections[['DBA Name','Facility Type']].drop_duplicates().groupby('DBA Name')
facility_type_series.apply['SUBWAY SANDWICH & SALADS']

We first examine the facility type entries

In [None]:
# from nltk.corpus import stopwords
# stop = stopwords.words('english')
spell = Speller(lang='en')
inspections['Facility Type'] = inspections['Facility Type'].str.lower().apply(lambda x: spell(str(x)))
unique_types = pd.Series(inspections['Facility Type'].unique())
unique_types = pd.Series(unique_types.apply(lambda x: re.sub(',|&|;','/',str(x)).split('/')).explode().unique())
unique_types.apply(lambda x: str(x).strip())

In [None]:
#corrected_unique_types = unique_types.apply(lambda x: spell(str(x).strip()))

In [None]:
#corrected_unique_types

There are 427 facility types in the dataframe. Let's see if we can cluster some of them.

In [None]:
school_vocab = ['school', 'under 6', 'university', 'cafeteria', 'training program', 'kids', 'children', 'daycare', 'years', 'school cafeteria', 'college', 'class', 'day care']
restaurant_vocab = ['restaurant', 'smokehouse', 'diner', 'breakfast', 'lunch', 'grill', 'sushi','banquet', 'dining', 'taqueria']
homes_vocab = ['long term care','assisted living', 'nursing', 'care', 'supportive']

religious_vocab = ['religious', 'religion', 'church', 'synaguogue', 'temple']
coffee_vocab = ['coffee', 'cafe']
catering_vocab = ['cater']

bakery_vocab = ['bake', 'patisserie', 'boulangerie']
market_vocab = ['butcher', 'snack', 'dollar', 'grocery', 'frozen food storage', 'meat packing', 'food', 'market', 'packaged', 'popcorn', 'pantry', 'store', 'produce', 'kiosk', 'convenience', 'commiasary','gas station', 'wholesale', 'deli', 'convenient', 'retail', 'dollar tree']
dessert_vocab = ['ice cream', 'dessert', 'paleteria', 'candy', 'gelato', 'donut']

health_vocab = ['gym', 'exercise', 'nutrition','herbal', 'fitness', 'drug', 'rehab', 'herbalife', 'weight', 'herb', 'health', 'decream']
medical_vocab = ['pharmacy']
vending_vocab = ['vend', 'pop-up', 'mobile', 'cart', 'dispenser', 'truck']

drinks_vocab = ['liquor', 'music', 'bar', 'pub', 'beverage', 'club', 'roof', 'tavern', 'brewery', 'wine', 'beer', 'lounge', 'tea', 'shakes']
live_vocab =['live', 'poultry', 'slaughter', 'farm']
kitchen_vocab = ['kitchen']

distributor_vocab = ['distributor', 'distribution']
shelter_vocab = ['shelter', 'youth housing']
# other_vocab = ['warehouse', 'other', 'theater', 'golf', 'laundromat', 'hotel', 'riverbank', 'theatre', 'event', 'special', 'museum', 'hospital', 'pool', 'art', 'airport', 'gallery', 'terminal']


In [None]:
#unique_types = corrected_unique_types

In [None]:
#'live poultry' in corrected_unique_types.values

In [None]:
def extract_label(unique_labels, vocab,mssg):
    labels = unique_types.apply(lambda x: x if bool(re.search('|'.join(vocab), x)) else np.nan).dropna()
    print(labels)
    print('{}: {}'.format(mssg, labels.size))
    unique_labels = unique_types.drop(labels=labels.index,errors='ignore')
    return unique_labels, labels

In [None]:
unique_types, school_labels = extract_label(unique_types, school_vocab, 'Number of Schools: ')
unique_types, restaurant_labels = extract_label(unique_types, restaurant_vocab, 'Number of Restaurants: ')
unique_types, homes_labels = extract_label(unique_types, homes_vocab, 'Number of Nursing homes: ')

unique_types, religious_labels = extract_label(unique_types, religious_vocab, 'Number of Religious Establishments: ')
unique_types, coffee_labels = extract_label(unique_types, coffee_vocab, 'Number of Coffeeshops: ')
unique_types, catering_labels = extract_label(unique_types, catering_vocab, 'Number of Caterers: ')

unique_types, bakery_labels = extract_label(unique_types, bakery_vocab, 'Number of Bakeries: ')
unique_types, market_labels = extract_label(unique_types, market_vocab, 'Number of Markets: ')
unique_types, dessert_labels = extract_label(unique_types, dessert_vocab, 'Number of Dessert Places: ')

unique_types, health_labels = extract_label(unique_types, health_vocab, 'Number of Health Institutions: ')
unique_types, medical_labels = extract_label(unique_types, medical_vocab, 'Number of Pharmacies: ')
unique_types, vending_labels = extract_label(unique_types, vending_vocab, 'Number of Vending Establishments: ')

unique_types, drinks_labels = extract_label(unique_types, drinks_vocab, 'Number of Drinks Places: ')
unique_types, live_labels = extract_label(unique_types, live_vocab, 'Number of Live Animal Sellers and Slaughterhouses: ')
unique_types, kitchen_labels = extract_label(unique_types, kitchen_vocab, 'Number of Shared Kitchen: ')

unique_types, distributor_labels = extract_label(unique_types, distributor_vocab, 'Number of Distributors: ')
unique_types, shelter_labels = extract_label(unique_types, shelter_vocab, 'Number of Shelters: ')
other_labels = unique_types

In [None]:
unique_types[0:10]

### E. Clean the AKA names column

**Explore the difference between DBA and AKA names**

In [None]:
print ('There are {0} unique DBA (‘Doing business as.’) names in the dataset.'.format(len(inspections['DBA Name'].unique())))

In [None]:
# Display the number of restaurants (we display the unique DBA names)
print ('There are {0} AKA (‘Also known as.’) names in the dataset.'.format(len(inspections['AKA Name'].unique())))

In [None]:
# Explore how DBA and AKA names differ
print ('There are {0} rows where the DBA names and the AKA names differ.'\
       .format((len(inspections[inspections['DBA Name'] != inspections['AKA Name']]))))

In [None]:
print('Examples of different DBA and AKA names : ')
inspections[inspections['DBA Name'] != inspections['AKA Name']].head(10)

We see that the AKA name is the name of the restaurant as known to the public. We decide to duplicate the DBA name to the missing AKA names: We will need those for our recommendation map later on, and it make more sense to display the AKA names for the users.

In [None]:
inspections.loc[inspections['AKA Name'].isnull()].loc[:,'AKA Name'] 
= inspections.loc[inspections['AKA Name'].isnull()].loc[:,'DBA Name']

In [None]:
inspections.loc[inspections['AKA Name'].isnull()]['AKA Name']

### F. Explore the violations column

Let's first see if the missing violation are consistent with our entries

In [None]:
inspections[inspections['Violations'].isnull()]['Results'].unique()

We expected the entries which have null Violations to have 'No Entry', 'Out of Business', 'Pass' or 'Business Not Located' as a value for Results. We see that we alse have null Violations for 'Fail' and 'Pass w/ Conditions' inspections. We still keep those entries because they can be useful for other metrics.

In [None]:
#TODO: Explode the violations column and create a pickle of all possible violations with their respective numbers