# Identifying Specific Tokens That Lead To Failed Food Inspections

**A personal project of Azizha Zeinita**

* **Project Goal:**

This project is one of NLP project series using food inspections dataset from the Chicago Department of Public Health’s Food Protection Program. This goal is to **improve the accuracy of failed food inspections** I've produced from "NLP-ChicagoFoodInspections-Python" by **identifying specific *tokens* that lead to failed food inspections** in Chicago and produce more accurate list of top-10 using **different kind of tokenization methods**.
 
* **Data Source:**

Chicago Data Portal - Fodd inspections of restaurants and other food establishments in Chicago from January 1, 2010 to August 4, 2022. Inspections are performed by staff from the Chicago Department of Public Health’s Food Protection Program. https://data.cityofchicago.org/Health-Human-Services/Food-Inspections/4ijn-s7e5

* **Methods:**
1. General tokenization using nltk.tokenize
2. Porter Stemming
3. Lancester Stemming
4. Lemmatization


In [1]:
import pandas as pd
import numpy as np
import re
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Load The Data
df = pd.read_csv("/Users/azizhazeinita/Project/NLP/Food_Inspections.csv")
df.head()

Unnamed: 0,Inspection ID,DBA Name,AKA Name,License #,Facility Type,Risk,Address,City,State,Zip,Inspection Date,Inspection Type,Results,Violations,Latitude,Longitude,Location
0,2561879,SOLITA,SOLITA,2857386.0,Restaurant,Risk 3 (Low),431 N WELLS ST,CHICAGO,IL,60654.0,08/09/2022,License,Pass,,41.890025,-87.633882,"(-87.63388240330882, 41.89002468212031)"
1,2559417,CHEZ JOEL,CHEZ JOEL,32359.0,Restaurant,Risk 1 (High),1119 W TAYLOR ST,CHICAGO,IL,60607.0,06/15/2022,Non-Inspection,No Entry,,41.869332,-87.655107,"(-87.65510678669794, 41.86933227916697)"
2,2559999,ROJO GUSANO,ROJO GUSANO,1305286.0,Restaurant,Risk 1 (High),3830 W LAWRENCE AVE,CHICAGO,IL,60625.0,06/28/2022,Canvass,Out of Business,,41.96839,-87.724448,"(-87.72444785924317, 41.968390431264375)"
3,2554552,HOST INTERNATIONAL B05,LA TAPENADE (T1-B5),34203.0,Restaurant,Risk 1 (High),11601 W TOUHY AVE,CHICAGO,IL,60666.0,04/20/2022,Canvass,Pass,,42.008536,-87.914428,"(-87.91442843927047, 42.008536400868735)"
4,2553925,LITTLE HARVARD ACADEMY,LITTLE HARVARD ACADEMY,2215573.0,Daycare (2 - 6 Years),Risk 1 (High),2708 W PETERSON AVE,CHICAGO,IL,60659.0,04/05/2022,Canvass,Out of Business,,41.990564,-87.697365,"(-87.69736479750581, 41.99056361928264)"


### Extract regulation descriptions from each record corresponding to a failed inspection, same as "NLP-ChicagoFoodInspections-Python" project

Selecting only the records corresponding to failed inspection (see "Results" column)

In [3]:
# Filtering Dataframe Only With Value 'Fail' in Column 'Results'
df_fail = df[df.Results=='Fail']
df_fail.head()

Unnamed: 0,Inspection ID,DBA Name,AKA Name,License #,Facility Type,Risk,Address,City,State,Zip,Inspection Date,Inspection Type,Results,Violations,Latitude,Longitude,Location
20,2561623,LA CHOZA MEXICAN GRILL,LA CHOZA MEXICAN GRILL,1840862.0,Restaurant,Risk 1 (High),7022 N CLARK ST,CHICAGO,IL,60626.0,08/02/2022,Canvass,Fail,9. NO BARE HAND CONTACT WITH RTE FOOD OR A PRE...,42.009698,-87.674274,"(-87.67427432484828, 42.009697980488845)"
50,2560320,iO Theater,iO Theater,2850833.0,Restaurant,Risk 3 (Low),1501-1519 N KINGSBURY ST,CHICAGO,IL,60642.0,07/06/2022,License,Fail,,41.908306,-87.6518,"(-87.65179957092685, 41.90830578569197)"
109,2556045,REGGIE'S ON THE BEACH,REGGIE'S ON THE BEACH,2840426.0,Restaurant,Risk 3 (Low),6245 S LAKE SHORE DR,CHICAGO,IL,60637.0,05/19/2022,License,Fail,,41.780804,-87.574616,"(-87.57461558340579, 41.780803899394144)"
135,2555232,HOT SEAFOOD MARKET INC.,HOT SEAFOOD MARKET INC.,2840868.0,Restaurant,Risk 2 (Medium),9454 S COTTAGE GROVE AVE,CHICAGO,IL,60619.0,05/04/2022,License,Fail,2. CITY OF CHICAGO FOOD SERVICE SANITATION CER...,41.722519,-87.604637,"(-87.60463680899302, 41.72251869461573)"
141,2555124,PRIME FISH,PRIME FISH,2463968.0,Restaurant,Risk 1 (High),8022 S HALSTED ST,CHICAGO,IL,60620.0,05/02/2022,Canvass,Fail,"1. PERSON IN CHARGE PRESENT, DEMONSTRATES KNOW...",41.748104,-87.644115,"(-87.64411513161956, 41.74810405559482)"


Clean the data, making sure that there are no NaNs in "Violations" column that lists the reasons for inspection failure. 

Separated by "|", which each reason consists of a regulation code, regulation description and comments describing how the regulation was violated.

In [4]:
# Drop Null Value in All Columns Based On The Null Value In Column 'Violations'
df_clean = df_fail.dropna(subset=['Violations'])
df_clean.head()

Unnamed: 0,Inspection ID,DBA Name,AKA Name,License #,Facility Type,Risk,Address,City,State,Zip,Inspection Date,Inspection Type,Results,Violations,Latitude,Longitude,Location
20,2561623,LA CHOZA MEXICAN GRILL,LA CHOZA MEXICAN GRILL,1840862.0,Restaurant,Risk 1 (High),7022 N CLARK ST,CHICAGO,IL,60626.0,08/02/2022,Canvass,Fail,9. NO BARE HAND CONTACT WITH RTE FOOD OR A PRE...,42.009698,-87.674274,"(-87.67427432484828, 42.009697980488845)"
135,2555232,HOT SEAFOOD MARKET INC.,HOT SEAFOOD MARKET INC.,2840868.0,Restaurant,Risk 2 (Medium),9454 S COTTAGE GROVE AVE,CHICAGO,IL,60619.0,05/04/2022,License,Fail,2. CITY OF CHICAGO FOOD SERVICE SANITATION CER...,41.722519,-87.604637,"(-87.60463680899302, 41.72251869461573)"
141,2555124,PRIME FISH,PRIME FISH,2463968.0,Restaurant,Risk 1 (High),8022 S HALSTED ST,CHICAGO,IL,60620.0,05/02/2022,Canvass,Fail,"1. PERSON IN CHARGE PRESENT, DEMONSTRATES KNOW...",41.748104,-87.644115,"(-87.64411513161956, 41.74810405559482)"
149,2554905,LAS BRISAS DEL MAR,LAS BRISAS DEL MAR,84625.0,Restaurant,Risk 1 (High),3207 W 51ST ST,CHICAGO,IL,60632.0,04/27/2022,Canvass,Fail,"3. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL E...",41.800706,-87.704099,"(-87.70409936083796, 41.80070644396386)"
167,2554291,MANA,MANA,2712278.0,Grocery Store,Risk 2 (Medium),434 E 71ST ST,CHICAGO,IL,60619.0,04/13/2022,Complaint,Fail,2. CITY OF CHICAGO FOOD SERVICE SANITATION CER...,41.765844,-87.613915,"(-87.61391496525177, 41.76584421474714)"


Parse "Violations" column to select only regulation descriptions

In [5]:
# Showing Example of Violations Content From Random Index, This Case I Choose Index 20
df_clean.Violations[20]

# Remove Comments
a = df_clean.Violations.apply(lambda x: re.sub(r'\s?\-\s+Comments.*?\|','',x))

# Remove The Last Comment
b = a.apply(lambda x: re.sub(r'\s?\-\s+Comments.*','',x))

# Remove Regulation Code
c = b.apply(lambda x: re.sub(r'\s\d+','', str(x)))

# Remove Regulation Code In The Beginning of Paragraphs
d = c.apply(lambda x: re.sub(r'^\d+\W+ | ^\d?\.', '', x))

# Make Sure To Remove All '|'
e = d.apply(lambda x: re.sub(r'([\|])','', str(x)))

# Make Sure To Remove All Space At The End of Sentences
f = e.apply(lambda x: re.sub('([\w]) ([.])', r'\1\2', str(x)))

# Make Sure To Remove All Space At The Beginning of Sentences
g = f.apply(lambda x: re.sub(r'\^s','', str(x)))

# Remove Space After Dots, So It Can Be Easy To Split (Based On Dots)
desc = g.apply(lambda x: re.sub('([.]) ([\w])', r'\1\2', str(x)))

# Store Description-Only Into DataFrame
df_clean['Description'] = desc.tolist()
df_clean.head()

Unnamed: 0,Inspection ID,DBA Name,AKA Name,License #,Facility Type,Risk,Address,City,State,Zip,Inspection Date,Inspection Type,Results,Violations,Latitude,Longitude,Location,Description
20,2561623,LA CHOZA MEXICAN GRILL,LA CHOZA MEXICAN GRILL,1840862.0,Restaurant,Risk 1 (High),7022 N CLARK ST,CHICAGO,IL,60626.0,08/02/2022,Canvass,Fail,9. NO BARE HAND CONTACT WITH RTE FOOD OR A PRE...,42.009698,-87.674274,"(-87.67427432484828, 42.009697980488845)",NO BARE HAND CONTACT WITH RTE FOOD OR A PRE-AP...
135,2555232,HOT SEAFOOD MARKET INC.,HOT SEAFOOD MARKET INC.,2840868.0,Restaurant,Risk 2 (Medium),9454 S COTTAGE GROVE AVE,CHICAGO,IL,60619.0,05/04/2022,License,Fail,2. CITY OF CHICAGO FOOD SERVICE SANITATION CER...,41.722519,-87.604637,"(-87.60463680899302, 41.72251869461573)",CITY OF CHICAGO FOOD SERVICE SANITATION CERTIF...
141,2555124,PRIME FISH,PRIME FISH,2463968.0,Restaurant,Risk 1 (High),8022 S HALSTED ST,CHICAGO,IL,60620.0,05/02/2022,Canvass,Fail,"1. PERSON IN CHARGE PRESENT, DEMONSTRATES KNOW...",41.748104,-87.644115,"(-87.64411513161956, 41.74810405559482)","PERSON IN CHARGE PRESENT, DEMONSTRATES KNOWLED..."
149,2554905,LAS BRISAS DEL MAR,LAS BRISAS DEL MAR,84625.0,Restaurant,Risk 1 (High),3207 W 51ST ST,CHICAGO,IL,60632.0,04/27/2022,Canvass,Fail,"3. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL E...",41.800706,-87.704099,"(-87.70409936083796, 41.80070644396386)","MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL EMPL..."
167,2554291,MANA,MANA,2712278.0,Grocery Store,Risk 2 (Medium),434 E 71ST ST,CHICAGO,IL,60619.0,04/13/2022,Complaint,Fail,2. CITY OF CHICAGO FOOD SERVICE SANITATION CER...,41.765844,-87.613915,"(-87.61391496525177, 41.76584421474714)",CITY OF CHICAGO FOOD SERVICE SANITATION CERTIF...


### Tokenize each regulation description

In [6]:
import os, requests, sys
import nltk as nltk
import nltk.corpus  
from nltk.text import Text
import pandas as pd
import re

In [7]:
tokens = nltk.tokenize.word_tokenize(" ".join(df_clean['Description'].values))
tokens[:20]

['NO',
 'BARE',
 'HAND',
 'CONTACT',
 'WITH',
 'RTE',
 'FOOD',
 'OR',
 'A',
 'PRE-APPROVED',
 'ALTERNATIVE',
 'PROCEDURE',
 'PROPERLY',
 'ALLOWED.FOOD',
 'SEPARATED',
 'AND',
 'PROTECTED.PROPER',
 'DATE',
 'MARKING',
 'AND']

### Find top-10 tokens

In [8]:
fdist = nltk.FreqDist(tokens)
fdist.most_common(10)

[(',', 347632),
 ('AND', 167060),
 (':', 124178),
 ('CONSTRUCTED', 64733),
 ('PROPERLY', 64341),
 ('EQUIPMENT', 64111),
 ('OF', 60302),
 ('INSTALLED', 59910),
 ('&', 57615),
 ('MAINTAINED', 54141)]

### Clean data: convert to lower case, remove stopwords, punctuation, numbers

In [9]:
# Convert to lower case
desc_lower = df_clean['Description'].str.lower().str.replace(r'\|', ' ', regex=True).str.cat(sep=' ')

# Tokenize desc_lower
words = nltk.tokenize.word_tokenize(desc_lower)

# Make stopwords
stopwords = set(nltk.corpus.stopwords.words('english'))

# Remove stopwords
words = [word for word in words if word not in stopwords]

# Remove punctuation
words = [word for word in words if word.isalpha()]

# Remove numbers
words = [word for word in words if not word.isnumeric()]

words[:10]

['bare',
 'hand',
 'contact',
 'rte',
 'food',
 'alternative',
 'procedure',
 'properly',
 'separated',
 'date']

### Find top-10 tokens again after cleaning process

In [10]:
# Find top-10 Tokenized Word
word_dist = nltk.FreqDist(words)
df_word_dist = pd.DataFrame(word_dist.most_common(10),columns=['Word', 'Frequency'])
df_word_dist

Unnamed: 0,Word,Frequency
0,constructed,64733
1,properly,64341
2,equipment,64111
3,installed,59910
4,maintained,54141
5,cleaning,48295
6,surfaces,45992
7,clean,45015
8,contact,43345
9,good,38535


### Find top-10 tokens after applying Porter stemming

In [11]:
# Applying Porter Stemming
porter = nltk.PorterStemmer()
words_porter = [porter.stem(t) for t in words]

# Find top-10 Porter stemming
word_dist_porter = nltk.FreqDist(words_porter)
df_word_dist_porter = pd.DataFrame(word_dist_porter.most_common(10),columns=['Word', 'Frequency'])
df_word_dist_porter

Unnamed: 0,Word,Frequency
0,clean,115593
1,construct,64733
2,properli,64341
3,equip,64111
4,instal,59910
5,maintain,57690
6,surfac,45992
7,contact,43345
8,good,38535
9,per,37628


### Find top-10 tokens after applying Lancaster stemming

In [12]:
# Applying Lancester Stemming
lancaster = nltk.LancasterStemmer()
words_lancester = [lancaster.stem(t) for t in words]

# Find top-10 Porter stemming
word_dist_lancester = nltk.FreqDist(words_lancester)
df_word_dist_lancester = pd.DataFrame(word_dist_lancester.most_common(10),columns=['Word', 'Frequency'])
df_word_dist_lancester


Unnamed: 0,Word,Frequency
0,cle,121593
1,prop,75145
2,construct,64733
3,equip,64111
4,instal,59910
5,maintain,57690
6,surfac,45992
7,contact,43345
8,good,38535
9,per,37628


### Find top-10 tokens after applying lemmatization

In [13]:
# Applying Lemmatization
wnl = nltk.WordNetLemmatizer()
words_lemmatization = [wnl.lemmatize(t) for t in words]

# Find top-10 Porter stemming
word_dist_lemmatization = nltk.FreqDist(words_lemmatization)
df_word_dist_lemmatization = pd.DataFrame(word_dist_lemmatization.most_common(10),columns=['Word', 'Frequency'])
df_word_dist_lemmatization


Unnamed: 0,Word,Frequency
0,constructed,64733
1,properly,64341
2,equipment,64111
3,installed,59910
4,maintained,54141
5,cleaning,48295
6,surface,45992
7,clean,45015
8,contact,43345
9,good,38535


### Compare top-10 tokens obtained in 3, 5, 6, 7, 8.

In [14]:
# Store 1st tokenization method to dataframe
df_fdist = pd.DataFrame(fdist.most_common(10),columns=['Before Cleaning', 'Frequency'])

In [16]:
# Store results from all methods into a dataframe

df_comp = pd.DataFrame()

df_comp['Before Cleaning (BC)'] = df_fdist['Before Cleaning']
df_comp['Frequency (BC)'] = df_fdist['Frequency']
df_comp['After Cleaning (AC)'] = df_word_dist['Word']
df_comp['Frequency (AC)'] = df_word_dist['Frequency']
df_comp['Porter Stemming (PS)'] = df_word_dist_porter['Word']
df_comp['Frequency (PS)'] = df_word_dist_porter['Frequency']
df_comp['Lancester (LA)'] = df_word_dist_lancester['Word']
df_comp['Frequency (LA)'] = df_word_dist_lancester['Frequency']
df_comp['Lemmatization (LEM)'] = df_word_dist_lemmatization['Word']
df_comp['Frequency (LEM)'] = df_word_dist_lemmatization['Frequency']
df_comp

Unnamed: 0,Before Cleaning (BC),Frequency (BC),After Cleaning (AC),Frequency (AC),Porter Stemming (PS),Frequency (PS),Lancester (LA),Frequency (LA),Lemmatization (LEM),Frequency (LEM)
0,",",347632,constructed,64733,clean,115593,cle,121593,constructed,64733
1,AND,167060,properly,64341,construct,64733,prop,75145,properly,64341
2,:,124178,equipment,64111,properli,64341,construct,64733,equipment,64111
3,CONSTRUCTED,64733,installed,59910,equip,64111,equip,64111,installed,59910
4,PROPERLY,64341,maintained,54141,instal,59910,instal,59910,maintained,54141
5,EQUIPMENT,64111,cleaning,48295,maintain,57690,maintain,57690,cleaning,48295
6,OF,60302,surfaces,45992,surfac,45992,surfac,45992,surface,45992
7,INSTALLED,59910,clean,45015,contact,43345,contact,43345,clean,45015
8,&,57615,contact,43345,good,38535,good,38535,contact,43345
9,MAINTAINED,54141,good,38535,per,37628,per,37628,good,38535


**Summary:**
* Based on the comparison table, the **most acceptable** result is in **Porter stemming**. Although some words lost some of their characters, but it's still can be read and understood what the meaning is
* The 2nd and the last method, which are using nltk.tokenize **after cleaning** the data and **Lemmatization** have the **clearest words** (not loosing any character), **but** they don't convert the words into one format (ex.: present tense). So, some words, like 'cleaning' and 'clean' are calculated as different words. **Also**, they have the **same result**
* Using nltk.tokenize **before cleaning** gives the most **uncontextualized result**, while the other method **Lancester** is also cutting too many characters in words that makes the words **loosing their true meaning**