# Data preparation and Exploration

Prepare data from [restaurant inspections](https://data.cityofnewyork.us/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/xx67-kt59) Note: Exported as csv.

Make some descriptives for the data set.

In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [1]:
# Read the complete data set
rest_insp = pd.read_csv('DOHMH_New_York_City_Restaurant_Inspection_Results.csv', sep=',',engine='python')
# separate columns specific to 
# 1. Inspection
# 2. Restaurant
# 3. Violations

In [7]:
rest_insp.head()

Unnamed: 0,CAMIS,DBA,BORO,BUILDING,STREET,ZIPCODE,PHONE,CUISINE DESCRIPTION,INSPECTION DATE,ACTION,VIOLATION CODE,VIOLATION DESCRIPTION,CRITICAL FLAG,SCORE,GRADE,GRADE DATE,RECORD DATE,INSPECTION TYPE
0,50041313,876 MARKET DELI,MANHATTAN,876,6TH AVE,10001.0,2122132431,Delicatessen,01/26/2016,Violations were cited in the following area(s).,02B,Hot food item not held at or above 140Ã‚Âº F.,Critical,12.0,A,01/26/2016,07/15/2017,Pre-permit (Operational) / Initial Inspection
1,41312955,RIPE JUICE BAR & GRILL,QUEENS,7013,AUSTIN STREET,11375.0,7182612881,American,04/01/2014,Violations were cited in the following area(s).,02G,Cold food item held above 41Ã‚Âº F (smoked fis...,Critical,12.0,A,04/01/2014,07/15/2017,Cycle Inspection / Re-inspection
2,41601691,WAZA SUSHI,BROOKLYN,485,MYRTLE AVENUE,11205.0,7183993839,Japanese,10/26/2015,Violations were cited in the following area(s).,10H,Proper sanitization not provided for utensil w...,Not Critical,25.0,,,07/15/2017,Cycle Inspection / Initial Inspection
3,50043431,SEATTLE CAFE,MANHATTAN,1411,MADISON AVE,10029.0,2124230446,American,05/19/2017,Violations were cited in the following area(s).,10F,Non-food contact surface improperly constructe...,Not Critical,35.0,,,07/15/2017,Cycle Inspection / Initial Inspection
4,50001580,CIRO PIZZA CAFE,STATEN ISLAND,862,HUGUENOT AVENUE,10312.0,7186050620,Italian,12/01/2015,Violations were cited in the following area(s).,02G,Cold food item held above 41Ã‚Âº F (smoked fis...,Critical,8.0,A,12/01/2015,07/15/2017,Cycle Inspection / Re-inspection


In [8]:
len(rest_insp)

410649

In [2]:
# creates a new column based on inspection date
rest_insp['date'] = pd.to_datetime(rest_insp['INSPECTION DATE'])

In [13]:
rest_insp.head()

Unnamed: 0,CAMIS,DBA,BORO,BUILDING,STREET,ZIPCODE,PHONE,CUISINE DESCRIPTION,INSPECTION DATE,ACTION,...,CRITICAL FLAG,SCORE,GRADE,GRADE DATE,RECORD DATE,INSPECTION TYPE,date,year,month,day
0,50041313,876 MARKET DELI,MANHATTAN,876,6TH AVE,10001.0,2122132431,Delicatessen,01/26/2016,Violations were cited in the following area(s).,...,Critical,12.0,A,01/26/2016,07/15/2017,Pre-permit (Operational) / Initial Inspection,2016-01-26,2016,1,26
1,41312955,RIPE JUICE BAR & GRILL,QUEENS,7013,AUSTIN STREET,11375.0,7182612881,American,04/01/2014,Violations were cited in the following area(s).,...,Critical,12.0,A,04/01/2014,07/15/2017,Cycle Inspection / Re-inspection,2014-04-01,2014,4,1
2,41601691,WAZA SUSHI,BROOKLYN,485,MYRTLE AVENUE,11205.0,7183993839,Japanese,10/26/2015,Violations were cited in the following area(s).,...,Not Critical,25.0,,,07/15/2017,Cycle Inspection / Initial Inspection,2015-10-26,2015,10,26
3,50043431,SEATTLE CAFE,MANHATTAN,1411,MADISON AVE,10029.0,2124230446,American,05/19/2017,Violations were cited in the following area(s).,...,Not Critical,35.0,,,07/15/2017,Cycle Inspection / Initial Inspection,2017-05-19,2017,5,19
4,50001580,CIRO PIZZA CAFE,STATEN ISLAND,862,HUGUENOT AVENUE,10312.0,7186050620,Italian,12/01/2015,Violations were cited in the following area(s).,...,Critical,8.0,A,12/01/2015,07/15/2017,Cycle Inspection / Re-inspection,2015-12-01,2015,12,1


In [3]:
# separation by month
rest_insp['year'] = rest_insp['date'].dt.year
rest_insp['month'] = rest_insp['date'].dt.month
rest_insp['day'] = rest_insp['date'].dt.day

## Basic Description for data
Some descriptives, complete table

In [4]:
# years in the data
rest_insp['year'].value_counts()

2015    119371
2016    118620
2014     95602
2017     62312
2013     13709
1900      1020
2012        13
2011         2
Name: year, dtype: int64

Here, I only consider top 4 years

In [5]:
# Keep only 2014 to 2017
inspections = pd.DataFrame(rest_insp[(x) & (rest_insp['year']<=2017)])

In [6]:
type(inspections)
#type(rest_insp)

pandas.core.frame.DataFrame

In [7]:
# May or not be too relevant?
inspections['CUISINE DESCRIPTION'].value_counts()

American                                                            90133
Chinese                                                             42136
Latin (Cuban, Dominican, Puerto Rican, South & Central American)    18908
Pizza                                                               18635
Italian                                                             17825
CafÃƒÂ©/Coffee/Tea                                                  15186
Mexican                                                             15010
Japanese                                                            14289
Caribbean                                                           12990
Spanish                                                             11687
Bakery                                                              11684
Pizza/Italian                                                        8642
Delicatessen                                                         6197
Asian                                 

In [8]:
#
inspections['GRADE'].value_counts()
# perhaps, it makes sense to try to predict only low values
# or changes in value, specially from higher to lower

A                 153672
B                  28724
C                   7072
Z                   2567
Not Yet Graded      1828
P                   1369
Name: GRADE, dtype: int64

In [30]:
inspections['CRITICAL FLAG'].value_counts()

Critical          217888
Not Critical      172198
Not Applicable      5819
Name: CRITICAL FLAG, dtype: int64

In [None]:
inspections['VIOLATION CODE'].value_counts()

In [9]:

inspections['VIOLATION DESCRIPTION'].value_counts()


Non-food contact surface improperly constructed. Unacceptable material used. Non-food contact surface or equipment improperly maintained and/or not properly sealed, raised, spaced or movable to allow accessibility for cleaning on all sides, above and underneath the unit.                      55711
Facility not vermin proof. Harborage or conditions conducive to attracting vermin to the premises and/or allowing vermin to exist.                                                                                                                                                                   39238
Cold food item held above 41Ã‚Âº F (smoked fish and reduced oxygen packaged foods above 38 Ã‚ÂºF) except during necessary preparation.                                                                                                                                                               27861
Evidence of mice or live mice present in facility's food and/or non-food areas.                        

In [10]:
# PERHASP it would be easy to look at values with these problems
inspections['ACTION'].value_counts()

Violations were cited in the following area(s).                                                                                        378322
Establishment Closed by DOHMH.  Violations were cited in the following area(s) and those requiring immediate action were addressed.      8758
No violations were recorded at the time of this inspection.                                                                              5457
Establishment re-opened by DOHMH                                                                                                         2605
Establishment re-closed by DOHMH                                                                                                          763
Name: ACTION, dtype: int64

In [15]:
inspections['INSPECTION TYPE'].value_counts()

Cycle Inspection / Initial Inspection                          225319
Cycle Inspection / Re-inspection                               101552
Pre-permit (Operational) / Initial Inspection                   24329
Pre-permit (Operational) / Re-inspection                        10917
Administrative Miscellaneous / Initial Inspection                7947
Smoke-Free Air Act / Initial Inspection                          4266
Pre-permit (Non-operational) / Initial Inspection                4020
Cycle Inspection / Reopening Inspection                          3109
Trans Fat / Initial Inspection                                   2957
Administrative Miscellaneous / Re-inspection                     2840
Smoke-Free Air Act / Re-inspection                               1660
Cycle Inspection / Compliance Inspection                         1460
Trans Fat / Re-inspection                                        1228
Inter-Agency Task Force / Initial Inspection                     1171
Pre-permit (Operatio

In [22]:
inspections.to_csv('inspections.csv')

In [2]:
inspections = pd.read_csv('inspections.csv', sep=',',engine='python')

In [19]:
inspections.head()

Unnamed: 0.1,Unnamed: 0,CAMIS,DBA,BORO,BUILDING,STREET,ZIPCODE,PHONE,CUISINE DESCRIPTION,INSPECTION DATE,...,CRITICAL FLAG,SCORE,GRADE,GRADE DATE,RECORD DATE,INSPECTION TYPE,date,year,month,day
0,0,50041313,876 MARKET DELI,MANHATTAN,876,6TH AVE,10001.0,2122132431,Delicatessen,01/26/2016,...,Critical,12.0,A,01/26/2016,07/15/2017,Pre-permit (Operational) / Initial Inspection,2016-01-26,2016,1,26
1,1,41312955,RIPE JUICE BAR & GRILL,QUEENS,7013,AUSTIN STREET,11375.0,7182612881,American,04/01/2014,...,Critical,12.0,A,04/01/2014,07/15/2017,Cycle Inspection / Re-inspection,2014-04-01,2014,4,1
2,2,41601691,WAZA SUSHI,BROOKLYN,485,MYRTLE AVENUE,11205.0,7183993839,Japanese,10/26/2015,...,Not Critical,25.0,,,07/15/2017,Cycle Inspection / Initial Inspection,2015-10-26,2015,10,26
3,3,50043431,SEATTLE CAFE,MANHATTAN,1411,MADISON AVE,10029.0,2124230446,American,05/19/2017,...,Not Critical,35.0,,,07/15/2017,Cycle Inspection / Initial Inspection,2017-05-19,2017,5,19
4,4,50001580,CIRO PIZZA CAFE,STATEN ISLAND,862,HUGUENOT AVENUE,10312.0,7186050620,Italian,12/01/2015,...,Critical,8.0,A,12/01/2015,07/15/2017,Cycle Inspection / Re-inspection,2015-12-01,2015,12,1


In [31]:
list(inspections)
inspections['ACTION'].value_counts()

Violations were cited in the following area(s).                                                                                        378322
Establishment Closed by DOHMH.  Violations were cited in the following area(s) and those requiring immediate action were addressed.      8758
No violations were recorded at the time of this inspection.                                                                              5457
Establishment re-opened by DOHMH                                                                                                         2605
Establishment re-closed by DOHMH                                                                                                          763
Name: ACTION, dtype: int64

In [3]:
# Normalization tables # Violations # and reshape to restaurant and inspection (date)
violations_vars = ['VIOLATION CODE', 'VIOLATION DESCRIPTION']
violations_rep = inspections[violations_vars]
#
violationsDescription = violations_rep.drop_duplicates(subset=None, keep='first', inplace=False)
violationsDescription.head()


Unnamed: 0,VIOLATION CODE,VIOLATION DESCRIPTION
0,02B,Hot food item not held at or above 140Ã‚Âº F.
1,02G,Cold food item held above 41Ã‚Âº F (smoked fis...
2,10H,Proper sanitization not provided for utensil w...
3,10F,Non-food contact surface improperly constructe...
5,20F,Current letter grade card not posted.


In [4]:
len(violationsDescription)
violationsDescription.to_csv('violationsCode.csv')

In [48]:
# Normalization tables # Violations # and reshape to restaurant and inspection (date)
violations_vars = ['CAMIS', 'VIOLATION CODE', 'date', 'year', 'month','day']
violations_repeated = inspections[violations_vars]
#CI_restaurants = CI_restaurants_repeated.duplicated(subset=None, keep='first')
violationsDF = violations_repeated.drop_duplicates(subset=None, keep='first', inplace=False)
violationsDF.head()
#? 'ACTION',
# , to traspose for presence in a inspection
# 'VIOLATION DESCRIPTION', table to relate with violation Code 
# 'CRITICAL FLAG', table to relate with violation Code and Violation Description

Unnamed: 0,CAMIS,VIOLATION CODE,date,year,month,day
0,50041313,02B,2016-01-26,2016,1,26
1,41312955,02G,2014-04-01,2014,4,1
2,41601691,10H,2015-10-26,2015,10,26
3,50043431,10F,2017-05-19,2017,5,19
4,50001580,02G,2015-12-01,2015,12,1


In [25]:
# Normalization tables # Inspections
inspection_vars = ['CAMIS', 'INSPECTION DATE', 'SCORE', 'GRADE', 'GRADE DATE', 'RECORD DATE', 'INSPECTION TYPE', 'date', 'year', 'month','day']
inspections_repeated = inspections[inspection_vars]
#CI_restaurants = CI_restaurants_repeated.duplicated(subset=None, keep='first')
inspectionsDF = inspections_repeated.drop_duplicates(subset=None, keep='first', inplace=False)
inspectionsDF.head()


Unnamed: 0_level_0,CAMIS,INSPECTION DATE,SCORE,GRADE,GRADE DATE,RECORD DATE,INSPECTION TYPE,date,year,month,12.0,A,01/26/2016,07/15/2017,Pre-permit (Operational) / Initial Inspection,2016-01-26,2016,1,26
1,41312955,04/01/2014,12.0,A,04/01/2014,07/15/2017,Cycle Inspection / Re-inspection,2014-04-01,2014,4,1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
2,41601691,10/26/2015,25.0,NaN,NaN,07/15/2017,Cycle Inspection / Initial Inspection,2015-10-26,2015,10,26,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2
3,50043431,05/19/2017,35.0,NaN,NaN,07/15/2017,Cycle Inspection / Initial Inspection,2017-05-19,2017,5,19,Unnamed: 12_level_3,Unnamed: 13_level_3,Unnamed: 14_level_3,Unnamed: 15_level_3,Unnamed: 16_level_3,Unnamed: 17_level_3,Unnamed: 18_level_3,Unnamed: 19_level_3
4,50001580,12/01/2015,8.0,A,12/01/2015,07/15/2017,Cycle Inspection / Re-inspection,2015-12-01,2015,12,1,Unnamed: 12_level_4,Unnamed: 13_level_4,Unnamed: 14_level_4,Unnamed: 15_level_4,Unnamed: 16_level_4,Unnamed: 17_level_4,Unnamed: 18_level_4,Unnamed: 19_level_4


In [21]:
# Normalization tables
restaurant_vars = ['CAMIS', 'DBA', 'BORO', 'BUILDING', 'STREET','ZIPCODE','PHONE','CUISINE DESCRIPTION']
restaurants_repeated = inspections[restaurant_vars]
#CI_restaurants = CI_restaurants_repeated.duplicated(subset=None, keep='first')
restaurantsDF = restaurants_repeated.drop_duplicates(subset=None, keep='first', inplace=False)
restaurantsDF.head()

Unnamed: 0,CAMIS,DBA,BORO,BUILDING,STREET,ZIPCODE,PHONE,CUISINE DESCRIPTION
0,50041313,876 MARKET DELI,MANHATTAN,876,6TH AVE,10001.0,2122132431,Delicatessen
1,41312955,RIPE JUICE BAR & GRILL,QUEENS,7013,AUSTIN STREET,11375.0,7182612881,American
2,41601691,WAZA SUSHI,BROOKLYN,485,MYRTLE AVENUE,11205.0,7183993839,Japanese
3,50043431,SEATTLE CAFE,MANHATTAN,1411,MADISON AVE,10029.0,2124230446,American
4,50001580,CIRO PIZZA CAFE,STATEN ISLAND,862,HUGUENOT AVENUE,10312.0,7186050620,Italian


In [23]:
len(restaurantsDF)
restaurantsDF.to_csv('restaurantsTable.csv')

In [26]:
len(inspectionsDF)
inspectionsDF.to_csv('inspectionsTable.csv')

In [29]:
len(violationsDF)
violationsDF.to_csv('violationsTable.csv')

In [None]:
## NEXT IS DRAFT OF WORK WITH Inspections Type and Violation description


In [55]:
# Testing with one restaurant
#cy_some = pd.DataFrame(CycleInspections[CycleInspections['CAMIS'] == 50043431])
#violationsDF.sort_values('date').head()
# hold presence of a violation
df_multi = violationsDF[['CAMIS','date','VIOLATION CODE']]
df_multi['violation_presence'] = 1
df_multi.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0,CAMIS,date,VIOLATION CODE,violation_presence
0,50041313,2016-01-26,02B,1
1,41312955,2014-04-01,02G,1
2,41601691,2015-10-26,10H,1
3,50043431,2017-05-19,10F,1
4,50001580,2015-12-01,02G,1


In [56]:
df_multi.set_index(['date', 'CAMIS'])
df_multi = df_multi.pivot_table(index = ['CAMIS','date'], columns='VIOLATION CODE')
df_multi.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,violation_presence,violation_presence,violation_presence,violation_presence,violation_presence,violation_presence,violation_presence,violation_presence,violation_presence,violation_presence,violation_presence,violation_presence,violation_presence,violation_presence,violation_presence,violation_presence,violation_presence,violation_presence,violation_presence,violation_presence,violation_presence
Unnamed: 0_level_1,VIOLATION CODE,02A,02B,02C,02D,02E,02F,02G,02H,02I,02J,...,20B,20D,20E,20F,22A,22B,22C,22E,22F,22G
CAMIS,date,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2
30075445,2015-02-09,,,,,,,,,,,...,,,,,,,,,,
30075445,2016-02-18,,,,,,,,,,,...,,,,,,,,,,
30075445,2017-05-18,,,,,,,,,,,...,,,,,,,,,,
30112340,2014-07-01,,,,,,,1.0,,,,...,,,,,,,,,,
30112340,2014-11-13,,,,,,,,,,,...,,,,,,,,,,


In [58]:
len(df_multi)
df_multi.to_csv('multiTable.csv')

In [18]:

# Only Cycle Inspections
CycleInspections=pd.DataFrame(inspections[(inspections['INSPECTION TYPE'] == 'Cycle Inspection / Initial Inspection') | 
                                          (inspections['INSPECTION TYPE'] =='Cycle Inspection / Re-inspection')])

In [23]:
variables = ['CAMIS','date','BORO','DBA','INSPECTION TYPE', 'SCORE','GRADE DATE','RECORD DATE', 'year','month','day','VIOLATION CODE','VIOLATION DESCRIPTION']
CycleInspections_out = CycleInspections[variables]
CycleInspections_out.to_csv('CycleInspections_out.csv')

In [None]:
#CycleInspections.to_csv('CycleInspections.csv')

In [2]:
#CycleInspections = pd.read_csv('CycleInspections.csv', sep=',',engine='python')
CycleInspections_out = pd.read_csv('CycleInspections_out.csv', sep=',',engine='python')


In [14]:
CycleInspections_out.head()
len(CycleInspections_out)

326871

In [17]:
# Restaurant 
# CAMIS, BORO, DBA, 
variables = ['CAMIS','BORO','DBA']
CI_restaurants_repeated = CycleInspections_out[variables]
#CI_restaurants = CI_restaurants_repeated.duplicated(subset=None, keep='first')
CI_restaurants = CI_restaurants_repeated.drop_duplicates(subset=None, keep='first', inplace=False)
CI_restaurants.head()

# Inspections
# CAMIS, date (year, month, day) , INSPECTION TYPE, SCORE, GRADE, RECORD DATE
# Violations
# VIOLATION CODE, VIOLATION DESCRIPTION

In [18]:
CI_restaurants.head()

Unnamed: 0,CAMIS,BORO,DBA
0,41312955,QUEENS,RIPE JUICE BAR & GRILL
1,41601691,BROOKLYN,WAZA SUSHI
2,50043431,MANHATTAN,SEATTLE CAFE
3,50001580,STATEN ISLAND,CIRO PIZZA CAFE
4,41722020,BRONX,2 BROS PIZZA


In [None]:
len(CI_restaurants)
#
#
#CI_restaurants_repeated.head()
#CI_restaurants.to_csv('CI_restaurants.csv')


In [3]:
#CycleInspections_out.head()
CycleInspections[CycleInspections['DBA'] == 'SEATTLE CAFE']

NameError: name 'CycleInspections' is not defined

In [32]:
CycleInspections['date'][(CycleInspections['DBA'] == 'SEATTLE CAFE') & (CycleInspections['BORO'] == 'QUEENS') ]

88157    2016-02-25
91145    2015-02-13
96696    2017-05-02
126780   2017-03-29
132349   2015-08-25
139140   2015-02-24
160327   2015-02-13
172257   2017-03-29
174390   2015-09-14
182326   2015-02-13
190151   2017-05-02
218412   2015-02-24
220206   2017-05-02
253064   2017-03-29
268994   2015-08-25
336187   2016-02-25
349623   2015-08-25
382440   2015-09-14
388612   2015-02-24
Name: date, dtype: datetime64[ns]

In [44]:
cy_some = pd.DataFrame(CycleInspections[CycleInspections['CAMIS'] == 50043431])

In [60]:
cy_some[['VIOLATION DESCRIPTION','SCORE','date','INSPECTION TYPE','ACTION']].sort_values('date')

Unnamed: 0,VIOLATION DESCRIPTION,SCORE,date,INSPECTION TYPE,ACTION
124825,Cold food item held above 41Ã‚Âº F (smoked fis...,19.0,2016-07-26,Cycle Inspection / Initial Inspection,Violations were cited in the following area(s).
136948,"Sanitized equipment or utensil, including in-u...",19.0,2016-07-26,Cycle Inspection / Initial Inspection,Violations were cited in the following area(s).
404115,Non-food contact surface improperly constructe...,19.0,2016-07-26,Cycle Inspection / Initial Inspection,Violations were cited in the following area(s).
76844,"Food contact surface not properly washed, rins...",12.0,2016-08-25,Cycle Inspection / Re-inspection,Violations were cited in the following area(s).
116190,Non-food contact surface improperly constructe...,12.0,2016-08-25,Cycle Inspection / Re-inspection,Violations were cited in the following area(s).
285089,Wiping cloths soiled or not stored in sanitizi...,12.0,2016-08-25,Cycle Inspection / Re-inspection,Violations were cited in the following area(s).
3,Non-food contact surface improperly constructe...,35.0,2017-05-19,Cycle Inspection / Initial Inspection,Violations were cited in the following area(s).
76623,Plumbing not properly installed or maintained;...,35.0,2017-05-19,Cycle Inspection / Initial Inspection,Violations were cited in the following area(s).
146463,"Food contact surface not properly washed, rins...",35.0,2017-05-19,Cycle Inspection / Initial Inspection,Violations were cited in the following area(s).
164194,Facility not vermin proof. Harborage or condit...,35.0,2017-05-19,Cycle Inspection / Initial Inspection,Violations were cited in the following area(s).


In [30]:
len(CycleInspections['date'][CycleInspections['DBA'] == 'SEATTLE CAFE'])

35

In [48]:
cy_some.sort_values('date').head()
# Agregate by inspection date
# number of critical flags
# number of inspection NOtes?
# score is average(?)
# no change: boro, dba, street, zipcode, phone, cuisine description (by camis)
#            inspection date, action, grade, date, year, month, day
# only: camis, date (month, day, year), grade
# Aggregation of violation descriptions to grade?


Unnamed: 0,CAMIS,DBA,BORO,BUILDING,STREET,ZIPCODE,PHONE,CUISINE DESCRIPTION,INSPECTION DATE,ACTION,...,CRITICAL FLAG,SCORE,GRADE,GRADE DATE,RECORD DATE,INSPECTION TYPE,date,year,month,day
124825,50043431,SEATTLE CAFE,MANHATTAN,1411,MADISON AVE,10029.0,2124230446,American,07/26/2016,Violations were cited in the following area(s).,...,Critical,19.0,,,07/15/2017,Cycle Inspection / Initial Inspection,2016-07-26,2016,7,26
136948,50043431,SEATTLE CAFE,MANHATTAN,1411,MADISON AVE,10029.0,2124230446,American,07/26/2016,Violations were cited in the following area(s).,...,Critical,19.0,,,07/15/2017,Cycle Inspection / Initial Inspection,2016-07-26,2016,7,26
404115,50043431,SEATTLE CAFE,MANHATTAN,1411,MADISON AVE,10029.0,2124230446,American,07/26/2016,Violations were cited in the following area(s).,...,Not Critical,19.0,,,07/15/2017,Cycle Inspection / Initial Inspection,2016-07-26,2016,7,26
76844,50043431,SEATTLE CAFE,MANHATTAN,1411,MADISON AVE,10029.0,2124230446,American,08/25/2016,Violations were cited in the following area(s).,...,Critical,12.0,A,08/25/2016,07/15/2017,Cycle Inspection / Re-inspection,2016-08-25,2016,8,25
116190,50043431,SEATTLE CAFE,MANHATTAN,1411,MADISON AVE,10029.0,2124230446,American,08/25/2016,Violations were cited in the following area(s).,...,Not Critical,12.0,A,08/25/2016,07/15/2017,Cycle Inspection / Re-inspection,2016-08-25,2016,8,25


In [85]:
# separate CAMIS identifiers (Tables: restaurant, inspection)
# generate columns for critical, not critical
# 'SCORE' is already in the "correct" aggregation label
cy_some['violation_presence'] = 1
#Not here yet cy_some['critical'] = (cy_some['CRITICAL FLAG'] == "Critical") 
# sum
cy_some.head()


Unnamed: 0,CAMIS,DBA,BORO,BUILDING,STREET,ZIPCODE,PHONE,CUISINE DESCRIPTION,INSPECTION DATE,ACTION,...,GRADE,GRADE DATE,RECORD DATE,INSPECTION TYPE,date,year,month,day,critical,violation_presence
3,50043431,SEATTLE CAFE,MANHATTAN,1411,MADISON AVE,10029.0,2124230446,American,05/19/2017,Violations were cited in the following area(s).,...,,,07/15/2017,Cycle Inspection / Initial Inspection,2017-05-19,2017,5,19,False,1
76623,50043431,SEATTLE CAFE,MANHATTAN,1411,MADISON AVE,10029.0,2124230446,American,05/19/2017,Violations were cited in the following area(s).,...,,,07/15/2017,Cycle Inspection / Initial Inspection,2017-05-19,2017,5,19,False,1
76844,50043431,SEATTLE CAFE,MANHATTAN,1411,MADISON AVE,10029.0,2124230446,American,08/25/2016,Violations were cited in the following area(s).,...,A,08/25/2016,07/15/2017,Cycle Inspection / Re-inspection,2016-08-25,2016,8,25,True,1
80326,50043431,SEATTLE CAFE,MANHATTAN,1411,MADISON AVE,10029.0,2124230446,American,07/03/2017,Violations were cited in the following area(s).,...,A,07/03/2017,07/15/2017,Cycle Inspection / Re-inspection,2017-07-03,2017,7,3,False,1
83541,50043431,SEATTLE CAFE,MANHATTAN,1411,MADISON AVE,10029.0,2124230446,American,07/03/2017,Violations were cited in the following area(s).,...,A,07/03/2017,07/15/2017,Cycle Inspection / Re-inspection,2017-07-03,2017,7,3,False,1


In [86]:
# keep only specific variables non restaurant specific
# TEST 
#inspections_some = cy_some[['CAMIS','date','VIOLATION DESCRIPTION','violation_presence']]
inspections_some.head()

Unnamed: 0,CAMIS,date,VIOLATION DESCRIPTION,violation_presence
3,50043431,2017-05-19,Non-food contact surface improperly constructe...,1
76623,50043431,2017-05-19,Plumbing not properly installed or maintained;...,1
76844,50043431,2016-08-25,"Food contact surface not properly washed, rins...",1
80326,50043431,2017-07-03,Facility not vermin proof. Harborage or condit...,1
83541,50043431,2017-07-03,Non-food contact surface improperly constructe...,1


In [87]:
df_multi = inspections_some.set_index(['date', 'CAMIS'])
df_multi.head()

In [89]:
df_multi.pivot_table(index = ['CAMIS','date'], columns='VIOLATION DESCRIPTION')

Unnamed: 0_level_0,Unnamed: 1_level_0,violation_presence,violation_presence,violation_presence,violation_presence,violation_presence,violation_presence,violation_presence,violation_presence,violation_presence,violation_presence
Unnamed: 0_level_1,VIOLATION DESCRIPTION,Cold food item held above 41Ã‚Âº F (smoked fish and reduced oxygen packaged foods above 38 Ã‚ÂºF) except during necessary preparation.,Evidence of mice or live mice present in facility's food and/or non-food areas.,Facility not vermin proof. Harborage or conditions conducive to attracting vermin to the premises and/or allowing vermin to exist.,"Filth flies or food/refuse/sewage-associated (FRSA) flies present in facilitys food and/or non-food areas. Filth flies include house flies, little house flies, blow flies, bottle flies and flesh flies. Food/refuse/sewage-associated flies include fruit flies, drain flies and Phorid flies.","Food contact surface not properly washed, rinsed and sanitized after each use and following any activity when contamination may have occurred.","Non-food contact surface improperly constructed. Unacceptable material used. Non-food contact surface or equipment improperly maintained and/or not properly sealed, raised, spaced or movable to allow accessibility for cleaning on all sides, above and underneath the unit.",Plumbing not properly installed or maintained; anti-siphonage or backflow prevention device not provided where required; equipment or floor not properly drained; sewage disposal system in disrepair or not functioning properly.,"Raw, cooked or prepared food is adulterated, contaminated, cross-contaminated, or not discarded in accordance with HACCP plan.","Sanitized equipment or utensil, including in-use food dispensing utensil, improperly used or stored.",Wiping cloths soiled or not stored in sanitizing solution.
CAMIS,date,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2
50043431,2016-07-26,1.0,,,,,1.0,,,1.0,
50043431,2016-08-25,,,,,1.0,1.0,,,,1.0
50043431,2017-05-19,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,,
50043431,2017-07-03,,1.0,1.0,,,1.0,,,,


In [None]:
# MOre questions:
# what violations are more frequent? 
# improvement and worsening
# as signs of good or bad management
# what violations are more prevalent?
# one variable for each potential violation
# for each date
# aggregate on total of violations or absence (?)
# aggregate on total of critical and non-critical violations
# 

In [91]:
# questions to explore
# in a given inspection are there many registries for the same restaurant?
# are there violation descriptions frequently linked to Critical FLAG?
# Grades are related with any of those variables?
