# Data preparation and Exploration for Cycle Inspections

Prepare data from [restaurant inspections](https://data.cityofnewyork.us/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/xx67-kt59) Note: Exported as csv.

Make some descriptives for the data set.

In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
# Separate cycle 
CycleInspections_out = pd.read_csv('CycleInspections_out.csv', sep=',',engine='python')
len(CycleInspections_out)

326871

In [3]:
CycleInspections_out.head()

Unnamed: 0.1,Unnamed: 0,CAMIS,date,BORO,DBA,INSPECTION TYPE,SCORE,GRADE DATE,RECORD DATE,year,month,day,VIOLATION CODE,VIOLATION DESCRIPTION
0,1,41312955,2014-04-01,QUEENS,RIPE JUICE BAR & GRILL,Cycle Inspection / Re-inspection,12.0,04/01/2014,07/15/2017,2014,4,1,02G,Cold food item held above 41Ã‚Âº F (smoked fis...
1,2,41601691,2015-10-26,BROOKLYN,WAZA SUSHI,Cycle Inspection / Initial Inspection,25.0,,07/15/2017,2015,10,26,10H,Proper sanitization not provided for utensil w...
2,3,50043431,2017-05-19,MANHATTAN,SEATTLE CAFE,Cycle Inspection / Initial Inspection,35.0,,07/15/2017,2017,5,19,10F,Non-food contact surface improperly constructe...
3,4,50001580,2015-12-01,STATEN ISLAND,CIRO PIZZA CAFE,Cycle Inspection / Re-inspection,8.0,12/01/2015,07/15/2017,2015,12,1,02G,Cold food item held above 41Ã‚Âº F (smoked fis...
4,6,41722020,2017-04-12,BRONX,2 BROS PIZZA,Cycle Inspection / Initial Inspection,22.0,,07/15/2017,2017,4,12,04A,Food Protection Certificate not held by superv...


One table with violation groups and violation description

In [4]:
CycleInspections_out['violation groups'] = CycleInspections_out['VIOLATION CODE'].str.extract('(\d\d)', expand=True)
CycleInspections_out.head()

Unnamed: 0.1,Unnamed: 0,CAMIS,date,BORO,DBA,INSPECTION TYPE,SCORE,GRADE DATE,RECORD DATE,year,month,day,VIOLATION CODE,VIOLATION DESCRIPTION,violation groups
0,1,41312955,2014-04-01,QUEENS,RIPE JUICE BAR & GRILL,Cycle Inspection / Re-inspection,12.0,04/01/2014,07/15/2017,2014,4,1,02G,Cold food item held above 41Ã‚Âº F (smoked fis...,2
1,2,41601691,2015-10-26,BROOKLYN,WAZA SUSHI,Cycle Inspection / Initial Inspection,25.0,,07/15/2017,2015,10,26,10H,Proper sanitization not provided for utensil w...,10
2,3,50043431,2017-05-19,MANHATTAN,SEATTLE CAFE,Cycle Inspection / Initial Inspection,35.0,,07/15/2017,2017,5,19,10F,Non-food contact surface improperly constructe...,10
3,4,50001580,2015-12-01,STATEN ISLAND,CIRO PIZZA CAFE,Cycle Inspection / Re-inspection,8.0,12/01/2015,07/15/2017,2015,12,1,02G,Cold food item held above 41Ã‚Âº F (smoked fis...,2
4,6,41722020,2017-04-12,BRONX,2 BROS PIZZA,Cycle Inspection / Initial Inspection,22.0,,07/15/2017,2017,4,12,04A,Food Protection Certificate not held by superv...,4


In [6]:
CycleInspections_out['INSPECTION TYPE'].value_counts()
# how many initial inspections do not have a re-inspection?
# hyp: only those with initial inspection grade = A

Cycle Inspection / Initial Inspection    225319
Cycle Inspection / Re-inspection         101552
Name: INSPECTION TYPE, dtype: int64

For each year and inspection type, how many violations were in each group type?

In [10]:
CycleInspections_out['presence'] =1

In [13]:
viol_present_table = CycleInspections_out.groupby(['violation groups','INSPECTION TYPE','year'], as_index=False)['presence'].sum()

In [14]:
viol_present_table.head()

Unnamed: 0,violation groups,INSPECTION TYPE,year,presence
0,2,Cycle Inspection / Initial Inspection,2014,8450
1,2,Cycle Inspection / Initial Inspection,2015,10252
2,2,Cycle Inspection / Initial Inspection,2016,9403
3,2,Cycle Inspection / Initial Inspection,2017,4877
4,2,Cycle Inspection / Re-inspection,2014,3900


In [None]:
# for each year
# plot side by side the inspection type
# violation group bars

In [18]:
viol_present_table[viol_present_table['year']==2014 and viol_present_table['year']=='Cycle Inspection / Initial Inspection']

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

In [2]:
#same_inspection_cycle.to_csv('same_inspection_cycle.csv')
same_inspection_cycle = pd.read_csv('same_inspection_cycle.csv', sep=',',engine='python')
len(same_inspection_cycle)

30138

In [3]:
same_inspection_cycle.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,CAMIS,date_x,BORO_x,year,02_x,03_x,04_x,05_x,...,05_y,06_y,07_y,08_y,09_y,10_y,22_y,date_init,date_reins,Difference
0,1,3,30112340,2016-10-03,BROOKLYN,2016,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,2016-10-03,2016-10-27,24 days 00:00:00.000000000
1,2,4,30191841,2015-08-31,MANHATTAN,2015,0.0,0.0,1.0,0.0,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,2015-08-31,2015-09-21,21 days 00:00:00.000000000
2,3,5,40356151,2014-04-11,QUEENS,2014,1.0,0.0,2.0,1.0,...,0.0,2.0,0.0,0.0,0.0,0.0,0.0,2014-04-11,2014-05-02,21 days 00:00:00.000000000
3,5,8,40356151,2014-10-03,QUEENS,2014,0.0,0.0,2.0,0.0,...,0.0,1.0,0.0,1.0,0.0,0.0,0.0,2014-10-03,2014-11-15,43 days 00:00:00.000000000
4,6,9,40356151,2015-04-24,QUEENS,2015,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,2015-04-24,2015-05-29,35 days 00:00:00.000000000


For each year, how each many violations are reported for each violation group?
What violations are enclosed in each violation group? 

In [7]:
same_inspection_cycle.groupby('year').sum() 
#[same_inspection_cycle['year']==2014]

Unnamed: 0_level_0,Unnamed: 0,Unnamed: 0.1,CAMIS,02_x,03_x,04_x,05_x,06_x,07_x,08_x,...,02_y,03_y,04_y,05_y,06_y,07_y,08_y,09_y,10_y,22_y
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2014,143211613,204965079,342200325963,5612.0,78.0,8673.0,430.0,5802.0,8.0,4374.0,...,3161.0,33.0,4998.0,143.0,3515.0,2.0,2755.0,487.0,5515.0,2.0
2015,218848963,312872720,461521748014,6909.0,84.0,10785.0,589.0,7533.0,5.0,5371.0,...,3808.0,35.0,5915.0,158.0,5420.0,1.0,3303.0,660.0,7478.0,46.0
2016,190541431,271782581,357904957341,5083.0,77.0,8124.0,475.0,6145.0,2.0,4311.0,...,2708.0,26.0,4803.0,135.0,4612.0,2.0,2784.0,568.0,5371.0,29.0
2017,85590694,121841252,151392973417,2061.0,43.0,3229.0,224.0,2659.0,0.0,1887.0,...,1172.0,23.0,1581.0,77.0,1901.0,1.0,1040.0,247.0,2416.0,16.0
