# 04 - dataframes 

# today's exercise:

use what you have learned in your previous solutions to read in the new york rodent inspection data as a `pandas` dataframe, and then augment that dataframe with wait time (interval between inspection and approval), weekday,  and iso-week of inspection. also add a new column, `season` of the inspection date (defined as: dec-feb, inclusive, is winter, mar-may is spring, jun-aug is summer, sep-nov is fall). 

then answer the following questions: 
- between the dates of 2010-01-01 and 2010-03-31, what was the daily average number of inspections?
- between the dates of 2010-01-01 and 2010-03-31, which weekday had the greatest number of inspections?
- between the dates of 2010-01-01 and 2010-03-31, which weekday had the least number of inspections?

if you are confident that you can already solve this, the rest of the session is not much use to you. 

as a bonus we should be able to answer these further questions
- which weekday has the longest average wait time for approval in the winter? (let's define the seasons as: (dec-feb is winter, mar-may is spring, jun-aug is summer, sep-nov is fall).
- which weekday has the longest average wait time for approval in the summer?
- which season has the greatest number of inspections? 
- which season has the greatest number of distinct dates ...
    + a) in the data set
    + b) in the calendar? 
- which borough has the greatest difference in the number of inspections in the spring vs in the fall?
- count the number of inspections per [iso-week](https://en.wikipedia.org/wiki/ISO_week_date). find the week with the greatest number of inspections. for that week, and that week only, count the inspections by day-of-week.

In [1]:
import pandas as pd

In [2]:
rodent_df = pd.read_csv('/Users/katiea/git/python_workshops/python_exercises/exercises/04_dataframes/NY_rodent_inspections_sample.csv',
                       parse_dates=['INSPECTION_DATE', 'APPROVED_DATE'])

In [3]:
rodent_df.head()

Unnamed: 0,INSPECTION_TYPE,JOB_TICKET_OR_WORK_ORDER_ID,JOB_ID,JOB_PROGRESS,BBL,BORO_CODE,BLOCK,LOT,HOUSE_NUMBER,STREET_NAME,ZIP_CODE,X_COORD,Y_COORD,LATITUDE,LONGITUDE,BOROUGH,INSPECTION_DATE,RESULT,APPROVED_DATE,LOCATION
0,BAIT,1,PO12965,3,1011470035,1,1147,35,104,WEST 76 STREET,10023,990505,223527,40.780204,-73.977414,Manhattan,2009-10-14 12:00:27,Bait applied,2009-10-14 15:01:46,"(40.7802039792471, -73.9774144709456)"
1,BAIT,2,PO12966,3,1011470034,1,1147,34,102,WEST 76 STREET,10023,990516,223521,40.780188,-73.977375,Manhattan,2009-10-14 12:51:21,Bait applied,2009-10-14 15:02:30,"(40.7801875030438, -73.977374757787)"
2,BAIT,30,PO16966,3,2043370027,2,4337,27,620,THWAITES PLACE,10467,1020110,252216,40.858877,-73.870364,Bronx,2009-11-09 12:59:55,Bait applied,2009-11-10 14:54:52,"(40.8588765781972, -73.8703636422023)"
3,BAIT,31,PO13665,3,2037670077,2,3767,77,1227,WHITEPLAINS ROAD,10472,1022441,242180,40.831321,-73.861994,Bronx,2009-11-09 11:10:16,Bait applied,2009-11-10 14:56:42,"(40.8313209626148, -73.861994089899)"
4,BAIT,38,PO11291,3,1011690057,1,1169,57,2199,BROADWAY,10024,989641,224567,40.783059,-73.980533,Manhattan,2009-11-10 08:40:42,Bait applied,2009-11-17 11:39:11,"(40.7830590725833, -73.9805333640688)"


In [4]:
# Add a column providing wait time (i.e. the number of hours between inspection and approved date)
rodent_df['wait_time'] = rodent_df.apply(lambda row: ((row['APPROVED_DATE'] - row['INSPECTION_DATE']).days * 24) + ((row['APPROVED_DATE'] - row['INSPECTION_DATE']).seconds / (60*60)), axis=1)

In [5]:
# Add a column providing weekday name 
rodent_df['weekday_name'] = rodent_df.apply(lambda row: row['INSPECTION_DATE'].strftime("%A"),axis=1)

In [6]:
# Add a column providing ISO week number
rodent_df['iso_week'] = rodent_df.apply(lambda row: row['INSPECTION_DATE'].isocalendar()[1],axis=1)

In [7]:
# Add a column providing season; defined as: dec-feb, inclusive, is winter, mar-may is spring, jun-aug is summer, 
# sep-nov is fall)

def determine_season(inspection_date):
    if inspection_date.month in [12, 1, 2]:
        season = "winter"
    elif inspection_date.month in [3, 4, 5]:
        season = "spring"
    elif inspection_date.month in [6, 7, 8]:
        season = "summer"
    elif inspection_date.month in [9, 10, 11]:
        season = "fall"
    return season
    
rodent_df['season'] = rodent_df.apply(lambda row: determine_season(row['INSPECTION_DATE']),axis=1)

In [8]:
rodent_df.head()

Unnamed: 0,INSPECTION_TYPE,JOB_TICKET_OR_WORK_ORDER_ID,JOB_ID,JOB_PROGRESS,BBL,BORO_CODE,BLOCK,LOT,HOUSE_NUMBER,STREET_NAME,...,LONGITUDE,BOROUGH,INSPECTION_DATE,RESULT,APPROVED_DATE,LOCATION,wait_time,weekday_name,iso_week,season
0,BAIT,1,PO12965,3,1011470035,1,1147,35,104,WEST 76 STREET,...,-73.977414,Manhattan,2009-10-14 12:00:27,Bait applied,2009-10-14 15:01:46,"(40.7802039792471, -73.9774144709456)",3.021944,Wednesday,42,fall
1,BAIT,2,PO12966,3,1011470034,1,1147,34,102,WEST 76 STREET,...,-73.977375,Manhattan,2009-10-14 12:51:21,Bait applied,2009-10-14 15:02:30,"(40.7801875030438, -73.977374757787)",2.185833,Wednesday,42,fall
2,BAIT,30,PO16966,3,2043370027,2,4337,27,620,THWAITES PLACE,...,-73.870364,Bronx,2009-11-09 12:59:55,Bait applied,2009-11-10 14:54:52,"(40.8588765781972, -73.8703636422023)",25.915833,Monday,46,fall
3,BAIT,31,PO13665,3,2037670077,2,3767,77,1227,WHITEPLAINS ROAD,...,-73.861994,Bronx,2009-11-09 11:10:16,Bait applied,2009-11-10 14:56:42,"(40.8313209626148, -73.861994089899)",27.773889,Monday,46,fall
4,BAIT,38,PO11291,3,1011690057,1,1169,57,2199,BROADWAY,...,-73.980533,Manhattan,2009-11-10 08:40:42,Bait applied,2009-11-17 11:39:11,"(40.7830590725833, -73.9805333640688)",170.974722,Tuesday,46,fall


In [9]:
# Subset the dataframe between the dates of 2010-01-01 and 2010-03-31 in order to be able to answer the following 
# three questions 

criteria = (rodent_df.INSPECTION_DATE >= "2010-01-01") & (rodent_df.INSPECTION_DATE <= "2010-03-31")

rodent_df_subset = rodent_df.loc[criteria]

In [10]:
# between the dates of 2010-01-01 and 2010-03-31, what was the daily average number of inspections?

daily_inspections = rodent_df_subset['INSPECTION_DATE'].value_counts()
print(f"The daily average number of inspections = {daily_inspections.mean()}")

The daily average number of inspections = 1.0109824873849806


In [11]:
# between the dates of 2010-01-01 and 2010-03-31, which weekday had the greatest number of inspections?

inspections_per_weekday = rodent_df_subset['weekday_name'].value_counts()
print(inspections_per_weekday)
print(f"The weekday with the greatest number of inspections = {inspections_per_weekday.idxmax()}")

Tuesday      795
Thursday     739
Friday       671
Monday       658
Wednesday    542
Sunday         1
Name: weekday_name, dtype: int64
The weekday with the greatest number of inspections = Tuesday


In [12]:
# between the dates of 2010-01-01 and 2010-03-31, which weekday had the least number of inspections?

print(f"The weekday with the least number of inspections = {inspections_per_weekday.idxmin()}")

The weekday with the least number of inspections = Sunday


In [13]:
# create winter data subset to answer the below question

winter_criteria = (rodent_df.season == "winter")
rodent_df_winter = rodent_df.loc[winter_criteria]

# which weekday has the longest average wait time for approval in the winter?

winter_average_wait_time = rodent_df_winter.groupby('weekday_name')['wait_time'].mean()
print(winter_average_wait_time)
print(f"The weekday with the longest average wait time for approval in winter = {winter_average_wait_time.idxmax()}")

weekday_name
Friday       102.970981
Monday        38.333509
Sunday        98.312500
Thursday      58.853018
Tuesday       42.990648
Wednesday     59.829062
Name: wait_time, dtype: float64
The weekday with the longest average wait time for approval in winter = Friday


In [14]:
# create summer data subset to answer the below question

summer_criteria = (rodent_df.season == "summer")
rodent_df_summer = rodent_df.loc[summer_criteria]

# which weekday has the longest average wait time for approval in the summer?

summer_average_wait_time = rodent_df_summer.groupby('weekday_name')['wait_time'].mean()
print(summer_average_wait_time)
print(f"The weekday with the longest average wait time for approval in summer = {summer_average_wait_time.idxmax()}")

weekday_name
Friday       97.479622
Monday       24.875926
Saturday     85.458623
Sunday       37.510391
Thursday     69.628545
Tuesday      46.619051
Wednesday    40.166068
Name: wait_time, dtype: float64
The weekday with the longest average wait time for approval in summer = Friday


In [15]:
# which season has the greatest number of inspections?

seasonal_inspections = rodent_df['season'].value_counts()
print(seasonal_inspections)
print(f"The season with the greatest number of inspections = {seasonal_inspections.idxmax()}")

spring    5536
summer    2500
winter    1915
fall        48
Name: season, dtype: int64
The season with the greatest number of inspections = spring


In [16]:
# which season has the greatest number of distinct dates ...

# a) in the data set?

# create new column with inspection date separated from time 

rodent_df['inspection_date_only'] = rodent_df.apply(lambda row: row['INSPECTION_DATE'].date(),axis=1)

number_dates_per_season = rodent_df.groupby('season')['inspection_date_only'].nunique()
print(number_dates_per_season)
print(f"The season with the greatest number of distinct dates = {number_dates_per_season.idxmax()}")

season
fall      12
spring    65
summer    39
winter    57
Name: inspection_date_only, dtype: int64
The season with the greatest number of distinct dates = spring


In [17]:
# b) in the calendar?

# I'm assuming that this means the number of distinct calendar dates between the minimum and maximum dates in the 
# dataset? If so, this is my answer...

calendar_df = pd.date_range(start=rodent_df['inspection_date_only'].min(), end=rodent_df['inspection_date_only'].max())
calendar_df = calendar_df.to_frame(index=False)
calendar_df.columns = ['calendar_date']
calendar_df['season'] = calendar_df.apply(lambda row: determine_season(row['calendar_date']),axis=1)
calendar_dates_per_season = calendar_df.groupby('season')['calendar_date'].nunique()
print(calendar_dates_per_season)
print(f"The seasons with the greatest number of distinct dates = {calendar_dates_per_season.index[calendar_dates_per_season == calendar_dates_per_season.max()].tolist()}")


season
fall      239
spring    276
summer    276
winter    224
Name: calendar_date, dtype: int64
The seasons with the greatest number of distinct dates = ['spring', 'summer']


In [18]:
# which borough has the greatest difference in the number of inspections in the spring vs in the fall?

# subset the dataframe to only include spring or fall 
spring_criteria = (rodent_df.season == "spring") 
fall_criteria = (rodent_df.season == "fall")
rodent_df_spring = rodent_df.loc[spring_criteria]
rodent_df_fall = rodent_df.loc[fall_criteria]

spring_inspections_per_borough = rodent_df_spring.groupby('BOROUGH')['inspection_date_only'].count().to_frame()
spring_inspections_per_borough.columns = ['number_of_spring_inspections']
fall_inspections_per_borough = rodent_df_fall.groupby('BOROUGH')['inspection_date_only'].count().to_frame()
fall_inspections_per_borough.columns = ['number_of_fall_inspections']

# merge the dataframes on index
spring_and_fall_inspections_df = spring_inspections_per_borough.join(fall_inspections_per_borough, how='outer')
spring_and_fall_inspections_df = spring_and_fall_inspections_df.fillna(0)

# calculate the difference 
spring_and_fall_inspections_df['difference_between_spring_and_fall'] = spring_and_fall_inspections_df['number_of_spring_inspections'] - spring_and_fall_inspections_df['number_of_fall_inspections']
print(f"The borough that has the greatest difference in the number of inspections in the spring vs. in the fall = {spring_and_fall_inspections_df['difference_between_spring_and_fall'].idxmax()}")

The borough that has the greatest difference in the number of inspections in the spring vs. in the fall = Bronx


In [19]:
# count the number of inspections per iso-week

iso_week_inspections = rodent_df['iso_week'].value_counts()
iso_week_inspections

23    601
12    587
17    541
15    535
16    501
14    497
24    471
25    466
9     444
18    433
13    404
19    392
8     365
20    345
22    344
21    339
10    313
5     264
2     212
1     209
27    207
28    207
7     206
11    205
4     203
26    202
3     160
51    102
6      93
50     41
53     33
47     20
49     17
52     17
48     13
46      5
42      2
30      2
43      1
Name: iso_week, dtype: int64

In [20]:
# find the week with the greatest number of inspections

print(f"The iso week with the greatest number of inspections = {iso_week_inspections.idxmax()}")

The iso week with the greatest number of inspections = 23


In [21]:
# for that week, and that week only, count the inspections by day-of-week

# subset the dataframe to the iso week with the highest number of inspections
iso_week_criteria = (rodent_df.iso_week == iso_week_inspections.idxmax()) 
iso_week_subset = rodent_df.loc[iso_week_criteria]

# show the number of inspections per day of the week
inspections_per_day_of_week = iso_week_subset.groupby('weekday_name')['JOB_TICKET_OR_WORK_ORDER_ID'].count()
inspections_per_day_of_week

weekday_name
Friday        55
Monday        82
Saturday     117
Sunday       103
Thursday      65
Tuesday      104
Wednesday     75
Name: JOB_TICKET_OR_WORK_ORDER_ID, dtype: int64