# introducing dataframes

a dataframe is best described as a table with particular properties:

- there is one row per observation (what defines an observation may vary case by case).
- there is one column per measure.
- all values of each column (i.e. all rows) must be the same data type.
- the columns may be (most often are) of different data type. 

together these properties define "tidy data". tidy data is the preferred format to work with when it comes to modeling and analysis. many useful modules and functions assume the input to be in tidy format and also shape their output to be tidy. this offers the ability to string these functions together to make complex manipulations. again, the `pandas` module makes working with dataframes easier so we will be using that.

# manipulating dataframes

today we will see how we can manipulate dataframes with pandas to 
- trim, 
- filter, 
- summarise, 
- augment
pandas dataframes. (later we will look into joining dataframes, but we won't have time for it now).


# today's exercise:
use what you have learned in your previous solutions to read in the new york rodent inspection data as a `pandas` dataframe, and then augment that dataframe with wait time (interval between inspection and approval), weekday,  and iso-week of inspection. also add a new column, `season` of the inspection date (defined as: dec-feb, inclusive, is winter, mar-may is spring, jun-aug is summer, sep-nov is fall). 

then answer the following questions: 
- between the dates of 2010-01-01 and 2010-03-31, what was the daily average number of inspections?
- between the dates of 2010-01-01 and 2010-03-31, which weekday had the greatest number of inspections?
- between the dates of 2010-01-01 and 2010-03-31, which weekday had the least number of inspections?

if you are confident that you can already solve this, the rest of the session is not much use to you. 

as a bonus we should be able to answer these further questions
- which weekday has the longest average wait time for approval in the winter? (let's define the seasons as: (dec-feb is winter, mar-may is spring, jun-aug is summer, sep-nov is fall).
- which weekday has the longest average wait time for approval in the summer?
- which season has the greatest number of inspections? 
- which season has the greatest number of distinct dates ...
    + a) in the data set
    + b) in the calendar? 
- which borough has the greatest difference in the number of inspections in the spring vs in the fall?
- count the number of inspections per [iso-week](https://en.wikipedia.org/wiki/ISO_week_date). find the week with the greatest number of inspections. for that week, and that week only, count the inspections by day-of-week.

In [1]:
# let us start by importing the modules we need.
import pandas as pd
import datetime

In [2]:
#filename_csv = 'NY_rodent_inspections_sample_small.csv' # for initial testing
#filename_csv = 'NY_rodent_inspections_sample_medium.csv' # for further consideration
filename_csv = 'NY_rodent_inspections_sample.csv' # for full on run

when you read in a data file with `read_csv()`, `pandas` will automatically convert it into a dataframe for you.

In [3]:
rodent_df = pd.read_csv(filename_csv) 
# inspect the data frame
rodent_df.shape

(9999, 20)

In [4]:
rodent_df.tail(10)

Unnamed: 0,INSPECTION_TYPE,JOB_TICKET_OR_WORK_ORDER_ID,JOB_ID,JOB_PROGRESS,BBL,BORO_CODE,BLOCK,LOT,HOUSE_NUMBER,STREET_NAME,ZIP_CODE,X_COORD,Y_COORD,LATITUDE,LONGITUDE,BOROUGH,INSPECTION_DATE,RESULT,APPROVED_DATE,LOCATION
9989,BAIT,23815,PO124451,4,2026970019,2,2697,19,939,EAST 163 STREET,10459,1012691,238311,40.820742,-73.897219,Bronx,07/15/2010 10:39:47 AM,Bait applied,07/16/2010 09:55:15 AM,"(40.8207420244034, -73.8972187135464)"
9990,BAIT,23816,PO124452,4,2026970022,2,2697,22,915,DAWSON STREET,10459,1012673,238233,40.820474,-73.897292,Bronx,07/15/2010 10:15:00 AM,Bait applied,07/16/2010 09:54:43 AM,"(40.8204738315909, -73.8972922812013)"
9991,BAIT,23817,PO124205,4,2026990060,2,2699,60,937,WESTCHESTER AVENUE,10459,1012718,238880,40.822281,-73.897191,Bronx,07/15/2010 10:47:50 AM,Bait applied,07/16/2010 09:54:09 AM,"(40.8222812434217, -73.8971906702658)"
9992,BAIT,23820,PO123787,4,2027050021,2,2705,21,1070,INTERVALE AVENUE,10459,1012861,239976,40.825313,-73.896491,Bronx,07/15/2010 09:36:11 AM,Bait applied,07/16/2010 09:52:13 AM,"(40.82531300756, -73.8964910829487)"
9993,BAIT,23823,PO123783,4,2027050025,2,2705,25,1078,INTERVALE AVENUE,10459,1012865,240063,40.825547,-73.896489,Bronx,07/15/2010 09:53:54 AM,Bait applied,07/16/2010 09:50:35 AM,"(40.8255470401929, -73.8964886688692)"
9994,BAIT,23826,PO107033,5,4097640117,4,9764,117,87-23,160 STREET,11432,1039216,197200,40.707774,-73.801744,Queens,07/15/2010 10:10:46 AM,Bait applied,07/15/2010 02:07:33 PM,"(40.7077744363728, -73.8017444698633)"
9995,BAIT,23827,PO69910,6,2032610091,2,3261,91,3160,BAILEY AVENUE,10463,1011423,259476,40.87884,-73.901741,Bronx,07/15/2010 12:16:29 PM,Bait applied,07/16/2010 09:49:36 AM,"(40.8788397683361, -73.9017406808259)"
9996,BAIT,23828,PO37474,8,2032160021,2,3216,21,1944,ANDREWS AVENUE SOUTH,10453,1008490,250758,40.854907,-73.91303,Bronx,07/15/2010 12:07:55 PM,Bait applied,07/16/2010 09:49:01 AM,"(40.8549066197023, -73.9130302748932)"
9997,BAIT,23830,PO84268,5,2024260001,2,2426,1,381,EAST 166 STREET,10456,1008570,241456,40.829389,-73.912123,Bronx,07/15/2010 10:09:31 AM,Bait applied,07/16/2010 09:47:50 AM,"(40.8293887213765, -73.9121234267685)"
9998,BAIT,23831,PO59293,11,2028490081,2,2849,81,63,CLIFFORD PLACE,10453,1008712,247532,40.846065,-73.911588,Bronx,07/15/2010 11:54:37 AM,Bait applied,07/16/2010 09:47:16 AM,"(40.8460651323037, -73.9115881298086)"


# exploring dataframes
`pandas` has three defined data types: `Series`, `DataFrame`, and `Panel`. a `Series` is an array of objects of the same type, a `DataFrame` is an indexed table formed by a set of `Series`. a `Panel` is a collection of dataframes.

we often want to select a part of the data rather having to deal with the whole table at a time. there are multiple ways to address a segment of the data. we might want a single column, a single row, a set of columns, a contiguous set of rows, a set of rows that satisfy some criteria, or a selection defined by a combination of these.

# selecting a dataframe column
any individual column/series of a dataframe is mapped to a property of the dataframe. for instance, from the above we saw that the dataframe contains the `JOB_ID` property.

In [5]:
print(type(rodent_df.INSPECTION_TYPE)) # just the JOB_ID column

<class 'pandas.core.series.Series'>


each column is a series object, and series objects have some built in methods for caclulating summaries of the values (the sum, mean, median, max, min, etc.). you can find out what these methods are with the `dir(object)` function.

In [6]:
lower_left  = (rodent_df.X_COORD.min(), rodent_df.Y_COORD.min())
upper_right = (rodent_df.X_COORD.max(), rodent_df.Y_COORD.max())
xrange = upper_right[0] - lower_left[0]
yrange = upper_right[1] - lower_left[1]
mid_point = (rodent_df.X_COORD.mean(), rodent_df.Y_COORD.mean())
med_point = (rodent_df.X_COORD.median(), rodent_df.Y_COORD.median())

In [7]:
rodent_df.X_COORD.describe() # try also .sum() function

count    9.999000e+03
mean     1.276815e+06
std      6.887182e+06
min      0.000000e+00
25%      9.964650e+05
50%      1.003659e+06
75%      1.010183e+06
max      2.434079e+08
Name: X_COORD, dtype: float64

In [8]:
# we can also access individual columns by column name akin to dict key.
print('INSPECTION_TYPE details:')
print(rodent_df['INSPECTION_TYPE'].describe())
print('------')
print('JOB_ID details:')
print(rodent_df['JOB_ID'].describe())
print('------')
print('BOROUGH details')
print(rodent_df['BOROUGH'].describe())
print('there are', len(rodent_df['JOB_TICKET_OR_WORK_ORDER_ID'].unique()), 'unique job ticket ids')

INSPECTION_TYPE details:
count     9999
unique       1
top       BAIT
freq      9999
Name: INSPECTION_TYPE, dtype: object
------
JOB_ID details:
count        9999
unique       3800
top       PO17708
freq           14
Name: JOB_ID, dtype: object
------
BOROUGH details
count      9999
unique        5
top       Bronx
freq       3418
Name: BOROUGH, dtype: object
there are 9999 unique job ticket ids


In [9]:
# or a set/list of columns:
rodent_df[['JOB_ID', 'JOB_TICKET_OR_WORK_ORDER_ID', 'BOROUGH']].head()

Unnamed: 0,JOB_ID,JOB_TICKET_OR_WORK_ORDER_ID,BOROUGH
0,PO12965,1,Manhattan
1,PO12966,2,Manhattan
2,PO16966,30,Bronx
3,PO13665,31,Bronx
4,PO11291,38,Manhattan


In [10]:
alist = ['JOB_ID', 'JOB_TICKET_OR_WORK_ORDER_ID', 'BOROUGH']
rodent_df[alist].head()
# note! what is wrong with this???
#rodent_df['JOB_ID', 'JOB_TICKET_OR_WORK_ORDER_ID'] # error!

Unnamed: 0,JOB_ID,JOB_TICKET_OR_WORK_ORDER_ID,BOROUGH
0,PO12965,1,Manhattan
1,PO12966,2,Manhattan
2,PO16966,30,Bronx
3,PO13665,31,Bronx
4,PO11291,38,Manhattan


# selecting rows
to select rows, we use either the `loc` or the `iloc` property. the former picks up rows by index and the latter by position in the dataframe.

In [11]:
rodent_df.loc[50]
# note the index column (Name:)

INSPECTION_TYPE                                                BAIT
JOB_TICKET_OR_WORK_ORDER_ID                                     193
JOB_ID                                                      PO15400
JOB_PROGRESS                                                      3
BBL                                                      1020880038
BORO_CODE                                                         1
BLOCK                                                          2088
LOT                                                              38
HOUSE_NUMBER                                                    610
STREET_NAME                                         WEST 141 STREET
ZIP_CODE                                                      10031
X_COORD                                                      997302
Y_COORD                                                      239524
LATITUDE                                                    40.8241
LONGITUDE                                       

In [12]:
# because for this dataframe the index is a just a row ordinal there is no difference between using `loc` and `iloc` 
rodent_df.iloc[50]

INSPECTION_TYPE                                                BAIT
JOB_TICKET_OR_WORK_ORDER_ID                                     193
JOB_ID                                                      PO15400
JOB_PROGRESS                                                      3
BBL                                                      1020880038
BORO_CODE                                                         1
BLOCK                                                          2088
LOT                                                              38
HOUSE_NUMBER                                                    610
STREET_NAME                                         WEST 141 STREET
ZIP_CODE                                                      10031
X_COORD                                                      997302
Y_COORD                                                      239524
LATITUDE                                                    40.8241
LONGITUDE                                       

In [13]:
# we can even select a number of rows:
rodent_df.loc[50:55]

Unnamed: 0,INSPECTION_TYPE,JOB_TICKET_OR_WORK_ORDER_ID,JOB_ID,JOB_PROGRESS,BBL,BORO_CODE,BLOCK,LOT,HOUSE_NUMBER,STREET_NAME,ZIP_CODE,X_COORD,Y_COORD,LATITUDE,LONGITUDE,BOROUGH,INSPECTION_DATE,RESULT,APPROVED_DATE,LOCATION
50,BAIT,193,PO15400,3,1020880038,1,2088,38,610,WEST 141 STREET,10031,997302,239524,40.824104,-73.952841,Manhattan,12/01/2009 01:19:58 PM,Bait applied,12/02/2009 01:30:51 PM,"(40.824103930119, -73.9528407970262)"
51,BAIT,199,PO16428,3,2032080031,2,3208,31,2250,AQUEDUCT AVENUE,10453,1009903,252202,40.858874,-73.907263,Bronx,12/02/2009 09:50:47 AM,Bait applied,12/07/2009 08:40:34 AM,"(40.8588737119502, -73.9072626673015)"
52,BAIT,200,PO12863,3,4012420022,4,1242,22,33-34,70 STREET,11372,1012966,213776,40.753396,-73.896354,Queens,12/03/2009 01:05:28 PM,Bait applied,12/09/2009 10:42:22 AM,"(40.7533955947211, -73.8963540872997)"
53,BAIT,201,PO10985,3,4014690050,4,1469,50,35-51,95 STREET,11372,1019565,213555,40.752765,-73.872537,Queens,12/03/2009 11:12:17 AM,Bait applied,12/14/2009 09:44:37 AM,"(40.7527651129745, -73.872537219612)"
54,BAIT,202,PO12163,3,4095870017,4,9587,17,104-36,134 STREET,11419,1036263,191009,40.690799,-73.812443,Queens,12/03/2009 12:21:47 PM,Bait applied,12/09/2009 10:44:47 AM,"(40.6907994771446, -73.8124433519563)"
55,BAIT,205,PO28701,3,1012300025,1,1230,25,209,WEST 82 STREET,10024,990522,225382,40.785295,-73.977351,Manhattan,12/04/2009 02:09:10 PM,Bait applied,12/07/2009 09:29:37 AM,"(40.7852954492163, -73.9773513556123)"


In [14]:
rodent_df.iloc[50:55]

Unnamed: 0,INSPECTION_TYPE,JOB_TICKET_OR_WORK_ORDER_ID,JOB_ID,JOB_PROGRESS,BBL,BORO_CODE,BLOCK,LOT,HOUSE_NUMBER,STREET_NAME,ZIP_CODE,X_COORD,Y_COORD,LATITUDE,LONGITUDE,BOROUGH,INSPECTION_DATE,RESULT,APPROVED_DATE,LOCATION
50,BAIT,193,PO15400,3,1020880038,1,2088,38,610,WEST 141 STREET,10031,997302,239524,40.824104,-73.952841,Manhattan,12/01/2009 01:19:58 PM,Bait applied,12/02/2009 01:30:51 PM,"(40.824103930119, -73.9528407970262)"
51,BAIT,199,PO16428,3,2032080031,2,3208,31,2250,AQUEDUCT AVENUE,10453,1009903,252202,40.858874,-73.907263,Bronx,12/02/2009 09:50:47 AM,Bait applied,12/07/2009 08:40:34 AM,"(40.8588737119502, -73.9072626673015)"
52,BAIT,200,PO12863,3,4012420022,4,1242,22,33-34,70 STREET,11372,1012966,213776,40.753396,-73.896354,Queens,12/03/2009 01:05:28 PM,Bait applied,12/09/2009 10:42:22 AM,"(40.7533955947211, -73.8963540872997)"
53,BAIT,201,PO10985,3,4014690050,4,1469,50,35-51,95 STREET,11372,1019565,213555,40.752765,-73.872537,Queens,12/03/2009 11:12:17 AM,Bait applied,12/14/2009 09:44:37 AM,"(40.7527651129745, -73.872537219612)"
54,BAIT,202,PO12163,3,4095870017,4,9587,17,104-36,134 STREET,11419,1036263,191009,40.690799,-73.812443,Queens,12/03/2009 12:21:47 PM,Bait applied,12/09/2009 10:44:47 AM,"(40.6907994771446, -73.8124433519563)"


In [15]:
# and the selection need not be contiguous: 
row_indeces = [0,1,10,100]
rodent_df.iloc[row_indeces]

Unnamed: 0,INSPECTION_TYPE,JOB_TICKET_OR_WORK_ORDER_ID,JOB_ID,JOB_PROGRESS,BBL,BORO_CODE,BLOCK,LOT,HOUSE_NUMBER,STREET_NAME,ZIP_CODE,X_COORD,Y_COORD,LATITUDE,LONGITUDE,BOROUGH,INSPECTION_DATE,RESULT,APPROVED_DATE,LOCATION
0,BAIT,1,PO12965,3,1011470035,1,1147,35,104,WEST 76 STREET,10023,990505,223527,40.780204,-73.977414,Manhattan,10/14/2009 12:00:27 PM,Bait applied,10/14/2009 03:01:46 PM,"(40.7802039792471, -73.9774144709456)"
1,BAIT,2,PO12966,3,1011470034,1,1147,34,102,WEST 76 STREET,10023,990516,223521,40.780188,-73.977375,Manhattan,10/14/2009 12:51:21 PM,Bait applied,10/14/2009 03:02:30 PM,"(40.7801875030438, -73.977374757787)"
10,BAIT,51,PO11817,3,1003770026,1,377,26,390,EAST 8 STREET,10009,990654,203066,40.724044,-73.976896,Manhattan,11/16/2009 09:55:20 PM,Bait applied,11/19/2009 12:09:49 PM,"(40.7240436116555, -73.976895948583)"
100,BAIT,295,PO20835,3,2027960039,2,2796,39,1773,WEEKS AVENUE,10457,1009806,247554,40.84624,-73.907447,Bronx,12/14/2009 10:52:40 AM,Bait applied,12/15/2009 08:41:13 AM,"(40.8462399851237, -73.9074465419394)"


In [16]:
# and we can even filter rows on some criteria
criteria = (rodent_df.BOROUGH == 'Manhattan') & (rodent_df.LONGITUDE < -73.950259)
#rodent_df.loc[row_indeces].filter(criteria)
rodent_df[criteria]

Unnamed: 0,INSPECTION_TYPE,JOB_TICKET_OR_WORK_ORDER_ID,JOB_ID,JOB_PROGRESS,BBL,BORO_CODE,BLOCK,LOT,HOUSE_NUMBER,STREET_NAME,ZIP_CODE,X_COORD,Y_COORD,LATITUDE,LONGITUDE,BOROUGH,INSPECTION_DATE,RESULT,APPROVED_DATE,LOCATION
0,BAIT,1,PO12965,3,1011470035,1,1147,35,104,WEST 76 STREET,10023,990505,223527,40.780204,-73.977414,Manhattan,10/14/2009 12:00:27 PM,Bait applied,10/14/2009 03:01:46 PM,"(40.7802039792471, -73.9774144709456)"
1,BAIT,2,PO12966,3,1011470034,1,1147,34,102,WEST 76 STREET,10023,990516,223521,40.780188,-73.977375,Manhattan,10/14/2009 12:51:21 PM,Bait applied,10/14/2009 03:02:30 PM,"(40.7801875030438, -73.977374757787)"
4,BAIT,38,PO11291,3,1011690057,1,1169,57,2199,BROADWAY,10024,989641,224567,40.783059,-73.980533,Manhattan,11/10/2009 08:40:42 AM,Bait applied,11/17/2009 11:39:11 AM,"(40.7830590725833, -73.9805333640688)"
5,BAIT,39,PO12483,3,1015340012,1,1534,12,232,EAST 89 STREET,10128,997617,223597,40.780388,-73.951734,Manhattan,11/10/2009 09:53:06 AM,Bait applied,11/17/2009 11:39:42 AM,"(40.780388216313, -73.9517343542567)"
10,BAIT,51,PO11817,3,1003770026,1,377,26,390,EAST 8 STREET,10009,990654,203066,40.724044,-73.976896,Manhattan,11/16/2009 09:55:20 PM,Bait applied,11/19/2009 12:09:49 PM,"(40.7240436116555, -73.976895948583)"
11,BAIT,52,PO11077,3,1003890019,1,389,19,202,EAST 7 STREET,10009,989741,203281,40.724634,-73.980190,Manhattan,11/16/2009 10:20:14 PM,Bait applied,11/23/2009 12:11:55 PM,"(40.7246343483455, -73.980189651703)"
12,BAIT,53,PO15241,3,1004590040,1,459,40,19,EAST 3 STREET,10003,986804,203770,40.725978,-73.990786,Manhattan,11/16/2009 09:25:46 PM,Bait applied,11/19/2009 12:04:34 PM,"(40.7259778701156, -73.9907855310067)"
13,BAIT,57,PO10580,3,1002100005,1,210,5,336,CANAL STREET,10013,983586,201500,40.719748,-74.002395,Manhattan,11/17/2009 12:00:29 PM,Bait applied,11/23/2009 12:16:37 PM,"(40.7197476131092, -74.0023953936767)"
14,BAIT,58,PO13111,3,1003920010,1,392,10,350,EAST 10 STREET,10009,989987,203999,40.726605,-73.979302,Manhattan,11/17/2009 09:55:24 AM,Bait applied,11/23/2009 12:07:54 PM,"(40.726604927718, -73.9793015244998)"
15,BAIT,59,PO10759,3,1004290001,1,429,1,12,1 AVENUE,10009,987488,202831,40.723400,-73.988318,Manhattan,11/17/2009 10:30:04 AM,Bait applied,11/23/2009 02:11:19 PM,"(40.7234003190272, -73.9883182080389)"


# selecting individual cells
as with a matrix in linear algebra, an individual value can be addresses by its coordinates (row, column) note that this is opposite of how values are addressed in excel: <column><row>, e.g. B4.  by use of the `iloc` property.

for instance the value in the 3rd row, 4th column is accessed with `rodent_df.iloc[2,3]` (remember that the indeces start off from 0). 

In [17]:
print('the value in the 3rd, row, 5th column is:', rodent_df.iloc[2,4])
rodent_df.head()

the value in the 3rd, row, 5th column is: 2043370027


Unnamed: 0,INSPECTION_TYPE,JOB_TICKET_OR_WORK_ORDER_ID,JOB_ID,JOB_PROGRESS,BBL,BORO_CODE,BLOCK,LOT,HOUSE_NUMBER,STREET_NAME,ZIP_CODE,X_COORD,Y_COORD,LATITUDE,LONGITUDE,BOROUGH,INSPECTION_DATE,RESULT,APPROVED_DATE,LOCATION
0,BAIT,1,PO12965,3,1011470035,1,1147,35,104,WEST 76 STREET,10023,990505,223527,40.780204,-73.977414,Manhattan,10/14/2009 12:00:27 PM,Bait applied,10/14/2009 03:01:46 PM,"(40.7802039792471, -73.9774144709456)"
1,BAIT,2,PO12966,3,1011470034,1,1147,34,102,WEST 76 STREET,10023,990516,223521,40.780188,-73.977375,Manhattan,10/14/2009 12:51:21 PM,Bait applied,10/14/2009 03:02:30 PM,"(40.7801875030438, -73.977374757787)"
2,BAIT,30,PO16966,3,2043370027,2,4337,27,620,THWAITES PLACE,10467,1020110,252216,40.858877,-73.870364,Bronx,11/09/2009 12:59:55 PM,Bait applied,11/10/2009 02:54:52 PM,"(40.8588765781972, -73.8703636422023)"
3,BAIT,31,PO13665,3,2037670077,2,3767,77,1227,WHITEPLAINS ROAD,10472,1022441,242180,40.831321,-73.861994,Bronx,11/09/2009 11:10:16 AM,Bait applied,11/10/2009 02:56:42 PM,"(40.8313209626148, -73.861994089899)"
4,BAIT,38,PO11291,3,1011690057,1,1169,57,2199,BROADWAY,10024,989641,224567,40.783059,-73.980533,Manhattan,11/10/2009 08:40:42 AM,Bait applied,11/17/2009 11:39:11 AM,"(40.7830590725833, -73.9805333640688)"


In [18]:
rodent_df.iloc[2:5,4:7]

Unnamed: 0,BBL,BORO_CODE,BLOCK
2,2043370027,2,4337
3,2037670077,2,3767
4,1011690057,1,1169


# a note on indeces
note that the first column, the row index, is by default an ordinal number from 0 to N-1. we can identify rows by this index, redefine it to be any unique-valued series of equal length. in this case, for instance we note that each row (each observation) is an inspection job, each of which has its own job id. for some use cases it would make sense to index the table on the job id instead. let's see how that could be done:

In [19]:

rodent_df.index = rodent_df.JOB_TICKET_OR_WORK_ORDER_ID
rodent_df.head()

Unnamed: 0_level_0,INSPECTION_TYPE,JOB_TICKET_OR_WORK_ORDER_ID,JOB_ID,JOB_PROGRESS,BBL,BORO_CODE,BLOCK,LOT,HOUSE_NUMBER,STREET_NAME,ZIP_CODE,X_COORD,Y_COORD,LATITUDE,LONGITUDE,BOROUGH,INSPECTION_DATE,RESULT,APPROVED_DATE,LOCATION
JOB_TICKET_OR_WORK_ORDER_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
1,BAIT,1,PO12965,3,1011470035,1,1147,35,104,WEST 76 STREET,10023,990505,223527,40.780204,-73.977414,Manhattan,10/14/2009 12:00:27 PM,Bait applied,10/14/2009 03:01:46 PM,"(40.7802039792471, -73.9774144709456)"
2,BAIT,2,PO12966,3,1011470034,1,1147,34,102,WEST 76 STREET,10023,990516,223521,40.780188,-73.977375,Manhattan,10/14/2009 12:51:21 PM,Bait applied,10/14/2009 03:02:30 PM,"(40.7801875030438, -73.977374757787)"
30,BAIT,30,PO16966,3,2043370027,2,4337,27,620,THWAITES PLACE,10467,1020110,252216,40.858877,-73.870364,Bronx,11/09/2009 12:59:55 PM,Bait applied,11/10/2009 02:54:52 PM,"(40.8588765781972, -73.8703636422023)"
31,BAIT,31,PO13665,3,2037670077,2,3767,77,1227,WHITEPLAINS ROAD,10472,1022441,242180,40.831321,-73.861994,Bronx,11/09/2009 11:10:16 AM,Bait applied,11/10/2009 02:56:42 PM,"(40.8313209626148, -73.861994089899)"
38,BAIT,38,PO11291,3,1011690057,1,1169,57,2199,BROADWAY,10024,989641,224567,40.783059,-73.980533,Manhattan,11/10/2009 08:40:42 AM,Bait applied,11/17/2009 11:39:11 AM,"(40.7830590725833, -73.9805333640688)"


note the difference? now there is a difference between the `loc` and `iloc` methods:

In [20]:
print('loc[2]: ', rodent_df['JOB_ID'].loc[2])
print('iloc[2]:', rodent_df['JOB_ID'].iloc[2])

loc[2]:  PO12966
iloc[2]: PO16966


# aggregating
we often want to calculate summary statistics of similar sections dataframe, we do this by first _grouping by_ a column value, and then _aggregating_ the values in each group. e.g. here we might want to know how many inspections there are in each borough. to do that we need to group by the borough column values and aggregating all the rows for each borough with the `count()` function. 


In [35]:
rodent_df.groupby('BOROUGH').agg({'BOROUGH':'count'})

Unnamed: 0_level_0,BOROUGH
BOROUGH,Unnamed: 1_level_1
Bronx,3418
Brooklyn,3045
Manhattan,2707
Queens,603
Staten Island,226


note that the output is a dataframe, indexed on the categorical column we grouped by. 

In [None]:
type(rodent_df.groupby('BOROUGH').agg({'BOROUGH':'count'}))

the above form works for any aggregation function. but in case you are just doing a count of rows, and especially if you have a large dataframe, you might prefer to use the `value_counts()` function which has been optimised. 

In [36]:
rodent_df['BOROUGH'].value_counts(dropna=False)

Bronx            3418
Brooklyn         3045
Manhattan        2707
Queens            603
Staten Island     226
Name: BOROUGH, dtype: int64

note that the outut from `value_counts()` is a series, not a dataframe.

In [37]:
type(rodent_df['BOROUGH'].value_counts(dropna=False))

pandas.core.series.Series

# back to the exercise:
we now have all the features in place we need to answer the questions set out in today's exercise. we start by copying over a solution to last exercise (to read and manipulate the new york rodent inspection data). 

In [22]:
# let us define this utility function for formatting an iso-week:
def format_iso_week(datetime_obj):
    the_date = datetime_obj.date()
    iso_year, iso_week, iso_weekday = the_date.isocalendar()
    iso_week_str = str(iso_year) + '-W' + str(iso_week)
    return iso_week_str

In [23]:
# the columns of interest is the 'INSPECTION_DATE' and 'APPROVED_DATE'
# inspection by eye gives us the datetime format descriptor
format_descriptor = '%m/%d/%Y %I:%M:%S %p'

rodent_df['inspection_datetime']   = rodent_df.apply(lambda row: datetime.datetime.strptime(row['INSPECTION_DATE'], format_descriptor), axis=1)
rodent_df['approval_datetime']     = rodent_df.apply(lambda row: datetime.datetime.strptime(row['APPROVED_DATE'],   format_descriptor), axis=1)
rodent_df['wait_time_to_approval'] = rodent_df.apply(lambda row: (row['approval_datetime'] - row['inspection_datetime']).total_seconds(),axis=1)
rodent_df['inspection_month']      = rodent_df.apply(lambda row: row['inspection_datetime'].strftime("%B"),   axis=1)
rodent_df['isoweek']               = rodent_df.apply(lambda row: format_iso_week(row['inspection_datetime']), axis=1)
rodent_df['inspection_weekday']    = rodent_df.apply(lambda row: row['inspection_datetime'].strftime("%A"),   axis=1)
rodent_df.head()

Unnamed: 0_level_0,INSPECTION_TYPE,JOB_TICKET_OR_WORK_ORDER_ID,JOB_ID,JOB_PROGRESS,BBL,BORO_CODE,BLOCK,LOT,HOUSE_NUMBER,STREET_NAME,...,INSPECTION_DATE,RESULT,APPROVED_DATE,LOCATION,inspection_datetime,approval_datetime,wait_time_to_approval,inspection_month,isoweek,inspection_weekday
JOB_TICKET_OR_WORK_ORDER_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,BAIT,1,PO12965,3,1011470035,1,1147,35,104,WEST 76 STREET,...,10/14/2009 12:00:27 PM,Bait applied,10/14/2009 03:01:46 PM,"(40.7802039792471, -73.9774144709456)",2009-10-14 12:00:27,2009-10-14 15:01:46,10879.0,October,2009-W42,Wednesday
2,BAIT,2,PO12966,3,1011470034,1,1147,34,102,WEST 76 STREET,...,10/14/2009 12:51:21 PM,Bait applied,10/14/2009 03:02:30 PM,"(40.7801875030438, -73.977374757787)",2009-10-14 12:51:21,2009-10-14 15:02:30,7869.0,October,2009-W42,Wednesday
30,BAIT,30,PO16966,3,2043370027,2,4337,27,620,THWAITES PLACE,...,11/09/2009 12:59:55 PM,Bait applied,11/10/2009 02:54:52 PM,"(40.8588765781972, -73.8703636422023)",2009-11-09 12:59:55,2009-11-10 14:54:52,93297.0,November,2009-W46,Monday
31,BAIT,31,PO13665,3,2037670077,2,3767,77,1227,WHITEPLAINS ROAD,...,11/09/2009 11:10:16 AM,Bait applied,11/10/2009 02:56:42 PM,"(40.8313209626148, -73.861994089899)",2009-11-09 11:10:16,2009-11-10 14:56:42,99986.0,November,2009-W46,Monday
38,BAIT,38,PO11291,3,1011690057,1,1169,57,2199,BROADWAY,...,11/10/2009 08:40:42 AM,Bait applied,11/17/2009 11:39:11 AM,"(40.7830590725833, -73.9805333640688)",2009-11-10 08:40:42,2009-11-17 11:39:11,615509.0,November,2009-W46,Tuesday


In [24]:
earliest_date = rodent_df['inspection_datetime'].min()
latest_date = rodent_df['inspection_datetime'].max()
print('the inspection dates range from', earliest_date, 'to', latest_date)

the inspection dates range from 2009-01-16 10:04:24 to 2011-10-27 18:28:10


# solution to exercise

In [25]:
start_date = pd.to_datetime('2010-01-01')
end_date = pd.to_datetime('2010-03-31')
criteria = (rodent_df.inspection_datetime >= start_date) & (rodent_df.inspection_datetime <= end_date)
print('there are', sum(criteria), 'rows selected')

there are 3406 rows selected


In [38]:
rodent_df[['inspection_datetime', 'inspection_weekday']].sort_values('inspection_datetime').groupby('inspection_weekday').agg({'inspection_weekday':'count'})

Unnamed: 0_level_0,inspection_weekday
inspection_weekday,Unnamed: 1_level_1
Friday,1750
Monday,1824
Saturday,359
Sunday,342
Thursday,1927
Tuesday,2065
Wednesday,1732


In [41]:
rodent_df['inspection_month'].value_counts()

April       2243
June        2062
March       1784
May         1509
February     928
January      785
July         438
December     202
November      45
October        3
Name: inspection_month, dtype: int64