## Trip Data Processing

### Purpose

* Create datasets for Tableau visualization from raw trip data files.

### Rationale

* Monthly raw trip data files are quite large, and two files are needed for year-over-year comparisons.  Having Tableau use the files directly will slow down performance.  Pre-processed summary files will still provide useful data to Tableau, but in a much more compact format.

### Requirements

* Pandas (0.24.2)
* Numpy (1.16.4)

### Input / Output

* Input files named according to `{yyyy}{mm}-citibike-tripdata.csv` where yyyy is the current year, and mm is the month of interest.  The system will look for two files, one with year yyyy, and another with year yyyy - 1.

* Output file named `{yyyy}{mm}-hourly-tripdata.csv`

* These files should be in a sub-directory called 'trips" within the 'data' directory, found in the folder above the one containing this notebook.  The input files can be downloaded from the New York Citibike site at https://s3.amazonaws.com/tripdata/index.html

### <span style="color:blue">Required User Input</span>

In [1]:
valid = False
attempts = 0  # Tracked to avoid infinite loops
while (not valid) and (attempts < 6):
    attempts += 1
    if attempts == 6:
        print('Too many invalid attempts.  Exiting loop.  *This notebook will not process data*.')
        break
    try:
        mmyyyy = input('Please enter the two-digit month and four-digit year for analysis (e.g. 05-2019)')
        mm = int(mmyyyy[0:2])
        yyyy = int(mmyyyy[-4:])
        print('Thank you.')
        valid = True
    except ValueError:
        print('Invalid entry.  Pleae try again.')

Please enter the two-digit month and four-digit year for analysis (e.g. 05-2019)05-2019
Thank you.


### Set-Up / Import 

In [1]:
import pandas as pd
import numpy as np

In [3]:
# Compose import filenames
trip_data_dir = '../data/trips/'
if mm < 10:
    month_str = '0' + str(mm)
else:
    month_str = str(mm)
this_year_str = str(yyyy)
last_year_str = str(yyyy - 1)
this_year_filename = trip_data_dir + this_year_str + month_str + '-citibike-tripdata.csv'
last_year_filename = trip_data_dir + last_year_str + month_str + '-citibike-tripdata.csv'
# Get data
this_yr_trips = pd.read_csv(this_year_filename, parse_dates = ['starttime','stoptime'])
last_yr_trips = pd.read_csv(last_year_filename, parse_dates = ['starttime', 'stoptime'])

### Basic Data Checking

In [4]:
# Examine data
this_yr_trips.head()

Unnamed: 0,tripduration,starttime,stoptime,start station id,start station name,start station latitude,start station longitude,end station id,end station name,end station latitude,end station longitude,bikeid,usertype,birth year,gender
0,139,2019-05-01 00:00:01.901,2019-05-01 00:02:21.517,447,8 Ave & W 52 St,40.763707,-73.985162,423,W 54 St & 9 Ave,40.765849,-73.986905,31170,Subscriber,1983,1
1,754,2019-05-01 00:00:03.021,2019-05-01 00:12:37.692,3258,W 27 St & 10 Ave,40.750182,-74.002184,3255,8 Ave & W 31 St,40.750585,-73.994685,25560,Customer,1969,0
2,2308,2019-05-01 00:00:04.627,2019-05-01 00:38:33.171,3093,N 6 St & Bedford Ave,40.717452,-73.958509,3676,Van Brunt St & Van Dyke St,40.675833,-74.014726,33369,Subscriber,1978,1
3,143,2019-05-01 00:00:19.334,2019-05-01 00:02:42.520,3486,Schermerhorn St & Bond St,40.688417,-73.984517,3412,Pacific St & Nevins St,40.685376,-73.983021,32041,Subscriber,1997,1
4,138,2019-05-01 00:00:22.184,2019-05-01 00:02:40.648,388,W 26 St & 10 Ave,40.749718,-74.00295,494,W 26 St & 8 Ave,40.747348,-73.997236,35237,Subscriber,1967,1


In [5]:
last_yr_trips.head()

Unnamed: 0,tripduration,starttime,stoptime,start station id,start station name,start station latitude,start station longitude,end station id,end station name,end station latitude,end station longitude,bikeid,usertype,birth year,gender
0,367,2018-05-01 05:06:16.584,2018-05-01 05:12:23.965,72,W 52 St & 11 Ave,40.767272,-73.993929,514,12 Ave & W 40 St,40.760875,-74.002777,30567,Subscriber,1965,1
1,1313,2018-05-01 06:25:49.425,2018-05-01 06:47:42.712,72,W 52 St & 11 Ave,40.767272,-73.993929,426,West St & Chambers St,40.717548,-74.013221,18965,Subscriber,1956,1
2,1798,2018-05-01 06:40:26.445,2018-05-01 07:10:25.179,72,W 52 St & 11 Ave,40.767272,-73.993929,3435,Grand St & Elizabeth St,40.718822,-73.99596,30241,Subscriber,1959,2
3,518,2018-05-01 07:06:02.973,2018-05-01 07:14:41.004,72,W 52 St & 11 Ave,40.767272,-73.993929,477,W 41 St & 8 Ave,40.756405,-73.990026,28985,Subscriber,1986,1
4,109,2018-05-01 07:26:32.345,2018-05-01 07:28:21.542,72,W 52 St & 11 Ave,40.767272,-73.993929,530,11 Ave & W 59 St,40.771522,-73.990541,14556,Subscriber,1991,1


In [6]:
# Check row counts
print(str(len(this_yr_trips)) + ' rows for ' + this_year_str)
print(str(len(last_yr_trips)) + ' rows for ' + last_year_str)

1924563 rows for 2019
1824710 rows for 2018


In [7]:
# Check that columns are the same 
this_yr_trips.columns == last_yr_trips.columns

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True])

In [8]:
this_yr_trips.columns

Index(['tripduration', 'starttime', 'stoptime', 'start station id',
       'start station name', 'start station latitude',
       'start station longitude', 'end station id', 'end station name',
       'end station latitude', 'end station longitude', 'bikeid', 'usertype',
       'birth year', 'gender'],
      dtype='object')

### Generate Hourly Station Data

In [19]:
def frac_subscribers(series):
    """Returns that fraction of DataFrame Series entries that are 'Subscriber', ignores na"""
    num_subscribers = len(list(filter(lambda x: x == 'Subscriber', series)))
    series_count = series.count()
    if series_count != 0:
        return num_subscribers / series_count
    else:
        return 0

In [20]:
def frac_in_age_range(series, current_year = yyyy, lower_limit = 0, upper_limit = 120):
    """Returns fraction of data series that represents people between lower_limit (inclusive) 
    and upper_limit (exclusive) of age, given a DataFrame Series with birth years and the 
    current year.  Note that ages are not exact, but equal the difference in current_year and the
    birth year.  Limits are converted to integers."""
    try:
        numeric_series = series.astype('int').dropna()
        lower_limit = int(lower_limit)
        upper_limit = int(upper_limit)
        num_in_range = len(list(filter(lambda x: ((current_year - x) >= lower_limit) and
                                                ((current_year - x) < upper_limit), numeric_series)))
        series_count = len(numeric_series)
        if series_count != 0:
            return num_in_range / series_count
        else:
            return 0
    except ValueError:
        return 0

In [27]:
def frac_male(series):
    """Returns fraction of data series that represents males, given a DataFrame series encoded
    as 0 = unknown, 1 = male, 2 = female"""
    num_male = len(list(filter(lambda x: x == 1, series)))
    num_female = len(list(filter(lambda x: x == 2, series)))
    if (num_male + num_female) != 0:
        return num_male / (num_male + num_female)
    else:
        return 0

In [38]:
def frac_under_25_this_year(series):
    """Returns frac_in_age_range for age range 0 to 25 using current year.  Provides a unique
    function name to simplify groupby operations"""
    return frac_in_age_range(series, yyyy, 0, 25)

In [39]:
def frac_25_to_34_this_year(series):
    """Returns frac_in_age_range for age range 25-34 using current year.  Provides a unique
    function name to simplify groupby operations"""
    return frac_in_age_range(series, yyyy, 25, 35)

In [40]:
def frac_35_to_44_this_year(series):
    """Returns frac_in_age_range for age range 35-44 using current year.  Provides a unique
    function name to simplify groupby operations"""
    return frac_in_age_range(series, yyyy, 34, 45)

In [41]:
def frac_45_to_54_this_year(series):
    """Returns frac_in_age_range for age range 45-54 using current year.  Provides a unique
    function name to simplify groupby operations"""
    return frac_in_age_range(series, yyyy, 45, 55)

In [42]:
def frac_over_55_this_year(series):
    """Returns frac_in_age_range for age range 55 and over using current year.  Provides a unique
    function name to simplify groupby operations"""
    return frac_in_age_range(series, yyyy, 55, 199)

In [43]:
def frac_under_25_last_year(series):
    """Returns frac_in_age_range for age range 0 to 25 using previous year.  Provides a unique
    function name to simplify groupby operations"""
    return frac_in_age_range(series, yyyy-1, 0, 25)

In [44]:
def frac_25_to_34_last_year(series):
    """Returns frac_in_age_range for age range 25 to 34 using previous year.  Provides a unique
    function name to simplify groupby operations"""
    return frac_in_age_range(series, yyyy-1, 25, 35)

In [45]:
def frac_35_to_44_last_year(series):
    """Returns frac_in_age_range for age range 35 to 44 using previous year.  Provides a unique
    function name to simplify groupby operations"""
    return frac_in_age_range(series, yyyy-1, 35, 45)

In [46]:
def frac_45_to_54_last_year(series):
    """Returns frac_in_age_range for age range 45 to 54 using previous year.  Provides a unique
    function name to simplify groupby operations"""
    return frac_in_age_range(series, yyyy-1, 45, 55)

In [47]:
def frac_over_55_last_year(series):
    """Returns frac_in_age_range for age range 55 and over using previous year.  Provides a unique
    function name to simplify groupby operations"""
    return frac_in_age_range(series, yyyy-1, 55, 199)

In [48]:
# Build aggregation dictionary for use with groupby
#  -- tripduration is very unevenly distributed (lots of long-length outliers, so use median
# -- use custom functions for usertype, gender, and birth year 
# -- Note that yyyy should already be set to the current year
agg_dict = {'tripduration':['median','count'], 'start station latitude':'first', 
            'start station longitude':'first','usertype':frac_subscribers,
            'gender':frac_male,'birth year' : [frac_under_25_this_year, 
                                               frac_25_to_34_this_year,
                                               frac_35_to_44_this_year,
                                               frac_45_to_54_this_year,
                                               frac_over_55_this_year]}

In [49]:
this_yr_hourly = this_yr_trips.groupby([this_yr_trips['starttime'].dt.date,
                                       this_yr_trips['starttime'].dt.hour,
                                       this_yr_trips['start station id']]).agg(agg_dict)

In [57]:
# Rename columns and reset index -- because of duplicate index names, do the inner columns first
this_yr_hourly = this_yr_hourly.reset_index(level = [1,2])
this_yr_hourly.columns = ['hour','start_id','median_duration','trip_count','station_lat',
                         'station_long','frac_subscriber','frac_male','frac_under_25','frac_25_34',
                         'frac_35_44','frac_45_54','frac_55_over']
this_yr_hourly.head()

Unnamed: 0_level_0,hour,start_id,median_duration,trip_count,station_lat,station_long,frac_subscriber,frac_male,frac_under_25,frac_25_34,frac_35_44,frac_45_54,frac_55_over
starttime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2019-05-01,0,83,1802.0,3,40.683826,-73.976323,0.0,0.666667,1.0,0.0,0.0,0.0,0.0
2019-05-01,0,127,581.0,1,40.731724,-74.006744,1.0,1.0,0.0,0.0,0.0,0.0,0.0
2019-05-01,0,128,548.0,1,40.727103,-74.002971,1.0,1.0,0.0,0.0,0.0,0.0,0.0
2019-05-01,0,146,467.0,1,40.71625,-74.009106,1.0,1.0,0.0,0.0,0.0,0.0,0.0
2019-05-01,0,151,541.0,1,40.722104,-73.997249,1.0,1.0,0.0,0.0,0.0,0.0,0.0


In [58]:
# Now fix the outer column with standard reset index
this_yr_hourly = this_yr_hourly.reset_index()
this_yr_hourly.head()

Unnamed: 0,starttime,hour,start_id,median_duration,trip_count,station_lat,station_long,frac_subscriber,frac_male,frac_under_25,frac_25_34,frac_35_44,frac_45_54,frac_55_over
0,2019-05-01,0,83,1802.0,3,40.683826,-73.976323,0.0,0.666667,1.0,0.0,0.0,0.0,0.0
1,2019-05-01,0,127,581.0,1,40.731724,-74.006744,1.0,1.0,0.0,0.0,0.0,0.0,0.0
2,2019-05-01,0,128,548.0,1,40.727103,-74.002971,1.0,1.0,0.0,0.0,0.0,0.0,0.0
3,2019-05-01,0,146,467.0,1,40.71625,-74.009106,1.0,1.0,0.0,0.0,0.0,0.0,0.0
4,2019-05-01,0,151,541.0,1,40.722104,-73.997249,1.0,1.0,0.0,0.0,0.0,0.0,0.0


In [59]:
# Lastly, rename the column
this_yr_hourly = this_yr_hourly.rename(columns = {'starttime':'date'})
this_yr_hourly.head()

Unnamed: 0,date,hour,start_id,median_duration,trip_count,station_lat,station_long,frac_subscriber,frac_male,frac_under_25,frac_25_34,frac_35_44,frac_45_54,frac_55_over
0,2019-05-01,0,83,1802.0,3,40.683826,-73.976323,0.0,0.666667,1.0,0.0,0.0,0.0,0.0
1,2019-05-01,0,127,581.0,1,40.731724,-74.006744,1.0,1.0,0.0,0.0,0.0,0.0,0.0
2,2019-05-01,0,128,548.0,1,40.727103,-74.002971,1.0,1.0,0.0,0.0,0.0,0.0,0.0
3,2019-05-01,0,146,467.0,1,40.71625,-74.009106,1.0,1.0,0.0,0.0,0.0,0.0,0.0
4,2019-05-01,0,151,541.0,1,40.722104,-73.997249,1.0,1.0,0.0,0.0,0.0,0.0,0.0


In [63]:
# Repeat process for previous year data
agg_dict_last_year = {'tripduration':['median','count'], 'start station latitude':'first', 
            'start station longitude':'first','usertype':frac_subscribers,
            'gender':frac_male,'birth year' : [frac_under_25_last_year, 
                                               frac_25_to_34_last_year,
                                               frac_35_to_44_last_year,
                                               frac_45_to_54_last_year,
                                               frac_over_55_last_year]}

In [64]:
last_yr_hourly = last_yr_trips.groupby([last_yr_trips['starttime'].dt.date,
                                       last_yr_trips['starttime'].dt.hour,
                                       last_yr_trips['start station id']]).agg(agg_dict_last_year)

In [65]:
# DataFrame clean up -- see above for explanation of two-part index reset
last_yr_hourly = last_yr_hourly.reset_index(level = [1,2])
last_yr_hourly.columns = ['hour','start_id','median_duration','trip_count','station_lat',
                         'station_long','frac_subscriber','frac_male','frac_under_25','frac_25_34',
                         'frac_35_44','frac_45_54','frac_55_over']
last_yr_hourly = last_yr_hourly.reset_index()
last_yr_hourly = last_yr_hourly.rename(columns = {'starttime':'date'})
last_yr_hourly.head()

Unnamed: 0,date,hour,start_id,median_duration,trip_count,station_lat,station_long,frac_subscriber,frac_male,frac_under_25,frac_25_34,frac_35_44,frac_45_54,frac_55_over
0,2018-05-01,0,128,550.0,3,40.727103,-74.002971,1.0,1.0,0.0,0.333333,0.333333,0.0,0.333333
1,2018-05-01,0,146,1182.5,2,40.71625,-74.009106,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2018-05-01,0,157,188.0,1,40.690893,-73.996123,1.0,1.0,0.0,0.0,0.0,0.0,0.0
3,2018-05-01,0,161,480.5,2,40.72917,-73.998102,1.0,1.0,0.0,0.0,0.0,0.0,0.0
4,2018-05-01,0,168,662.0,1,40.739713,-73.994564,1.0,0.0,0.0,0.0,0.0,0.0,0.0


### Add balancing data

In [68]:
# We need only the hourly count of trips ending at a given station
this_yr_hourly_balance = this_yr_trips.groupby([this_yr_trips['stoptime'].dt.date,
                                       this_yr_trips['stoptime'].dt.hour,
                                       this_yr_trips['end station id']])[['tripduration']].count()
this_yr_hourly_balance.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,tripduration
stoptime,stoptime,end station id,Unnamed: 3_level_1
2019-05-01,0,120,1
2019-05-01,0,127,1
2019-05-01,0,146,1
2019-05-01,0,151,1
2019-05-01,0,161,2


In [69]:
#Clean up, in parts to deal with duplicated index name
this_yr_hourly_balance = this_yr_hourly_balance.reset_index(level=[1,2])
this_yr_hourly_balance.columns = ['return_hour','return_id','return_count']
this_yr_hourly_balance = this_yr_hourly_balance.reset_index()
this_yr_hourly_balance = this_yr_hourly_balance.rename(columns = {'stoptime':'return_date'})
this_yr_hourly_balance.head()

Unnamed: 0,return_date,return_hour,return_id,return_count
0,2019-05-01,0,120,1
1,2019-05-01,0,127,1
2,2019-05-01,0,146,1
3,2019-05-01,0,151,1
4,2019-05-01,0,161,2


In [70]:
#Repeat for previous year
last_yr_hourly_balance = last_yr_trips.groupby([last_yr_trips['stoptime'].dt.date,
                                       last_yr_trips['stoptime'].dt.hour,
                                       last_yr_trips['end station id']])[['tripduration']].count()
last_yr_hourly_balance = last_yr_hourly_balance.reset_index(level=[1,2])
last_yr_hourly_balance.columns = ['return_hour','return_id','return_count']
last_yr_hourly_balance = last_yr_hourly_balance.reset_index()
last_yr_hourly_balance = last_yr_hourly_balance.rename(columns = {'stoptime':'return_date'})
last_yr_hourly_balance.head()

Unnamed: 0,return_date,return_hour,return_id,return_count
0,2018-05-01,0,72,3
1,2018-05-01,0,120,1
2,2018-05-01,0,127,1
3,2018-05-01,0,143,1
4,2018-05-01,0,150,2


In [72]:
#Now merge in the data
this_yr_hourly = this_yr_hourly.merge(this_yr_hourly_balance,
                                     left_on = ['date','hour','start_id'],
                                     right_on = ['return_date','return_hour','return_id'],
                                     how = 'left')
this_yr_hourly.head()

Unnamed: 0,date,hour,start_id,median_duration,trip_count,station_lat,station_long,frac_subscriber,frac_male,frac_under_25,frac_25_34,frac_35_44,frac_45_54,frac_55_over,return_date,return_hour,return_id,return_count
0,2019-05-01,0,83,1802.0,3,40.683826,-73.976323,0.0,0.666667,1.0,0.0,0.0,0.0,0.0,,,,
1,2019-05-01,0,127,581.0,1,40.731724,-74.006744,1.0,1.0,0.0,0.0,0.0,0.0,0.0,2019-05-01,0.0,127.0,1.0
2,2019-05-01,0,128,548.0,1,40.727103,-74.002971,1.0,1.0,0.0,0.0,0.0,0.0,0.0,,,,
3,2019-05-01,0,146,467.0,1,40.71625,-74.009106,1.0,1.0,0.0,0.0,0.0,0.0,0.0,2019-05-01,0.0,146.0,1.0
4,2019-05-01,0,151,541.0,1,40.722104,-73.997249,1.0,1.0,0.0,0.0,0.0,0.0,0.0,2019-05-01,0.0,151.0,1.0


In [73]:
# Clean up the merge 
this_yr_hourly = this_yr_hourly.drop(columns = ['return_date','return_hour','return_id'])
this_yr_hourly = this_yr_hourly.fillna(0)
this_yr_hourly.head()

Unnamed: 0,date,hour,start_id,median_duration,trip_count,station_lat,station_long,frac_subscriber,frac_male,frac_under_25,frac_25_34,frac_35_44,frac_45_54,frac_55_over,return_count
0,2019-05-01,0,83,1802.0,3,40.683826,-73.976323,0.0,0.666667,1.0,0.0,0.0,0.0,0.0,0.0
1,2019-05-01,0,127,581.0,1,40.731724,-74.006744,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
2,2019-05-01,0,128,548.0,1,40.727103,-74.002971,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
3,2019-05-01,0,146,467.0,1,40.71625,-74.009106,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
4,2019-05-01,0,151,541.0,1,40.722104,-73.997249,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0


In [74]:
# Add net change in inventory column
this_yr_hourly['comp_net_inven'] = -1 * this_yr_hourly['trip_count'] + this_yr_hourly['return_count']
this_yr_hourly.head()

Unnamed: 0,date,hour,start_id,median_duration,trip_count,station_lat,station_long,frac_subscriber,frac_male,frac_under_25,frac_25_34,frac_35_44,frac_45_54,frac_55_over,return_count,comp_net_inven
0,2019-05-01,0,83,1802.0,3,40.683826,-73.976323,0.0,0.666667,1.0,0.0,0.0,0.0,0.0,0.0,-3.0
1,2019-05-01,0,127,581.0,1,40.731724,-74.006744,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,2019-05-01,0,128,548.0,1,40.727103,-74.002971,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,-1.0
3,2019-05-01,0,146,467.0,1,40.71625,-74.009106,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,2019-05-01,0,151,541.0,1,40.722104,-73.997249,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


In [75]:
# Repeat for previous year
last_yr_hourly = last_yr_hourly.merge(last_yr_hourly_balance,
                                     left_on = ['date','hour','start_id'],
                                     right_on = ['return_date','return_hour','return_id'],
                                     how = 'left')
last_yr_hourly = last_yr_hourly.drop(columns = ['return_date','return_hour','return_id'])
last_yr_hourly = last_yr_hourly.fillna(0)
last_yr_hourly['comp_net_inven'] = -1 * last_yr_hourly['trip_count'] + last_yr_hourly['return_count']
last_yr_hourly.head()

Unnamed: 0,date,hour,start_id,median_duration,trip_count,station_lat,station_long,frac_subscriber,frac_male,frac_under_25,frac_25_34,frac_35_44,frac_45_54,frac_55_over,return_count,comp_net_inven
0,2018-05-01,0,128,550.0,3,40.727103,-74.002971,1.0,1.0,0.0,0.333333,0.333333,0.0,0.333333,0.0,-3.0
1,2018-05-01,0,146,1182.5,2,40.71625,-74.009106,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-2.0
2,2018-05-01,0,157,188.0,1,40.690893,-73.996123,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,-1.0
3,2018-05-01,0,161,480.5,2,40.72917,-73.998102,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,-2.0
4,2018-05-01,0,168,662.0,1,40.739713,-73.994564,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-1.0


### Save Data

* The output data is designed for import into Tableau, and also can be used to estimate commuters using the notebook `Estimate-Commuters.jupnb`

In [76]:
# Compose filenames
trip_data_dir = '../data/trips/'
if mm < 10:
    month_str = '0' + str(mm)
else:
    month_str = str(mm)
this_year_str = str(yyyy)
last_year_str = str(yyyy - 1)
this_year_out_filename = trip_data_dir + this_year_str + month_str + '-hourly-tripdata.csv'
last_year_out_filename = trip_data_dir + last_year_str + month_str + '-hourly-tripdata.csv'
# Write files
this_yr_hourly.to_csv(this_year_out_filename)
last_yr_hourly.to_csv(last_year_out_filename)

In [77]:
# Lastly, estimate our savings by aggregating hourly, for our own info
print(f'For current year data:  {len(this_yr_trips)} to {len(this_yr_hourly)} rows')
print(f'For previous year data:  {len(last_yr_trips)} to {len(last_yr_hourly)} rows')

For current year data:  1924563 to 348900 rows
For previous year data:  1824710 to 331921 rows
