First we import relevant libraries and generate a dataframe representing all of the arrivals at a single stop in a single month traveling in one of two directions on the M100 bus route. 

It should be noted that there are relatively few entries for the month (107) due to data loss earlier in the cleaning process. Further work on cleaning is necessary

In [12]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

mta_df = pd.read_csv('../../data/busBoarding.csv', error_bad_lines=False)
mta_df.shape

(14500, 8)

Then the dataframe is sorted based on 'RecordedAtTime' which, since all entries are while the bus has reached the stop, represent the actual arrival times. 

In [14]:

mta_df.sort_values("BusDepartureTime", inplace=True)
mta_df.head()

Unnamed: 0.1,Unnamed: 0,Date,passengerId,passengerArrivalTime,BusDepartureTime,VehicleRef,TimeDelta,numPassengersPerBus
0,0,2017-08-01 00:00:00.000000000,4,2017-08-01 00:19:35.131602364,2017-08-01 00:21:06.000000000,NYCT_4368,0 days 00:09:27.000000000,5
1,1,2017-08-01 00:00:00.000000000,233,2017-08-01 00:20:47.502397393,2017-08-01 00:21:06.000000000,NYCT_4368,0 days 00:09:27.000000000,5
2,2,2017-08-01 00:00:00.000000000,290,2017-08-01 00:20:27.926917146,2017-08-01 00:21:06.000000000,NYCT_4368,0 days 00:09:27.000000000,5
3,3,2017-08-01 00:00:00.000000000,308,2017-08-01 00:11:59.891229654,2017-08-01 00:21:06.000000000,NYCT_4368,0 days 00:09:27.000000000,5
4,4,2017-08-01 00:00:00.000000000,398,2017-08-01 00:15:17.577506243,2017-08-01 00:21:06.000000000,NYCT_4368,0 days 00:09:27.000000000,5


The arrivals dataframe is initialized with a controlled number of passenger arrival time entries and can have the frequency of random times changed in its definition statement

In [15]:
def select_random_dates(frequency, NumDataPoints):
    date_range = pd.date_range(start='2017-08-01', end='2017-08-30', freq=frequency)
    random_dates = pd.to_datetime(
        np.concatenate([
                np.random.choice(date_range[1:-1], size=NumDataPoints, replace=False)
            ])
        )
    return random_dates

arrivals_df = pd.DataFrame()
arrivals_df['passengerArrivalTime'] = select_random_dates('1min', 600)
arrivals_df.head(10)

Unnamed: 0,passengerArrivalTime
0,2017-08-28 17:25:00
1,2017-08-12 10:46:00
2,2017-08-28 04:44:00
3,2017-08-28 16:39:00
4,2017-08-03 23:25:00
5,2017-08-07 01:47:00
6,2017-08-08 07:01:00
7,2017-08-10 12:04:00
8,2017-08-10 03:59:00
9,2017-08-06 00:32:00


The next arriving bus is found for each of the random passenger arrival times defined above as well as the time delta between the two, representing wait time.

In [22]:
def findNextBus(arrivals_df, mta_df):
    for arrivalIndex, arrivalRow in arrivals_df.iterrows():
        for mtaIndex, mtaRow in mta_df.iterrows():
            if (pd.to_datetime(mtaRow['BusDepartureTime']) > pd.to_datetime(arrivalRow[0])):
                arrivals_df.loc[arrivalIndex,'NextBus'] = mtaRow['BusDepartureTime']
                break

findNextBus(arrivals_df, mta_df)
arrivals_df['WaitTime'] = pd.to_datetime(arrivals_df['NextBus']) - pd.to_datetime(arrivals_df['passengerArrivalTime'])
arrivals_df.head(10)

Unnamed: 0,passengerArrivalTime,NextBus,WaitTime
0,2017-08-28 17:25:00,2017-08-28 17:30:21.000000000,00:05:21
1,2017-08-12 10:46:00,2017-08-12 10:52:08.000000000,00:06:08
2,2017-08-28 04:44:00,2017-08-28 06:29:28.000000000,01:45:28
3,2017-08-28 16:39:00,2017-08-28 16:41:02.000000000,00:02:02
4,2017-08-03 23:25:00,2017-08-03 23:27:49.000000000,00:02:49
5,2017-08-07 01:47:00,2017-08-07 01:52:46.000000000,00:05:46
6,2017-08-08 07:01:00,2017-08-08 07:34:46.000000000,00:33:46
7,2017-08-10 12:04:00,2017-08-10 12:18:37.000000000,00:14:37
8,2017-08-10 03:59:00,2017-08-10 06:08:56.000000000,02:09:56
9,2017-08-06 00:32:00,2017-08-06 00:51:08.000000000,00:19:08
