First we import relevant libraries and generate a dataframe representing all of the arrivals at a single stop in a single month traveling in one of two directions on the M100 bus route. 

It should be noted that there are relatively few entries for the month (107) due to data loss earlier in the cleaning process. Further work on cleaning is necessary

In [56]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

mta_df = pd.read_csv('M100_month_W125_st.csv', error_bad_lines=False)
mta_df.shape

(107, 21)

Then the dataframe is sorted based on 'RecordedAtTime' which, since all entries are while the bus has reached the stop, represent the actual arrival times. 

In [58]:
mta_df['RecordedAtTime'] = pd.to_datetime(mta_df['RecordedAtTime'])
mta_df.sort_values("RecordedAtTime", inplace=True)
mta_df.head()

Unnamed: 0,RecordedAtTime,DirectionRef,PublishedLineName,OriginName,OriginLat,OriginLong,DestinationName,DestinationLat,DestinationLong,VehicleRef,...,VehicleLocation.Longitude,NextStopPointName,ArrivalProximityText,DistanceFromStop,ExpectedArrivalTime,ScheduledArrivalTime,time_delta,time_delta_mins,time_diff,time_diff_mins
0,2017-08-01 07:51:49,0,M100,1 AV/125 ST,40.801968,-73.931358,INWOOD 220 ST via AMSTERDAM via BWAY,40.871902,-73.913101,NYCT_8368,...,-73.952819,W 125 ST/ST NICHOLAS AV,at stop,21.0,2017-08-01 07:52:03,07:49:06,,0,,0
1,2017-08-01 07:51:49,0,M100,1 AV/125 ST,40.801968,-73.931358,INWOOD 220 ST via AMSTERDAM via BWAY,40.871902,-73.913101,NYCT_8368,...,-73.952819,W 125 ST/ST NICHOLAS AV,at stop,21.0,2017-08-01 07:52:03,07:50:48,0 days 00:00:00.000000000,0,0 days 00:00:00.000000000,0
2,2017-08-01 07:51:49,0,M100,1 AV/125 ST,40.801968,-73.931358,INWOOD 220 ST via AMSTERDAM via BWAY,40.871902,-73.913101,NYCT_8368,...,-73.952819,W 125 ST/ST NICHOLAS AV,at stop,21.0,2017-08-01 07:52:03,07:51:39,0 days 00:00:00.000000000,0,0 days 00:00:00.000000000,0
3,2017-08-01 11:02:25,0,M100,1 AV/125 ST,40.801968,-73.931358,INWOOD 220 ST via AMSTERDAM via BWAY,40.871902,-73.913101,NYCT_8384,...,-73.952928,W 125 ST/ST NICHOLAS AV,at stop,11.0,2017-08-01 11:02:42,11:04:39,0 days 03:10:39.000000000,191,0 days 03:10:39.000000000,191
4,2017-08-01 13:42:23,0,M100,1 AV/125 ST,40.801968,-73.931358,INWOOD 220 ST via AMSTERDAM via BWAY,40.871902,-73.913101,NYCT_8391,...,-73.952865,W 125 ST/ST NICHOLAS AV,at stop,17.0,2017-08-01 13:43:07,13:35:20,0 days 02:40:25.000000000,161,0 days 02:40:25.000000000,161


The arrivals dataframe is initialized with a controlled number of passenger arrival time entries and can have the frequency of random times changed in its definition statement

In [59]:
def select_random_dates(frequency, NumDataPoints):
    date_range = pd.date_range(start='2017-08-01', end='2017-08-30', freq=frequency)
    random_dates = pd.to_datetime(
        np.concatenate([
                np.random.choice(date_range[1:-1], size=NumDataPoints, replace=False)
            ])
        )
    return random_dates

arrivals_df = pd.DataFrame()
arrivals_df['PassengerTime'] = select_random_dates('1min', 10)
arrivals_df.head(10)

Unnamed: 0,PassengerTime
0,2017-08-14 20:14:00
1,2017-08-18 20:35:00
2,2017-08-01 15:14:00
3,2017-08-26 05:15:00
4,2017-08-26 23:08:00
5,2017-08-26 06:49:00
6,2017-08-17 12:00:00
7,2017-08-10 13:40:00
8,2017-08-19 02:50:00
9,2017-08-19 15:11:00


The next arriving bus is found for each of the random passenger arrival times defined above as well as the time delta between the two, representing wait time.

In [64]:
def findNextBus(arrivals_df, mta_df):
    for arrivalIndex, arrivalRow in arrivals_df.iterrows():
        for mtaIndex, mtaRow in mta_df.iterrows():
            if (mtaRow['RecordedAtTime'] > arrivalRow[0]):
                arrivals_df.loc[arrivalIndex,'NextBus'] = mtaRow['RecordedAtTime']
                break

findNextBus(arrivals_df, mta_df)
arrivals_df['WaitTime'] = arrivals_df['NextBus'] - arrivals_df['PassengerTime']
arrivals_df.head(10)

Unnamed: 0,PassengerTime,NextBus,WaitTime
0,2017-08-14 20:14:00,2017-08-15 10:07:05,0 days 13:53:05
1,2017-08-18 20:35:00,2017-08-20 06:35:28,1 days 10:00:28
2,2017-08-01 15:14:00,2017-08-01 15:53:02,0 days 00:39:02
3,2017-08-26 05:15:00,2017-08-26 12:07:20,0 days 06:52:20
4,2017-08-26 23:08:00,2017-08-27 07:18:07,0 days 08:10:07
5,2017-08-26 06:49:00,2017-08-26 12:07:20,0 days 05:18:20
6,2017-08-17 12:00:00,2017-08-17 14:11:18,0 days 02:11:18
7,2017-08-10 13:40:00,2017-08-10 18:19:26,0 days 04:39:26
8,2017-08-19 02:50:00,2017-08-20 06:35:28,1 days 03:45:28
9,2017-08-19 15:11:00,2017-08-20 06:35:28,0 days 15:24:28
