Delays original data source: https://www1.toronto.ca/wps/portal/contentonly?vgnextoid=fa6be8c5a612c510VgnVCM10000071d60f89RCRD

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
delays = pd.read_csv('csv_originals/delays.csv', encoding = "ISO-8859-1")

In [3]:
delays.head()

Unnamed: 0,Date,Time,Day,Station,Code,Min Delay,Min Gap,Bound,Line,Vehicle
0,1/1/2014,02:06,Wednesday,HIGH PARK STATION,SUDP,3,7,W,BD,5001
1,1/1/2014,02:40,Wednesday,SHEPPARD STATION,MUNCA,0,0,,YU,0
2,1/1/2014,03:10,Wednesday,LANSDOWNE STATION,SUDP,3,8,W,BD,5116
3,1/1/2014,03:20,Wednesday,BLOOR STATION,MUSAN,5,10,S,YU,5386
4,1/1/2014,03:29,Wednesday,DUFFERIN STATION,MUPAA,0,0,E,BD,5174


In [4]:
delays.isnull().any()
## Looks like Code, Bound and Line coutain NaNs

Date         False
Time         False
Day          False
Station      False
Code          True
Min Delay    False
Min Gap      False
Bound         True
Line          True
Vehicle      False
dtype: bool

## Date

In [5]:
dates = pd.to_datetime(delays['Date'])

In [6]:
years = dates.dt.year
months = dates.dt.month
days = dates.dt.day
years.unique(), months.unique(), days.unique()

(array([2014, 2015, 2016, 2017]),
 array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12]),
 array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
        18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]))

The years are between the expected 2014-2017 and the months/days look reasonable so no clean up on *Date*

## Time

In [7]:
hours = pd.to_datetime(delays['Time'], format='%H:%M').dt.hour
minutes = pd.to_datetime(delays['Time'], format='%H:%M').dt.minute
hours.min(), hours.max(), minutes.min(), minutes.max()

(0, 23, 0, 59)

Looks like the hours are within 0 - 23 and the minutes are within 0-59 so no clean up needed for *Time*

## Day

In [8]:
delays['Day'].unique()

array(['Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday', 'Monday',
       'Tuesday'], dtype=object)

Days look reasonable so no clean up

Make sure the days from the day column match the day from the date?

In [9]:
dates_day = delays.copy()[['Day', 'Date']]
dates_day['Date'] = pd.to_datetime(dates_day['Date'])
dates_day['Date Day'] = dates_day['Date'].dt.weekday_name
dates_day[dates_day['Date Day'] != dates_day['Day']]

Unnamed: 0,Day,Date,Date Day


Looks like the *Day* column matches the expected date so no clean up here

## Station

In [10]:
delays['Station'].drop_duplicates().sort_values()

41188             (APPROACHING)
41018    0SSINGTON STATION (ENT
70435       169 DANFORTH AVENUE
39741    401 EMERGENCY EE (SHEP
42551     APPROACHING DOWNSVIEW
47634     APPROACHING HIGH PARK
2531     APPROACHING KEELE STAT
59133    APPROACHING KENNEDY BD
39046      APPROACHING LAWRENCE
66605    APPROACHING LAWRENCE S
37369    APPROACHING LESLIE STA
51425     APPROACHING OSSINGTON
26048    APPROACHING VICTORIA P
60229        APPROACHING WARDEN
44476    APPROACHING WARDEN STA
47937    APPROACHING WILSON STA
20309         APPROACHING YONGE
48403      APPROCHING DOWNSVIEW
59520        APROACHING MCCOWAN
27166        ASQUITH SUBSTATION
22806     BATHURST (APPROACHING
10064           BATHURST STAION
158            BATHURST STATION
17516    BATHURST STATION (APPR
15497    BATHURST STATION (ENTE
16813    BATHURST STATION (EXIT
17325    BATHURST STATION (IN T
18515    BATHURST STATION (LEAV
13325    BATHURST STATION (WEST
67072    BATHURST STATION - APP
                  ...          
58350   

In [11]:
# Source https://www1.toronto.ca/wps/portal/contentonly?vgnextoid=c077c316f16e8410VgnVCM10000071d60f89RCRD&vgnextchannel=7807e03bb8d1e310VgnVCM10000071d60f89RCRD
ttc_stations = pd.read_csv('stations.csv')
ttc_stations.head(2)

FileNotFoundError: File b'stations.csv' does not exist

The station in the above data all have their line in brackets but this will cause issues comparing to the delay data.

Strip the station name from the bracket, the delay data has the exact line so we can use this to differentiate the data

In [None]:
ttc_stations['Short Name'] = ttc_stations['Subway/RT Station'].dropna().str.split('(').str.get(0).str.upper().str.strip()

# Fix up SHEPPARD and MAIN to match the delay data better
ttc_stations[ttc_stations['Short Name'] == 'SHEPPARD-YONGE'] = 'SHEPPARD'
ttc_stations[ttc_stations['Short Name'] == 'MAIN STREET'] = 'MAIN'

# Drop any NaNs and grab just the uniques
ttc_stations_names = ttc_stations['Short Name'].dropna().unique()

'''
Some stations have a 'West' version and ideally we want these first since we don't want to
misclasify St. Clear West as St Clair so we reverse sort the names so the longer name is first
'''
ttc_stations_names = np.sort(ttc_stations_names)[::-1]

ttc_stations_names

Save the original stations and count how many there were

In [None]:
delays['Station_original'] = delays['Station']
stations = delays['Station']
# 76801 Way too many stations!
stations.count()

Fixing up some inconsistencies with the delays stations. The St are need to be normalized and some common typos fixed

In [None]:
def fix_station(station):
    if station.startswith('ST'):
#         return station.replace('ST.', 'ST. ').replace('ST ', 'ST. ')
        return station.replace('ST.', 'ST. ').replace('ST ', 'ST. ').replace('ST.  ', 'ST. ')
    elif station == 'NORTH YORK CTR STATION' or station == 'NORTH YORK CENTER' or station == 'NORTH YORK CENTER STAT':
        return 'NORTH YORK CENTRE'
    elif '0SSINGTON' in station:
        return station.replace('0SSINGTON', 'OSSINGTON')
    elif 'BESSARIAN' in station or 'BESSARRION' in station:
        return 'BESSARION'
    elif 'BUTHURST' in station:
        return 'BATHURST'
    elif 'SCARB' in station or 'SCARBOROUGH' in station or 'SCAB' in station or 'SCAR' in station and 'RAPID' not in station:
        return 'SCARBOROUGH CENTRE'
    elif 'DOWNVIEW' in station:
        return 'DOWNSVIEW'
    else: 
        return station
    
# Store the newly fixed stations in it's own column  
delays['Station_Fixed'] = delays['Station_original'].apply(fix_station)
delays

Now that we have a list of known stations and a cleaner list of stations lets take our best guess at the 'Normalized Station'

**Warning** sometimes the above code is slow or gets stuck you can kill the kernal and restart

In [None]:
def estimate_station(original_station):
    for station_name in ttc_stations_names:        
        if station_name in original_station:
            return station_name
        
    return np.NaN
delays['Station'] = delays['Station_Fixed'].apply(estimate_station)
delays.head()

Next lets check out the stations that are still unknown -- also add a check here removing all the stations where the Line is null since those data points likely have issues with the stations

In [None]:
# delays
unknown_station = delays[delays['Station'].isnull() & (delays['Line'].notnull())]
len(unknown_station['Station_Fixed'].unique())
# unknown_station['Station'].unique()
print(len(unknown_station))
unknown_station.groupby('Station_Fixed').size().sort_values(ascending=False)

It looks like we weren't able to classify 342 of the original stations but from a quick glance some of of these aren't real station (like TRANISIT CONTORL, DANFORTH DIVISION )

Interestingly some of there seem like entire lines or Systems (SYSTEM WIDE, SRT LINE ) which may be worth looking at individually than the stations

In [None]:
# Quick glance at the final stations
delays.groupby('Station').size()

## Line

In [None]:
delays['Line'].drop_duplicates()

In [None]:
# Similar to station store the original lines
delays['Line_ori'] = delays['Line']

Based on the metadata that came with the data we know there are only 4 real lines so we can map these with their full name which will also convert the rest to null

In [None]:
expected_lines = {'BD': 'Bloor-Danforth', 'YU': 'Yonge-University', 'SHP' : 'Sheppard', 'SRT' : 'Scarborough RT'}
delays['Line'] = delays['Line_ori'].map(expected_lines)

## Intersection Stations

We need to rename the stations that are at the intersections (Bloor/Yonge/Sheppard)

In [None]:
exchange_stations = delays[delays['Station'].isin(['BLOOR', 'SHEPPARD', 'YONGE', 'ST. GEORGE', 'KENNEDY', 'SPADINA'])]
exchange_stations.groupby(['Station', 'Line']).size()

In [None]:
def fix_station(station, line, new_name):
    delays.loc[(delays['Station'] == station) & (delays['Line'] == line), 'Station'] = new_name

In [None]:
fix_station('BLOOR', 'Bloor-Danforth', 'BLOOR-YONGE - BD')
fix_station('BLOOR', 'Yonge-University', 'BLOOR-YONGE - YU')
fix_station('YONGE', 'Bloor-Danforth', 'BLOOR-YONGE - BD')
fix_station('YONGE', 'Yonge-University', 'BLOOR-YONGE - YU')

fix_station('SHEPPARD', 'Yonge-University', 'SHEPPARD - YU')
fix_station('SHEPPARD', 'Sheppard', 'SHEPPARD - SHP')
fix_station('YONGE', 'Sheppard', 'SHEPPARD - SHP')

fix_station('SPADINA', 'Yonge-University', 'SPADINA - YU')
fix_station('SPADINA', 'Bloor-Danforth', 'SPADINA - BD')

fix_station('ST. GEORGE', 'Yonge-University', 'ST. GEORGE - YU')
fix_station('ST. GEORGE', 'Bloor-Danforth', 'ST. GEORGE - BD')

fix_station('KENNEDY', 'Scarborough RT', 'KENNEDY - SRT')
fix_station('KENNEDY', 'Bloor-Danforth', 'KENNEDY - BD')

In [None]:
exchange_stations = delays[delays['Station'].isin(['BLOOR', 'SHEPPARD', 'YONGE', 'ST. GEORGE', 'KENNEDY', 'SPADINA'])]
exchange_stations.groupby(['Station', 'Line']).size()

## Bound

The meta data says that Bound should be a direction N/S/E/W so we can likely drop the B/R/Y which may have been typos

In [None]:
delays.groupby('Bound').size()

In [None]:
expected_directions = {'E': 'East', 'N': 'North', 'W':'West', 'S':'South'}
delays['Bound'] = delays['Bound'].map(expected_directions)
delays.groupby('Bound').size()

We also know something about the lines -- Yonge/University line only goes North/South, Bloor Danforth/Sheppard/Scarboughout RT only does East/West so we can NaN any values that don't match that

In [None]:
delays_with_line = delays[delays['Line'].notnull()]
num_non_null_bound = len(delays_with_line[delays_with_line['Bound'].notnull()])
num_null_bound = len(delays_with_line[delays_with_line['Bound'].isnull()])
total_delay = len(delays_with_line)

print('Number of delays with non-null Bound: ', num_non_null_bound)
print('Number of delays with null Bound', num_null_bound)
print('Percentage of null ', (num_null_bound/total_delay))

In [None]:
# Fix up the YU line
non_null_yu = delays[(delays['Line'] == 'Yonge-University') & (delays['Bound'].notnull())]

invalid_yu_mask = ~non_null_yu['Bound'].isin(['North','South'])
invalid_yu = non_null_yu[invalid_yu_mask]
num_invalid_yu = len(invalid_yu)
print('Number of Invalid Bounds on the Yonge/Univerity line ', num_invalid_yu)
print('Number of entries on the Yonge/Univerity line ', len(non_null_yu))
print('Percent of Younge Line Bound being removed: ', num_invalid_yu/len(non_null_yu) * 100)

Looks like 0.23% of the Yonge University Line Bounds were classified with the wrong Bond direction

In [None]:
# Convert all the East/West to NaN
yu_mask = (delays['Line'] == 'Yonge-University') & (delays['Bound'].notnull()) &(~delays['Bound'].isin(['North','South']))
delays.loc[yu_mask, 'Bound'] = np.NaN
delays[yu_mask]

In [None]:
non_null_other = delays[(delays['Line'] != 'Yonge-University') & (delays['Line'].notnull()) & (delays['Bound'].notnull())]

invalid_other = non_null_other[~non_null_other['Bound'].isin(['East','West'])]
num_invalid_other = len(invalid_other)
print('Number of Invalid Bounds on the Other Line line ', num_invalid_other)
print('Number of entries on the Other Lines ', len(non_null_other))
print('Percent of Other Line Bound being removed: ', num_invalid_other/len(non_null_other) * 100)

In [None]:
# Convert all the North/South to NaN
other_mask = (delays['Line'] != 'Yonge-University') & (delays['Line'].notnull()) & (delays['Bound'].notnull()) &(~delays['Bound'].isin(['East','West']))
delays.loc[other_mask, 'Bound'] = np.NaN
delays[other_mask]

Looks like we will end up removing 6% of the other lines

## Vehicle

Empty vehicle numbers seem to be treated as 0 but NaN or None is likely more appropriate here

In [None]:
delays['Vehicle'] = delays['Vehicle'].replace(0, np.NaN)

In [None]:
vehicle_grouping = delays.groupby('Vehicle').size()
# vehicle_grouping

In [None]:
# vehicle_grouping.hist()
vehicle_grouping.plot()
plt.show()

## Codes

We recieved the full code names from the data set so we can put the description in the table

In [None]:
codes = pd.read_csv('codes.csv')
# Likely due to an encoding issue but the Code column is 'SUB RMENU CODE' so rename it to Code
codes['Code'] = codes['SUB RMENU CODE']
codes['Code Description'] = codes['CODE DESCRIPTION']

To get the codes into our delays dataframe we need to merge the codes

In [None]:
delays = delays.merge(codes, how='left', on='Code')
delays

Let see if code codes didn't get translated 

In [None]:
# lets see if any codes don't have description
no_description = delays[delays['Code Description'].isnull()]
print('Number of non decoded values: ', len(no_description))

# Reverse sort by the number of entries of each of this code
no_description.groupby('Code').size().sort_values(ascending=False)

Looks like most of these are one off's with the exception of *MUNCA* which has 1561 values!
One guess is this was meant to be MUNOA,No Operator Immediately Available - Not E.S.A. Related  or it could be missing from the code list

In [None]:
# List out the codes with the most common one first
delays.groupby('Code Description').size().sort_values(ascending=False).head(20)

In [None]:
# Not really clean up I just was curcious how many of these were 'Passenger' related
filled = delays['Code Description'].fillna('')
pass_related = filled[filled.str.contains('Passenger')].unique()

In [None]:
pass_delays = delays[delays['Code Description'].isin(pass_related)]
print(len(pass_delays))
pass_delays.groupby('Code Description').size()

In [None]:
# Remove some not really useful columns before exporting (CODE DESCRIPTION is a duplicate)
delays = delays.drop(['Station_Fixed', 'SUB RMENU CODE', 'CODE DESCRIPTION'], axis=1)

In [None]:
delays.head()

In [None]:
delays.to_csv('ttc_delays_cleaned.csv', index=False)

In [None]:
delays['Min Gap'].describe()

In [None]:
delays['Min Delay'].describe()

In [None]:
cleaned = pd.read_csv('ttc_delays_cleaned.csv')

In [None]:
cleaned.head()