# SEPTA Data Project
#### William McKee
#### December 2017

SEPTA is a public agency responsible for the public transportation system in Philadelphia and its Pennsylvania suburbs.  SEPTA stands for Southeastern Pennsylvania Transportation Authority. 

This code analyzes the data set for SEPTA Bus and Rail lines downloaded from https://transitfeeds.com.  I downloaded the SEPTA Bus zip file and renamed gfts.zip to septa_bus_gfts.zip.  I downloaded the SEPTA Rail zip file and renamed gfts.zip to septa_rail_gfts.zip.

## Data Set Conversion

The code below checks the contents of both zip files, displays some zip file contents, and converts the files to csv format.

In [1]:
import zipfile
import csv
import os

def read_and_print_first_lines_from_zipped_file(zipfilename, limit):
    """
    Reads zip file and prints the first limit lines from each file contained in the zip file
    zipfilename = zip file name (such as 'example.zip')
    limit = number of lines to print in file
    """
    print()
    print("CONTENTS OF ZIP FILE " + zipfilename + ":")
    print()
    with zipfile.ZipFile(zipfilename, 'r') as z:
        file_name_list = sorted(z.namelist())
        for file in file_name_list:
            print(file)
            with z.open(file, 'r') as input_file:
                for line_number, line in enumerate(input_file):
                    if line_number > limit:
                        break
                    print(line)
            print()
    print()

# Loop through zip files
NUM_LINES = 5
ZIP_FILE_NAMES = ['septa_bus_gfts.zip', 'septa_rail_gfts.zip']
DIRECTORY_NAMES = []
for file in ZIP_FILE_NAMES:
    # Read the zip files and display some file contents
    read_and_print_first_lines_from_zipped_file(file, NUM_LINES)

    # Extract zip file contents
    directory_name = os.path.splitext(file)[0]
    DIRECTORY_NAMES.append(directory_name)
    with zipfile.ZipFile(file, 'r') as zip_ref:
        zip_ref.extractall(directory_name)

    # Convert txt files to csv files
    os.chdir(directory_name)
    for input_file in os.listdir('.'):
        with open(input_file, 'r') as in_file:
            stripped = (line.strip() for line in in_file)
            lines = (line.split(",") for line in stripped if line)
            output_file = os.path.splitext(input_file)[0] + ".csv"
            print("Convert " + input_file + " contents to " + output_file)
            with open(output_file, 'w', ) as out_file:
                writer = csv.writer(out_file, lineterminator = '\n')
                writer.writerows(lines)
            
    # Remove original text files
    for item in os.listdir('.'):
        if item.endswith(".txt"):
            os.remove(item)

    os.chdir('..')


CONTENTS OF ZIP FILE septa_bus_gfts.zip:

agency.txt
b'agency_name,agency_url,agency_timezone,agency_lang,agency_fare_url\r\n'
b'SEPTA,http://www.septa.org,America/New_York,EN,http://www.septa.org/fares/transit/index.html'

calendar.txt
b'service_id,monday,tuesday,wednesday,thursday,friday,saturday,sunday,start_date,end_date\r\n'
b'10,1,1,1,1,1,0,0,20170903,20180224\r\n'
b'11,0,0,0,0,0,0,0,20170903,20180224\r\n'
b'12,0,0,0,0,0,1,0,20170903,20180224\r\n'
b'13,0,0,0,0,0,0,1,20170903,20180224\r\n'
b'16,1,1,1,1,1,0,0,20170903,20180224\r\n'

calendar_dates.txt
b'service_id,date,exception_type\r\n'
b'10,20170904,2\r\n'
b'13,20170904,1\r\n'
b'16,20170904,2\r\n'
b'19,20170904,1\r\n'
b'22,20170904,2\r\n'

fare_attributes.txt
b'fare_id,price,currency_type,payment_method,transfers,transfer_duration\r\n'
b'1,2.50,USD,0,0,0\r\n'
b'2,3.50,USD,0,1,3600\r\n'
b'3,4.50,USD,0,2,3600\r\n'
b'13,7.00,USD,0,0,0\r\n'
b'14,8.00,USD,0,1,3600\r\n'

fare_rules.txt
b'fare_id,origin_id,destination_id\r\n'
b'1,1,1\

## Data Set Basics

We explore our CSV files for both bus and rail.  The groupings occur only on specific files which will reveal more useful information.

In [2]:
import pandas as pd

# Print sizes of CSV files
for directory in DIRECTORY_NAMES:
    os.chdir(directory)
    print("Looking at " + directory + " contents")
    print()
    for input_file in os.listdir('.'):
        print("Description of " + input_file + ":")
        data_set = pd.read_csv(input_file)
        print(data_set.shape)
        print()
    os.chdir('..')

Looking at septa_bus_gfts contents

Description of agency.csv:
(1, 5)

Description of calendar.csv:
(28, 10)

Description of calendar_dates.csv:
(168, 3)

Description of fare_attributes.csv:
(6, 6)

Description of fare_rules.csv:
(9, 3)

Description of routes.csv:
(139, 7)

Description of shapes.csv:
(570717, 4)

Description of stops.csv:
(13701, 8)

Description of stop_times.csv:
(3121225, 5)

Description of transfers.csv:
(1, 4)

Description of trips.csv:
(52000, 7)

Looking at septa_rail_gfts contents

Description of agency.csv:
(1, 6)

Description of calendar.csv:
(5, 10)

Description of calendar_dates.csv:
(2, 3)

Description of fare_attributes.csv:
(0, 6)

Description of fare_rules.csv:
(0, 5)

Description of routes.csv:
(13, 9)

Description of shapes.csv:
(180235, 4)

Description of stops.csv:
(155, 7)

Description of stop_times.csv:
(25082, 7)

Description of transfers.csv:
(3, 3)

Description of trips.csv:
(1711, 8)



In [3]:
# Groupby could provide useful information for some files
GROUPBY_FILES_FIELDS = {'shapes.csv': 'shape_id', 
                        'trips.csv': 'route_id', 
                        'stop_times.csv': 'trip_id'}
for directory in DIRECTORY_NAMES:
    os.chdir(directory)
    print("Looking at " + directory + " groupby contents")
    print()
    for input_file in os.listdir('.'):
        if (input_file in GROUPBY_FILES_FIELDS.keys()):
            this_field = GROUPBY_FILES_FIELDS[input_file]
            print("Description of " + input_file + " Groupings:")
            data_set = pd.read_csv(input_file)
            data_set_distinct = data_set.groupby(this_field)[this_field].count()
            print(data_set_distinct)
            print()
    os.chdir('..')

Looking at septa_bus_gfts groupby contents

Description of shapes.csv Groupings:
shape_id
203286     296
203287     296
203288     280
203290     195
203291     273
203292     188
203293     289
203305     211
203307     189
203308     204
203310     182
203311      95
203312     288
203313      84
203314      84
203315     338
203316     790
203317     896
203318     790
203319     900
203320     872
203322     305
203323     372
203324     387
203325     790
203326     306
203327     321
203329     381
203330     361
203332     814
          ... 
206251     328
206287     253
206288     276
206289     170
206290     249
206291     249
206293     235
206294     257
206305     251
206308     403
206309     578
206310     484
206311    1180
206312     496
206313    1086
206314     597
206315     409
206317     304
206318     381
206319    1001
206320     990
206321     408
206322     370
206323    1079
206324     457
206325     366
206326     884
206327    1337
206329    1419
208290    

## One Train Route

Here, I will get the data associated with the Trenton Rail Line for train #734.

In [4]:
# Train directory
os.chdir(DIRECTORY_NAMES[1])

# List route information
routes_data_set = pd.read_csv('routes.csv')

print("Routes Data Set")
print(routes_data_set[['route_id', 'route_short_name', 'route_color']])

Routes Data Set
   route_id          route_short_name route_color
0       AIR              Airport Line      91456C
1       CHE   Chestnut Hill East Line      94763C
2       CHW   Chestnut Hill West Line      00B4B2
3       LAN  Lansdale/Doylestown Line      775B49
4       MED          Media/Elwyn Line      007CC8
5       FOX            Fox Chase Line      FF823D
6       NOR  Manayunk/Norristown Line      EE4C69
7       PAO      Paoli/Thorndale Line      20825C
8       CYN               Cynwyd Line      6F549E
9       TRE              Trenton Line      F683C9
10      WAR           Warminster Line      F7AF42
11      WIL    Wilmington/Newark Line      8AD16B
12      WTR         West Trenton Line      5D5EBC


In [5]:
# Find trips associated with the TRE line
trips_data_set = pd.read_csv('trips.csv')
trips_data_set_tre = trips_data_set.loc[trips_data_set['route_id'] == 'TRE']

print("Trips Data Set for Trenton")
print(trips_data_set_tre[['trip_id', 'service_id', 'trip_headsign', 'block_id', 'shape_id']])

Trips Data Set for Trenton
             trip_id service_id             trip_headsign  block_id  shape_id
1      TRE_717_V77_M         M4                   Trenton       717      7701
5      TRE_723_V77_M         M4                   Trenton       723      7701
11      TRE_773_V5_M         M1                   Trenton       773      7701
25      TRE_705_V5_M         M1                   Trenton       705      7701
74     TRE_9741_V5_M         M1                   Trenton      9741      7701
78     TRE_711_V66_M         M3                   Trenton       711      7701
81     TRE_7218_V5_M         M1  Center City Philadelphia      7218    701007
85     TRE_708_V77_M         M4  Center City Philadelphia       708    701004
104    TRE_1766_V5_M         M1  Center City Philadelphia      1766    701005
123     TRE_774_V5_M         M1  Center City Philadelphia       774    701004
127    TRE_722_V77_M         M4  Center City Philadelphia       722    701004
128    TRE_7406_V5_M         M1  Cent

In [8]:
# Look for rows associated with train #734 on the Trenton Line
trips_data_set_tre_734 = trips_data_set_tre.loc[trips_data_set['block_id'] == 734]

print("Trips Data Set for Trenton Train #734")
print(trips_data_set_tre_734[['trip_id', 'service_id', 'trip_headsign', 'block_id', 'shape_id']])

Trips Data Set for Trenton Train #734
            trip_id service_id             trip_headsign  block_id  shape_id
1009  TRE_734_V66_M         M3  Center City Philadelphia       734    701004
1152   TRE_734_V5_M         M1  Center City Philadelphia       734    701004
1445  TRE_734_V77_M         M4  Center City Philadelphia       734    701004


In [13]:
# Obtain the schedule for train #734 on the Trenton Line
trip_ids = trips_data_set_tre_734['trip_id'].tolist()

stop_times_data_set = pd.read_csv('stop_times.csv')
stop_times_data_set_tre_734 = stop_times_data_set[stop_times_data_set['trip_id'].isin(trip_ids)]

print("Stop Times Data Set for Trenton Train #734")
print(stop_times_data_set_tre_734[['trip_id', 'arrival_time', 'stop_id', 'stop_sequence']])

Stop Times Data Set for Trenton Train #734
             trip_id arrival_time  stop_id  stop_sequence
19721   TRE_734_V5_M     10:43:00    90701              1
19722   TRE_734_V5_M     10:50:00    90702              4
19723   TRE_734_V5_M     10:54:00    90703              6
19724   TRE_734_V5_M     10:58:00    90704              7
19725   TRE_734_V5_M     11:00:00    90705              8
19726   TRE_734_V5_M     11:02:00    90706              9
19727   TRE_734_V5_M     11:05:00    90707             11
19728   TRE_734_V5_M     11:09:00    90708             12
19729   TRE_734_V5_M     11:10:00    90709             13
19730   TRE_734_V5_M     11:13:00    90710             15
19731   TRE_734_V5_M     11:20:00    90711             17
19746   TRE_734_V5_M     11:33:00    90004             27
19747  TRE_734_V66_M     22:59:00    90701              1
19748  TRE_734_V66_M     23:07:00    90702              4
19749  TRE_734_V66_M     23:10:00    90703              6
19750  TRE_734_V66_M     23:1

In [27]:
# Focus on trip V5
stop_times_data_set_tre_734_v5 = stop_times_data_set_tre_734[stop_times_data_set_tre_734['trip_id'] == 'TRE_734_V5_M']

# Obtain the list of stops for train #734 for Trenton Line
stop_ids = stop_times_data_set_tre_734_v5['stop_id'].tolist()
print(stop_ids)

stops_data_set = pd.read_csv('stops.csv')
stops_data_set_tre_734_v5 = stops_data_set[stops_data_set['stop_id'].isin(stop_ids)]

print("Stops Data Set for Trenton Train #734 V5")
print(stops_data_set_tre_734_v5[['stop_id', 'stop_name', 'stop_lat', 'stop_lon', 'zone_id']])

[90701, 90702, 90703, 90704, 90705, 90706, 90707, 90708, 90709, 90710, 90711, 90004]
Stops Data Set for Trenton Train #734 V5
     stop_id                  stop_name   stop_lat   stop_lon zone_id
3      90004        30th Street Station  39.956667 -75.181667       C
120    90701                    Trenton  40.217778 -74.755000      NJ
121    90702        Levittown-Tullytown  40.140278 -74.816944       4
122    90703                    Bristol  40.104722 -74.854722       4
123    90704                    Croydon  40.093611 -74.906667       3
124    90705                  Eddington  40.083056 -74.933611       3
125    90706          Cornwells Heights  40.071667 -74.952222       3
126    90707                 Torresdale  40.054444 -74.984444       3
127    90708             Holmesburg Jct  40.032778 -75.023611       2
128    90709                     Tacony  40.023333 -75.038889       2
129    90710                 Bridesburg  40.010556 -75.069722       2
130    90711  North Philadelphia A