# SEPTA Data Project
#### William McKee
#### December 2017

SEPTA is a public agency responsible for the public transportation system in Philadelphia and its Pennsylvania suburbs.  SEPTA stands for Southeastern Pennsylvania Transportation Authority. 

This code analyzes the data set for SEPTA Bus and Rail lines downloaded from https://transitfeeds.com.  I downloaded the SEPTA Bus zip file and renamed gfts.zip to septa_bus_gfts.zip.  I downloaded the SEPTA Rail zip file and renamed gfts.zip to septa_rail_gfts.zip.

## Data Set Conversion

The code below checks the contents of both zip files, displays some zip file contents, and converts the files to csv format.

In [14]:
import zipfile
import csv
import os

def read_and_print_first_lines_from_zipped_file(zipfilename, limit):
    """
    Reads zip file and prints the first limit lines from each file contained in the zip file
    zipfilename = zip file name (such as 'example.zip')
    limit = number of lines to print in file
    """
    print()
    print("CONTENTS OF ZIP FILE " + zipfilename + ":")
    print()
    with zipfile.ZipFile(zipfilename, 'r') as z:
        file_name_list = sorted(z.namelist())
        for file in file_name_list:
            print(file)
            with z.open(file, 'r') as input_file:
                for line_number, line in enumerate(input_file):
                    if line_number > limit:
                        break
                    print(line)
            print()
    print()

# Loop through zip files
NUM_LINES = 5
ZIP_FILE_NAMES = ['septa_bus_gfts.zip', 'septa_rail_gfts.zip']
DIRECTORY_NAMES = []
for file in ZIP_FILE_NAMES:
    # Read the zip files and display some file contents
    read_and_print_first_lines_from_zipped_file(file, NUM_LINES)

    # Extract zip file contents
    directory_name = os.path.splitext(file)[0]
    DIRECTORY_NAMES.append(directory_name)
    with zipfile.ZipFile(file, 'r') as zip_ref:
        zip_ref.extractall(directory_name)

    # Convert txt files to csv files
    os.chdir(directory_name)
    for input_file in os.listdir('.'):
        with open(input_file, 'r') as in_file:
            stripped = (line.strip() for line in in_file)
            lines = (line.split(",") for line in stripped if line)
            output_file = os.path.splitext(input_file)[0] + ".csv"
            print("Convert " + input_file + " contents to " + output_file)
            with open(output_file, 'w', ) as out_file:
                writer = csv.writer(out_file, lineterminator = '\n')
                writer.writerows(lines)
            
    # Remove original text files
    for item in os.listdir('.'):
        if item.endswith(".txt"):
            os.remove(item)

    os.chdir('..')


CONTENTS OF ZIP FILE septa_bus_gfts.zip:

agency.txt
b'agency_name,agency_url,agency_timezone,agency_lang,agency_fare_url\r\n'
b'SEPTA,http://www.septa.org,America/New_York,EN,http://www.septa.org/fares/transit/index.html'

calendar.txt
b'service_id,monday,tuesday,wednesday,thursday,friday,saturday,sunday,start_date,end_date\r\n'
b'10,1,1,1,1,1,0,0,20170903,20180224\r\n'
b'11,0,0,0,0,0,0,0,20170903,20180224\r\n'
b'12,0,0,0,0,0,1,0,20170903,20180224\r\n'
b'13,0,0,0,0,0,0,1,20170903,20180224\r\n'
b'16,1,1,1,1,1,0,0,20170903,20180224\r\n'

calendar_dates.txt
b'service_id,date,exception_type\r\n'
b'10,20170904,2\r\n'
b'13,20170904,1\r\n'
b'16,20170904,2\r\n'
b'19,20170904,1\r\n'
b'22,20170904,2\r\n'

fare_attributes.txt
b'fare_id,price,currency_type,payment_method,transfers,transfer_duration\r\n'
b'1,2.50,USD,0,0,0\r\n'
b'2,3.50,USD,0,1,3600\r\n'
b'3,4.50,USD,0,2,3600\r\n'
b'13,7.00,USD,0,0,0\r\n'
b'14,8.00,USD,0,1,3600\r\n'

fare_rules.txt
b'fare_id,origin_id,destination_id\r\n'
b'1,1,1\

## Data Set Basics

We explore our CSV files for both bus and rail.  The groupings occur only on specific files which will reveal more useful information.

In [15]:
import pandas as pd

# Print sizes of CSV files
for directory in DIRECTORY_NAMES:
    os.chdir(directory)
    print("Looking at " + directory + " contents")
    print()
    for input_file in os.listdir('.'):
        print("Description of " + input_file + ":")
        data_set = pd.read_csv(input_file)
        print(data_set.shape)
        print()
    os.chdir('..')

Looking at septa_bus_gfts contents

Description of agency.csv:
(1, 5)

Description of calendar.csv:
(28, 10)

Description of calendar_dates.csv:
(168, 3)

Description of fare_attributes.csv:
(6, 6)

Description of fare_rules.csv:
(9, 3)

Description of routes.csv:
(139, 7)

Description of shapes.csv:
(570717, 4)

Description of stops.csv:
(13701, 8)

Description of stop_times.csv:
(3121225, 5)

Description of transfers.csv:
(1, 4)

Description of trips.csv:
(52000, 7)

Looking at septa_rail_gfts contents

Description of agency.csv:
(1, 6)

Description of calendar.csv:
(5, 10)

Description of calendar_dates.csv:
(2, 3)

Description of fare_attributes.csv:
(0, 6)

Description of fare_rules.csv:
(0, 5)

Description of routes.csv:
(13, 9)

Description of shapes.csv:
(180235, 4)

Description of stops.csv:
(155, 7)

Description of stop_times.csv:
(25082, 7)

Description of transfers.csv:
(3, 3)

Description of trips.csv:
(1711, 8)



In [16]:
# Groupby could provide useful information for some files
GROUPBY_FILES_FIELDS = {'shapes.csv': 'shape_id', 
                        'trips.csv': 'route_id', 
                        'stop_times.csv': 'trip_id'}
for directory in DIRECTORY_NAMES:
    os.chdir(directory)
    print("Looking at " + directory + " groupby contents")
    print()
    for input_file in os.listdir('.'):
        if (input_file in GROUPBY_FILES_FIELDS.keys()):
            this_field = GROUPBY_FILES_FIELDS[input_file]
            print("Description of " + input_file + " Groupings:")
            data_set = pd.read_csv(input_file)
            data_set_distinct = data_set.groupby(this_field)[this_field].count()
            print(data_set_distinct)
            print()
    os.chdir('..')

Looking at septa_bus_gfts groupby contents

Description of shapes.csv Groupings:
shape_id
203286     296
203287     296
203288     280
203290     195
203291     273
203292     188
203293     289
203305     211
203307     189
203308     204
203310     182
203311      95
203312     288
203313      84
203314      84
203315     338
203316     790
203317     896
203318     790
203319     900
203320     872
203322     305
203323     372
203324     387
203325     790
203326     306
203327     321
203329     381
203330     361
203332     814
          ... 
206251     328
206287     253
206288     276
206289     170
206290     249
206291     249
206293     235
206294     257
206305     251
206308     403
206309     578
206310     484
206311    1180
206312     496
206313    1086
206314     597
206315     409
206317     304
206318     381
206319    1001
206320     990
206321     408
206322     370
206323    1079
206324     457
206325     366
206326     884
206327    1337
206329    1419
208290    