# Exploring the Datasets
*Authors: Angelika Shastapalava, Excel Espina, David Hadaller, Sam Mundle*  

### What are we using:  
1) The "Discovery" API is MTA's official developer resource to get real-time data from their NYC Bus Time service. You can get more information <a href="http://bustime.mta.info/wiki/Developers/Index">here</a>  
2) Kaggle's NYC Bus Data <a href="https://www.kaggle.com/stoney71/new-york-city-transport-statistics">here</a>

### How are we using it:
Using regression and classification techniques learned in class, we want to explore the following:  
> 1. Based on a ~10 stops/lines how closely do the actual stop times reflect the posted bus
schedules and what is the distribution around the scheduled time that busses actually
arrive?
>2. What environmental factors impact a buses schedule? What impact does time of day,
temperature, and weather have?
>3. What socioeconomic factors play into a buses schedule? Do we see better or worse
availability in neighborhoods with different average incomes?
>4. Predicting with a defined degree certainty if a bus is coming within a given time frame

### Sections:
1) [Loading the Datasets](#Loading-the-Datasets)  
2) [Cleaning the Data](#Cleaning-the-Data)

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import datetime

### Loading the Datasets 
We want to work with the Kaggle dataset so head over <a href="https://www.kaggle.com/stoney71/new-york-city-transport-statistics">here</a> and download the zip file. (A word of caution: the dataset is approx **5GB** when extracted!)

After you extract the data, we want to load a csv on our notebook.  

The `error_bad_lines=False` parameter fixes some formatting issues when we load in our dataset.

In [2]:
mta = pd.read_csv('mta_1708.csv', error_bad_lines=False)
mta.set_index('PublishedLineName', inplace=True)

b'Skipping line 3356: expected 17 fields, saw 18\n'
b'Skipping line 59440: expected 17 fields, saw 18\nSkipping line 61296: expected 17 fields, saw 18\n'
b'Skipping line 66068: expected 17 fields, saw 18\nSkipping line 75328: expected 17 fields, saw 18\nSkipping line 81683: expected 17 fields, saw 18\nSkipping line 98179: expected 17 fields, saw 18\n'
b'Skipping line 116273: expected 17 fields, saw 18\n'
b'Skipping line 133094: expected 17 fields, saw 18\nSkipping line 137887: expected 17 fields, saw 18\nSkipping line 152688: expected 17 fields, saw 18\nSkipping line 160593: expected 17 fields, saw 18\n'
b'Skipping line 168801: expected 17 fields, saw 18\nSkipping line 170953: expected 17 fields, saw 18\nSkipping line 179203: expected 17 fields, saw 18\n'
b'Skipping line 201585: expected 17 fields, saw 18\nSkipping line 210727: expected 17 fields, saw 18\n'
b'Skipping line 289132: expected 17 fields, saw 18\n'
b'Skipping line 310840: expected 17 fields, saw 18\nSkipping line 311549: ex

b'Skipping line 2237867: expected 17 fields, saw 18\nSkipping line 2247688: expected 17 fields, saw 18\nSkipping line 2253588: expected 17 fields, saw 18\nSkipping line 2255383: expected 17 fields, saw 18\nSkipping line 2255868: expected 17 fields, saw 18\nSkipping line 2256435: expected 17 fields, saw 18\nSkipping line 2260295: expected 17 fields, saw 18\n'
b'Skipping line 2268763: expected 17 fields, saw 18\nSkipping line 2286471: expected 17 fields, saw 18\n'
b'Skipping line 2344654: expected 17 fields, saw 18\nSkipping line 2354426: expected 17 fields, saw 18\n'
b'Skipping line 2366801: expected 17 fields, saw 18\nSkipping line 2380028: expected 17 fields, saw 18\nSkipping line 2380096: expected 17 fields, saw 18\nSkipping line 2380637: expected 17 fields, saw 18\nSkipping line 2384540: expected 17 fields, saw 18\nSkipping line 2387555: expected 17 fields, saw 18\nSkipping line 2390096: expected 17 fields, saw 18\nSkipping line 2390568: expected 17 fields, saw 18\nSkipping line 239

b'Skipping line 4065803: expected 17 fields, saw 18\nSkipping line 4070743: expected 17 fields, saw 18\nSkipping line 4086582: expected 17 fields, saw 18\n'
b'Skipping line 4103429: expected 17 fields, saw 18\nSkipping line 4108827: expected 17 fields, saw 18\n'
b'Skipping line 4134726: expected 17 fields, saw 18\n'
b'Skipping line 4345024: expected 17 fields, saw 18\n'
b'Skipping line 4486230: expected 17 fields, saw 18\nSkipping line 4488411: expected 17 fields, saw 18\n'
b'Skipping line 4500721: expected 17 fields, saw 18\nSkipping line 4503037: expected 17 fields, saw 18\nSkipping line 4516948: expected 17 fields, saw 18\nSkipping line 4519551: expected 17 fields, saw 18\n'
b'Skipping line 4532148: expected 17 fields, saw 18\nSkipping line 4536531: expected 17 fields, saw 18\nSkipping line 4538611: expected 17 fields, saw 18\nSkipping line 4546027: expected 17 fields, saw 18\n'
b'Skipping line 4557757: expected 17 fields, saw 18\nSkipping line 4581724: expected 17 fields, saw 18\nS

In [3]:
mta.head()

Unnamed: 0_level_0,RecordedAtTime,DirectionRef,OriginName,OriginLat,OriginLong,DestinationName,DestinationLat,DestinationLong,VehicleRef,VehicleLocation.Latitude,VehicleLocation.Longitude,NextStopPointName,ArrivalProximityText,DistanceFromStop,ExpectedArrivalTime,ScheduledArrivalTime
PublishedLineName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
Q32,2017-08-01 00:01:03,0,W 32 ST/7 AV,40.749405,-73.99102,JACKSON HTS NORTHERN - 81 via ROOSVLT,40.755322,-73.886139,NYCT_7424,40.749403,-73.990841,W 32 ST/AV OF THE AMERICAS,< 1 stop away,220.0,2017-08-01 00:01:37,24:01:11
B35,2017-08-01 00:00:52,0,39 ST/1 AV,40.656456,-74.012245,BROWNSVILLE M GASTON BL via CHURCH,40.656345,-73.907188,NYCT_406,40.65133,-73.93896,CHURCH AV/E 42 ST,approaching,107.0,2017-08-01 00:02:00,23:56:12
Q83,2017-08-01 00:01:18,1,227 ST/113 DR,40.702263,-73.730339,JAMAICA HILLSIDE - 153 via LIBERTY,40.706795,-73.8041,NYCT_6449,40.706532,-73.804177,153 ST/HILLSIDE AV,at stop,25.0,2017-08-01 00:01:27,24:00:00
M60-SBS,2017-08-01 00:01:05,0,BROADWAY/W 106 ST,40.801819,-73.967644,SELECT BUS SERVICE LA GUARDIA AIRPORT,40.768074,-73.862091,NYCT_5846,40.770403,-73.917687,HOYT AV/31 ST,4.1 miles away,6519.0,2017-08-01 00:06:47,23:39:14
M60-SBS,2017-08-01 00:01:05,0,BROADWAY/W 106 ST,40.801819,-73.967644,SELECT BUS SERVICE LA GUARDIA AIRPORT,40.768074,-73.862091,NYCT_5846,40.770403,-73.917687,HOYT AV/31 ST,4.1 miles away,6519.0,2017-08-01 00:06:47,23:44:32


In [4]:
mta.dtypes

RecordedAtTime                object
DirectionRef                   int64
OriginName                    object
OriginLat                    float64
OriginLong                   float64
DestinationName               object
DestinationLat               float64
DestinationLong              float64
VehicleRef                    object
VehicleLocation.Latitude     float64
VehicleLocation.Longitude    float64
NextStopPointName             object
ArrivalProximityText          object
DistanceFromStop             float64
ExpectedArrivalTime           object
ScheduledArrivalTime          object
dtype: object

## Cleaning the Data
Looking at the head samples, we see `ExpectedArrivalTime` and `ScheduledArrivalTime` are objects of differing formats. We should fix that.

In this case, we want to compare the time difference between Expected and Schedule times when the distance from the stop is <= 30ft.

First things first: convert ```ExpectedArrivalTime```'s Timestamp format to just the time itself.

Then we need to adjust the ```ScheduledArrivalTime``` to match python's time range of 0-23 hours. Here we pass ```errors='coerce'``` parameter to convert troublesome or out of range times to NaT (Not a Time).

In [5]:
### Removing Date from the ExpectedArrivalTime

# Convert object to datetime64 
mta['expected_time'] = pd.to_datetime(mta['ExpectedArrivalTime']).dt.time
mta['scheduled_time'] = pd.to_datetime(mta['ScheduledArrivalTime'],format='%H:%M:%S', errors='coerce').dt.time
# mta['scheduled_time'] = pd.to_datetime(mta['scheduled_time'])
# temp = pd.to_datetime(mta['scheduled_time']).dt.time
# datetime.datetime.strptime().time()
mta.infer_objects().dtypes

RecordedAtTime                object
DirectionRef                   int64
OriginName                    object
OriginLat                    float64
OriginLong                   float64
DestinationName               object
DestinationLat               float64
DestinationLong              float64
VehicleRef                    object
VehicleLocation.Latitude     float64
VehicleLocation.Longitude    float64
NextStopPointName             object
ArrivalProximityText          object
DistanceFromStop             float64
ExpectedArrivalTime           object
ScheduledArrivalTime          object
expected_time                 object
scheduled_time                object
dtype: object

In [6]:
mta.head()

Unnamed: 0_level_0,RecordedAtTime,DirectionRef,OriginName,OriginLat,OriginLong,DestinationName,DestinationLat,DestinationLong,VehicleRef,VehicleLocation.Latitude,VehicleLocation.Longitude,NextStopPointName,ArrivalProximityText,DistanceFromStop,ExpectedArrivalTime,ScheduledArrivalTime,expected_time,scheduled_time
PublishedLineName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
Q32,2017-08-01 00:01:03,0,W 32 ST/7 AV,40.749405,-73.99102,JACKSON HTS NORTHERN - 81 via ROOSVLT,40.755322,-73.886139,NYCT_7424,40.749403,-73.990841,W 32 ST/AV OF THE AMERICAS,< 1 stop away,220.0,2017-08-01 00:01:37,24:01:11,00:01:37,NaT
B35,2017-08-01 00:00:52,0,39 ST/1 AV,40.656456,-74.012245,BROWNSVILLE M GASTON BL via CHURCH,40.656345,-73.907188,NYCT_406,40.65133,-73.93896,CHURCH AV/E 42 ST,approaching,107.0,2017-08-01 00:02:00,23:56:12,00:02:00,23:56:12
Q83,2017-08-01 00:01:18,1,227 ST/113 DR,40.702263,-73.730339,JAMAICA HILLSIDE - 153 via LIBERTY,40.706795,-73.8041,NYCT_6449,40.706532,-73.804177,153 ST/HILLSIDE AV,at stop,25.0,2017-08-01 00:01:27,24:00:00,00:01:27,NaT
M60-SBS,2017-08-01 00:01:05,0,BROADWAY/W 106 ST,40.801819,-73.967644,SELECT BUS SERVICE LA GUARDIA AIRPORT,40.768074,-73.862091,NYCT_5846,40.770403,-73.917687,HOYT AV/31 ST,4.1 miles away,6519.0,2017-08-01 00:06:47,23:39:14,00:06:47,23:39:14
M60-SBS,2017-08-01 00:01:05,0,BROADWAY/W 106 ST,40.801819,-73.967644,SELECT BUS SERVICE LA GUARDIA AIRPORT,40.768074,-73.862091,NYCT_5846,40.770403,-73.917687,HOYT AV/31 ST,4.1 miles away,6519.0,2017-08-01 00:06:47,23:44:32,00:06:47,23:44:32


Now let's drop observations whose ```DistanceFromStop``` is farther away than 30 ft.

In [7]:
mta = mta.loc[(mta['DistanceFromStop']<=30),]
print(mta.shape)

(1856274, 18)


Now let's drop missing ```ExpectedArrivalTime, expected_time & scheduled_time``` since we can't impute it at the moment.

In [8]:
mta = mta.loc[(mta['ExpectedArrivalTime'].notnull()),]
mta = mta.loc[(mta['expected_time'].notnull()),]
mta = mta.loc[(mta['scheduled_time'].notnull()),]
print(mta.shape)

(1315851, 18)


We'll create a new column that calculates the time difference between the expected and scheduled.

In [9]:
mta['time_diff'] = (pd.to_timedelta(mta['expected_time'].astype(str)) - 
                   pd.to_timedelta(mta['scheduled_time'].astype(str)))

In [10]:
mta['time_diff_mins'] = ((mta['time_diff'] / np.timedelta64(1, 'm')) + 1440).astype(int)
mta.dtypes

RecordedAtTime                        object
DirectionRef                           int64
OriginName                            object
OriginLat                            float64
OriginLong                           float64
DestinationName                       object
DestinationLat                       float64
DestinationLong                      float64
VehicleRef                            object
VehicleLocation.Latitude             float64
VehicleLocation.Longitude            float64
NextStopPointName                     object
ArrivalProximityText                  object
DistanceFromStop                     float64
ExpectedArrivalTime                   object
ScheduledArrivalTime                  object
expected_time                         object
scheduled_time                        object
time_diff                    timedelta64[ns]
time_diff_mins                         int32
dtype: object

In [11]:
mta.head(40)

Unnamed: 0_level_0,RecordedAtTime,DirectionRef,OriginName,OriginLat,OriginLong,DestinationName,DestinationLat,DestinationLong,VehicleRef,VehicleLocation.Latitude,VehicleLocation.Longitude,NextStopPointName,ArrivalProximityText,DistanceFromStop,ExpectedArrivalTime,ScheduledArrivalTime,expected_time,scheduled_time,time_diff,time_diff_mins
PublishedLineName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
B49,2017-08-01 00:01:05,1,FRANKLIN AV/LEFFERTS PL,40.680473,-73.955643,MNHATN BCH KNGSBRO CC via BEDFORD AV,40.578117,-73.939857,NYCT_5045,40.5859,-73.951865,SHEEPSHEAD BAY RD/VOORHIES AV,at stop,28.0,2017-08-01 00:01:27,23:58:58,00:01:27,23:58:58,-1 days +00:02:29,2
Q28,2017-08-01 00:00:52,0,39 AV/138 ST,40.760971,-73.827057,BAY TERRACE BELL BL,40.782295,-73.777031,NYCT_7430,40.782288,-73.77688,23 AV /BELL BL,at stop,10.0,2017-08-01 00:01:27,23:44:00,00:01:27,23:44:00,-1 days +00:17:27,17
B47,2017-08-01 00:01:19,0,AV U/E 54 ST,40.611008,-73.92041,BED-STUY WOODHULL HOSP via RALPH AV,40.699776,-73.941505,NYCT_4016,40.654108,-73.921428,REMSEN AV/LINDEN BL,at stop,3.0,2017-08-01 00:01:27,23:56:34,00:01:27,23:56:34,-1 days +00:04:53,4
Bx36,2017-08-01 00:01:05,0,W 179 ST/BROADWAY,40.849113,-73.937752,SOUNDVIEW PUGSLEY AV,40.820507,-73.851631,NYCT_7706,40.849021,-73.905455,E TREMONT AV/MONROE AV,at stop,3.0,2017-08-01 00:01:27,23:51:32,00:01:27,23:51:32,-1 days +00:09:55,9
B16,2017-08-01 00:01:12,0,SHORE RD/4 AV,40.611763,-74.035118,LEFRTS GDNS PROSPCT PK STA,40.660858,-73.96138,NYCT_331,40.660771,-73.96286,OCEAN AV/LINCOLN RD,at stop,10.0,2017-08-01 00:01:27,23:49:05,00:01:27,23:49:05,-1 days +00:12:22,12
Bx36,2017-08-01 00:01:01,1,RANDALL AV/OLMSTEAD AV,40.818676,-73.851555,WASHINGTON HTS GW BRIDGE,40.849033,-73.937309,NYCT_714,40.834829,-73.867594,E 174 ST/ST LAWRENCE AV,at stop,25.0,2017-08-01 00:01:27,23:58:12,00:01:27,23:58:12,-1 days +00:03:15,3
B49,2017-08-01 00:01:11,0,ORIENTAL BL/MACKENZIE ST,40.57814,-73.939796,BD-STY FLTN ST via OCEAN AV via ROGRS AV,40.680645,-73.955711,NYCT_4594,40.680692,-73.955581,FRANKLIN AV/LEFFERTS PL,at stop,3.0,2017-08-01 00:01:27,23:50:00,00:01:27,23:50:00,-1 days +00:11:27,11
B8,2017-08-01 00:01:24,0,4 AV/95 ST,40.616104,-74.031143,BROWNSVILLE ROCKAWAY AV,40.656048,-73.907379,NYCT_414,40.646466,-73.924625,KINGS HY/BEVERLY RD,at stop,5.0,2017-08-01 00:01:27,23:59:34,00:01:27,23:59:34,-1 days +00:01:53,1
Q3,2017-08-01 00:01:06,0,JFK AIRPORT/TERMINAL 5 AirTrain STATION,40.647278,-73.779633,JAMAICA 179 ST STA 165 ST TERM,40.707615,-73.79554,NYCT_8453,40.705718,-73.767101,FARMERS BL/105 AV,at stop,8.0,2017-08-01 00:01:27,23:57:19,00:01:27,23:57:19,-1 days +00:04:08,4
B82,2017-08-01 00:00:51,0,STILLWELL TERMINAL BUS LOOP,40.57708,-73.981293,SPRING CRK TWRS SEAVIEW AV via KINGS HWY,40.64299,-73.878326,NYCT_6597,40.592202,-73.992909,CROPSEY AV/25 AV,at stop,5.0,2017-08-01 00:01:27,23:56:31,00:01:27,23:56:31,-1 days +00:04:56,4


##

Now we can visualize the time difference between expected and scheduled arrival times.

In [12]:
# TODO

Looking at MTA API:

In [13]:
# Key for Excel
mta_key = "51b681ab-bb14-4f29-9104-db15a7a41d41"