# Exploring the Datasets
*Authors: Angelika Shastapalava, Excel Espina, David Hadaller, Sam Mundle*  

### What are we using:  
1) The "Discovery" API is MTA's official developer resource to get real-time data from their NYC Bus Time service. You can get more information <a href="http://bustime.mta.info/wiki/Developers/Index">here</a>  
2) Kaggle's NYC Bus Data <a href="https://www.kaggle.com/stoney71/new-york-city-transport-statistics">here</a>

### How are we using it:
Using regression and classification techniques learned in class, we want to explore the following:  
> 1. Based on a ~10 stops/lines how closely do the actual stop times reflect the posted bus
schedules and what is the distribution around the scheduled time that busses actually
arrive?
>2. What environmental factors impact a buses schedule? What impact does time of day,
temperature, and weather have?
>3. What socioeconomic factors play into a buses schedule? Do we see better or worse
availability in neighborhoods with different average incomes?
>4. Predicting with a defined degree certainty if a bus is coming within a given time frame

### Sections:
1) [Loading the Datasets](#Loading-the-Datasets)  
2) [Cleaning the Data](#Cleaning-the-Data)

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

### Loading the Datasets 
We want to work with the Kaggle dataset so head over <a href="https://www.kaggle.com/stoney71/new-york-city-transport-statistics">here</a> and download the zip file. (A word of caution: the dataset is approx **5GB** when extracted!)

After you extract the data, we want to load a csv on our notebook.  

The `error_bad_lines=False` parameter fixes some formatting issues when we load in our dataset.

In [2]:
mta = pd.read_csv('mta_1708.csv', error_bad_lines=False)

b'Skipping line 3356: expected 17 fields, saw 18\n'
b'Skipping line 59440: expected 17 fields, saw 18\nSkipping line 61296: expected 17 fields, saw 18\n'
b'Skipping line 66068: expected 17 fields, saw 18\nSkipping line 75328: expected 17 fields, saw 18\nSkipping line 81683: expected 17 fields, saw 18\nSkipping line 98179: expected 17 fields, saw 18\n'
b'Skipping line 116273: expected 17 fields, saw 18\n'
b'Skipping line 133094: expected 17 fields, saw 18\nSkipping line 137887: expected 17 fields, saw 18\nSkipping line 152688: expected 17 fields, saw 18\nSkipping line 160593: expected 17 fields, saw 18\n'
b'Skipping line 168801: expected 17 fields, saw 18\nSkipping line 170953: expected 17 fields, saw 18\nSkipping line 179203: expected 17 fields, saw 18\n'
b'Skipping line 201585: expected 17 fields, saw 18\nSkipping line 210727: expected 17 fields, saw 18\n'
b'Skipping line 289132: expected 17 fields, saw 18\n'
b'Skipping line 310840: expected 17 fields, saw 18\nSkipping line 311549: ex

b'Skipping line 2237867: expected 17 fields, saw 18\nSkipping line 2247688: expected 17 fields, saw 18\nSkipping line 2253588: expected 17 fields, saw 18\nSkipping line 2255383: expected 17 fields, saw 18\nSkipping line 2255868: expected 17 fields, saw 18\nSkipping line 2256435: expected 17 fields, saw 18\nSkipping line 2260295: expected 17 fields, saw 18\n'
b'Skipping line 2268763: expected 17 fields, saw 18\nSkipping line 2286471: expected 17 fields, saw 18\n'
b'Skipping line 2344654: expected 17 fields, saw 18\nSkipping line 2354426: expected 17 fields, saw 18\n'
b'Skipping line 2366801: expected 17 fields, saw 18\nSkipping line 2380028: expected 17 fields, saw 18\nSkipping line 2380096: expected 17 fields, saw 18\nSkipping line 2380637: expected 17 fields, saw 18\nSkipping line 2384540: expected 17 fields, saw 18\nSkipping line 2387555: expected 17 fields, saw 18\nSkipping line 2390096: expected 17 fields, saw 18\nSkipping line 2390568: expected 17 fields, saw 18\nSkipping line 239

b'Skipping line 4065803: expected 17 fields, saw 18\nSkipping line 4070743: expected 17 fields, saw 18\nSkipping line 4086582: expected 17 fields, saw 18\n'
b'Skipping line 4103429: expected 17 fields, saw 18\nSkipping line 4108827: expected 17 fields, saw 18\n'
b'Skipping line 4134726: expected 17 fields, saw 18\n'
b'Skipping line 4345024: expected 17 fields, saw 18\n'
b'Skipping line 4486230: expected 17 fields, saw 18\nSkipping line 4488411: expected 17 fields, saw 18\n'
b'Skipping line 4500721: expected 17 fields, saw 18\nSkipping line 4503037: expected 17 fields, saw 18\nSkipping line 4516948: expected 17 fields, saw 18\nSkipping line 4519551: expected 17 fields, saw 18\n'
b'Skipping line 4532148: expected 17 fields, saw 18\nSkipping line 4536531: expected 17 fields, saw 18\nSkipping line 4538611: expected 17 fields, saw 18\nSkipping line 4546027: expected 17 fields, saw 18\n'
b'Skipping line 4557757: expected 17 fields, saw 18\nSkipping line 4581724: expected 17 fields, saw 18\nS

In [3]:
mta.head()

Unnamed: 0,RecordedAtTime,DirectionRef,PublishedLineName,OriginName,OriginLat,OriginLong,DestinationName,DestinationLat,DestinationLong,VehicleRef,VehicleLocation.Latitude,VehicleLocation.Longitude,NextStopPointName,ArrivalProximityText,DistanceFromStop,ExpectedArrivalTime,ScheduledArrivalTime
0,2017-08-01 00:01:03,0,Q32,W 32 ST/7 AV,40.749405,-73.99102,JACKSON HTS NORTHERN - 81 via ROOSVLT,40.755322,-73.886139,NYCT_7424,40.749403,-73.990841,W 32 ST/AV OF THE AMERICAS,< 1 stop away,220.0,2017-08-01 00:01:37,24:01:11
1,2017-08-01 00:00:52,0,B35,39 ST/1 AV,40.656456,-74.012245,BROWNSVILLE M GASTON BL via CHURCH,40.656345,-73.907188,NYCT_406,40.65133,-73.93896,CHURCH AV/E 42 ST,approaching,107.0,2017-08-01 00:02:00,23:56:12
2,2017-08-01 00:01:18,1,Q83,227 ST/113 DR,40.702263,-73.730339,JAMAICA HILLSIDE - 153 via LIBERTY,40.706795,-73.8041,NYCT_6449,40.706532,-73.804177,153 ST/HILLSIDE AV,at stop,25.0,2017-08-01 00:01:27,24:00:00
3,2017-08-01 00:01:05,0,M60-SBS,BROADWAY/W 106 ST,40.801819,-73.967644,SELECT BUS SERVICE LA GUARDIA AIRPORT,40.768074,-73.862091,NYCT_5846,40.770403,-73.917687,HOYT AV/31 ST,4.1 miles away,6519.0,2017-08-01 00:06:47,23:39:14
4,2017-08-01 00:01:05,0,M60-SBS,BROADWAY/W 106 ST,40.801819,-73.967644,SELECT BUS SERVICE LA GUARDIA AIRPORT,40.768074,-73.862091,NYCT_5846,40.770403,-73.917687,HOYT AV/31 ST,4.1 miles away,6519.0,2017-08-01 00:06:47,23:44:32


In [4]:
mta.dtypes

RecordedAtTime                object
DirectionRef                   int64
PublishedLineName             object
OriginName                    object
OriginLat                    float64
OriginLong                   float64
DestinationName               object
DestinationLat               float64
DestinationLong              float64
VehicleRef                    object
VehicleLocation.Latitude     float64
VehicleLocation.Longitude    float64
NextStopPointName             object
ArrivalProximityText          object
DistanceFromStop             float64
ExpectedArrivalTime           object
ScheduledArrivalTime          object
dtype: object

## Cleaning the Data
Looking at the head samples, we see `ExpectedArrivalTime` and `ScheduledArrivalTime` are objects of differing formats. We should fix that.

In this case, we want to compare the time difference between Expected and Schedule times when the distance from the stop is <= 50ft.

In [None]:
# TODO

Looking at MTA API:

In [None]:
# Key for Excel
mta_key = "51b681ab-bb14-4f29-9104-db15a7a41d41"