# Exploring the Datasets
*Authors: Angelika Shastapalava, Excel Espina, David Hadaller, Sam Mundle*  

### What are we using:  
1) The "Discovery" API is MTA's official developer resource to get real-time data from their NYC Bus Time service. You can get more information <a href="http://bustime.mta.info/wiki/Developers/Index">here</a>  
2) Kaggle's NYC Bus Data <a href="https://www.kaggle.com/stoney71/new-york-city-transport-statistics">here</a>

### How are we using it:
Using regression and classification techniques learned in class, we want to explore the following:  
> 1. Based on a ~10 stops/lines how closely do the actual stop times reflect the posted bus
schedules and what is the distribution around the scheduled time that busses actually
arrive?
>2. What environmental factors impact a buses schedule? What impact does time of day,
temperature, and weather have?
>3. What socioeconomic factors play into a buses schedule? Do we see better or worse
availability in neighborhoods with different average incomes?
>4. Predicting with a defined degree certainty if a bus is coming within a given time frame

### Sections:
1) [Loading the Datasets](#Loading-the-Datasets)  
2) [Cleaning the Data](#Cleaning-the-Data)  
3) [Visualizing the Data](#Visualizing-the-Data)

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import datetime

### Loading the Datasets 
We want to work with the Kaggle dataset so head over <a href="https://www.kaggle.com/stoney71/new-york-city-transport-statistics">here</a> and download the zip file. (A word of caution: the dataset is approx **5GB** when extracted!)

After you extract the data, we want to load a csv on our notebook.  

The `error_bad_lines=False` parameter fixes some formatting issues when we load in our dataset.

In [None]:
%%capture
mta = pd.read_csv('mta_1708.csv', error_bad_lines=False)
mta.set_index('PublishedLineName', inplace=True)

In [None]:
mta.head()

In [None]:
mta.dtypes

## Cleaning the Data
Looking at the head samples, we see `ExpectedArrivalTime` and `ScheduledArrivalTime` are objects of differing formats. We should fix that.

In this case, we want to compare the time difference between Expected and Schedule times when the distance from the stop is <= 30ft.

First things first: convert ```ExpectedArrivalTime```'s Timestamp format to just the time itself.

Then we need to adjust the ```ScheduledArrivalTime``` to match python's time range of 0-23 hours. Here we pass ```errors='coerce'``` parameter to convert troublesome or out of range times to NaT (Not a Time).

In [None]:
### Removing Date from the ExpectedArrivalTime

# Convert object to datetime64 
mta['expected_time'] = pd.to_datetime(mta['ExpectedArrivalTime']).dt.time
mta['scheduled_time'] = pd.to_datetime(mta['ScheduledArrivalTime'],format='%H:%M:%S', errors='coerce').dt.time
# mta['scheduled_time'] = pd.to_datetime(mta['scheduled_time'])
# temp = pd.to_datetime(mta['scheduled_time']).dt.time
# datetime.datetime.strptime().time()
mta.infer_objects().dtypes

In [None]:
mta.head()

Now let's drop observations whose ```DistanceFromStop``` is farther away than 30 ft.

In [None]:
mta = mta.loc[(mta['DistanceFromStop']<=30),]
print(mta.shape)

Now let's drop missing ```ExpectedArrivalTime, expected_time & scheduled_time``` since we can't impute it at the moment.

In [None]:
mta = mta.loc[(mta['ExpectedArrivalTime'].notnull()),]
mta = mta.loc[(mta['expected_time'].notnull()),]
mta = mta.loc[(mta['scheduled_time'].notnull()),]
print(mta.shape)

We'll create a new column that calculates the time difference between the expected and scheduled.

In [None]:
mta['time_diff'] = (pd.to_timedelta(mta['expected_time'].astype(str)) - 
                   pd.to_timedelta(mta['scheduled_time'].astype(str)))

In [None]:
mta['time_diff_mins'] = ((mta['time_diff'] / np.timedelta64(1, 'm')) + 1440).astype(int)
mta.dtypes

In [None]:
mta.head(40)

## Visualizing the Data

Now we can visualize the time difference between expected and scheduled arrival times.

In [None]:
# TODO

Looking at MTA API:

In [None]:
# Key for Excel
mta_key = "51b681ab-bb14-4f29-9104-db15a7a41d41"