# Exploring the Datasets
*Authors: Angelika Shastapalava, Excel Espina, David Hadaller, Sam Mundle*  

### What are we using:  
1) The "Discovery" API is MTA's official developer resource to get real-time data from their NYC Bus Time service. You can get more information <a href="http://bustime.mta.info/wiki/Developers/Index">here</a>  
2) Kaggle's NYC Bus Data <a href="https://www.kaggle.com/stoney71/new-york-city-transport-statistics">here</a>

### How are we using it:
Using regression and classification techniques learned in class, we want to explore the following:  
> 1. Based on a ~10 stops/lines how closely do the actual stop times reflect the posted bus
schedules and what is the distribution around the scheduled time that busses actually
arrive?
>2. What environmental factors impact a buses schedule? What impact does time of day,
temperature, and weather have?
>3. What socioeconomic factors play into a buses schedule? Do we see better or worse
availability in neighborhoods with different average incomes?
>4. Predicting with a defined degree certainty if a bus is coming within a given time frame

### Sections:
1) [Loading the Datasets](#Loading-the-Datasets)  
2) [Cleaning the Data](#Cleaning-the-Data)  
3) [Visualizing the Data](#Visualizing-the-Data)

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import datetime, math

### Loading the Datasets 
We want to work with the Kaggle dataset so head over <a href="https://www.kaggle.com/stoney71/new-york-city-transport-statistics">here</a> and download the zip file. (A word of caution: the dataset is approx **5GB** when extracted!)

After you extract the data, we want to load a csv on our notebook.  

The `error_bad_lines=False` parameter fixes some formatting issues when we load in our dataset.

In [None]:
%%capture
mta = pd.read_csv('mta_1708.csv', error_bad_lines=False)
# mta.set_index('PublishedLineName', inplace=True)

In [None]:
mta.head()

In [None]:
mta.dtypes

## Cleaning the Data
For this EDA, we're only going to be looking at the **M100** bus with the route going to **Inwood 220 St Via Amsterdam Via Bway**. We are interested in the stop data in **W 125 St/St Nicholas Av**

In [None]:
m100 = mta.loc[(mta['PublishedLineName']== 'M100') & (mta['DestinationName'] == 'INWOOD 220 ST via AMSTERDAM via BWAY'),]

We want to look at M100 buses that have reported ```at stop``` on ```ArrivalProximityText``` whose ```OriginName``` is the stop before **W 125 St/St Nicholas Av**, ```W 125 ST/FRED DOUGLASS BL```

In [None]:
M100_FD = m100.loc[(m100['ArrivalProximityText'] == 'at stop') & (m100['NextStopPointName'] == 'W 125 ST/FRED DOUGLASS BL'),]
# M100_NICK = m100.loc[(m100['ArrivalProximityText'] == 'at stop') & (m100['NextStopPointName'] == 'W 125 ST/ST NICHOLAS AV'),]


In [None]:
M100_CONCAT = pd.concat([M100_FD, M100_NICK])

In [None]:
M100_CONCAT.head()

Now we have a dataframe of M100 buses that have stopped at Nicholas. 

Our next step would be to compare the expected arrival to scheduled.

Also, how about we limit our the scope to 7 days from Aug 5 to Aug 11.

In [None]:
# Changing obj to datetime 
M100_FD['RecordedAtTime'] = pd.to_datetime(M100_FD['RecordedAtTime'])

# 
M100_FD[(M100_FD['RecordedAtTime'] > '2017-08-05 05:00:00') & 
            (M100_FD['RecordedAtTime'] < '2017-08-11 21:00:00')]

Converting timestamps to datetime.time objects so that we can manipulate them with ```ScheduledArrivalTime```

In [None]:
M100_FD['scheduled_time'] = pd.to_datetime(M100_FD['ScheduledArrivalTime'],format='%H:%M:%S', errors='coerce').dt.time
M100_FD['expected_time'] = pd.to_datetime(M100_FD['ExpectedArrivalTime']).dt.time


Dropping malformed values:

In [None]:
M100_FD = M100_FD.loc[(M100_FD['expected_time'].notnull()),]
M100_FD = M100_FD.loc[(M100_FD['scheduled_time'].notnull()),]

In [None]:
M100_FD.shape

Now we calculate the actual time difference between the expected and scheduled.

In [None]:
M100_FD['time_diff'] = (pd.to_timedelta(M100_FD['expected_time'].astype(str)) - 
                   pd.to_timedelta(M100_FD['scheduled_time'].astype(str)))

In [None]:
M100_FD['time_diff_mins'] = ((M100_FD['time_diff'] / np.timedelta64(1, 'm'))).apply(math.ceil).astype(int)

Now we save our progress to a CSV file.

In [None]:
M100_FD.to_csv('M100_7days_W125_st.csv', encoding='utf-8', index=False)