# Exploring the Datasets
*Authors: Angelika Shastapalava, Excel Espina, David Hadaller, Sam Mundle*  

### What are we using:  
1) The "Discovery" API is MTA's official developer resource to get real-time data from their NYC Bus Time service. You can get more information <a href="http://bustime.mta.info/wiki/Developers/Index">here</a>  
2) Kaggle's NYC Bus Data <a href="https://www.kaggle.com/stoney71/new-york-city-transport-statistics">here</a>

### How are we using it:
Using regression and classification techniques learned in class, we want to explore the following:  
> 1. Based on a ~10 stops/lines how closely do the actual stop times reflect the posted bus
schedules and what is the distribution around the scheduled time that busses actually
arrive?
>2. What environmental factors impact a buses schedule? What impact does time of day,
temperature, and weather have?
>3. What socioeconomic factors play into a buses schedule? Do we see better or worse
availability in neighborhoods with different average incomes?
>4. Predicting with a defined degree certainty if a bus is coming within a given time frame

### Sections:
1) [Loading the Datasets](#Loading-the-Datasets)  
2) [Cleaning the Data](#Cleaning-the-Data)  
3) [Visualizing the Data](#Visualizing-the-Data)

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import datetime, math

### Loading the Datasets 
We want to work with the Kaggle dataset so head over <a href="https://www.kaggle.com/stoney71/new-york-city-transport-statistics">here</a> and download the zip file. (A word of caution: the dataset is approx **5GB** when extracted!)

After you extract the data, we want to load a csv on our notebook.  

The `error_bad_lines=False` parameter fixes some formatting issues when we load in our dataset.

In [2]:
%%capture
mta = pd.read_csv('mta_1708.csv', error_bad_lines=False)
# mta.set_index('PublishedLineName', inplace=True)

In [3]:
mta.head()

Unnamed: 0,RecordedAtTime,DirectionRef,PublishedLineName,OriginName,OriginLat,OriginLong,DestinationName,DestinationLat,DestinationLong,VehicleRef,VehicleLocation.Latitude,VehicleLocation.Longitude,NextStopPointName,ArrivalProximityText,DistanceFromStop,ExpectedArrivalTime,ScheduledArrivalTime
0,2017-08-01 00:01:03,0,Q32,W 32 ST/7 AV,40.749405,-73.99102,JACKSON HTS NORTHERN - 81 via ROOSVLT,40.755322,-73.886139,NYCT_7424,40.749403,-73.990841,W 32 ST/AV OF THE AMERICAS,< 1 stop away,220.0,2017-08-01 00:01:37,24:01:11
1,2017-08-01 00:00:52,0,B35,39 ST/1 AV,40.656456,-74.012245,BROWNSVILLE M GASTON BL via CHURCH,40.656345,-73.907188,NYCT_406,40.65133,-73.93896,CHURCH AV/E 42 ST,approaching,107.0,2017-08-01 00:02:00,23:56:12
2,2017-08-01 00:01:18,1,Q83,227 ST/113 DR,40.702263,-73.730339,JAMAICA HILLSIDE - 153 via LIBERTY,40.706795,-73.8041,NYCT_6449,40.706532,-73.804177,153 ST/HILLSIDE AV,at stop,25.0,2017-08-01 00:01:27,24:00:00
3,2017-08-01 00:01:05,0,M60-SBS,BROADWAY/W 106 ST,40.801819,-73.967644,SELECT BUS SERVICE LA GUARDIA AIRPORT,40.768074,-73.862091,NYCT_5846,40.770403,-73.917687,HOYT AV/31 ST,4.1 miles away,6519.0,2017-08-01 00:06:47,23:39:14
4,2017-08-01 00:01:05,0,M60-SBS,BROADWAY/W 106 ST,40.801819,-73.967644,SELECT BUS SERVICE LA GUARDIA AIRPORT,40.768074,-73.862091,NYCT_5846,40.770403,-73.917687,HOYT AV/31 ST,4.1 miles away,6519.0,2017-08-01 00:06:47,23:44:32


In [4]:
mta.dtypes

RecordedAtTime                object
DirectionRef                   int64
PublishedLineName             object
OriginName                    object
OriginLat                    float64
OriginLong                   float64
DestinationName               object
DestinationLat               float64
DestinationLong              float64
VehicleRef                    object
VehicleLocation.Latitude     float64
VehicleLocation.Longitude    float64
NextStopPointName             object
ArrivalProximityText          object
DistanceFromStop             float64
ExpectedArrivalTime           object
ScheduledArrivalTime          object
dtype: object

## Cleaning the Data
For this EDA, we're only going to be looking at the **M100** bus with the route going to **Inwood 220 St Via Amsterdam Via Bway**. We are interested in the stop data in **W 125 St/St Nicholas Av**

In [5]:
m100 = mta.loc[(mta['PublishedLineName']== 'M100') & (mta['DestinationName'] == 'INWOOD 220 ST via AMSTERDAM via BWAY'),]

We want to look at M100 buses that have reported ```at stop``` on ```ArrivalProximityText``` whose ```OriginName``` is the stop before **W 125 St/St Nicholas Av**, ```W 125 ST/FRED DOUGLASS BL```

In [6]:
# M100_FD = m100.loc[(m100['ArrivalProximityText'] == 'at stop') & (m100['NextStopPointName'] == 'W 125 ST/FRED DOUGLASS BL'),]
M100_NICK = m100.loc[(m100['ArrivalProximityText'] == 'at stop') & (m100['NextStopPointName'] == 'W 125 ST/ST NICHOLAS AV'),]


In [7]:
# M100_CONCAT = pd.concat([M100_FD, M100_NICK])

In [8]:
# M100_CONCAT.head()

Now we have a dataframe of M100 buses that have stopped at Nicholas. 

Our next step would be to compare the expected arrival to scheduled.

~~Also, how about we limit our the scope to 7 days from Aug 5 to Aug 11.~~

In [9]:
# Changing obj to datetime 
M100_NICK['RecordedAtTime'] = pd.to_datetime(M100_NICK['RecordedAtTime'])

# 
# M100_FD[(M100_FD['RecordedAtTime'] > '2017-08-05 05:00:00') & 
#             (M100_FD['RecordedAtTime'] < '2017-08-11 21:00:00')]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Converting timestamps to datetime.time objects so that we can manipulate them with ```ScheduledArrivalTime```

In [10]:
M100_NICK['scheduled_time'] = pd.to_datetime(M100_NICK['ScheduledArrivalTime'],format='%H:%M:%S', errors='coerce').dt.time
M100_NICK['expected_time'] = pd.to_datetime(M100_NICK['ExpectedArrivalTime']).dt.time


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Dropping malformed values:

In [11]:
M100_NICK = M100_NICK.loc[(M100_NICK['expected_time'].notnull()),]
M100_NICK = M100_NICK.loc[(M100_NICK['scheduled_time'].notnull()),]

In [12]:
M100_NICK.shape

(86, 19)

Now we calculate the actual time difference between the expected and scheduled.

In [13]:
M100_NICK['time_diff'] = (pd.to_timedelta(M100_NICK['expected_time'].astype(str)) - 
                   pd.to_timedelta(M100_NICK['scheduled_time'].astype(str)))

In [14]:
M100_NICK['time_diff_mins'] = ((M100_NICK['time_diff'] / np.timedelta64(1, 'm'))).apply(math.ceil).astype(int)

Now we save our progress to a CSV file.

In [15]:
M100_NICK.to_csv('M100_month_W125_st.csv', encoding='utf-8', index=False)