# Taxi Trips and Traffic

Where most models use real-time data from users to predict arrival
times at any given moment, we believe they could be improved by including a predictive
element. Our intent is to use the NYC Taxi and Limousine Commission's yellow and green cab data set to estimate density of pickup and dropoffs at any given place and time. We will then use the density as a proxy for traffic to estimate the time it takes to arrive at a destination.

In [1]:
%matplotlib inline
import edward as ed
import pandas as pd
import tensorflow as tf
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from setup import set_random_seeds
import data
# set_random_seeds(42)
plt.style.use("seaborn-talk")
sns.set_context("talk")

ModuleNotFoundError: No module named 'data'

## Data

Use the
[2015 NYC Yellow Cab Dataset](http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml),
which consists of pickup and dropoff coordinates for trips, along 
with metadata like cost, distance, and number of passengers. For the time being, we are only interested in trips that occurr entirely within the range from 1st street to 59th street, and 12th ave to 1st ave, during the month of april. This data can be easily visualized as 2d plot.

```SQL
SELECT
  *
FROM
  [bigquery-public-data:new_york.tlc_yellow_trips_2016]
WHERE
  (pickup_longitude > -74.0124053955
    AND pickup_longitude < -73.9673309326)
  AND (pickup_latitude > 40.7186431885
    AND pickup_latitude < 40.7735137939)
  AND (dropoff_longitude > -74.0124053955
    AND dropoff_longitude < -73.9673309326)
  AND (dropoff_latitude > 40.7186431885
    AND dropoff_latitude < 40.7735137939)
  AND TIMESTAMP_TO_MSEC(pickup_datetime) > 1459483200000
  AND TIMESTAMP_TO_MSEC(dropoff_datetime) < 1462075200000
  order by pickup_datetime desc
LIMIT
  10000
```

In [3]:
data_raw = pd.read_csv('datasets/train.csv')

In [4]:
""" Function drops unnecessary metadata in raw dataset"""
def getCleanTaxiData(data_raw):
    # Drop unneccessary meta data
    dataset = data_raw.loc[:, 'pickup_datetime':'trip_duration']
    dataset.drop('passenger_count', axis=1, inplace=True)
    dataset.drop('store_and_fwd_flag', axis=1, inplace=True)

    # Rename columns
    dataset.columns=[['pickup_datetime','dropoff_datetime','pickup_longitude',
                      'pickup_latitude','dropoff_longitude','dropoff_latitude',
                      'trip_time']]

    # Convert date/time into two columns
    dataset['pickup_datetime'] = pd.to_datetime(dataset['pickup_datetime'])
    dataset['pickup_date'] = dataset['pickup_datetime'].dt.date
    dataset['pickup_time'] = dataset['pickup_datetime'].dt.time

    dataset['dropoff_datetime'] = pd.to_datetime(dataset['dropoff_datetime'])
    dataset['dropoff_date'] = dataset['dropoff_datetime'].dt.date
    dataset['dropoff_time'] = dataset['dropoff_datetime'].dt.time

    dataset.drop('pickup_datetime', axis=1, inplace=True)
    dataset.drop('dropoff_datetime', axis=1, inplace=True)
    
    return dataset

# ... could take a ~1-2 minutes
dataset = getCleanTaxiData(data_raw)
# View first 5 rows of dataset using pandas df.head()
dataset.head()




Unnamed: 0,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,trip_time,pickup_date,pickup_time,dropoff_date,dropoff_time
0,-73.982155,40.767937,-73.96463,40.765602,455,2016-03-14,17:24:55,2016-03-14,17:32:30
1,-73.980415,40.738564,-73.999481,40.731152,663,2016-06-12,00:43:35,2016-06-12,00:54:38
2,-73.979027,40.763939,-74.005333,40.710087,2124,2016-01-19,11:35:24,2016-01-19,12:10:48
3,-74.01004,40.719971,-74.012268,40.706718,429,2016-04-06,19:32:31,2016-04-06,19:39:40
4,-73.973053,40.793209,-73.972923,40.78252,435,2016-03-26,13:30:55,2016-03-26,13:38:10


In [None]:
import json

with open('nyc_neighborhoods.json') as file:
    json_data = json.load(file)

In [None]:
#from shapely.geometry import shape, Point
import shapely

""" 
Function takes a geo/json file of NYC census tracts and returns
neighborhoods.... works in general but need to fix so clearly returns
pickup/dropoff neighborhood & just appends that to main dataframe
.... so tired...
"""

def getRoundedLonLat(dataset):
    decimals = 6
    dataset['pickup_rounded_lon'] = dataset[['pickup_longitude']].round(decimals)
    dataset['pickup_rounded_lat'] = dataset[['pickup_latitude']].round(decimals)
    dataset['dropoff_rounded_lon'] = dataset[['dropoff_longitude']].round(decimals)
    dataset['dropoff_rounded_lat'] = dataset[['dropoff_latitude']].round(decimals)

    
    dataset = dataset.sort_values(['pickup_rounded_lon', 'pickup_rounded_lat','dropoff_rounded_lon', 'dropoff_rounded_lat']).reset_index()[['pickup_rounded_lon', 'pickup_rounded_lat','dropoff_rounded_lon', 'dropoff_rounded_lat']]
    
    return dataset


def checkPointBounds(point):
    # check if point is in Manhattan bound
    # defaults from mike - kinda stringent and lead to a bunch of NaNs
    point_lon = point[0]
    point_lat = point[1]
    min_lon =  40.7186431885
    max_lon =  40.7735137939
    min_lat = -74.0124053955
    max_lat = -73.9673309326

    if(min_lon < point_lon < max_lon) and (min_lat <  point_lat < max_lat):
        return True
    else:
        return False

       
def getNeighborhood(point, json_data):

    temp_point = Point(point[1],point[0])
    
    #if (checkPointBounds(point)): #Checks to if it's within our arb NYC cutoff bound
    if (True):
            for feature in json_data['features']:
                if feature['properties']['borough'] == 'Manhattan':
                    polygon = shapely.geometry.asShape(feature['geometry'])
                    if polygon.contains(temp_point):
                        return feature['properties']['neighborhood']
                else:
                    continue
        
    return 'NaN'
                        
    
def addNeighborhoods(dataset, json_data):

    temp_data = getRoundedLonLat(dataset)
    temp_data['pickup_neighborhood'] = ""
    temp_data['dropoff_neighborhood'] = ""
    drops = []
    picks = []
    
    for idx, row in temp_data.iterrows():
        pickup_point = (temp_data.pickup_rounded_lat[idx],temp_data.pickup_rounded_lon[idx])
        temp_data.pickup_neighborhood.loc[idx] = getNeighborhood(pickup_point, json_data)

        dropoff_point = (temp_data.dropoff_rounded_lat[idx], temp_data.dropoff_rounded_lon[idx])
        temp_data.dropoff_neighborhood.loc[idx] = getNeighborhood(dropoff_point, json_data)
        
    print(temp_data.head())

        

In [None]:

addNeighborhoods(dataset.loc[0:100], json_data)

In [None]:
""" Get neighborhoods... tire clean all this shit up tomorrow and make histogram + hourly data ugh 
"""


#point = (40.7713579,-73.956581)
point = (40.740321999999999, -73.993324000000001)

print( checkPointBounds(point) )
print( getNeighborhood(point,json_data) )


In [10]:
from haversine import haversine
# hours = dataset['d_time'].loc[0:3]
# print(dataset['d_time'].loc[0].hour < 20)

# for h in hours:
#     if h.hour < 20:
#         print(str(h.hour) + ':' + str(h.minute))
#         print(True)

#print(len(dataset))

def cleanTime(data):
    return data[(0 < data.time) & (data.time < 9800)]


''' Function returns dataset with haversine trip distance column
    Note DF pruned to btwn 0 miles and less than 50 miles
'''
# def calcDistances(df):
    
#     '''Internal function computes haversine distance (miles)
#         usage: pt = (plat, plon, dlat, dlon)
#     '''
#     def haversineDistance(pickup_lat, pickup_lon, dropoff_lat, dropoff_lon):
#         pickup_point = (pickup_lat, pickup_lon)
#         dropoff_point = (dropoff_lat, dropoff_lon)
#         # Return distance in miles rounded to 2 decimals
#         return round( haversine(pickup_point, dropoff_point, miles=True), 2)
    
# #     df['trip_distance'] = np.nan
# #     df['trip_distance'] = df.loc[:,('pickup_latitude',
# #                                     'pickup_longitude',
# #                                     'dropoff_latitude',
# #                                     'dropoff_longitude')].apply(lambda x: haversineDistance(*x), axis=1)
#     df['trip_distance'] = df.apply(lambda x: haversineDistance(x['pickup_latitude'],
#                                                             x['pickup_longitude'], 
#                                                             x['dropoff_latitude'],
#                                                             x['dropoff_longitude']),axis=1)
    
#     return df[(0 < df.trip_distance) & (df.trip_distance < 50)]

def calcDistances(df):
    
    '''Internal function computes haversine distance (miles)
        usage: pt = (plat, plon, dlat, dlon)
    '''
    def haversineDistance(row):
        pickup_point = (row['pickup_latitude'], row['pickup_longitude'])
        dropoff_point = (row['dropoff_latitude'], row['dropoff_longitude'])
        # Return distance in miles rounded to 2 decimals
        return round( haversine(pickup_point, dropoff_point, miles=True), 2)
    
#     df['trip_distance'] = df.loc[:,('pickup_latitude',
#                                     'pickup_longitude',
#                                     'dropoff_latitude',
#                                     'dropoff_longitude')].apply(lambda x: haversineDistance(*x), axis=1)
    
    df['trip_distance'] = df.apply(lambda row: haversineDistance(row),axis=1)
    
    return df[(0 < df.trip_distance) & (df.trip_distance < 50)]



''' Function returns dataset with trip speed (MPH) column
    Note: Trips with speeds below bottom 1% and above top 99% speeds removed
'''
def calcSpeeds(df):
    
    '''Internal function computes trip speed (mph)
        usage: time (secs), distance (miles)
    '''
    def getMPH(row):
        hours = row['trip_time'] / 3600
        mph = row['trip_distance'] / hours
        # Return distance in miles rounded to 2 decimals
        return round( mph, 2)
    
    df['trip_speed'] = df.loc[:, ('trip_time','trip_distance')].apply(lambda row: getMPH(row), axis=1)
    
    return df[(0 < df.trip_speed) & (df.trip_speed < 150)]



In [6]:
#dataset = calcDistances(dataset)
dataset.head()

Unnamed: 0,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,trip_time,pickup_date,pickup_time,dropoff_date,dropoff_time,trip_distance
0,-73.982155,40.767937,-73.96463,40.765602,455,2016-03-14,17:24:55,2016-03-14,17:32:30,0.93
1,-73.980415,40.738564,-73.999481,40.731152,663,2016-06-12,00:43:35,2016-06-12,00:54:38,1.12
2,-73.979027,40.763939,-74.005333,40.710087,2124,2016-01-19,11:35:24,2016-01-19,12:10:48,3.97
3,-74.01004,40.719971,-74.012268,40.706718,429,2016-04-06,19:32:31,2016-04-06,19:39:40,0.92
4,-73.973053,40.793209,-73.972923,40.78252,435,2016-03-26,13:30:55,2016-03-26,13:38:10,0.74


In [12]:
#dataset = calcSpeeds(dataset)
dataset[0:50]

Unnamed: 0,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,trip_time,pickup_date,pickup_time,dropoff_date,dropoff_time,trip_distance,trip_speed
0,-73.982155,40.767937,-73.96463,40.765602,455,2016-03-14,17:24:55,2016-03-14,17:32:30,0.93,7.36
1,-73.980415,40.738564,-73.999481,40.731152,663,2016-06-12,00:43:35,2016-06-12,00:54:38,1.12,6.08
2,-73.979027,40.763939,-74.005333,40.710087,2124,2016-01-19,11:35:24,2016-01-19,12:10:48,3.97,6.73
3,-74.01004,40.719971,-74.012268,40.706718,429,2016-04-06,19:32:31,2016-04-06,19:39:40,0.92,7.72
4,-73.973053,40.793209,-73.972923,40.78252,435,2016-03-26,13:30:55,2016-03-26,13:38:10,0.74,6.12
5,-73.982857,40.742195,-73.992081,40.749184,443,2016-01-30,22:01:40,2016-01-30,22:09:03,0.68,5.53
6,-73.969017,40.757839,-73.957405,40.765896,341,2016-06-17,22:34:59,2016-06-17,22:40:40,0.82,8.66
7,-73.969276,40.797779,-73.92247,40.760559,1551,2016-05-21,07:54:58,2016-05-21,08:20:49,3.55,8.24
8,-73.999481,40.7384,-73.985786,40.732815,255,2016-05-27,23:12:23,2016-05-27,23:16:38,0.81,11.44
9,-73.981049,40.744339,-73.973,40.789989,1225,2016-03-10,21:45:01,2016-03-10,22:05:26,3.18,9.35


In [None]:
dataset.describe()
bins = [0, 120, 240, 360, 480, 600, 720, 840, 960, 1080, 1540]
bins2 = [np.arange(0, 18000, 120)]
bins3 = list(range(0, 10000, 120))
#print(bins3)
counts = pd.cut(dataset['time'], bins3)
dataset['counts'] = pd.cut(dataset['time'], bins3)
x = pd.value_counts(dataset['counts'])
x

In [None]:
dataset = data.get_data('datasets/results-20171101-101319.csv', name="raw_coordinates")
with tf.Session() as sess:
    rotated_data = sess.run(data.rotate(dataset, np.pi * 0.0))
H = data.bin_2d(rotated_data.T, [60, 12], pad_longitude=0.0, pad_latitude=0.0)
ax = sns.heatmap(H[0])
ax.invert_yaxis()
# sns.jointplot(rotated_data.T[:, 0], rotated_data.T[:, 1])

Now lets look at a single bin, over the course of several hours throughout the day. Specifically, lets look at the busy point around 30th street and Ninth Avenue.

In [None]:
utils.
max_cars_in_bin = H[0].max()
np.argwhere(H[0] == max_cars_in_bin)
# utils.get_bin_indices(num_cars_in_bin)
# num_avenues = 12
# max_cars = H[0].max()
# street = H[0].argmax() // num_avenues
# ave = H[0][street].argmax()
# print("The busiest Intersection is", num_avenues - ave, "ave and", street, "street with", max_cars, "pickups or dropoffs")

In [None]:
print("The coordinates surrounding the busiest intersection are")
ave_min_long = H[2][ave]
ave_max_long = H[2][ave + 1]
street_min_lat = H[1][street]
street_max_lat = H[1][street + 1]
print(f"[{ave_min_long}, {ave_max_long}], [{street_min_lat}, {street_max_lat}]")
trips_in_busy_point = []
count = 0
for i in rotated_data.T:
    longitude = i[0]
    latitude = i[1]
    if longitude > ave_min_long and longitude < ave_max_long and latitude > street_min_lat and latitude < street_max_lat:
            trips_in_busy_point.append(i)

## Model

Here, we define a placeholder `X`. During inference, we pass in
the value for this placeholder according to data.

## Inference

Perform variational inference.
Define the variational model to be a fully factorized normal.

In [None]:
qf = Normal(loc=tf.Variable(tf.random_normal([N])),
            scale=tf.nn.softplus(tf.Variable(tf.random_normal([N]))))

Run variational inference for `500` iterations.

In [None]:
inference = ed.KLqp({f: qf}, data={X: X_train, y: y_train})
inference.run(n_iter=5000)

In this case
`KLqp` defaults to minimizing the
$\text{KL}(q\|p)$ divergence measure using the reparameterization
gradient.
For more details on inference, see the [$\text{KL}(q\|p)$ tutorial](/tutorials/klqp).
(This example happens to be slow because evaluating and inverting full
covariances in Gaussian processes happens to be slow.)