# NYC Taxi Data, Hotel Pairing Experiment

This notebook is meant to serve as a proof-of-concept for pairing public record New York City taxi data (pick-ups, drop-offs) with a few hotels in the city.

In [1]:
# imports...
import csv, numpy as np, pandas as pd
from geopy.geocoders import Nominatim
from geopy.distance import vincenty

### Importing and Cleaning Data

We first read the .csv file into memory and store the columns of the data which we want to use (Do we need any other columns besides latitude and longitude?).

For a first pass, I use the Green taxicab data from January 2016 from the NYC Taxi & Limousine Commission website (http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml). The Yellow taxicab datasets by month are ~1.65Gb, which takes a bit to load and even longer to process (calculate distances).

In [2]:
# change this if you want to try a different dataset
taxi_file = '../data/green_tripdata_2016-01.csv'

# let's load a single .csv file of taxi cab records (say, January 2016)
taxi_data = pd.read_csv(taxi_file)

# let's take a look at the loaded .csv file (for a sanity check)
taxi_data

Unnamed: 0,VendorID,lpep_pickup_datetime,Lpep_dropoff_datetime,Store_and_fwd_flag,RateCodeID,Pickup_longitude,Pickup_latitude,Dropoff_longitude,Dropoff_latitude,Passenger_count,...,Fare_amount,Extra,MTA_tax,Tip_amount,Tolls_amount,Ehail_fee,improvement_surcharge,Total_amount,Payment_type,Trip_type
0,2,2016-01-01 00:29:24,2016-01-01 00:39:36,N,1,-73.928642,40.680611,-73.924278,40.698044,1,...,8.0,0.5,0.5,1.86,0.00,,0.3,11.16,1,1.0
1,2,2016-01-01 00:19:39,2016-01-01 00:39:18,N,1,-73.952675,40.723175,-73.923920,40.761379,1,...,15.5,0.5,0.5,0.00,0.00,,0.3,16.80,2,1.0
2,2,2016-01-01 00:19:33,2016-01-01 00:39:48,N,1,-73.971611,40.676105,-74.013161,40.646072,1,...,16.5,0.5,0.5,4.45,0.00,,0.3,22.25,1,1.0
3,2,2016-01-01 00:22:12,2016-01-01 00:38:32,N,1,-73.989502,40.669579,-74.000648,40.689034,1,...,13.5,0.5,0.5,0.00,0.00,,0.3,14.80,2,1.0
4,2,2016-01-01 00:24:01,2016-01-01 00:39:22,N,1,-73.964729,40.682854,-73.940720,40.663013,1,...,12.0,0.5,0.5,0.00,0.00,,0.3,13.30,2,1.0
5,2,2016-01-01 00:32:59,2016-01-01 00:39:35,N,1,-73.891144,40.746456,-73.867744,40.742111,1,...,7.0,0.5,0.5,0.00,0.00,,0.3,8.30,2,1.0
6,2,2016-01-01 00:34:42,2016-01-01 00:39:21,N,1,-73.896675,40.746197,-73.886192,40.745689,1,...,5.0,0.5,0.5,0.00,0.00,,0.3,6.30,2,1.0
7,2,2016-01-01 00:31:23,2016-01-01 00:39:36,N,1,-73.953354,40.803558,-73.949150,40.794121,1,...,7.0,0.5,0.5,0.00,0.00,,0.3,8.30,2,1.0
8,2,2016-01-01 00:24:40,2016-01-01 00:39:52,N,1,-73.994064,40.702816,-73.971573,40.679726,1,...,12.0,0.5,0.5,2.00,0.00,,0.3,15.30,1,1.0
9,2,2016-01-01 00:28:59,2016-01-01 00:39:23,N,1,-73.914131,40.756641,-73.917549,40.739658,1,...,9.0,0.5,0.5,1.60,0.00,,0.3,11.90,1,1.0


In [3]:
# get relevant rows of the data and store them as numpy arrays
pickup_longs, pickup_lats = np.array(taxi_data['Pickup_longitude']), np.array(taxi_data['Pickup_latitude'])
dropoff_longs, dropoff_lats = np.array(taxi_data['Dropoff_longitude']), np.array(taxi_data['Dropoff_latitude'])

# for brevity, let's just experiment with the first 10^6 datapoints.
pickup_longs, pickup_lats = pickup_longs[:1000000], pickup_lats[:1000000]
dropoff_longs, dropoff_lats = dropoff_longs[:1000000], dropoff_lats[:1000000]

### Geolocating Hotels

We experiment with the geopy client for popular geolocation packages. We try the OpenStreetMap Nominatim service (https://nominatim.openstreetmap.org/), since it seems fairly accurate (investigate this?) and doesn't require authentication. I might move to Google's geolocation service (https://developers.google.com/maps/documentation/geolocation/intro) after obtaining an API key.

In [4]:
# change this variable to change the address to geolocate
hotel_address = '66 Charlton Street, New York City, NY 10014' # Four Points by Sheraton Manhattan SoHo Village

# setting up geolocator object
geolocator = Nominatim()

# storing the geocode of the above address
location = geolocator.geocode(hotel_address)
print 'address:', location.address
print 'longitude, latitude:', location.longitude, ',', location.latitude

address: 66, Charlton Street, Greenwich Village, Manhattan, New York County, NYC, New York, 10014, United States of America
longitude, latitude: -74.0061074667 , 40.7271283333


### Finding Nearby Pickups and Dropoffs

Now, we look through our pickup / dropoff dataset and find nearby points. We'll try a number of settings for "nearby", but they all rely on the "vincenty" distance (the default distance setting for this service), which calculates the distance between two points on a spheroid (https://en.wikipedia.org/wiki/Vincenty's_formulae). This can be measured in units of meters, miles, etc.

In [5]:
%%time

# let's try pickup and dropoff locations that are within 1 mile of our destination!

# setting up variables to store results
nearby_pickups, nearby_dropoffs = [], []
# store the hotel's coordinates
hotel_coords = (location.longitude, location.latitude)
# get number of datapoints in dataset N
N = pickup_longs.shape[0]

print '...getting nearby pickup locations\n'

# loop through each pickup long, lat pair
for i, pickup in enumerate(zip(pickup_longs, pickup_lats)):
    # print progress to console periodically
    if i % 100000 == 0:
        print 'pickups progress: (' + str(i) + ' / ' + str(N) + ')'
    # calculate vincenty distance and check for criterion
    if vincenty(hotel_coords, pickup).miles <= 1.0:
        # add to list if it meets the 1 mile criterion
        nearby_pickups.append(pickup)
        
print 'pickups progress: (' + str(N) + ' / ' + str(N) + ')\n'
        
print '...getting nearby dropoff locations\n'

# loop through each pickup long, lat pair
for i, dropoff in enumerate(zip(dropoff_longs, dropoff_lats)):
    # print progress to console periodically
    if i % 100000 == 0:
        print 'dropoffs progress: (' + str(i) + ' / ' + str(N) + ')'
    # calculate vincenty distance and check for criterion
    if vincenty(hotel_coords, dropoff).miles <= 1.0:
        # add to list if it meets the 1 mile criterion
        nearby_dropoffs.append(dropoff)
        
print 'dropoffs progress: (' + str(N) + ' / ' + str(N) + ')\n'
        
print 'number of nearby pickups (<= 1 mile)', len(nearby_pickups), '\n'
print 'number of nearby dropoffs (<= 1 mile):', len(nearby_dropoffs), '\n'

...getting nearby pickup locations

pickups progress: (0 / 1000000)
pickups progress: (100000 / 1000000)
pickups progress: (200000 / 1000000)
pickups progress: (300000 / 1000000)
pickups progress: (400000 / 1000000)
pickups progress: (500000 / 1000000)
pickups progress: (600000 / 1000000)
pickups progress: (700000 / 1000000)
pickups progress: (800000 / 1000000)
pickups progress: (900000 / 1000000)
pickups progress: (1000000 / 1000000)

...getting nearby dropoff locations

dropoffs progress: (0 / 1000000)
dropoffs progress: (100000 / 1000000)
dropoffs progress: (200000 / 1000000)
dropoffs progress: (300000 / 1000000)
dropoffs progress: (400000 / 1000000)
dropoffs progress: (500000 / 1000000)
dropoffs progress: (600000 / 1000000)
dropoffs progress: (700000 / 1000000)
dropoffs progress: (800000 / 1000000)
dropoffs progress: (900000 / 1000000)
dropoffs progress: (1000000 / 1000000)

number of nearby pickups (<= 1 mile) 8363 

number of nearby dropoffs (<= 1 mile): 46366 

CPU times: user 1

In [6]:
%%time

# let's try pickup and dropoff locations that are within 500 meters of our destination!

# setting up variables to store results
nearby_pickups, nearby_dropoffs = [], []
# other necessary variables are already stored

print '...getting nearby pickup locations\n'

# loop through each pickup long, lat pair
for i, pickup in enumerate(zip(pickup_longs, pickup_lats)):
    # print progress to console periodically
    if i % 100000 == 0:
        print 'pickups progress: (' + str(i) + ' / ' + str(N) + ')'
    # calculate vincenty distance and check for criterion
    if vincenty(hotel_coords, pickup).meters <= 500.0:
        # add to list if it meets the 1 mile criterion
        nearby_pickups.append(pickup)
        
print 'pickups progress: (' + str(N) + ' / ' + str(N) + ')\n'
        
print '...getting nearby dropoff locations\n'

# loop through each pickup long, lat pair
for i, dropoff in enumerate(zip(dropoff_longs, dropoff_lats)):
    # print progress to console periodically
    if i % 100000 == 0:
        print 'dropoffs progress: (' + str(i) + ' / ' + str(N) + ')'
    # calculate vincenty distance and check for criterion
    if vincenty(hotel_coords, dropoff).meters <= 500.0:
        # add to list if it meets the 1 mile criterion
        nearby_dropoffs.append(dropoff)
        
print 'dropoffs progress: (' + str(N) + ' / ' + str(N) + ')\n'
        
print 'number of nearby pickups (<= 500 meters)', len(nearby_pickups), '\n'
print 'number of nearby dropoffs (<= 500 meters):', len(nearby_dropoffs), '\n'

...getting nearby pickup locations

pickups progress: (0 / 1000000)
pickups progress: (100000 / 1000000)
pickups progress: (200000 / 1000000)
pickups progress: (300000 / 1000000)
pickups progress: (400000 / 1000000)
pickups progress: (500000 / 1000000)
pickups progress: (600000 / 1000000)
pickups progress: (700000 / 1000000)
pickups progress: (800000 / 1000000)
pickups progress: (900000 / 1000000)
pickups progress: (1000000 / 1000000)

...getting nearby dropoff locations

dropoffs progress: (0 / 1000000)
dropoffs progress: (100000 / 1000000)
dropoffs progress: (200000 / 1000000)
dropoffs progress: (300000 / 1000000)
dropoffs progress: (400000 / 1000000)
dropoffs progress: (500000 / 1000000)
dropoffs progress: (600000 / 1000000)
dropoffs progress: (700000 / 1000000)
dropoffs progress: (800000 / 1000000)
dropoffs progress: (900000 / 1000000)
dropoffs progress: (1000000 / 1000000)

number of nearby pickups (<= 500 meters) 0 

number of nearby dropoffs (<= 500 meters): 8126 

CPU times: us

##### Let's do the same experiments, but with a different hotel.

In [7]:
# change this variable to change the address to geolocate
hotel_address = '123 Nassau St, New York City, NY 10038' # The Beekman, A Thompson Hotel ("Best stay in New York!")

# setting up geolocator object
geolocator = Nominatim()

# storing the geocode of the above address
location = geolocator.geocode(hotel_address)
print 'address:', location.address
print 'longitude, latitude:', location.longitude, ',', location.latitude

address: 123, Nassau Street, Southbridge Towers, Manhattan, New York County, NYC, New York, 10038, United States of America
longitude, latitude: -74.0069278333 , 40.7110179444


In [8]:
%%time

# let's try pickup and dropoff locations that are within 1 mile of our destination!

# setting up variables to store results
nearby_pickups, nearby_dropoffs = [], []
# store the hotel's coordinates
hotel_coords = (location.longitude, location.latitude)
# get number of datapoints in dataset N
N = pickup_longs.shape[0]

print '...getting nearby pickup locations\n'

# loop through each pickup long, lat pair
for i, pickup in enumerate(zip(pickup_longs, pickup_lats)):
    # print progress to console periodically
    if i % 100000 == 0:
        print 'pickups progress: (' + str(i) + ' / ' + str(N) + ')'
    # calculate vincenty distance and check for criterion
    if vincenty(hotel_coords, pickup).miles <= 1.0:
        # add to list if it meets the 1 mile criterion
        nearby_pickups.append(pickup)
        
print 'pickups progress: (' + str(N) + ' / ' + str(N) + ')\n'
        
print '...getting nearby dropoff locations\n'

# loop through each pickup long, lat pair
for i, dropoff in enumerate(zip(dropoff_longs, dropoff_lats)):
    # print progress to console periodically
    if i % 100000 == 0:
        print 'dropoffs progress: (' + str(i) + ' / ' + str(N) + ')'
    # calculate vincenty distance and check for criterion
    if vincenty(hotel_coords, dropoff).miles <= 1.0:
        # add to list if it meets the 1 mile criterion
        nearby_dropoffs.append(dropoff)
        
print 'dropoffs progress: (' + str(N) + ' / ' + str(N) + ')\n'
        
print 'number of nearby pickups (<= 1 mile)', len(nearby_pickups), '\n'
print 'number of nearby dropoffs (<= 1 mile):', len(nearby_dropoffs), '\n'

...getting nearby pickup locations

pickups progress: (0 / 1000000)
pickups progress: (100000 / 1000000)
pickups progress: (200000 / 1000000)
pickups progress: (300000 / 1000000)
pickups progress: (400000 / 1000000)
pickups progress: (500000 / 1000000)
pickups progress: (600000 / 1000000)
pickups progress: (700000 / 1000000)
pickups progress: (800000 / 1000000)
pickups progress: (900000 / 1000000)
pickups progress: (1000000 / 1000000)

...getting nearby dropoff locations

dropoffs progress: (0 / 1000000)
dropoffs progress: (100000 / 1000000)
dropoffs progress: (200000 / 1000000)
dropoffs progress: (300000 / 1000000)
dropoffs progress: (400000 / 1000000)
dropoffs progress: (500000 / 1000000)
dropoffs progress: (600000 / 1000000)
dropoffs progress: (700000 / 1000000)
dropoffs progress: (800000 / 1000000)
dropoffs progress: (900000 / 1000000)
dropoffs progress: (1000000 / 1000000)

number of nearby pickups (<= 1 mile) 23582 

number of nearby dropoffs (<= 1 mile): 50800 

CPU times: user 

In [9]:
%%time

# let's try pickup and dropoff locations that are within 500 meters of our destination!

# setting up variables to store results
nearby_pickups, nearby_dropoffs = [], []
# other necessary variables are already stored

print '...getting nearby pickup locations\n'

# loop through each pickup long, lat pair
for i, pickup in enumerate(zip(pickup_longs, pickup_lats)):
    # print progress to console periodically
    if i % 100000 == 0:
        print 'pickups progress: (' + str(i) + ' / ' + str(N) + ')'
    # calculate vincenty distance and check for criterion
    if vincenty(hotel_coords, pickup).meters <= 500.0:
        # add to list if it meets the 1 mile criterion
        nearby_pickups.append(pickup)
        
print 'pickups progress: (' + str(N) + ' / ' + str(N) + ')\n'
        
print '...getting nearby dropoff locations\n'

# loop through each pickup long, lat pair
for i, dropoff in enumerate(zip(dropoff_longs, dropoff_lats)):
    # print progress to console periodically
    if i % 100000 == 0:
        print 'dropoffs progress: (' + str(i) + ' / ' + str(N) + ')'
    # calculate vincenty distance and check for criterion
    if vincenty(hotel_coords, dropoff).meters <= 500.0:
        # add to list if it meets the 1 mile criterion
        nearby_dropoffs.append(dropoff)
        
print 'dropoffs progress: (' + str(N) + ' / ' + str(N) + ')\n'
        
print 'number of nearby pickups (<= 500 meters)', len(nearby_pickups), '\n'
print 'number of nearby dropoffs (<= 500 meters):', len(nearby_dropoffs), '\n'

...getting nearby pickup locations

pickups progress: (0 / 1000000)
pickups progress: (100000 / 1000000)
pickups progress: (200000 / 1000000)
pickups progress: (300000 / 1000000)
pickups progress: (400000 / 1000000)
pickups progress: (500000 / 1000000)
pickups progress: (600000 / 1000000)
pickups progress: (700000 / 1000000)
pickups progress: (800000 / 1000000)
pickups progress: (900000 / 1000000)
pickups progress: (1000000 / 1000000)

...getting nearby dropoff locations

dropoffs progress: (0 / 1000000)
dropoffs progress: (100000 / 1000000)
dropoffs progress: (200000 / 1000000)
dropoffs progress: (300000 / 1000000)
dropoffs progress: (400000 / 1000000)
dropoffs progress: (500000 / 1000000)
dropoffs progress: (600000 / 1000000)
dropoffs progress: (700000 / 1000000)
dropoffs progress: (800000 / 1000000)
dropoffs progress: (900000 / 1000000)
dropoffs progress: (1000000 / 1000000)

number of nearby pickups (<= 500 meters) 0 

number of nearby dropoffs (<= 500 meters): 7465 

CPU times: us

## Conclusions

Looks like The Beekman is the more popular hotel (at least in the month of January 2016, according to Green taxicab traffic). 

As for usefulness, even on my laptop, processing 10^6 records in this fashion takes about 1 minute per hotel. I am confident that parallelizing this code and deploying it on a cloud computing resource (e.g., Amazon Elastic Cloud or something similar) should make the numer crunching very doable. 

What we need to move forward: I think that having a list of hotels and their addresses would be the next logical step. From there, it is easy to find their latitude and longitude, and then use this to calculate pick-up and drop-off distances.