# Pickups and Dropoffs for All NYC Hotels

In which I will try to develop the taxicab and hotel investigation further by using more data (Yellow and Green taxicab and FHV records) and all (listed) NYC hotels, find the most frequented tourist attractions / business destinations, and draw clusters and heatmaps by time of day. 

In [1]:
# imports...
import csv, imp, os, numpy as np, pandas as pd, matplotlib.pyplot as plt
from geopy.geocoders import GoogleV3
import gmplot
from IPython.display import Image, display
from IPython.core.display import HTML
import webbrowser

# importing helper methods
from util import *

# for auto-reloading external modules
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2

# matplotlib setup
%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

### Importing and Cleaning Data

Let's try reading in Yellow and Green taxicab data files from January 2016 to experiment with. These files together are approximately 2Gb, so it should take some time to read in and process, but after we've done so, the rest of the analysis is less memory-intensive.

In [2]:
# change this if you want to try different dataset(s)
taxi_files = ['../data/yellow_tripdata_2016-01.csv', '../data/green_tripdata_2016-01.csv']

# variables to store pick-up and drop-off coordinates
pickup_coords, dropoff_coords = [], []

for taxi_file in taxi_files:
        
    if 'green' in taxi_file:
        # let's load a single .csv file of taxicab records (say, January 2016)
        taxi_data = pd.read_csv(taxi_file, usecols=['Pickup_latitude', 'Pickup_longitude', 'Dropoff_latitude', 'Dropoff_longitude', 'lpep_pickup_datetime'])
        
        # get relevant rows of the data and store them as numpy arrays
        pickup_lats, pickup_longs = np.array(taxi_data['Pickup_latitude']), np.array(taxi_data['Pickup_longitude'])
        dropoff_lats, dropoff_longs = np.array(taxi_data['Dropoff_latitude']), np.array(taxi_data['Dropoff_longitude']),
        pickup_datetimes = np.array(taxi_data['lpep_pickup_datetime'])
    elif 'yellow' in taxi_file:
        # let's load a single .csv file of taxicab records (say, January 2016)
        taxi_data = pd.read_csv(taxi_file, usecols=['pickup_latitude', 'pickup_longitude', 'dropoff_latitude', 'dropoff_longitude', 'tpep_pickup_datetime'])
        
        # get relevant rows of the data and store them as numpy arrays
        pickup_lats, pickup_longs = np.array(taxi_data['pickup_latitude']), np.array(taxi_data['pickup_longitude'])
        dropoff_lats, dropoff_longs = np.array(taxi_data['dropoff_latitude']), np.array(taxi_data['dropoff_longitude']),
        pickup_datetimes = np.array(taxi_data['tpep_pickup_datetime'])
    else:
        # this shouldn't happen
        raise NotImplementedError

    # remove the taxicab data from memory
    del taxi_data

    # zip together lats, longs for coordinates and append them to the lists
    pickup_coords.extend(zip(pickup_lats, pickup_longs))
    dropoff_coords.extend(zip(dropoff_lats, dropoff_longs))
    
pickup_coords = np.array(pickup_coords).T
dropoff_coords = np.array(dropoff_coords).T

### Geolocating Hotels

We use the geopy client for popular geolocation packages. We use Google's geolocation service (https://developers.google.com/maps/documentation/geolocation/intro), since it appears to be the most accurate, user friendly, and requires little money to operate, even with many requests.

In [3]:
# get file containing hotel names and addresses
hotel_file = pd.read_excel('../data/Name_Address_ID.xlsx')

# split the file into lists of names and addresses
hotel_names = hotel_file['Name']
hotel_addresses = hotel_file['Address']

# setting up geolocator object
geolocator = GoogleV3(api_key='AIzaSyDq_J-JxCRANCMXtHZoDdqzFvd2O_d04eI', timeout=10)

# storing the geocode of the above addresses
hotel_coords = []

print '...getting hotel coordinates'

# get and store hotel coordinates
for idx, hotel_address in enumerate(hotel_addresses):
    # print progress to console
    if idx % 10 == 0:
        print 'progress:', idx, '/', len(hotel_addresses) 
    
    # get the hotel's geolocation
    location = geolocator.geocode(hotel_address)
    if location == None:
        continue
    
    # get the coordinates of the hotel from the geolocation
    hotel_coord = (location.latitude, location.longitude)
    
    # add it to our list
    hotel_coords.append(hotel_coord)
    
print 'progress:', len(hotel_addresses), '/', len(hotel_addresses)

...getting hotel coordinates
progress: 0 / 178
progress: 10 / 178
progress: 20 / 178
progress: 30 / 178
progress: 40 / 178
progress: 50 / 178
progress: 60 / 178
progress: 70 / 178
progress: 80 / 178
progress: 90 / 178
progress: 100 / 178
progress: 110 / 178
progress: 120 / 178
progress: 130 / 178
progress: 140 / 178
progress: 150 / 178
progress: 160 / 178
progress: 170 / 178
progress: 178 / 178


In [32]:
# distance (in feet) criterion
distance = 300

# create and open spreadsheet for nearby pick-ups and drop-offs for each hotel
writer = pd.ExcelWriter('../data/Nearby Pickups and Dropoffs.xlsx')

## Finding Nearby Taxicab Pick-ups and Corresponding Drop-offs

For each hotel, we want to find all taxicab rides which begin within a certain distance of the hotel (say, 100 meters).

In [None]:
print '...finding distance criterion-satisfying taxicab pick-ups', '\n'

# loop through each hotel and find all satisfying taxicab rides
for idx, hotel_coord in enumerate(hotel_coords):
    # print progress to console
    print 'hotels progress:', idx, '/', len(hotel_coords)
    print '...finding satisfying taxicab rides for', hotel_names[idx], '\n'
    
    # call the 'get_destinations' function from the 'util.py' script on all trips stored
    destinations = get_destinations(pickup_coords.T, dropoff_coords.T, hotel_coord, distance, unit='feet').T
    
    # create pandas DataFrame from output from destinations (distance from hotel, latitude, longitude)
    index = [ i for i in range(1, destinations.shape[0] + 1)]
    destinations = pd.DataFrame(destinations, index=index, columns=['Distance From Hotel', 'Latitude', 'Longitude'])
    
    # add column for hotel name
    name_frame = pd.DataFrame([hotel_names[idx]] * destinations.shape[0], index=destinations.index, columns=['Hotel Name'])
    to_write = pd.concat([name_frame, destinations], axis=1)
    
    # write sheet to Excel file
    if idx == 0:
        to_write.to_excel(writer, 'Nearby Pick-ups', index=False)
    
    if idx != 0:
        to_write.to_excel(writer, 'Nearby Pick-ups', startrow=prev_len+1, header=None, index=False)
    
    prev_len = len(to_write)
    
# close the ExcelWriter object    
writer.close()

...finding distance criterion-satisfying taxicab pick-ups 

hotels progress: 0 / 178
...finding satisfying taxicab rides for Homewood Suites New York Midtown Manhattan Times Square South 



## Finding Nearby Taxicab Drop-offs and Corresponding Pick-ups

Now, for each hotel, we want to find all taxicab rides which end within a certain distance of the hotel (again, 100 meters).

In [None]:
print '...finding distance criterion-satisfying taxicab drop-offs', '\n'

# loop through each hotel and find all satisfying taxicab rides
for idx, hotel_coord in enumerate(hotel_coords):
    # print progress to console
    print 'hotels progress:', idx, '/', len(hotel_coords)
    print '...finding satisfying taxicab rides for', hotel_names[idx], '\n'
    
    # call the 'get_starting_points' function from the 'util.py' script on all trips stored
    starting_points = get_starting_points(pickup_coords.T, dropoff_coords.T, hotel_coord, distance, unit='feet').T
    
    # create pandas DataFrame from output from destinations (distance from hotel, latitude, longitude)
    index = [ i for i in range(1, starting_points.shape[0] + 1)]
    starting_points = pd.DataFrame(starting_points, index=index, columns=['Distance From Hotel', 'Latitude', 'Longitude'])
    
    # add column for hotel name
    name_frame = pd.DataFrame([hotel_names[idx]] * starting_points.shape[0], index=starting_points.index, columns=['Hotel Name'])
    to_write = pd.concat([name_frame, starting_points], axis=1)
    
    # write sheet to Excel file
    if idx == 0:
        to_write.to_excel(writer, 'Nearby Drop-offs', index=False)
    
    if idx != 0:
        to_write.to_excel(writer, 'Nearby Drop-offs', startrow=prev_len+1, header=None, index=False)
    
    prev_len = len(to_write)
    
# close the ExcelWriter object    
writer.close()