# NYC Taxi Data, Hotel Pairing Experiment (Part 2)

In which I will try to develop the taxicab and hotel investigate further by using more data (Yellow and Green taxicab records) and more hotels, find the most frequented tourist attractions / business destinations, and draw clusters and heatmaps by time of day. 

In [1]:
# imports...
import csv, imp, os, numpy as np, pandas as pd, matplotlib.pyplot as plt
from geopy.geocoders import Nominatim
import gmplot
from IPython.display import Image, display
from IPython.core.display import HTML
import webbrowser

# importing helper methods
from util import *

# for auto-reloading external modules
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2

# matplotlib setup
%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

### Importing and Cleaning Data

Let's try reading in a Yellow taxicab data file from January 2016 to experiment with. This file is approximately 1.7Gb, so it should take some time to read in and process, but after we've done so, the rest of the analysis is less memory-intensive.

In [8]:
# change this if you want to try different dataset(s)
taxi_files = ['../data/yellow_tripdata_2016-01.csv']

# variables to store pick-up and drop-off coordinates
pickup_coords, dropoff_coords = [], []

for taxi_file in taxi_files:
    # let's load a single .csv file of taxicab records (say, January 2016)
    taxi_data = pd.read_csv(taxi_file)
        
    # get relevant rows of the data and store them as numpy arrays
    if 'green' in taxi_file:
        pickup_lats, pickup_longs = np.array(taxi_data['Pickup_latitude']), np.array(taxi_data['Pickup_longitude'])
        dropoff_lats, dropoff_longs = np.array(taxi_data['Dropoff_latitude']), np.array(taxi_data['Dropoff_longitude']),
        pickup_datetimes = np.array(taxi_data['lpep_pickup_datetime'])
    elif 'yellow' in taxi_file:
        pickup_lats, pickup_longs = np.array(taxi_data['pickup_latitude']), np.array(taxi_data['pickup_longitude'])
        dropoff_lats, dropoff_longs = np.array(taxi_data['dropoff_latitude']), np.array(taxi_data['dropoff_longitude']),
        pickup_datetimes = np.array(taxi_data['tpep_pickup_datetime'])
    else:
        # this shouldn't happen
        raise NotImplementedError

    # remove the taxicab data from memory
    del taxi_data

    # zip together lats, longs for coordinates and append them to the lists
    pickup_coords.append(zip(pickup_lats, pickup_longs))
    dropoff_coords.append(zip(dropoff_lats, dropoff_longs))

### Geolocating Hotels

We use the geopy client for popular geolocation packages. We try the OpenStreetMap Nominatim service (https://nominatim.openstreetmap.org/), since it seems accurate and doesn't require authentication. I might move to Google's geolocation service (https://developers.google.com/maps/documentation/geolocation/intro), but I'm not sure this is necessary. It might be something to discuss with Prof. Rojas.