## Collaborative Notebook for EDA and Feature Engineering of Taxi Data

A good place to start might be how taxis are regulated. I've never taken a taxi in NY, but I believe they're metered and have formulae for calculating fares even though it might be complex (e.g. different rates depending on whether the taxi is moving or stuck in traffic, etc.).

I found [this website](http://www.nyc.gov/html/tlc/html/passenger/taxicab_rate.shtml) that has a run down of some of the dimensions involved in calculating fares.

There are some data quality issues that we'll need to deal with, too. Looks there are at least a few zero values in the lat and lon fields. Also, the timestamp is in UTC. Pandas has some timezone conversion functions, so that shouldn't be to difficult to correct. It would be worth manually checking a few dozen to make sure that they have converted correctly. I think those built in functions take into account all of the special cases like daylight savings. We can double check, though.

![map](outsideny.png)

Here are some possible features that occurred to me and we might explore:

* distance as the crow flies
* actual driving distance
* time of day
* day of the week
* crosses bridge (tolls) (binary variable)
* inter-borough/intra-borough
* holiday (binary variable)
* estimated trip time (data doesn't have start and stop time, so this could take some work)
* proximity to landmarks for pickup or dropoff
* season/month



If you're thinking about doing the Coursera specialization, the first course is mostly the philosophy of ML according to Google. If you don't need all that, I recommend you start at Python notebooks in the cloud about 2/3 of the way through the course.

In [14]:
import pandas as pd
import numpy as np

In [16]:
taxi_data = pd.read_csv("train_subset.csv")
taxi_data.head()

Unnamed: 0.1,Unnamed: 0,key,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
0,1,2009-06-15 17:26:21.0000001,4.5,2009-06-15 17:26:21 UTC,-73.844311,40.721319,-73.84161,40.712278,1
1,2,2010-01-05 16:52:16.0000002,16.9,2010-01-05 16:52:16 UTC,-74.016048,40.711303,-73.979268,40.782004,1
2,3,2011-08-18 00:35:00.00000049,5.7,2011-08-18 00:35:00 UTC,-73.982738,40.76127,-73.991242,40.750562,2
3,4,2012-04-21 04:30:42.0000001,7.7,2012-04-21 04:30:42 UTC,-73.98713,40.733143,-73.991567,40.758092,1
4,5,2010-03-09 07:51:00.000000135,5.3,2010-03-09 07:51:00 UTC,-73.968095,40.768008,-73.956655,40.783762,1
