# Crazy Taxi Data Exploration and Acquisition
Here we will explore the data sets obtained from http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml. This is data collected by the Taxi & Limo Commission on New York City taxis. 

There are three types of taxis in NYC. Yellow: these are your usual yellow taxis and are not restricted to pickup customers in any area. Green: these are called Boro Taxis and primarily service everything outside of downtown Manhattan and are not allowed to pick up in the Manhattan exclusionary zone. FHV (For-Hire-Vehicle): these are prearranged taxis all around NY through approved third party dispatchers.

Here is a map of the zones:
![Map of zones](map_service_area_map_thumbnail.jpg)
*Yellow area is the Manhatten exclusioary zone, green area is the area where both yellow and green taxis can pick up and drop off.(http://www.nyc.gov/html/tlc/html/passenger/shl_passenger.shtml)*

## Loading the Data into pandas and Cleaning

In [1]:
import pandas as pd

yellow_data = pd.read_csv('TaxiData/yellow_tripdata_2016-06.csv')
green_data = pd.read_csv('TaxiData/green_tripdata_2016-06.csv')

In [2]:
yellow_data.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RatecodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
0,2,2016-06-09 21:06:36,2016-06-09 21:13:08,2,0.79,-73.98336,40.760937,1,N,-73.977463,40.753979,2,6.0,0.5,0.5,0.0,0.0,0.3,7.3
1,2,2016-06-09 21:06:36,2016-06-09 21:35:11,1,5.22,-73.98172,40.736668,1,N,-73.981636,40.670242,1,22.0,0.5,0.5,4.0,0.0,0.3,27.3
2,2,2016-06-09 21:06:36,2016-06-09 21:13:10,1,1.26,-73.994316,40.751072,1,N,-74.004234,40.742168,1,6.5,0.5,0.5,1.56,0.0,0.3,9.36
3,2,2016-06-09 21:06:36,2016-06-09 21:36:10,1,7.39,-73.982361,40.773891,1,N,-73.929466,40.85154,1,26.0,0.5,0.5,1.0,0.0,0.3,28.3
4,2,2016-06-09 21:06:36,2016-06-09 21:23:23,1,3.1,-73.987106,40.733173,1,N,-73.985909,40.766445,1,13.5,0.5,0.5,2.96,0.0,0.3,17.76


In [3]:
green_data.head()

Unnamed: 0,VendorID,lpep_pickup_datetime,Lpep_dropoff_datetime,Store_and_fwd_flag,RateCodeID,Pickup_longitude,Pickup_latitude,Dropoff_longitude,Dropoff_latitude,Passenger_count,...,Fare_amount,Extra,MTA_tax,Tip_amount,Tolls_amount,Ehail_fee,improvement_surcharge,Total_amount,Payment_type,Trip_type
0,2,2016-06-01 02:46:38,2016-06-01 03:06:40,N,1,-73.93058,40.695179,-74.000053,40.729046,1,...,19.5,0.5,0.5,6.24,0.0,,0.3,27.04,1,1.0
1,2,2016-06-01 02:55:26,2016-06-01 03:06:52,N,1,-73.94693,40.792553,-73.951569,40.825161,1,...,11.5,0.5,0.5,2.56,0.0,,0.3,15.36,1,1.0
2,2,2016-06-01 02:50:36,2016-06-01 03:08:39,N,1,-73.944534,40.823956,-73.994659,40.750423,1,...,23.5,0.5,0.5,2.0,0.0,,0.3,26.8,1,1.0
3,2,2016-06-01 02:57:04,2016-06-01 03:07:52,N,1,-73.952209,40.823872,-73.91436,40.814697,1,...,10.5,0.5,0.5,0.0,0.0,,0.3,11.8,2,1.0
4,2,2016-06-01 02:52:03,2016-06-01 03:08:12,N,1,-73.957977,40.717827,-73.954018,40.655121,3,...,16.5,0.5,0.5,0.0,0.0,,0.3,17.8,1,1.0


Now that the data is added, any null values for pickup and dropoff location must be removed.

In [None]:
yellow_data = yellow_data[pd.notnull(yellow_data['pickup_longitude'])]
yellow_data = yellow_data[pd.notnull(yellow_data['pickup_latitude'])]
yellow_data = yellow_data[pd.notnull(yellow_data['dropoff_longitude'])]
yellow_data = yellow_data[pd.notnull(yellow_data['dropoff_latitude'])]

green_data = green_data[pd.notnull(green_data['Pickup_longitude'])]
green_data = green_data[pd.notnull(green_data['Pickup_latitude'])]
green_data = green_data[pd.notnull(green_data['Dropoff_longitude'])]
green_data = green_data[pd.notnull(green_data['Dropoff_latitude'])]

The clean data is explored for basic information

In [4]:
yellow_data.describe()

Unnamed: 0,VendorID,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RatecodeID,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
count,11135470.0,11135470.0,11135470.0,11135470.0,11135470.0,11135470.0,11135470.0,11135470.0,11135470.0,11135470.0,11135470.0,11135470.0,11135470.0,11135470.0,11135470.0,11135470.0
mean,1.529817,1.657273,3.044006,-73.05081,40.24282,1.04388,-73.12388,40.28391,1.349718,13.50708,0.3407188,0.4973046,1.842121,0.3402089,0.2996818,16.83016
std,0.4991102,1.302489,21.83019,8.208047,4.521673,0.566061,7.880313,4.341196,0.4944984,275.5358,0.5339716,0.04451916,2.713585,1.71971,0.01358086,275.8608
min,1.0,0.0,0.0,-118.1863,0.0,1.0,-118.1863,0.0,1.0,-450.0,-41.23,-2.7,-67.7,-12.5,-0.3,-450.8
25%,1.0,1.0,1.0,-73.99178,40.73653,1.0,-73.99123,40.73492,1.0,6.5,0.0,0.5,0.0,0.0,0.3,8.8
50%,2.0,1.0,1.71,-73.98135,40.75358,1.0,-73.97935,40.75412,1.0,10.0,0.0,0.5,1.35,0.0,0.3,12.3
75%,2.0,2.0,3.23,-73.96617,40.76831,1.0,-73.96202,40.76954,2.0,15.5,0.5,0.5,2.46,0.0,0.3,18.36
max,2.0,9.0,71732.7,0.0,64.09648,99.0,106.2469,60.04071,5.0,628544.7,597.92,60.35,854.85,970.0,11.64,629033.8


In [5]:
green_data.describe()

Unnamed: 0,VendorID,RateCodeID,Pickup_longitude,Pickup_latitude,Dropoff_longitude,Dropoff_latitude,Passenger_count,Trip_distance,Fare_amount,Extra,MTA_tax,Tip_amount,Tolls_amount,Ehail_fee,improvement_surcharge,Total_amount,Payment_type,Trip_type
count,1404726.0,1404726.0,1404726.0,1404726.0,1404726.0,1404726.0,1404726.0,1404726.0,1404726.0,1404726.0,1404726.0,1404726.0,1404726.0,0.0,1404726.0,1404726.0,1404726.0,1404724.0
mean,1.79514,1.09179,-73.82591,40.68546,-73.85015,40.69769,1.358612,2.879364,12.50718,0.3502088,0.4868255,1.307438,0.1190993,,0.2921007,15.08798,1.515029,1.02131
std,0.4035993,0.5990534,2.863031,1.578683,2.509123,1.383819,1.026833,2.990728,10.69113,0.3857585,0.08534602,2.909235,0.9104712,,0.05122782,12.23701,0.5241456,0.1444165
min,1.0,1.0,-75.91609,0.0,-75.9155,0.0,0.0,0.0,-499.0,-4.5,-0.5,-13.2,-5.54,,-0.3,-499.0,1.0,1.0
25%,2.0,1.0,-73.96138,40.69402,-73.96924,40.69503,1.0,1.07,6.5,0.0,0.5,0.0,0.0,,0.3,8.19,1.0,1.0
50%,2.0,1.0,-73.94639,40.74594,-73.94553,40.7461,1.0,1.9,9.5,0.5,0.5,0.0,0.0,,0.3,11.76,2.0,1.0
75%,2.0,1.0,-73.91862,40.80157,-73.91144,40.7891,1.0,3.6,15.0,0.5,0.5,2.0,0.0,,0.3,18.3,2.0,1.0
max,2.0,99.0,0.0,42.32437,0.0,42.3243,9.0,268.19,3347.5,4.5,0.5,300.08,98.0,,0.3,3349.3,5.0,2.0


For this project, the data was chosen from June 2016. This is the most recent data set that the New York City Taxi and Limo Comission used degrees for its pickup and dropoff locations. More recent data uses UTM, which requires conversion using GIS software. To keep the focus on data science, the June 2016 data was used. In addition to this, the conversion inculdes errors in calculation, so the degree data is also more accurate. Lasty, the FHV data does not include any pickup or dropoff location, so it is not useable.