# NYC Taxi Fare Prediction

### **Table of Contents**
1. Download data
2. Read data using Pandas
3. Exploratory data anaysis and data cleaning
4. Feature engineering 
5. Understand temporal feature conversions 
6. Todo
6. Todo

## Import packages

In [3]:
import numpy as np
import seaborn as sb
import pandas as pd
import tensorflow as tf

## 1. Download data
Download data using ```wget```. A large volume of data is available at https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page, but we are going to use only ```yellow_tripdata_2020-06.csv```, which has 499,000 data points. 


In [1]:
# !wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2020-02.csv
# !wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2020-03.csv
# !wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2020-04.csv
# !wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2020-05.csv
!wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2020-06.csv

--2020-11-16 22:49:25--  https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2020-06.csv
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.106.29
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.106.29|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 50277193 (48M) [text/csv]
Saving to: ‘yellow_tripdata_2020-06.csv.1’



## 2. Read data using Pandas

Let's try to peek the data using ```head``` method. The definition of each column names can be found from [here](https://www1.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf). Also note that ```PULocationID``` and ```DOLocationID``` indicate pickup and dropoff zones. More details can be checked from [here](https://s3.amazonaws.com/nyc-tlc/misc/taxi+_zone_lookup.csv).


In [5]:
data = pd.read_csv('./yellow_tripdata_2020-06.csv')
data.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge
0,1.0,2020-06-01 00:31:23,2020-06-01 00:49:58,1.0,3.6,1.0,N,140,68,1.0,15.5,3.0,0.5,4.0,0.0,0.3,23.3,2.5
1,1.0,2020-06-01 00:42:50,2020-06-01 01:04:33,1.0,5.6,1.0,N,79,226,1.0,19.5,3.0,0.5,2.0,0.0,0.3,25.3,2.5
2,1.0,2020-06-01 00:39:51,2020-06-01 00:49:09,1.0,2.3,1.0,N,238,116,2.0,10.0,0.5,0.5,0.0,0.0,0.3,11.3,0.0
3,1.0,2020-06-01 00:56:13,2020-06-01 01:11:38,1.0,5.3,1.0,N,141,116,2.0,17.5,3.0,0.5,0.0,0.0,0.3,21.3,2.5
4,1.0,2020-06-01 00:16:41,2020-06-01 00:29:30,1.0,4.4,1.0,N,186,75,1.0,14.5,3.0,0.5,3.65,0.0,0.3,21.95,2.5


## 3. Exploratory data analysis

```describe``` is a very useful method to get summary statistics for the numeric fields of the data. Each column of the first row has different values because 


In [6]:
data.describe()

Unnamed: 0,VendorID,passenger_count,trip_distance,RatecodeID,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge
count,499043.0,499043.0,549760.0,499043.0,549760.0,549760.0,499043.0,549760.0,549760.0,549760.0,549760.0,549760.0,549760.0,549760.0,549760.0
mean,1.598351,1.356148,4.104275,1.047214,157.636474,153.473989,1.373327,13.606734,1.023772,0.491298,1.762904,0.367066,0.29698,18.768912,1.967681
std,0.490232,1.016665,336.02428,1.09579,69.756787,73.842217,0.531616,13.521364,1.263818,0.080832,2.631495,1.751982,0.041437,15.178965,1.044792
min,1.0,0.0,0.0,1.0,1.0,1.0,1.0,-216.0,-4.5,-0.5,-36.3,-28.75,-0.3,-216.3,-2.5
25%,1.0,1.0,1.01,1.0,107.0,87.0,1.0,6.0,0.0,0.5,0.0,0.0,0.3,10.7,2.5
50%,2.0,1.0,1.86,1.0,151.0,151.0,1.0,9.0,0.5,0.5,1.5,0.0,0.3,14.16,2.5
75%,2.0,1.0,3.66,1.0,234.0,233.0,2.0,15.5,2.5,0.5,2.75,0.0,0.3,20.8,2.5
max,2.0,9.0,220386.23,99.0,265.0,265.0,5.0,941.5,87.56,3.3,422.68,114.75,0.3,1141.1,2.5
