# Taxi Trips in NYC Data Cleaning

This notebook contains all the code needed to clean the raw data from taxi_tripdata.csv and taxi_zones.csv to create the new, cleaned dataset taxi_clean.csv

## Imports and Grabbing Data

In [1]:
import pandas as pd
import numpy as np

In [2]:
dtypes = {'VendorID': 'Int64', 'store_and_fwd_flag': 'str', 'RatecodeID': 'Int64', 'passenger_count': 'Int64', 'payment_type': 'Int64', 'trip_type': 'Int64'}
parse_dates = ['lpep_pickup_datetime', 'lpep_dropoff_datetime']
df = pd.read_csv('data/taxi_tripdata.csv', dtype=dtypes, parse_dates=parse_dates)

In [3]:
# First few rows of the raw data:
df.head(10)

Unnamed: 0,VendorID,lpep_pickup_datetime,lpep_dropoff_datetime,store_and_fwd_flag,RatecodeID,PULocationID,DOLocationID,passenger_count,trip_distance,fare_amount,extra,mta_tax,tip_amount,tolls_amount,ehail_fee,improvement_surcharge,total_amount,payment_type,trip_type,congestion_surcharge
0,1,2021-07-01 00:30:52,2021-07-01 00:35:36,N,1,74,168,1,1.2,6.0,0.5,0.5,0.0,0.0,,0.3,7.3,2,1,0.0
1,2,2021-07-01 00:25:36,2021-07-01 01:01:31,N,1,116,265,2,13.69,42.0,0.5,0.5,0.0,0.0,,0.3,43.3,2,1,0.0
2,2,2021-07-01 00:05:58,2021-07-01 00:12:00,N,1,97,33,1,0.95,6.5,0.5,0.5,2.34,0.0,,0.3,10.14,1,1,0.0
3,2,2021-07-01 00:41:40,2021-07-01 00:47:23,N,1,74,42,1,1.24,6.5,0.5,0.5,0.0,0.0,,0.3,7.8,2,1,0.0
4,2,2021-07-01 00:51:32,2021-07-01 00:58:46,N,1,42,244,1,1.1,7.0,0.5,0.5,0.0,0.0,,0.3,8.3,2,1,0.0
5,1,2021-07-01 00:05:00,2021-07-01 00:11:50,N,1,24,239,1,1.9,8.0,3.25,0.5,3.0,0.0,,0.3,15.05,1,1,2.75
6,2,2021-07-01 00:57:14,2021-07-01 01:27:43,N,1,75,243,1,0.0,17.5,0.5,0.5,0.0,0.0,,0.3,18.8,2,1,0.0
7,2,2021-07-01 00:27:36,2021-07-01 00:32:35,N,1,82,82,1,0.66,5.0,0.5,0.5,0.0,0.0,,0.3,6.3,2,1,0.0
8,2,2021-07-01 00:29:09,2021-07-01 00:34:18,N,1,74,42,1,1.72,7.0,0.5,0.5,2.08,0.0,,0.3,10.38,1,1,0.0
9,2,2021-07-01 00:41:33,2021-07-01 00:49:24,N,1,41,42,1,1.37,7.5,0.5,0.5,0.0,0.0,,0.3,8.8,2,1,0.0


## Dropping Unnecessary Columns
The following columns will not be used in our analysis and will be dropped: `ehail_fee`

In [4]:
df = df.drop('ehail_fee', axis=1)

## Remove Invalid Rows

Remove rows where the `trip_distance` is 0.

In [5]:
print('Remove ' + str(len(df[df['trip_distance'] == 0])) + ' rows with a trip_distance of 0')
df = df[df['trip_distance'] > 0]

Remove 3455 rows with a trip_distance of 0


## Adding New Columns

We will begin by adding the following columns to make it easier to analyze the times:
- day of the week for the trip (0 = Monday...6 = Sunday)
- day of the month for the trip
- time of the day for the trip

In [6]:
df['day_of_week'] = df['lpep_pickup_datetime'].dt.dayofweek
df['day_of_month'] = df['lpep_pickup_datetime'].dt.day
df['hour_of_day'] = df['lpep_pickup_datetime'].dt.hour
df['trip_duration'] = round((df['lpep_dropoff_datetime'] - df['lpep_pickup_datetime']).dt.total_seconds())
df[['lpep_pickup_datetime', 'lpep_dropoff_datetime', 'day_of_week', 'day_of_month', 'hour_of_day', 'trip_duration']]

Unnamed: 0,lpep_pickup_datetime,lpep_dropoff_datetime,day_of_week,day_of_month,hour_of_day,trip_duration
0,2021-07-01 00:30:52,2021-07-01 00:35:36,3,1,0,284.0
1,2021-07-01 00:25:36,2021-07-01 01:01:31,3,1,0,2155.0
2,2021-07-01 00:05:58,2021-07-01 00:12:00,3,1,0,362.0
3,2021-07-01 00:41:40,2021-07-01 00:47:23,3,1,0,343.0
4,2021-07-01 00:51:32,2021-07-01 00:58:46,3,1,0,434.0
...,...,...,...,...,...,...
83686,2021-07-02 07:59:00,2021-07-02 08:33:00,4,2,7,2040.0
83687,2021-07-02 07:02:00,2021-07-02 07:18:00,4,2,7,960.0
83688,2021-07-02 07:53:00,2021-07-02 08:15:00,4,2,7,1320.0
83689,2021-07-02 07:58:00,2021-07-02 08:30:00,4,2,7,1920.0


Then, we'll add a column for the total fare without the tip (`total_amount` - `tip_amount`) and the fare per distance ($/mile not including the tip amount).

In [7]:
df['total_without_tip'] = round(df['total_amount'] - df['tip_amount'], 2)
df['fare_per_mile'] = round(df['total_without_tip'] / df['trip_distance'], 2)
df[['tip_amount', 'total_amount', 'total_without_tip', 'trip_distance', 'fare_per_mile']]

Unnamed: 0,tip_amount,total_amount,total_without_tip,trip_distance,fare_per_mile
0,0.00,7.30,7.30,1.20,6.08
1,0.00,43.30,43.30,13.69,3.16
2,2.34,10.14,7.80,0.95,8.21
3,0.00,7.80,7.80,1.24,6.29
4,0.00,8.30,8.30,1.10,7.55
...,...,...,...,...,...
83686,0.00,59.84,59.84,18.04,3.32
83687,3.66,25.87,22.21,5.56,3.99
83688,0.00,22.75,22.75,5.13,4.43
83689,0.00,54.12,54.12,12.58,4.30


For potential next steps, we would like to incorporate the data for taxi zones and performa analysis regarding location.

## Save Data

In [8]:
df.to_csv('data/taxi_clean.csv', index=False)