# Taxi Curated Data Pre-processing

**This note book pre-process the taxi data from the raw folder and save it to the curated folder**

**Read in the raw data**


In [92]:
import pandas as pd

df = pd.read_csv("../data/raw/yellowtaxi_raw_data.csv")
df.head()

Unnamed: 0,vendorid,tpep_pickup_datetime,tpep_dropoff_datetime,trip_distance,pulocationid,dolocationid,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
0,1,2020-01-01 00:28:15,2020-01-01 00:33:03,1.2,238,239,1,6.0,3.0,0.5,1.47,0.0,0.3,11.27
1,1,2020-01-01 00:35:39,2020-01-01 00:43:04,1.2,239,238,1,7.0,3.0,0.5,1.5,0.0,0.3,12.3
2,1,2020-01-01 00:47:41,2020-01-01 00:53:52,0.6,238,238,1,6.0,3.0,0.5,1.0,0.0,0.3,10.8
3,1,2020-01-01 00:55:23,2020-01-01 01:00:14,0.8,238,151,1,5.5,0.5,0.5,1.36,0.0,0.3,8.16
4,2,2020-01-01 00:01:58,2020-01-01 00:04:16,0.0,193,193,2,3.5,0.5,0.5,0.0,0.0,0.3,4.8


**Remove column where negative value should not occur**

In [93]:
# Filter out rows with negative values in the specified columns
# Doing research fare amount is initially 3.00$ so it would not make sense if customer pay less than 3$ for a trip
# pick up and drop off only in zone 1 to zone 263 according to taxi zone look up
# vendor id only 2 and 2 according to dictionary

filtered_df = df[
    ((df['vendorid'] == 1) | (df['vendorid'] == 2)) &
    (df['trip_distance'] >= 0) &
    (df['fare_amount'] >= 3) &
    (df['extra'] >= 0) &
    (df['mta_tax'] >= 0) &
    (df['tip_amount'] >= 0) &
    (df['tolls_amount'] >= 0) &
    (df['improvement_surcharge'] >= 0) &
    (df['total_amount'] >= 3) &
    (df['pulocationid'] >= 1) &
    (df['pulocationid'] <= 263) &
    (df['dolocationid'] >= 1) &
    (df['dolocationid'] <= 263)
]

columns_to_check = ['trip_distance', 'fare_amount', 'extra', 'mta_tax', 'tip_amount', 'tolls_amount', 'improvement_surcharge', 'total_amount']

negative_values = filtered_df[columns_to_check].lt(0).any().any()

if negative_values:
    print("There are still negative values present.")
else:
    print("No negative values found.")

filtered_df.dropna(inplace = True)

No negative values found.


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df.dropna(inplace = True)


**Features Engineer**

In [94]:
# Ensure 'tpep_pickup_datetime' is in datetime format
filtered_df['tpep_pickup_datetime'] = pd.to_datetime(filtered_df['tpep_pickup_datetime'])
filtered_df['tpep_dropoff_datetime'] = pd.to_datetime(filtered_df['tpep_dropoff_datetime'])

# Extract the hour, date, month
filtered_df.insert(3, 'pickup_month', filtered_df['tpep_pickup_datetime'].dt.month)
filtered_df.insert(4, 'pickup_date', filtered_df['tpep_pickup_datetime'].dt.date)
filtered_df.insert(5, 'pickup_hour', filtered_df['tpep_pickup_datetime'].dt.hour)

# Add in trip duration
filtered_df['trip_duration_minutes'] = (filtered_df['tpep_dropoff_datetime'] - filtered_df['tpep_pickup_datetime']).dt.total_seconds() / 60
filtered_df.insert(6, 'trip_duration', filtered_df['trip_duration_minutes'])

# Extract day of the week for pickup
pickup_dayofweek = filtered_df['tpep_pickup_datetime'].dt.dayofweek

# Calculate weekend and weekday values
weekend_values = (pickup_dayofweek >= 5).astype(int)  # 1 for Saturday and Sunday, 0 otherwise
weekday_values = (pickup_dayofweek < 5).astype(int)   # 1 for Monday to Friday, 0 otherwise

# Insert the values into the 8th and 9th positions
filtered_df.insert(7, 'weekend', weekend_values)
filtered_df.insert(8, 'weekday', weekday_values)

# Calculate morning and evening rush values
morning_rush_values = ((filtered_df['pickup_hour'] >= 7) & (filtered_df['pickup_hour'] <= 9)).astype(int)
evening_rush_values = ((filtered_df['pickup_hour'] >= 17) & (filtered_df['pickup_hour'] <= 19)).astype(int)

# Insert the values into the 10th and 11th positions
filtered_df.insert(9, 'morning_rush', morning_rush_values)
filtered_df.insert(10, 'evening_rush', evening_rush_values)

filtered_df.dropna()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df['tpep_pickup_datetime'] = pd.to_datetime(filtered_df['tpep_pickup_datetime'])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df['tpep_dropoff_datetime'] = pd.to_datetime(filtered_df['tpep_dropoff_datetime'])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df['trip_durat

Unnamed: 0,vendorid,tpep_pickup_datetime,tpep_dropoff_datetime,pickup_month,pickup_date,pickup_hour,trip_duration,weekend,weekday,morning_rush,...,dolocationid,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,trip_duration_minutes
0,1,2020-01-01 00:28:15,2020-01-01 00:33:03,1,2020-01-01,0,4.800000,0,1,0,...,239,1,6.00,3.0,0.5,1.47,0.0,0.3,11.27,4.800000
1,1,2020-01-01 00:35:39,2020-01-01 00:43:04,1,2020-01-01,0,7.416667,0,1,0,...,238,1,7.00,3.0,0.5,1.50,0.0,0.3,12.30,7.416667
2,1,2020-01-01 00:47:41,2020-01-01 00:53:52,1,2020-01-01,0,6.183333,0,1,0,...,238,1,6.00,3.0,0.5,1.00,0.0,0.3,10.80,6.183333
3,1,2020-01-01 00:55:23,2020-01-01 01:00:14,1,2020-01-01,0,4.850000,0,1,0,...,151,1,5.50,0.5,0.5,1.36,0.0,0.3,8.16,4.850000
4,2,2020-01-01 00:01:58,2020-01-01 00:04:16,1,2020-01-01,0,2.300000,0,1,0,...,193,2,3.50,0.5,0.5,0.00,0.0,0.3,4.80,2.300000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15712057,2,2020-03-31 23:21:00,2020-03-31 23:33:00,3,2020-03-31,23,12.000000,0,1,0,...,87,0,30.47,0.0,0.5,0.00,0.0,0.3,33.77,12.000000
15712058,2,2020-03-31 23:57:00,2020-04-01 00:26:00,3,2020-03-31,23,29.000000,0,1,0,...,71,0,37.97,0.0,0.5,0.00,0.0,0.3,41.27,29.000000
15712059,2,2020-03-31 23:22:01,2020-03-31 23:43:52,3,2020-03-31,23,21.850000,0,1,0,...,32,0,37.10,0.0,0.0,0.00,0.0,0.3,39.90,21.850000
15712060,2,2020-03-31 23:18:53,2020-03-31 23:32:21,3,2020-03-31,23,13.466667,0,1,0,...,159,0,20.07,0.0,0.0,0.00,0.0,0.3,22.87,13.466667


In [95]:
# Drop tpep_pickup_datetime and tpep_dropoff_datetime
filtered_df.drop(columns=['tpep_pickup_datetime', 'tpep_dropoff_datetime'], inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df.drop(columns=['tpep_pickup_datetime', 'tpep_dropoff_datetime'], inplace=True)


**Add total surcharge**

In [96]:
filtered_df.head()

Unnamed: 0,vendorid,pickup_month,pickup_date,pickup_hour,trip_duration,weekend,weekday,morning_rush,evening_rush,trip_distance,...,dolocationid,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,trip_duration_minutes
0,1,1,2020-01-01,0,4.8,0,1,0,0,1.2,...,239,1,6.0,3.0,0.5,1.47,0.0,0.3,11.27,4.8
1,1,1,2020-01-01,0,7.416667,0,1,0,0,1.2,...,238,1,7.0,3.0,0.5,1.5,0.0,0.3,12.3,7.416667
2,1,1,2020-01-01,0,6.183333,0,1,0,0,0.6,...,238,1,6.0,3.0,0.5,1.0,0.0,0.3,10.8,6.183333
3,1,1,2020-01-01,0,4.85,0,1,0,0,0.8,...,151,1,5.5,0.5,0.5,1.36,0.0,0.3,8.16,4.85
4,2,1,2020-01-01,0,2.3,0,1,0,0,0.0,...,193,2,3.5,0.5,0.5,0.0,0.0,0.3,4.8,2.3


In [97]:
# Calculate the "total surcharge"
total_surcharge_values = filtered_df['extra'] + filtered_df['mta_tax'] + filtered_df['tolls_amount'] + filtered_df['improvement_surcharge']

# Insert "total surcharge" into the 13th column
filtered_df.insert(13, 'total_surcharge', total_surcharge_values)

# Drop the constituent columns
filtered_df.drop(columns=['extra', 'mta_tax', 'tolls_amount', 'improvement_surcharge'], inplace=True)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df.drop(columns=['extra', 'mta_tax', 'tolls_amount', 'improvement_surcharge'], inplace=True)


In [98]:
filtered_df.head()

Unnamed: 0,vendorid,pickup_month,pickup_date,pickup_hour,trip_duration,weekend,weekday,morning_rush,evening_rush,trip_distance,pulocationid,dolocationid,payment_type,total_surcharge,fare_amount,tip_amount,total_amount,trip_duration_minutes
0,1,1,2020-01-01,0,4.8,0,1,0,0,1.2,238,239,1,3.8,6.0,1.47,11.27,4.8
1,1,1,2020-01-01,0,7.416667,0,1,0,0,1.2,239,238,1,3.8,7.0,1.5,12.3,7.416667
2,1,1,2020-01-01,0,6.183333,0,1,0,0,0.6,238,238,1,3.8,6.0,1.0,10.8,6.183333
3,1,1,2020-01-01,0,4.85,0,1,0,0,0.8,238,151,1,1.3,5.5,1.36,8.16,4.85
4,2,1,2020-01-01,0,2.3,0,1,0,0,0.0,193,193,2,1.3,3.5,0.0,4.8,2.3


**Sample the data to only include credit card payment**

* Tip is only count on credit card payment

In [99]:
final_df = filtered_df[filtered_df['payment_type'] == 1]

# Add tip ratio feature
final_df['tip_ratio'] = (final_df['tip_amount'] / final_df['total_amount']) * 100

# Drop vendor id and total amount
final_df.drop(columns=['vendorid', 'total_amount', 'trip_duration_minutes'], inplace=True)

final_df.head()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final_df['tip_ratio'] = (final_df['tip_amount'] / final_df['total_amount']) * 100
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final_df.drop(columns=['vendorid', 'total_amount', 'trip_duration_minutes'], inplace=True)


Unnamed: 0,pickup_month,pickup_date,pickup_hour,trip_duration,weekend,weekday,morning_rush,evening_rush,trip_distance,pulocationid,dolocationid,payment_type,total_surcharge,fare_amount,tip_amount,tip_ratio
0,1,2020-01-01,0,4.8,0,1,0,0,1.2,238,239,1,3.8,6.0,1.47,13.043478
1,1,2020-01-01,0,7.416667,0,1,0,0,1.2,239,238,1,3.8,7.0,1.5,12.195122
2,1,2020-01-01,0,6.183333,0,1,0,0,0.6,238,238,1,3.8,6.0,1.0,9.259259
3,1,2020-01-01,0,4.85,0,1,0,0,0.8,238,151,1,1.3,5.5,1.36,16.666667
9,1,2020-01-01,0,11.45,0,1,0,0,0.7,246,48,1,3.8,8.0,2.35,16.607774


**Save to csv**

In [100]:
final_df.to_csv('../data/curated/yellowtaxi_curated_data.csv', index = False)