The Taxi Data for NYC is stored in Parquet files and this takes up quite a bit of space. For space saving and consistency, we will first use this file to create a usable .csv file from the Taxi data .parquet files. 

In [1]:
import glob
import pandas as pd
import numpy as np


First we will create a glob with all of the files for easy itteration. During testing, it was discovered that 2009, 2010, and 2011-2022 all have different column names. We will pull one file from each year-set to determine the column names to use in the rest of our code.

In [2]:
taxi_data = sorted(glob.glob("Taxi Data\*.parquet"))
# df = pd.read_parquet("Taxi Data\yellow_tripdata_2010-01.parquet")
# df.columns

Originally, we needed to create 3 DataFrames to start with due to these differences. One for 2009 data, one for 2010 data, and a third for 2011-2022. We can accomplish this by looping over the enitre folder and including a try except block in our loop to catch any files with different column names. We needed to do this 3 times to extract all the data we need. The project shifted to only needing 2011-2022 so one time through will accomplish what we need.

In [3]:
# temp_list_10 = []
# col_list_10 = ['vendor_id', 'pickup_datetime', 'dropoff_datetime', 'passenger_count',
#        'trip_distance', 'pickup_longitude', 'pickup_latitude', 'rate_code',
#        'store_and_fwd_flag', 'dropoff_longitude', 'dropoff_latitude',
#        'payment_type', 'fare_amount', 'surcharge', 'mta_tax', 'tip_amount',
#        'tolls_amount', 'total_amount']

# for file in taxi_data:
#     try:
#         df = pd.read_parquet(file, columns=col_list_10).sample(n=100000)
#     except Exception as e:
#         print(f"{file} Does not include correct cols.")
#     else:
#         temp_list_10.append(df)

In [4]:
temp_list_11_22 = []
col_list_11_22 = ['VendorID', 'tpep_pickup_datetime', 'tpep_dropoff_datetime',
       'passenger_count', 'trip_distance',
       'PULocationID', 'DOLocationID','RatecodeID','store_and_fwd_flag', 'payment_type', 'fare_amount', 'extra',
       'mta_tax', 'tip_amount', 'tolls_amount', 'improvement_surcharge',
       'total_amount']

for file in taxi_data:
    try:
        df = pd.read_parquet(file, columns=col_list_11_22).sample(n=100000)
    except Exception as e:
        print(f"{file} Does not include correct cols.")
    else:
        temp_list_11_22.append(df)

Taxi Data\yellow_tripdata_2009-01.parquet Does not include correct cols.
Taxi Data\yellow_tripdata_2009-02.parquet Does not include correct cols.
Taxi Data\yellow_tripdata_2009-03.parquet Does not include correct cols.
Taxi Data\yellow_tripdata_2009-04.parquet Does not include correct cols.
Taxi Data\yellow_tripdata_2009-05.parquet Does not include correct cols.
Taxi Data\yellow_tripdata_2009-06.parquet Does not include correct cols.
Taxi Data\yellow_tripdata_2009-07.parquet Does not include correct cols.
Taxi Data\yellow_tripdata_2009-08.parquet Does not include correct cols.
Taxi Data\yellow_tripdata_2009-09.parquet Does not include correct cols.
Taxi Data\yellow_tripdata_2009-10.parquet Does not include correct cols.
Taxi Data\yellow_tripdata_2009-11.parquet Does not include correct cols.
Taxi Data\yellow_tripdata_2009-12.parquet Does not include correct cols.
Taxi Data\yellow_tripdata_2010-01.parquet Does not include correct cols.
Taxi Data\yellow_tripdata_2010-02.parquet Does not 

Next, we need to combine all of the data from each file into an overal DataFrame.

In [18]:
overall_data = pd.concat(temp_list_11_22)

With one DataFrame with all the data we need, we can do some inital cleaning that will carry through the CSV format.

We will:
    -Rename the columns to more user friendly names
    -Create a new column for just the year of each transaction for quicker access

In [19]:
cols_rename = ['VendorID', 'pickup_datetime', 'dropoff_datetime', 'passenger_count', 'trip_distance', 'PULocationID', 'DOLocationID', 'RateCodeID', 'store_and_fwd_flag', 'payment_type', 'fare_amount', 'extra', 'mta_tax', 'tip_amount', 'tolls_amount', 'improvement_surcharge', 'total_amount']
overall_data.columns = cols_rename

In [20]:
overall_data['year'] = overall_data["pickup_datetime"].dt.year

In [21]:
overall_data['is_covid_rel'] = np.where(overall_data["year"] > 2019, True, False)

In [22]:
overall_data = overall_data[overall_data.year > 2011]
overall_data.sort_values(by='year')

Unnamed: 0,VendorID,pickup_datetime,dropoff_datetime,passenger_count,trip_distance,PULocationID,DOLocationID,RateCodeID,store_and_fwd_flag,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,year,is_covid_rel
7172866,1,2012-01-18 15:29:43,2012-01-18 15:42:25,1.0,1.50,262,75,1.0,N,2,8.1,0.00,0.5,0.00,0.0,0.0,8.60,2012,False
8504454,2,2012-09-18 18:29:00,2012-09-18 18:50:00,5.0,6.11,161,87,1.0,,1,20.5,1.00,0.5,2.00,0.0,0.0,24.00,2012,False
4267138,2,2012-09-09 23:19:00,2012-09-09 23:26:00,1.0,2.38,264,264,1.0,,2,9.0,0.50,0.5,0.00,0.0,0.0,10.00,2012,False
12226022,2,2012-09-26 12:16:00,2012-09-26 12:34:00,2.0,3.92,162,166,1.0,,2,16.0,0.00,0.5,0.00,0.0,0.0,16.50,2012,False
9314982,1,2012-09-20 10:09:19,2012-09-20 10:13:20,1.0,0.60,142,48,1.0,N,1,5.0,0.00,0.5,1.10,0.0,0.0,6.60,2012,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1228384,2,2022-04-11 15:01:36,2022-04-11 15:17:13,1.0,1.48,161,234,1.0,N,1,11.0,0.00,0.5,4.29,0.0,0.3,18.59,2022,True
1755887,2,2022-04-15 23:44:58,2022-04-15 23:58:25,1.0,2.30,148,68,1.0,N,4,-10.5,-0.50,-0.5,0.00,0.0,-0.3,-14.30,2022,True
1523986,2,2022-04-14 07:28:09,2022-04-14 07:41:41,1.0,2.43,68,161,1.0,N,1,11.0,0.00,0.5,4.29,0.0,0.3,18.59,2022,True
2135517,2,2022-04-19 17:43:59,2022-04-19 17:53:32,1.0,1.19,237,142,1.0,N,1,8.0,1.00,0.5,3.08,0.0,0.3,15.38,2022,True


In [23]:
overall_data['payment_type'] = overall_data['payment_type'].values.astype(str).astype(int)

In [24]:
#change all 1 values to credit card
overall_data['payment_type'] = overall_data['payment_type'].replace(1, 'credit_card')
#change all 2 values to cash
overall_data['payment_type'] = overall_data['payment_type'].replace(2, 'cash')
#change all 3 values to no charge
overall_data['payment_type'] = overall_data['payment_type'].replace(3, 'no_charge')
#change all 4 values to dispute
overall_data['payment_type'] = overall_data['payment_type'].replace(4, 'dispute')
#change all 5 values to unknown
overall_data['payment_type'] = overall_data['payment_type'].replace(5, 'unknown')
#change all 6 values to voided trip
overall_data['payment_type'] = overall_data['payment_type'].replace(6, 'voided_trip')

Now we export to the CSV to be used in our main analysis.

In [26]:
overall_data.to_csv("overall_data.csv", index=False)