The Taxi Data for NYC is stored in Parquet files and this takes up quite a bit of space. For space saving and consistency, we will first use this file to create a usable .csv file from the Taxi data .parquet files. 

In [1]:
import glob
import pandas as pd
import numpy as np

First we will create a glob with all of the files for easy itteration. During testing, it was discovered that 2009, 2010, and 2011-2022 all have different column names. We will pull one file from each year-set to determine the column names to use in the rest of our code.

In [2]:
taxi_data = sorted(glob.glob("Taxi Data\*.parquet"))
# df = pd.read_parquet("Taxi Data\yellow_tripdata_2010-01.parquet")
# df.columns

Originally, we needed to create 3 DataFrames to start with due to these differences. One for 2009 data, one for 2010 data, and a third for 2011-2022. We can accomplish this by looping over the enitre folder and including a try except block in our loop to catch any files with different column names. We needed to do this 3 times to extract all the data we need. The project shifted to only needing 2011-2022 so one time through will accomplish what we need.

In [3]:
temp_list_11_22 = []
col_list_11_22 = ['VendorID', 'tpep_pickup_datetime', 'tpep_dropoff_datetime',
       'passenger_count', 'trip_distance',
       'PULocationID', 'DOLocationID','RatecodeID','store_and_fwd_flag', 'payment_type', 'fare_amount', 'extra',
       'mta_tax', 'tip_amount', 'tolls_amount', 'improvement_surcharge',
       'total_amount']

for file in taxi_data:
    try:
        df = pd.read_parquet(file, columns=col_list_11_22).sample(n=100000)
    except Exception as e:
        print(f"{file} Does not include correct cols.")
    else:
        temp_list_11_22.append(df)

Taxi Data\yellow_tripdata_2009-01.parquet Does not include correct cols.
Taxi Data\yellow_tripdata_2009-02.parquet Does not include correct cols.
Taxi Data\yellow_tripdata_2009-03.parquet Does not include correct cols.
Taxi Data\yellow_tripdata_2009-04.parquet Does not include correct cols.
Taxi Data\yellow_tripdata_2009-05.parquet Does not include correct cols.
Taxi Data\yellow_tripdata_2009-06.parquet Does not include correct cols.
Taxi Data\yellow_tripdata_2009-07.parquet Does not include correct cols.
Taxi Data\yellow_tripdata_2009-08.parquet Does not include correct cols.
Taxi Data\yellow_tripdata_2009-09.parquet Does not include correct cols.
Taxi Data\yellow_tripdata_2009-10.parquet Does not include correct cols.
Taxi Data\yellow_tripdata_2009-11.parquet Does not include correct cols.
Taxi Data\yellow_tripdata_2009-12.parquet Does not include correct cols.
Taxi Data\yellow_tripdata_2010-01.parquet Does not include correct cols.
Taxi Data\yellow_tripdata_2010-02.parquet Does not 

Next, we need to combine all of the data from each file into an overal DataFrame.

In [4]:
overall_data = pd.concat(temp_list_11_22)

With one DataFrame with all the data we need, we can do some inital cleaning that will carry through the CSV format.

We will:
    -Rename the columns to more user friendly names
    -Create a new column for just the year of each transaction for quicker access

In [5]:
cols_rename = ['VendorID', 'pickup_datetime', 'dropoff_datetime', 'passenger_count', 'trip_distance', 'PULocationID', 'DOLocationID', 'RateCodeID', 'store_and_fwd_flag', 'payment_type', 'fare_amount', 'extra', 'mta_tax', 'tip_amount', 'tolls_amount', 'improvement_surcharge', 'total_amount']
overall_data.columns = cols_rename

In [6]:
overall_data['year'] = overall_data["pickup_datetime"].dt.year

In [7]:
overall_data['is_covid_rel'] = np.where(overall_data["year"] > 2019, True, False)

In [8]:
overall_data = overall_data[overall_data.year >= 2011]
overall_data = overall_data[overall_data.year <= 2022]
overall_data.sort_values(by='year')

Unnamed: 0,VendorID,pickup_datetime,dropoff_datetime,passenger_count,trip_distance,PULocationID,DOLocationID,RateCodeID,store_and_fwd_flag,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,year,is_covid_rel
5195892,1,2011-01-13 22:21:40,2011-01-13 22:30:55,2.0,1.90,142,236,1.0,N,2,7.3,0.5,0.5,0.00,0.0,0.0,8.30,2011,False
323393,1,2011-09-01 18:39:20,2011-09-01 18:47:42,1.0,0.40,48,48,1.0,N,1,5.7,1.0,0.5,1.25,0.0,0.0,8.45,2011,False
658147,1,2011-09-02 11:41:32,2011-09-02 11:53:05,1.0,2.00,79,161,1.0,N,1,8.5,0.0,0.5,1.80,0.0,0.0,10.80,2011,False
6292798,2,2011-09-14 10:04:00,2011-09-14 10:21:00,3.0,2.64,230,211,1.0,,1,10.9,0.0,0.5,3.00,0.0,0.0,14.40,2011,False
1118207,1,2011-09-03 10:47:06,2011-09-03 10:56:33,0.0,1.30,68,211,1.0,N,2,6.5,0.0,0.5,0.00,0.0,0.0,7.00,2011,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1496506,2,2022-04-13 20:16:13,2022-04-13 20:26:17,1.0,1.41,158,234,1.0,N,1,8.5,0.5,0.5,1.00,0.0,0.3,13.30,2022,True
2284561,1,2022-04-20 21:11:57,2022-04-20 21:17:55,1.0,1.30,68,163,1.0,N,1,6.5,3.0,0.5,2.55,0.0,0.3,12.85,2022,True
366288,2,2022-04-04 08:04:24,2022-04-04 08:14:36,1.0,0.72,236,75,1.0,N,1,7.5,0.0,0.5,1.00,0.0,0.3,11.80,2022,True
2306356,1,2022-04-21 07:56:53,2022-04-21 08:08:02,1.0,2.00,113,230,1.0,Y,1,10.0,2.5,0.5,2.00,0.0,0.3,15.30,2022,True


In [9]:
overall_data['payment_type'] = overall_data['payment_type'].values.astype(str).astype(int)

In [10]:
#change all 1 values to credit card
overall_data['payment_type'] = overall_data['payment_type'].replace(1, 'credit_card')
#change all 2 values to cash
overall_data['payment_type'] = overall_data['payment_type'].replace(2, 'cash')
#change all 3 values to no charge
overall_data['payment_type'] = overall_data['payment_type'].replace(3, 'no_charge')
#change all 4 values to dispute
overall_data['payment_type'] = overall_data['payment_type'].replace(4, 'dispute')
#change all 5 values to unknown
overall_data['payment_type'] = overall_data['payment_type'].replace(5, 'unknown')
#change all 6 values to voided trip
overall_data['payment_type'] = overall_data['payment_type'].replace(6, 'voided_trip')

Now we export to the CSV to be used in our main analysis.

In [11]:
# overall_data.to_csv("overall_data.csv", index=False)

overall_data.to_csv("sample.csv", index=False)