# Data Preparation

I wrote a bash script that downloads the data and saves it locally. I have already run it for yellow taxis in 2020 and 2021 as well as green taxis in 2020. Let's finally run it for green taxis in 2021 (which only goes out to July; there is no data for August and beyond):

In [1]:
!./download_data.sh green 2021

Downloading https://github.com/DataTalksClub/nyc-tlc-data/releases/download/green/green_tripdata_2021-01.csv.gz and saving to data/raw/green/2021/01/green_tripdata_2021_01.csv.gz...
--2023-10-25 17:31:41--  https://github.com/DataTalksClub/nyc-tlc-data/releases/download/green/green_tripdata_2021-01.csv.gz
Resolving github.com (github.com)... 192.30.255.112
Connecting to github.com (github.com)|192.30.255.112|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/513814948/ea387a15-484c-469b-860d-3382ee7659be?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20231025%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20231025T173141Z&X-Amz-Expires=300&X-Amz-Signature=46a51e0ff84a427066967ffda0078a950c339df6ddcd9077cba6187daac8e53e&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=513814948&response-content-disposition=attachment%3B%20filename%3Dgreen_tripdata_2021-01.cs

Let's take a look with `zcat` or `gzcat`. Ignore the error at the bottom. It does not appear when I run the command in the shell. It seems to be a Jupyter issue.

> Note: `gzcat` is GNU `zcat` and `zcat` is like `cat` for compressed files. Regular `zcat` has a bug on MacOS where it appears to append a `.Z` to the file name, so I used `gzcat` instead which seems to work fine. If you don't have GNU utilities, install them. Or better yet, check out [linuxify](https://github.com/darksonic37/linuxify).

In [3]:
!gzcat data/raw/green/2021/01/green_tripdata_2021_01.csv.gz | head -n 5

VendorID,lpep_pickup_datetime,lpep_dropoff_datetime,store_and_fwd_flag,RatecodeID,PULocationID,DOLocationID,passenger_count,trip_distance,fare_amount,extra,mta_tax,tip_amount,tolls_amount,ehail_fee,improvement_surcharge,total_amount,payment_type,trip_type,congestion_surcharge
2,2021-01-01 00:15:56,2021-01-01 00:19:52,N,1,43,151,1,1.01,5.5,0.5,0.5,0,0,,0.3,6.8,2,1,0
2,2021-01-01 00:25:59,2021-01-01 00:34:44,N,1,166,239,1,2.53,10,0.5,0.5,2.81,0,,0.3,16.86,1,1,2.75
2,2021-01-01 00:45:57,2021-01-01 00:51:55,N,1,41,42,1,1.12,6,0.5,0.5,1,0,,0.3,8.3,1,1,0
2,2020-12-31 23:57:51,2021-01-01 00:04:56,N,1,168,75,1,1.99,8,0.5,0.5,0,0,,0.3,9.3,2,1,0

gzip: stdout: Broken pipe


In [4]:
!ls -FGhl data/raw/green/2021/08

total 0
-rw-rw-r-- 1 freddie 0 Oct 25 17:31 green_tripdata_2021_08.csv.gz


Actually the August 2021 data files for both yellow and green cabs are empty, so let's remove them.

In [5]:
! rm -r data/raw/green/2021/08 data/raw/yellow/2021/08

In [8]:
!tree data/raw

[01;34mdata/raw[00m
├── [01;34mgreen[00m
│   ├── [01;34m2020[00m
│   │   ├── [01;34m01[00m
│   │   │   └── [01;31mgreen_tripdata_2020_01.csv.gz[00m
│   │   ├── [01;34m02[00m
│   │   │   └── [01;31mgreen_tripdata_2020_02.csv.gz[00m
│   │   ├── [01;34m03[00m
│   │   │   └── [01;31mgreen_tripdata_2020_03.csv.gz[00m
│   │   ├── [01;34m04[00m
│   │   │   └── [01;31mgreen_tripdata_2020_04.csv.gz[00m
│   │   ├── [01;34m05[00m
│   │   │   └── [01;31mgreen_tripdata_2020_05.csv.gz[00m
│   │   ├── [01;34m06[00m
│   │   │   └── [01;31mgreen_tripdata_2020_06.csv.gz[00m
│   │   ├── [01;34m07[00m
│   │   │   └── [01;31mgreen_tripdata_2020_07.csv.gz[00m
│   │   ├── [01;34m08[00m
│   │   │   └── [01;31mgreen_tripdata_2020_08.csv.gz[00m
│   │   ├── [01;34m09[00m
│   │   │   └── [01;31mgreen_tripdata_2020_09.csv.gz[00m
│   │   ├── [01;34m10[00m
│   │   │   └── [01;31mgreen_tripdata_2020_10.csv.gz[00m
│   │   ├── [01;34m11[00m
│   │   │   └── [01;31mgreen_t

In [41]:
import os
import pandas as pd
import pyspark
from pyspark.sql import SparkSession, types

In [30]:
spark = SparkSession.builder \
    .master("local[*]") \
    .appName("NYTaxi") \
    .getOrCreate()

Green taxis first...

In [19]:
df_green_pandas = pd.read_csv("data/raw/green/2021/01/green_tripdata_2021_01.csv.gz", nrows=1000)

In [20]:
spark.createDataFrame(df_green_pandas).schema

StructType([StructField('VendorID', LongType(), True), StructField('lpep_pickup_datetime', StringType(), True), StructField('lpep_dropoff_datetime', StringType(), True), StructField('store_and_fwd_flag', StringType(), True), StructField('RatecodeID', LongType(), True), StructField('PULocationID', LongType(), True), StructField('DOLocationID', LongType(), True), StructField('passenger_count', LongType(), True), StructField('trip_distance', DoubleType(), True), StructField('fare_amount', DoubleType(), True), StructField('extra', DoubleType(), True), StructField('mta_tax', DoubleType(), True), StructField('tip_amount', DoubleType(), True), StructField('tolls_amount', DoubleType(), True), StructField('ehail_fee', DoubleType(), True), StructField('improvement_surcharge', DoubleType(), True), StructField('total_amount', DoubleType(), True), StructField('payment_type', LongType(), True), StructField('trip_type', LongType(), True), StructField('congestion_surcharge', DoubleType(), True)])

In [29]:
green_schema = types.StructType([
    types.StructField("VendorID", types.IntegerType(), True), 
    types.StructField("lpep_pickup_datetime", types.TimestampType(), True), 
    types.StructField("lpep_dropoff_datetime", types.TimestampType(), True), 
    types.StructField("store_and_fwd_flag", types.StringType(), True), 
    types.StructField("RatecodeID", types.IntegerType(), True), 
    types.StructField("PULocationID", types.IntegerType(), True), 
    types.StructField("DOLocationID", types.IntegerType(), True), 
    types.StructField("passenger_count", types.IntegerType(), True), 
    types.StructField("trip_distance", types.DoubleType(), True), 
    types.StructField("fare_amount", types.DoubleType(), True), 
    types.StructField("extra", types.DoubleType(), True), 
    types.StructField("mta_tax", types.DoubleType(), True), 
    types.StructField("tip_amount", types.DoubleType(), True), 
    types.StructField("tolls_amount", types.DoubleType(), True), 
    types.StructField("ehail_fee", types.DoubleType(), True), 
    types.StructField("improvement_surcharge", types.DoubleType(), True), 
    types.StructField("total_amount", types.DoubleType(), True), 
    types.StructField("payment_type", types.IntegerType(), True), 
    types.StructField("trip_type", types.IntegerType(), True), 
    types.StructField("congestion_surcharge", types.DoubleType(), True)
])

In [45]:
df_green = spark.read \
    .option("header", "true") \
    .schema(green_schema) \
    .csv("data/raw/green/2021/01")

In [32]:
df_green.printSchema()

root
 |-- VendorID: integer (nullable = true)
 |-- lpep_pickup_datetime: timestamp (nullable = true)
 |-- lpep_dropoff_datetime: timestamp (nullable = true)
 |-- store_and_fwd_flag: string (nullable = true)
 |-- RatecodeID: integer (nullable = true)
 |-- PULocationID: integer (nullable = true)
 |-- DOLocationID: integer (nullable = true)
 |-- passenger_count: integer (nullable = true)
 |-- trip_distance: double (nullable = true)
 |-- fare_amount: double (nullable = true)
 |-- extra: double (nullable = true)
 |-- mta_tax: double (nullable = true)
 |-- tip_amount: double (nullable = true)
 |-- tolls_amount: double (nullable = true)
 |-- ehail_fee: double (nullable = true)
 |-- improvement_surcharge: double (nullable = true)
 |-- total_amount: double (nullable = true)
 |-- payment_type: integer (nullable = true)
 |-- trip_type: integer (nullable = true)
 |-- congestion_surcharge: double (nullable = true)



Looks good. Now the same thing for yellow taxis...

In [34]:
df_yellow_pandas = pd.read_csv("data/raw/yellow/2021/01/yellow_tripdata_2021_01.csv.gz", nrows=1000)

In [35]:
spark.createDataFrame(df_yellow_pandas).schema

StructType([StructField('VendorID', LongType(), True), StructField('tpep_pickup_datetime', StringType(), True), StructField('tpep_dropoff_datetime', StringType(), True), StructField('passenger_count', LongType(), True), StructField('trip_distance', DoubleType(), True), StructField('RatecodeID', LongType(), True), StructField('store_and_fwd_flag', StringType(), True), StructField('PULocationID', LongType(), True), StructField('DOLocationID', LongType(), True), StructField('payment_type', LongType(), True), StructField('fare_amount', DoubleType(), True), StructField('extra', DoubleType(), True), StructField('mta_tax', DoubleType(), True), StructField('tip_amount', DoubleType(), True), StructField('tolls_amount', DoubleType(), True), StructField('improvement_surcharge', DoubleType(), True), StructField('total_amount', DoubleType(), True), StructField('congestion_surcharge', DoubleType(), True)])

In [46]:
yellow_schema = types.StructType([
    types.StructField("VendorID", types.IntegerType(), True), 
    types.StructField("tpep_pickup_datetime", types.TimestampType(), True), 
    types.StructField("tpep_dropoff_datetime", types.TimestampType(), True), 
    types.StructField("passenger_count", types.IntegerType(), True), 
    types.StructField("trip_distance", types.DoubleType(), True), 
    types.StructField("RatecodeID", types.IntegerType(), True), 
    types.StructField("store_and_fwd_flag", types.StringType(), True), 
    types.StructField("PULocationID", types.IntegerType(), True), 
    types.StructField("DOLocationID", types.IntegerType(), True), 
    types.StructField("payment_type", types.IntegerType(), True), 
    types.StructField("fare_amount", types.DoubleType(), True), 
    types.StructField("extra", types.DoubleType(), True), 
    types.StructField("mta_tax", types.DoubleType(), True), 
    types.StructField("tip_amount", types.DoubleType(), True), 
    types.StructField("tolls_amount", types.DoubleType(), True), 
    types.StructField("improvement_surcharge", types.DoubleType(), True), 
    types.StructField("total_amount", types.DoubleType(), True), 
    types.StructField("congestion_surcharge", types.DoubleType(), True)
])

In [47]:
df_yellow = spark.read \
    .option("header", "true") \
    .schema(yellow_schema) \
    .csv("data/raw/yellow/2021/01")

In [48]:
df_yellow.printSchema()

root
 |-- VendorID: integer (nullable = true)
 |-- tpep_pickup_datetime: timestamp (nullable = true)
 |-- tpep_dropoff_datetime: timestamp (nullable = true)
 |-- passenger_count: integer (nullable = true)
 |-- trip_distance: double (nullable = true)
 |-- RatecodeID: integer (nullable = true)
 |-- store_and_fwd_flag: string (nullable = true)
 |-- PULocationID: integer (nullable = true)
 |-- DOLocationID: integer (nullable = true)
 |-- payment_type: integer (nullable = true)
 |-- fare_amount: double (nullable = true)
 |-- extra: double (nullable = true)
 |-- mta_tax: double (nullable = true)
 |-- tip_amount: double (nullable = true)
 |-- tolls_amount: double (nullable = true)
 |-- improvement_surcharge: double (nullable = true)
 |-- total_amount: double (nullable = true)
 |-- congestion_surcharge: double (nullable = true)



Now to process all the data programmatically...

In [49]:
taxi_types = ["green", "yellow"]
schemas_dict = {"green": green_schema, "yellow": yellow_schema}
years = [2020, 2021]

for taxi_type in taxi_types:
    for year in years:
        for month in range(1, 13):           
            input_path = f"data/raw/{taxi_type}/{year}/{month:02d}/"
            output_path = f"data/pq/{taxi_type}/{year}/{month:02d}/"
            
            if os.path.exists(input_path):
                print(f"Processing {taxi_type} taxi data for {year}-{month:02d}...")
                
                schema = schemas_dict[taxi_type]
                
                df = spark.read \
                    .option("header", "true") \
                    .schema(schema) \
                    .csv(input_path)

                df \
                    .repartition(4) \
                    .write.parquet(output_path)

print("Done!") # I added this later which is why it doesn't show up in the output below

Processing green taxi data for 2020-01...


                                                                                

Processing green taxi data for 2020-02...


                                                                                

Processing green taxi data for 2020-03...


                                                                                

Processing green taxi data for 2020-04...
Processing green taxi data for 2020-05...
Processing green taxi data for 2020-06...
Processing green taxi data for 2020-07...
Processing green taxi data for 2020-08...
Processing green taxi data for 2020-09...
Processing green taxi data for 2020-10...


                                                                                

Processing green taxi data for 2020-11...
Processing green taxi data for 2020-12...
Processing green taxi data for 2021-01...
Processing green taxi data for 2021-02...
Processing green taxi data for 2021-03...
Processing green taxi data for 2021-04...
Processing green taxi data for 2021-05...
Processing green taxi data for 2021-06...
Processing green taxi data for 2021-07...
Processing yellow taxi data for 2020-01...


                                                                                

Processing yellow taxi data for 2020-02...


                                                                                

Processing yellow taxi data for 2020-03...


                                                                                

Processing yellow taxi data for 2020-04...


                                                                                

Processing yellow taxi data for 2020-05...


                                                                                

Processing yellow taxi data for 2020-06...


                                                                                

Processing yellow taxi data for 2020-07...


                                                                                

Processing yellow taxi data for 2020-08...


                                                                                

Processing yellow taxi data for 2020-09...


                                                                                

Processing yellow taxi data for 2020-10...


                                                                                

Processing yellow taxi data for 2020-11...


                                                                                

Processing yellow taxi data for 2020-12...


                                                                                

Processing yellow taxi data for 2021-01...


                                                                                

Processing yellow taxi data for 2021-02...


                                                                                

Processing yellow taxi data for 2021-03...


                                                                                

Processing yellow taxi data for 2021-04...


                                                                                

Processing yellow taxi data for 2021-05...


                                                                                

Processing yellow taxi data for 2021-06...


                                                                                

Processing yellow taxi data for 2021-07...


                                                                                

**Note:** I don't know why we did it in such a roundabout way, when there is another way to infer schemas from CSV files, which is to set the `inferSchema` option to `true` while reading the files in Spark.

In [50]:
!tree data/pq

[01;34mdata/pq[00m
├── [01;34mgreen[00m
│   ├── [01;34m2020[00m
│   │   ├── [01;34m01[00m
│   │   │   ├── _SUCCESS
│   │   │   ├── part-00000-3e005098-9a5f-4976-80c6-91a55b177bd9-c000.snappy.parquet
│   │   │   ├── part-00001-3e005098-9a5f-4976-80c6-91a55b177bd9-c000.snappy.parquet
│   │   │   ├── part-00002-3e005098-9a5f-4976-80c6-91a55b177bd9-c000.snappy.parquet
│   │   │   └── part-00003-3e005098-9a5f-4976-80c6-91a55b177bd9-c000.snappy.parquet
│   │   ├── [01;34m02[00m
│   │   │   ├── _SUCCESS
│   │   │   ├── part-00000-488b8998-0a62-4e88-a33b-b16df36e6da1-c000.snappy.parquet
│   │   │   ├── part-00001-488b8998-0a62-4e88-a33b-b16df36e6da1-c000.snappy.parquet
│   │   │   ├── part-00002-488b8998-0a62-4e88-a33b-b16df36e6da1-c000.snappy.parquet
│   │   │   └── part-00003-488b8998-0a62-4e88-a33b-b16df36e6da1-c000.snappy.parquet
│   │   ├── [01;34m03[00m
│   │   │   ├── _SUCCESS
│   │   │   ├── part-00000-dc498d10-8ed7-409a-8ab5-2a53f628b0c5-c000.snappy.parquet
│   │   │   ├──