I wrote a bash script that downloads the data and saves it locally. I have already run it for yellow taxis in 2020 and 2021 as well as green taxis in 2020. Let's finally run it for green taxis in 2021 (which only goes out to July; there is no data for August and beyond):

In [1]:
!./download_data.sh green 2021


Downloading https://github.com/DataTalksClub/nyc-tlc-data/releases/download/green/green_tripdata_2021-01.csv.gz and saving to data/raw/green/2021/01/green_tripdata_2021_01.csv.gz...
--2023-10-25 09:26:45--  https://github.com/DataTalksClub/nyc-tlc-data/releases/download/green/green_tripdata_2021-01.csv.gz
Resolving github.com (github.com)... 140.82.114.4
Connecting to github.com (github.com)|140.82.114.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/513814948/ea387a15-484c-469b-860d-3382ee7659be?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20231025%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20231025T162646Z&X-Amz-Expires=300&X-Amz-Signature=8ed0f2bf92c9b625cc3e6598cdb764f4a4a28c0f3a1a6627e60993cfe3423f6b&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=513814948&response-content-disposition=attachment%3B%20filename%3Dgreen_tripdata_2021-01.csv.gz

Let's take a look with `gzcat`. Ignore the error at the bottom. It does not appear when I run the command in the shell. It is either a Jupyter or VSCode issue.

> Note: `gzcat` is GNU `zcat` and `zcat` is like `cat` for compressed files. Regular `zcat` has a bug on MacOS where it appears to append a `.Z` to the file name, so I used `gzcat` instead which seems to work fine. If you don't have GNU utilities, install them. Or better yet, check out [linuxify](https://github.com/darksonic37/linuxify).

In [3]:
!gzcat data/raw/green/2021/01/green_tripdata_2021_01.csv.gz | head -n 5


VendorID,lpep_pickup_datetime,lpep_dropoff_datetime,store_and_fwd_flag,RatecodeID,PULocationID,DOLocationID,passenger_count,trip_distance,fare_amount,extra,mta_tax,tip_amount,tolls_amount,ehail_fee,improvement_surcharge,total_amount,payment_type,trip_type,congestion_surcharge
2,2021-01-01 00:15:56,2021-01-01 00:19:52,N,1,43,151,1,1.01,5.5,0.5,0.5,0,0,,0.3,6.8,2,1,0
2,2021-01-01 00:25:59,2021-01-01 00:34:44,N,1,166,239,1,2.53,10,0.5,0.5,2.81,0,,0.3,16.86,1,1,2.75
2,2021-01-01 00:45:57,2021-01-01 00:51:55,N,1,41,42,1,1.12,6,0.5,0.5,1,0,,0.3,8.3,1,1,0
2,2020-12-31 23:57:51,2021-01-01 00:04:56,N,1,168,75,1,1.99,8,0.5,0.5,0,0,,0.3,9.3,2,1,0
gzcat: error writing to output: Broken pipe
gzcat: data/raw/green/2021/01/green_tripdata_2021_01.csv.gz: uncompress failed


In [8]:
!ls -FGhl data/raw/green/2021/08


total 0
-rw-r--r--  1 fafa  staff     0B 25 Oct 09:26 green_tripdata_2021_08.csv.gz


Actually the August 2021 data files for both yellow and green cabs are empty, so let's remove them.

In [9]:
! rm -r data/raw/green/2021/08 data/raw/yellow/2021/08


In [10]:
!tree data


[1;36mdata[0m
└── [1;36mraw[0m
    ├── [1;36mgreen[0m
    │   ├── [1;36m2020[0m
    │   │   ├── [1;36m01[0m
    │   │   │   └── green_tripdata_2020_01.csv.gz
    │   │   ├── [1;36m02[0m
    │   │   │   └── green_tripdata_2020_02.csv.gz
    │   │   ├── [1;36m03[0m
    │   │   │   └── green_tripdata_2020_03.csv.gz
    │   │   ├── [1;36m04[0m
    │   │   │   └── green_tripdata_2020_04.csv.gz
    │   │   ├── [1;36m05[0m
    │   │   │   └── green_tripdata_2020_05.csv.gz
    │   │   ├── [1;36m06[0m
    │   │   │   └── green_tripdata_2020_06.csv.gz
    │   │   ├── [1;36m07[0m
    │   │   │   └── green_tripdata_2020_07.csv.gz
    │   │   ├── [1;36m08[0m
    │   │   │   └── green_tripdata_2020_08.csv.gz
    │   │   ├── [1;36m09[0m
    │   │   │   └── green_tripdata_2020_09.csv.gz
    │   │   ├── [1;36m10[0m
    │   │   │   └── green_tripdata_2020_10.csv.gz
    │   │   ├── [1;36m11[0m
    │   │   │   └── green_tripdata_2020_11.csv.gz
    │   │   └── [1;36m12[0m
  