Hey, I'm Jobert Gutierrez and hereafter you'll find the logic and code used to answer the third assignment in the program Data Engineering Zoomcamp offered by Data Talks Club.

# __Module 3 Homework: Data Warehousing__

ATTENTION: At the end of the submission form, you will be required to include a link to your GitHub repository or other public code-hosting site. This repository should contain your code for solving the homework. If your solution includes code that is not in file format (such as SQL queries or shell commands), please include these directly in the README file of your repository.

__Important Note:__

For this homework we will be using the Yellow Taxi Trip Records for January 2024 - June 2024 NOT the entire year of data Parquet Files from the New York City Taxi Data found here:

`https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page`

If you are using orchestration such as Kestra, Mage, Airflow or Prefect etc. do not load the data into Big Query using the orchestrator.
Stop with loading the files into a bucket.

Load Script: You can manually download the parquet files and upload them to your GCS Bucket or you can use the linked script [here](https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/cohorts/2025/03-data-warehouse/load_yellow_taxi_data.py):

You will simply need to generate a Service Account with GCS Admin Priveleges or be authenticated with the Google SDK and update the bucket name in the script to the name of your bucket
Nothing is fool proof so make sure that all 6 files show in your GCS Bucket before begining.


NOTE: You will need to use the PARQUET option files when creating an External Table

__BIG QUERY SETUP:__

Create an external table using the Yellow Taxi Trip Records.
Create a (regular/materialized) table in BQ using the Yellow Taxi Trip Records (do not partition or cluster this table).

### __Question 1.__

Question 1: What is count of records for the 2024 Yellow Taxi Data?

--65,623 <br>
--840,402 <br>
--20,332,093 <br>
--85,431,289 <br>


### Answer: 
I created the tables in BigQuery using the following commands:

In [None]:
create or replace external table `taxi-data-24-dtc.ny_taxi_25.external_yellow_taxi_25`
options (
  format = 'parquet',
  uris = ['gs://data-taxi-yellow/yellow_tripdata_2024-*.parquet']
)

create or replace table `taxi-data-24-dtc.ny_taxi_25.yellow_taxi_25_non_partition` as
select * from `taxi-data-24-dtc.ny_taxi_25.external_yellow_taxi_25`

Once created the tables, I counted the records using this query:

In [None]:
select count(1) from `taxi-data-24-dtc.ny_taxi_25.yellow_taxi_25_non_partition`

The number of rows in the yellow taxi dataset for 2024 is __20.332.093__.

### __Question 2.__
Write a query to count the distinct number of PULocationIDs for the entire dataset on both the tables.
What is the estimated amount of data that will be read when this query is executed on the External Table and the Table?

-- 18.82 MB for the External Table and 47.60 MB for the Materialized Table <br>
-- 0 MB for the External Table and 155.12 MB for the Materialized Table <br>
-- 2.14 GB for the External Table and 0MB for the Materialized Table <br>
-- 0 MB for the External Table and 0MB for the Materialized Table <br>


### Answer: 
To answer this question I used this query for the materialized data:

In [None]:
select distinct(PULocationID) from `taxi-data-24-dtc.ny_taxi_25.yellow_taxi_25_non_partition`

![Q2](Q2.png "Materialized table")

And this query for the external table:

In [None]:
select distinct(PULocationID) from `taxi-data-24-dtc.ny_taxi_25.external_yellow_taxi_25`

![Q2-b](Q2-b.png "External Table")

Then the proper answer is: __0 MB for the External Table and 155.12 MB for the Materialized Table__

### __Question 3.__
Write a query to retrieve the PULocationID from the table (not the external table) in BigQuery. Now write a query to retrieve the PULocationID and DOLocationID on the same table. Why are the estimated number of Bytes different?

- BigQuery is a columnar database, and it only scans the specific columns requested in the query. Querying two columns (PULocationID, DOLocationID) requires reading more data than querying one column (PULocationID), leading to a higher estimated number of bytes processed. <br>
- BigQuery duplicates data across multiple storage partitions, so selecting two columns instead of one requires scanning the table twice, doubling the estimated bytes processed. <br>
- BigQuery automatically caches the first queried column, so adding a second column increases processing time but does not affect the estimated bytes scanned. <br>
- When selecting multiple columns, BigQuery performs an implicit join operation between them, increasing the estimated bytes processed <br>


### Answer:
Both queries are: 

In [None]:
select PULocationID from `taxi-data-24-dtc.ny_taxi_25.external_yellow_taxi_25`
# wich process 155.12 MB

# AND 

select 
  PULocationID, DOLocationID
from `taxi-data-24-dtc.ny_taxi_25.yellow_taxi_25_non_partition`
# wich process 310.24 MB

The right answer is __BigQuery is a columnar database, and it only scans the specific columns requested in the query. Querying two columns (PULocationID, DOLocationID) requires reading more data than querying one column (PULocationID), leading to a higher estimated number of bytes processed.__

### __Question 4.__
How many records have a fare_amount of 0?

- 128,210 <br>
- 546,578 <br>
- 20,188,016 <br>
- 8,333

### Answer:
I proceed this query:

In [None]:
select count(1)
from `taxi-data-24-dtc.ny_taxi_25.yellow_taxi_25_non_partition`
where fare_amount = 0 

Then identifying __8.333 rows__ with a fare_amount of 0.

### __Question 5.__
What is the best strategy to make an optimized table in Big Query if your query will always filter based on tpep_dropoff_datetime and order the results by VendorID (Create a new table with this strategy)

- Partition by tpep_dropoff_datetime and Cluster on VendorID <br>
- Cluster on by tpep_dropoff_datetime and Cluster on VendorID <br>
- Cluster on tpep_dropoff_datetime Partition by VendorID <br>
- Partition by tpep_dropoff_datetime and Partition by VendorID


### Answer:
The best strategy to make an optimized a table in BigQuery using the fields of "tpep_dropoff_datetime" and "VendorID" is __Partition by tpep_dropoff_datetime and Cluster on VendorID__, because only one field can be selected to partition upon and a datetime field can reduce the amount of data of each partition more effectively than the vendors ID. The code used to create such table is:

In [None]:
create or replace table `taxi-data-24-dtc.ny_taxi_25.yellow_taxi_partition_and_cluster`
partition by date(tpep_dropoff_datetime)
cluster by VendorID as 
select * from `taxi-data-24-dtc.ny_taxi_25.yellow_taxi_25_non_partition`

### __Question 6.__
Write a query to retrieve the distinct VendorIDs between tpep_dropoff_datetime 2024-03-01 and 2024-03-15 (inclusive)

Use the materialized table you created earlier in your from clause and note the estimated bytes. Now change the table in the from clause to the partitioned table you created for question 5 and note the estimated bytes processed. What are these values?

Choose the answer which most closely matches.

- 12.47 MB for non-partitioned table and 326.42 MB for the partitioned table <br>
- 310.24 MB for non-partitioned table and 26.84 MB for the partitioned table <br>
- 5.87 MB for non-partitioned table and 0 MB for the partitioned table <br>
- 310.31 MB for non-partitioned table and 285.64 MB for the partitioned table


### Answer:
The query used is:

In [None]:
# For non partitioned table:
select distinct(VendorID) from `taxi-data-24-dtc.ny_taxi_25.yellow_taxi_25_non_partition`
where date(tpep_dropoff_datetime) between '2024-03-01' and '2024-03-15'

![Q6-a](Q6-a.png "Non partitioned query")

In [None]:
# And for partitioned tables is:
select distinct(VendorID) from `taxi-data-24-dtc.ny_taxi_25.yellow_taxi_partition_and_cluster`
where date(tpep_dropoff_datetime) between '2024-03-01' and '2024-03-15'

![Q6-b](Q6-b.png "Partitioned table")

The option that properly reflects the amount of data is __310.24 MB for non-partitioned table and 26.84 MB for the partitioned table__.

### __Question 7.__
Where is the data stored in the External Table you created?

- Big Query <br>
- Container Registry <br>
- GCP Bucket <br>
- Big Table


### Answer:
The data is stored in a __GCP Bucket__ for the external table.

### __Question 8.__
It is best practice in Big Query to always cluster your data:

- True <br>
- False

### Answer:
Its false because it depends on the amount of data in the table. When its little data, the compute resources used for building the clusters overrides the benefits during queries. 