Hey, I'm Jobert Gutierrez and hereafter you'll find the logic and code used to answer the third assignment in the program Data Engineering Zoomcamp offered by Data Talks Club.

# __Module 3 Homework__

__Important Note:__

For this homework we will be using the 2022 Green Taxi Trip Record Parquet Files from the New York City Taxi Data found here:

https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page

If you are using orchestration such as Mage, Airflow or Prefect do not load the data into Big Query using the orchestrator.
Stop with loading the files into a bucket.

NOTE: You will need to use the PARQUET option files when creating an External Table

SETUP:

Create an external table using the Green Taxi Trip Records Data for 2022. <br>
Create a table in BQ using the Green Taxi Trip Records for 2022 (do not partition or cluster this table).

### Preparation.

The external table was created with the following code:

In [None]:
# First snippet of code
create or replace external table `data-taxi-1.bq_homework.green_taxi_trips`
options (
  format = 'parquet',
  uris = ['gs://dtc_data_lake_data-taxi-1/data/green/green_tripdata_2022-*.parquet']
)

The non-partitioned table in BQ is created using the following code:

In [None]:
create or replace table `data-taxi-1.bq_homework.green_taxi_non_partitioned` as
select * from `data-taxi-1.bq_homework.green_taxi_trips`;

## Questions
### Question 1: 
What is count of records for the 2022 Green Taxi Data??

- 65,623,481
- 840,402
- 1,936,423
- 253,647



### Answer: 
Using the code:

In [None]:
select count(VendorID) from `data-taxi-1.bq_homework.green_taxi_non_partitioned`

We obtain the result of __840.402 records__.

## Question 2:
Write a query to count the distinct number of PULocationIDs for the entire dataset on both the tables.
What is the estimated amount of data that will be read when this query is executed on the External Table and the Table?

- 0 MB for the External Table and 6.41MB for the Materialized Table
- 18.82 MB for the External Table and 47.60 MB for the Materialized Table
- 0 MB for the External Table and 0MB for the Materialized Table
- 2.14 MB for the External Table and 0MB for the Materialized Table

## Answer:

Using the code:

In [None]:
# To query the external table:
select distinct `PULocationID`
from 
(
 select * 
 from `data-taxi-1.bq_homework.green_taxi_trips`
)
`data-taxi-1.bq_homework.green_taxi_taxi_trips`

# To query the non-partitioned table:
select distinct `PULocationID`
from 
(
 select * 
 from `data-taxi-1.bq_homework.green_taxi_non_partitioned`
)
`data-taxi-1.bq_homework.green_taxi_non_partitioned`

Before running the queries, BigQuery predicts the amount of data to be processed. When refering the external table, BigQuery shows __'This query will process 0 B when run.'__, and predicts __'This query will process 6.41 MB when run.'__ from the materialized table. Size of the dataset in external tables can not be calculated because the data lays outside BigQuery.

## Question 3. 
How many records have a fare_amount of 0?

- 12,488
- 128,219
- 112
- 1,622

## Answer:

Using the code:

In [None]:
select count(VendorID) 
from `data-taxi-1.bq_homework.green_taxi_non_partitioned` as t1
where t1.fare_amount = 0

Using the code snippet above, we get a prediction of __'This query will process 12.82 MB when run.'__. After being processed, __1.622 records are found with a fare amount of 0.__

## Question 4. 
What is the best strategy to make an optimized table in Big Query if your query will always order the results by PUlocationID and filter based on lpep_pickup_datetime? (Create a new table with this strategy)

- Cluster on lpep_pickup_datetime Partition by PUlocationID
- Partition by lpep_pickup_datetime Cluster on PUlocationID
- Partition by lpep_pickup_datetime and Partition by PUlocationID
- Cluster on by lpep_pickup_datetime and Cluster on PUlocationID

## Answer:

From theory we know that the best strategy to make an optimized table in BigQuery so that we always order the results by PUlocationID and filter based on lpep_pickup_datetime is `Creating partition by lpep_pickup_datetime Cluster on PUlocationID`. Indeed, knowing that partitioning is allowed over one column only; hence, using the column _lpep_pickup_datetime_ to create a date column will effectively reduce the number of rows to be analized in future queries. Using PULocationID to create partitions will reduce the results to a limited number of IDs instead of listing all of them in a specific order, which is the desired outcome. 

However, the code used to create a new partitioned and clustered table as indicated above is:

In [None]:
create or replace table `data-taxi-1.bq_homework.green_taxi_partitioned_clustered` 
partition by date(lpep_pickup_datetime)
cluster by PULocationID as
select * from `data-taxi-1.bq_homework.green_taxi_trips`

## Question 5. Data Transformation
Write a query to retrieve the distinct PULocationID between lpep_pickup_datetime 06/01/2022 and 06/30/2022 (inclusive)

Use the materialized table you created earlier in your from clause and note the estimated bytes. Now change the table in the from clause to the partitioned table you created for question 4 and note the estimated bytes processed. What are these values?

Choose the answer which most closely matches.

- 22.82 MB for non-partitioned table and 647.87 MB for the partitioned table
- 12.82 MB for non-partitioned table and 1.12 MB for the partitioned table
- 5.63 MB for non-partitioned table and 0 MB for the partitioned table
- 10.31 MB for non-partitioned table and 10.31 MB for the partitioned table

## Answer:

Reading the information from the non-partitioned table results in the following message __'This query will process 12.82 MB when run.'__ using the snippet:

In [None]:
# To query the non-partitioned table:
select distinct `PULocationID`
from 
(
 select * 
 from `data-taxi-1.bq_homework.green_taxi_non_partitioned`
)
`data-taxi-1.bq_homework.green_taxi_non_partitioned`
where date(lpep_pickup_datetime) between '2022-06-01' and '2022-06-30'

# To query the partitioned and clustered table:
select distinct `PULocationID`
from 
(
 select * 
 from `data-taxi-1.bq_homework.green_taxi_partitioned_clustered`
)
`data-taxi-1.bq_homework.green_taxi_partitioned_clustered`
where date(lpep_pickup_datetime) between '2022-06-01' and '2022-06-30'

But the amount of data read from the partitioned and clustered table reduces to 10% of that read from the non-partitioned table __('This query will process 1.12 MB when run.')__. 

## Question 6. 
Where is the data stored in the External Table you created?

- Big Query
- GCP Bucket
- Big Table
- Container Registry

## Answer:

It can be seen in `First snippet of code` that the data used in the external table created is stored in my  __'GCP Bucket called dtc_data_lake_data-taxi-1'__.

## Question 7. 
It is best practice in Big Query to always cluster your data:

- True
- False

## Answer:

__'False.'__. It is recommended to Cluster data of 1 GB size or higher, because smaller data size don't show improvements in performance. 

## Question 8. (Bonus: Not worth points)
No Points: Write a SELECT count(*) query FROM the materialized table you created. How many bytes does it estimate will be read? Why?

## Answer:

The code snippet written is `select * from data-taxi-1.bq_homework.green_taxi_non_partitioned`and __This query will process 114.11 MB when run__ because this will read the entire dataset. That means, the query will read all the rows in every column.