First, we create the external table
CREATE OR REPLACE EXTERNAL TABLE `data-warehouse-486723.nytaxi.external_yellow_tripdata`
OPTIONS (
format = 'PARQUET',
uris = ['gs://diego-warehouse/yellow_tripdata_2024-*.parquet']
);Then we use this query
-- Check yellow trip data
SELECT
COUNT(1)
FROM
nytaxi.external_yellow_tripdata;20332093
First, we create the table
CREATE OR REPLACE TABLE `data-warehouse-486723.nytaxi.yellow_tripdata` AS
SELECT * FROM `data-warehouse-486723.nytaxi.external_yellow_tripdata`;And then we use the respectives queries
SELECT DISTINCT
COUNT(eyt.PULocationID)
FROM
nytaxi.external_yellow_tripdata AS eyt;
SELECT DISTINCT
COUNT(yt.PULocationID)
FROM
nytaxi.yellow_tripdata AS yt;Reading 0 MB for the External Table and 155.12 MB for the Materialized Table
When writting these two queries
SELECT
yt.PULocationID
FROM
nytaxi.yellow_tripdata AS yt;
SELECT
yt.PULocationID,
yt.DOLocationID
FROM
nytaxi.yellow_tripdata AS yt;The second one, represents the double of the first one, because of BigQuery being a colunar database
This query
SELECT
COUNT(1)
FROM
nytaxi.yellow_tripdata AS yt
WHERE
yt.fare_amount = 0;Returns the an amount of 8333
Question 5 - Best strategy to make an optimized table in Big Query if your query will always filter based on tpep_dropoff_datetime and order the results by VendorID
Partition by tpep_dropoff_datetime and Cluster on VendorID
This query creates the mentioned table
CREATE OR REPLACE TABLE `data-warehouse-486723.nytaxi.yellow_tripdata_partioned_clustered`
PARTITION BY DATE(tpep_dropoff_datetime)
CLUSTER BY VendorID AS
SELECT * FROM `data-warehouse-486723.nytaxi.external_yellow_tripdata`;Then we compare the amount of data processed with these queries
SELECT DISTINCT
COUNT(yt.VendorID)
FROM
`data-warehouse-486723.nytaxi.yellow_tripdata` yt
WHERE
tpep_dropoff_datetime between '2024-03-01' and '2024-03-15';
SELECT DISTINCT
COUNT(ytpc.VendorID)
FROM
`data-warehouse-486723.nytaxi.yellow_tripdata_partioned_clustered` ytpc
WHERE
tpep_dropoff_datetime between '2024-03-01' and '2024-03-15';310.24 MB for non-partitioned table and 26.84 MB for the partitioned table
Container Registry
No. One basic example, is that it does not gain benefit in smaller tables