Skip to content

Latest commit

 

History

History
678 lines (500 loc) · 23.9 KB

week3.md

File metadata and controls

678 lines (500 loc) · 23.9 KB

Week 3: Data Warehouse and BigQuery

See week_3_data_warehouse on GitHub.

Youtube videos:

Basic SQL:

Concepts

OLAP vs OLTP

OLTP systems are designed to handle large volumes of transactional data involving multiple users.

OLAP system is designed to process large amounts of data quickly, allowing users to analyze multiple data dimensions in tandem. Teams can use this data for decision-making and problem-solving.

  OLTP OLAP
Purpose Control and run essential business operations in real time Plan, solve problems, support decisions, discover hidden insights
Data updates Short, fast updates initiated by user Data periodically refreshed with scheduled, long-running batch jobs
Database design Normalized databases for efficiency Denormalized databases for analysis
Space requirements Generally small if historical data is archived Generally large due to aggregating large datasets
Backup and recovery Regular backups required to ensure business continuity and meet legal and governance requirements Lost data can be reloaded from OLTP database as needed in lieu of regular backups
Productivity Increases productivity of end users Increases productivity of business managers, data analysts, and executives
Data view Lists day-to-day business transactions Multi-dimensional view of enterprise data
User examples Customer-facing personnel, clerks, online shoppers Knowledge workers such as data analysts, business analysts, and executives

What is a data wahehouse

  • OLAP solution
  • Used for reporting and data analysis

See also The Data Warehouse Toolkit 3rd Edition by Ralph Kimball and Margy Ross.

BigQuery

Overwiew

BigQuery is a fully managed enterprise data warehouse that helps you manage and analyze your data with built-in features like machine learning, geospatial analysis, and business intelligence.

  • Serverless data warehouse
    • There are no servers to manage or database software to install
  • Software as well as infrastructure including
    • scalability and high-availability
  • Built-in features like
    • machine learning
    • geospatial analysis
    • business intelligence
  • BigQuery maximizes flexibility by separating the compute engine that analyzes your data from your storage
  • On demand pricing
    • 1 TB of data processed is $5
  • Flat rate pricing
    • Based on number of pre requested slots
    • 100 slots → $2,000/month = 400 TB data processed on demand pricing

See What is BigQuery? for more information.

Query Settings

5:59/23:04 (3.1.1)

w3s03

Click on MORE button, select Query Settings. In the Resource management section, we should disable Use cached results.

9:41/23:04 (3.1.1)

w3s04

File big_query.sql

-- Query public available table
SELECT station_id, name FROM
  bigquery-public-data.new_york_citibike.citibike_stations
LIMIT 100;

-- Creating external table referring to gcs path
CREATE OR REPLACE EXTERNAL TABLE `taxi-rides-ny.nytaxi.external_yellow_tripdata`
OPTIONS (
  format = 'CSV',
  uris = ['gs://nyc-tl-data/trip data/yellow_tripdata_2019-*.csv', 'gs://nyc-tl-data/trip data/yellow_tripdata_2020-*.csv']
);

11:45/23:04 (3.1.1)

File big_query.sql

-- Check yello trip data
SELECT * FROM taxi-rides-ny.nytaxi.external_yellow_tripdata limit 10;

BigQuery Partition

A partitioned table is divided into segments, called partitions, that make it easier to manage and query your data. By dividing a large table into smaller partitions, you can improve query performance and control costs by reducing the number of bytes read by a query. You partition tables by specifying a partition column which is used to segment the table.

0:17/7:43 (3.1.2)

  • Time-unit column
  • Ingestion time (_PARTITIONTIME)
  • Integer range partitioning
  • When using Time unit or ingestion time
    • Daily (Default)
    • Hourly
    • Monthly or yearly
  • Number of partitions limit is 4000

See Introduction to partitioned tables for more information.

Partition in BigQuery

w3s05

16:10/23:04 (3.1.1)

File big_query.sql

-- Create a non partitioned table from external table
CREATE OR REPLACE TABLE taxi-rides-ny.nytaxi.yellow_tripdata_non_partitoned AS
SELECT * FROM taxi-rides-ny.nytaxi.external_yellow_tripdata;

-- Create a partitioned table from external table
CREATE OR REPLACE TABLE taxi-rides-ny.nytaxi.yellow_tripdata_partitoned
PARTITION BY
  DATE(tpep_pickup_datetime) AS
SELECT * FROM taxi-rides-ny.nytaxi.external_yellow_tripdata;

16:59/23:04 (3.1.1)

File big_query.sql

-- Impact of partition
-- Scanning 1.6GB of data
SELECT DISTINCT(VendorID)
FROM taxi-rides-ny.nytaxi.yellow_tripdata_non_partitoned
WHERE DATE(tpep_pickup_datetime) BETWEEN '2019-06-01' AND '2019-06-30';

-- Scanning ~106 MB of DATA
SELECT DISTINCT(VendorID)
FROM taxi-rides-ny.nytaxi.yellow_tripdata_partitoned
WHERE DATE(tpep_pickup_datetime) BETWEEN '2019-06-01' AND '2019-06-30';

18:01/23:04 (3.1.1)

File big_query.sql

-- Let's look into the partitons
SELECT table_name, partition_id, total_rows
FROM `nytaxi.INFORMATION_SCHEMA.PARTITIONS`
WHERE table_name = 'yellow_tripdata_partitoned'
ORDER BY total_rows DESC;

BigQuery Clustering

Clustered tables in BigQuery are tables that have a user-defined column sort order using clustered columns. Clustered tables can improve query performance and reduce query costs.

In BigQuery, a clustered column is a user-defined table property that sorts storage blocks based on the values in the clustered columns. The storage blocks are adaptively sized based on the size of the table. A clustered table maintains the sort properties in the context of each operation that modifies it. Queries that filter or aggregate by the clustered columns only scan the relevant blocks based on the clustered columns instead of the entire table or table partition. As a result, BigQuery might not be able to accurately estimate the bytes to be processed by the query or the query costs, but it attempts to reduce the total bytes at execution.

See Introduction to clustered tables.

1:58/7:43 (3.1.2)

  • Columns you specify are used to colocate related data
  • Order of the column is important
  • The order of the specified columns determines the sort order of the data.
  • Clustering improves
    • Filter queries
    • Aggregate queries
  • Table with data size < 1 GB, don’t show significant improvement with partitioning and clustering
  • You can specify up to four clustering columns

3.01/7:43 (3.1.2)

Clustering columns must be top-level, non-repeated columns:

  • DATE
  • BOOL
  • GEOGRAPHY
  • INT64
  • NUMERIC
  • BIGNUMERIC
  • STRING
  • TIMESTAMP
  • DATETIME

19:38/23:04 (3.1.1)

Clustering in BigQuery

w3s06

File big_query.sql

-- Creating a partition and cluster table
CREATE OR REPLACE TABLE taxi-rides-ny.nytaxi.yellow_tripdata_partitoned_clustered
PARTITION BY DATE(tpep_pickup_datetime)
CLUSTER BY VendorID AS
SELECT * FROM taxi-rides-ny.nytaxi.external_yellow_tripdata;

-- Query scans 1.1 GB
SELECT count(*) as trips
FROM taxi-rides-ny.nytaxi.yellow_tripdata_partitoned
WHERE DATE(tpep_pickup_datetime) BETWEEN '2019-06-01' AND '2020-12-31'
  AND VendorID=1;

-- Query scans 864.5 MB
SELECT count(*) as trips
FROM taxi-rides-ny.nytaxi.yellow_tripdata_partitoned_clustered
WHERE DATE(tpep_pickup_datetime) BETWEEN '2019-06-01' AND '2020-12-31'
  AND VendorID=1;

Partitioning vs Clustering

3:18/7:43 (3.1.2)

Clustering Partitoning
Cost benefit unknown. Cost known upfront.
You need more granularity than partitioning alone allows. You need partition-level management.
Your queries commonly use filters or aggregation against multiple particular columns. Filter or aggregate on single column.
The cardinality of the number of values in a column or group of columns is large.

Clustering over paritioning

Like clustering, partitioning uses user-defined partition columns to specify how data is partitioned and what data is stored in each partition. Unlike clustering, partitioning provides granular query cost estimates before you run a query.

See Introduction to clustered tables.

5:03/7:43 (3.1.2)

  • Partitioning results in a small amount of data per partition (approximately less than 1 GB).
  • Partitioning results in a large number of partitions beyond the limits on partitioned tables.
  • Partitioning results in your mutation operations modifying the majority of partitions in the table frequently (for example, every few minutes).

Automatic reclustering

6:04/7:43 (3.1.2)

As data is added to a clustered table

  • The newly inserted data can be written to blocks that contain key ranges that overlap with the key ranges in previously written blocks
  • These overlapping keys weaken the sort property of the table

To maintain the performance characteristics of a clustered table

  • BigQuery performs automatic re-clustering in the background to restore the sort property of the table
  • For partitioned tables, clustering is maintained for data within the scope of each partition.

BigQuery Best Practice

0:00/3:44 (3.2.1)

  • Cost reduction
    • Avoid SELECT *
    • Price your queries before running them
    • Use clustered or partitioned tables
    • Use streaming inserts with caution
    • Materialize query results in stages

In BigQuery, materialized views are precomputed views that periodically cache the results of a query for increased performance and efficiency. BigQuery leverages precomputed results from materialized views and whenever possible reads only delta changes from the base tables to compute up-to-date results. See Introduction to materialized views

1:56/3:44 (3.2.1)

  • Query performance
    • Filter on partitioned columns
    • Denormalizing data
    • Use nested or repeated columns
    • Use external data sources appropriately
    • Don’t use it, in case u want a high query performance
    • Reduce data before using a JOIN
    • Do not treat WITH clauses as prepared statements
    • Avoid oversharding tables

2:43/3:44 (3.2.1)

  • Query performance
    • Avoid JavaScript user-defined functions
    • Use approximate aggregation functions (HyperLogLog++)
    • Order Last, for query operations to maximize performance
    • Optimize your join patterns
    • As a best practice, place the table with the largest number of rows first, followed by the table with the fewest rows, and then place the remaining tables by decreasing size.

Internals of BigQuery

0:00/6:05 (3.2.2)

A high-level architecture for BigQuery service

BigQuery Architecture

See also:

2:49/6:05 (3.2.2)

Record-oriented vs column-oriented

w3s07

BigQuery stores table data in columnar format, meaning it stores each column separately. Column-oriented databases are particularly efficient at scanning individual columns over an entire dataset.

Column-oriented databases are optimized for analytic workloads that aggregate data over a very large number of records. Often, an analytic query only needs to read a few columns from a table. See Overview of BigQuery storage and Storage layout.

3:58/6:05 (3.2.2)

An example of Dremel serving tree

w3s08

See A Deep Dive Into Google BigQuery Architecture.

BigQuery References

5:53/6:05 (3.2.2)

BigQuery ML

Overwiew

BigQuery ML lets you create and execute machine learning models in BigQuery using standard SQL queries. BigQuery ML democratizes machine learning by letting SQL practitioners build models using existing SQL tools and skills. BigQuery ML increases development speed by eliminating the need to move data.

See What is BigQuery ML?

0:00/15:51 (3.3.1)

  • Target audience Data analysts, managers
  • No need for Python or Java knowledge
  • No need to export data into a different system

1:17/15:51 (3.3.1)

  • Free
    • 10 GB per month of data storage
    • 1 TB per month of queries processed
    • ML Create model step: First 10 GB per month is free
  • BigQuery ML pricing

2:20/15:51 (3.3.1)

w3s01

Supervised (labeled) machine learning model study design overview. Steps for the deployment of a supervised machine learning model. From left to right, the figure shows the initial team of multidisciplinary experts defining a study design to address a need. Data are then collected, processed, trained tested, validated, and ultimately deployed. See Rashidi, Hooman & Tran, Nam & Betts, Elham & Howell, Lydia & Green, Ralph. (2019). Artificial Intelligence and Machine Learning in Pathology: The Present Landscape of Supervised Methods.

Model selection guide

3:37/15:51 (3.3.1)

Model selection guide

w3s02

See Model selection guide.

Create dataset

4:48/15:51 (3.3.1)

✅ This query will process 6.5 GB when run.

File big_query_ml.sql

-- SELECT THE COLUMNS INTERESTED FOR YOU
SELECT passenger_count, trip_distance, PULocationID, DOLocationID, payment_type, fare_amount, tolls_amount, tip_amount
FROM `taxi-rides-ny.nytaxi.yellow_tripdata_partitoned` WHERE fare_amount != 0;

Feature preprocessing

6:49/15:51 (3.3.1)

Feature preprocessing is one of the most important steps in developing a machine learning model. It consists of the creation of features as well as the cleaning of the data. Sometimes, the creation of features is also referred as "feature engineering". See Feature preprocessing overview.

Here, we cast as string some columns for automatic feature reprocessing by BigQuery ML.

File big_query_ml.sql

-- CREATE A ML TABLE WITH APPROPRIATE TYPE
CREATE OR REPLACE TABLE `taxi-rides-ny.nytaxi.yellow_tripdata_ml` (
  `passenger_count` INTEGER,
  `trip_distance` FLOAT64,
  `PULocationID` STRING,
  `DOLocationID` STRING,
  `payment_type` STRING,
  `fare_amount` FLOAT64,
  `tolls_amount` FLOAT64,
  `tip_amount` FLOAT64
) AS (
  SELECT passenger_count, trip_distance, cast(PULocationID AS STRING), CAST(DOLocationID AS STRING),
  CAST(payment_type AS STRING), fare_amount, tolls_amount, tip_amount
  FROM `taxi-rides-ny.nytaxi.yellow_tripdata_partitoned` WHERE fare_amount != 0
);

Create a model

10:27/15:51 (3.3.1)

To create a model in BigQuery, use the BigQuery ML CREATE MODEL statement. See The CREATE MODEL statement.

This query below may take 5 minutes to run…​

File big_query_ml.sql

-- CREATE MODEL WITH DEFAULT SETTING
CREATE OR REPLACE MODEL `taxi-rides-ny.nytaxi.tip_model`
OPTIONS
  (model_type='linear_reg',
  input_label_cols=['tip_amount'],
  DATA_SPLIT_METHOD='AUTO_SPLIT') AS
SELECT
  *
FROM
  `taxi-rides-ny.nytaxi.yellow_tripdata_ml`
WHERE
  tip_amount IS NOT NULL;

12:04/15:51 (3.3.1)

The ML.FEATURE_INFO function allows you to see information about the input features used to train a model. See The ML.FEATURE_INFO function.

File big_query_ml.sql

-- CHECK FEATURES
SELECT * FROM ML.FEATURE_INFO(MODEL `taxi-rides-ny.nytaxi.tip_model`);

Evaluate the model

12:50/15:51 (3.3.1)

Use the ML.EVALUATE function to evaluate model metrics. See The ML.EVALUATE function.

File big_query_ml.sql

-- EVALUATE THE MODEL
SELECT
  *
FROM
  ML.EVALUATE(MODEL `taxi-rides-ny.nytaxi.tip_model`,
  (
  SELECT
  *
  FROM
  `taxi-rides-ny.nytaxi.yellow_tripdata_ml`
  WHERE
  tip_amount IS NOT NULL
  ));

Predict the model

13:27/15:51 (3.3.1)

The ML.PREDICT function is used to predict outcomes using the model. See The ML.PREDICT function.

File big_query_ml.sql

-- PREDICT THE MODEL
SELECT
  *
FROM
  ML.PREDICT(MODEL `taxi-rides-ny.nytaxi.tip_model`,
  (
  SELECT
  *
  FROM
  `taxi-rides-ny.nytaxi.yellow_tripdata_ml`
  WHERE
  tip_amount IS NOT NULL
  ));

13:27/15:51 (3.3.1)

The ML.EXPLAIN_PREDICT function generates a predicted value and a set of feature attributions per instance of the input data. Feature attributions indicate how much each feature in your model contributed to the final prediction for each given instance. See The ML.EXPLAIN_PREDICT function.

File big_query_ml.sql

-- PREDICT AND EXPLAIN
SELECT
  *
FROM
  ML.EXPLAIN_PREDICT(MODEL `taxi-rides-ny.nytaxi.tip_model`,
  (
  SELECT
  *
  FROM
  `taxi-rides-ny.nytaxi.yellow_tripdata_ml`
  WHERE
  tip_amount IS NOT NULL
  ), STRUCT(3 as top_k_features));

Hyperparameter tuning

14:43/15:51 (3.3.1)

In machine learning, hyperparameter tuning identifies a set of optimal hyperparameters for a learning algorithm. A hyperparameter is a model argument whose value is set before the learning process begins. See BigQuery ML Hyperparameter Tuning Overview.

See also The CREATE MODEL statement for generalized linear models to know the hyperparameters we can tune for linear and logistic regression.

File big_query_ml.sql

-- HYPER PARAM TUNNING
CREATE OR REPLACE MODEL `taxi-rides-ny.nytaxi.tip_hyperparam_model`
OPTIONS
  (model_type='linear_reg',
  input_label_cols=['tip_amount'],
  DATA_SPLIT_METHOD='AUTO_SPLIT',
  num_trials=5,
  max_parallel_trials=2,
  l1_reg=hparam_range(0, 20),
  l2_reg=hparam_candidates([0, 0.1, 1, 10])) AS
SELECT
  *
FROM
  `taxi-rides-ny.nytaxi.yellow_tripdata_ml`
WHERE
  tip_amount IS NOT NULL;

BigQuery Machine Learning Deployment

0:00/4:26 (3.3.2)

This tutorial shows how to export a BigQuery ML model and then deploy the model either on AI Platform or on a local machine. See Exporting a BigQuery ML model for online prediction.

Steps for model deployment

See extract_model.md.

File extract_model.md

gcloud auth login
bq --project_id taxi-rides-ny extract -m nytaxi.tip_model gs://taxi_ml_model/tip_model
mkdir /tmp/model
gsutil cp -r gs://taxi_ml_model/tip_model /tmp/model
mkdir -p serving_dir/tip_model/1
cp -r /tmp/model/tip_model/* serving_dir/tip_model/1
docker pull tensorflow/serving
docker run -p 8501:8501 \\
  --mount type=bind,source=pwd/serving_dir/tip_model,target=/models/tip_model \\
  -e MODEL_NAME=tip_model -t tensorflow/serving &
docker ps # Check if TensorFlow Serving is running.
curl -d '{"instances": [{"passenger_count":1, "trip_distance":12.2, "PULocationID":"193", \\
  "DOLocationID":"264", "payment_type":"2","fare_amount":20.4,"tolls_amount":0.0}]}' \\
  -X POST http://localhost:8501/v1/models/tip_model:predict

2:57/4:26 (3.3.2)

TensorFlow Serving is a flexible, high-performance serving system for machine learning models, designed for production environments.

3:04/4:26 (3.3.2)

We use Postman to make some HTTP requests

Finally, open this url http://localhost:8501/v1/models/tip_model.

BigQuery ML References

See also

Links not directly related to Data Warehouse and BigQuery.