- Big Query is used as example
Content:
- OLAP vs OLTP
- What is a data warehouse
- BigQuery
- Costs
- Partitions and Clustering
- Best Practices
- Internals
- ML in BQ
OLTP: Online Transfer Processing OLAP: Online Analytical Processing
- A data warehouse is an OLAP solution
- It is used for reporting and data analysis
- In general consists of raw data, meta data and summary data
- In general has different data sources
- Can be converted into Data Marts, which can be directly accessed by users
-
Serverless data warehouse
- There are no servers to manage or database software to install
-
Software as well as infrastructure including
- Scalable and high-availability
-
Built-in features like
- Machine Learning
- Geospatial Analysis
- Nusiness Intelligence
-
BigQuery maximize flexibility by separating the compute engine that analyzes your data from your storage
-
BigQuery provides a lot of open source public data
- You can search for this data in the web ui
-
We can use SQL to query the data and save the results as csv
- On demand pricing:
- 1TB of data processed is $5
- Flat rate procing
- Based on number of pre requested slots
- 100 slots -> $2,000/month = 400TB data processed on demand pricing
- Partitioning can improve BigQuery performance
- E.g. partitioning on date
- If we want to read the data for a specific date, the other dates, not in that partition will not be read
- Helps to improve costs
- Partitioning on date using SQL:
CREATE OR REPLACE TABLE taxi-rides-ny.nytaxi.yellow_tripdata_partitioned
PARTITION BY
DATE(tpep_pickup) AS
SELECT * FROM taxi-rides-ny.nytaxi.external_yellow_tripdata;
- Additionally to partitioning we can cluster by other columns
CREATE OR REPLACE TABLE taxi-rides-ny.nytaxi.yellow_tripdata_partitioned_clustered
PARTITIONED BY DATE(tpep_pickup_datetime)
CLUSTER BY VendorID AS
SELECT * FROM taxi-rides-ny.nytaxi.external_yellow_tripdata;
-
When creating a partition in BigQuery, you can choose
- A time-unit column
- Ingestion time (PARTITIONTIME)
- Integer range partitioning
- When using time unit or ingestion time
- Daily (default)
- Hourly
- Monthly or yearly
- Number of partitions limit is 4000
-
BigQuery Clustering
- Columns you specify are used to colocate related data
- Order of the columns is important
- The order of the specified columns determines the sort order of the data
- Clustering improves
- Filter queries
- Aggregate queries
- Table with data size < 1GB, don't show signficant improvement with partitiong and clustering
- You can specify up to four clustering columns
- Clustering columns must be top-level, non-repeated columns
- DATE
- BOOL
- GEOGRAPHY
- INT 64
- NUMERIC
- BIGNUMERIC
- STRING
- TIMESTAMP
- DATETIME
-
When to prefer clustering over partitioning
- Partitioning results in a small amount of data (approximately less than 1GB)
- Partitioning results in a large number of partitions beyond the limits on partitioned tables
- Partitioning results in your mutation operations modifying majority of partitions in the table frequently (for example every few minutes)
-
Automated reclustering
- As data is added to a clustered table
- The newly inserted data can be written to blocks that contain key ranges that overlap with the key ranges in previsouly written blocks
- These overlapping keys weaken the sort property of the tables
- To maintain the performance characteristics of a clustered table
- BigQuery performs automatic re-clustering in the backgound to restore the sort property of the table
- For partitionen tables, clustering is maintained for data within the scope of each partition
- As data is added to a clustered table
-
Cost reduction
- Avoid
SELECT *
- Price your queries before running them
- Use clustered or partitioned tables
- Use streaming inserts with caution
- Materialize query results in stages
- Avoid
-
Query performance
- Filter on partitioned columns
- Denormalize data
- Use nested or repeated columns
- Use external data sources appropriately
- Don't use it in case you want a high query performanc
- Reduce data befor using a
JOIN
- Do not treat
WITH
clauses as prepared statements - Avoid oversharding tables
- Avoid JavaScript user-defined functions
- Use approxiate aggregation functions (HyperLogLog++)
- Order last, for query operatios to maximize performance
- Optimize your join patterns
- As a best practice, place the table with the largest number of rows first, followed by the table with the fewest rows, and then place the remaining tables by decreasing size
-
Target audience Data Analysts, managers
-
No need for Python or Jva knowledge
-
No need to export data into a different system
- Usually, when training a ML model:
- export the data from a data warehouse
- build a model, train & deploy it
- BigQuery allows to build the model in th edata warehouse itself
- Usually, when training a ML model:
-
Costs:
- Free:
- 10 GB per month of data storage
- 1 TB per month of queries processed
- ML create model step: First 10 GB per month are free
- After the free tier: 250$ per TB month for logistig regression, linear regression, k-means, time series and extra costs for other models (US)
- Free:
-
Steps in ML development:
- BigQuery helps in all these steps
- Choose columns of interest (data cleaning)
-- SELECT THE COLUMNS INTERESTED FOR YOU
SELECT passenger_count, trip_distance, PULocationID, DOLocationID, payment_type, fare_amount, tolls_amount, tip_amount
FROM `taxi-rides-ny.nytaxi.yellow_tripdata_partitoned` WHERE fare_amount != 0;
- Feature Engineering
- Documentation: cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-auto-preprocessing
- Change type of some features
-- CREATE A ML TABLE WITH APPROPRIATE TYPE
CREATE OR REPLACE TABLE `taxi-rides-ny.nytaxi.yellow_tripdata_ml` (
`passenger_count` INTEGER,
`trip_distance` FLOAT64,
`PULocationID` STRING,
`DOLocationID` STRING,
`payment_type` STRING,
`fare_amount` FLOAT64,
`tolls_amount` FLOAT64,
`tip_amount` FLOAT64
) AS (
SELECT passenger_count, trip_distance, cast(PULocationID AS STRING), CAST(DOLocationID AS STRING),
CAST(payment_type AS STRING), fare_amount, tolls_amount, tip_amount
FROM `taxi-rides-ny.nytaxi.yellow_tripdata_partitoned` WHERE fare_amount != 0
);
- Create model
-- CREATE MODEL WITH DEFAULT SETTING
CREATE OR REPLACE MODEL `taxi-rides-ny.nytaxi.tip_model`
OPTIONS
(model_type='linear_reg',
input_label_cols=['tip_amount'],
DATA_SPLIT_METHOD='AUTO_SPLIT') AS
SELECT
*
FROM
`taxi-rides-ny.nytaxi.yellow_tripdata_ml`
WHERE
tip_amount IS NOT NULL;
- Have a look at the features
-- CHECK FEATURES
SELECT * FROM ML.FEATURE_INFO(MODEL `taxi-rides-ny.nytaxi.tip_model`);
* This shows some statistics of the features
- Evaluate the model
-- EVALUATE THE MODEL
SELECT
*
FROM
ML.EVALUATE(MODEL `taxi-rides-ny.nytaxi.tip_model`,
(
SELECT
*
FROM
`taxi-rides-ny.nytaxi.yellow_tripdata_ml`
WHERE
tip_amount IS NOT NULL
));
- Make predictions
-- PREDICT THE MODEL
SELECT
*
FROM
ML.PREDICT(MODEL `taxi-rides-ny.nytaxi.tip_model`,
(
SELECT
*
FROM
`taxi-rides-ny.nytaxi.yellow_tripdata_ml`
WHERE
tip_amount IS NOT NULL
));
- Explain and predict shows the top three features used in the prediction
-- PREDICT AND EXPLAIN
SELECT
*
FROM
ML.EXPLAIN_PREDICT(MODEL `taxi-rides-ny.nytaxi.tip_model`,
(
SELECT
*
FROM
`taxi-rides-ny.nytaxi.yellow_tripdata_ml`
WHERE
tip_amount IS NOT NULL
), STRUCT(3 as top_k_features));
- Hyperparamter tuning
- Documentation for hyperparameters: cloud.google.com/gigquery-ml/docs/referencs/standard-sql/bigquery-syntax-create-glm
- HYPER PARAM TUNNING
CREATE OR REPLACE MODEL `taxi-rides-ny.nytaxi.tip_hyperparam_model`
OPTIONS
(model_type='linear_reg',
input_label_cols=['tip_amount'],
DATA_SPLIT_METHOD='AUTO_SPLIT',
num_trials=5,
max_parallel_trials=2,
l1_reg=hparam_range(0, 20),
l2_reg=hparam_candidates([0, 0.1, 1, 10])) AS
SELECT
*
FROM
`taxi-rides-ny.nytaxi.yellow_tripdata_ml`
WHERE
tip_amount IS NOT NULL;
- Export the model made and check prediction by deploying it with a docker image
- Steps:
- gcloud auth login
- export project to google storage
bq --project_id taxi-rides-ny extract -m nytaxi.tip_model gs://taxi_ml_model/tip_model
- Make a directory
mkdir /tmp/model
- Copy the model intothe new directory
gsutil cp -r gs://taxi_ml_model/tip_model /tmp/model
- Create a serving directory
mkdir -p serving_dir/tip_model/1
- Copy all the data into this directory
cp -r /tmp/model/tip_model/* serving_dir/tip_model/1
- Pull tensorflow serving docker image
docker pull tensorflow/serving
- Run the image
docker run -p 8501:8501 --mount type=bind,source=pwd/serving_dir/tip_model,target= /models/tip_model -e MODEL_NAME=tip_model -t tensorflow/serving &
- Make http request (alternatice use postman for this)
curl -d '{"instances": [{"passenger_count":1, "trip_distance":12.2, "PULocationID":"193", "DOLocationID":"264", "payment_type":"2","fare_amount":20.4,"tolls_amount":0.0}]}' -X POST http://localhost:8501/v1/models/tip_model:predict
- http://localhost:8501/v1/models/tip_model