The artifacts in this repository support the published blog: Optimized bulk loading in Amazon RDS for PostgreSQL. Refer to the blog for detailed instructions on setup and configuration for benchmarking.
To replicate the results presented in the blog you will need to complete the following steps:
- Create an EC2 instance to serve as a bastion host from which to execute psql commands.
- Create a multi-AZ Amazon RDS for PostgreSQL database instance with IAM authentication enabled.
- Follow this guide to create an IAM policy, an IAM role, and attach the role to Amazon RDS for PostgreSQL instance.
- Follow this guide to install the
aws_s3
extension, create an IAM role and policy that provides Amazon S3 permissions to execute the s3import function. - Setup a pgpass file with the proper credentials to allow the bash scripts to execute against the database instance.
- Create the
imdb
database,staging
schema, and imdb tables using the provided DDL scripts. - Download and stage the IMDb public dataset in an Amazon S3 bucket.
- Git clone this repo to the bastion host.
The folloiwng cli
commands and json
are provided in the cli
folder.
First create a database parameter group and set parameters that are optimized for bulk loading.
aws rds create-db-parameter-group \
--db-parameter-group-name rds-postgres14-bulkload \
--db-parameter-group-family postgres14 \
--description "Optimized database parameters for bulk loading into Amazon RDS for PostgreSQL"
Create json file rds-postgresql14-bulkload.json
with optimized parameters.
{
"DBParameterGroupName": "rds-postgresql14-bulkload",
"Parameters": [
{
"ParameterName": "maintenance_work_mem",
"ParameterValue": "1048576",
"ApplyMethod": "immediate"
},
{
"ParameterName": "max_wal_size",
"ParameterValue": "4096",
"ApplyMethod": "immediate"
},
{
"ParameterName": "checkpoint_timeout",
"ParameterValue": "1800",
"ApplyMethod": "immediate"
}
]
}
Modify the json file with parameter values to the optimized database parameter group.
aws rds modify-db-parameter-group \
--db-parameter-group-name rds-postgres14-bulkload \
--cli-input-json file://rds-postgresql14-bulkload.json
Apply database parameter group optimized for bulk loading
aws rds modify-db-instance \
--db-instance-identifier [DB_INSTANCE_IDENTIFIER] \
--db-parameter-group rds-postgres14-bulkload
Apply normal database parameter group
aws rds modify-db-instance \
--db-instance-identifier [DB_INSTANCE_IDENTIFIER] \
--db-parameter-group default.postgres14
The bash scripts provided execute psql commands against the Amazon RDS for PostgreSQL database. The script named copy_bulk_sequential.sh
executes each psql command after the previous one finishes. The script named copy_bulk_parallel.sh
executes all commands in parallel without waiting for the previous commands to complete. There are also a "pre" and "post" bulk loading scripts. The pre_bulkloading.sh
script disables autovacuum at the table level. The post_bulkloading.sh
script executes a checkpoint command, runs vacuum analyze on each of the target tables, then reenables autovacuum on each table.
Make scripts executable
cd psql
chmod +x pre_bulkloading.sh
chmod +x post_bulkloading.sh
chmod +x copy_bulk_sequential.sh
chmod +x copy_bulk_parallel.sh
Executing bulk load sequentially and report loading time
./pre_bulkloading.sh
time ./copy_bulk_sequential.sh
./post_bulkloading.sh
Executing bulk load in parallel and report loading time
./pre_bulkloading.sh
time ./copy_bulk_parallel.sh
./post_bulkloading.sh
See CONTRIBUTING for more information.
This library is licensed under the MIT-0 License. See the LICENSE file.