This repo provides code samples to re-partition the AWS IoT SiteWise cold storage tier data and store it into a destination Amazon S3 bucket. Re-partitioning helps with improving Athena query performance for query patterns encompassing multiple time series.
The samples provided downloads time series objects from cold tier S3 bucket, merges them into daily partitions, and finally, uploads to a destination S3 bucket. Post initial-run, only new data additions are re-partitioned. Following are a few supported scenarios.
Change in Asset Modeling | Supported? |
---|---|
No change, all existing assets and properties | Y |
New properties added to an existing asset | Y |
New assets added | Y |
New datastream mapped to an existing property | Y |
-
Configure AWS credentials either using config file or shared credential file. Ensure your region is configured in the config file.
-
You are expected to have specific permissions to access Amazon S3, AWS IAM and AWS Glue as mentioned below. Replace the placeholders with property values as explained in Configure section.
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "s3:PutObject" ], "Resource": "<ARN of s3.glue_assets.bucket_name>/*" }, { "Effect": "Allow", "Action": "iam:PassRole", "Resource": "<glue_role_arn>" }, { "Effect": "Allow", "Action": [ "glue:CreateJob", "glue:StartJob*", "glue:DeleteJob", "glue:GetJob*" ], "Resource": "*" } ] }
-
Ensure Python 3 is installed on your system, you can verify by running
python3 --version
orpython --version
(on Windows). -
Clone this
Git
repository and install required Python packages by runningpip3 install -r requirements.txt
-
Ensure
make
utility is installed on your system, you can verify by runningmake --version
. You can manually download the source code from GNU Make website or Make for Windows (for Windows OS).
Review and update the configuration in the config.yml and Makefile files.
Property | Description | Default Value |
---|---|---|
timeseries_type |
The type of time series to include ASSOCIATED or DISASSOCIATED | ASSOCIATED |
s3.cold_tier.bucket_name |
Name of the S3 bucket configured in AWS IoT SiteWise cold tier settings | |
s3.cold_tier.data_prefix |
Root prefix of all date partitions for data objects | raw/ |
s3.repartitioned.bucket_name |
Name of the S3 bucket to store the re-partitioned data | |
s3.repartitioned.data_prefix |
Root prefix of all date partitions for data objects | consolidated/ |
s3.repartitioned.index_prefix |
Root prefix of all date partitions for index objects | index/ |
s3.glue_assets.bucket_name |
Name of the S3 bucket to store the assets required by Glue ETL jobs | |
s3.glue_assets.scripts_prefix |
Prefix of all script artifacts required by Glue ETL jobs | scripts/ |
profile |
Profile used for AWS credentials, change if using non-default profile | default |
python_alias |
Alias for running python commands. Change to python for Windows OS |
python3 |
job_name_prefix |
Prefix of job name | sitewise-cold-tier-repartitioning |
glue_role_arn |
ARN of the IAM role associated with the job |
Note It is recommended to use server-side encryption for S3 buckets to protect data at rest.
The IAM role mapped to glue_role_arn
has the following requirements:
- Name of the role must start with the string
AWSGlueServiceRole
, e.g.,AWSGlueServiceRole-SiteWiseRepartitioning
. - Ensure the role has necessary permissions to access Amazon S3, AWS KMS, AWS IoT SiteWise and Amazon CloudWatch.
- The glue_role_policy.json sample policy document provides list of required permissions.
- If server-side encryption with SSE-S3 is used, remove the statement for KMS permissions.
Run make build
to build and package the scripts required by AWS Glue ETL jobs.
The following files will be uploaded to the S3 bucket configured at s3.glue_assets.bucket_name
:
- job_helpers.zip - helper scripts that will be provided to AWS Glue as additional Python modules
- job_script.py - script that runs an ETL job
Run make execute {from} {to} {days_per_job}
to create ETL jobs for processing data between a given date range.
Option | Description |
---|---|
from (string) |
Start date in '%Y-%m-%d' format |
to (string) |
End date in '%Y-%m-%d' format |
days_per_job (integer) |
Maximum number of days to be processed by a single job |
Example: make execute from=2023-05-01 to=2023-05-15 days_per_job=5
Several ETL jobs will be created to horizontally scale the work depending on the parallelization defined at days_per_job
. The table below provides a few examples:
from |
to |
# of days | days_per_job |
# of jobs |
---|---|---|---|---|
2022-01-01 | 2022-01-31 | 31 | 1 | 31 |
2022-01-01 | 2022-01-31 | 31 | 5 | 7 |
2022-01-01 | 2022-12-31 | 365 | 10 | 37 |
2022-01-01 | 2022-12-31 | 365 | 30 | 13 |
Warning The following service quotas for AWS IoT SiteWise and AWS Glue may cause the jobs to fail. In this case, you may need to request AWS to increase the quota for your account.
- Request rate for ListTimeSeries
- Max concurrent job runs per account
- Max jobs per account
Warning Exclude days that have only partial data for properties at a day level. Each day provided is expected to have data that is either All or Nothing for each property.
Here is a sample output showing the jobs created:
Creating 3 Glue ETL jobs to process data between 2022-05-01 and 2022-05-15..
Created job sitewise-cold-tier-repartitioning-1686433173-2022-05-01-2022-05-05
Created job sitewise-cold-tier-repartitioning-1686433173-2022-05-06-2022-05-10
Created job sitewise-cold-tier-repartitioning-1686433173-2022-05-11-2022-05-15
You can track the status of jobs at AWS Glue Studio → View monitoring
Note Each job is created with a tag "source: sitewise-repartitioning" for tracking purposes. This tag will be used to clean up the jobs, when needed.
Run make cleanup
to delete the ended jobs. Only ended jobs with the tag "source: sitewise-repartitioning" will be deleted so any jobs created outside the context of this project are not deleted.
Removing all ended jobs..
REMOVED - Name: sitewise-cold-tier-repartitioning-1687563866-2022-02-10-2022-03-01
SKIPPED - Name: sitewise-cold-tier-repartitioning-1687563866-2022-03-02-2022-03-21
REMOVED - Name: sitewise-cold-tier-repartitioning-1687563866-2022-01-01-2022-01-20
SKIPPED - Name: sitewise-cold-tier-repartitioning-1687563866-2022-03-22-2022-04-10
REMOVED - Name: sitewise-cold-tier-repartitioning-1687563960-2022-04-06-2022-04-10
REMOVED - Name: sitewise-cold-tier-repartitioning-1687563866-2022-01-21-2022-02-09
Removed 5 ended jobs
Skipped 4 jobs that are not ended
Each job consists of three main stages as outlined below. You can monitor and troubleshoot these stages using the logs at Amazon CloudWatch → Logs → Log groups → /aws-glue/jobs/output
In this stage, AVRO data files for the given date range are downloaded from the IoT SiteWise cold tier S3 bucket. If the data has already been processed previously for a given day (tracked in timeseries.txt
), the script skips downloading the data for the day.
timeseries.txt
- a new-line delimited plain text file that stores the list of all time series ids processed in previous runs.
If there's new data, corresponding S3 objects will be downloaded to day-wise local directories
Reviewing --> year: 2022, month: 5, day: 5
Retrieving all keys with prefix: raw/startYear=2022/startMonth=5/startDay=5/
# of new timeseries detected: 1200
Found new data to process, starting to download
Downloading S3 objects..
** Download time: 43 secs **
If no new data is found, no further processing happens for the day
Reviewing --> year: 2022, month: 5, day: 11
Retrieving all keys with prefix: raw/startYear=2022/startMonth=5/startDay=11/
# of timeseries previously processed: 1200
Skip, no new data
Once the AVRO files are downloaded from IOT SiteWise cold tier S3 bucket, they are merged into a single AVRO file per day.
| Before | After |
| -- | -- | -- |
| Multiple .AVRO
files per day | Single .AVRO
file per day |
| timeseries.txt
and optionally previous_timeseries.txt
per day | Single timeseries.txt
file per day |
Here is a sample output:
Started merging AVRO data files and index files for each day
2022-5-5: ** Merge Time: 247 secs **
2022-5-4: ** Merge Time: 246 secs **
The merged AVRO files are then uploaded into the destination S3 bucket configured at s3.repartitioned.bucket_name
.
Here is a sample output:
Started uploading re-partitioned AVRO data files and index file for each day
2022-5-5: ** Upload Time: 0 secs **
2023-5-4: ** Upload Time: 0 secs **
Consider automating the workflow by leveraging Amazon S3 Event Notifications, Amazon Simple Queue Service, and AWS Lambda