Skip to content

Latest commit

 

History

History
37 lines (22 loc) · 1.88 KB

developer_guide.md

File metadata and controls

37 lines (22 loc) · 1.88 KB

Developer Guide

S3 Write Path Validation

When using S3 storage as intermediate layer, we generate a S3 bucket path for intermediate data. The generated path is checked that it is empty before writing data.

The path layout:

userProvidedS3Bucket/
└── <UUID>-<SparkApplicationId>/
    └── <SparkQueryId>/

The generated intermediate write path <UUID>-<SparkApplicationId>/<SparkQueryId>/ is validated that it is empty before write. And it is cleaned up after the write query finishes.

S3 Staging Commit Process

The Spark job that writes data to Exasol uses an AWS S3 bucket as intermediate storage. In this process, the ExasolS3Table API implementation uses Spark CSVTable writer to create files in S3.

The write process continues as following:

  1. We ask Spark's CSVTable to commit data into S3 bucket
  2. We commit to import this data into Exasol database using Exasol's CSV loader
  3. And finally we ask our ExasolS3Table API implementation to commit the write process

If any failure occurs, each step will trigger the abort method and S3 bucket locations will be cleaned up. If job finishes successfully, the Spark job end listener will trigger the cleanup process.

S3 Maximum Number of Files

For the write Spark jobs, we allow maximum of 1000 CSV files to be written as intermediate data into S3 bucket. The main reason for this is that S3 SDK listObjects command returns up to 1000 objects from a bucket path per each request.

Even though we could improve it to list more objects from S3 bucket with multiple requests, we wanted to keep this threshold for now.

Integration Tests

The integration tests are run using Docker and exasol-testcontainers