New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[Spark] Add an integration test for DynamoDB Commit Coordinator #3158

Merged

tdas merged 1 commit into delta-io:master from dhruvarya-db:dynamodb-commitowner-integration-test

Aug 5, 2024

Collaborator

dhruvarya-db commented May 25, 2024 •

edited

Loading

Which Delta project/connector is this regarding?

Description

Adds an integration test for the DynamoDB Commit Coordinator. Tests the following scenarios

Automated dynamodb table creation
Concurrent reads and writes
Table upgrade and downgrade

The first half of the test is heavily borrowed from dynamodb_logstore.py.

How was this patch tested?

Test runs successfully with real DynamoDB and S3.
Set the following environment variables (after setting the credentials in ~/.aws/credentials):

export S3_BUCKET=<bucket_name>
export AWS_PROFILE=<profile_name>
export RUN_ID=<random_run_id>
export AWS_DEFAULT_REGION=<region_that_matches_configured_ddb_region>

Ran the test:

./run-integration-tests.py --use-local --run-dynamodb-commit-coordinator-integration-tests \
    --dbb-conf io.delta.storage.credentials.provider=com.amazonaws.auth.profile.ProfileCredentialsProvider \
               spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.profile.ProfileCredentialsProvider \
     --dbb-packages org.apache.hadoop:hadoop-aws:3.4.0,com.amazonaws:aws-java-sdk-bundle:1.12.262

Does this PR introduce any user-facing changes?

dhruvarya-db force-pushed the dynamodb-commitowner-integration-test branch from 5c1a71f to 5aa0826 Compare

May 25, 2024 00:29

junlee-db reviewed

View reviewed changes

Contributor

junlee-db left a comment

Logic looks good, some mechanicals

run-integration-tests.py Outdated

+                  python_root_dir = path.join(root_dir, "python")
+                  extra_class_path = path.join(python_root_dir, path.join("delta", "testing"))
+                  packages = "io.delta:delta-%s_2.12:%s" % (get_artifact_name(version), version)
+                  packages += "," + "org.apache.hadoop:hadoop-aws:3.3.4"

Contributor

junlee-db May 30, 2024

please help me understand, why do we need this while the test above doesn't?

in logstore integration test we use 3.3.1, how is the version decided? https://github.com/delta-io/delta/blob/master/storage-s3-dynamodb/integration_tests/dynamodb_logstore.py#L64

Collaborator Author

dhruvarya-db May 30, 2024 •

edited

Loading

That is a good question. I think just like dynamodb_logstore.py, we should let the user specify the version for this library.

Hi @scottsand-db, this integration test is very similar to https://github.com/delta-io/delta/blob/master/storage-s3-dynamodb/integration_tests/dynamodb_logstore.py that you worked on. One difference is that dynamodb-commitowner is built as part of delta-spark. However, I couldn't get the package to work correctly without adding org.apache.hadoop:hadoop-aws:3.3.4 as an extra dependency in this script. Should we add org.apache.hadoop:hadoop-aws:3.3.4 as a direct dependency of io.delta:delta-* in build.sbt?

...java/io/delta/dynamodbcommitstore/integration_tests/dynamodb_commitowner_integration_test.py Outdated

Comment on lines 99 to 135

+              spark = SparkSession \
+                  .builder \
+                  .appName("utilities") \
+                  .master("local[*]") \
+                  .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
+                  .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
+                  .config(f"spark.databricks.delta.properties.defaults.{commit_owner_property_key}{property_key_suffix}", "dynamodb") \
+                  .config(f"spark.databricks.delta.properties.defaults.managedCommits.commitOwnerConf{property_key_suffix}", dynamodb_commit_owner_conf) \
+                  .config(f"spark.databricks.delta.managedCommits.commitOwner.ddb.awsCredentialsProviderName", "com.amazonaws.auth.profile.ProfileCredentialsProvider") \
+                  .getOrCreate()
+              # spark.sparkContext.setLogLevel("INFO")
+              print("Creating table at path ", delta_table_path)
+              spark.sql(f"CREATE table delta.`{delta_table_path}` (id int, a int) USING DELTA") # commit 0
+              def write_tx(n):
+                  print("writing:", [n, n])
+                  spark.sql(f"INSERT INTO delta.`{delta_table_path}` VALUES ({n}, {n})")
+              stop_reading = threading.Event()
+              def read_data():
+                  while not stop_reading.is_set():
+                      print("Reading {:d} rows ...".format(
+                          spark.read.format("delta").load(delta_table_path).distinct().count())
+                      )
+                      time.sleep(1)
+              def start_read_thread():
+                  thread = threading.Thread(target=read_data)
+                  thread.start()
+                  return thread

Contributor

junlee-db May 30, 2024

Seems to be quite common utils in the other test as well, maybe should we create a dynamo_test_util.py and let both of them look at the files?

...java/io/delta/dynamodbcommitstore/integration_tests/dynamodb_commitowner_integration_test.py Outdated

+                  .config(f"spark.databricks.delta.managedCommits.commitOwner.ddb.awsCredentialsProviderName", "com.amazonaws.auth.profile.ProfileCredentialsProvider") \
+                  .getOrCreate()
+              # spark.sparkContext.setLogLevel("INFO")

Contributor

junlee-db May 30, 2024

redundant line?

...java/io/delta/dynamodbcommitstore/integration_tests/dynamodb_commitowner_integration_test.py Outdated

+                  res = spark.sql(f"SELECT 1 FROM delta.`{delta_table_path}` WHERE id = {insert_value} AND a = {insert_value}").collect()
+                  assert(len(res) == 1)
+              def check_for_delta_file_existence(version, is_backfilled, should_exist):

Contributor

junlee-db May 30, 2024

maybe call them check_for_delta_file because existence is decided by should_exist param.

Collaborator Author

dhruvarya-db May 30, 2024

Renamed to check_for_delta_file_in_filesystem to make it explicit that we are querying the filesystem not dynamodb

junlee-db reviewed

View reviewed changes

...java/io/delta/dynamodbcommitstore/integration_tests/dynamodb_commitowner_integration_test.py Outdated

+              delta_table_version += 1
+              perform_insert_and_validate(9991)
+              delta_table_version += 1

Contributor

junlee-db May 30, 2024

run another check check_for_delta_file_existence... here?

...java/io/delta/dynamodbcommitstore/integration_tests/dynamodb_commitowner_integration_test.py Outdated

+              # Upgrade to managed commits should work
+              print("===================== Evaluating upgrade to managed commits =====================")
+              spark.sql(f"ALTER TABLE delta.`{delta_table_path}` SET TBLPROPERTIES ('delta.{commit_owner_property_key}{property_key_suffix}' = 'dynamodb')")
+              delta_table_version += 1

Contributor

junlee-db May 30, 2024

verify here too?

dhruvarya-db force-pushed the dynamodb-commitowner-integration-test branch from d0d51c6 to af88158 Compare

May 31, 2024 18:50

prakharjain09 reviewed

View reviewed changes

...java/io/delta/dynamodbcommitstore/integration_tests/dynamodb_commitowner_integration_test.py Outdated

+                  .master("local[*]") \
+                  .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
+                  .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
+                  .config(f"spark.databricks.delta.properties.defaults.{commit_owner_property_key}{property_key_suffix}", "dynamodb") \

Collaborator

prakharjain09 Jun 5, 2024

Could we also run the same test on top of a new table which is not a managed-commit table at the time of creation?
So CREATE -> INSERT+SELECT -> UPGRADE -> INSERT+SELECT -> DOWNGRADE. -> INSERT+SELECT -> UPGRADE

Collaborator Author

dhruvarya-db Jun 12, 2024

Done


          add integration test

6631ffe

dhruvarya-db force-pushed the dynamodb-commitowner-integration-test branch from b652dc7 to 6631ffe Compare

June 12, 2024 01:53

dhruvarya-db changed the title ~~[Spark] Add an integration test for DynamoDB Commit Owner~~ [Spark] Add an integration test for DynamoDB Commit Coordinator

dhruvarya-db requested review from prakharjain09 and junlee-db

June 12, 2024 01:54

prakharjain09 approved these changes

View reviewed changes

tdas merged commit 9be04ba into delta-io:master

9 of 10 checks passed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet