Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HADOOP-15563 S3Guard to create on-demand DDB tables #879

Closed

Conversation

steveloughran
Copy link
Contributor

HADOOP-15563 Full S3Guard Support for on-demand DDB tables

  • If you create a table with read capacity == write capacity == 0, you get an on-demand table.
  • change the defaults in Constants and core-defaults to be zero. You get an on-demand table by default.
  • S3Guard docs reworked to cover the topic, including the updated defaults and why you should go to on-demand; also mentioned in the troubleshooting section.

It'd be nice if set-capacity could switch a table to on-demand, but the latest AWS SDK doesn't let you do that, even though the low level REST API does.

ITestS3GuardToolDynamoDB l

  • ifecycle test creates an on-demand table and skips changing its capacity.
  • Fix HADOOP-16187 ITestS3GuardToolDynamoDB test failures: test wasn't clearing enough per-bucket options.

ITestDynamoDBMetadataStore: review of every test case in so that

  • Capacity is always set to on-demand before table creation
  • Every test always cleans up its tables, even on assert failures.

Those tests have tended to leak tables in the past; these changes should eliminate it on uninterrupted test runs, and with on-demand capacity you don't get billed much until the next time you do an audit and delete of tables.

Also:

  • removed @unstable attribute from those Constants which I consider having been shipping too long to be changed.
  • moved test-only S3GUARD_DDB_TEST_TABLE_NAME_KEY key to S3ATestConstants. It was tagged Unstable, and should be in the right file.

Change-Id: Ia44559354666db8027e574eab97983167ed930bf

Note: this patch includes HADOOP-16117 as precursor; that's a separate PR, (#818); its needed to be able to create on-demand tables

@steveloughran
Copy link
Contributor Author

Test: S3 Ireland w/ s3guard + auth

<description>
Provisioned throughput requirements for write operations in terms of
capacity units for the DynamoDB table. Refer to related config
fs.s3a.s3guard.ddb.table.capacity.read before usage.
capacity units for the DynamoDB table.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whitespace:end of line

Creates a table "ireland-team" with tags "first" and "second".
Creates a table "ireland-team" with tags "first" and "second". The read and
write capacity will be those of the site configuration's values of
`fs.s3a.s3guard.ddb.table.capacity.read` and `fs.s3a.s3guard.ddb.table.capacity.write`;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whitespace:end of line

when creating or updating a bucket.
```bash
hadoop s3guard init -meta dynamodb://ireland-team -write 0 -read 0 s3a://ireland-1
```

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whitespace:end of line

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Comment
0 reexec 33 Docker mode activated.
_ Prechecks _
+1 dupname 0 No case conflicting files found.
+1 @author 0 The patch does not contain any @author tags.
+1 test4tests 0 The patch appears to include 5 new or modified test files.
_ trunk Compile Tests _
0 mvndep 152 Maven dependency ordering for branch
+1 mvninstall 1281 trunk passed
+1 compile 1027 trunk passed
+1 checkstyle 134 trunk passed
+1 mvnsite 879 trunk passed
+1 shadedclient 1648 branch has no errors when building and testing our client artifacts.
+1 javadoc 341 trunk passed
0 spotbugs 1733 Used deprecated FindBugs config; considering switching to SpotBugs.
0 findbugs 18 branch/hadoop-project no findbugs output file (findbugsXml.xml)
-1 findbugs 1727 root in trunk has 6 extant findbugs warnings.
_ Patch Compile Tests _
0 mvndep 31 Maven dependency ordering for patch
+1 mvninstall 1119 the patch passed
+1 compile 976 the patch passed
+1 javac 976 the patch passed
-0 checkstyle 143 root: The patch generated 7 new + 29 unchanged - 4 fixed = 36 total (was 33)
+1 mvnsite 916 the patch passed
-1 whitespace 0 The patch has 3 line(s) that end in whitespace. Use git apply --whitespace=fix <<patch_file>>. Refer https://git-scm.com/docs/git-apply
+1 xml 2 The patch has no ill-formed XML file.
+1 shadedclient 630 patch has no errors when building and testing our client artifacts.
+1 javadoc 350 the patch passed
0 findbugs 15 hadoop-project has no data from findbugs
_ Other Tests _
-1 unit 945 root in the patch failed.
+1 asflicense 50 The patch does not generate ASF License warnings.
13746
Reason Tests
Failed junit tests hadoop.util.TestDiskCheckerWithDiskIo
Subsystem Report/Notes
Docker Client=17.05.0-ce Server=17.05.0-ce base: https://builds.apache.org/job/hadoop-multibranch/job/PR-879/1/artifact/out/Dockerfile
GITHUB PR #879
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient xml findbugs checkstyle
uname Linux 427b5f3dbd2a 4.4.0-138-generic #164-Ubuntu SMP Tue Oct 2 17:16:02 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality personality/hadoop.sh
git revision trunk / 1ae062c
Default Java 1.8.0_212
findbugs https://builds.apache.org/job/hadoop-multibranch/job/PR-879/1/artifact/out/branch-findbugs-root-warnings.html
checkstyle https://builds.apache.org/job/hadoop-multibranch/job/PR-879/1/artifact/out/diff-checkstyle-root.txt
whitespace https://builds.apache.org/job/hadoop-multibranch/job/PR-879/1/artifact/out/whitespace-eol.txt
unit https://builds.apache.org/job/hadoop-multibranch/job/PR-879/1/artifact/out/patch-unit-root.txt
Test Results https://builds.apache.org/job/hadoop-multibranch/job/PR-879/1/testReport/
Max. process+thread count 1420 (vs. ulimit of 5500)
modules C: hadoop-project hadoop-common-project/hadoop-common hadoop-tools/hadoop-aws . U: .
Console output https://builds.apache.org/job/hadoop-multibranch/job/PR-879/1/console
versions git=2.7.4 maven=3.3.9 findbugs=3.1.0-RC1
Powered by Apache Yetus 0.10.0 http://yetus.apache.org

This message was automatically generated.

Copy link

@bgaborg bgaborg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think removing @InterfaceStability.Unstable and moving the constant should be done under a separate issue - but I also think it is a good idea to do those. The reason behind it is that somebody may want that change, but without adding support for on-demand ddb tables. These are not connected, so it be separated.

public static final String S3GUARD_DDB_TABLE_TAG =
"fs.s3a.s3guard.ddb.table.tag.";

/**
Copy link

@bgaborg bgaborg Jun 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for moving this to S3ATestConstants. I've put this here, next time I will use the TestConstants instead for test only constants.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. I'm thinking we also need to split internal from public

@@ -434,7 +434,8 @@ public abstract int run(String[] args, PrintStream out) throws Exception,
"\n" +
" URLs for Amazon DynamoDB are of the form dynamodb://TABLE_NAME.\n" +
" Specifying both the -" + REGION_FLAG + " option and an S3A path\n" +
" is not supported.";
" is not supported.\n"
+ "To create a table with per-request billing, set the read and write capaciies to 0";
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: capaciies -> capacities

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed; also splitting the line in both the IDE and in the printed text

@@ -82,7 +79,7 @@ public String toString() {
}

/**
* Is the the capacity that of a pay-on-demand table?
* Is the the capacity that of a pay-per-request table?
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this change. What's the diff between on-demand and per-request?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The AWS docs say "on demand" or "pay by request"; I was trying to be consistent but gave up. This is a leftover from an attempt/revert of actually changing the method name. Changing to "on demand" (and not pay-on-demand)

@@ -274,7 +300,9 @@ public void testInitialize() throws IOException {
getTestTableName("testInitialize");
final Configuration conf = s3afs.getConf();
conf.set(S3GUARD_DDB_TABLE_NAME_KEY, tableName);
try (DynamoDBMetadataStore ddbms = new DynamoDBMetadataStore()) {
enableOnDemand(conf);
Copy link

@bgaborg bgaborg Jun 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could add this to @Before, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's done in the beforeClass for the static table; for the others its done to to the config which comes from the FS.
I'm going to change it to set it in the FS config before we set the FS. If someone has per-bucket options to set the capacity then things will be different, but I'd be surprised if anyone ever does that

@@ -274,7 +300,9 @@ public void testInitialize() throws IOException {
getTestTableName("testInitialize");
final Configuration conf = s3afs.getConf();
conf.set(S3GUARD_DDB_TABLE_NAME_KEY, tableName);
try (DynamoDBMetadataStore ddbms = new DynamoDBMetadataStore()) {
enableOnDemand(conf);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You add this in @BeforeClass as well. Is there a point to add this to all tests as well? Maybe add this to @Before only?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the beforeClass is for the static ddbms; the others were for the new ones

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll be doing it in the FS configuration and the one for the static metastore -they are different configurations

@steveloughran steveloughran changed the title HADOOP-15563 Full S3Guard Support for on-demand DDB tables HADOOP-15563 S3Guard to create on-demand DDB tables Jun 5, 2019
@steveloughran
Copy link
Contributor Author

BTW, HADOOP-15183 will be fixing the ITestDynamoDBMetadataStoreScale tests to move from verifying that DDB throttling is recovered from to becoming tests of the scalability of how our client uses DDB. Which will remove the current assume(!ondemand) change of HADOOP-16118. Not relevant for this patch except to explain why I'm not doing anything with that test in this one

@steveloughran
Copy link
Contributor Author

Update: we do need the per-test setting of the capacity, else your created tables have numbers. Assumption: FS caching may complicate things: setting the capacity on the new FS doesn't get picked up if another FS is used.
Factoring out the getFS.getConf() + patch routine, and also calling clone() so that each tests' changes are isolated from the others. Without that (as we are today) one test could interfere with its successor.

* If you create a table with read capacity == write capacity == 0, you get an on-demand table.
* change the defaults in Constants and core-defaults to be zero. You get an on-demand table by default.
* S3Guard docs reworked to cover the topic, including the updated defaults and why you should go to on-demand; also mentioned in the troubleshooting section.

It'd be nice if set-capacity could switch a table to on-demand, but the latest AWS SDK doesn't let you do that, even though the low level REST API does.

ITestS3GuardToolDynamoDB l
* ifecycle test creates an on-demand table and skips changing its capacity.
* Fix HADOOP-16187 ITestS3GuardToolDynamoDB test failures: test wasn't clearing enough per-bucket options.

ITestDynamoDBMetadataStore: review of every test case in so that
* Capacity is always set to on-demand before table creation
* Every test always cleans up its tables, even on assert failures.

Those tests have tended to leak tables in the past; these changes should eliminate it on uninterrupted test runs, and with on-demand capacity you don't get billed much until the next time you do an audit and delete of tables.

Also:

- removed @unstable attribute from those Constants which I consider having been shipping too long to be changed.
- moved test-only S3GUARD_DDB_TEST_TABLE_NAME_KEY key to S3ATestConstants. It was tagged Unstable, and should be in the right file.

Change-Id: Ia44559354666db8027e574eab97983167ed930bf
(cherry picked from commit d695ba4e9935928c211e4fd52d635c3f03f8c5d7)
Change-Id: I62422be23b8a30d2222e3c4d6354cb20d765260c
(cherry picked from commit 9904c9a8a885a4c4aa715058b0a1294105b6b0d9)
+DynamoDBMetaStore provisionTable() has the r=0 w=0 capacity check pushed down as it is visible for testing.

Change-Id: Ide149515f8b212381087e84b842c5582c5cdab02
(cherry picked from commit 476d9c72c0f04de37d85893b5efdba24f9d232cf)
@steveloughran steveloughran force-pushed the s3/HADOOP-15563-full-on-demand branch from 476d9c7 to a17aaef Compare June 6, 2019 11:43
@steveloughran
Copy link
Contributor Author

Latest PR is on trunk and the updated SDK; removed the SDK part of this patch

Tested: S3 Ireland with -Dparallel-tests -DtestsThreadCount=8 -Ds3guard -Ddynamo

No tests failed. Not even a little one :)

expect to need for your cluster. Setting higher values will cost you more
money. *Note* that these settings only affect table creation when
Before AWS offered pay-per-request billing, the sole billing mechanism,
was "provisioned capacity". This mechanism requires you to choose

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whitespace:end of line


You do, however, pay more per DynamoDB operation.
Even so, the ability to cope with sudden bursts of read or write requests, combined
with the elimination of charges for idle tables, suit the use patterns made of

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whitespace:end of line

@@ -512,7 +546,13 @@ hadoop s3guard init -meta URI ( -region REGION | s3a://BUCKET )
Creates and initializes an empty metadata store.

A DynamoDB metadata store can be initialized with additional parameters
pertaining to [Provisioned Throughput](http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.ProvisionedThroughput.html):
pertaining to capacity.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whitespace:end of line

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Comment
0 reexec 47 Docker mode activated.
_ Prechecks _
+1 dupname 0 No case conflicting files found.
+1 @author 0 The patch does not contain any @author tags.
+1 test4tests 0 The patch appears to include 6 new or modified test files.
_ trunk Compile Tests _
0 mvndep 69 Maven dependency ordering for branch
+1 mvninstall 1148 trunk passed
+1 compile 1042 trunk passed
+1 checkstyle 145 trunk passed
+1 mvnsite 121 trunk passed
+1 shadedclient 1046 branch has no errors when building and testing our client artifacts.
+1 javadoc 93 trunk passed
0 spotbugs 67 Used deprecated FindBugs config; considering switching to SpotBugs.
+1 findbugs 191 trunk passed
_ Patch Compile Tests _
0 mvndep 23 Maven dependency ordering for patch
+1 mvninstall 80 the patch passed
+1 compile 1008 the patch passed
+1 javac 1008 the patch passed
-0 checkstyle 150 root: The patch generated 6 new + 31 unchanged - 5 fixed = 37 total (was 36)
+1 mvnsite 124 the patch passed
-1 whitespace 0 The patch has 3 line(s) that end in whitespace. Use git apply --whitespace=fix <<patch_file>>. Refer https://git-scm.com/docs/git-apply
+1 xml 1 The patch has no ill-formed XML file.
+1 shadedclient 736 patch has no errors when building and testing our client artifacts.
+1 javadoc 94 the patch passed
+1 findbugs 206 the patch passed
_ Other Tests _
+1 unit 563 hadoop-common in the patch passed.
+1 unit 292 hadoop-aws in the patch passed.
+1 asflicense 46 The patch does not generate ASF License warnings.
7230
Subsystem Report/Notes
Docker Client=17.05.0-ce Server=17.05.0-ce base: https://builds.apache.org/job/hadoop-multibranch/job/PR-879/4/artifact/out/Dockerfile
GITHUB PR #879
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient xml findbugs checkstyle
uname Linux 4d76f68c8ed7 4.4.0-144-generic #170~14.04.1-Ubuntu SMP Mon Mar 18 15:02:05 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality personality/hadoop.sh
git revision trunk / f7c77b3
Default Java 1.8.0_212
checkstyle https://builds.apache.org/job/hadoop-multibranch/job/PR-879/4/artifact/out/diff-checkstyle-root.txt
whitespace https://builds.apache.org/job/hadoop-multibranch/job/PR-879/4/artifact/out/whitespace-eol.txt
Test Results https://builds.apache.org/job/hadoop-multibranch/job/PR-879/4/testReport/
Max. process+thread count 1714 (vs. ulimit of 5500)
modules C: hadoop-common-project/hadoop-common hadoop-tools/hadoop-aws U: .
Console output https://builds.apache.org/job/hadoop-multibranch/job/PR-879/4/console
versions git=2.7.4 maven=3.3.9 findbugs=3.1.0-RC1
Powered by Apache Yetus 0.10.0 http://yetus.apache.org

This message was automatically generated.

Change-Id: I6a7a104b6fe2bb041109c6055d2245636ca42067
expect to need for your cluster. Setting higher values will cost you more
money. *Note* that these settings only affect table creation when
Before AWS offered pay-per-request billing, the sole billing mechanism,
was "provisioned capacity". This mechanism requires you to choose

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whitespace:end of line


You do, however, pay more per DynamoDB operation.
Even so, the ability to cope with sudden bursts of read or write requests, combined
with the elimination of charges for idle tables, suit the use patterns made of

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whitespace:end of line

@@ -512,7 +546,13 @@ hadoop s3guard init -meta URI ( -region REGION | s3a://BUCKET )
Creates and initializes an empty metadata store.

A DynamoDB metadata store can be initialized with additional parameters
pertaining to [Provisioned Throughput](http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.ProvisionedThroughput.html):
pertaining to capacity.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whitespace:end of line

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Comment
0 reexec 31 Docker mode activated.
_ Prechecks _
+1 dupname 0 No case conflicting files found.
+1 @author 0 The patch does not contain any @author tags.
+1 test4tests 0 The patch appears to include 6 new or modified test files.
_ trunk Compile Tests _
0 mvndep 41 Maven dependency ordering for branch
+1 mvninstall 1114 trunk passed
+1 compile 1121 trunk passed
+1 checkstyle 145 trunk passed
+1 mvnsite 122 trunk passed
+1 shadedclient 1045 branch has no errors when building and testing our client artifacts.
+1 javadoc 95 trunk passed
0 spotbugs 67 Used deprecated FindBugs config; considering switching to SpotBugs.
+1 findbugs 194 trunk passed
_ Patch Compile Tests _
0 mvndep 23 Maven dependency ordering for patch
+1 mvninstall 82 the patch passed
+1 compile 1087 the patch passed
+1 javac 1087 the patch passed
+1 checkstyle 170 root: The patch generated 0 new + 31 unchanged - 5 fixed = 31 total (was 36)
+1 mvnsite 138 the patch passed
-1 whitespace 0 The patch has 3 line(s) that end in whitespace. Use git apply --whitespace=fix <<patch_file>>. Refer https://git-scm.com/docs/git-apply
+1 xml 1 The patch has no ill-formed XML file.
+1 shadedclient 747 patch has no errors when building and testing our client artifacts.
+1 javadoc 105 the patch passed
+1 findbugs 244 the patch passed
_ Other Tests _
+1 unit 616 hadoop-common in the patch passed.
+1 unit 299 hadoop-aws in the patch passed.
+1 asflicense 50 The patch does not generate ASF License warnings.
7476
Subsystem Report/Notes
Docker Client=17.05.0-ce Server=17.05.0-ce base: https://builds.apache.org/job/hadoop-multibranch/job/PR-879/5/artifact/out/Dockerfile
GITHUB PR #879
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient xml findbugs checkstyle
uname Linux 8f02a7b4c84e 4.4.0-143-generic #169~14.04.2-Ubuntu SMP Wed Feb 13 15:00:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality personality/hadoop.sh
git revision trunk / 1f244b4
Default Java 1.8.0_212
whitespace https://builds.apache.org/job/hadoop-multibranch/job/PR-879/5/artifact/out/whitespace-eol.txt
Test Results https://builds.apache.org/job/hadoop-multibranch/job/PR-879/5/testReport/
Max. process+thread count 1322 (vs. ulimit of 5500)
modules C: hadoop-common-project/hadoop-common hadoop-tools/hadoop-aws U: .
Console output https://builds.apache.org/job/hadoop-multibranch/job/PR-879/5/console
versions git=2.7.4 maven=3.3.9 findbugs=3.1.0-RC1
Powered by Apache Yetus 0.10.0 http://yetus.apache.org

This message was automatically generated.

@mackrorysd
Copy link
Contributor

+1. I can fix the whitespace as I commit.

@steveloughran
Copy link
Contributor Author

Thanks, I can do the merge; I fix it when I do it on my desktop (That is when I Don't also break the build, pull in the wrong artifacts, etc, etc)

@steveloughran
Copy link
Contributor Author

applied to trunk. Thanks for the review. This is a nice feature. Pity about the fact we can't yet change the mode of an existing table: I've had to go through the console and manually switch every single DDB table over in our test account.

shanthoosh added a commit to shanthoosh/hadoop that referenced this pull request Oct 15, 2019
Prior to Samza 1.0, users plugged in the properties of an I/O system through a configuration file. Samza employed rewriters in the user-defined order to compute the configuration of a job.

Post Samza 1.0, we introduced new abstractions viz` StreamDescriptor` and `SystemDescriptor` in samza, with the purpose of performing configuration expansion for predefined systems at run-time.

Configuration computed at run-time is not persisted at a centralized storage in samza-standalone. This breaks the functionality of the tools viz checkpoint-tool, coordinator-stream-writer, etc in samza standalone. This patch addresses this problem by storing the configuration in coordinator stream for standalone.

In the follow up PR's:
1. We'll switch from zookeeper to coordinator-stream as JobModel storage layer in standalone
2.  Samza tools(checkpoint-tool) will be migrated to read the configuration from coordinator stream rather than from disk.

Author: Shanthoosh Venkataraman <spvenkat@usc.edu>
Author: shanthoosh <svenkataraman@linkedin.com>
Author: Shanthoosh Venkataraman <svenkata@linkedin.com>

Reviewers: Prateek Maheshwari <pmaheshwari@apache.org>

Closes apache#879 from shanthoosh/standalone-coordinator-stream-for-config and squashes the following commits:

c989a59b [shanthoosh] Merge branch 'master' into standalone-coordinator-stream-for-config
d6290f9e [Shanthoosh Venkataraman] Addressing review comments.
8d0ae13d [Shanthoosh Venkataraman] Fix typo in java doc.
f194a04e [Shanthoosh Venkataraman] SAMZA-2059: Storing configuration in coordinator stream for standalone.
@steveloughran steveloughran deleted the s3/HADOOP-15563-full-on-demand branch October 15, 2021 19:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants