-
Notifications
You must be signed in to change notification settings - Fork 13.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-24229][connectors/dynamodb] Added DynamoDB connector #18518
[FLINK-24229][connectors/dynamodb] Added DynamoDB connector #18518
Conversation
Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community Automated ChecksLast check on commit 8ec1f57 (Wed Jan 26 09:57:33 UTC 2022) Warnings:
Mention the bot in a comment to re-run the automated checks. Review Progress
Please see the Pull Request Review Guide for a full explanation of the review process. The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commandsThe @flinkbot bot supports the following commands:
|
@@ -0,0 +1,184 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an alternative we could let user to construct the client themselves and pass it via a provider:
interface DynamoDbClientProvider { DynamoDbAsyncClient getClient(); }
Then we do not have to create the wrapper util class and give full flexibility to the user on what configuration options to set.
The downside will be that we leak DynamoDB SDK interfaces.
In the end I took same approach as we use in other places. But may be we should consider this option.
What do you think @CrynetLogistics ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like your current approach more, like you said, we'd not leak any ddb sdk interfaces.
Plus, I'm not sure there is a way to pass DynamoDbClientProvider
to the DynamoDbSinkWriter
without first making it a field in DynamoDbSink
. If I'm not wrong, I believe the DynamoDbSink
will complain that DynamoDbAsyncClient
in your DynamoDbClientProvider
is not serialisable during runtime (it compiles ok).
I think all fields of the sink must be serialisable... so even if we wanted to accept a customer-ready-made DynamoDbAsyncClient
, I'm not sure there's a good way to do it...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested with provider in the previous sink implementation, I pass the provider to the constructor and then retrieve client when constructing the writer itself, then serialization issue is not there in runtime.
But I will stick with the current approach for consistency and other reasons you mentioned
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah good to know. Thanks, agreed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Yuri, really appreciate your contribution.
- I was wondering if you were planning to add a new page to the documentation in a separate jira?
- FYI, I have a pending PR to make some minor modifications to the
elementConverter
that would allow sink implementers(like us!) to hide theelementConverter
from the user completely.
return new DynamoDbSinkBuilder<>(); | ||
} | ||
|
||
@Experimental |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @Internal
might be more appropriate here, since we're not expecting to remove this method... flink-architecture-tests
are cool with @Internal
here too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @CrynetLogistics for the review. I wasn't sure about the documentation that is published on the website. I'll read up about the process and add it in this PR if possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please give me a reference to the changes for the elements converter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please give me a reference to the changes for the elements converter?
Sure, here is the link to the commit on the master branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see how it is useful for the KinesisSink, as we do not leak Kinesis Put request to the user, but I think it would be more complicated to implement for the DynamoDB Sink.
We will have to implement something that would look exactly like DynamoDB WriteRequest with all its attributes, types, request types, etc. and translate it internally into the WriteRequest object (by the converter). I can not see any other solution for that at the moment. Do you have any ideas?
It will be a lot of functionality we will have to "repeat" that dynamodb provides and limit the user if dynamoDb client evolves in the future.
Do you think it worth doing it, instead of letting user to define the converter themselves? @CrynetLogistics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI added https://issues.apache.org/jira/browse/FLINK-25859 for documentation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@CrynetLogistics @dannycranmer apologize for ping. We have prepared a sample PR in case hiding the ElementConverter is a requirement.
Here we don't let the user interact with the Element Converter or any class from the AWS sdk
dynamoDbClientProperties); | ||
} | ||
|
||
@Experimental |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, @Internal
might be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
* | ||
* <ul> | ||
* <li>{@code maxBatchSize} will be 25 | ||
* <li>{@code maxInFlightRequests} will be 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we set the maxInFlightRequests
to 50? Since that's the default for the AWS Async Client ... .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
import java.util.function.Consumer; | ||
|
||
/** | ||
* TODO. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A short description :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh missed that :( thanks a lot! :)
@@ -0,0 +1,184 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah good to know. Thanks, agreed.
package org.apache.flink.streaming.connectors.dynamodb.sink; | ||
|
||
/** Exception is thrown when DynamoDb sink failed to write data. */ | ||
public class DynamoDbSinkException extends RuntimeException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing compatibility annotation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might need to be @PublicEvolving
, at least it is here https://github.com/apache/flink/blob/master/flink-connectors/flink-connector-aws-kinesis-firehose/src/main/java/org/apache/flink/connector/firehose/sink/KinesisFirehoseException.java#L26
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, fixed
* Represents a single DynamoDb {@link WriteRequest}. Contains the name of the DynamoDb table name | ||
* to write to as well as the {@link WriteRequest} | ||
*/ | ||
public class DynamoDbWriteRequest implements Serializable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing compatibility annotation, might be worth having a look across the whole PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added for all classes
} | ||
|
||
public static PrimaryKey build(DynamoDbTablesConfig.TableConfig config, WriteRequest request) { | ||
if (config != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we expecting a nullable parameter?
might be worth adding a @Nullable
annotation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
requestItems.get(config.getPartitionKeyName()); | ||
AttributeValue sortKeyAttributeValue = requestItems.get(config.getSortKeyName()); | ||
|
||
if (config.getPartitionKeyName() != null && partitionKeyAttributeValue == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can config.getPartitionKeyName()
ever be null?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, you are right, it is mandatory in the config class. Will remove it.
flink-connectors/flink-connector-dynamodb/src/main/resources/META-INF/NOTICE
Outdated
Show resolved
Hide resolved
...c/test/java/org/apache/flink/streaming/connectors/dynamodb/sink/DynamoDbSinkBuilderTest.java
Outdated
Show resolved
Hide resolved
...est/java/org/apache/flink/streaming/connectors/dynamodb/sink/TableRequestsContainerTest.java
Outdated
Show resolved
Hide resolved
...db/src/test/java/org/apache/flink/streaming/connectors/dynamodb/sink/key/PrimaryKeyTest.java
Outdated
Show resolved
Hide resolved
class TableRequestsContainer { | ||
|
||
private final DynamoDbTablesConfig tablesConfig; | ||
private final LinkedHashMap<String, Map<PrimaryKey, WriteRequest>> container; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Do we need container
to be a LinkedHashMap
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, thanks, it is fixed now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does not look fixed in this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, strange, probably got lost on previous re-base. It is fixed now.
* @param <InputT> Type of the elements handled by this sink | ||
*/ | ||
@PublicEvolving | ||
public class DynamoDbSink<InputT> extends AsyncSinkBase<InputT, DynamoDbWriteRequest> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why didn't you name it DynamoDbAsyncSink
? Shouldn't users be aware of this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comment, makes sense, I was looking at other sinks in the project already inheriting from the AsyncSinkBase, and they are not using the Async naming convention.
@CrynetLogistics @dannycranmer it makes sense to me, is that something we want to do moving forward?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep it as DynamoDbSink
for consistency with the code base. I do not think the underlying implementation of sync vs async is necessarily a concern to the user and therefore adding it to the class name does not add much value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dannycranmer So how would you expect a user to know if a connector is sync /async without looking at the underlying code? It should at least be added to the docs. Somewhere it needs to be specified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nirtsruya what other async connectors aren't specified with async
in their names? Can you give some examples?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@almogtavor for example KinesisStreamsSink
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@almogtavor I would not expect a user to care whether the connector is sync/async. This is an implementation detail of the underlying connector. What is the reason to expose async
to the user in the classname?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not get a chance to go though all of the test yet, but have added a bunch of comments.
|
||
<modelVersion>4.0.0</modelVersion> | ||
|
||
<artifactId>flink-connector-dynamodb</artifactId> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please rename inline with the other new aws
connectors > flink-connector-aws-dynamodb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
<name>Flink : Connectors : DynamoDB</name> | ||
<properties> | ||
<aws.sdk.version>2.17.116</aws.sdk.version> | ||
<commons-lang3.version>3.11</commons-lang3.version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use the managed version from parent instead? https://github.com/apache/flink/blob/master/pom.xml#L478
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, fixed now
<properties> | ||
<aws.sdk.version>2.17.116</aws.sdk.version> | ||
<commons-lang3.version>3.11</commons-lang3.version> | ||
<testcontainers.version>1.16.2</testcontainers.version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use the managed version from parent instead?
https://github.com/apache/flink/blob/master/pom.xml#L139
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, fixed now
<executions> | ||
<execution> | ||
<goals> | ||
<goal>test-jar</goal> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we building a test jar? I cannot see any consumers of this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed test jar configuration
TableRequestsContainer container = new TableRequestsContainer(tablesConfig); | ||
requestEntries.forEach(container::put); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Since you are deduplicating here, the actual batch size will be less than the configured batch size. I cannot see a better way to do it though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes good point, this is the same how boto3 behaves. May be we should rename batch size property to MAX batch size
...amodb/src/main/java/org/apache/flink/streaming/connectors/dynamodb/util/AWSDynamoDbUtil.java
Outdated
Show resolved
Hide resolved
@VisibleForTesting | ||
static RetryPolicy getRetryPolicy(Properties properties) { | ||
if (hasRetryConfiguration(properties)) { | ||
RetryPolicy.Builder builder = RetryPolicy.builder(); | ||
|
||
if (properties.containsKey(AWSDynamoDbConfigConstants.NUMBER_RETRIES)) { | ||
builder.numRetries( | ||
Integer.parseInt( | ||
properties.getProperty(AWSDynamoDbConfigConstants.NUMBER_RETRIES))); | ||
} | ||
|
||
if (properties.containsKey(AWSDynamoDbConfigConstants.BACKOFF_STRATEGY)) { | ||
builder.backoffStrategy( | ||
getBackoffStrategy( | ||
properties, AWSDynamoDbConfigConstants.BACKOFF_STRATEGY)); | ||
} | ||
|
||
if (properties.containsKey(AWSDynamoDbConfigConstants.THROTTLING_BACKOFF_STRATEGY)) { | ||
builder.throttlingBackoffStrategy( | ||
getBackoffStrategy( | ||
properties, | ||
AWSDynamoDbConfigConstants.THROTTLING_BACKOFF_STRATEGY)); | ||
} | ||
return builder.build(); | ||
} | ||
return RetryPolicy.defaultRetryPolicy(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe these configs can be made generic and pulled in to AWSGeneralUtil
?
@VisibleForTesting | ||
static BackoffStrategy getBackoffStrategy(Properties properties, String strategy) { | ||
AWSDynamoDbConfigConstants.BackoffStrategy backoffStrategy = | ||
AWSDynamoDbConfigConstants.BackoffStrategy.valueOf( | ||
properties.getProperty(strategy)); | ||
|
||
switch (backoffStrategy) { | ||
case FULL_JITTER: | ||
return FullJitterBackoffStrategy.builder() | ||
.baseDelay( | ||
getDuration( | ||
properties.getProperty( | ||
AWSDynamoDbConfigConstants | ||
.FULL_JITTER_BASE_DELAY_MS))) | ||
.maxBackoffTime( | ||
getDuration( | ||
properties.getProperty( | ||
AWSDynamoDbConfigConstants | ||
.FULL_JITTER_MAX_BACKOFF_TIME_MS))) | ||
.build(); | ||
case EQUAL_JITTER: | ||
return EqualJitterBackoffStrategy.builder() | ||
.baseDelay( | ||
getDuration( | ||
properties.getProperty( | ||
AWSDynamoDbConfigConstants | ||
.EQUAL_JITTER_BASE_DELAY_MS))) | ||
.maxBackoffTime( | ||
getDuration( | ||
properties.getProperty( | ||
AWSDynamoDbConfigConstants | ||
.EQUAL_JITTER_MAX_BACKOFF_TIME_MS))) | ||
.build(); | ||
case FIXED_DELAY: | ||
return FixedDelayBackoffStrategy.create( | ||
getDuration( | ||
properties.getProperty( | ||
AWSDynamoDbConfigConstants.FIXED_DELAY_BACKOFF_MS))); | ||
default: | ||
return BackoffStrategy.defaultStrategy(); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe these configs can be made generic and pulled in to AWSGeneralUtil
?
|
||
/** A collection of utility functions to simplify work with DynamoDB service exceptions. */ | ||
@Internal | ||
public class DynamoDbExceptionUtils { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed previously, we have implemented some error handling classes in AsyncSink to help with the common things, for example https://github.com/apache/flink/blob/master/flink-connectors/flink-connector-aws-base/src/main/java/org/apache/flink/connector/aws/util/AWSCredentialFatalExceptionClassifiers.java
63f13be
to
f92d953
Compare
…e new AsyncSinkBase
… TestMapper and DynamoDbElementConverter, do not retry items that are bigger that dynamodb maximum allowed record size
…anged the Scenario implementation, to create the test table with name of the variable testTableName from DynamoDbSinkITCase, but create the DynamoDbSink with table name passed by the tableName property.
…ded an example usage of the DynamoDb sink
deleted NOTICE file
…ecessary checks, add missing javadoc
…ge default value for msxInFlightRequests
… flink-connector-aws-dynamodb, similar to other modules
…versions for testcontainers and apache commons
Add support for DynamoDb write request. Currently WriteRequest is not serializable (although it is implementing the Serializable interface). The reason is that the AttributeValue class is not Serializable, although also implementing the Serializable interface. This is due to the Set fields initialized as a software.amazon.awssdk.core.util.DefaultSdkAutoConstructList which is not Serializable. There is an open issue - aws/aws-sdk-java-v2#3143
…le requests container
38caf77
to
c770f76
Compare
…sync-writer-element-converter Flink 24229 dynamodb connector async writer element converter
Closing this pull request as it was moved to a separate flink-connector-dynamodb repository: apache/flink-connector-aws#1 |
What is the purpose of the change
User stories:
As a Flink user, I’d like to use DynamoDB as sink for my data pipeline.
Scope:
Brief change log
Verifying this change
This change added tests and can be verified as follows:
Does this pull request potentially affect one of the following parts:
@Public(Evolving)
: noDocumentation