[FLINK-24229][Connectors][DynamoDB] - Add AWS DynamoDB connector #1

nirtsruya · 2022-08-12T08:20:47Z

Move flink-connector-aws-connector from the flink respository

...in/java/org/apache/flink/streaming/connectors/dynamodb/util/DynamoDbAttributeValueUtils.java

...rc/main/java/org/apache/flink/streaming/connectors/dynamodb/sink/DynamoDbAttributeValue.java

...rc/main/java/org/apache/flink/streaming/connectors/dynamodb/config/DynamoDbTablesConfig.java

dannycranmer · 2022-09-08T13:38:40Z

...n/java/org/apache/flink/streaming/connectors/dynamodb/sink/DynamoDbSinkElementConverter.java

+ * {@link DynamoDbWriteRequest} that may be persisted.
+ */
+@Internal
+public class DynamoDbSinkElementConverter<InputT>


It looks like we have created wrappers around the AWS SDK DynamoDB model within the sink. Can you please explain your reasoning behind this approach? The idea behind hiding the ElementConverter is to 1/ hide the internal implementation and 2/ enable sink to manage de/serialisation. However I am struggling to see the benefits here since the 2x models are very similar. If we will copy the underlying model it might be best to use that model, since users might already be creating these objects.

Did you consider using a more generic ElementConverter, like the DynamoDBMapper? I am not familiar with this so not sure if there are performance/feature limitations

We had a discussion earlier on that matter here

If we go for hiding the converter we do not leak underlying dynamodb sdk classes to the user, but we have to create a wrapper around their model. This means if the dynamodb client changes over time we will have to propagate model changes too.

I think it may be better to allow the user to define the element converter and not do any translation ourselves (like we had it before this PR: YuriGusev/flink@de83346#diff-1c18ee5b3c5b4d827778ffdf4e5d2219dbcb02fc45be0d3665357633a134995c).

If we revert the change, we must ensure that if the dynamo client classes change over time we update our de/serialiser. Otherwise we might get in the situation where user ElementConverter generates classes that are not supported by our serialiser. The impact would be that we lose/break attributes when restoring from checkpoints. We can mitigate this by calling it out in the docs/PR template as something to check when updating the SDK.

I think it may be better to allow the user to define the element converter and not do any translation ourselves (like we had it before this PR

Given we cannot escape these wrappers, I agree with this.

Agree that having a wrapper provides limited benefits in this case and that it makes more sense to expose the DynamoDB SDK classes directly.

To mitigate the risk of not updating the de/serializers when doing a SDK update, it would be great if we could find a way to programmatically verify during compile time that the SDK is still compatible with the de/serializers. But I'm not sure how much work that would required and if it's worth the effort.

Thanks @sthm good comment
If I understand correcty, we are concerned that the sink uses version x of dynamodb sdk, while the user is using version y, and the versions are incompatible for some reason? I think during compile time it would be hard to detect, as this is somewhat a dependency management issue - unless we shade the aws sdk and only allow to pass these shaded objects as an input?
Otherwise, do we think an error during sink startup makes sense that the sdk classes used are incompatible with the one used in the sink due to serialization issues? similar to how flink fails if an object that is passed to the operators is not serializable

@hlteoh37 / @YuriGusev / @nirtsruya there has been discussed regarding support for Conditional Writes and non-batch support. I note that the batch API uses PutRequest and non-batch uses PutItemRequest. They both internally use the same Map<String, AttributeValue>, but the PutItemRequest adds additional fields.

My concern is that we would not be able to toggle batch mode without implementing a new connector, since the generic RequestItem type will change. Therefore, maybe we need some middle ground here. We could use a POJO that contains an Map<String, AttributeValue> that we can transform to either PutRequest or PutItemRequest. Then we can evolve this element converter to transparently support either API.

To summarise, I am considering if we actually need a RequestEntryT to be something like:

interface DynamoDbWriteRequest<T> { Map<String, AttributeValue> getItems(T record); // And some other method to decide if it is put or delete }

Then later we can support for getConditionalExpression()

@dannycranmer makes sense, I can go ahead with implementing that if you agree @YuriGusev
regarding @sthm comment on serialization, do we want to shade the dynamodb sdk? is it something the community was considering?

I would rather avoid shading the SDK. It means that end users cannot update the SDK without building from source, this has caused issues before. For example, there is no way to add support for new AWS regions for older versions of the connector without building from source. On top of this, if you include multiple connectors, you end up with multiple versions of the SDK bloating the size of the jar.

We can think about a build time check as a follow up https://issues.apache.org/jira/browse/FLINK-29688

flink-connector-aws-dynamodb/pom.xml

pom.xml

dannycranmer · 2022-09-08T13:56:05Z

Apologies for the delay, I have left some comments. As commented in the main pom I will open a PR to contribute the general repo files so we can decouple from this contribution.

dannycranmer · 2022-09-12T16:22:20Z

Please rebase on main after this is merged #2

YuriGusev · 2022-09-13T21:13:37Z

Thanks for the review @dannycranmer, I'm currently away from home, but I hope to be back and start working on the fixes next week

...dynamodb/src/main/java/org/apache/flink/streaming/connectors/dynamodb/sink/DynamoDbSink.java

hlteoh37

Some general callouts on docs and tests

...dynamodb/src/main/java/org/apache/flink/streaming/connectors/dynamodb/sink/DynamoDbSink.java

...b/src/main/java/org/apache/flink/streaming/connectors/dynamodb/sink/DynamoDbSinkBuilder.java

hlteoh37 · 2022-10-15T15:57:25Z

...db/src/main/java/org/apache/flink/streaming/connectors/dynamodb/sink/DynamoDbSinkWriter.java

+            }
+        }
+
+        LOG.warn("DynamoDB Sink failed to persist {} entries", unprocessed.size());


Nit: Might help to also mention these entries will be retried. (same below for handleFullyFailedRequest()

...est/java/org/apache/flink/streaming/connectors/dynamodb/sink/TableRequestsContainerTest.java

hlteoh37 · 2022-10-15T19:37:10Z

...est/java/org/apache/flink/streaming/connectors/dynamodb/sink/TableRequestsContainerTest.java

+                .witItem("tableOne", ImmutableMap.of("pk", "2", "sk", "1"))
+                .witItem("tableTwo", ImmutableMap.of("pk", "1", "sk", "1"))
+                .withDescription(
+                        "requests should not be deduplecated if belong to different tables")


Suggested change

"requests should not be deduplecated if belong to different tables")

"requests should not be deduplicated if belong to different tables")

...est/java/org/apache/flink/streaming/connectors/dynamodb/sink/TableRequestsContainerTest.java

hlteoh37 · 2022-10-15T19:40:49Z

...db/src/test/java/org/apache/flink/streaming/connectors/dynamodb/sink/key/PrimaryKeyTest.java

+    }
+
+    @Test
+    public void testPartitionKeysOfTwoDifferentRequestsEqual() {


should we also test the case where config is null?

hlteoh37 · 2022-10-15T19:45:46Z

...b/src/test/java/org/apache/flink/streaming/connectors/dynamodb/util/AWSDynamoDbUtilTest.java

+                        getAdvancedOption(configuration, SdkAdvancedClientOption.USER_AGENT_PREFIX))
+                .isEqualTo(AWSDynamoDbUtil.getFlinkUserAgentPrefix(properties));
+
+        Assertions.assertThat(configuration.retryPolicy().isPresent()).isTrue();


nit: For a better message when tests fail

Suggested change

Assertions.assertThat(configuration.retryPolicy().isPresent()).isTrue();

Assertions.assertThat(configuration.retryPolicy()).isEmpty();

hlteoh37 · 2022-10-15T19:52:43Z

...b/src/main/java/org/apache/flink/streaming/connectors/dynamodb/sink/DynamoDbSinkBuilder.java

+        extends AsyncSinkBaseBuilder<InputT, DynamoDbWriteRequest, DynamoDbSinkBuilder<InputT>> {
+
+    private static final int DEFAULT_MAX_BATCH_SIZE = 25;
+    private static final int DEFAULT_MAX_IN_FLIGHT_REQUESTS = 50;


Am thinking how this will work for users if there are multiple inflight requests with records that have the same PK/SK.

Seems like it will be a race condition where the first request to complete will take precedence

If there are failures/partial failures in a batch, the "latest" record in the stream might no longer take precedence. That said, this situation is more likely to occur if there are more records with the same PK/SK, which helps us, because that means eventual consistency will be attained quickly.

In the scenario above, it seems setting MAX_IN_FLIGHT_REQUESTS to 1 would be the main mitigation. I wonder if we should set this as default, or at least call this out to users?

Very good point, I was thinking about it, I think we should definately add a comment about this.
We need to be specific here, as if we are using parallelism, MAX_IN_FLIGHT_REQUESTS alone will not help and we need to ensure key and ordering from the source as well
But in general, I think we should document that the sink has this "problem"
Maybe in the future we can allow to use single PutItem requests and allow expressions as you mentioned in another comment

Yes you're right. Users will also need to ensure that a keyBy is done before the sink such that all records with same SK/PK goes to the same subtask.

Yep, agreed that we can set the default, and create a JIRA for using single PutItem requests with conditional expressions in the future.

I am also thinking, that if we want to really support that we probably have to remove that deduplication that we are doing in case we find a duplicate key, and instead "split" the batch, to make sure we do not lose records with the same primary key

@YuriGusev I think it makes sense, what do you think? should we split batches instead of deduplication?

Hi,

Duplicated keys only matter when they get to the same batch request, then dynamodb service will fail the request. I took a simple approach to have accumulator to deduplicate entries based on PK/SK in the same request. E.g. if we had 25 entries and 1 entry has duplicated PK/SK we would then write 24 entries in the request.

But if two entries with the same PK/SK get to two different in-flight batch request this is not a problem as deduplication then happens on DynamoDB side.

If the order of the duplicate/update entries matters I think batch sink can not be used anyway, as the order is mixed up by partial write failures/full retries and parallelism in the sink. Then it is up to user to dedup before the sink.

I may be misunderstood the question. :)

I am also thinking, that if we want to really support that we probably have to remove that deduplication that we are doing in case we find a duplicate key, and instead "split" the batch, to make sure we do not lose records with the same primary key

IMO deduplication within a batch on the Flink DDB sink side is a useful feature, because it reduces writes to DDB. The current deduplication ensures the newest record takes precedence, which is the behaviour we see on consecutive BatchWriteItem requests.

The difference I can see here is that the ChangeDataCapture stream off the DDB table will be different depending on whether deduping happens in Flink DDB Sink. This might be possible if the user wants to write to DDB but also track number of changes in the stream. This sounds like a relatively niche use case though, since this use case is more suited to a Streaming Destination like Kinesis.

So, I think it makes sense to keep the deduping. @nirtsruya do you have a particular use case where splitting into batch would be preferable?

But if two entries with the same PK/SK get to two different in-flight batch request this is not a problem as deduplication then happens on DynamoDB side.
If the order of the duplicate/update entries matters I think batch sink can not be used anyway, as the order is mixed up by partial write failures/full retries and parallelism in the sink. Then it is up to user to dedup before the sink.

As far as I can tell, the second request in DDB will take precedence and overwrite the first request. I can see Batching being very useful for Batch jobs where deduping already occured, and we just want to dump a large amt of data to DDB.

For streaming use case, this situation is mitigated if user sets the MAX_IN_FLIGHT_REQUESTS to 1, because the AsyncSinkWriter will put any failed items from requests to the front of the queue, so the "older" items will be retried first.

In the future, I was thinking we can implement an improvement to use PutItem API, and the DDB Sink can add an additional key-value with eventTime as the value. Whenever an item with older event time attempts to be written, we can just drop the older record.

May be we could provide a setting to turn off deduplication on the sink side. That would be on-pair with boto3 clients for dynamodb (overwrite_by_pkeys=['partition_key', 'sort_key'] config). Wdyt?

In the future, I was thinking we can implement an improvement to use PutItem API, and the DDB Sink can add an additional key-value with eventTime as the value. Whenever an item with older event time attempts to be written, we can just drop the older record.

Yes I agree that would be a nice feature to add. We actually had it planned for the first sink version (not based on AsyncSyncBase). There was interface that allows different types of producers (batching/non-batching) to be able to extend for this use case in the future. We can introduce something similar here and plan for adding this feature. For non-batching sink we could also support conditional expressions for individual PutItem/DeleteItem requests (as you mentioned below)

May be we could provide a setting to turn off deduplication on the sink side

Currently we can do this via tableConfig, isn't it? Or do you think a separate config will be better? But I agree - providing a setting to toggle deduplication is a good idea

Yes this is right if there is no configuration it won't deduplicate anything. It's been a while I forgot I added test for this :)

hlteoh37 · 2022-10-15T20:05:58Z

Question for general discussion:
Do we have any thoughts on using conditional expressions? I note that the BatchWriteItem API we are using doesn't support this. Maybe we can create a followup JIRA to track this feature.

dannycranmer · 2022-10-18T18:49:22Z

...n/java/org/apache/flink/streaming/connectors/dynamodb/config/AWSDynamoDbConfigConstants.java

+    public enum BackoffStrategy {
+
+        /**
+         * Backoff strategy that uses a full jitter strategy for computing the next backoff delay. A
+         * full jitter strategy will always compute a new random delay between and the computed
+         * exponential backoff for each subsequent request. See example: {@link
+         * FullJitterBackoffStrategy}
+         */
+        FULL_JITTER,
+
+        /**
+         * Backoff strategy that uses equal jitter for computing the delay before the next retry. An
+         * equal jitter backoff strategy will first compute an exponential delay based on the
+         * current number of retries, base delay and max delay. The final computed delay before the
+         * next retry will keep half of this computed delay plus a random delay computed as a random
+         * number between 0 and half of the exponential delay plus one. See example: {@link
+         * EqualJitterBackoffStrategy}
+         */
+        EQUAL_JITTER,
+
+        /**
+         * Simple backoff strategy that always uses a fixed delay for the delay * before the next
+         * retry attempt. See example: {@link FixedDelayBackoffStrategy}
+         */
+        FIXED_DELAY
+    }


As a follow-up, we should generalise and pull this config into aws-base. Just a callout, nothing to do now

Created a JIRA for it https://issues.apache.org/jira/browse/FLINK-29683

dannycranmer · 2022-10-18T18:52:45Z

...dynamodb/src/main/java/org/apache/flink/streaming/connectors/dynamodb/sink/DynamoDbSink.java

+import static org.apache.flink.util.Preconditions.checkNotNull;
+
+/**
+ * A DynamoDB Sink that performs async requests against a destination stream using the buffering


Suggested change

* A DynamoDB Sink that performs async requests against a destination stream using the buffering

* A DynamoDB Sink that performs async requests against a destination table using the buffering

dannycranmer · 2022-10-18T19:05:45Z

...dynamodb/src/main/java/org/apache/flink/streaming/connectors/dynamodb/sink/DynamoDbSink.java

+    @Override
+    public StatefulSinkWriter<InputT, BufferedRequestState<DynamoDbWriteRequest>> createWriter(
+            InitContext context) throws IOException {
+        return new DynamoDbSinkWriter<>(


nit: Can simplify with:

return restoreWriter(context, Collections.emptyList());

thats neat. Thanks fixed.

dannycranmer · 2022-10-18T19:07:50Z

...b/src/main/java/org/apache/flink/streaming/connectors/dynamodb/sink/DynamoDbSinkBuilder.java

+    /**
+     * @param elementConverter the {@link ElementConverter} to be used for the sink
+     * @return {@link DynamoDbSinkBuilder} itself
+     */


nit: Since this comment does not add anything above the method signature, consider removing, as per the coding guidelines

dannycranmer · 2022-10-18T19:18:15Z

Question for general discussion: Do we have any thoughts on using conditional expressions? I note that the BatchWriteItem API we are using doesn't support this. Maybe we can create a followup JIRA to track this feature.

I had considered the same. I see that conditional write is supported on the PutItem, but not the Batch api. Based on this, and the discussion above about supporting non-batch mode, I would vote to proceed without it, and add later.

hlteoh37

Generally looks good.

...db/src/main/java/org/apache/flink/streaming/connectors/dynamodb/sink/DynamoDbSinkWriter.java

...odb/src/main/java/org/apache/flink/streaming/connectors/dynamodb/util/PrimaryKeyBuilder.java

...dynamodb/src/main/java/org/apache/flink/streaming/connectors/dynamodb/sink/DynamoDbSink.java

dannycranmer · 2022-10-20T11:42:02Z

...b/src/main/java/org/apache/flink/streaming/connectors/dynamodb/sink/DynamoDbSinkBuilder.java

+     * @param overwriteByPKeys provide partition key and (optionally) sort key name if you want to
+     *     bypass no duplication limitation of single batch write request. Batching DynamoDB sink
+     *     will drop request items in the buffer if their primary keys(composite) values are the
+     *     same as newly added one.


and (optionally) sort key name

It is not clear from this comment or the method signature how I provide a sort key. If it is element 0 and 1, could we split to individual fields instead?

We had it as a separate DTO, I changed it here. Order/number of keys doesn't matter for the deduplication logic. I can fix the comment to make it more clear, or return previous approach (or to have two individual fields as well).

I changed this is to match similar approach in the boto3 dynamodb batch client.

Otherwise I'm fine to have two fields, however we would need to keep nullable field in the builder for sort key and pass both fields down the tree.

@dannycranmer do you think it is ok to keep this way or you would like to change the approach?

It's fine to use the list approach. It's more flexible, and given its similar in boto

...db/src/main/java/org/apache/flink/streaming/connectors/dynamodb/sink/DynamoDbSinkWriter.java

...main/java/org/apache/flink/streaming/connectors/dynamodb/util/DynamoDbSerializationUtil.java

...odb/src/main/java/org/apache/flink/streaming/connectors/dynamodb/util/PrimaryKeyBuilder.java

...c/test/java/org/apache/flink/streaming/connectors/dynamodb/sink/DynamoDbSinkBuilderTest.java

dannycranmer · 2022-10-20T12:19:26Z

.../src/main/java/org/apache/flink/streaming/connectors/dynamodb/sink/DynamoDbWriteRequest.java

+    @Override
+    public boolean equals(Object o) {
+        if (this == o) {
+            return true;
+        }
+        if (o == null || getClass() != o.getClass()) {
+            return false;
+        }
+        DynamoDbWriteRequest that = (DynamoDbWriteRequest) o;
+        return Objects.equals(writeRequest, that.writeRequest);
+    }
+
+    @Override
+    public int hashCode() {
+        return Objects.hash(writeRequest);
+    }


Why do we need these?

I left this class to remove later when we migrate to the interface approach (@nirtsruya) is working on that

dannycranmer · 2022-10-20T20:56:05Z

...db/src/main/java/org/apache/flink/streaming/connectors/dynamodb/sink/DynamoDbSinkWriter.java

+ * AWS_SECRET_ACCESS_KEY} through environment variables etc.
+ */
+@Internal
+class DynamoDbSinkWriter<InputT> extends AsyncSinkWriter<InputT, WriteRequest> {


@nirtsruya sorry I might not have explained my idea well. I actually meant that here we would use your new type:

class DynamoDbSinkWriter<InputT> extends AsyncSinkWriter<InputT, DynamoDbWriteRequest> {

Now we can modify DynamoDbWriteRequest later to support the conditional fields for non-batch mode. Inside submitRequestEntries we would convert DynamoDbWriteRequest into WriteRequest for batch or PutItemRequest for non batch. We would need to also convert WriteRequest back into DynamoDbWriteRequest for failed records (need to check we have all the required info).

There will be a slight performance hit when we retry, as records are converted multiple times. But this allows us to evolve the connector to support batch/non-batch without having to duplicate all the classes as you explained

Ok makes sense,
I have one comment on the DynamoDbWriteRequest - it seems like a DTO, do we really need to make it an interface?
would help when deserializing for example to have the same concrete class maybe for de/serialization

I have one comment on the DynamoDbWriteRequest - it seems like a DTO, do we really need to make it an interface?

Yes I agree that an interface does not make sense here. Class is fine.

dannycranmer · 2022-10-21T14:40:24Z

...b/src/main/java/org/apache/flink/streaming/connectors/dynamodb/sink/DynamoDbSinkBuilder.java

+ * private static class DummyDynamoDbElementConverter implements ElementConverter<String, DynamoDbWriteRequest> {
+ *
+ *         @Override
+ *         public DynamoDbWriteRequest apply(String s) {
+ *             final Map<String, AttributeValue> item = new HashMap<>();
+ *             item.put("your-key", AttributeValue.builder().s(s).build());
+ *             return new DynamoDbWriteRequest(
+ *                       WriteRequest.builder()
+ *                           .putRequest(PutRequest.builder()
+ *                               .item(item)
+ *                               .build())
+ *                           .build()
+ *                   );
+ *         }
+ *     }
+ * DynamoDbSink<String> dynamoDbSink = DynamoDbSink.<String>builder()
+ *                                          .setElementConverter(new DummyDynamoDbElementConverter())
+ *                                          .setDestinationTableName("your-table-name")
+ *                                       .build();


This comment is now outdated

dannycranmer · 2022-10-21T14:40:41Z

...b/src/main/java/org/apache/flink/streaming/connectors/dynamodb/sink/DynamoDbSinkBuilder.java

+ *   <li>{@code maxRecordSizeInBytes} will be 400 KB i.e. {@code 400 * 1000}
+ *   <li>{@code failOnError} will be false
+ *   <li>{@code destinationTableName} destination table for the sink
+ *   <li>{@code overwriteByPKeys} will be empty meaning no records deduplication will be performed


This comment is outdated

dannycranmer · 2022-10-21T14:43:14Z

...db/src/main/java/org/apache/flink/streaming/connectors/dynamodb/sink/DynamoDbSinkWriter.java

+    private final boolean failOnError;
+    private final String tableName;
+
+    private List<String> overwriteByPKeys;


Can we update name to overwriteByPartitionKeys to match the other classes?

sorry missed this

dannycranmer · 2022-10-21T14:47:06Z

...db/src/main/java/org/apache/flink/streaming/connectors/dynamodb/sink/DynamoDbSinkWriter.java

+                            DeleteRequest.builder().key(dynamoDbWriteRequest.getItem()).build())
+                    .build();
+        } else {
+            throw new IllegalArgumentException("");


Can we add a message to the exception? Given this should not happen we can quite explicitly say the convertToWriteRequest method must need updating

dannycranmer · 2022-10-21T14:47:22Z

...db/src/main/java/org/apache/flink/streaming/connectors/dynamodb/sink/DynamoDbSinkWriter.java

+                    .setType(DynamoDbWriteRequestType.DELETE)
+                    .build();
+        } else {
+            throw new IllegalArgumentException("");


Same as above

.../src/main/java/org/apache/flink/streaming/connectors/dynamodb/sink/DynamoDbWriteRequest.java

dannycranmer · 2022-10-21T14:54:08Z

...rc/main/java/org/apache/flink/streaming/connectors/dynamodb/sink/TableRequestsContainer.java

+
+    private final Map<String, WriteRequest> container;
+
+    private List<String> overwriteByPKeys;


Please update field name to match the other one overwriteByPartitionKeys

dannycranmer · 2022-10-21T15:05:40Z

...db/src/main/java/org/apache/flink/streaming/connectors/dynamodb/sink/DynamoDbSinkWriter.java

+        TableRequestsContainer container = new TableRequestsContainer(overwriteByPKeys);
+        for (DynamoDbWriteRequest requestEntry : requestEntries) {
+            container.put(convertToWriteRequest(requestEntry));
+        }
+
+        CompletableFuture<BatchWriteItemResponse> future =
+                client.batchWriteItem(
+                        BatchWriteItemRequest.builder()
+                                .requestItems(
+                                        ImmutableMap.of(tableName, container.getRequestItems()))
+                                .build());


As an optimization, we could completely skip PrimaryKeyBuilder.build when CollectionUtil.isNullOrEmpty(overwriteByPKeys). This would remove the UUID generation and map overhead. Something like this:

final List<WriteRequest> writeRequests; if (CollectionUtil.isNullOrEmpty(overwriteByPKeys)) { writeRequests = new ArrayList<>(); for (DynamoDbWriteRequest requestEntry : requestEntries) { writeRequests.put(convertToWriteRequest(requestEntry)); } } else { TableRequestsContainer container = new TableRequestsContainer(overwriteByPKeys); for (DynamoDbWriteRequest requestEntry : requestEntries) { container.put(convertToWriteRequest(requestEntry)); } writeRequests = container.getRequestItems(); } CompletableFuture<BatchWriteItemResponse> future = client.batchWriteItem( BatchWriteItemRequest.builder() .requestItems(ImmutableMap.of(tableName, writeRequests)) .build());

dannycranmer · 2022-10-21T15:19:08Z

.../main/java/org/apache/flink/streaming/connectors/dynamodb/sink/DynamoDbWriteRequestType.java

+public enum DynamoDbWriteRequestType {
+    PUT,
+    DELETE,
+}


Is it worth trying use a more general Flink type to represent this, something like https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/types/RowKind.java?

Maybe @MartijnVisser would be able to help?

@dannycranmer I like using the byte representation for de/serialization. will also apply it for DynamoDbType

dannycranmer · 2022-10-31T09:29:11Z

...dynamodb/src/main/java/org/apache/flink/streaming/connectors/dynamodb/sink/DynamoDbSink.java

+ *   <li>{@code dynamoDbTablesConfig}: if provided for the table, the DynamoDb sink will attempt to
+ *       deduplicate records with the same primary and/or secondary keys in the same batch request.
+ *       Only the latest record with the same combination of key attributes is preserved in the
+ *       request.
+ * </ul>


This comment is out of sync with the recent changes to the class

Fixed and other javadocs also updated

.../src/main/java/org/apache/flink/streaming/connectors/dynamodb/sink/DynamoDbWriteRequest.java

.../main/java/org/apache/flink/streaming/connectors/dynamodb/sink/DynamoDbWriteRequestType.java

dannycranmer · 2022-10-31T09:42:20Z

...main/java/org/apache/flink/streaming/connectors/dynamodb/util/DynamoDbSerializationUtil.java

+                Set<String> stringSet = new LinkedHashSet<>(stringSetSize);
+                for (int i = 0; i < stringSetSize; i++) {
+                    stringSet.add(in.readUTF());
+                }
+                return AttributeValue.builder().ss(stringSet).build();


Why use a Set here? Does this mean this is a non-symmetric transform? We could hit a situation where:
!originalObject.equals(deserialize(serialize(originalObject)))

If so, I think we should not be deduplicating here and the transform should be symmetrical

...dynamodb/src/main/java/org/apache/flink/streaming/connectors/dynamodb/util/DynamoDbType.java

dannycranmer · 2022-11-02T19:38:13Z

...db/src/main/java/org/apache/flink/streaming/connectors/dynamodb/sink/DynamoDbSinkWriter.java

+    protected long getSizeInBytes(DynamoDbWriteRequest requestEntry) {
+        // dynamodb calculates item size as a sum of all attributes and all values, but doing so on
+        // every operation may be too expensive, so this is just an estimate
+        return requestEntry.getItem().toString().getBytes(StandardCharsets.UTF_8).length;


As discussed offline, it is not practical to compute the sizeInBytes given that:

We do not know how the client serializes the request payload

This will be an expensive operation and we would potentially be serializing each record twice

We agree to:

Return a dummy constant value here

Follow up with an improvement to make this flush strategy and validation optional (FLINK-29854)

The failure modes are already covered:

Total batch size too large: this will be handled organically by AIMD

Single DynamoDbWriteRequest too large: if we try to send a single request that is too large it will result in a fatal validation exception

This was fixed. I set to 0 and updated builder class to raise exceptions if corresponding parameters were used

dannycranmer · 2022-11-03T15:27:18Z

...b/src/main/java/org/apache/flink/streaming/connectors/dynamodb/sink/DynamoDbSinkBuilder.java

+    @Override
+    public DynamoDbSinkBuilder<InputT> setMaxBatchSizeInBytes(long maxBatchSizeInBytes) {
+        throw new InvalidConfigurationException(
+                "Max batch size in bytes is not supported by the DynamoDB sink implementation.");
+    }
+
+    @Override
+    public DynamoDbSinkBuilder<InputT> setMaxRecordSizeInBytes(long maxRecordSizeInBytes) {
+        throw new InvalidConfigurationException(
+                "Max record size in bytes is not supported by the DynamoDB sink implementation.");
+    }


Thanks for adding this

dannycranmer · 2022-11-03T15:28:49Z

...src/main/java/org/apache/flink/streaming/connectors/dynamodb/sink/DynamoDbSinkException.java

+/** Exception is thrown when DynamoDb sink failed to write data. */
+public class DynamoDbSinkException extends RuntimeException {


Missing @PublicEvolving

I see the annotations were recently removed, why was this? 598373a

dannycranmer · 2022-11-03T15:32:13Z

.../java/org/apache/flink/streaming/connectors/dynamodb/sink/InvalidConfigurationException.java

+/** Exception is thrown if a DynamoDB configuration was invalid. */
+public class InvalidConfigurationException extends RuntimeException {


Missing @PublicEvolving

Why is this not a subtype of DynamoDbSinkException

I see the annotations were recently removed, why was this? 598373a

please see my answer below. Fixed this now

dannycranmer · 2022-11-03T15:32:59Z

...c/main/java/org/apache/flink/streaming/connectors/dynamodb/sink/InvalidRequestException.java

+/** Exception is thrown if a DynamoDB request was invalid. */
+public class InvalidRequestException extends RuntimeException {


Missing @PublicEvolving

Why is this not a sub type of DynamoDbSinkException?

I see the annotations were recently removed, why was this? 598373a

I first started adding them for other exceptions that we expose now, and then I saw there are no PublicEvolving annotations for exceptions in the kinesis sink. Wasn’t sure if thats right , but decided to do similarly to existing code. I'll add them back now.

fixed this now

...main/java/org/apache/flink/streaming/connectors/dynamodb/util/DynamoDbSerializationUtil.java

...db/src/main/java/org/apache/flink/streaming/connectors/dynamodb/sink/DynamoDbSinkWriter.java

Co-authored-by: Yuri Gusev <yura.gusev@gmail.com>

dannycranmer · 2022-11-04T20:14:42Z

As discussed offline. The build takes ~9 minutes to run due to excessive use of integration tests. We will follow up to migrate these tests to unit tests, and generally improve coverage ahead of the release:

https://issues.apache.org/jira/browse/FLINK-29895

dannycranmer · 2022-11-04T20:25:15Z

Seems like the checks are broken, raised a Jira to fix:

https://issues.apache.org/jira/browse/FLINK-29896

LGTM, running tests locally and waiting for @hlteoh37 to approve before merging

[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for Flink : Connectors : Amazon DynamoDB Parent 1.0.0-SNAPSHOT:
[INFO]
[INFO] Flink : Connectors : Amazon DynamoDB Parent ........ SUCCESS [  2.091 s]
[INFO] Flink : Connectors : Amazon DynamoDB ............... SUCCESS [09:13 min]
[INFO] Flink : Connectors : SQL : Amazon DynamoDB ......... SUCCESS [  7.273 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------

hlteoh37 · 2022-11-04T20:33:06Z

LGTM

hlteoh37

Approved

dannycranmer requested changes Sep 8, 2022

View reviewed changes

hlteoh37 reviewed Oct 15, 2022

View reviewed changes

...dynamodb/src/main/java/org/apache/flink/streaming/connectors/dynamodb/sink/DynamoDbSink.java Outdated Show resolved Hide resolved

hlteoh37 reviewed Oct 15, 2022

View reviewed changes

YuriGusev force-pushed the FLINK-24229-connector-aws-dynamodb branch from fd28524 to b38aa65 Compare October 17, 2022 18:44

dannycranmer requested changes Oct 18, 2022

View reviewed changes

YuriGusev mentioned this pull request Oct 18, 2022

[FLINK-24229][connectors/dynamodb] Added DynamoDB connector apache/flink#18518

Closed

YuriGusev force-pushed the FLINK-24229-connector-aws-dynamodb branch 2 times, most recently from 7047d49 to 42a4778 Compare October 19, 2022 19:52

hlteoh37 reviewed Oct 19, 2022

View reviewed changes

...db/src/main/java/org/apache/flink/streaming/connectors/dynamodb/sink/DynamoDbSinkWriter.java Outdated Show resolved Hide resolved

...odb/src/main/java/org/apache/flink/streaming/connectors/dynamodb/util/PrimaryKeyBuilder.java Outdated Show resolved Hide resolved

dannycranmer requested changes Oct 20, 2022

View reviewed changes

YuriGusev force-pushed the FLINK-24229-connector-aws-dynamodb branch from e3318ce to ce7f45d Compare October 20, 2022 17:10

dannycranmer reviewed Oct 20, 2022

View reviewed changes

dannycranmer requested changes Oct 21, 2022

View reviewed changes

dannycranmer reviewed Oct 21, 2022

View reviewed changes

YuriGusev force-pushed the FLINK-24229-connector-aws-dynamodb branch 2 times, most recently from 7ba4e4f to 2d1c15a Compare October 29, 2022 23:36

dannycranmer requested changes Oct 31, 2022

View reviewed changes

dannycranmer reviewed Nov 2, 2022

View reviewed changes

dannycranmer requested changes Nov 3, 2022

View reviewed changes

hlteoh37 reviewed Nov 4, 2022

View reviewed changes

...db/src/main/java/org/apache/flink/streaming/connectors/dynamodb/sink/DynamoDbSinkWriter.java Outdated Show resolved Hide resolved

YuriGusev force-pushed the FLINK-24229-connector-aws-dynamodb branch 3 times, most recently from c67f439 to 7a76f7b Compare November 4, 2022 20:01

[FLINK-24229][Connectors][DynamoDB] - Add AWS DynamoDB connector

ec7f1e8

Co-authored-by: Yuri Gusev <yura.gusev@gmail.com>

YuriGusev force-pushed the FLINK-24229-connector-aws-dynamodb branch from 7a76f7b to ec7f1e8 Compare November 4, 2022 20:14

dannycranmer approved these changes Nov 4, 2022

View reviewed changes

hlteoh37 approved these changes Nov 4, 2022

View reviewed changes

dannycranmer merged commit 11bcbf9 into apache:main Nov 4, 2022

	"requests should not be deduplecated if belong to different tables")
	"requests should not be deduplicated if belong to different tables")

	Assertions.assertThat(configuration.retryPolicy().isPresent()).isTrue();
	Assertions.assertThat(configuration.retryPolicy()).isEmpty();

	* A DynamoDB Sink that performs async requests against a destination stream using the buffering
	* A DynamoDB Sink that performs async requests against a destination table using the buffering


		private final Map<String, WriteRequest> container;

		private List<String> overwriteByPKeys;

		/** Exception is thrown when DynamoDb sink failed to write data. */
		public class DynamoDbSinkException extends RuntimeException {

		/** Exception is thrown if a DynamoDB configuration was invalid. */
		public class InvalidConfigurationException extends RuntimeException {

		/** Exception is thrown if a DynamoDB request was invalid. */
		public class InvalidRequestException extends RuntimeException {

[FLINK-24229][Connectors][DynamoDB] - Add AWS DynamoDB connector #1

[FLINK-24229][Connectors][DynamoDB] - Add AWS DynamoDB connector #1

Conversation

nirtsruya commented Aug 12, 2022

dannycranmer Sep 8, 2022 • edited

Choose a reason for hiding this comment

YuriGusev Sep 8, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dannycranmer commented Sep 8, 2022

dannycranmer commented Sep 12, 2022

YuriGusev commented Sep 13, 2022 • edited

hlteoh37 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

YuriGusev Oct 18, 2022 • edited

Choose a reason for hiding this comment

hlteoh37 Oct 18, 2022 • edited

Choose a reason for hiding this comment

YuriGusev Oct 18, 2022 • edited

Choose a reason for hiding this comment

hlteoh37 Oct 18, 2022 • edited

Choose a reason for hiding this comment

YuriGusev Oct 18, 2022 • edited

Choose a reason for hiding this comment

hlteoh37 commented Oct 15, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dannycranmer commented Oct 18, 2022

hlteoh37 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

YuriGusev Oct 20, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

YuriGusev Oct 24, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dannycranmer Nov 2, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dannycranmer Sep 8, 2022 •

edited

YuriGusev Sep 8, 2022 •

edited

YuriGusev commented Sep 13, 2022 •

edited

YuriGusev Oct 18, 2022 •

edited

hlteoh37 Oct 18, 2022 •

edited

YuriGusev Oct 18, 2022 •

edited

hlteoh37 Oct 18, 2022 •

edited

YuriGusev Oct 18, 2022 •

edited

YuriGusev Oct 20, 2022 •

edited

YuriGusev Oct 24, 2022 •

edited

dannycranmer Nov 2, 2022 •

edited

dannycranmer commented Nov 4, 2022 •

edited