Add Kinesis Indexing Service to core Druid #6431

jsun98 · 2018-10-08T21:57:39Z

Major changes to review

SeekableStreamSupervisor and Kafka/Kinesis implementation classes
SeekableStreamIndexTask and Kafka/Kinesis implementation classes
SeekableStreamIndexTaskRunner and Kafka/Kinesis implementation classes
RecordSupplier and Kafka/Kinesis implementation classes

Motivation

The motivation for this PR is to add kinesis-indexing-service to core druid extensions and provide an abstraction seekablestream for streaming systems with the following characteristics:

Seekable: ability to seek to a specific point in the stream - so that we can store the offsets transactionally in an external system to ensure correctness and can return to a specific record if we need to retry after a failure
Partition-able: so that ingestion can be scaled out and handled by multiple indexing tasks
Partition-ordered: so that we have deterministic results from replica tasks that are independently reading the same partition

Major changes

Merged identical Kafka/Kinesis supervisor/task logic into common abstract classes
Created the RecordSupplier abstraction instead of using Kafka/Kinesis client APIs directly
Implemented incremental handoff and checkpointing for Kinesis
Introduced abstractions to avoid working with Kafka/Kinesis specific implementations like partition, shard, offsets, sequence number, shardIterators etc. directly
unit test coverage for Kinesis

The logic in KinesisIndexTask and KafkaIndexTask has not been merged yet, due to Kinesis not yet supporting incremental handoff, but this will be implemented in a follow-up PR. The code in the two index tasks are left as is for now.

incremental handoff in now implemented for Kinesis Index Task

~~The unit tests for Kinesis depends on LocalStack, a local AWS testing stack, which requires a running Docker instance~~

docker dependency is no longer required as mocks are used instead

Definition of Abstractions

`SeekableStreamSupervisor`

merged the logic from KinesisSupervisor and KafkaSupervisor, most notably the logic in runInternal() and the helper methods.

`RecordSupplier`

abstraction for fetching records from Kafka Server/Kinesis Stream. The implementation for Kafka is trivial as Kafka provides a KafkaConsumer which basically encapsulates all of the logic. KinesisRecordSupplier on the other hand is a full-fledged buffer implementation.

`SeekableStreamIndexTaskRunner`

merged logic from KinesisIndexTask (which didn't have the TaskRunner abstraction) and IncrementalPublishingKafkaTaskRunner. There is a slight discrepancy in fetching the next sequence number/offset in which the kafka task will read from starting offset (inclusive) to end offset (exclusive) while the Kinesis task reads from start sequence number (inclusive for 1st task, exclusive for all subsequent tasks) to end sequence number (inclusive). This is because Kinesis sequence numbers cannot be calculated simply like Kafka's, so instead of making an expensive API call to get the next sequence number, we store the current-already-read-to sequence number instead.

`OrderedPartitionableRecord`, `SeekableStreamPartition`, `SeekableStreamPartitions`

abstractions over the Kafka/Kinesis partition/shard, offset/sequence number concepts. These classes help to avoid working with Kafka/Kinesis specific types directly

Compatibility with other streaming systems

This abstract design is specifically aimed at partition-able, partition-ordered stream systems such as Kafka and Kinesis. Systems that do not have partitions or do not offer ordering within partitions are not good fits under this abstraction.

b-slim · 2018-10-08T22:37:07Z

...ce/src/main/java/org/apache/druid/indexing/seekablestream/SeekableStreamIndexTaskRunner.java

 */
-public interface KafkaIndexTaskRunner extends ChatHandler
+public interface SeekableStreamIndexTaskRunner<T1, T2> extends ChatHandler


neither the methods nor the interface has something related to the word seekable so am not sure what it is called this way, probably StreamingIndexTaskRunner is enough.

this interface is meant to be used for seekable stream index tasks like Kafka and Kinesis index tasks, both of which requires the ability to seek

b-slim · 2018-10-08T22:39:05Z

...ce/src/main/java/org/apache/druid/indexing/seekablestream/SeekableStreamIndexTaskRunner.java

 {
  Appenderator getAppenderator();

+  /**
+   * Run the task


please avoid repeating the code it self as a comment. In general if the comment does not add any information to the reader then there is no need for it. Reading the method signature i can tell that is running something since it is called run... same for stop.

b-slim · 2018-10-08T22:47:33Z

...service/src/main/java/org/apache/druid/indexing/seekablestream/SeekableStreamPartitions.java

+
+  @JsonCreator
+  public SeekableStreamPartitions(
+      @JsonProperty("stream") final String stream,


not sure am getting this. when am i suppose to provide a stream or topic ? what does it means providing both?
Can you explain what this class is doing ?

I considered using separate KafkaPartitions and KinesisPartitions initially, but later decided to remove them and just use this generic class instead. The problem is that when deserializing datasource metadata, Kafka uses the field topic and Kinesis uses stream, so I needed to map both topic and stream to the name of the stream, and that's why both stream and topic are in the constructor, it's meant so that either stream or topic need to be provided.

b-slim · 2018-10-08T22:49:26Z

...service/src/main/java/org/apache/druid/indexing/seekablestream/SeekableStreamPartitions.java

+  @JsonProperty
+  public String getTopic()
+  {
+    return stream;


this shows that definition of the class it self is somehow not clear. method is getTopic but returns stream ?

b-slim · 2018-10-08T22:59:00Z

...service/src/main/java/org/apache/druid/indexing/seekablestream/SeekableStreamPartitions.java

+    Preconditions.checkArgument(this.stream != null);
+    Preconditions.checkArgument(map != null);
+    // Validate map
+    for (Map.Entry<T1, T2> entry : map.entrySet()) {


not sure what this is doing ? if you are checking null values then the immutableMap copy i think it does that.
Also if the goal is a null check the error message should be clear, like value is null this is invalid

b-slim · 2018-10-08T23:00:13Z

...service/src/main/java/org/apache/druid/indexing/seekablestream/SeekableStreamPartitions.java

+  @JsonProperty
+  public Map<T1, T2> getPartitionSequenceNumberMap()
+  {
+    return map;


same thing, not sure why we need both methods

b-slim · 2018-10-08T23:01:58Z

...service/src/main/java/org/apache/druid/indexing/seekablestream/SeekableStreamPartitions.java

+    if (o == null || getClass() != o.getClass()) {
+      return false;
+    }
+    SeekableStreamPartitions that = (SeekableStreamPartitions) o;


this implies if i create 2 instances with same stream string and different topic strings they are both equal?

only 1 of stream, topic should be a non-null value. They both map to the name of the stream. There are both topic and stream in this class because KinesisDataSourceMeatadata uses stream while KafkaDatasourceMeatadata uses topic

b-slim · 2018-10-08T23:06:01Z

indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/common/Record.java

+
+  public Record(String streamName, T1 partitionId, T2 sequenceNumber, List<byte[]> data)
+  {
+    this.streamName = streamName;


early in the code i see topic, stream, now this class is using yet another name streamName can we use one term?

b-slim · 2018-10-08T23:07:25Z

indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/common/Record.java

+ */
+public class Record<T1, T2>
+{
+  public static final String END_OF_SHARD_MARKER = "EOS";


not sure how END_OF_SHARD_MARKER maps to EOS

b-slim · 2018-10-08T23:12:06Z

indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/common/Record.java

+  private final T2 sequenceNumber;
+  private final List<byte[]> data;
+
+  public Record(String streamName, T1 partitionId, T2 sequenceNumber, List<byte[]> data)


am not very familiar with Kinesis, but wondering why data has to be ordered? can't we just use a Collection? or an Iterable ?

see my comment below

b-slim · 2018-10-08T23:17:34Z

...ng-service/src/main/java/org/apache/druid/indexing/seekablestream/common/RecordSupplier.java

+public interface RecordSupplier<T1, T2> extends Closeable
+{
+  /**
+   * assigns a set of partitions to this RecordSupplier


repeating the method signature word for word. Such comments are not really needed, in fact they introduce more issues, imaging tomorrow someone refactor the code and change the name of method or name of the class RecordSupplier this comment will be inaccurate.

b-slim · 2018-10-09T00:42:56Z

I have looked at the abstractions/Api added by this PR and am not convinced that this is the best abstraction for Streaming Ingest Task.
First when you read the code you can not really tell what is the semantic around T1 and T2 ? can this be any object really? or does it have to be comparable ? or it extend numbers ? IMO we need to clear that out. is it really true that every record is part of a partition and has a sequence number?

My biggest concerns is that this abstraction will only work with Kafka/Kinesis. For instance the proposed abstraction of Record as

Record(String streamName, T1 partitionId, T2 sequenceNumber, List<byte[]> data)

will not fit Both Pulsar and Pravega of of the box, as far i can tell there is no notion of partition in both system.

Other question is that why the data is an ordered list of bytes ? does order really matter?

Also This Pr models both the record Id and a position in the stream with the pair of <T1,T2> which can be problematic as well. I think it is better to have an abstraction for record Id independent from a position in the stream

Now i understand that this currently works nice with Kinesis and Kafka, but i am afraid that if we want to add support to another system will lead to an entire code refactor -> PR with yet 16K line refactor. Thus i really recommend to have a proposal as a design review and then go from there.

gianm · 2018-10-09T01:48:18Z

First when you read the code you can not really tell what is the semantic around T1 and T2 ? can this be any object really? or does it have to be comparable ? or it extend numbers ? IMO we need to clear that out. is it really true that every record is part of a partition and has a sequence number?

I haven't read the patch yet, but I think it is useful to have an abstraction for ordered-partitionable streams (like Kafka and Kinesis have). It's because we have a lot of logic specifically around dealing with those kinds of systems when it comes to partition discovery, task creation, work assignment, and metadata handling. A lot of that shared code won't apply if the streams do not have an ordered-partition concept.

It seems like Pulsar topics are ordered-partitionable (they can have partitions, and those partitions have messages with ordered sequence IDs) so it should be able to fall under the same abstraction. I am not familiar with Pravega so I don't know about that one.

b-slim · 2018-10-09T02:34:39Z

@gianm am not expert of Pulsar but if am not wrong Pulsar record id is per topic and not TopicPartition and i think Consumer don't have the choice over the partition...https://pulsar.apache.org/docs/latest/getting-started/ConceptsAndArchitecture/#subscription-modes.
What am saying here is that instead of reviewing a 18K lines change PR, We need to have a Proposal to define basic abstractions like what is a record/recordId what is a partition, what is a sequence Number, what does seekAfter means, what does a checkpoint means etc ...

jsun98 · 2018-10-09T17:10:31Z

@b-slim thanks for the comments. The motivation for creating the SeekableStream abstraction is for seekable order partition-able streams, like Kafka and Kinesis, I should have been more specific in the pr description and naming the classes.

Regarding Pulsar, after digging into the docs for a bit, it seems that Pulsar is partition-able, but does not guarantee ordering within the partitions when shared subscription mode is used on the consumer side. So I think Pulsar and other systems like it do not fit the characteristics / share enough similarities with Kafka/Kinesis indexing service to be grouped under the same abstraction.

Seems like Pulsar supports partitions and offer strict ordering within each partition if failover subscription mode is used, so it should fit into the current abstraction.

As for the Record class, it should probably be named something like OrderedPartitionableRecord as its purpose was not to serve as a generic record for any streaming task. The reason why it contains an order list of List<byte[]> data is because in our current Kafka supervisor implementation, a record's value is an array of bytes, as in IncrementalPublishingKafkaIndexTaskRunner

final byte[] valueBytes = record.value();
final List<InputRow> rows = valueBytes == null
        ? Utils.nullableListOf((InputRow) null)
        : parser.parseBatch(ByteBuffer.wrap(valueBytes));

and in KinesisRecordSupplier, when deaggregate is configured, UserRecord.deaggregate() returns a List<UserRecord>, which, is likely to maintain the order in which the messages were inputted into the aggregator

if (deaggregate) {
    data = new ArrayList<>();

    final List<UserRecord> userRecords = UserRecord.deaggregate(Collections.singletonList(kinesisRecord));
    for (UserRecord userRecord : userRecords) {
          data.add(toByteArray(userRecord.getData()));
    }
} else {
    data = Collections.singletonList(toByteArray(kinesisRecord.getData()));
}

I agree that <T1, T2> should be changed to something more descriptive like <partitionType, sequenceType>

I will update the pr description to include more details, and address the code reviews individually.

gianm · 2018-10-11T20:55:19Z

Thanks @jsun98 for being willing to add more detail to the PR description. I'd suggest including,

Definition of the abstractions being introduced & how they apply to Kafka and Kinesis.
Discussion of any other streaming systems you are aware of that these abstractions may (or may not) apply to.
Discussion of any approaches you considered when developing this (not sharing code at all, some other ways of sharing code) and why you chose the method you implemented.

I think it will help people understand the approach you took and make it easier to review it.

fjy · 2018-10-11T20:57:26Z

@b-slim Also as an FYI - this is a module that our customers have been running in production for some time now

b-slim · 2018-10-11T21:06:29Z

@fjy please read my review, the questions I raised are about the abstraction mostly, not sure how this has to deal if this running on your customers clusters ....thanks.

dyanarose · 2018-10-12T13:57:03Z

extensions-core/kinesis-indexing-service/pom.xml

+
+    <dependency>
+      <groupId>com.amazonaws</groupId>
+      <artifactId>amazon-kinesis-client</artifactId>


The KCL is licensed under the Amazon Software License.

I know this has had release implications for other Apache projects, I wonder if this should be considered by Druid as well?

That's a good point. The Amazon Software License is forbidden for Apache projects based on https://www.apache.org/legal/resolved.html. In particular it has a field of use restriction, which is incompatible with Apache philosophy and license:

3.3 Use Limitation. The Work and any derivative works thereof only may be used or intended for use with the web services, computing platforms or applications provided by Amazon.com, Inc. or its affiliates, including Amazon Web Services, Inc.

The aws-java-sdk-kinesis library does not have any such limitation (it is Apache licensed). Is it possible to use this instead of amazon-kinesis-client? If so, we should do that. If not, I think we can still ship the kinesis extension, but we cannot include amazon-kinesis-client. It should be something users download separately and drop in the jar for on their own: https://www.apache.org/legal/resolved.html#optional

FWIW this is the same basic situation as the mysql extension is in (the mysql driver is under a category X license) so we should probably package it the same way.

@gianm it seems that aws-java-sdk-kinesis would work except it doesn't have UserRecord.deaggregate, which is needed in the KinesisRecordSupplier implementation. Do you know of any work arounds?

assuming that the KCL was used here in the "normal" manner, ie handling leases (including balancing and checkpointing), resharding, retries, backoff, etc, there will be a quite large amount of work to be done to remove the KCL and still have a stable kinesis reader.

@dyanarose ah, we don't actually use the KCL in the "normal" manner here; the library is included mainly for IKinesisProxy which is the Java wrapper around the HTTP API. We handle the balancing, checkpointing, and resharding in the indexing service logic (out of necessity to be able to support exactly-once read semantics)

jsun98 · 2018-10-15T19:28:02Z

@gianm included the points you mentioned in the PR description

gianm · 2018-10-15T20:59:36Z

Thanks!!

jihoonson · 2018-12-20T00:12:03Z

...ce/src/main/java/org/apache/druid/indexing/seekablestream/SeekableStreamIndexTaskRunner.java

+              toolbox
+          );
+
+          stillReading = !assignment.isEmpty();


Would you add a comment about assignment can be changed in getRecords?

jihoonson · 2018-12-20T00:14:03Z

...ce/src/main/java/org/apache/druid/indexing/seekablestream/SeekableStreamIndexTaskRunner.java

+              continue;
+            }
+
+            if (log.isTraceEnabled()) {


Please remove this.

jihoonson · 2018-12-20T00:33:51Z

...ce/src/main/java/org/apache/druid/indexing/seekablestream/SeekableStreamIndexTaskRunner.java

+                        }
+
+                        @Override
+                        public void onFailure(@ParametersAreNonnullByDefault Throwable t)


This annotation is for class or package, not for individual parameter. We're already using this and you can safely remove it from this class.

jihoonson · 2018-12-20T00:36:35Z

...service/src/main/java/org/apache/druid/indexing/seekablestream/SeekableStreamPartitions.java

+ */
+public class SeekableStreamPartitions<PartitionIdType, SequenceOffsetType>
+{
+  public static final String NO_END_SEQUENCE_NUMBER = "NO_END_SEQUENCE_NUMBER";


Would you please add a comment about what this is?

jihoonson · 2018-12-20T00:43:30Z

...ce/src/main/java/org/apache/druid/indexing/seekablestream/SeekableStreamIndexTaskRunner.java

+    private boolean checkpointed;
+
+    /**
+     * Lock for 


These two variables can be package-private.

jihoonson · 2018-12-20T00:44:01Z

...ce/src/main/java/org/apache/druid/indexing/seekablestream/SeekableStreamIndexTaskRunner.java

+    private boolean checkpointed;
+
+    /**
+     * Lock for 


This can be private.

dclim · 2018-12-19T23:34:11Z

...g-service/src/main/java/org/apache/druid/indexing/kinesis/KinesisIndexTaskClientFactory.java

+      @Json ObjectMapper mapper
+  )
+  {
+    super(


Put these on one line

dclim · 2018-12-19T23:56:48Z

...ng-service/src/main/java/org/apache/druid/indexing/kinesis/KinesisIndexTaskTuningConfig.java

+      return false;
+    }
+    KinesisIndexTaskTuningConfig that = (KinesisIndexTaskTuningConfig) o;
+    return getMaxRowsInMemory() == that.getMaxRowsInMemory() &&


Missing maxTotalRows, maxRecordsPerPoll, and intermediateHandoffPeriod in equals, hashCode, and toString

dclim · 2018-12-20T00:51:25Z

...-indexing-service/src/main/java/org/apache/druid/indexing/kinesis/KinesisRecordSupplier.java

+  @Override
+  public String getPosition(StreamPartition<String> partition)
+  {
+    throw new UnsupportedOperationException("getPosition is not supported in Kinesiss");


typo: Kinesis

dclim · 2018-12-20T01:11:54Z

...-indexing-service/src/main/java/org/apache/druid/indexing/kinesis/KinesisRecordSupplier.java

+      try {
+        recordsResult = kinesis.getRecords(new GetRecordsRequest().withShardIterator(shardIterator).withLimit(1000));
+      }
+      catch (ProvisionedThroughputExceededException e) {


We should let the user know when we get this exception so that they understand what's happening (and maybe can take corrective action by scaling up).

dclim · 2018-12-20T01:13:25Z

...-indexing-service/src/main/java/org/apache/druid/indexing/kinesis/KinesisRecordSupplier.java

+      List<Record> records = recordsResult.getRecords();
+
+      if (!records.isEmpty()) {
+        return records.get(0).getSequenceNumber();


If you're only ever expecting one record here, can you check and throw an exception if you get more?

it is fine if we get more than 1 record, kinesis.getRecords(new GetRecordsRequest().withShardIterator(shardIterator).withLimit(1000)); fetches up to 1000 records to make sure we can find the first record in this shard (in the case where the shard is constantly removing records due to records being past retention period).

Got it, I remember this conversation now.

dclim

Last set of review comments from me. I am good with this after these (+ existing comments) are addressed. Thank you for the hard work on this PR!

dclim · 2018-12-20T06:40:56Z

.../kinesis-indexing-service/src/main/java/org/apache/druid/indexing/kinesis/KinesisRegion.java

+  @JsonValue
+  public String toString()
+  {
+    return StringUtils.toLowerCase(name()).replace('_', '-');


Can you add tests to KinesisRegionTest to test getEndpoint(), and in particular make sure that the China regions return kinesis.cn-north-1.amazonaws.com.cn (with the .cn at the end)

dclim · 2018-12-20T06:46:54Z

...e/src/main/java/org/apache/druid/indexing/kinesis/aws/ConstructibleAWSCredentialsConfig.java

+import org.apache.druid.metadata.DefaultPasswordProvider;
+import org.apache.druid.metadata.PasswordProvider;
+
+public class ConstructibleAWSCredentialsConfig extends AWSCredentialsConfig


This class isn't needed anymore; you can just pass the AWSCredentialsConfig that you get from configuration to KinesisRecordSupplier.getAmazonKinesisClient() instead of calling the getter on the access key and secret key and then constructing a new instance of ConstructibleAWSCredentialsConfig

dclim · 2018-12-20T06:49:38Z

...ng-service/src/main/java/org/apache/druid/indexing/kinesis/supervisor/KinesisSupervisor.java

+        new SeekableStreamPartitions<>(ioConfig.getStream(), startPartitions),
+        new SeekableStreamPartitions<>(ioConfig.getStream(), endPartitions),
+        true,
+        true, // should pause after reading otherwise the task may complete early which will confuse the supervisor


pauseAfterRead is never used (and not needed, it was removed from the Kafka implementation)

dclim · 2018-12-20T07:01:04Z

...ervice/src/main/java/org/apache/druid/indexing/kinesis/supervisor/KinesisSupervisorSpec.java

+  @Override
+  public String toString()
+  {
+    return "KinesisSupervisorSpec{" +


Can you add context here?

dclim · 2018-12-20T07:05:59Z

...rc/main/java/org/apache/druid/indexing/kinesis/supervisor/KinesisSupervisorTuningConfig.java

+  @Override
+  public String toString()
+  {
+    return "KinesisSupervisorTuningConfig{" +


Add maxBytesInMemory, maxTotalRows, logParseExceptions, maxParseExceptions, maxSavedParseExceptions, maxRecordsPerPoll, and intermediateHandoffPeriod. I would also just remove buildV9Directly from the toString if the parent class is doing nothing with it.

dclim · 2018-12-20T08:24:29Z

...g/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisorReportPayload.java

+    return dataSource;
+  }
+
+  public String getStream()


Missing @JsonProperty ?

dclim · 2018-12-20T08:25:06Z

...g/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisorReportPayload.java

+  }
+
+  @Override
+  public abstract String toString();


I don't think this is strictly necessary to enforce

dclim · 2018-12-20T08:27:01Z

...n/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisorSpec.java

+  }
+
+  @Override
+  public abstract String toString();


I don't think this is strictly necessary to enforce

dclim · 2018-12-20T08:27:12Z

...rg/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisorTuningConfig.java

+  Duration getShutdownTimeout();
+
+  @Override
+  String toString();


I don't think this is strictly necessary to enforce

dclim · 2018-12-20T08:31:37Z

indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/utils/RandomId.java

+import java.util.Random;
+import java.util.concurrent.ThreadLocalRandom;
+
+public class RandomId


I think by convention we suffix a helper class like this with Utils

jihoonson · 2018-12-20T19:23:44Z

...-indexing-service/src/main/java/org/apache/druid/indexing/kinesis/KinesisRecordSupplier.java

+                ShardIteratorType.AT_SEQUENCE_NUMBER.toString(),
+                currRecord.getSequenceNumber()
+            ).getShardIterator();
+            rescheduleRunnable(EXCEPTION_RETRY_DELAY_MS);


Thanks for updating this. This looks good except that we should return immediately on InterruptedException.

jihoonson · 2018-12-20T19:24:18Z

...-indexing-service/src/main/java/org/apache/druid/indexing/kinesis/KinesisRecordSupplier.java

+        }
+        catch (SdkClientException e) {
+          log.warn(e, "encounted unknown AWS exception, retrying in [%,dms]", EXCEPTION_RETRY_DELAY_MS);
+          rescheduleRunnable(EXCEPTION_RETRY_DELAY_MS);


Probably we can also check this exception is transient or not using S3Utils.isServiceExceptionRecoverable.

jihoonson · 2018-12-20T19:35:00Z

...-indexing-service/src/main/java/org/apache/druid/indexing/kinesis/KinesisRecordSupplier.java

+  private class PartitionResource
+  {
+    private final StreamPartition<String> streamPartition;
+    private final Object startLock = new Object();


Hmm, I'm not sure what this lock is for. What threads can call startBackgroundFetch() at the same time?

jihoonson · 2018-12-20T19:37:46Z

...ce/src/main/java/org/apache/druid/indexing/seekablestream/SeekableStreamIndexTaskRunner.java

+          }
+
+          @Override
+          public void onFailure(@ParametersAreNonnullByDefault Throwable t)


Please remove this annotation too.

dclim · 2018-12-21T01:04:21Z

docs/content/development/extensions-core/kinesis-ingestion.md

+```
+-Ddruid.kinesis.accessKey=123 -Ddruid.kinesis.secretKey=456
+```
+The AWS access key ID ad secret access key are used for Kinesis API requests. If this is not provided, the service will look for credentials set in environment variables, in the default profile configuration file, and from the EC2 instance profile provider (in this order).


typo: The AWS access key ID ad secret access key.. -> 'and'

jihoonson

The latest change looks good to me. Thank you for your hard work @jsun98!!

dclim · 2018-12-21T01:11:03Z

...-indexing-service/src/main/java/org/apache/druid/indexing/kinesis/KinesisRecordSupplier.java

+      catch (ProvisionedThroughputExceededException e) {
+        log.warn(
+            e,
+            "encounted ProvisionedThroughputExceededException while fetching records, this means "


typo: encounted -> encountered

Also, you can recommend that they can increase the number of shards to get more throughput

dclim · 2018-12-21T01:14:34Z

...-service/src/main/java/org/apache/druid/indexing/seekablestream/SeekableStreamIndexTask.java

+  }
+
+
+  private static String makeTaskId(String dataSource, int randomBits, String type)


randomBits no longer used

dclim · 2018-12-21T01:21:51Z

LGTM, thank you @jsun98!

b-slim · 2018-12-21T02:00:49Z

docs/content/development/extensions-core/kafka-ingestion.md

@@ -312,7 +312,7 @@ compatible because they have a different ingestion spec or partition allocation,
 supervisor will create a new set of tasks. In this way, the supervisors are persistent across overlord restarts and
 fail-overs.

-A supervisor is stopped via the `POST /druid/indexer/v1/supervisor/<supervisorId>/shutdown` endpoint. This places a
+A supervisor is stopped via the `POST /druid/indexer/v1/supervisor/<supervisorId>/terminate` endpoint. This places a


why do we need this backward incompatible change or this is fixing the doc?

/shutdown and /terminate are equivalent since #6272, although /shutdown is deprecated. (On the grounds that the difference between 'suspend' and 'terminate' is more clear than 'suspend' and 'shutdown'.)

So this was just fixing the doc.

b-slim · 2018-12-21T02:03:37Z

...-indexing-service/src/main/java/org/apache/druid/indexing/kafka/KafkaDataSourceMetadata.java

-
-  @Override
-  public String toString()
+  protected SeekableStreamDataSourceMetadata<Integer, Long> createConcreteDataSourceMetaData(


what does concrete adds to the meaning here ? is there a non concrete DatasourceMetadata?

yes, the non concrete one is called SeekableStreamDataSourceMetadata.

b-slim · 2018-12-21T02:09:35Z

...afka-indexing-service/src/main/java/org/apache/druid/indexing/kafka/KafkaRecordSupplier.java

+  }
+
+  @Override
+  public void seek(StreamPartition<Integer> partition, Long sequenceNumber)


i really i do not see the need to call a partition as StreamPartition...

As opposed to Partition, you mean? I guess it's because StreamPartition is clearer?

b-slim · 2018-12-21T02:17:18Z

...afka-indexing-service/src/main/java/org/apache/druid/indexing/kafka/KafkaRecordSupplier.java

+  @Override
+  public Long getLatestSequenceNumber(StreamPartition<Integer> partition)
+  {
+    Long currPos = consumer.position(new TopicPartition(partition.getStream(), partition.getPartitionId()));


why not bump the kafka version to 0.11 and use https://kafka.apache.org/0110/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html#beginningOffsets(java.util.Collection) and https://kafka.apache.org/0110/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html#endOffsets(java.util.Collection)

There's another PR in the works -- #6496 -- to upgrade to Kafka 2.x, from @surekhasaharan, that we were planning to do this in after this current PR has been merged (to avoid conflicts). There were other changes needed too in order to make the upgrade work smoothly (check out that PR). So IMO it makes sense to leave this as is, & we can make the new Kafka API changes over in #6496.

gianm

Consider this a third design review, at the level of the abstractions. They look good to me, judging by the fact that the Kafka and Kinesis implementations are relatively minimal in what they need to override (except LegacyKafkaIndexTaskRunner, which is huge, but we should be removing that soonish anyway).

Thanks @jsun98 for your work on this patch!!

gianm · 2018-12-21T19:36:40Z

It looks to me like this patch is ready to go. @b-slim are you ok with merging this now and doing some of the followups you suggested (like API changes) in #6496?

b-slim · 2018-12-21T19:41:51Z

I don't think i can get to re-review the entire thing before next year, so if there is enough consensus that this is good to go, then good it is.

dclim added Feature Area - Batch Ingestion Area - Streaming Ingestion Area - Extension labels Oct 8, 2018

b-slim reviewed Oct 8, 2018

View reviewed changes

fjy added this to the 0.13.1 milestone Oct 9, 2018

dclim self-assigned this Oct 9, 2018

jsun98 force-pushed the seekablestream branch 2 times, most recently from 6003167 to 3d54e85 Compare October 9, 2018 20:52

dyanarose reviewed Oct 12, 2018

View reviewed changes

jsun98 force-pushed the seekablestream branch from 3d54e85 to 00c612e Compare October 15, 2018 23:28

jihoonson reviewed Dec 20, 2018

View reviewed changes

dclim reviewed Dec 20, 2018

View reviewed changes

jsun98 added 2 commits December 19, 2018 17:37

commmmmmmmmmments

9f59488

extra error handling in KinesisRecordSupplier getRecords

e16324c

dclim reviewed Dec 20, 2018

View reviewed changes

jihoonson reviewed Dec 20, 2018

View reviewed changes

comments

7439a17

dclim reviewed Dec 21, 2018

View reviewed changes

quickfix

b6eaa84

jihoonson approved these changes Dec 21, 2018

View reviewed changes

dclim reviewed Dec 21, 2018

View reviewed changes

typo

913ec66

dclim approved these changes Dec 21, 2018

View reviewed changes

oof

6df49e4

b-slim reviewed Dec 21, 2018

View reviewed changes

gianm approved these changes Dec 21, 2018

View reviewed changes

dclim merged commit 7c7997e into apache:master Dec 21, 2018

jsun98 mentioned this pull request Feb 2, 2019

Support kafka transactional topics #6496

Merged

jon-wei mentioned this pull request Feb 22, 2019

0.14.0-incubating release notes #7126

Closed

jihoonson mentioned this pull request Mar 1, 2019

Fix two SeekableStream serde issues. #7176

Merged

jihoonson mentioned this pull request Mar 12, 2019

Kafka tasks fail after resuming for incremental handoff #7239

Closed

clintropolis mentioned this pull request Mar 13, 2019

Kafka Tasks unable restore persisted segments properly #7252

Closed

jihoonson mentioned this pull request Mar 14, 2019

Logic adjustments to SeekableStreamIndexTaskRunner. #7267

Merged

		}


		private static String makeTaskId(String dataSource, int randomBits, String type)

Add Kinesis Indexing Service to core Druid #6431

Add Kinesis Indexing Service to core Druid #6431

Conversation

jsun98 commented Oct 8, 2018 • edited Loading

Major changes to review

Motivation

Major changes

Definition of Abstractions

SeekableStreamSupervisor

RecordSupplier

SeekableStreamIndexTaskRunner

OrderedPartitionableRecord, SeekableStreamPartition, SeekableStreamPartitions

Compatibility with other streaming systems

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

b-slim commented Oct 9, 2018

gianm commented Oct 9, 2018

b-slim commented Oct 9, 2018

jsun98 commented Oct 9, 2018 • edited Loading

gianm commented Oct 11, 2018

fjy commented Oct 11, 2018

b-slim commented Oct 11, 2018

dyanarose Oct 12, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dyanarose Oct 15, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jsun98 commented Oct 15, 2018

gianm commented Oct 15, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dclim left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jihoonson Dec 20, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jihoonson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jsun98 commented Oct 8, 2018 •

edited

Loading

`SeekableStreamSupervisor`

`RecordSupplier`

`SeekableStreamIndexTaskRunner`

`OrderedPartitionableRecord`, `SeekableStreamPartition`, `SeekableStreamPartitions`

jsun98 commented Oct 9, 2018 •

edited

Loading

dyanarose Oct 12, 2018 •

edited

Loading

dyanarose Oct 15, 2018 •

edited

Loading

jihoonson Dec 20, 2018 •

edited

Loading