Elasticsearch connector #221

takezoe · 2017-03-17T03:03:32Z

fixes #99

jrudolph · 2017-03-28T14:45:20Z

elasticsearch/src/main/scala/akka/stream/alpakka/elasticsearch/ElasticsearchSinkStage.scala

+
+final case class ElasticsearchSinkSettings(bufferSize: Int = 10)
+
+final case class IncomingMessage[T](id: Option[String], source: T)


This class might not be necessary. Can't we just provide it as a special case of the Typed variant giving it the DefaultJsonProtocol.RootJsObjectFormat (or, if that doesn't fit type-wise, a simple identity JsonWriter)?

Hmm... In my opinion, holding id and object (which is converted to document json) separately is natural when using Elasticsearch. It's possible to replace with tuple like (Option[String], T).

Oh, I actually meant the class below. Do we need both ElasticsearchSinkStage and ElasticsearchSinkStageTyped? Isn't ElasticsearchSinkStage just ElasticsearchSinkStageTyped[JsObject]?

Got it! That's right.

jrudolph

Thanks a lot @takezoe. Sorry for keeping this open for a while. This looks quite good.

The main things that need to be fixed are these:

no blocking calls in GraphStages, this will interfere with the whole ecosystem, so either run these blocking calls with a dedicated dispatcher or use an asynchronous http client like akka-http
needs documentation

jrudolph · 2017-03-28T14:47:38Z

elasticsearch/src/main/scala/akka/stream/alpakka/elasticsearch/ElasticsearchSinkStage.scala

+               |${convert(message.source).toString}""".stripMargin
+        }.mkString("", "\n", "\n")
+
+        client.performRequest(


Hmm, if this is a blocking call it needs to be executed in a dedicated thread-pool. If the client is just a simple HTTP client, it could make sense to use akka-http instead which is fully asynchronous.

It is almost Apache HTTP Components, but it offers handling multiple nodes when accessing to Elasticsearch cluster via HTTP.
https://www.elastic.co/guide/en/elasticsearch/client/java-rest/current/_initialization.html

In addition, client is given from outside of the stage. So configuring client is user's responsibility.

However it seems to be possible to use CloseableHttpAsyncClient. Should I use it instead?

Yes, using an asynchronous client would be best. Make sure to deliver callbacks through getAsyncCallback as explained here: http://doc.akka.io/docs/akka/2.4.14/scala/stream/stream-customize.html#Using_asynchronous_side-channels

jrudolph · 2017-03-28T14:54:22Z

elasticsearch/src/main/scala/akka/stream/alpakka/elasticsearch/ElasticsearchSinkStage.scala

+    override def onPush(): Unit = {
+      val message = grab(in)
+      buffer.addLast(message)
+      if (buffer.size >= settings.bufferSize) {


This is an uncommon pattern: what happens if no new message comes in for a long time? Then the other requests will sit around in the buffer for an indefinite amount of time.

A better approach could be this:

have two states:

Idle - no requests in flight

Buffering - a request is in flight

in Idle state, a message that is coming in, is instantly dispatched, changing the state to Buffering

in Buffering state, only as messages are pulled in as can fit into the internal buffer and otherwise the inlet is backpressured (i.e. not pulled)

when the previous request is complete, either a next batch is instantly dispatched if requests were buffered, or we go back to Idle state

Does that make sense?

That's right. I will try to make that pattern.

jrudolph · 2017-03-28T14:56:28Z

elasticsearch/src/main/scala/akka/stream/alpakka/elasticsearch/ElasticsearchSourceStage.scala

+
+  protected def convert(jsObj: JsObject): T
+
+  def receiveMessages(): Unit =


Similar to the Sink case, this call needs to be asynchronous. So, either use a dedicated dispatcher to use the RestClient or use akka-http directly.

jrudolph · 2017-03-28T15:07:42Z

elasticsearch/src/main/scala/akka/stream/alpakka/elasticsearch/ElasticsearchSourceStage.scala

+  setHandler(out,
+    new OutHandler {
+    override def onPull(): Unit = {
+      if (buffer.isEmpty) {


This can be simplified using emit to push the complete list of elements received by receiveMessages.

Thanks! I will try that.

jrudolph · 2017-03-28T15:08:35Z

elasticsearch/src/main/scala/akka/stream/alpakka/elasticsearch/ElasticsearchSourceStage.scala

+
+final case class OutgoingMessage[T](id: String, source: T)
+
+final class ElasticsearchSourceStage(indexName: String,


Same as above, can probably be provided by creating a JsonReader[JsObject] and then using the typed variant.

jrudolph · 2017-03-28T15:09:42Z

elasticsearch/src/test/scala/akka/stream/alpakka/elasticsearch/ElasticsearchSpec.scala

+import spray.json._
+import DefaultJsonProtocol._
+
+class ElasticsearchSpec extends WordSpec with Matchers with BeforeAndAfterAll {


Nice tests.

takezoe · 2017-03-31T02:00:45Z

elasticsearch/src/main/scala/akka/stream/alpakka/elasticsearch/ElasticsearchSinkStage.scala

+
+              sentHandler.invoke(())
+            }
+          }


Although I tried to move entire this part to an async callback, callback wasn't invoked in second time. It works to call only tryPull() in an async callback like this. I expect that something called from that async callback might be blocking the graph, but I can't find a correct way. Hmm...

Hmm that's still quite dangerous to do because you can run in all kinds of race conditions. The one that I can see is the state.set(Idle) case. When the stage fills the buffer after the val messages in line 150 but before line 155 is executed, then the state will be set to Idle but tryPull won't do anything and the stream is locked because it will never be woken up again. Not very likely but still possible.

Once you do all handling in the async callback you can remove all other synchronization like the AtomicReferences and the ConcurrentLinkedHashmap.

So, can you try to execute everything in the async callback? If it still blocks the graph as you say, can you post the code and maybe a stack trace gathered with jstack <pid> on the shell (or any other means of gathering stack traces), so we can figure out what's going on?

jrudolph · 2017-04-12T10:45:57Z

elasticsearch/src/main/scala/akka/stream/alpakka/elasticsearch/ElasticsearchSourceStage.scala

+          s"/_search/scroll",
+          Map[String, String]().asJava,
+          new StringEntity(Map("scroll" -> "5m", "scroll_id" -> scrollId).toJson.toString),
+          new ResponseListener {


The GraphStageLogic could implement the ResponseListener directly. That would also remove the code duplication with the above branch.

jrudolph · 2017-04-12T10:46:40Z

elasticsearch/src/main/scala/akka/stream/alpakka/elasticsearch/ElasticsearchSourceStage.scala

+  setHandler(out,
+    new OutHandler {
+    override def onPull(): Unit =
+      if (started == false) {


if (!started)

jrudolph · 2017-04-12T10:48:19Z

Looks good! Only small things to fix.

takezoe · 2017-04-28T13:42:13Z

@jrudolph Finished to fix. In addition, I added ElasticsearchFlow and make sink based on it.

jrudolph

Sorry for the delay. The source part is looking very good now. It needs the backpressure fix I outlined.

I wonder if we could move the Flow/Sink part (which I haven't reviewed in depth yet) to another PR to unblock this one. WDYT?

jrudolph · 2017-05-08T09:48:31Z

elasticsearch/src/main/scala/akka/stream/alpakka/elasticsearch/ElasticsearchSourceStage.scala

+
+    jsObj.fields.get("error") match {
+      case None => {
+        val hits = jsObj.fields("hits").asJsObject.fields("hits").asInstanceOf[JsArray]


Might make sense here to model the result wrapper also using spray-json?

Do you mean creating case classes to map response from Elasticsarch?

Yes, exactly. Would that be feasible?

It would be possible. But the REST client of Elasticsearch is improving rapidly. I expect it will become to return Java models instead of json string as same as the transport client (it's an another type of Elasticsearch client) in the future. So now creating our models might be overinvestment.

It's not really a big investment and would document the current view about how the API is structured, which fields are expected, etc.

Would it really be more than that:

case class Hit[T](_id: String, _source: T) case class Response[T](hits: Seq[Hit[T]], _scroll_id: String) object Protocol { import DefaultJsonProtocol._ implicit def hitFormat[T: JsonFormat] = jsonFormat2(Hit.apply _) implicit def responseFormat[T: JsonFormat] = jsonFormat2(Response.apply _) }

An then it's just

val response = jsObj.convertTo[Response[T]] scrollId = response._scroll_id val messages = response.hits.map(h => OutgoingMessage(h._id, h._source)

But maybe I'm missing something ;)

This demands JsonFormat for Source despite only JsonReader is necessary essentially. And hand written readers for Hit and Response are not much distance from current code. I'm not sure whether I should do it.

jrudolph · 2017-05-08T09:53:58Z

elasticsearch/src/main/scala/akka/stream/alpakka/elasticsearch/ElasticsearchSourceStage.scala

+            OutgoingMessage(id, source.convertTo[T])
+          }
+          emitMultiple(out, messages)
+          sendScrollScanRequest()


AFAICS doing it like this would disable backpressure because there's basically a loop sendScrollScanRequest -> handleResponse -> sendScrollScanRequest that immediately slurps the whole data source instead of waiting until the first results are drained by the stream and a new pull comes in.

Fortunately, it should be very simple to fix. Couldn't you remove the sendScrollScanRequest here and also remove the condition in onPull below. What then should happen is that when you use emitMultiple, your below OutHandler gets exchanged with an internal EmitMultiple handler that will send out the current results. Once all of those results are drained your original OutHandler is swapped in again which will then use sendScrollScanRequest on the next onPull.

takezoe · 2017-05-09T07:00:01Z

I wonder if we could move the Flow/Sink part (which I haven't reviewed in depth yet) to another PR to unblock this one. WDYT?

It's reasonable. I will do so after fix source part.

takezoe · 2017-05-09T15:07:31Z

@jrudolph Separating this pull request to source and flow/sink is reasonable, but separating testcases might be waste... Since I'm never in a hurry, I can wait your review!

jrudolph

I agree about those tests being easier to write having both the sink and the source.

I reviewed the sink and added a few comments.

jrudolph · 2017-05-18T13:58:28Z

elasticsearch/src/main/scala/akka/stream/alpakka/elasticsearch/ElasticsearchFlowStage.scala

+  override def createLogic(inheritedAttributes: Attributes): GraphStageLogic =
+    new GraphStageLogic(shape) with ResponseListener {
+
+      private val state = new AtomicReference[State](Idle)


No need for AtomicReference inside of GraphStage if asyncCallback is used correctly.

jrudolph · 2017-05-18T13:58:56Z

elasticsearch/src/main/scala/akka/stream/alpakka/elasticsearch/ElasticsearchFlowStage.scala

+    new GraphStageLogic(shape) with ResponseListener {
+
+      private val state = new AtomicReference[State](Idle)
+      private val buffer = new util.concurrent.ConcurrentLinkedQueue[IncomingMessage[T]]()


Doesn't need to be concurrent if in GraphStage.

jrudolph · 2017-05-18T13:59:41Z

elasticsearch/src/main/scala/akka/stream/alpakka/elasticsearch/ElasticsearchFlowStage.scala

+
+}
+
+private sealed trait State


Maybe put these in the companion object to keep the scope clean.

jrudolph · 2017-05-18T14:03:44Z

elasticsearch/src/main/scala/akka/stream/alpakka/elasticsearch/scaladsl/ElasticsearchFlow.scala

+      .fromGraph(
+        new ElasticsearchFlowStage(indexName, typeName, client, settings)(DefaultJsonProtocol.RootJsObjectFormat)
+      )
+      .mapAsync(settings.parallelism)(identity)


The parallelism setting has no effect as it doesn't matter how many times you call identity concurrently. Just use 1 instead.

jrudolph · 2017-05-18T14:05:42Z

elasticsearch/src/main/scala/akka/stream/alpakka/elasticsearch/ElasticsearchFlowStage.scala

+          case ex: Exception => failStage(ex)
+        }
+
+      setHandler(out, new OutHandler {


Just extend the GraphStageLogic with InHandler with OutHandler and put the handlers directly on the logic. Then use

setHandlers(in, out, this)

jrudolph · 2017-05-18T14:08:51Z

elasticsearch/src/main/scala/akka/stream/alpakka/elasticsearch/ElasticsearchFlowStage.scala

+  override val shape = FlowShape(in, out)
+
+  override def createLogic(inheritedAttributes: Attributes): GraphStageLogic =
+    new GraphStageLogic(shape) with ResponseListener {


Could you add a comment that explains the basic workflow of this stage?

jrudolph · 2017-05-18T14:10:11Z

elasticsearch/src/main/scala/akka/stream/alpakka/elasticsearch/scaladsl/ElasticsearchFlow.scala

+
+object ElasticsearchFlow {
+
+  /**


It's unclear what the exact semantics of the flow are. How many responses do you get, one per source element or one per batch?

Reading the source code of the GraphStage, I see that it would be one Response per batch. Would there be anything you could do when a batch fails? Could it make sense to simplify and just make the GraphStage a sink that fails the flow when the request fails? Or maybe configure it with an exception handler (similar to a supervisor) that knows what to do when a single batch request fails?

Response of bulk request contains results per command. So it's possible to handle errors if some commands in bulk request failed. I fixed to fail ElasticsearchFlowStage when at least one command failed, but I'm not sure this is the best solution.

jrudolph · 2017-05-18T14:18:32Z

elasticsearch/src/main/scala/akka/stream/alpakka/elasticsearch/ElasticsearchSourceStage.scala

+
+    jsObj.fields.get("error") match {
+      case None => {
+        val hits = jsObj.fields("hits").asJsObject.fields("hits").asInstanceOf[JsArray]


It's not really a big investment and would document the current view about how the API is structured, which fields are expected, etc.

Would it really be more than that:

case class Hit[T](_id: String, _source: T) case class Response[T](hits: Seq[Hit[T]], _scroll_id: String) object Protocol { import DefaultJsonProtocol._ implicit def hitFormat[T: JsonFormat] = jsonFormat2(Hit.apply _) implicit def responseFormat[T: JsonFormat] = jsonFormat2(Response.apply _) }

An then it's just

val response = jsObj.convertTo[Response[T]] scrollId = response._scroll_id val messages = response.hits.map(h => OutgoingMessage(h._id, h._source)

But maybe I'm missing something ;)

jrudolph · 2017-05-18T14:18:50Z

elasticsearch/src/main/scala/akka/stream/alpakka/elasticsearch/ElasticsearchFlowStage.scala

+        try {
+          val json = messages
+            .map { message =>
+              s"""{"index": {"_index": "${indexName}", "_type": "${typeName}"${message.id


Might make sense to model these using spray-json as well, otherwise, you will have to add escaping for all of those fields.

jrudolph · 2017-05-18T14:19:39Z

elasticsearch/src/main/scala/akka/stream/alpakka/elasticsearch/ElasticsearchFlowStage.scala

+
+final case class IncomingMessage[T](id: Option[String], source: T)
+
+class ElasticsearchFlowStage[T](


I think the basic approach here is sound.

jrudolph · 2017-05-18T14:33:40Z

Another thing that will be needed in the end is documentation, see https://github.com/akka/alpakka/blob/master/CONTRIBUTING.md#documentation.

takezoe · 2017-05-21T17:57:01Z

Fixed codes which are commented by @jrudolph. In addition, I updated following stuffs as well:

Added abstraction layer for JSON conversion
- since spray-json is hard to use in Java, I fixed to use Jackson in Java API instead.
Added tests for ElasticsearchFlow and Java API

- Support durable retry and recovery in flow and sink - Flow passes failed messages to followsing stage - Makes Java API more useful

jrudolph

LGTM. There would probably be a few cosmetic changes possible but let's not delay this any further.

Thanks a lot @takezoe, great stuff. Sorry, that it took for such a long while to get it finally reviewed another time.

jrudolph · 2017-09-19T08:40:53Z

elasticsearch/src/main/scala/akka/stream/alpakka/elasticsearch/ElasticsearchFlowStage.scala

+
+      private var state: State = Idle
+      private val queue = new mutable.Queue[IncomingMessage[T]]()
+      private val failureHandler = getAsyncCallback[(Seq[IncomingMessage[T]], Throwable)](handleFailure)


I guess you could have used a single handler with type getAsyncCallback[(Seq[IncomingMessage[T]], Try[Response])] instead, but that's fine for now as well.

jrudolph · 2017-09-19T08:42:41Z

elasticsearch/src/main/scala/akka/stream/alpakka/elasticsearch/ElasticsearchSourceStage.scala

+
+}
+
+sealed class ElasticsearchSourceLogic[T](indexName: String,


final instead of sealed?

takezoe added 9 commits March 15, 2017 21:34

Add ElasticsearchSource

07003fc

Add ElasticsearchSink

3f38a9b

Use spray-json instead of Jackson and Sink does bulk indexing

26341e5

Add javadsl

d07d80e

Add test for Elasticsearch connector

4d164fe

Simplify error checking in ElasticsearchSourceStage

ad8d491

Returns Future[Done] from ElasticsearchSink

1c40037

Make IncomingMessage.id optional

b2da12d

Add typed Source and Sink

cbcdb37

patriknw assigned jrudolph Mar 20, 2017

Merge branch 'master' into elasticsearch

3ec6501

jrudolph reviewed Mar 28, 2017

View reviewed changes

jrudolph suggested changes Mar 28, 2017

View reviewed changes

takezoe added 3 commits March 30, 2017 18:58

Fix buffering in ElasticsearchSink

737f5b0

Use emitMultiple in ElasticsearchSource to push chunk

44daf92

Update ElasticsearchSource to use async API

4b235bd

takezoe force-pushed the elasticsearch branch from e889a7b to 4b235bd Compare March 30, 2017 17:05

takezoe commented Mar 31, 2017

View reviewed changes

Merge branch 'master' into elasticsearch

80a006e

jrudolph reviewed Apr 12, 2017

View reviewed changes

takezoe added 3 commits April 13, 2017 21:02

Fixup

af449b2

Merge branch 'master' into elasticsearch

24dcc5d

Fixup

f6fee51

takezoe force-pushed the elasticsearch branch from 7192169 to f6fee51 Compare April 25, 2017 09:31

takezoe added 2 commits April 25, 2017 18:34

Merge remote-tracking branch 'akka/master' into elasticsearch

c7a17d6

Add ElasticsearchFlow and offer sink based on it

e9c42e6

jrudolph reviewed May 8, 2017

View reviewed changes

Fix to enable back pressure

cd5d865

jrudolph reviewed May 18, 2017

View reviewed changes

Fixup

1bdaa15

takezoe force-pushed the elasticsearch branch from a019a80 to 1bdaa15 Compare May 19, 2017 08:25

takezoe added 7 commits May 19, 2017 19:57

To fail ElasticsearchFlowStage if some commands in bulk request failed

369dad4

Fix Java API of Elasticsearch connector

aa36639

Add Test for Java API

bdd65e3

Writing docs

2fb8c0a

Use Jackson in Java API instead of spray-json

799624c

Fix document

b1015c5

Add tests for ElasticsearchFlow and update document

5c4c4f4

takezoe added 8 commits May 29, 2017 11:03

Specify Content-Type header to remove warnings from Elasticsearch

a8f1e52

Fix error response handling

ce32d2a

Merge branch 'master' into elasticsearch

647ad01

Format code

c4a5931

Merge branch 'master' into elasticsearch

18f28d9

Enhancement Elasticsearch connector

6db8b56

- Support durable retry and recovery in flow and sink - Flow passes failed messages to followsing stage - Makes Java API more useful

Fix document

c0c3c29

Fix unused import

4116c6e

jrudolph approved these changes Sep 19, 2017

View reviewed changes

raboof merged commit 618ebf0 into akka:master Sep 19, 2017

takezoe deleted the elasticsearch branch October 26, 2017 07:29


		final case class ElasticsearchSinkSettings(bufferSize: Int = 10)

		final case class IncomingMessage[T](id: Option[String], source: T)


		protected def convert(jsObj: JsObject): T

		def receiveMessages(): Unit =


		final case class OutgoingMessage[T](id: String, source: T)

		final class ElasticsearchSourceStage(indexName: String,


		final case class IncomingMessage[T](id: Option[String], source: T)

		class ElasticsearchFlowStage[T](


		}

		sealed class ElasticsearchSourceLogic[T](indexName: String,

Elasticsearch connector #221

Elasticsearch connector #221

Conversation

takezoe commented Mar 17, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jrudolph left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

takezoe Mar 29, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jrudolph commented Apr 12, 2017

takezoe commented Apr 28, 2017

jrudolph left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

takezoe May 9, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

takezoe commented May 9, 2017

takezoe commented May 9, 2017

jrudolph left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jrudolph commented May 18, 2017

takezoe commented May 21, 2017

jrudolph left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

takezoe Mar 29, 2017 •

edited

takezoe May 9, 2017 •

edited