HDFS: Add sources and flows #965

burakkose · 2018-05-21T21:27:27Z

Ref: #557

In this pull request, you will find the initial work for Hdfs. The PR is work in progress. There are still some TODOs.

Documentation (done)
Java Tests (done)
Support HDFS version 2.x (done)
pass-through (done)

While I am working on these, please review the code, suggest new functionalities, and help for testing.

ennru

I had a first look. Very good stuff.
How should we think about the HDFS version? Would it work with 2.x?

ennru · 2018-05-22T12:09:37Z

hdfs/src/main/scala/akka/stream/alpakka/hdfs/HdfsFlowStage.scala

+ * Copyright (C) 2016-2018 Lightbend Inc. <http://www.lightbend.com>
+ */
+
+package akka.stream.alpakka.hdfs


It would be great to move internal stuff into an impl package. That would improve Java Module readiness.

ennru · 2018-05-22T12:18:31Z

hdfs/src/main/scala/akka/stream/alpakka/hdfs/javadsl/HdfsSink.scala

+   * @param compressionCodec a class encapsulates a streaming compression/decompression pair.
+   * @param settings Hdfs writing settings
+   */
+  def compressed(


To reduce the API, you could just have HdfsSink.data and let the users connect to Sink.ignore for the other cases.

Sorry I could not understand. HdfsSink.data , HdfsSink.compressed, HdfsSink.sequence are completely different. What exactly do you mean by letting users connect to Sink.ignore

I think users needing these cases might as well use the HdfsFlow and connect it to Sink.ignore themselves.

oh, then can we remove Sink implementations?

Yes, just keep the most basic ones. Most users discover they'll need to do something after sending/writing to the destination.
Speaking of that, what we most often need is some kind of pass-through, a value untouched by the flow, but available afterwards (eg. Kafka offsets for committing). Have you thought about that? It would require a wrapper for Writable, AFAICS.

Actually, I was thinking about pass-through like Solr connector. I was just not sure to implement. It will be added soon.

I would like to consult about it. I was thinking the best strategy for it. Currently, the flow expects ByteString as an input. Whenever it rotates the output, it pushes WriteLog to the downstream.

If we want to implement pass-through, how shall we design it? The dumbest way is

final case class WriteLog[T](path: String, rotation: Int, passThroughs: Seq[T])

However, if we have millions input, keeping this sequence in memory until the flow rotates is super inefficient.

The second idea is

sealed trait OutgoingMessage final case class RotationMessage(path: String, rotation: Int) extends OutgoingMessage final case class PassThrough[T](pass: T) extends OutgoingMessage

So flow can push RotationMessage when it rotates, and for the rest, it will push PassThrough. However, this has a drawback. Let's talk about Kafka example, we write the input to output and send PassThrough message with an offset. If something goes wrong in the flow, and if it does not synchronize the output, we will basically fail, but downstream can already commit this offset.

I actually did not like this idea also. Do you have any idea for it?

ennru · 2018-05-22T12:37:07Z

project/Dependencies.scala

+    Seq(
+      libraryDependencies ++= Seq(
+        "org.apache.hadoop" % "hadoop-client" % hadoopVersion, // ApacheV2
+        "org.typelevel" %% "cats-core" % catsVersion, // MIT,


You might upgrade to Cats 1.1.0.

burakkose · 2018-05-22T14:32:04Z

I am not sure about the HDFS version. How can we provide support for 2.x also? Is there any example for it in Alpakka connectors?

ennru · 2018-05-22T14:36:35Z

No there is no example right now. One way of doing it would be to make the HDFS dependency optional and let the users add their version explicitly.

burakkose · 2018-05-22T21:39:14Z

I have rearranged the project structure with impl package. Shall we move FilePathGenerator, RotationStrategy, SyncStrategy to model.scala.

Can you also check the commit b687ad5 , I have realized that we need to use japi.Pair for Java

ennru · 2018-05-23T07:20:38Z

The build error is an extra comma which is not supported in Scala 2.11 in HdfsWriterSpec.scala:463.

ennru · 2018-05-23T09:38:26Z

We build on Scala 2.11 and 2.12. Since 2.12 you're allowed to have a superfluous comma as in

Seq(
  1,
  2,
)

and that happens to be in there in HdfsWriterSpec.scala:463.

burakkose · 2018-05-23T10:13:49Z

Yes, my late time commit fault :)

burakkose · 2018-05-24T16:03:04Z

I had a comment in the outdated discussion, so I am posting again here as a new discussion.

I would like to consult about it. I was thinking the best strategy for it. Currently, the flow expects ByteString as an input. Whenever it rotates the output, it pushes WriteLog to the downstream.

If we want to implement pass-through, how shall we design it? The dumbest way is

final case class WriteLog[T](path: String, rotation: Int, passThroughs: Seq[T])

However, if we have millions input, keeping this sequence in memory until the flow rotates is super inefficient.

The second idea is

sealed trait OutgoingMessage
final case class RotationMessage(path: String, rotation: Int) extends OutgoingMessage
final case class PassThrough[T](pass: T)  extends OutgoingMessage

So flow can push RotationMessage when it rotates, and for the rest, it will push PassThrough. However, this has a drawback. Let's talk about Kafka example, we write the input to output and send PassThrough message with an offset. If something goes wrong in the flow, and if it does not synchronize the output, we will basically fail, but downstream can already commit this offset.

I actually did not like this idea also. Do you have any idea for it?

burakkose · 2018-05-24T16:03:52Z

Java tests added, and some API simplified for Java usages.

ennru · 2018-05-29T08:36:57Z

Emitting a message for every incoming is the only reasonable way.
The user may accumulate pass-throughs if needed. With Kafka the offsets would be committed on a Rotation/Write message.
An alternative could be to have a type

case class HdfsWritten[T](passThrough: T, status: Option[RotatationMessage])

burakkose · 2018-06-05T15:25:00Z

@ennru did you have time to have a first look at pass-through.

burakkose · 2018-06-06T15:37:50Z

Here is an update for different versions. I have tested from 2.x to 3.x. Tests passed successfully. Moreover, I have published the library locally, and override the Hadoop version with 2.6 because we use Hadoop 2.6. Data ingestion was smooth. It looks like we do not have any problem with different versions. I added a text in the documentation that mentions the default version and the way to override the default one.

ennru

This looks great!
You asked earlier about it: classes that belong to the API may not be hidden in the impl package.
I wonder if making the rotation strategy extendable would be important, some might want to implement a combined time/size rotation strategy.

ennru · 2018-06-08T11:45:17Z

hdfs/src/main/scala/akka/stream/alpakka/hdfs/impl/strategy/RotationStrategy.scala

+
+import scala.concurrent.duration.FiniteDuration
+
+sealed trait RotationStrategy extends Strategy {


At least the RotationStrategy interface should be in the public part of the module, as it is part of the API.
It might make sense to have it extendable.

burakkose · 2018-06-09T23:27:35Z

I have added Sources and made RotationStrategy and SyncStrategy extendable.

ennru

I'm really impressed, this is great work.

ennru · 2018-06-15T09:11:28Z

docs/src/main/paradox/hdfs.md

+
+### Compressed Data Writer
+
+First, create `CompressionCodec`.


By adding

"javadoc.org.apache.hadoop.base_url" -> s"https://hadoop.apache.org/docs/r${hadoopVersion}/api/",

in build.sbt you can create links to Hadoop's API via @javadoc.

ennru · 2018-06-15T09:12:51Z

docs/src/main/paradox/hdfs.md

+
+`FilePathGenerator` provides a functionality to generate rotation path in HDFS. 
+@scala[@scaladoc[FilePathGenerator](akka.stream.alpakka.hdfs.FilePathGenerator$).]
+@java[@scaladoc[FilePathGenerator](akka.stream.alpakka.hdfs.FilePathGenerator$).]


No need to use @scala/@java when linking to the same class.

ennru · 2018-06-15T09:14:00Z

hdfs/src/main/scala/akka/stream/alpakka/hdfs/impl/HdfsFlowStage.scala

+/**
+ * Internal API
+ */
+private[hdfs] final class HdfsFlowStage[W, I, C](


Please add even @akka.annotation.InternalApi as private[hdfs] doesn't protect from Java users using it.

ennru · 2018-06-15T09:18:19Z

hdfs/src/main/scala/akka/stream/alpakka/hdfs/model.scala

+final case class IncomingMessage[T, C](source: T, passThrough: C)
+
+object IncomingMessage {
+  // Apply method to use when not using passThrough


Make these doc comments, please.

ennru · 2018-06-15T09:23:28Z

hdfs/src/main/scala/akka/stream/alpakka/hdfs/model.scala

+    HdfsWritingSettings()
+}
+
+final case class IncomingMessage[T, C](source: T, passThrough: C)


I'm not super happy with the IncomingMessage name used in several connectors. It can easily become messy when using several connectors in the same codebase. And "incoming" is tied to the point of view of the stage, for the user the data leaves...
Maybe HdfsWriteMessage or HdfsData?

ennru · 2018-06-15T09:32:14Z

hdfs/src/main/scala/akka/stream/alpakka/hdfs/scaladsl/HdfsFlow.scala

+   * @param rotationStrategy rotation strategy
+   * @param settings hdfs writing settings
+   */
+  def dataWithPassThrough[C](


Would P make an easier type parameter name?

ennru · 2018-06-15T09:33:44Z

hdfs/src/main/scala/akka/stream/alpakka/hdfs/model.scala

+
+sealed abstract class OutgoingMessage[+T]
+final case class RotationMessage(path: String, rotation: Int) extends OutgoingMessage[Nothing]
+final case class WrittenMessage[T](passThrough: T, inRotation: Int) extends OutgoingMessage[T]


Document the inRotation value.

ennru · 2018-06-15T09:37:47Z

hdfs/src/main/scala/akka/stream/alpakka/hdfs/impl/writer/HdfsWriter.scala

+
+private[writer] object HdfsWriter {
+
+  val NewLineByteArray: Array[Byte] = ByteString(System.getProperty("line.separator")).toArray


Is it useful to use the system's separator? It could be provided via the settings instead.

ennru · 2018-06-15T09:41:41Z

hdfs/src/main/scala/akka/stream/alpakka/hdfs/impl/HdfsFlowStage.scala

+
+  override def preStart(): Unit = {
+    // Schedule timer to rotate output file
+    initialRotationStrategy match {


This should be expressed by something in RotationStrategy instead of the type so it becomes extendible.

Do you have any hint about this? How can i have a strategy in RotationStrategy for preStart. I would consider to pass stateLogic as a parameter and call schedule there but these methods are not visible. How can i trigger schedule in RotationStrategy.

You could add an interval: Option[FiniteDuration] to it and use foreach in pre-start to schedule the poll.

I did not like the idea of having an optional field in rotation strategy and have a logic for it. I shared the scheduling API with implementation package and use it in new method(preStart) of RotationStrategy. Please check e6135a7 and if you do not like it, I will figure out something else.

ennru · 2018-06-15T09:46:08Z

hdfs/src/main/scala/akka/stream/alpakka/hdfs/javadsl/HdfsFlow.scala

+   * Java API: creates a Flow with [[akka.stream.alpakka.hdfs.impl.HdfsFlowStage]]
+   * for [[org.apache.hadoop.fs.FSDataOutputStream]]
+   *
+   * @param fs file system


Hadoop file system

2m

Looking very good! The only larger change I would like to see is to add the IODispatcher attributes where necessary.

2m · 2018-06-18T14:42:35Z

hdfs/src/test/java/akka/stream/alpakka/hdfs/HdfsReaderTest.java

+  private static MiniDFSCluster hdfsCluster = null;
+  private static ActorSystem system;
+  private static ActorMaterializer materializer;
+  private static String destionation = JavaTestUtils.destination();


s/destionation/destination

2m · 2018-06-18T14:42:51Z

hdfs/src/test/scala/akka/stream/alpakka/hdfs/HdfsReaderSpec.scala

+class HdfsReaderSpec extends WordSpecLike with Matchers with BeforeAndAfterAll with BeforeAndAfterEach {
+
+  private var hdfsCluster: MiniDFSCluster = _
+  private val destionation = "/tmp/alpakka/"


s/destionation/destination

2m · 2018-06-18T14:43:10Z

hdfs/src/test/java/akka/stream/alpakka/hdfs/HdfsWriterTest.java

+  private static MiniDFSCluster hdfsCluster = null;
+  private static ActorSystem system;
+  private static ActorMaterializer materializer;
+  private static String destionation = JavaTestUtils.destination();


s/destionation/destination

2m · 2018-06-18T14:43:39Z

hdfs/src/test/scala/akka/stream/alpakka/hdfs/HdfsWriterSpec.scala

+class HdfsWriterSpec extends WordSpecLike with Matchers with BeforeAndAfterAll with BeforeAndAfterEach {
+
+  private var hdfsCluster: MiniDFSCluster = _
+  private val destionation = "/tmp/alpakka/"


s/destionation/destination

2m · 2018-06-18T15:04:20Z

hdfs/src/main/scala/akka/stream/alpakka/hdfs/impl/HdfsFlowStage.scala

+  private val out = Outlet[OutgoingMessage[C]](Logging.simpleName(this) + ".out")
+
+  override val shape: FlowShape[HdfsWriteMessage[I, C], OutgoingMessage[C]] = FlowShape(in, out)
+


As the underlying writing abstraction is java.io.OutputStream which is blocking, we have to signal to the Akka Stream materializer that it should materialize HdfsFlowStage to a separate dispatcher. This allows other parts of the stream to continue uninpacted when this stage is going to block a thread during writing operations.

Therefore add the following here:

override def initialAttributes: Attributes = super.initialAttributes and ActorAttributes.IODispatcher

Thank you for this comment. While I was running it, i was also profiling the app. I guess this is a nice improvement.

2m · 2018-06-18T15:06:12Z

hdfs/src/main/scala/akka/stream/alpakka/hdfs/scaladsl/HdfsSource.scala

+      }
+      .takeWhile(_._1)
+      .map(_._2)
+    Source.fromIterator(() => it)


I assume that SequenceFile.Readers next operation is blocking as well. We will have to put this source on a separate dispatcher then as well: .addAttributes(Attributes(ActorAttributes.IODispatcher)).

2m · 2018-06-18T15:06:49Z

hdfs/src/main/scala/akka/stream/alpakka/hdfs/scaladsl/HdfsSource.scala

+      codec: CompressionCodec,
+      chunkSize: Int = 8192
+  ): Source[ByteString, Future[IOResult]] =
+    StreamConverters.fromInputStream(() => codec.createInputStream(fs.open(path)), chunkSize)


A source created by fromInputStream already runs on the IO dispatcher, therefore it is fine here.

burakkose · 2018-06-25T19:31:53Z

@2m, @ennru did you have a chance to review last changes?

2m · 2018-06-26T09:59:27Z

hdfs/src/main/scala/akka/stream/alpakka/hdfs/scaladsl/HdfsSource.scala

-    Source.fromIterator(() => it)
+    Source
+      .fromIterator(() => it)
+      .addAttributes(Attributes(IODispatcher))


Use ActorAttributes.IODispatcher here as well since the DefaultAttributes.IODispatcher is in the impl package.

2m · 2018-06-26T10:00:02Z

Thank you for the ping. Looking very good. Just one last nitpick an we are good to merge.

burakkose · 2018-06-27T08:08:22Z

Done

2m

Awesome work @burakkose!

burakkose force-pushed the hdfs-writer branch from 8596bb3 to be167e2 Compare May 21, 2018 21:35

burakkose changed the title ~~HDFS/WIP: Add writer connector #557~~ HDFS/WIP: Add writer connector May 21, 2018

burakkose force-pushed the hdfs-writer branch 2 times, most recently from 60fe76e to 3bbd57c Compare May 22, 2018 08:42

ennru reviewed May 22, 2018

View reviewed changes

ennru added the p:new label May 22, 2018

burakkose force-pushed the hdfs-writer branch 2 times, most recently from 3515734 to b687ad5 Compare May 22, 2018 21:35

burakkose force-pushed the hdfs-writer branch from e6a0f1f to 6f77296 Compare May 24, 2018 15:54

burakkose changed the title ~~HDFS/WIP: Add writer connector~~ HDFS: Add writer connector Jun 2, 2018

ennru reviewed Jun 8, 2018

View reviewed changes

burakkose changed the title ~~HDFS: Add writer connector~~ HDFS: Add sources and flows Jun 9, 2018

burakkose force-pushed the hdfs-writer branch from a26de04 to a987d1b Compare June 10, 2018 11:33

ennru reviewed Jun 15, 2018

View reviewed changes

burakkose added 4 commits June 17, 2018 16:22

Hdfs: Add project structure

80736a9

Hdfs: Bump version of cats

dbeea61

Hdfs: Rearrange module structure

bd3afe1

Hdfs: Use akka.japi.Pair for Java Api

487798a

burakkose added 9 commits June 17, 2018 16:23

Hdfs: Make RotationStrategy and SyncStrategy extendable

a88a0d4

Hdfs: Fix materialized value type for javadsl

3c652d1

Hdfs: Remove unused imports

9b481f3

Hdfs: Format

188adf9

Hdfs: Add internal api annotation

00f4eff

Hdfs: Rename IncomingMessage to HdfsWriteMessage

17d7796

Hdfs: Refactor - better type parameter

6c4866c

Hdfs: Enrich documentation

c44f16a

Hdfs: Enrich documentation of HdfsWriteMessage

8b02324

burakkose force-pushed the hdfs-writer branch from 80be99f to 5816fdb Compare June 17, 2018 14:25

Hdfs: Use Hadoop javadoc in connector documentation

5816fdb

2m reviewed Jun 18, 2018

View reviewed changes

burakkose added 7 commits June 21, 2018 20:25

Hdfs: Fix typo

730e146

Hdfs: Use IODispatcher

f7bbae5

Hdfs: Make message separator configurable

81edeb4

Hdfs: Improve javadocs

9ee5246

Hdfs: Add preStart to RotationStrategy and share schedule api

e6135a7

Hdfs: Use absolute output path in RotationMessage

5abcab0

Hdfs: Add documentation for OutgoingMessage

f1ab4aa

2m reviewed Jun 26, 2018

View reviewed changes

Hdfs: Use ActorAttributes.IODispatcher

c809a0b

2m approved these changes Jun 27, 2018

View reviewed changes

2m merged commit 224a180 into akka:master Jun 27, 2018

2m added this to the 0.20 milestone Jun 27, 2018

2m mentioned this pull request Jun 27, 2018

Connector for HDFS #557

Closed

2m added p:hdfs and removed p:new labels Jun 27, 2018


		import scala.concurrent.duration.FiniteDuration

		sealed trait RotationStrategy extends Strategy {


		private[writer] object HdfsWriter {

		val NewLineByteArray: Array[Byte] = ByteString(System.getProperty("line.separator")).toArray

		private val out = Outlet[OutgoingMessage[C]](Logging.simpleName(this) + ".out")

		override val shape: FlowShape[HdfsWriteMessage[I, C], OutgoingMessage[C]] = FlowShape(in, out)

HDFS: Add sources and flows #965

HDFS: Add sources and flows #965

Conversation

burakkose commented May 21, 2018 • edited Loading

ennru left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

burakkose May 23, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

burakkose commented May 22, 2018

ennru commented May 22, 2018

burakkose commented May 22, 2018

ennru commented May 23, 2018

ennru commented May 23, 2018

burakkose commented May 23, 2018

burakkose commented May 24, 2018

burakkose commented May 24, 2018

ennru commented May 29, 2018

burakkose commented Jun 5, 2018

burakkose commented Jun 6, 2018

ennru left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

burakkose commented Jun 9, 2018

ennru left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ennru Jun 18, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

2m left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

burakkose commented Jun 25, 2018

Choose a reason for hiding this comment

2m commented Jun 26, 2018

burakkose commented Jun 27, 2018

2m left a comment

Choose a reason for hiding this comment

burakkose commented May 21, 2018 •

edited

Loading

burakkose May 23, 2018 •

edited

Loading

ennru Jun 18, 2018 •

edited

Loading