Sink and Source for Apache Parquet #1131

dannylesnik · 2018-08-03T15:57:01Z

This commit contains Scala and Java DSL for Sink and Source. I don't think there is a need to create Flow stage as well since I can't find and use case for it.

It contains Unit tests for both DSL.

Adding missing configuration to build.sbt

juanjoDiaz · 2018-08-03T19:44:26Z

I haven't gone through this in depth but this PR seems very similar to #720 and I can see some of the same issues that I highlighted there:

I don't think that we should create connectors for specific file formats.
This should be a flow to format the content and then use the File sink to write to disk.

and on the same line, I don't think that we should have a dependency to hadoop at all. We should only have a dependency to Avro-Parquet

Adding missing configuration to build.sbt

dannylesnik · 2018-08-03T22:56:25Z

@juanjoDiaz
Thank you for the feedback, Some points.

In my commit I provided also Sink and Source. So I believe I extended WIP - Parquet sink for Alpakka #720 with Source implementation as well.
You mention that there should be flow format and File as Sink. In my opinion this is good reason, but don't forget that ParquetReader and Writer can write and read to HDFS directly (we are using it this way on our production systems. In m my Sink and Source can come directly from files located in HDFS.
This is how we are using it on production.
1) Source is query to Elasticsearch with elastic4s. Sink is parquet file located directly on HDFS.
2) Run Spark job on this data, which generates parquet output files.
3) Use parquet output files from HDFS as the Source, and Elastic4s as sink to stream job results to HDFS.
Using parquet Source and Sink, we can do this functionality by several line of codes.
and the last, I don't depend on hadoop. In my commit the scope for hadoop files is TEST. I use Avro-Parquet library only as the dependency.

Adding missing configuration to build.sbt

ennru · 2018-08-17T06:32:39Z

I see you have trouble with Paradox, we recently changed to absolute paths. You can remove the $alpakka$ prefix.

dannylesnik · 2018-08-17T20:00:21Z

@ennru Thank you for your help. This issue fixed. Now have some weird ftp unit test failure.

ennru · 2018-08-20T13:20:22Z

Looks very promising!

Without looking into the details, I'd like to ask you to change a few things in the structure

move internal classes into an impl package
move the example code to docs.scaladsl and docs.javadsl (to surface any visibility problems)
add the module to .travis.yml
akka-stream, junit and akka-stream-testkit dependencies come in from Common
please annotate other dependencies with their license

dannylesnik · 2018-08-21T19:59:08Z

Hi @ennru

Just made the changes as you asked me.

ennru

Looking good.
Would there be a way to create a writer which emits ByteString? Creating a Flow to emit data to be written or sent via other technologies would be useful as @juanjoDiaz pointed out.

Even if the Parquet API doesn't support that case, I'd like to see a Flow complementing the Sink, maybe just emitting Done when the record is written.

ennru · 2018-08-24T09:21:11Z

avroparquet/src/main/scala/akka/stream/alpakka/avroparquet/impl/AvroParquetFlow.scala

+
+          override def onUpstreamFailure(ex: Throwable): Unit = {
+            super.onUpstreamFailure(ex)
+            writer.close()


Please free resources in postStop. See #277

ennru · 2018-08-24T09:23:08Z

avroparquet/src/main/scala/akka/stream/alpakka/avroparquet/impl/AvroParquetFlow.scala

+import org.apache.avro.generic.GenericRecord
+import org.apache.parquet.hadoop.ParquetWriter
+
+private[avroparquet] class AvroParquetFlow(writer: ParquetWriter[GenericRecord])


Add @InternalApi and a ScalaDoc comment stating INTERNAL API to all internal classes.

ennru · 2018-08-24T09:23:27Z

avroparquet/src/main/scala/akka/stream/alpakka/avroparquet/impl/AvroParquetFlow.scala

+            writer.close()
+          }
+
+          @scala.throws[Exception](classOf[Exception])


Does the annotation add any value?

ennru · 2018-08-24T09:24:13Z

avroparquet/src/main/scala/akka/stream/alpakka/avroparquet/impl/AvroParquetSource.scala

+
+        override def onDownstreamFinish(): Unit = {
+          super.onDownstreamFinish()
+          reader.close()


Use postStop here, as well.

dannylesnik · 2018-08-24T20:08:25Z

Hi @ennru.

Committed all the changes you requested to do after your code review.

Regarding what you suggest. I don't think that ByteString is the correct abstraction, since this is Avro Column record and I need schema to manipulate it.
Working heavily with Akka Streams and Hadoop Echo System. I believe that 99% percent of this code usage would be storing events (which obviously are case classes or raw Jsons, fetched from DB, received from HTTP Layer or even from Actor Publisher) on HDFS (or any other distributed storage), to run Map-Reduce operation, which stores result in Parquet and move output back to the system. I don't believe that we need any other format on the way from schema based parquet record to case class.

What I suggest might be useful is to make AvroParquetFlow public API and use it as .via(FlowStage[GenericRecord,GenericRecord]) in case where Store to Parquet is not a last stage of the stream.

ennru · 2018-08-27T09:38:53Z

Ok, makes sense. Having access to a flow is the most important bit.

My advice for the documentation snippets was not clear enough. Please move the snippets to /avroparquet/src/test/java/docs/javadsl and /avroparquet/src/test/java/docs/scaladsl.

You need to add Parquet in connectors.md to get it listed.

dannylesnik · 2018-08-27T16:03:57Z

Created public Java and Scala DSL for API for AvroParquetFlow
Added Spec for AvroParquetFlow
Added documentation section for AvroParquetFlow in Paradox
Added Parquet for connectors.md
Examples.java moved to /avroparquet/src/test/java/docs/javadsl

ennru

Please configure logging to be sent to a file instead (see other modules).

Have you tried the generated documentation? To get the current Paradox plugin, you'd need to re-base your branch.

You may move all API tests to docs.javadoc and docs.scaladoc. This makes sure all API is accessible.

Please explain what the code snippets do in the docs.

Make sure the imports are part of the snippets (too many classes are called Path). You may use the same snippet tag multiple times in a source file to get everything into one snippet.

ennru · 2018-08-28T07:52:59Z

avroparquet/src/main/scala/akka/stream/alpakka/avroparquet/impl/AvroParquetFlow.scala

+
+          override def onPush(): Unit = {
+            val obtainedValue = grab(in)
+            writer.write(obtainedValue)


As the write operation in most cases will be blocking, the Parquet stages should use Akka's IODispatcher.

Please add this to both flow and source stages:

override protected def initialAttributes: Attributes = super.initialAttributes and ActorAttributes.IODispatcher

@ennru Hi, can you give a brief explanation what IODispatcher is and why it's necessary here? I believe it provides separate dispatcher for I/O configured through the akka.stream.blocking-io-dispatcher. If this is the case, it should be added to both Sink & Source.

Yes, the IODispatcher can be configured centrally in Akka. Its size may need to be adapted if there are many blocking things happening in your Actor system. And yes, it should be selected for all stages executing blocking IO.

# Please enter a commit message to explain why this merge is necessary, # especially if it merges an updated upstream into a topic branch. # # Lines starting with '#' will be ignored, and an empty message aborts # the commit.

dannylesnik · 2018-08-28T21:18:11Z

Logging configured
All API unit tests moved to docs.javadoc and docs.scaladoc
Added more explanation and imports in documentation
Added IO Dispatcher to Flow and Sink.
generated avroparquet.html using code/paradox. Looks OK.

ennru

We're closing in...

ennru · 2018-08-29T07:04:38Z

avroparquet/src/test/java/docs/javadsl/AvroParquetSinkTest.java

+    if (entries != null) {
+      for (String s : entries) {
+        File currentFile = new File(index.getPath(), s);
+        currentFile.delete();


Did you consider to use File.createTempFile and File.deleteOnExit?

File.createTempFile can't work here, because I'm creating Hadoop Path file directly from path String, however File.deleteOnExit plays perfectly here. Thanks.

ennru · 2018-08-29T07:05:09Z

avroparquet/src/test/resources/logback-test.xml

@@ -0,0 +1,13 @@
+<configuration>
+    <appender name="FILE" class="ch.qos.logback.core.FileAppender">
+        <file>target/files.log</file>


The log should be called after the module, avroparquet.log.

ennru · 2018-08-29T07:06:29Z

docs/src/main/paradox/connectors.md

@@ -8,6 +8,7 @@
 * [Apache Geode](geode.md)
 * [Apache Kafka](kafka.md)
 * [Apache Kudu](kudu.md)
+* [Apache Parquet](avroparquet.md)


Was it intentional to use "Apache" instead of "Avro"? Both are Apache projects, but it would better to follow the module name.

dannylesnik · 2018-08-29T08:07:13Z

Changes committed.

dannylesnik · 2018-08-29T09:31:57Z

@ennru .
Changes committed.

ennru

LGTM.

ennru · 2018-08-31T11:31:15Z

Thank you for your contribution! Keep them coming.

dannylesnik added 3 commits August 3, 2018 18:40

Sink and Source for Apache Parquet

38acc3a

alpakka

537ae37

Fixing Build.

17b0ba2

Adding missing configuration to build.sbt

Fixing Build.

5c2fc33

Adding missing configuration to build.sbt

dannylesnik added 3 commits August 4, 2018 01:58

Fixing Build.

f5bb3cc

Adding missing configuration to build.sbt

Fixing Build.

092fadd

Adding missing configuration to build.sbt

Fixing Build.

07f8fe2

Adding missing configuration to build.sbt

ennru added the p:new label Aug 17, 2018

Fixing Paradox problem.

1c1468c

dannylesnik added 6 commits August 21, 2018 17:37

Fixing issues per Enno suggestions

1cb3ca8

Fixing issues per Enno suggestions

d5781d8

Fixing issues per Enno suggestions

af32d9b

Fixing issues per Enno suggestions

a8dc995

Fixing issues per Enno suggestions

9838f49

Fixing issues per Enno suggestions

cc6f587

ennru reviewed Aug 24, 2018

View reviewed changes

Fixing post code review issues.

d843664

Adding public API for Flow, Fixing docs changes

314f86f

dannylesnik added 2 commits August 27, 2018 19:09

Adding public API for Flow, Fixing docs changes

4f193d5

Adding public API for Flow, Fixing docs changes

c2bb4e5

ennru reviewed Aug 28, 2018

View reviewed changes

dannylesnik added 3 commits August 28, 2018 18:50

Fixing latest Code Review Commennts

b49b009

Fixing latest Code Review Commennts

cd99ca2

ennru reviewed Aug 29, 2018

View reviewed changes

Changes from latest code review.

aadca3d

Changes from latest code review.

246ae68

ennru approved these changes Aug 31, 2018

View reviewed changes

ennru added the p:avroparquet label Aug 31, 2018

ennru merged commit 84db4a5 into akka:master Aug 31, 2018

ennru added this to the 0.21 milestone Aug 31, 2018

ennru mentioned this pull request Aug 31, 2018

WIP - Parquet sink for Alpakka #720

Closed

dannylesnik deleted the avroparquet branch August 31, 2018 14:30

sebastianharko pushed a commit to sebastianharko/alpakka that referenced this pull request Sep 5, 2018

Apache Parquet module (akka#1131)

b6cbc26

dannylesnik added a commit to dannylesnik/alpakka that referenced this pull request Sep 8, 2018

Apache Parquet module (akka#1131)

12cece5

dannylesnik restored the avroparquet branch September 10, 2018 14:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sink and Source for Apache Parquet #1131

Sink and Source for Apache Parquet #1131

dannylesnik commented Aug 3, 2018

juanjoDiaz commented Aug 3, 2018

dannylesnik commented Aug 3, 2018

ennru commented Aug 17, 2018

dannylesnik commented Aug 17, 2018

ennru commented Aug 20, 2018

dannylesnik commented Aug 21, 2018

ennru left a comment

ennru Aug 24, 2018

dannylesnik Aug 24, 2018

ennru Aug 24, 2018

dannylesnik Aug 24, 2018

ennru Aug 24, 2018

dannylesnik Aug 24, 2018

ennru Aug 24, 2018

dannylesnik Aug 24, 2018

dannylesnik commented Aug 24, 2018

ennru commented Aug 27, 2018

dannylesnik commented Aug 27, 2018

ennru left a comment

ennru Aug 28, 2018

danelkotev Aug 28, 2018 •

edited

Loading

ennru Aug 28, 2018

dannylesnik commented Aug 28, 2018

ennru left a comment

ennru Aug 29, 2018

dannylesnik Aug 29, 2018

ennru Aug 29, 2018

ennru Aug 29, 2018

dannylesnik commented Aug 29, 2018

dannylesnik commented Aug 29, 2018

ennru left a comment

ennru commented Aug 31, 2018

Sink and Source for Apache Parquet #1131

Sink and Source for Apache Parquet #1131

Conversation

dannylesnik commented Aug 3, 2018

juanjoDiaz commented Aug 3, 2018

dannylesnik commented Aug 3, 2018

ennru commented Aug 17, 2018

dannylesnik commented Aug 17, 2018

ennru commented Aug 20, 2018

dannylesnik commented Aug 21, 2018

ennru left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dannylesnik commented Aug 24, 2018

ennru commented Aug 27, 2018

dannylesnik commented Aug 27, 2018

ennru left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danelkotev Aug 28, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dannylesnik commented Aug 28, 2018

ennru left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dannylesnik commented Aug 29, 2018

dannylesnik commented Aug 29, 2018

ennru left a comment

Choose a reason for hiding this comment

ennru commented Aug 31, 2018

danelkotev Aug 28, 2018 •

edited

Loading