-
Notifications
You must be signed in to change notification settings - Fork 647
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sink and Source for Apache Parquet #1131
Conversation
Adding missing configuration to build.sbt
I haven't gone through this in depth but this PR seems very similar to #720 and I can see some of the same issues that I highlighted there:
and on the same line, I don't think that we should have a dependency to hadoop at all. We should only have a dependency to Avro-Parquet |
Adding missing configuration to build.sbt
@juanjoDiaz
|
Adding missing configuration to build.sbt
Adding missing configuration to build.sbt
Adding missing configuration to build.sbt
I see you have trouble with Paradox, we recently changed to absolute paths. You can remove the |
@ennru Thank you for your help. This issue fixed. Now have some weird ftp unit test failure. |
Looks very promising! Without looking into the details, I'd like to ask you to change a few things in the structure
|
Hi @ennru Just made the changes as you asked me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good.
Would there be a way to create a writer which emits ByteString
? Creating a Flow to emit data to be written or sent via other technologies would be useful as @juanjoDiaz pointed out.
Even if the Parquet API doesn't support that case, I'd like to see a Flow complementing the Sink, maybe just emitting Done
when the record is written.
|
||
override def onUpstreamFailure(ex: Throwable): Unit = { | ||
super.onUpstreamFailure(ex) | ||
writer.close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please free resources in postStop
. See #277
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
import org.apache.avro.generic.GenericRecord | ||
import org.apache.parquet.hadoop.ParquetWriter | ||
|
||
private[avroparquet] class AvroParquetFlow(writer: ParquetWriter[GenericRecord]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add @InternalApi
and a ScalaDoc comment stating INTERNAL API
to all internal classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
writer.close() | ||
} | ||
|
||
@scala.throws[Exception](classOf[Exception]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the annotation add any value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
override def onDownstreamFinish(): Unit = { | ||
super.onDownstreamFinish() | ||
reader.close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use postStop
here, as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Hi @ennru. Committed all the changes you requested to do after your code review. Regarding what you suggest. I don't think that ByteString is the correct abstraction, since this is Avro Column record and I need schema to manipulate it. What I suggest might be useful is to make |
Ok, makes sense. Having access to a flow is the most important bit. My advice for the documentation snippets was not clear enough. Please move the snippets to You need to add Parquet in |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please configure logging to be sent to a file instead (see other modules).
Have you tried the generated documentation? To get the current Paradox plugin, you'd need to re-base your branch.
You may move all API tests to docs.javadoc
and docs.scaladoc
. This makes sure all API is accessible.
Please explain what the code snippets do in the docs.
Make sure the imports are part of the snippets (too many classes are called Path
). You may use the same snippet tag multiple times in a source file to get everything into one snippet.
|
||
override def onPush(): Unit = { | ||
val obtainedValue = grab(in) | ||
writer.write(obtainedValue) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the write
operation in most cases will be blocking, the Parquet stages should use Akka's IODispatcher
.
Please add this to both flow and source stages:
override protected def initialAttributes: Attributes =
super.initialAttributes and ActorAttributes.IODispatcher
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ennru Hi, can you give a brief explanation what IODispatcher is and why it's necessary here? I believe it provides separate dispatcher for I/O configured through the akka.stream.blocking-io-dispatcher. If this is the case, it should be added to both Sink & Source.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the IODispatcher
can be configured centrally in Akka. Its size may need to be adapted if there are many blocking things happening in your Actor system. And yes, it should be selected for all stages executing blocking IO.
# Please enter a commit message to explain why this merge is necessary, # especially if it merges an updated upstream into a topic branch. # # Lines starting with '#' will be ignored, and an empty message aborts # the commit.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're closing in...
if (entries != null) { | ||
for (String s : entries) { | ||
File currentFile = new File(index.getPath(), s); | ||
currentFile.delete(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you consider to use File.createTempFile
and File.deleteOnExit
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
File.createTempFile
can't work here, because I'm creating Hadoop Path file directly from path String, however File.deleteOnExit
plays perfectly here. Thanks.
@@ -0,0 +1,13 @@ | |||
<configuration> | |||
<appender name="FILE" class="ch.qos.logback.core.FileAppender"> | |||
<file>target/files.log</file> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The log should be called after the module, avroparquet.log
.
docs/src/main/paradox/connectors.md
Outdated
@@ -8,6 +8,7 @@ | |||
* [Apache Geode](geode.md) | |||
* [Apache Kafka](kafka.md) | |||
* [Apache Kudu](kudu.md) | |||
* [Apache Parquet](avroparquet.md) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was it intentional to use "Apache" instead of "Avro"? Both are Apache projects, but it would better to follow the module name.
Changes committed. |
@ennru . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Thank you for your contribution! Keep them coming. |
This commit contains Scala and Java DSL for Sink and Source. I don't think there is a need to create Flow stage as well since I can't find and use case for it.
It contains Unit tests for both DSL.