Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-3228][Streaming] #2132

Closed
wants to merge 2 commits into from

Conversation

Projects
None yet
5 participants
@Leolh
Copy link

commented Aug 26, 2014

When I use DStream to save files to hdfs, it will create a directory and a empty file named "_SUCCESS" for each job which made in the batch duration.
But if there are no data from source for a long time , and the duration is very short(e.g. 10s), it will create so many directory and empty files in hdfs.
I don't think it is necessary. So I want to modify class DStream's method saveAsObjectFiles and saveAsTextFiles , it creates directory and files just when the RDD's partitions size > 0 .

leo added some commits Aug 26, 2014

leo
When DStream save RDD to hdfs , don't create directory and empty file…
… if there are no data received from source in the batch duration .
@AmplabJenkins

This comment has been minimized.

Copy link

commented Aug 26, 2014

Can one of the admins verify this patch?

@mateiz

This comment has been minimized.

Copy link
Contributor

commented Aug 27, 2014

When would the RDD not have any partitions? It seems that if you use a reduce, updateStateByKey, or anything like that, we will always have partitions, so this won't save a lot of hassle in most jobs. It would be better if you implement a cleanup process in your application to get rid of these files.

@tdas

This comment has been minimized.

Copy link
Contributor

commented Aug 30, 2014

Can you please add a title to the PR. And also, this is a tricky change as this actually changes the user-perceived behavior of saveAsXXXFile. If someone has set up a system that expects a new file every batch, irrespecitve of the fact that it has empty data or not, then this change will break the system.

This functionality can be very easily replicated in user code, by doing

dstream.foreachRDD((rdd: RDD[XXX], time: Time) => {
     val fileName = prefix + time.milliseconds + suffix
     rdd.saveAsXXXFile(fileName)
})

So I am not convinced that this is a good change, especially because it breaks exisitng behavior.
Any thoughts?

@tdas

This comment has been minimized.

Copy link
Contributor

commented Sep 3, 2014

@Leolh Any thoughts on @mateiz and my comments?

@SparkQA

This comment has been minimized.

Copy link

commented Sep 5, 2014

Can one of the admins verify this patch?

@tdas

This comment has been minimized.

Copy link
Contributor

commented Nov 11, 2014

If you are unable to update this patch, then mind closing this patch?

@Leolh Leolh closed this Nov 18, 2014

@Leolh Leolh deleted the Leolh:spark-streaming branch Nov 18, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.