Skip to content

Conversation

@Parth-Brahmbhatt
Copy link
Contributor

@Parth-Brahmbhatt Parth-Brahmbhatt commented Aug 25, 2016

What changes were proposed in this pull request?

when calcualting size of a relation from hdfs, the size calculation should be aborted once size is established to be > broadcast threshold. This will be really helpful when the config is enabled and one of the tables in query is partitioned and fairly large.

How was this patch tested?

Unit test modified.

…e size calculation should be aborted once size is established to be > broadcast threshold.
@hvanhovell
Copy link
Contributor

@Parth-Brahmbhatt we are currently working Cost Based Optimization in Spark. An important input will be the actual size of the table. Having partial statistics (what you are suggestion) will not make this as good as it can be.

The other issue is that the size of the table is currently used to determine if we are doing a hash join or a sort merge join. In order to do a hash join, the build side must be at least 3x smaller than the stream side and the build side should not occupy more memory per worker than the broadcast threshold. This will break that to.

@Parth-Brahmbhatt
Copy link
Contributor Author

Parth-Brahmbhatt commented Aug 25, 2016

@hvanhovell The behavior in case fallbackToHdfs is not enabled ( and by default it is not enabled for performance reason) is to return the value specified via spark.sql.defaultSizeInBytes (default Long.MaxValue) which is also wrong and I am fine with returning that value instead of returning the partial value.

This patch is trying to reduce the perf penalty that we have to pay when the fall back is enabled and in that case trying to get an accurate size is just too expansive when one of the 2 tables being joined is huge. I could also add a config to enable/disable this behavior but not even having the option just makes this fallback useless in most cases.

}
)

private def getSize(f: Path): Long = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you're doing but this method no longer really does what it says. I think this is going to be error-prone.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed the method, added documentation and also returning the default bytes when size > broadcast threshold.

…nstead of returning any value > broadcast threshold.
@Parth-Brahmbhatt
Copy link
Contributor Author

@hvanhovell can you also point me at the design doc/discuss thread for CBO work? Thanks.

@Parth-Brahmbhatt
Copy link
Contributor Author

Can one of the committers take a look at this PR?

@hvanhovell
Copy link
Contributor

@Parth-Brahmbhatt here is the CBO ticket: https://issues.apache.org/jira/browse/SPARK-16026

Could you explain why this is so slow? Is this because of listing the files? Or because of the amount? We might be able to speed up the first one, see AlterTableRecoverPartitionsCommand for an example of this.

@Parth-Brahmbhatt
Copy link
Contributor Author

@hvanhovell its because of listing and gets worst as amount increases.

@hvanhovell
Copy link
Contributor

@Parth-Brahmbhatt would the approach taken in AlterTableRecoverPartitionsCommand help?

@Parth-Brahmbhatt
Copy link
Contributor Author

@hvanhovell I will take a look at it and update this PR.

@Parth-Brahmbhatt
Copy link
Contributor Author

@hvanhovell I looked at AlterTableRecoverPartitionsCommand and the parallelism in listing could help it will still cause huge perf penalty. We have tables with millions of partitions and we use s3 for storage where listing is more expansive. I think it is much better to just stop listing once we know the stat used only for join optimization won't meet the threshold and I don't see the downside compared to what we currently offer.

@Parth-Brahmbhatt
Copy link
Contributor Author

Can someone please review this PR? Thanks.

@Parth-Brahmbhatt
Copy link
Contributor Author

Request for review one more time.

@hvanhovell
Copy link
Contributor

@Parth-Brahmbhatt I am very curious why you have millions of partitions. What is the use case? You will be in a world of hurt as soon as you do any listing.

I am not going to merge this PR as-is, as it clashes with the work on CBO. I would support an analyze table command, that would allow for partial scanning, e.g. ANALYZE TABLE xyz NOSCAN BROADCAST ONLY. Lets discuss a design first, before pressing ahead.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@Parth-Brahmbhatt
Copy link
Contributor Author

@hvanhovell We have tables with 5-6 partition columns and data going back 4-5 years and given our data is stored in s3 the listing is paginated.

If you want to wait till CBO work is done, that is fine We can resume reviewing after the CBO work is done.

@gatorsmile
Copy link
Member

Thanks for reporting it!

After CBO, the relation size is not only used for deciding whether a table can be broadcasted. Maybe we can close this PR now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants