-
Notifications
You must be signed in to change notification settings - Fork 28.2k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[SPARK-17549][sql] Coalesce cached relation stats in driver.
Currently there's a scalability problem with cached relations, in that stats for every column, for each partition, are captured in the driver. For large tables that leads to lots and lots of memory usage. This change modifies the accumulator used to capture stats in the driver to summarize the data as it arrives, instead of collecting everything and then summarizing it. Previously, for each column, the driver needed: (64 + 2 * sizeof(type)) * number of partitions With the change, the driver requires a fixed 8 bytes per column. On top of that, the change fixes a second problem dealing with how statistics of cached relations that share stats with another one (e.g. a cache projection of a cached relation) are calculated; previously, the data would be wrong since the accumulator data would be summarized based on the child output (while the data reflected the parent's output). Now the calculation is done based on how the child's output maps to the parent's output, yielding the correct size.
- Loading branch information
Marcelo Vanzin
committed
Sep 21, 2016
1 parent
7e418e9
commit 5b3a65a
Showing
2 changed files
with
105 additions
and
16 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters