Skip to content

ORC-128. Add getStatistics to Writer API#78

Closed
omalley wants to merge 1 commit intoapache:masterfrom
omalley:orc-128
Closed

ORC-128. Add getStatistics to Writer API#78
omalley wants to merge 1 commit intoapache:masterfrom
omalley:orc-128

Conversation

@omalley
Copy link
Contributor

@omalley omalley commented Jan 6, 2017

Allow user to getStatistics while writing files.

…ics as the

file is written.

Signed-off-by: Owen O'Malley <omalley@apache.org>
@dain
Copy link
Contributor

dain commented Jan 6, 2017

Just curious, how do you plan on using this?

@asfgit asfgit closed this in 1e8b598 Jan 10, 2017
@prasanthj
Copy link
Contributor

@dain Hive already uses stats API (reader side and writer side) to get basic statistics like (numRows, rawDataSize, etc.) from the footer to avoid row-by-row stats gathering. This new API is to extend the same for column statistics (although ORC is missing NDV at this point).

@dain
Copy link
Contributor

dain commented Jan 19, 2017

I'm curious how the writer statistics are used since it is the "end of the line". Is it just to display to a user (e.g., logging), or is the engine making decisions based on the information?

@prasanthj
Copy link
Contributor

Hive auto-gathers statistics during write (INSERT, CTAS..). Just before closing the file, the file sink operator gets the statistics from Writer, publishes it for aggregation by the client. This just pushes the stats collection part from processOp (row-by-row or vector batch processing) to closeOp.

Similarly Reader side interface is used by ANALYZE queries to compute statistics just by reading footer.

@prasanthj
Copy link
Contributor

is the engine making decisions based on the information?

Not at this point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants