ORC-128. Add getStatistics to Writer API#78
Conversation
…ics as the file is written. Signed-off-by: Owen O'Malley <omalley@apache.org>
|
Just curious, how do you plan on using this? |
|
@dain Hive already uses stats API (reader side and writer side) to get basic statistics like (numRows, rawDataSize, etc.) from the footer to avoid row-by-row stats gathering. This new API is to extend the same for column statistics (although ORC is missing NDV at this point). |
|
I'm curious how the writer statistics are used since it is the "end of the line". Is it just to display to a user (e.g., logging), or is the engine making decisions based on the information? |
|
Hive auto-gathers statistics during write (INSERT, CTAS..). Just before closing the file, the file sink operator gets the statistics from Writer, publishes it for aggregation by the client. This just pushes the stats collection part from processOp (row-by-row or vector batch processing) to closeOp. Similarly Reader side interface is used by ANALYZE queries to compute statistics just by reading footer. |
Not at this point. |
Allow user to getStatistics while writing files.