-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-2204: [parquet-cpp] TypedColumnReaderImpl::Skip should reuse scratch space #14509
Conversation
@emkornfield could you take a look at this pull request? Thanks! |
|
|
I am planning to check in the microbenchmark as a separate pull request. |
Looks like this change is the same as in the new PR with microbenchmark, can we close in favor of that? |
I removed these commits from the benchmark pull request. |
Here are benchmark results before and after the change proposed in this pull request. Only including the Skip since this change does not affect the read performance. We get up to 15X reduction when the batch size (last parameter) is 1, which means we are repeatedly re-allocating the scratch space.
|
We need to consider the memory implications of this change. This means that if at least one Skip is requested, the scratch space will be allocated on the heap and kept until the column reader is destroyed. The scratch space can be as big as 12 KB. If we have 1024 column readers open at one time, that means a 12 MB overhead. Is it common to have this many readers open at the same time? If yes, is this overhead acceptable? |
cpp/src/parquet/column_reader.cc
Outdated
// value type for batch_size. | ||
void InitScratchForSkip(int64_t batch_size); | ||
|
||
// Scrtach space for reading and throwing away rep/def levels and values when |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Scrtach space for reading and throwing away rep/def levels and values when | |
// Scratch space for reading and throwing away rep/def levels and values when |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
That seems OK to me, another option would be to read the feature flag. This breaks only after a second call to skip? |
@emkornfield Could you give some context about when we normally use the feature flags? If we want to control this using a flag, that means we will support both cases of 1) allocating within the Skip function and 2) in the column reader. I am wondering if that is doing more than what we need. Also, I don't fully understand your question here: "This breaks only after a second call to skip?" |
cpp/src/parquet/column_reader.cc
Outdated
@@ -1151,6 +1159,14 @@ int64_t TypedColumnReaderImpl<DType>::ReadBatchSpaced( | |||
return total_values; | |||
} | |||
|
|||
template <typename DType> | |||
void TypedColumnReaderImpl<DType>::InitScratchForSkip(int64_t batch_size) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the batch size is not the same between calls?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed this argument and clarified this in the code. The batch size is constant and will not change.
Can you merge the latest changes from git master? |
sorry I think this answers the question, I think I meant help instead of breaks. |
the batch size for throwing away values is constant.
@pitrou, @emkornfield I have pulled in the latest changes, and updated the code so that the RecordReader uses the same scratch space. Please take a look. |
I get the following benchmark numbers:
Nice improvement! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, thank you @fatemehp !
TypedColumnReaderImpl::Skip allocates scratch space on every call. The scratch space is used to read rep/def levels and values and throw them away. The memory allocation slows down the skip based on microbenchmarks. The scratch space can be allocated once and re-used.