Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-2204: [parquet-cpp] TypedColumnReaderImpl::Skip should reuse scratch space #14509

Merged
merged 7 commits into from
Dec 8, 2022

Conversation

fatemehp
Copy link
Contributor

TypedColumnReaderImpl::Skip allocates scratch space on every call. The scratch space is used to read rep/def levels and values and throw them away. The memory allocation slows down the skip based on microbenchmarks. The scratch space can be allocated once and re-used.

@fatemehp
Copy link
Contributor Author

@emkornfield could you take a look at this pull request? Thanks!

@github-actions
Copy link

@github-actions
Copy link

⚠️ Ticket has no components in JIRA, make sure you assign one.

@github-actions
Copy link

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@fatemehp
Copy link
Contributor Author

I am planning to check in the microbenchmark as a separate pull request.

@emkornfield
Copy link
Contributor

Looks like this change is the same as in the new PR with microbenchmark, can we close in favor of that?

@fatemehp
Copy link
Contributor Author

I removed these commits from the benchmark pull request.

@fatemehp
Copy link
Contributor Author

fatemehp commented Oct 26, 2022

Here are benchmark results before and after the change proposed in this pull request. Only including the Skip since this change does not affect the read performance. We get up to 15X reduction when the batch size (last parameter) is 1, which means we are repeatedly re-allocating the scratch space.

PARAMETERS:
1) definition level
2) repetition level
3) 1 for Skip, 0 for ReadBatch (here we only have Skip)
4) batch size

BEFORE
-------------------------------------------------------------------------------
Benchmark                Time             CPU              Iterations
-------------------------------------------------------------------------------
BM_Skip/0/0/1/1       150319842 ns    149567587 ns         1000
BM_Skip/0/0/1/1000       244565 ns       244931 ns         1000
BM_Skip/0/0/1/10000      115395 ns       115924 ns         1000
BM_Skip/0/0/1/100000     115241 ns       115916 ns         1000
BM_Skip/1/0/1/1       149224507 ns    148683644 ns         1000
BM_Skip/1/0/1/1000       805812 ns       805417 ns         1000
BM_Skip/1/0/1/10000      702999 ns       700108 ns         1000
BM_Skip/1/0/1/100000     654163 ns       651947 ns         1000
BM_Skip/1/1/1/1       165600118 ns    164864530 ns         1000
BM_Skip/1/1/1/1000      1130975 ns      1130252 ns         1000
BM_Skip/1/1/1/10000     1009628 ns      1009589 ns         1000
BM_Skip/1/1/1/100000    1029064 ns      1028726 ns         1000

AFTER
-------------------------------------------------------------------------------
Benchmark                 Time             CPU             Iterations
-------------------------------------------------------------------------------
BM_Skip/0/0/1/1        10280337 ns     10234495 ns         1000
BM_Skip/0/0/1/1000       101228 ns       101436 ns         1000
BM_Skip/0/0/1/10000       96565 ns        96648 ns         1000
BM_Skip/0/0/1/100000      96598 ns        96814 ns         1000
BM_Skip/1/0/1/1        25771605 ns     25718891 ns         1000
BM_Skip/1/0/1/1000       651609 ns       650940 ns         1000
BM_Skip/1/0/1/10000      660890 ns       654217 ns         1000
BM_Skip/1/0/1/100000     640417 ns       636855 ns         1000
BM_Skip/1/1/1/1        36639124 ns     36433537 ns         1000
BM_Skip/1/1/1/1000       978058 ns       976403 ns         1000
BM_Skip/1/1/1/10000      997193 ns       996529 ns         1000
BM_Skip/1/1/1/100000     999080 ns       993296 ns         1000

@fatemehp
Copy link
Contributor Author

We need to consider the memory implications of this change. This means that if at least one Skip is requested, the scratch space will be allocated on the heap and kept until the column reader is destroyed. The scratch space can be as big as 12 KB. If we have 1024 column readers open at one time, that means a 12 MB overhead. Is it common to have this many readers open at the same time? If yes, is this overhead acceptable?

// value type for batch_size.
void InitScratchForSkip(int64_t batch_size);

// Scrtach space for reading and throwing away rep/def levels and values when
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Scrtach space for reading and throwing away rep/def levels and values when
// Scratch space for reading and throwing away rep/def levels and values when

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@emkornfield
Copy link
Contributor

We need to consider the memory implications of this change. This means that if at least one Skip is requested, the scratch space will be allocated on the heap and kept until the column reader is destroyed. The scratch space can be as big as 12 KB. If we have 1024 column readers open at one time, that means a 12 MB overhead. Is it common to have this many readers open at the same time? If yes, is this overhead acceptable?

That seems OK to me, another option would be to read the feature flag. This breaks only after a second call to skip?

@fatemehp
Copy link
Contributor Author

fatemehp commented Nov 1, 2022

@emkornfield Could you give some context about when we normally use the feature flags? If we want to control this using a flag, that means we will support both cases of 1) allocating within the Skip function and 2) in the column reader. I am wondering if that is doing more than what we need.

Also, I don't fully understand your question here: "This breaks only after a second call to skip?"
The scratch space is allocated on the "first" call to skip and will be retained until the column reader is destroyed.

@@ -1151,6 +1159,14 @@ int64_t TypedColumnReaderImpl<DType>::ReadBatchSpaced(
return total_values;
}

template <typename DType>
void TypedColumnReaderImpl<DType>::InitScratchForSkip(int64_t batch_size) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the batch size is not the same between calls?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this argument and clarified this in the code. The batch size is constant and will not change.

@pitrou
Copy link
Member

pitrou commented Nov 15, 2022

Can you merge the latest changes from git master?

@emkornfield
Copy link
Contributor

@emkornfield Could you give some context about when we normally use the feature flags? If we want to control this using a flag, that means we will support both cases of 1) allocating within the Skip function and 2) in the column reader. I am wondering if that is doing more than what we need.

Also, I don't fully understand your question here: "This breaks only after a second call to skip?"
The scratch space is allocated on the "first" call to skip and will be retained until the column reader is destroyed.

sorry I think this answers the question, I think I meant help instead of breaks.

@fatemehp
Copy link
Contributor Author

fatemehp commented Dec 7, 2022

@pitrou, @emkornfield I have pulled in the latest changes, and updated the code so that the RecordReader uses the same scratch space. Please take a look.

@pitrou
Copy link
Member

pitrou commented Dec 8, 2022

I get the following benchmark numbers:

  • before:
-------------------------------------------------------------------------------------------------------------------
Benchmark                                                         Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------------------------------
ColumnReaderSkipInt32/Repetition:0/BatchSize:100            2256949 ns      2256574 ns          310 bytes_per_second=2.1131G/s
ColumnReaderSkipInt32/Repetition:0/BatchSize:1000            322274 ns       322302 ns         2147 bytes_per_second=14.7947G/s
ColumnReaderSkipInt32/Repetition:0/BatchSize:10000           123075 ns       123135 ns         5695 bytes_per_second=38.7247G/s
ColumnReaderSkipInt32/Repetition:0/BatchSize:100000           34654 ns        34707 ns        19814 bytes_per_second=137.388G/s

ColumnReaderSkipInt32/Repetition:1/BatchSize:100            4483059 ns      4482271 ns          159 bytes_per_second=579.976M/s
ColumnReaderSkipInt32/Repetition:1/BatchSize:1000            974701 ns       974606 ns          719 bytes_per_second=2.60483G/s
ColumnReaderSkipInt32/Repetition:1/BatchSize:10000           630965 ns       630913 ns         1111 bytes_per_second=4.02382G/s
ColumnReaderSkipInt32/Repetition:1/BatchSize:100000          191333 ns       191349 ns         3553 bytes_per_second=13.2672G/s

ColumnReaderSkipInt32/Repetition:2/BatchSize:100            5919031 ns      5917853 ns          119 bytes_per_second=465.768M/s
ColumnReaderSkipInt32/Repetition:2/BatchSize:1000           1422720 ns      1422501 ns          488 bytes_per_second=1.89226G/s
ColumnReaderSkipInt32/Repetition:2/BatchSize:10000          1008342 ns      1008178 ns          698 bytes_per_second=2.66991G/s
ColumnReaderSkipInt32/Repetition:2/BatchSize:100000          306069 ns       306087 ns         2281 bytes_per_second=8.79405G/s
  • after:
ColumnReaderSkipInt32/Repetition:0/BatchSize:100             246589 ns       246512 ns         2869 bytes_per_second=19.3434G/s
ColumnReaderSkipInt32/Repetition:0/BatchSize:1000            114571 ns       114636 ns         6122 bytes_per_second=41.5957G/s
ColumnReaderSkipInt32/Repetition:0/BatchSize:10000            98790 ns        98844 ns         7101 bytes_per_second=48.2414G/s
ColumnReaderSkipInt32/Repetition:0/BatchSize:100000           32585 ns        32662 ns        22074 bytes_per_second=145.992G/s

ColumnReaderSkipInt32/Repetition:1/BatchSize:100            1716748 ns      1716514 ns          408 bytes_per_second=1.47897G/s
ColumnReaderSkipInt32/Repetition:1/BatchSize:1000            644526 ns       644500 ns         1091 bytes_per_second=3.93899G/s
ColumnReaderSkipInt32/Repetition:1/BatchSize:10000           567605 ns       567578 ns         1241 bytes_per_second=4.47283G/s
ColumnReaderSkipInt32/Repetition:1/BatchSize:100000          180075 ns       180121 ns         3862 bytes_per_second=14.0943G/s

ColumnReaderSkipInt32/Repetition:2/BatchSize:100            2880737 ns      2880209 ns          243 bytes_per_second=956.995M/s
ColumnReaderSkipInt32/Repetition:2/BatchSize:1000           1040618 ns      1040539 ns          676 bytes_per_second=2.58687G/s
ColumnReaderSkipInt32/Repetition:2/BatchSize:10000           916319 ns       916171 ns          767 bytes_per_second=2.93803G/s
ColumnReaderSkipInt32/Repetition:2/BatchSize:100000          288829 ns       288867 ns         2418 bytes_per_second=9.31826G/s

Nice improvement!

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, thank you @fatemehp !

@pitrou pitrou merged commit 4bc9355 into apache:master Dec 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants