-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++][Parquet] Performance reading S3 based files won't match localfilesystem even with large prebuffering. #39899
Comments
one addon comment.. I still find the localfilesystem performance to be pretty slow for such a small amount of data. I see other tickets referencing rates of 150mb/s to 1gb/s #38389 and I'm orders of magnitudes from that :( |
I found that I could add a call to |
Firstly I think 3s is so slow, would you mind tell the io pattern you're using? Actually the best pattern is send all io (if memory is enough) and waiting for them to finished, and read the file( or split the request by row-groups) |
I made the asumption that if values_read == 0 than I've processed a null value for that batch..but I will look into those rep-level and def-level concepts you mention... I've not really tested with nulls yet... I'm not dealing with any complex types like struct/list/map in my parser..mostly the simple primitive types.
I got the best (same as local file) performance when I prebuffered all the rowgroups and columns I wanted to read and then called WhenBuffered. We have a good amount of memory available to us. Splitting the request by row-groups would certainly help control memory provided the writer of the file did not write them too large. In my use case I have many processes processing their own files so I do not want to parallelize reading each column with an individual thread. I want one CPU thread to process the parsing of that one file (I know the prebuffering is happening by background threads but ideally this would be done serially as well) |
I'm going to close this, as I was able to get equivalent performance in the parsing once I called WhenBuffered. |
Describe the usage question you have. Please include as many useful details as possible.
I'm writing a simple program which uses the low-level parquet parser apis
parquet::ParquetFileReader
... this parser calls PreBuffer with the rowgroups and columns I want to read (along with CachceOptions::Defaults())... I getparquet::ColumnReaders
for each column and then loop through those and copy to a buffer so that the data is formed into a row-based format (as opposed to parquet's columnar format).in this example I'm only parsing booleans using
bool_reader->ReadBatch(1, nullptr, nullptr, (bool*)buf, &values_read);
to write directly to buf.
my total test data is 284K and consists of parquet files which contain 3 boolean columns, so it is very simple and should be fast.
I find that when I benchmark my code against files on my local filesystem it takes about 1.7s to parse the data in this way...but when I give it an S3 file handle (created from arrow's s3) class It takes significantly longer EVEN WITH PREBUFFER SETTINGS
I'm not including the prebuffer in my timings, but here is what I'm seeing
localfilesystem 1.7s
s3 without prebuffer 8.3s
s3 with prebuffer 3.9s (not benchmarking the prebuffer time)
I don't understand why the parsing of localfilesystem would be faster than s3 if I've prebuffered it into memory (remember I'm not including the prebuffer in these timings). I've tried changing the
parquet::ReaderProperties
buffer size to 20MB (which should fit the whole file in memory) but I can't seem to get equivalent performance with the local filesystem.Looking for some guidance...I'd really like to be as close to the localfilesystem performance as possible... I want to avoid downloading the whole file, but want the be able to read these prebuffered sections of the file efficiently.
Component(s)
C++, Parquet
The text was updated successfully, but these errors were encountered: