[DRAFT] PR to show Vectored IO integration, compilation fails now. #999

mukund-thakur · 2022-09-26T21:43:48Z

Make sure you have checked all steps below.

Jira

My PR addresses the following Parquet Jira issues and references them in the PR title. For example, "PARQUET-1234: My Parquet PR"
- https://issues.apache.org/jira/browse/PARQUET-XXX
- In case you are adding a dependency, check if the license complies with the ASF 3rd Party License Policy.

Tests

My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters (not including Jira issue reference)
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

Documentation

In case of new functionality, my PR adds documentation that describes how to use it.
- All the public functions and the classes in the PR contain Javadoc that explain what it does

danielcweeks · 2022-10-08T17:14:48Z

parquet-common/src/main/java/org/apache/parquet/io/SeekableInputStream.java

+import java.util.List;
+import java.util.function.IntFunction;
+
+import org.apache.hadoop.fs.FileRange;


I feel like this might be an issue. We probably don't want to introduce a Hadoop dependency here because it breaks the separation from Hadoop in the IO path.

Yeah, I agree with this. One solution would be to :
Introduce a custom ParquetFileRange class in the Parquet io module, use it in the interface and convert ParquetFileRange to hadoop FileRange in the implementation in H1SeekableInputStream and H2SeekableInputStream stream.

@danielcweeks Do you agree with my suggested approach? Thanks for reviewing the code.

parthchandra

Thanks for this PR! Can't wait to try it out.

parthchandra · 2022-10-10T17:28:21Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java

+        fileRanges.add(FileRange.createFileRange(currentOffset, lastAllocationSize));
+      }
+      LOG.warn("Doing vectored IO for ranges {}", fileRanges);
+      f.readVectored(fileRanges, ByteBuffer::allocate);


Use the allocator (options.getAllocator)? Keep in mind the allocated buffer might be a direct byte buffer.

parthchandra · 2022-10-10T17:30:12Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java

    for (ConsecutivePartList consecutiveChunks : allParts) {
-      consecutiveChunks.readAll(f, builder);
+      ranges.add(FileRange.createFileRange(consecutiveChunks.offset, (int) consecutiveChunks.length));


I would do this the way you were planning to do it initially ( or so it appears). Move this into a readAllVectored method.

And make it configurable to choose between vectored I/O and non vectored I/O (see HadoopReadOptions and ParquetReadOptions)

I have changed both the places where I thought the integration can be done. I am not really sure which will give better performance results, which is why left the other portion commented.

One option is to change in readAllVectored as you suggested and I did before.

Another change (current one) is at the top layer.
The reason I moved from 1st to 2nd of because of the name ConsecutivePartList. The name suggests it is a consecutive part essentially meaning just a single range and for which we won't be getting the real vectored io benefit like parallel IO and range coalescing.

Right. I was thinking that readAllVectored will take all ConsecutiveParts as input.
Or at least move this block into a new function. You will need to do the same thing in readNextRowGroup as well.

Sure moving to a new method makes sense. will do.

parthchandra · 2022-10-11T21:18:03Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java

    for (ConsecutivePartList consecutiveChunks : allParts) {
-      consecutiveChunks.readAll(f, builder);
+      ranges.add(FileRange.createFileRange(consecutiveChunks.offset, (int) consecutiveChunks.length));


Right. I was thinking that readAllVectored will take all ConsecutiveParts as input.
Or at least move this block into a new function. You will need to do the same thing in readNextRowGroup as well.

parthchandra · 2022-10-11T21:19:12Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java

+      ranges.add(FileRange.createFileRange(consecutiveChunks.offset, (int) consecutiveChunks.length));
+    }
+    LOG.warn("Doing vectored IO for ranges {}", ranges);
+    f.readVectored(ranges, ByteBuffer::allocate);


see comment below about using options.getAllocator

parthchandra · 2022-10-11T21:23:16Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java

+      ranges.add(FileRange.createFileRange(consecutiveChunks.offset, (int) consecutiveChunks.length));
+    }
+    LOG.warn("Doing vectored IO for ranges {}", ranges);
+    f.readVectored(ranges, ByteBuffer::allocate);


Does readVectored allocate a single buffer per range? Or does it split each range into bite sized pieces? If all the columns are being read, a single range can be the entire row group, potentially more than a GB.

Yes, vectored read allocates a single buffer per range.
If it can grow more than a GB, then I guess we will run into memory problems.

Then it may make sense to do what the readAllVectored method is doing - split each ConsecutiveParts into contiguous sub-ranges of 8MB (configurable). Presumably when you merge ranges, you have a limit so you will get a large consecutive read but not hit the memory allocation issues.

Well, I just went through the code of ConsecutivePartList#readAll() again. Yes, they are breaking the big range into smaller buffers but allocating all of them in one go only, so won't the memory issue still persists?

Also, if I do the change in readAll() like I have already done the commented readAllVectored(), we really won't be reducing the number of seek operations thus won't be getting the real benefits of vectored IO. It will just be like there is a big range to be fetched, we break into smaller ranges and fetch them parallelly. ( This is similar to PARQUET-2149 which you proposed and have already uploaded the PR :) ).

So if I have something like this -
range 1, delta 1, range2
I will give you two ranges, range1, and range2 separated by a gap of delta1 bytes. If delta1 is small enough, you will merge range1 and range2 and do a single scan.
If range1 and range2 are very large so that the resulting range is even larger, do you still do a single scan, or do you split the large range into smaller ranges and tradeoff some seek cost for increased parallelism?

Don't you think it is similar to what I have done in the commented readVectored() implementation?

Yes, it is. However, the parallel reading will split a large range as opposed to what the vectored read will do.

I guess we need to run some benchmarks. First, I need to see what type of ranges parquet generates for real-world/tpcds queries. Do you have that by any chance?

No I don't have the ranges. TPCDS is a large set of queries. Pick a scale factor and a couple of queries and I'll see what I can get for you.

You couldn't have ! :) My PR for async IO is rather different and a lot more complicated.

Oh yeah just revisited. You are right. :)

2022-10-18 21:51:28,308 [WARN] [TezChild] |hadoop.ParquetFileReader|: Doing vectored IO for ranges [range[1581775,2178915), range[6700883,9475957), range[10023390,10426141), range[12211215,15766053), range[24603672,25148984)]

I got these ranges while running tpcds query22 in hive on the parquet files in s3 with a scale factor of 1000. We can try the same in spark. thanks

Let me try to get the corresponding ranges for Spark.

that would be great. thanks.
Also, I reran the same with changes in readAllVectored and I can see that it breaks the big ranges into smaller ones if the size is greater than maxAllocationSize ( default 8 MB). For exmaple see the breaking in range[24928095,34504532]. note- this example is for query26.

So I guess changing in the above layer makes more sense.

2022-10-20 21:36:02,758 [WARN] [TezChild] |hadoop.ParquetFileReader|: Doing vectored IO for ranges [range[5326394,7673015), range[24928095,34504532), range[36603729,37991198), range[44105694,56402697), range[86874390,88752497)]
2022-10-20 21:36:03,146 [WARN] [TezChild] |hadoop.ParquetFileReader|: Reading through the vectored API.[readAllVectored]
2022-10-20 21:36:03,147 [WARN] [TezChild] |hadoop.ParquetFileReader|: Doing vectored IO for ranges [range[5326394,7673015)]
2022-10-20 21:36:03,225 [WARN] [TezChild] |hadoop.ParquetFileReader|: Reading through the vectored API.[readAllVectored]
2022-10-20 21:36:03,225 [WARN] [TezChild] |hadoop.ParquetFileReader|: Doing vectored IO for ranges [range[24928095,33316703), range[33316703,34504532)]
2022-10-20 21:36:03,352 [WARN] [TezChild] |hadoop.ParquetFileReader|: Reading through the vectored API.[readAllVectored]
2022-10-20 21:36:03,352 [WARN] [TezChild] |hadoop.ParquetFileReader|: Doing vectored IO for ranges [range[36603729,37991198)]
2022-10-20 21:36:03,439 [WARN] [TezChild] |hadoop.ParquetFileReader|: Reading through the vectored API.[readAllVectored]
2022-10-20 21:36:03,439 [WARN] [TezChild] |hadoop.ParquetFileReader|: Doing vectored IO for ranges [range[44105694,52494302), range[52494302,56402697)]
2022-10-20 21:36:03,652 [WARN] [TezChild] |hadoop.ParquetFileReader|: Reading through the vectored API.[readAllVectored]
2022-10-20 21:36:03,652 [WARN] [TezChild] |hadoop.ParquetFileReader|: Doing vectored IO for ranges [range[86874390,88752497)]

mukund-thakur · 2022-10-21T16:56:50Z

Hi @ggershinsky @shangxinli
We discussed this during Apache Conference. Could you please take a look at this. Thanks.

shangxinli · 2022-10-21T18:09:58Z

Hi @mukund-thakur Thanks for reaching out! I will have a look once I have time. It is the one in my radar. Meanwhile, can you fix the building errors? BTW, this is great feature and love to have it in next release.

mukund-thakur · 2024-02-12T21:58:29Z

closing this follow up is here #1103

Draft pr for vectored io integration, compilation fails now.

1e01b8d

mukund-thakur changed the title ~~Draft pr for Vectored IO integration, compilation fails now.~~ [DRAFT] PR to show Vectored IO integration, compilation fails now. Sep 26, 2022

danielcweeks reviewed Oct 8, 2022

View reviewed changes

parthchandra reviewed Oct 10, 2022

View reviewed changes

parthchandra reviewed Oct 11, 2022

View reviewed changes

parthchandra mentioned this pull request Nov 17, 2022

PARQUET-2149: Async IO implementation for ParquetFileReader #968

Open

mukund-thakur closed this Feb 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DRAFT] PR to show Vectored IO integration, compilation fails now. #999

[DRAFT] PR to show Vectored IO integration, compilation fails now. #999

mukund-thakur commented Sep 26, 2022

danielcweeks Oct 8, 2022

mukund-thakur Oct 11, 2022

mukund-thakur Oct 13, 2022

parthchandra left a comment

parthchandra Oct 10, 2022

parthchandra Oct 10, 2022

parthchandra Oct 10, 2022

mukund-thakur Oct 11, 2022 •

edited

parthchandra Oct 11, 2022

mukund-thakur Oct 11, 2022

parthchandra Oct 11, 2022

parthchandra Oct 11, 2022

parthchandra Oct 11, 2022

mukund-thakur Oct 11, 2022

parthchandra Oct 11, 2022

mukund-thakur Oct 12, 2022

parthchandra Oct 12, 2022

parthchandra Oct 17, 2022 •

edited

mukund-thakur Oct 17, 2022

mukund-thakur Oct 19, 2022

parthchandra Oct 21, 2022

mukund-thakur Oct 21, 2022

mukund-thakur commented Oct 21, 2022

shangxinli commented Oct 21, 2022

mukund-thakur commented Feb 12, 2024

[DRAFT] PR to show Vectored IO integration, compilation fails now. #999

[DRAFT] PR to show Vectored IO integration, compilation fails now. #999

Conversation

mukund-thakur commented Sep 26, 2022

Jira

Tests

Commits

Documentation

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

parthchandra left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mukund-thakur Oct 11, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

parthchandra Oct 17, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mukund-thakur commented Oct 21, 2022

shangxinli commented Oct 21, 2022

mukund-thakur commented Feb 12, 2024

mukund-thakur Oct 11, 2022 •

edited

parthchandra Oct 17, 2022 •

edited