-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Arrow CSV reader peak memory is very large #5766
Comments
cc @jinchengchenghh @zhztheplayer, thanks. |
I remember Arrow cached all record batches before it streams to Spark. In Gazelle we initially have the same issue, then have to customize some logic to do real streaming. @zhztheplayer do you remember? |
I can't recall that. But it doesn't make sense to buffer all data for a reader. I suppose @jinchengchenghh is looking into this. |
I could not reproduce this issue, I test TPCH Q6 with data 600G, and print the peak every time arrow reserve memory.
This is the test result
After I change the --master from local[18] to local[2], same peak memory |
@jinchengchenghh I will test the latest code. |
@jinchengchenghh can you print in the record batch construction and destruction function to confirm? there should be only 1 record batch alive, no more than 3. |
@jinchengchenghh Have you checked the size of a single CSV file? |
I assume you use a middle commit of csv reader, there is redundant |
The printed information is each time we request memory from arrow memory pool, not the recordbatch. |
@jinchengchenghh I will test the latest code today. |
@jinchengchenghh I tested the latest code, and the peak memory usage is still relatively high. I did not add logs in public long peak() {
return sharedUsage.peak();
}
public long current() {
return sharedUsage.current();
} @Override
public void release() throws Exception {
System.out.println("peak=" + listener.peak() + ", current=" + listener.current());
if (arrowPool.getBytesAllocated() != 0) {
LOGGER.warn(
String.format(
"Arrow pool still reserved non-zero bytes, "
+ "which may cause memory leak, size: %s. ",
Utils.bytesToString(arrowPool.getBytesAllocated())));
}
arrowPool.close();
} I created a Parquet table and used
I continued testing the catalog_sales table, where each CSV file is 1.15GB in size. The log output is as follows, the peak memory is about 1064MB:
I constructed a larger catalog_sales table with a single 30GB CSV file. The log output is as follows, the peak memory is about 6GB:
The peak memory logs that I printed should only be used by the CSV reader. But this issue is not that urgent for me at the moment. After splitting the large CSV file into smaller files, it still works normally. |
I think it is because arrow does not support to add file start and length to split a file, so it's peak memory is high for a very big CSV file. |
Do you mean arrow csv doesn't support split? each partition must have one or more csv files, instead of part of a large csv file. |
Yes. |
Arrow is easy to support file offset and length, we just need to use
https://github.com/apache/arrow/blob/main/cpp/src/arrow/dataset/file_base.cc#L110 I can help implement it on demand. |
Thank you, Chengcheng. Let's hold on until we get requests |
@jinchengchenghh Spark will split a single CSV file into multiple partitions for reading. We need to pass start and length to Arrow. I have currently resolved this issue through some hacks; otherwise, it would cause the same CSV file to be read multiple times. |
@jinchengchenghh Do we pass csv file multiple times to arrow if they are split by Spark? |
I mark this format as spiltable false, so it should not split. |
Backend
VL (Velox)
Bug description
When reading large CSV files, for example, when a single CSV file in a table is 300M, the peak memory usage of arrow memory pool during single-threaded reading can reach 500M. If the CSV is 2G, the peak memory usage can also increase to 1.7G. It looks like there is no memory leak, but the peak memory usage is very high.
From the code of Arrow Dataset, it seems that we are using the Streaming reader, theoretically the memory consumption may not increase proportionally with the size of the CSV file.
I have added some codes in the release method of ArrowNativeMemoryPool to check the peak memory.
I also added some logs in arrow codes to check the peak memory.
Spark version
None
Spark configurations
No response
System information
No response
Relevant logs
No response
The text was updated successfully, but these errors were encountered: