segmentation fault from opening large single column csv with small blockszie pyarrow.csv.open_csv() #38878

jiale0402 · 2023-11-24T21:16:05Z

Describe the usage question you have. Please include as many useful details as possible.

platform:
NAME="Ubuntu" VERSION="23.04 (Lunar Lobster)"
pyarrow version:
pyarrow 14.0.1
pyarrow-hotfix 0.5
python version:
Python 3.11.4 (main, Jun 9 2023, 07:59:55) [GCC 12.3.0] on linux

I have a very large single column csv file (about 63 million rows). I was hoping to create a lazy file streamer that reads one entry from the csv file at a time. I know each entry in my file has a length of 12 chars, so I tried setting block size to 13 (+1 for \n) with the pyarrow.csv.open_csv function.
import pyarrow.csv as csv
c_options = csv.ConvertOptions(column_types={'dne': pa.float32()})
r_options = csv.ReadOptions(skip_rows_after_names=8200,use_threads=True, column_names=["dne"],block_size=13)
stream = csv.open_csv(file, convert_options = c_options, read_options = r_options )
this code functions properly as expected, but when i change the skip_rows_after_names param of read options to 8300 I start to get segmentation faults when in the open_csv function. How to fix this (or am I using it wrong)? I want to be able to use only a portion of at (like from row 98885 to 111200)

I was able to produce this error on another computer with the exact same platform and versions. The file was created with
with open(f"feature_{i}.csv", "w+") as f: for i in range(FILE_LEN): n = random.uniform(-0.5, 0.5) nn = str(n)[:12] f.write(f"{nn}\n")

Component(s)

Python

The text was updated successfully, but these errors were encountered:

mapleFU · 2023-11-25T03:23:05Z

Would you like to try master branch? I think maybe my patch ( #38466 ) has fixed this

jiale0402 · 2023-11-26T07:28:04Z

Would you like to try master branch? I think maybe my patch ( #38466 ) has fixed this

thank you for your response! I tried building pyarrow from main branch, and now its saying
pyarrow.lib.ArrowInvalid: straddling object straddles two block boundaries (try to increase block size?)
But once again, it seems like changing the parameter of skip_rows_after_names to a smaller number suppresses this problem. However, I need to have a relatively large skip_rows_after_names param for my use case while maitaining the small block size such that I am reading one row at a time. Is there any suggested solution to this issue (or if I want to read one row at a time from a very large single column csv that doesn't fit into memory, is there any other suggested way of doing this)?

Thanks in advance for your time.

mapleFU · 2023-11-26T11:34:30Z

pyarrow.lib.ArrowInvalid: straddling object straddles two block boundaries (try to increase block size?)

This error might raise from the code below. How can I reproduce the problem using C++ or Python code?

Status Chunker::ProcessWithPartial(std::shared_ptr<Buffer> partial,
                                   std::shared_ptr<Buffer> block,
                                   std::shared_ptr<Buffer>* completion,
                                   std::shared_ptr<Buffer>* rest) {
  if (partial->size() == 0) {
    // If partial is empty, don't bother looking for completion
    *completion = SliceBuffer(block, 0, 0);
    *rest = block;
    return Status::OK();
  }
  int64_t first_pos = -1;
  RETURN_NOT_OK(boundary_finder_->FindFirst(std::string_view(*partial),
                                            std::string_view(*block), &first_pos));
  if (first_pos == BoundaryFinder::kNoDelimiterFound) {
    // No delimiter in block => the current object is too large for block size
    return StraddlingTooLarge();
  } else {
    *completion = SliceBuffer(block, 0, first_pos);
    *rest = SliceBuffer(block, first_pos);
    return Status::OK();
  }
}

jiale0402 · 2023-11-26T17:58:43Z

Here's a python sample that I created to replicate this behavior. I tested on my end and it seems to work. Please let me know if it does produce pyarrow.lib.ArrowInvalid: straddling object straddles two block boundaries (try to increase block size?) or not. Thanks!

This samples takes a csv file and erases its content, then it writes 30000 random numbers of length 12 to the file (to simulate a standard single column csv file with constant row length).

path = "path/to/csv"
row_len = 12
import random
#NOTE: this erases the existing content
with open(path, "w+") as f:
    f.write("name\n")
    for i in range(30000):
        n = random.uniform(-0.5, 0.5)
        n = str(n)[:row_len]
        f.write(f"{n}\n")
import pyarrow.csv as csv
stream = csv.open_csv(
            path, 
            read_options = csv.ReadOptions(
                skip_rows_after_names=20000,
                use_threads=False, 
                block_size=row_len+1, # +1 for /n
            )
        )

By the way, I notice that arrow aims to find a deliminator in the block and otherwise raises an error in the code you quoted. But by standard csv format a single column csv file would not contain any deliminator ,. However, if I do not set the skip_rows_after_names param to the pyarrow.csv.ReadOptions in the sample above, then the code functions properly, despite the fact that there's no deliminator in the blocks anywhere. Is this intended behavior?

jiale0402 · 2023-11-27T18:42:00Z

update:
it turns out that if I increase the block_size by a little (in my case switching from 13 to 130), I got [1] 47744 bus error python3 instead. But increasing from 130 to 1300 suppresses the problem

jorisvandenbossche · 2023-11-29T16:10:34Z

cc @pitrou

mapleFU · 2023-11-29T16:41:45Z

I'll try to reproduce these in main branch tomorrow, you can just using a larger buffer size as a workaround

pitrou · 2023-11-29T17:26:46Z

Ok, this is because there is a recursion for handling of skip_rows_after_names that simply hits the maximum C stack size:

arrow/cpp/src/arrow/csv/reader.cc

Lines 934 to 938 in 94fc124

    
           return batch_gen().Then([self, batch_gen, max_readahead, 
        
                                    prev_bytes_processed](const DecodedBlock& next_block) { 
        
             return self->InitFromBlock(next_block, std::move(batch_gen), max_readahead, 
        
                                        prev_bytes_processed); 
        
           });

This should not be a problem in normal usage, but you are asking for a large skip_rows_after_names together with a tiny block_size.

jiale0402 added the Type: usage Issue is a user question label Nov 24, 2023

github-actions bot added the Component: Python label Nov 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

segmentation fault from opening large single column csv with small blockszie pyarrow.csv.open_csv() #38878

segmentation fault from opening large single column csv with small blockszie pyarrow.csv.open_csv() #38878

jiale0402 commented Nov 24, 2023 •

edited

mapleFU commented Nov 25, 2023

jiale0402 commented Nov 26, 2023 •

edited

mapleFU commented Nov 26, 2023

jiale0402 commented Nov 26, 2023 •

edited

jiale0402 commented Nov 27, 2023

jorisvandenbossche commented Nov 29, 2023

mapleFU commented Nov 29, 2023

pitrou commented Nov 29, 2023 •

edited

segmentation fault from opening large single column csv with small blockszie pyarrow.csv.open_csv() #38878

segmentation fault from opening large single column csv with small blockszie pyarrow.csv.open_csv() #38878

Comments

jiale0402 commented Nov 24, 2023 • edited

Describe the usage question you have. Please include as many useful details as possible.

Component(s)

mapleFU commented Nov 25, 2023

jiale0402 commented Nov 26, 2023 • edited

mapleFU commented Nov 26, 2023

jiale0402 commented Nov 26, 2023 • edited

jiale0402 commented Nov 27, 2023

jorisvandenbossche commented Nov 29, 2023

mapleFU commented Nov 29, 2023

pitrou commented Nov 29, 2023 • edited

jiale0402 commented Nov 24, 2023 •

edited

jiale0402 commented Nov 26, 2023 •

edited

jiale0402 commented Nov 26, 2023 •

edited

pitrou commented Nov 29, 2023 •

edited