Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

segmentation fault from opening large single column csv with small blockszie pyarrow.csv.open_csv() #38878

Open
jiale0402 opened this issue Nov 24, 2023 · 8 comments
Labels
Component: Python Type: usage Issue is a user question

Comments

@jiale0402
Copy link

jiale0402 commented Nov 24, 2023

Describe the usage question you have. Please include as many useful details as possible.

platform:
NAME="Ubuntu" VERSION="23.04 (Lunar Lobster)"
pyarrow version:
pyarrow 14.0.1
pyarrow-hotfix 0.5
python version:
Python 3.11.4 (main, Jun 9 2023, 07:59:55) [GCC 12.3.0] on linux

I have a very large single column csv file (about 63 million rows). I was hoping to create a lazy file streamer that reads one entry from the csv file at a time. I know each entry in my file has a length of 12 chars, so I tried setting block size to 13 (+1 for \n) with the pyarrow.csv.open_csv function.
import pyarrow.csv as csv
c_options = csv.ConvertOptions(column_types={'dne': pa.float32()})
r_options = csv.ReadOptions(skip_rows_after_names=8200,use_threads=True, column_names=["dne"],block_size=13)
stream = csv.open_csv(file, convert_options = c_options, read_options = r_options )
this code functions properly as expected, but when i change the skip_rows_after_names param of read options to 8300 I start to get segmentation faults when in the open_csv function. How to fix this (or am I using it wrong)? I want to be able to use only a portion of at (like from row 98885 to 111200)

I was able to produce this error on another computer with the exact same platform and versions. The file was created with
with open(f"feature_{i}.csv", "w+") as f: for i in range(FILE_LEN): n = random.uniform(-0.5, 0.5) nn = str(n)[:12] f.write(f"{nn}\n")

Component(s)

Python

@jiale0402 jiale0402 added the Type: usage Issue is a user question label Nov 24, 2023
@mapleFU
Copy link
Member

mapleFU commented Nov 25, 2023

Would you like to try master branch? I think maybe my patch ( #38466 ) has fixed this

@jiale0402
Copy link
Author

jiale0402 commented Nov 26, 2023

Would you like to try master branch? I think maybe my patch ( #38466 ) has fixed this

thank you for your response! I tried building pyarrow from main branch, and now its saying
pyarrow.lib.ArrowInvalid: straddling object straddles two block boundaries (try to increase block size?)
But once again, it seems like changing the parameter of skip_rows_after_names to a smaller number suppresses this problem. However, I need to have a relatively large skip_rows_after_names param for my use case while maitaining the small block size such that I am reading one row at a time. Is there any suggested solution to this issue (or if I want to read one row at a time from a very large single column csv that doesn't fit into memory, is there any other suggested way of doing this)?

Thanks in advance for your time.

@mapleFU
Copy link
Member

mapleFU commented Nov 26, 2023

pyarrow.lib.ArrowInvalid: straddling object straddles two block boundaries (try to increase block size?)

This error might raise from the code below. How can I reproduce the problem using C++ or Python code?

Status Chunker::ProcessWithPartial(std::shared_ptr<Buffer> partial,
                                   std::shared_ptr<Buffer> block,
                                   std::shared_ptr<Buffer>* completion,
                                   std::shared_ptr<Buffer>* rest) {
  if (partial->size() == 0) {
    // If partial is empty, don't bother looking for completion
    *completion = SliceBuffer(block, 0, 0);
    *rest = block;
    return Status::OK();
  }
  int64_t first_pos = -1;
  RETURN_NOT_OK(boundary_finder_->FindFirst(std::string_view(*partial),
                                            std::string_view(*block), &first_pos));
  if (first_pos == BoundaryFinder::kNoDelimiterFound) {
    // No delimiter in block => the current object is too large for block size
    return StraddlingTooLarge();
  } else {
    *completion = SliceBuffer(block, 0, first_pos);
    *rest = SliceBuffer(block, first_pos);
    return Status::OK();
  }
}

@jiale0402
Copy link
Author

jiale0402 commented Nov 26, 2023

Here's a python sample that I created to replicate this behavior. I tested on my end and it seems to work. Please let me know if it does produce pyarrow.lib.ArrowInvalid: straddling object straddles two block boundaries (try to increase block size?) or not. Thanks!

This samples takes a csv file and erases its content, then it writes 30000 random numbers of length 12 to the file (to simulate a standard single column csv file with constant row length).

path = "path/to/csv"
row_len = 12
import random
#NOTE: this erases the existing content
with open(path, "w+") as f:
    f.write("name\n")
    for i in range(30000):
        n = random.uniform(-0.5, 0.5)
        n = str(n)[:row_len]
        f.write(f"{n}\n")
import pyarrow.csv as csv
stream = csv.open_csv(
            path, 
            read_options = csv.ReadOptions(
                skip_rows_after_names=20000,
                use_threads=False, 
                block_size=row_len+1, # +1 for /n
            )
        )

By the way, I notice that arrow aims to find a deliminator in the block and otherwise raises an error in the code you quoted. But by standard csv format a single column csv file would not contain any deliminator ,. However, if I do not set the skip_rows_after_names param to the pyarrow.csv.ReadOptions in the sample above, then the code functions properly, despite the fact that there's no deliminator in the blocks anywhere. Is this intended behavior?

@jiale0402
Copy link
Author

update:
it turns out that if I increase the block_size by a little (in my case switching from 13 to 130), I got [1] 47744 bus error python3 instead. But increasing from 130 to 1300 suppresses the problem

@jorisvandenbossche
Copy link
Member

cc @pitrou

@mapleFU
Copy link
Member

mapleFU commented Nov 29, 2023

I'll try to reproduce these in main branch tomorrow, you can just using a larger buffer size as a workaround

@pitrou
Copy link
Member

pitrou commented Nov 29, 2023

Ok, this is because there is a recursion for handling of skip_rows_after_names that simply hits the maximum C stack size:

return batch_gen().Then([self, batch_gen, max_readahead,
prev_bytes_processed](const DecodedBlock& next_block) {
return self->InitFromBlock(next_block, std::move(batch_gen), max_readahead,
prev_bytes_processed);
});

This should not be a problem in normal usage, but you are asking for a large skip_rows_after_names together with a tiny block_size.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Python Type: usage Issue is a user question
Projects
None yet
Development

No branches or pull requests

4 participants