Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Create InputStream that is an isolated reader of a segment of a RandomAccessFile #22573

Closed
asfimport opened this issue Aug 8, 2019 · 10 comments

Comments

@asfimport
Copy link

If different threads wants to do buffered reads over different portions of a file (and they are unable to create their own separate file handles), they may clobber each other. I would propose creating an object that keeps the RandomAccessFile internally and implements the InputStream API in a way that is safe from other threads changing the file position

Reporter: Wes McKinney / @wesm
Assignee: Wes McKinney / @wesm

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-6180. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Wes McKinney / @wesm:
The way I would do this is as follows:

  • Add RandomAccessFile::GetStream(int64_t offset, int64_t length, std::shared_ptr<InputStream>* out) (or similarly named – might need to be a top-level function so that it can retain a shared_ptr to the parent file)
  • Implement the object returned by this method privately. So no need to add any new class to public headers

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
I don't really understand what this does concretely. Is it some kind of BufferReader over some area of memory maintained by the RandomAccessFile itself?

@asfimport
Copy link
Author

Wes McKinney / @wesm:
@pitrou the problem is that multiple threads are creating their own BufferedInputStream that references different segments of a RandomAccessFile that was passed into say ParquetFileReader

Suppose you have a file that's 100MB and contains 10 10MB chunks. And you have different threads that do buffered processing of each chunk. If you naively pass RandomAccessFile into BufferedInputStream in different threads then the threads will clobber each other

@asfimport
Copy link
Author

Zherui Cao / @czxrrr:
@wesm If I don't change BufferedInputStream, I need to make every BufferedInputStream has their own copy of raw_ (RadomAccessFile) rather than just have a shared pointer, it this what you mean?

@asfimport
Copy link
Author

Wes McKinney / @wesm:
The idea is to have an private implementation of InputStream like this:

/// \brief Reads a segment of a RandomAccessFile in a thread-safe manner
class FileSegmentReader : public InputStream {
 public:
   FileSegmentReader(std::shared_ptr<RandomAccessFile> file, int64_t offset, int64_t length) ...

   ...

 private:
  std::shared_ptr<RandomAccessFile> file_;
  int64_t position_;
  int64_t file_offset_;
  int64_t total_length_;
}

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
I see. That sounds reasonable to me.

@asfimport
Copy link
Author

Deepak Majeti / @majetideepak:
@wesm, @pitrou  looks like the issue can also happen in a single-threaded application when you read multiple column chunks in a row-by-row fashion from each chunk. You end up creating a BufferedInputStream for each chunk and all these InputStreams share the same RandomAccessFile and thereby getting clobbered.

@asfimport
Copy link
Author

Wes McKinney / @wesm:
You can let me take care of this so we aren't going back and forth over the details.

@asfimport
Copy link
Author

Zherui Cao / @czxrrr:
I saw your patch, and Thanks for that.

@asfimport
Copy link
Author

Wes McKinney / @wesm:
Issue resolved by pull request 5085
#5085

@asfimport asfimport added this to the 0.15.0 milestone Jan 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants