Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] IO: BufferedInputStream would trigger small IO when issue small IO in edge of buffer #37434

Closed
mapleFU opened this issue Aug 29, 2023 · 7 comments · Fixed by #37460
Closed

Comments

@mapleFU
Copy link
Member

mapleFU commented Aug 29, 2023

Describe the enhancement requested

  Result<int64_t> Read(int64_t nbytes, void* out) {
    if (ARROW_PREDICT_FALSE(nbytes < 0)) {
      return Status::Invalid("Bytes to read must be positive. Received:", nbytes);
    }

    if (nbytes < buffer_size_) {
      // Pre-buffer for small reads
      RETURN_NOT_OK(BufferIfNeeded());
    }

    if (nbytes > bytes_buffered_) {
      // Copy buffered bytes into out, then read rest
      memcpy(out, buffer_data_ + buffer_pos_, bytes_buffered_);

      int64_t bytes_to_read = nbytes - bytes_buffered_;
      if (raw_read_bound_ >= 0) {
        bytes_to_read = std::min(bytes_to_read, raw_read_bound_ - raw_read_total_);
      }
      ARROW_ASSIGN_OR_RAISE(
          int64_t bytes_read,
          raw_->Read(bytes_to_read, reinterpret_cast<uint8_t*>(out) + bytes_buffered_));
      raw_read_total_ += bytes_read;

      // Do not make assumptions about the raw stream position
      raw_pos_ = -1;
      bytes_read += bytes_buffered_;
      RewindBuffer();
      return bytes_read;
    } else {
      memcpy(out, buffer_data_ + buffer_pos_, nbytes);
      ConsumeBuffer(nbytes);
      return nbytes;
    }
  }

If we Set BufferSize == 100k, and read 3k bytes per IO. When we read the 34 times, the IO would be (99k, 102k]

In Read, it will read buffered (99k, 100k], issue IO for (100k, 102k]. Rather than (100k, 200k]. Is this expected?

Component(s)

C++

@mapleFU
Copy link
Member Author

mapleFU commented Aug 29, 2023

@lidavidm @pitrou Is this expected? Or did I miss something?

@mapleFU
Copy link
Member Author

mapleFU commented Aug 29, 2023

cc @jp0317 as a Buffered-Read user in parquet :-)

@mapleFU mapleFU changed the title IO: BufferedInputStream would trigger small IO when issue small IO in edge of buffer [C++] IO: BufferedInputStream would trigger small IO when issue small IO in edge of buffer Aug 29, 2023
@lidavidm
Copy link
Member

I would probably have expected it to read a full chunk, yes. I suppose here we should copy the part of the existing buffer that we need, then discard the rest and read a new chunk.

@mapleFU
Copy link
Member Author

mapleFU commented Aug 30, 2023

Okay, I'll submit a version of read new large chunk tonight

@pitrou
Copy link
Member

pitrou commented Aug 30, 2023

Yes, this could be improved. But for large reads (larger than the buffer size), we don't want to write first to the buffer and then to out.

@mapleFU
Copy link
Member Author

mapleFU commented Aug 30, 2023

I've draft a patch here: #37460

@mapleFU
Copy link
Member Author

mapleFU commented Aug 30, 2023

I found that BufferedInputStream::Impl::Peek has the same problem. Should I separate a patch for it? Or just implement Peek and Read refactor in same patch?

@pitrou pitrou added this to the 14.0.0 milestone Aug 31, 2023
pitrou added a commit that referenced this issue Aug 31, 2023
#37460)

### Rationale for this change

If we Set BufferSize == 100k, and read 3k bytes per IO. When we read the 34 times, the IO would be (99k, 102k]

In Read, it will read buffered (99k, 100k], issue IO for (100k, 102k]. Rather than (100k, 200k]. 

### What changes are included in this PR?

Refactor `BufferedInputStream::Read` to optimize small IO.

### Are these changes tested?

Already has tests?

### Are there any user-facing changes?

User might get io-pattern changed. It can be optimization or downgrade.

* Closes: #37434

Lead-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this issue Nov 13, 2023
…l input (apache#37460)

### Rationale for this change

If we Set BufferSize == 100k, and read 3k bytes per IO. When we read the 34 times, the IO would be (99k, 102k]

In Read, it will read buffered (99k, 100k], issue IO for (100k, 102k]. Rather than (100k, 200k]. 

### What changes are included in this PR?

Refactor `BufferedInputStream::Read` to optimize small IO.

### Are these changes tested?

Already has tests?

### Are there any user-facing changes?

User might get io-pattern changed. It can be optimization or downgrade.

* Closes: apache#37434

Lead-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
…l input (apache#37460)

### Rationale for this change

If we Set BufferSize == 100k, and read 3k bytes per IO. When we read the 34 times, the IO would be (99k, 102k]

In Read, it will read buffered (99k, 100k], issue IO for (100k, 102k]. Rather than (100k, 200k]. 

### What changes are included in this PR?

Refactor `BufferedInputStream::Read` to optimize small IO.

### Are these changes tested?

Already has tests?

### Are there any user-facing changes?

User might get io-pattern changed. It can be optimization or downgrade.

* Closes: apache#37434

Lead-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants