-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
httpfs: Avoid corruptions with servers sending more data than asked for #8940
Conversation
This is an improvement on current behaviour, so should be OK, but maybe we can do better either returning a more informative IOException. Proper handling for this cases still to be done, since currently query will fail with:
Note that we can't just reallocate to a newer buffer in general, or at least that will require some deeper changes. |
Ideally, we allocate buffers there, but that would require a significant code rewrite. As a workaround, I think the best solution would be to store the excess buffer in a cache at the |
Do we need to make a similar fix in the fsspec wrapper as well? I think I used memcpy there too |
|
||
// To avoid corruption of memory, we bail out. | ||
return false; | ||
} | ||
memcpy(buffer_out + out_offset, data, data_length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if instead of bailing out we just constrain the amount of data buffered to the buffer length, i.e. do something like:
data_length = MinValue<idx_t>(data_length, buffer_out_len - data_length - out_offset);
That way we would just silently discard any data that is sent past the end of the buffer.
So, the server behaves strangely, I've tried the bailout strategy, but the next range request would be ignored and the file would continue the download from where it stopped. I've also tried it with the |
Thanks @pdet for the suggestions and for taking a deep dive on this. I will change this to throw a relevant error, pointing to the |
Fixes duckdb#8930 and duckdb#8908 (avoiding the fatal crash). HTTP server are allowed to send back more data than asked for, and be somehow unreliably. In this case a GET request with a range headers gets more data than asked for, and with very big files we ended up overriding the current buffer. This is an uncommon but legal behaviour, to properly handle this situation so we will have to improve logic elsewhere to properly handle this case.
65a6d69
to
61e8b6e
Compare
Thanks! |
Thanks, all. I learned a bit with the discussion as well. |
Fixes #8930 and fixes #8908 (avoiding the fatal crash).
HTTP server are allowed to send back more data than asked for, and be somehow unreliably. In this case a GET request with a range headers gets more data than asked for, and with very big files we ended up overriding the current buffer.
This is an uncommon but legal behaviour, to properly handle this situation so we will have to improve logic elsewhere to properly handle this case.