Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

httpfs: Avoid corruptions with servers sending more data than asked for #8940

Merged
merged 1 commit into from
Sep 19, 2023

Conversation

carlopi
Copy link
Contributor

@carlopi carlopi commented Sep 15, 2023

Fixes #8930 and fixes #8908 (avoiding the fatal crash).

HTTP server are allowed to send back more data than asked for, and be somehow unreliably. In this case a GET request with a range headers gets more data than asked for, and with very big files we ended up overriding the current buffer.

This is an uncommon but legal behaviour, to properly handle this situation so we will have to improve logic elsewhere to properly handle this case.

@carlopi carlopi requested a review from pdet September 15, 2023 08:06
@carlopi
Copy link
Contributor Author

carlopi commented Sep 15, 2023

This is an improvement on current behaviour, so should be OK, but maybe we can do better either returning a more informative IOException.

Proper handling for this cases still to be done, since currently query will fail with:

Error: IO Error: Canceled error for HTTP GET Range to 'https://www.ebi.ac.uk/gwas/api/search/downloads/alternative'

Note that we can't just reallocate to a newer buffer in general, or at least that will require some deeper changes.

@pdet
Copy link
Contributor

pdet commented Sep 15, 2023

Ideally, we allocate buffers there, but that would require a significant code rewrite.

As a workaround, I think the best solution would be to store the excess buffer in a cache at the HTTPState and properly manage it when calling the GetRangeRequest.

@Mause
Copy link
Member

Mause commented Sep 15, 2023

Do we need to make a similar fix in the fsspec wrapper as well? I think I used memcpy there too


// To avoid corruption of memory, we bail out.
return false;
}
memcpy(buffer_out + out_offset, data, data_length);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if instead of bailing out we just constrain the amount of data buffered to the buffer length, i.e. do something like:

data_length = MinValue<idx_t>(data_length, buffer_out_len - data_length - out_offset);

That way we would just silently discard any data that is sent past the end of the buffer.

@pdet
Copy link
Contributor

pdet commented Sep 15, 2023

So, the server behaves strangely, I've tried the bailout strategy, but the next range request would be ignored and the file would continue the download from where it stopped.

I've also tried it with the SET force_download=true;flag, which worked fine. I think that for now we should adjust the error message to, when detecting that range requests are not fully obeyed, the SET force_download=true; flag must be set.

@carlopi carlopi marked this pull request as draft September 15, 2023 15:02
@carlopi
Copy link
Contributor Author

carlopi commented Sep 15, 2023

Thanks @pdet for the suggestions and for taking a deep dive on this.

I will change this to throw a relevant error, pointing to the SET force_download=true; option

Fixes duckdb#8930 and duckdb#8908 (avoiding the fatal crash).

HTTP server are allowed to send back more data than asked for, and be somehow unreliably.
In this case a GET request with a range headers gets more data than asked for, and with very
big files we ended up overriding the current buffer.

This is an uncommon but legal behaviour, to properly handle this situation so we will have
to improve logic elsewhere to properly handle this case.
@carlopi carlopi marked this pull request as ready for review September 16, 2023 17:30
@Mytherin Mytherin merged commit 7d65444 into duckdb:main Sep 19, 2023
50 checks passed
@Mytherin
Copy link
Collaborator

Thanks!

@seandavi
Copy link

Thanks, all. I learned a bit with the discussion as well.

@carlopi carlopi deleted the httpfs_range branch October 16, 2023 07:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Selecting from a parquet file over https often crashes duckdb Segmentation fault
5 participants