Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added fixes for handling fetch_range extending beyond length of the file #247

Merged
merged 15 commits into from
Jun 14, 2021

Conversation

hayesgb
Copy link
Collaborator

@hayesgb hayesgb commented Jun 9, 2021

Fix for #241

Copy link
Contributor

@anders-kiaer anders-kiaer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for a great package 🎉 I'm new to both adlfs and fsspec, but we had a use case which apparently adlfs solves brilliantly.

Context: We have been facing the following challenge: Downloading a very small subset of columns from a large .parquet file has very poor performance on adlfs>=0.3. For adlfs<0.3 the performance is good.

At least in my case, I've pinned it down to the same thing as pointed out in #241: More bytes than necessary are downloaded from blob storage.

I have confirmed that changing the input to Azure's download_blob from length=end to length=end-start gives same performance also on adlfs>=0.3 in my case, as for adlfs<0.3. I.e. it might be that this PR also fix #57 at the same time (not 100% sure if the issue over there is similar, but it involves the same >=< 0.3 version border at least) .

adlfs/spec.py Outdated
@@ -1769,20 +1772,23 @@ def connect_client(self):
f"Unable to fetch container_client with provided params for {e}!!"
)

async def _async_fetch_range(self, start: int, end: int, **kwargs):
async def _async_fetch_range(self, start: int, length: int = None, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is ultimately overriding _fetch_range in fsspec.spec.AbstractBufferedFile right? If so, I guess the function signature needs to be kept start and end here, since that is what fsspec assumes?

async with self.container_client:
stream = await self.container_client.download_blob(
blob=self.blob, offset=start, length=end
blob=self.blob, offset=start, length=length
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...and then this line instead simply becomes

Suggested change
blob=self.blob, offset=start, length=length
blob=self.blob, offset=start, length=end-start

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @anders-kiaer. I appreciate the feedback, and the confirmation that this issue resolves the challenge cited in #57.

I'll align the params in fetch_range to fsspec.AbstractBufferedFile, but I also want to account for the situation where end > self.size, and also for the eventuality that length is None, which is valid for the Azure SDK. Can you take a look at this branch and provide feedback on performance?

Copy link
Contributor

@anders-kiaer anders-kiaer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a quick benchmark using a 2.2 GB .parquet file in Azure blob storage (106145 rows x 6837 columns),
and timed how much time it took to extract one column.

Results:

  • adlfs==0.2.4 + azure-storage-blob==2.1.0: 0.68 ± 0.11 seconds
  • adlfs==0.7.6 + azure-storage-blob==12.8.1: 47.1 ± 1.22 seconds
  • adlfs==(this branch) + azure-storage-blob==12.8.1: 0.64 ± 0.09 seconds

So that is a decent 99% reduction in execution time compared to 0.7.6, and also slightly faster (however well within 1σ though) than 0.2.4 in my unofficial benchmark + with my specific .parquet file. The performance improvement would obviously depend on where in the blob/file you want to extract from (extracting things from the middle of the blob/file would see a huge performance benefit after this PR, extracting from the beginning or end of the file will see the same performance as before the PR).

adlfs/spec.py Outdated Show resolved Hide resolved
adlfs/spec.py Outdated Show resolved Hide resolved
@hayesgb hayesgb merged commit 6cfde2d into master Jun 14, 2021
@hayesgb hayesgb deleted the fix_fetchrange branch June 14, 2021 12:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants