Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

_select_member optimization #8

Open
mobiusklein opened this issue Jun 9, 2018 · 4 comments
Open

_select_member optimization #8

mobiusklein opened this issue Jun 9, 2018 · 4 comments

Comments

@mobiusklein
Copy link
Contributor

Profiling of randomly accessing a large number of positions from a large file shows that more time is spent linearly searching for _Member objects than actually performing complex computation on them. I propose to optimize IdzipReader._select_member to detect when the position requested is within the set of parsed members and use a binary search to select the correct member in O(logn) time rather than O(n) time.

Is it at any point reasonable to just load all _Members in a single pass?

@mobiusklein
Copy link
Contributor Author

This appeared to be related to unfilled members caused by flushing the file.

@bauman
Copy link
Owner

bauman commented Jun 10, 2018

reopening because this might be easy.

the self._members variable is parsed and loaded when you perform idzipfile.open is performed. The content of the member doesn't need to be parsed yet.

@bauman bauman reopened this Jun 10, 2018
@bauman
Copy link
Owner

bauman commented Jun 10, 2018

whoops, I read that as chunks, not members. D'oh.

one key aspect I abuse about this implementation is that it does very little actual disk IO until read is called. Hesitant to load a members up front because we might need to io through the file to snag all the headers.

thoughts on perhaps a special (non-default?) precache flag on open?

@mobiusklein
Copy link
Contributor Author

I had been using a workaround in one utility script I wrote to get around the problem fixed in PR #7, which involved defensively flushing the writer prior to filling up a full member, so I had 100x the number of members I should have had. The natural transition point between linear and binary search should have come much later.

Lazily loading members is the right way to go. Now that I've removed the workaround, the file is packed into three members, which will be searched faster using a linear method still. If I really had a file with that many full members, I'd be better off loading them lazily and using a binary search to find the members who are absolutely known, and add some special case logic for the last member in the list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants