-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
_select_member optimization #8
Comments
This appeared to be related to unfilled members caused by flushing the file. |
reopening because this might be easy. the self._members variable is parsed and loaded when you perform idzipfile.open is performed. The content of the member doesn't need to be parsed yet. |
whoops, I read that as chunks, not members. D'oh. one key aspect I abuse about this implementation is that it does very little actual disk IO until read is called. Hesitant to load a members up front because we might need to io through the file to snag all the headers. thoughts on perhaps a special (non-default?) precache flag on open? |
I had been using a workaround in one utility script I wrote to get around the problem fixed in PR #7, which involved defensively flushing the writer prior to filling up a full member, so I had 100x the number of members I should have had. The natural transition point between linear and binary search should have come much later. Lazily loading members is the right way to go. Now that I've removed the workaround, the file is packed into three members, which will be searched faster using a linear method still. If I really had a file with that many full members, I'd be better off loading them lazily and using a binary search to find the members who are absolutely known, and add some special case logic for the last member in the list. |
Profiling of randomly accessing a large number of positions from a large file shows that more time is spent linearly searching for
_Member
objects than actually performing complex computation on them. I propose to optimizeIdzipReader._select_member
to detect when the position requested is within the set of parsed members and use a binary search to select the correct member inO(logn)
time rather thanO(n)
time.Is it at any point reasonable to just load all
_Member
s in a single pass?The text was updated successfully, but these errors were encountered: