_select_member optimization #8

mobiusklein · 2018-06-09T16:46:45Z

Profiling of randomly accessing a large number of positions from a large file shows that more time is spent linearly searching for _Member objects than actually performing complex computation on them. I propose to optimize IdzipReader._select_member to detect when the position requested is within the set of parsed members and use a binary search to select the correct member in O(logn) time rather than O(n) time.

Is it at any point reasonable to just load all _Members in a single pass?

The text was updated successfully, but these errors were encountered:

mobiusklein · 2018-06-09T21:13:45Z

This appeared to be related to unfilled members caused by flushing the file.

bauman · 2018-06-10T19:18:19Z

reopening because this might be easy.

the self._members variable is parsed and loaded when you perform idzipfile.open is performed. The content of the member doesn't need to be parsed yet.

bauman · 2018-06-10T20:16:04Z

whoops, I read that as chunks, not members. D'oh.

one key aspect I abuse about this implementation is that it does very little actual disk IO until read is called. Hesitant to load a members up front because we might need to io through the file to snag all the headers.

thoughts on perhaps a special (non-default?) precache flag on open?

mobiusklein · 2018-06-10T20:26:13Z

I had been using a workaround in one utility script I wrote to get around the problem fixed in PR #7, which involved defensively flushing the writer prior to filling up a full member, so I had 100x the number of members I should have had. The natural transition point between linear and binary search should have come much later.

Lazily loading members is the right way to go. Now that I've removed the workaround, the file is packed into three members, which will be searched faster using a linear method still. If I really had a file with that many full members, I'd be better off loading them lazily and using a binary search to find the members who are absolutely known, and add some special case logic for the last member in the list.

mobiusklein closed this as completed Jun 9, 2018

bauman reopened this Jun 10, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

_select_member optimization #8

_select_member optimization #8

mobiusklein commented Jun 9, 2018

mobiusklein commented Jun 9, 2018

bauman commented Jun 10, 2018

bauman commented Jun 10, 2018

mobiusklein commented Jun 10, 2018

_select_member optimization #8

_select_member optimization #8

Comments

mobiusklein commented Jun 9, 2018

mobiusklein commented Jun 9, 2018

bauman commented Jun 10, 2018

bauman commented Jun 10, 2018

mobiusklein commented Jun 10, 2018