Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix error for 32bit systems #54

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

mr-c
Copy link
Contributor

@mr-c mr-c commented Dec 18, 2019

Fixes #48

@rsharris
Copy link
Contributor

rsharris commented Jan 7, 2020

Sorry, this fell off my radar over the holidays. I will try to look at it in the next 7 days.

@rsharris
Copy link
Contributor

Well, it looks good to the extent that I can tell. But ...

The long comment that attempts to describe the file format is still out of date with the code. I recommend the comment just be deleted, as lack of information would be better than incorrect information.

I'm still concerned that BIN_OFFSETS_MAX[0] will (apparently) allow intervals larger than can be storing in the file format. To get into the file, a value is packed with struct.pack(">I") or equivalent. The "Format Characters" table at https://docs.python.org/3/library/struct.html indicates this will be 4 bytes regardless of system sizing. That's great - files written on one system can be read on another. But what happens when a program tries to write to file an interval that exceeds the 4 byte limit? They won't get an exception when creating the interval (at least I don't think they will), but they'll get it later on when they try to write to the file (I think it will be a struct.error exception).

Base automatically changed from master to main March 5, 2021 13:53
@mr-c
Copy link
Contributor Author

mr-c commented Feb 27, 2024

@rsharris and others, your advice is appreciated. We've been carrying this patch in Debian since December 2019, so if it introduces a problem into the file formats, then please let us know. I guess we could drop support for 32-bit architectures for this package on our side, as that is quite uncommon for bioinformatics these days..

@rsharris
Copy link
Contributor

@mr-c I will try to take a look at this this week. Essentially, due to the history of this module, it will mostly be a reverse-engineering task for me.

As I recall the module's history, James wrote the code. At some point after that I was writing a C version. As part of that effort I documented the previously-undocumented file structure. This was probably circa 2005, well before this was in github (probably before even the earlier repo at bitbucket). When I reviewed this mod in 2019, I (apparently) discovered that the file format documentation had not been updated when changes to the code had been made.

For example, line 8 says "This implementation writes version 1 file format, and reads versions 0 and 1." Yet line 112 indicates its version 2. So I don't know whether the format description in lines 10-81 relates to version 2 (I doubt it). And if it doesn't, then I have no spec for version 2, and I'll have to infer it from the code. Which seems something like a catch-22.

I don't know at what point that discrepancy took place, and whether I would find it in the github history. I suspect it may have happened earlier, before github. I also suspect that its even less likely that I would find some description of the motivation of moving from version 1 to version 2 — i.e. what were the shortcomings of version 1 and how was the format modified to address them.

I have a vague recollection from 2019 that I was unable to locate, among my old backups, versions of this module that predate that discrepancy. I'm pretty sure I can find the C version that reads/writes the format.

Also understand that I haven't been a user of this package since probably 2009 (perhaps earlier).

Anyway, I'll try to look at it this week. But I think it will be difficult for me to give a definitive thumbs up/down as far as correctness.

@rsharris
Copy link
Contributor

(1) file history

I was able to locate several versions of bx-python, back to 2007. Unable to locate the C code I mentioned, but I now think I don't need it.

The change to support version 2 of the interval index file format was already in the code as of the earliest I've found.

(2) primary change in this fix

The primary change appears to be to change from "BIN_OFFSETS_MAX[0] = sys.maxsize" to "BIN_OFFSETS_MAX[0] = 2**63"

That change, I believe is correct for files to be work on 32-bit and 64-bit platforms. The caveat is whether this should be 263 or (263)-1. I believe 2**63 is correct, and will discuss this more in my next item.

In the 2007 code, this was "BIN_OFFSETS_MAX[0] = max". That's clearly wrong. The variable max wasn't defined, and the assignment must have been (erroneously) sticking a ref to the function max() in hat location. Most likely the intent of the code was MAX instead of max. MAX wasn't define until a few lines later, at "MAX = 2**31"

(3) unjustified part(s) of fix

The fix refactors the loop inside of offsets_for_max_size(). The re-factoring is clearer (I for one didn't realize for loops could have an else clause). But it also changes the loop exit condition. "max_size <= off_max" uses less-or-equal while the existing code uses strictly-less-than.

It's difficult to tell whether this is a justified correction, or a typo. The comment on line 15 says "All intervals are origin-zero, inclusive start, exclusive end." This would seem to apply to the meaning of the interval given in the description of offsets_for_max_size(). But who knows. I think I'd like to see some rational for why that change was made.

(4) other changes

In bin_for_range(), the assignment to end_bin has been changed, presumably to tolerate an invocation with end<=start. If I've interpreted that correctly, it might be good to add a comment indicating that such an interval is empty, and what the expected return value should be. I think this is really a dont-care state, but something does get returned so it's be nice if the comments told future readers why whatever is turned is appropriate for this case.

@rsharris
Copy link
Contributor

(5) file format description

I also discovered the difference between format version 1 and 2. This is indicated by the description of OLD_MAX and also in Index.open(), and relates to the offsets/bins hierachy. In version 1 this hierarchy was static, supporting a maximum of 2^29.

That information ought to be reflected in the comments describing the file header. I recommend changing line 8 to "This implementation writes version 2 file format, and reads versions 0, 1, and 2." I guess that's the only change. The specifics of the binning hierarchy aren't described anywhere except as can be gleaned from the code, and it's probably not worth the effort to properly describe it now (given the high probability of getting it wrong).

@mr-c
Copy link
Contributor Author

mr-c commented Jul 3, 2024

@rsharris thank you for your deep analysis! I'm not a regular contributor to this repo, just the maintainer of the Debian package.

Can you suggest any specific code changes?

@rsharris
Copy link
Contributor

rsharris commented Jul 3, 2024

Can you suggest any specific code changes?

@mr-c No, not really. I was a part-time contributor to this package 15-20 years ago. I haven't used it since ≈2009. Without a clear spec for the interval index file format, especially with respect to edge cases, it's impossible to know if the code is correct. AFAIK no spec was ever written.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Exception: 2147483648 is larger than the maximum possible size (2147483647)
2 participants