fix: Correct projection onto BED schemas #148
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR fixes bugs with selecting fields when reading BED files with specific BEDn+m schemas via the low level scanner and high-level data source. #125
Now, a requested projection via the
fieldsargument is properly defined against the scanner's own BED schema.Examples:
bed6- only the first six (named) standard fields are available for projection.bed6+3- the first six standard fields are available, as well as the extended fields, namedBED6+1,BED6+2,BED6+3.bed6+- the first six standard fields are available, as well as the special fieldrestthat lumps the remainder of each textual record past the sixth field, including tabs.Caveat on field order
One perhaps surprising behavior of passing
fields=to scanner or data source is that extended andrestfields are always shuffled to the end of the projection, following the standard fields (though both groups remain in the order requested). This mirrors the behavior of tag fields in sam/bam/gxf scanners, where the tag struct column always follow fixed fields, and both groups of fields are returned in the order provided.Luckily, it seems that consumer libraries (like a polars lazy frame) that can push projections down to oxbow don't expect the results to have the exact schema requested and automatically normalize the column order to the original requested order. However, it's possible that some consumers may end up not working this way and we may wish to explore making all scanners honor a specific column order when projecting a mix of "fixed" and "variable" fields.