Skip to content

Conversation

@nvictus
Copy link
Member

@nvictus nvictus commented Nov 26, 2025

This PR fixes bugs with selecting fields when reading BED files with specific BEDn+m schemas via the low level scanner and high-level data source. #125

Now, a requested projection via the fields argument is properly defined against the scanner's own BED schema.

Examples:

  • bed6 - only the first six (named) standard fields are available for projection.
  • bed6+3 - the first six standard fields are available, as well as the extended fields, named BED6+1, BED6+2, BED6+3.
  • bed6+ - the first six standard fields are available, as well as the special field rest that lumps the remainder of each textual record past the sixth field, including tabs.

Caveat on field order

One perhaps surprising behavior of passing fields= to scanner or data source is that extended and rest fields are always shuffled to the end of the projection, following the standard fields (though both groups remain in the order requested). This mirrors the behavior of tag fields in sam/bam/gxf scanners, where the tag struct column always follow fixed fields, and both groups of fields are returned in the order provided.

Luckily, it seems that consumer libraries (like a polars lazy frame) that can push projections down to oxbow don't expect the results to have the exact schema requested and automatically normalize the column order to the original requested order. However, it's possible that some consumers may end up not working this way and we may wish to explore making all scanners honor a specific column order when projecting a mix of "fixed" and "variable" fields.

@nvictus nvictus merged commit be36159 into abdenlab:main Nov 27, 2025
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant