Skip to content

Conversation

joshuak94
Copy link
Contributor

@joshuak94 joshuak94 commented Apr 4, 2022

This is a very first draft/WIP for the annotation IO. This will cover annotation file types (BED, bedGraph, wiggle, etc.).

At the moment I've only implemented a very basic BED format (three columns, chrom, chromStart, chromEnd) and the BED header.

TODO

  • Extend BED3 format to full BED format w/ optional fields.
  • Implement the writer.
  • Add support for at the very least, bigbed (binary & indexed BED), wig, and bigwig files.
  • Allow reading BAM files as annotation files??

Copy link
Member

@h-2 h-2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this first of all! The structure seems fine to me!

Regarding the name: I still think that "ann_io" is sounds like the name of a secret agent 🕵🏻‍♀️ but I also don't have a great replacement, so we can stick with this for now. We can also discuss with @smehringer to see what she thinks.

Note that I have changed a small thing about reader_base in #47, so the inheritance works a little different now.

itemRgb, //!< An RGB value to determine the color of the displayed track in the browser.
blockCount, //!< The number of blocks (exons) in the BED file.
blockSizes, //!< A list of the block sizes, corresponding to blockCount.
blockStarts, //!< A list of block starts, relative to offset.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know these use camelCase in the specification, but it looks very strange to have that mixed with the other formatting styles in this library.
Can we change this to having snake_case or do you think that will confuse users?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could change it! I just did it this way because I wasn't sure if I should be consistent with the specs or with our code haha.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think we should stick with the code style for now. I would love to have a table like this in the documentation at some point:

bio::field:: bio::fasta bio::fastq bio::vcf bio::bcf bio::sam bio::bed
::id == ::qname description line description line ID ID QNAME name
::seq sequence data sequence data SEQ
::chrom == ::rname -- CHROM CHROM RNAME chrom
::qual == ::mapq quality data QUAL QUAL MAPQ
::pos == ::chrom_start POS POS POS chromStart

...

So, instead of having individual documentation for all the fields, one big table with the format-specific terminology would be more helpful I think.

@joshuak94 joshuak94 marked this pull request as draft April 7, 2022 07:16
@smehringer
Copy link
Collaborator

For the naming I opened a discussion thread: #51

@joshuak94 joshuak94 force-pushed the ann_io branch 2 times, most recently from 9331634 to a9c4b3e Compare July 4, 2022 14:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants