Add fasta2 package #337

folago · 2023-08-23T17:30:43Z

This is exploratory work and does not have to be merged in this state.

There are for sure some functionalities missing as well as docs and examples, but this more or less the direction I would go about a fasta package.

I called it fasta2 just to have it side by side and maybe try both API, but in case some of this package is liked we can decide how much of it to keep and integrate or discard.

Here is a list of things in no particular order I do not like about the status quo that I'm trying to fix in the fasta2 package:

fasta.Fasta stutters so I would rename it to fasta.Record also the Name field should be called Identifier as in the fastq package or Header as mentioned in the BioStar handbook.

I dot understand the gzipReaderFn, openFn, and buildFn functions.
Build() does not sound vary idiomatic for serialization in bytes I used Bytes() as a method of Record.
Test data should be in the testdata directory.
ParseConcurrent() is IMHO the nicest parser implementation, so I extracted the state in a struct and added some methods to get it to be the only parsing code to maintain.
As for the rest of the concurrent functions it is frowned upon to have channels in the API, better to provide a clean API that is easy to use with channels.

Some other thoughts:

Does it make sense to have a Parser interface in the io package and have implementation of it in each sub package?
Is it necessary to have the gzip functions? I understand that it can be convenient but then should we add some other compression algorithms?

Please let me know what you think.

Koeng101 · 2023-08-24T00:16:03Z

io/fasta2/fasta.go

+// Writes the Record []byte representation to the passed io.Writer.
+func (r Record) Write(w io.Writer) error {
+	recBytes := r.Bytes()
+	_, err := w.Write(recBytes)
+	if err != nil {
+		return fmt.Errorf("error writing record to io.Writer: %w", err)
+	}
+
+	return nil
+}


Write transforms the bytes into memory (in the bytes buffer) before actually writing them. Why not just directly write to the io Writer?

Yeah and also call it WriteTo and implement the io.WriterTo interface.
I mostly tried to avoid too many allocations since io.Writer takes a []byte as arguments so it need to be serialized fist in a []byte, and there was already a way to get the record a []byte.

I was actually just looking at it right now, i might send a fixup later.

Also writing to byte.Buffer returns almost always nil errors so it's just simpler.

I sent a small fixup, I think at this point the Record.Write() method can be removed, as the user has a way of getting a string and []byte representation of the Record.
For bulk writing the Write() function is now reusing the same buffer (and buffering the I/O is usually a goodthing™).

It would be cool to have a generic Record.Write() method that takes either a bytes.Buffer or a bufio.Writer so we can just decide to pass the buffer or wrap the writer. I unfortunately am traveling tomorrow until next week so no time this weekend 🙁

Koeng101 · 2023-08-24T00:56:13Z

Generally, I really like the idea of changing the name of fasta.Fasta to fasta.Record. I think we should implement this in fasta.

I don't really understand those "gzipReaderFn, openFn, and buildFn functions" either - if there are two of us, probably others wouldn't either, and this would decrease readability. I think we should remove them.

Does it make sense to have a Parser interface in the io package and have implementation of it in each sub package?

This is something I'd really like to do! Though I think it would probably be a generic of some kind? Basically, I think for every io (genbank, fasta, fastq, slow5, sam, pileup) we should implement the following standard functions:

type Parser[T genbank | fasta | fastq | slow5 | sam | pileup, TH any] struct { // TH should have their respective headers
   Header() (TH, error)
   Next() (T, error)
   MaxLineCount() int64
   Reset() error
}

or maybe something like that - I am not sure what is most readable. Of course, the records / alignments / reads themselves would implement the Write interface, which honestly I think we should just standardize to get into a io.Writer. Generics should at least allow us to build on top of that io.Writer in a way that we aren't rebuilding the same function six times.

Thoughts? happy to help prototype this this weekend.

Is it necessary to have the gzip functions? I understand that it can be convenient but then should we add some other compression algorithms?

GZIP is so widely used by fastq at least, a reasonable default would be to read gz files, not raw text, if it wasn't for the fact that it wouldn't make as much sense to programmers (but make perfect sense to end users - you pretty much only push around sequencing as fastq).

Still, it think this should probably be implemented on top of or somewhere else. It's kinda ridiculous that we have to implement functions like reading gz 6 times. If we had this in one place, it'd be a lot easier to build functions that could compress in may different ways.

For the io.Writer stuff, I have a small PR doing that here - https://github.com/TimothyStiles/poly/pull/337/files . There is one important bug fix in there too - https://github.com/TimothyStiles/poly/blob/5cde1b2232331be07625786b6201c7c0dce1717e/io/fastq/fastq.go#L179-L185 - which I have no idea why it helped but it did, but only for my very large files. I think it has something to do with memory, so it was hard to reproduce in an individual test.

carreter · 2023-09-16T00:35:41Z

@Koeng101 Is there anything in this PR that wasn't included in #339 ?

Koeng101 · 2023-09-16T02:32:32Z

Everything here is implemented in #339 I think, but the start of #339 was because of this PR. I'd like to give some credit somehow, even if there isn't significant code overlap

TimothyStiles · 2023-09-16T15:41:57Z

We should be able to do that @Koeng101. We just need to add @folago's GitHub email in a certain way to the bottom of the PR when we merge it.

Koeng101 · 2023-09-17T01:59:17Z

Aight, we will include that then. Hope that works @folago ! I appreciate your contributions!

folago · 2023-09-18T07:57:08Z

Hey,
sure all good on my side.
Thanks a lot 🙂

add fasta2 package

a40c6c1

Koeng101 reviewed Aug 24, 2023

View reviewed changes

folago added 2 commits August 24, 2023 14:18

fixup! add fasta2 package

fc746c5

fixup! add fasta2 package

b8f54fd

carreter added the duplicate This issue or pull request already exists label Sep 16, 2023

Koeng101 closed this Sep 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add fasta2 package #337

Add fasta2 package #337

folago commented Aug 23, 2023 •

edited

Koeng101 Aug 24, 2023

folago Aug 24, 2023

folago Aug 24, 2023

folago Aug 24, 2023 •

edited

Koeng101 commented Aug 24, 2023 •

edited

carreter commented Sep 16, 2023

Koeng101 commented Sep 16, 2023

TimothyStiles commented Sep 16, 2023 •

edited

Koeng101 commented Sep 17, 2023

folago commented Sep 18, 2023

Add fasta2 package #337

Add fasta2 package #337

Conversation

folago commented Aug 23, 2023 • edited

Koeng101 Aug 24, 2023

Choose a reason for hiding this comment

folago Aug 24, 2023

Choose a reason for hiding this comment

folago Aug 24, 2023

Choose a reason for hiding this comment

folago Aug 24, 2023 • edited

Choose a reason for hiding this comment

Koeng101 commented Aug 24, 2023 • edited

carreter commented Sep 16, 2023

Koeng101 commented Sep 16, 2023

TimothyStiles commented Sep 16, 2023 • edited

Koeng101 commented Sep 17, 2023

folago commented Sep 18, 2023

folago commented Aug 23, 2023 •

edited

folago Aug 24, 2023 •

edited

Koeng101 commented Aug 24, 2023 •

edited

TimothyStiles commented Sep 16, 2023 •

edited