Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test and fix bug (?) in 10x signature computation #1158

Closed
ctb opened this issue Aug 9, 2020 · 8 comments
Closed

test and fix bug (?) in 10x signature computation #1158

ctb opened this issue Aug 9, 2020 · 8 comments

Comments

@ctb
Copy link
Contributor

ctb commented Aug 9, 2020

we have the following code in command_compute.py, and I think it has to be a bug.

            for fasta in fastas:
                for n, record in enumerate(screed.open(fasta)):
                    # make minhashes for each sequence
                    minhashes = make_minhashes(args.ksizes, args.seed, args.protein,
                                           args.dayhoff, args.hp, args.dna,
                                           args.num_hashes,
                                           args.track_abundance, args.scaled)
                    add_seq(minhashes, record.sequence,
                            args.input_is_protein, args.check_sequence)

                siglist += build_siglist(minhashes, fasta, name=record.name)

(direct link to full context in a specific commit here)

it seems that minhashes is being built multiple times and discarded, and only the one with the last record is being kept. I suspect that the make_minhashes call should be within the outer loop, not within the inner loop.

@olgabot your thoughts appreciated!

assuming this is a bug, we should make sure we add a test that triggers this before we fix it!

@olgabot
Copy link
Collaborator

olgabot commented Aug 11, 2020

@pranathivemuri do you know what may be happening?

Honestly, I'm okay with dropping/deprecating the 10x/bam support. Beyond toy bam files of a few megabytes, it's not realistic/useful to use on the "real" 22GB bam files that one gets from 10X. We've stopped using sourmash compute directly on the bam files and are instead doing a bunch of preprocessing before we get there. Here are some of the processing steps we do: #1159 (comment)

Do you know if the feature is used by anyone else?

@olgabot
Copy link
Collaborator

olgabot commented Aug 11, 2020

A big part of the problem is that the Python API for samtools, pysam creates a gigantic string of all the reads when you do the Python equivalent of samtools fastq and thus one can't stream over or iterate over the fastq entries. With samtools fastq on the command line, it can output a file which can then be read record by record.

@olgabot
Copy link
Collaborator

olgabot commented Aug 11, 2020

Honestly, I'm okay with dropping/deprecating the 10x/bam support.

To clarify -- drop the support in sourmash directly and point people to bam2fasta to create the fastas/fastqs and then compute sketches from there.

@ctb
Copy link
Contributor Author

ctb commented Aug 11, 2020

hah! I'm glad you broached the subject, because my reading of #1159 (comment) was "this is not something we would support in sourmash directly" 😂

what about leaving it in sourmash compute and sourmash 3.x (perhaps fixing this bug... just 'cause?), and then omit 10x-specific computation from sourmash sketch moving forward?

@ctb
Copy link
Contributor Author

ctb commented Aug 11, 2020

(until and unless there is 10x specific computation that we can usefully support!)

@olgabot
Copy link
Collaborator

olgabot commented Aug 11, 2020

hah! I'm glad you broached the subject, because my reading of #1159 (comment) was "this is not something we would support in sourmash directly" 😂

Yeah it's ... a lot.

what about leaving it in sourmash compute and sourmash 3.x (perhaps fixing this bug... just 'cause?), and then omit 10x-specific computation from sourmash sketch moving forward?

This makes sense to me! Gonna defer to @pranathivemuri for the final call

@pranathivemuri
Copy link
Contributor

That sounds good- fixing the bug and eventually removing the 10x support

@ctb
Copy link
Contributor Author

ctb commented Jan 21, 2021

Turns out this was done in #1229! Created new issue re deprecation in 3.5.1: #1288

@ctb ctb closed this as completed Jan 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants