-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmarking statistics for divergence simulation tests #27
Comments
In the current workflow, we have unaligned RNA-seq reads in This counter should read all header entries in This tool is after alignment and before compression, it should read the |
I've been trying various methods to tackle this. Since speed is a concern and
That said, I don't seem to have enough computing power on my personal MacBook to handle this, mainly the SuffixTrees. I've tried using |
So I think you may have a few critically important points here.
I may be misunderstanding, but you shouldn't have to do any sub-sequence searching, the aligner already will make those and the output is a SAM stream contains alignment data in the form of a CIGAR string, the only thing that needs to be 'counted' is number of aligned reads and what
If you're talking about creating an index for the reference sequence NAMES (Accession ID), then I agree this is the correct approach and will be the fastest way to identify exact matches. In this case there are only 33,000 sequence names, which I don't think should be difficult to build an index for.
How are you defining 'unique sequences'. Are there duplications of identical sequences in cov0r? This is very important as we can update the pan-genome processing step to reduce the search space for the aligner and make things run faster. Great catch!
This is a good point and something we can investigate as a follow-up. If a true (PCR) duplicate, and has mapped to the pan-genome, we can heuristically check/remove PCR-duplicates in the final aligned data using For computing resources, if needed I can give you permissions to spool up some bigger CPU on AWS, just message me in Slack. As a rule of thumb, every instance we're running in this pipeline is a As one additional note, we've bumped up to |
Wow, seems like I misunderstood the task. So all the SAM reads are references from Yes, there are quite a bit of duplicate sequences in |
NP, good to clarify things :) I'd consider a pysam independent method too, just With
How are you measuring duplicates in For the python script, this falls under |
re: duplicates, I'm looking at the full sequence entries in the FASTA. Here's some sample code: from Bio import SeqIO
seqs_cov1r_fwd = [str(seq_record.seq)
for seq_record in SeqIO.parse("cov1r.fa", "fasta")
if 'REVERSE' not in seq_record.id]
print(f'{len(seqs_cov1r_fwd)} forward sequences')
print(f'{len(set(seqs_cov1r_fwd))} unique sequences')
print(f'{len(seqs_cov1r_fwd) - len(set(seqs_cov1r_fwd))} duplicate sequences') Output:
|
Is it necessary to maintain the order of accession IDs from
If it's acceptable to have results unsorted and excluding 0-counts, like this:
...then it can be achieved with an awk '/^[^@]/ {count[$3]++} END {for (id in count) {print id, count[id]}}' The counts can be written to a file cat SRR11454613.sam
| tee >(awk '/^[^@]/ {count[$3]++} END {for (id in count) {print id, count[id]}}' > SRR11454613_counts.txt)
| samtools view -b - > SRR11454613.bam |
That looks good, it will for sure be faster if we exclude zero counts (most will be zero) so yes. How is the performance? |
Quickly testing 3 runs of Here is a full summary: $ (time cat SRR11454613.bt2.sam | samtools view -b - > test.output.bam)
#1
real 0m9.461s
user 0m9.328s
sys 0m0.238s
#2
real 0m9.412s
user 0m9.301s
sys 0m0.236s
#3
real 0m9.786s
user 0m9.588s
sys 0m0.304s
$ (time cat SRR11454613.bt2.sam | tee >(awk '/^[^@]/ {count[$3]++} END {for (id in count) {print id, count[id]}}' > SRR11454613_counts.txt) | samtools view -b - > test.output.bam)
#1
real 0m12.432s
user 0m9.474s
sys 0m0.420s
#2
real 0m12.914s
user 0m9.866s
sys 0m0.535s
#3
real 0m12.886s
user 0m9.851s
sys 0m0.532s |
I think that should be OK performance. Now we just need to calculate TP/FP/FN/TN values using an extended Take a look at this simulated data-set: CoV simulations which you can download from s3://serratus-public/notebook/200411/. You can make a similar simulated data-set for testing ROC and make it a bit deeper too. Brainstorming, I can think of 2 ways to do this, maybe you can think of a better experiment.
edit: performance is much less of an issue here, as this is only for benchmarking. |
I feel like I'm missing an important concept here. Where do the counts from earlier play into the TP/FP/FN/TN calculation? It seems that all the measurement is done on simulated positive/negative sequences? |
When we're aligning a sequences from a given accession we have no idea if it's CoV+ or CoV-. The count will be used to report how much alignment there is in the library to CoV forward or REVERSE sequences. Libraries with a high forward count and a high forward:reverse ratio we will classify as 'Potentially CoV+ libraries' and re-analyze those in more detail using transcriptome assembly. The TP/FP/FN/TN rates with simulated reads is purely to establish benchmarks so we know how well the experimental system works in theory. |
I've tried some preliminary simulations based on the notebook referenced above. To me, it seems that all TP/FP/FN/TN values can be derived from experiment 1, outlined below. Definitions:
Statistics:
SimulationHere are
Results would be:
The calculation for these results was trivial. However, if the alignment of |
This is very clear headed. Good job. Few notes
I think these are switched, there are no FP only 175 FN.
The point is not to just to do this for this one data-set but use this to benchmark/test a wide variety of |
Thanks for the clarification regarding I'll have to look into a way to track the outcome of both
I'm guessing |
negative use |
ohh perfect. Think I have what's needed to get started then! |
I have a preliminary python script for this. Here's some sample output:
I'm thinking it might be better to normalize the counts as proportions relative to the number of reads, like this:
|
"ROC" implies plotting as a function of a confidence score, which in this case would be the MAPQ value in the SAM/BAM file. For example, if the mapper reports MAPQ=0 (i.e., P_error=1), do we consider the hit or not? This is tricky because MAPQ doesn't mean what we want it to mean. We want "error probability that the read is derived from this genome", but MAPQ actually means "assuming the read does come from this genome, probability the location is wrong". |
This is a little off-topic: For actual data, we likely won't use MAPQ as a confidence function but at the end of the day we do need some function to classify a library as potentially CoV+ or CoV-. See #41 for detailing of the false-positives encountered so far. What they have in common is even if they had a very high number/percentage of mapped reads, the mapping was localized to a very small seed region with clipping on either side. We can 'mask' these regions out of the pan-genome but I feel like it might be wack-a-mole with false-positives this way and instead we should consider if a On topic: This looks great Victor. If we have a quick py-script to calculate and report these values then we can apply it to larger optimizations. i.e. get it into the hands of @charlescongxu for his |
@ababaian @charlescongxu I've added the script+docs in a new directory named |
Hey @victorlin this looks like a good backbone for benchmarking. Can you break apart some of the functionality in this? We'll want it to be a bit more flexible then what you have
Then in the README add the definition of TP/FP/TN/FN you used made above so we can close this issue and not lose important information : ) Edit:
|
Sure, more flexibility is always nice :) Currently the script works on the entire pan-genome (
Option 2 would be easier to make the changes, but either is doable. Ultimately it's up to the use case and how these additional parameters are determined. Other questions:
|
|
Thought I'd hash out the high-level CLI first and get your opinion on it.
Example usages: 1. All default behavior
What this does: simulate positive/negative read sets using 5% divergence from
2. Configure positive/negative alignment sequences
3. Configure sequences for simulating positive/negative read sets
4. Configure pre-simulated reads
This can result in lengthy commands. All the |
This looks like it's what we need, the only thing which needs to be included is to allow for multiple fasta files in addition to regular fasta files. We can then put a wrapper around this bad boy and generate all the statistics we need in all the different 'experiments' we can think of. |
@ababaian so all the parameters taking a |
No, a multiple fasta file will be treated as a single entity for generating reads from or aligning against. Each run of the command should yield a single output. |
Given
cov0r.fa
, and an incoming stream of aligned reads (BAM format), count the total number of reads that match per sequence. Counting to be done on the fly until EOF.cov0r.fa
consists of sequences along with non-complement reverse entries. The counts for each sequence+reverse pair can be used to calculate TP/FP/TN/FN.The text was updated successfully, but these errors were encountered: