Add non-coding regions #589

donkirkby · 2020-07-21T17:31:05Z

Add some coordinate sequences that are not coding for proteins. The sequences can be marked as nucleotide sequences.

Add missing regions for SARS-CoV-2.
Include them in nuc.csv, but not amino.csv.
Align amino sequences.
Only count overlapping minimap matches once.
Report partial codons at boundary of minimap match.
Moved to separate issue: Add coverage plots and scores for non-coding regions #725
Add missing regions for HIV.
Stop trimming stop codons from count files.

The text was updated successfully, but these errors were encountered:

donkirkby · 2020-08-05T18:12:50Z

The plan is to first use minimap2 to align the entire consensus sequence with the whole-genome nucleotide reference, then cut out the sections that align to each gene region, and use the current Gotoh technique to align each of the three reading frames and choose the best match. That lets us keep the benefit of aligning in amino space, but gain the benefit of minimap2 handling the huge deletions we see in the assembled proviral samples. The sections that minimap2 can't align will appear in the genome coverage plots, but not in any gene coverage plots. Coordinate regions that are nucleotide references will just use the minimap2 alignment, without the extra Gotoh step.
Question: what do we do with minimap2 alignments that overlap at the ends? For the genome coverage plot, I just let the alignment closer to the 5' end override the other one. The overlaps are usually small enough that it's not noticable on the plot. In the amino.csv and nuc.csv reports, though, you can see the details.
I'm moving this issue to the next release, because it's going to take a while to finish.

donkirkby · 2020-10-22T20:44:51Z

I discussed the problem of overlaps and gaps between minimap2 matches with @cbrumme, and we agreed to use the same arbitrary rule that I currently use in the plots. Any gaps between matches will be labelled as yellow in the coverage plots, and won't be reported in any gene region.

Also stop reporting rows without data in nuc.csv and amino.csv. Report query position when a row comes from a single contig.

Stop highlighting partial-match contigs as unaligned.

Add sorting and filtering to genome alignment.

Still have broken tests in test_aln2counts.py.

Still have more broken tests in test_aln2counts.py.

donkirkby · 2021-06-02T17:40:10Z

Updated plan, after discussing with @cbrumme:

Use minimap2 to align full consensus against full-genome reference in nucleotide space.
Use gene region landmarks to clip out each region from the full sample counts.
Align sample amino consensus sequence for each region to the reference amino sequence for that region, if there is one. Use the start position from the region landmarks to choose which of the three reading frames to use.
Work backwards from amino codons to nucleotide counts for regions that have an amino reference. That keeps nuc.csv in synch with amino.csv.
Use original minimap2 alignment for regions that don't have an amino reference (TRS and UTR).

New version allows larger deletions within a single alignment.

Also upgrade to Ubuntu 20.04 in GitHub Actions, because 16.04 is losing support.

Needed after upgrading Ubuntu to 20.04.

Old Singularity was incompatible with Ubuntu 20.04.

donkirkby · 2021-06-25T20:51:47Z

When minimap2 matches don't align to codon boundaries, report the exact nucleotides in nuc.csv that are included in the minimap2 match. ~~In amino.csv, report partial deletions for codons that cross over the end of a minimap2 match.~~

donkirkby added the enhancement label Jul 21, 2020

donkirkby added this to the 7.13 milestone Jul 21, 2020

donkirkby added a commit that referenced this issue Jul 30, 2020

Add missing regions to SARS-CoV-2, for #589.

2115dc8

donkirkby added a commit that referenced this issue Jul 30, 2020

Add a seed group to SARS-CoV-2, for #589.

3fb724a

donkirkby modified the milestones: 7.13, near future Aug 5, 2020

donkirkby modified the milestones: near future, 7.14 Aug 18, 2020

donkirkby modified the milestones: 7.14 - Primers by Project, 7.15 Oct 22, 2020

donkirkby added a commit that referenced this issue May 11, 2021

Add non-coding regions to SARS-CoV-2, for #589.

0a938ea

Also stop reporting rows without data in nuc.csv and amino.csv. Report query position when a row comes from a single contig.

donkirkby added a commit that referenced this issue May 11, 2021

Add non-coding regions to SARS-CoV-2, for #589.

2da6390

Also stop reporting rows without data in nuc.csv and amino.csv. Report query position when a row comes from a single contig.

donkirkby added a commit that referenced this issue May 14, 2021

Add project scoring for noncoding regions in SARS-CoV-2, part of #589.

1d0bd1a

Stop highlighting partial-match contigs as unaligned.

donkirkby added a commit that referenced this issue May 15, 2021

Start adding ConsensusAligner, as part of #589.

48be290

donkirkby added a commit that referenced this issue May 20, 2021

Move genome alignment to SequenceReport.read(), for #589.

06c287d

Add sorting and filtering to genome alignment.

donkirkby added a commit that referenced this issue May 28, 2021

Start redesigning consensus aligner, for #589.

83079b8

donkirkby added a commit that referenced this issue May 31, 2021

Fix all tests in test_aln2counts_report.py, as part of #589.

2d831db

Still have broken tests in test_aln2counts.py.

donkirkby added a commit that referenced this issue Jun 1, 2021

Fix a bunch of tests in test_aln2counts.py, as part of #589.

d89f544

Still have more broken tests in test_aln2counts.py.

donkirkby added a commit that referenced this issue Jun 10, 2021

Add alignment step in amino acid space, as part of #589.

7afa861

donkirkby added a commit that referenced this issue Jun 10, 2021

Switch back to amino acid alignment from master branch, as part of #589.

1bf65dc

donkirkby added a commit that referenced this issue Jun 11, 2021

Populate consensus and insert positions, as part of #589.

fb26d11

donkirkby added a commit that referenced this issue Jun 11, 2021

Fix remaining tests of insert positions, as part of #589.

f7ea48f

donkirkby added a commit that referenced this issue Jun 17, 2021

Fix landmark definitions for HIV to be 1-based, as part of #589.

d786b28

donkirkby added a commit that referenced this issue Jun 18, 2021

Fix landmark definitions for HLA to be 1-based, as part of #589.

e73b058

donkirkby added a commit that referenced this issue Jun 18, 2021

Fix microtest checks, as part of #589.

32b0944

donkirkby added a commit that referenced this issue Jun 22, 2021

Fix broken tests after mappy upgrade, as part of #589.

45663bc

New version allows larger deletions within a single alignment.

donkirkby added a commit that referenced this issue Jun 22, 2021

Fall back to Gotoh alignment in genome coverage, as part of #589.

52f159c

Also upgrade to Ubuntu 20.04 in GitHub Actions, because 16.04 is losing support.

donkirkby added a commit that referenced this issue Jun 23, 2021

Install samtools from source instead of apt, as part of #589.

8d647de

Needed after upgrading Ubuntu to 20.04.

donkirkby added a commit that referenced this issue Jun 23, 2021

Upgrade Singularity to 3.7.1, as part of #589.

b2f6b1d

Old Singularity was incompatible with Ubuntu 20.04.

donkirkby added a commit that referenced this issue Jun 25, 2021

Eliminate overlaps from minimap2 alignments, as part of #589.

b38c020

donkirkby pinned this issue Jun 29, 2021

donkirkby added a commit that referenced this issue Jun 30, 2021

Trim boundary codons, as part of #589.

6f6f098

donkirkby added a commit that referenced this issue Jul 2, 2021

Add LTR regions to HIV, as part of #589.

30749b0

donkirkby mentioned this issue Jul 5, 2021

Add coverage plots and scores for non-coding regions #725

Open

2 tasks

donkirkby added a commit that referenced this issue Jul 8, 2021

Update test to expect stop codon, part of #589.

9748462

donkirkby closed this as completed in 41ea502 Jul 9, 2021

donkirkby unpinned this issue Jul 9, 2021

donkirkby added a commit that referenced this issue Jul 9, 2021

Fix project scoring config and display stop codons, following #589.

7292ffe

donkirkby added a commit that referenced this issue Jul 9, 2021

Fix broken test to handle stop codon, following #589.

d3dcde8

donkirkby added a commit that referenced this issue Jul 9, 2021

Make wild type overrides match original lengths, following #589.

9063e6d

donkirkby mentioned this issue Jul 9, 2021

Make coverage maps consistent with contigs coverage plot #479

Closed

3 tasks

donkirkby added a commit that referenced this issue Jul 10, 2021

Fix broken tests to match new wild type overrides, following #589.

3eaa6a7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add non-coding regions #589

Add non-coding regions #589

donkirkby commented Jul 21, 2020 •

edited

donkirkby commented Aug 5, 2020

donkirkby commented Oct 22, 2020

donkirkby commented Jun 2, 2021

donkirkby commented Jun 25, 2021 •

edited

Add non-coding regions #589

Add non-coding regions #589

Comments

donkirkby commented Jul 21, 2020 • edited

donkirkby commented Aug 5, 2020

donkirkby commented Oct 22, 2020

donkirkby commented Jun 2, 2021

donkirkby commented Jun 25, 2021 • edited

donkirkby commented Jul 21, 2020 •

edited

donkirkby commented Jun 25, 2021 •

edited