Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add non-coding regions #589

Closed
8 tasks done
donkirkby opened this issue Jul 21, 2020 · 4 comments
Closed
8 tasks done

Add non-coding regions #589

donkirkby opened this issue Jul 21, 2020 · 4 comments

Comments

@donkirkby
Copy link
Member

donkirkby commented Jul 21, 2020

Add some coordinate sequences that are not coding for proteins. The sequences can be marked as nucleotide sequences.

  • Add missing regions for SARS-CoV-2.
  • Include them in nuc.csv, but not amino.csv.
  • Align amino sequences.
  • Only count overlapping minimap matches once.
  • Report partial codons at boundary of minimap match.
  • Moved to separate issue: Add coverage plots and scores for non-coding regions #725
  • Add missing regions for HIV.
  • Stop trimming stop codons from count files.
@donkirkby donkirkby added this to the 7.13 milestone Jul 21, 2020
@donkirkby
Copy link
Member Author

The plan is to first use minimap2 to align the entire consensus sequence with the whole-genome nucleotide reference, then cut out the sections that align to each gene region, and use the current Gotoh technique to align each of the three reading frames and choose the best match. That lets us keep the benefit of aligning in amino space, but gain the benefit of minimap2 handling the huge deletions we see in the assembled proviral samples. The sections that minimap2 can't align will appear in the genome coverage plots, but not in any gene coverage plots. Coordinate regions that are nucleotide references will just use the minimap2 alignment, without the extra Gotoh step.
Question: what do we do with minimap2 alignments that overlap at the ends? For the genome coverage plot, I just let the alignment closer to the 5' end override the other one. The overlaps are usually small enough that it's not noticable on the plot. In the amino.csv and nuc.csv reports, though, you can see the details.
I'm moving this issue to the next release, because it's going to take a while to finish.

@donkirkby donkirkby modified the milestones: 7.13, near future Aug 5, 2020
@donkirkby donkirkby modified the milestones: near future, 7.14 Aug 18, 2020
@donkirkby
Copy link
Member Author

I discussed the problem of overlaps and gaps between minimap2 matches with @cbrumme, and we agreed to use the same arbitrary rule that I currently use in the plots. Any gaps between matches will be labelled as yellow in the coverage plots, and won't be reported in any gene region.

donkirkby added a commit that referenced this issue May 11, 2021
Also stop reporting rows without data in nuc.csv and amino.csv. Report query position when a row comes from a single contig.
donkirkby added a commit that referenced this issue May 11, 2021
Also stop reporting rows without data in nuc.csv and amino.csv. Report query position when a row comes from a single contig.
donkirkby added a commit that referenced this issue May 14, 2021
Stop highlighting partial-match contigs as unaligned.
donkirkby added a commit that referenced this issue May 20, 2021
Add sorting and filtering to genome alignment.
donkirkby added a commit that referenced this issue May 31, 2021
Still have broken tests in test_aln2counts.py.
donkirkby added a commit that referenced this issue Jun 1, 2021
Still have more broken tests in test_aln2counts.py.
@donkirkby
Copy link
Member Author

Updated plan, after discussing with @cbrumme:

  1. Use minimap2 to align full consensus against full-genome reference in nucleotide space.
  2. Use gene region landmarks to clip out each region from the full sample counts.
  3. Align sample amino consensus sequence for each region to the reference amino sequence for that region, if there is one. Use the start position from the region landmarks to choose which of the three reading frames to use.
  4. Work backwards from amino codons to nucleotide counts for regions that have an amino reference. That keeps nuc.csv in synch with amino.csv.
  5. Use original minimap2 alignment for regions that don't have an amino reference (TRS and UTR).

donkirkby added a commit that referenced this issue Jun 22, 2021
New version allows larger deletions within a single alignment.
donkirkby added a commit that referenced this issue Jun 22, 2021
Also upgrade to Ubuntu 20.04 in GitHub Actions, because 16.04 is losing support.
donkirkby added a commit that referenced this issue Jun 23, 2021
Needed after upgrading Ubuntu to 20.04.
donkirkby added a commit that referenced this issue Jun 23, 2021
Old Singularity was incompatible with Ubuntu 20.04.
@donkirkby
Copy link
Member Author

donkirkby commented Jun 25, 2021

When minimap2 matches don't align to codon boundaries, report the exact nucleotides in nuc.csv that are included in the minimap2 match. In amino.csv, report partial deletions for codons that cross over the end of a minimap2 match.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant