portions of repeats annotated as spacers #36

acvill · 2022-10-31T15:36:05Z

Preface: Thanks for creating and maintaining minCED! I use your software often and rely on its output for my own tools. I understand this issue is likely a problem with CRISPR Recognition Tool and not minCED itself, though minCED users should be aware of the implications for biological interpretation.

I'm using minCED 0.4.2. Generally, I will run minCED on metagenomic contigs with relaxed parameters, allowing for a greater range of spacer and repeat sizes than the default:

MIN=/path/to/minced
${MIN}/minced \
    -gffFull \
    -minNR 3 \        
    -minRL 20 \      # default: 23
    -maxRL 50 \      # default: 47
    -minSL 22 \      # default: 26
    -maxSL 55 \      # default: 50
    contigs.fasta \
    example.txt \
    example.gff

Recently, I noticed something peculiar about one of the CRISPRs associated with one sample:

CRISPR 1   Range: 125549 - 126048
POSITION	REPEAT				SPACER
--------	--------------------------------	----------------------------------------------
125549		GTTGTCATTAGCTTCCAGATTCCGTACCTTCA	CACTTGCTAATACAGCTGTGGTTGAGCCAAACAATGAGATGGTAAT	[ 32, 46 ]
125627		GTTGTGATTAGCTTTCAGATTCCGTACCTTCA	TACTTGCTAATACAGCGCACGCGAGACCTTCACGCGACTAGGACGG	[ 32, 46 ]
125705		GTTGTGATTAGCTTTCAGATTCCGTACCTTCA	TACTTGCTAATACAGCCACGAGCCTCATCACGCGAACTCTCATCAC	[ 32, 46 ]
125783		GTTGTGATTAGCTTTCAGATTCCGCACCTTCA	TACTTGCTAATCCAGCCGAATTATTGCAACGCTTATCCTCGCCTCG	[ 32, 46 ]
125861		GTTGTGATTAGCTTTCGAATTCCGTACCTTCA	CACTTGCTAACACAGCATAAAAACGACGACGACACGACCGACAGGT	[ 32, 46 ]
125939		GTTGTGATTAGCTTTCAGATTCCGTACCTTCA	CACTTGCTAATACAGCTCGGAGGAGTGAAGAATAGCCAGCACCTCG	[ 32, 46 ]
126017		GTTGTGATTAGCTTTCAGATTCCGTACCTCCA	
--------	--------------------------------	----------------------------------------------
Repeats: 7	Average Length: 32		Average Length: 46

Focusing on the spacers, you'll note that the first 16 nt are more similar than expected by chance.

1 [CACTTGCTAATACAGC] TGTGGTTGAGCCAAACAATGAGATGGTAAT
2 [TACTTGCTAATACAGC] GCACGCGAGACCTTCACGCGACTAGGACGG
3 [TACTTGCTAATACAGC] CACGAGCCTCATCACGCGAACTCTCATCAC
4 [TACTTGCTAATCCAGC] CGAATTATTGCAACGCTTATCCTCGCCTCG
5 [CACTTGCTAACACAGC] ATAAAAACGACGACGACACGACCGACAGGT
6 [CACTTGCTAATACAGC] TCGGAGGAGTGAAGAATAGCCAGCACCTCG

It seems that the 3' ends of the repeats have been mistakenly annotated as part of the spacers. My first thought was that my parameterization did not allow for the combination of repeat and/or spacer lengths needed for the "correct" annotation. However, 48 nt repeats (32 + 16) and 30 nt spacers (46 - 16) are within the ranges given in my parameters above. So, it seems like minCED is expecting near-perfect conservation of the repeat sequences, and therefore draws the boundary between the repeats and spacers at the base where the sequence is maximally ambiguous (equal utilization of C and T). This example was easy enough to catch by eye, but similar misannotations are problematic if the spacers are used to build a BLAST database or CRISPR system phylogenies are constructed from repeat consensus sequences.

The text was updated successfully, but these errors were encountered:

ctSkennerton · 2022-10-31T18:38:44Z

It is a limitation in the core CRT algorithm where it expects the repeats to be identical. In repeats where there is a slight wiggle the algorithm messes things up, as you have observed. There is a magic number in the core repeat extension algorithm where 75% of the nucleotides need to be the same to get to extend to the next base (in your example it looks to be a 50-50 split):

https://github.com/ctSkennerton/minced/blob/master/CRISPRUtil.java#L35-L81

Really this should be adaptive to look at the bases after the next to see if they return to being identical. Unfortunately I don't have time to dig into the core of CRT to fix this issue.

Thanks for pointing it out though and I would welcome anyone from the community who has the time to attempt a fix.

acvill · 2022-11-09T18:15:44Z

This is a post-hoc fix that doesn't address the problem with CRT, but I've implemented a fix_repeats option in my R-based minCED parser that appends spacer subsequences to repeats based on a rolling average conservation score across the spacer consensus.

install and load package

devtools::install_github("acvill/CRISPRviewR")
library(CRISPRviewR)
baseurl <- "https://raw.githubusercontent.com/acvill/CRISPRviewR/master/example_data_minced/"

Conserved sequence on left end of spacers ...

CRISPRviewR::read_minced(txt = url(paste0(baseurl, "s1.txt")),
                         gff = url(paste0(baseurl, "s1.gff"))) |>
  dplyr::filter(array == "CRISPR1") |>
  dplyr::select(rep, spacer)

# A tibble: 7 × 2
  rep                              spacer                                        
  <chr>                            <chr>                                         
1 GTTGTCATTAGCTTCCAGATTCCGTACCTTCA CACTTGCTAATACAGCTGTGGTTGAGCCAAACAATGAGATGGTAAT
2 GTTGTGATTAGCTTTCAGATTCCGTACCTTCA TACTTGCTAATACAGCGCACGCGAGACCTTCACGCGACTAGGACGG
3 GTTGTGATTAGCTTTCAGATTCCGTACCTTCA TACTTGCTAATACAGCCACGAGCCTCATCACGCGAACTCTCATCAC
4 GTTGTGATTAGCTTTCAGATTCCGCACCTTCA TACTTGCTAATCCAGCCGAATTATTGCAACGCTTATCCTCGCCTCG
5 GTTGTGATTAGCTTTCGAATTCCGTACCTTCA CACTTGCTAACACAGCATAAAAACGACGACGACACGACCGACAGGT
6 GTTGTGATTAGCTTTCAGATTCCGTACCTTCA CACTTGCTAATACAGCTCGGAGGAGTGAAGAATAGCCAGCACCTCG
7 GTTGTGATTAGCTTTCAGATTCCGTACCTCCA NA

... fixed

CRISPRviewR::read_minced(txt = url(paste0(baseurl, "s1.txt")),
                         gff = url(paste0(baseurl, "s1.gff")),
                         fix_repeats = TRUE) |>
  dplyr::filter(array == "CRISPR1") |>
  dplyr::select(rep, spacer)

# A tibble: 7 × 2
  rep                                              spacer                        
  <chr>                                            <chr>                         
1 GTTGTCATTAGCTTCCAGATTCCGTACCTTCACACTTGCTAATACAGC TGTGGTTGAGCCAAACAATGAGATGGTAAT
2 GTTGTGATTAGCTTTCAGATTCCGTACCTTCATACTTGCTAATACAGC GCACGCGAGACCTTCACGCGACTAGGACGG
3 GTTGTGATTAGCTTTCAGATTCCGTACCTTCATACTTGCTAATACAGC CACGAGCCTCATCACGCGAACTCTCATCAC
4 GTTGTGATTAGCTTTCAGATTCCGCACCTTCATACTTGCTAATCCAGC CGAATTATTGCAACGCTTATCCTCGCCTCG
5 GTTGTGATTAGCTTTCGAATTCCGTACCTTCACACTTGCTAACACAGC ATAAAAACGACGACGACACGACCGACAGGT
6 GTTGTGATTAGCTTTCAGATTCCGTACCTTCACACTTGCTAATACAGC TCGGAGGAGTGAAGAATAGCCAGCACCTCG
7 GTTGTGATTAGCTTTCAGATTCCGTACCTCCA                 NA

Conserved sequence on right end of spacers ...

CRISPRviewR::read_minced(txt = url(paste0(baseurl, "s1.txt")),
                         gff = url(paste0(baseurl, "s1.gff"))) |>
  dplyr::filter(array == "CRISPR10") |>
  dplyr::select(rep, spacer)

# A tibble: 4 × 2
  rep                      spacer                                    
  <chr>                    <chr>                                     
1 TCGTGAAAAGCAGCAAGTGTGTAC CAGCATCGAGCGCAGGTGTTACCGACCTAAATTGACTATATG
2 TCGTGAAAAGCAGCAGGTGTGTAC CTTACAGCCATCACGCCTATCGATGGCCGTATTGACTATATA
3 TCGTGAAAAGCAGCAAGTGTGTAC GCATCGTCACTTGCCAAGCCTGCCTTTCGATTGACTATATA 
4 TCGTGAAAAGCAGCGAGTGTGTGC NA

... fixed

CRISPRviewR::read_minced(txt = url(paste0(baseurl, "s1.txt")),
                         gff = url(paste0(baseurl, "s1.gff")),
                         fix_repeats = TRUE) |>
  dplyr::filter(array == "CRISPR10") |>
  dplyr::select(rep, spacer)

# A tibble: 4 × 2
  rep                                   spacer                       
  <chr>                                 <chr>                        
1 TCGTGAAAAGCAGCAAGTGTGTAC              CAGCATCGAGCGCAGGTGTTACCGACCTA
2 AATTGACTATATGTCGTGAAAAGCAGCAGGTGTGTAC CTTACAGCCATCACGCCTATCGATGGCCG
3 TATTGACTATATATCGTGAAAAGCAGCAAGTGTGTAC GCATCGTCACTTGCCAAGCCTGCCTTTC 
4 GATTGACTATATATCGTGAAAAGCAGCGAGTGTGTGC NA

You can read the vignette for details about implementation. One problem with this approach is that one repeat on the end of each "fixed" array will be shorter than the others, since the sequence context that would be appended to that repeat exists outside of the original array.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

portions of repeats annotated as spacers #36

portions of repeats annotated as spacers #36

acvill commented Oct 31, 2022

ctSkennerton commented Oct 31, 2022

acvill commented Nov 9, 2022 •

edited

Loading

portions of repeats annotated as spacers #36

portions of repeats annotated as spacers #36

Comments

acvill commented Oct 31, 2022

ctSkennerton commented Oct 31, 2022

acvill commented Nov 9, 2022 • edited Loading

install and load package

Conserved sequence on left end of spacers ...

... fixed

Conserved sequence on right end of spacers ...

... fixed

acvill commented Nov 9, 2022 •

edited

Loading