Preface: Thanks for creating and maintaining minCED! I use your software often and rely on its output for my own tools. I understand this issue is likely a problem with CRISPR Recognition Tool and not minCED itself, though minCED users should be aware of the implications for biological interpretation.
I'm using minCED 0.4.2. Generally, I will run minCED on metagenomic contigs with relaxed parameters, allowing for a greater range of spacer and repeat sizes than the default:
MIN=/path/to/minced
${MIN}/minced \
-gffFull \
-minNR 3 \
-minRL 20 \ # default: 23
-maxRL 50 \ # default: 47
-minSL 22 \ # default: 26
-maxSL 55 \ # default: 50
contigs.fasta \
example.txt \
example.gff
Recently, I noticed something peculiar about one of the CRISPRs associated with one sample:
CRISPR 1 Range: 125549 - 126048
POSITION REPEAT SPACER
-------- -------------------------------- ----------------------------------------------
125549 GTTGTCATTAGCTTCCAGATTCCGTACCTTCA CACTTGCTAATACAGCTGTGGTTGAGCCAAACAATGAGATGGTAAT [ 32, 46 ]
125627 GTTGTGATTAGCTTTCAGATTCCGTACCTTCA TACTTGCTAATACAGCGCACGCGAGACCTTCACGCGACTAGGACGG [ 32, 46 ]
125705 GTTGTGATTAGCTTTCAGATTCCGTACCTTCA TACTTGCTAATACAGCCACGAGCCTCATCACGCGAACTCTCATCAC [ 32, 46 ]
125783 GTTGTGATTAGCTTTCAGATTCCGCACCTTCA TACTTGCTAATCCAGCCGAATTATTGCAACGCTTATCCTCGCCTCG [ 32, 46 ]
125861 GTTGTGATTAGCTTTCGAATTCCGTACCTTCA CACTTGCTAACACAGCATAAAAACGACGACGACACGACCGACAGGT [ 32, 46 ]
125939 GTTGTGATTAGCTTTCAGATTCCGTACCTTCA CACTTGCTAATACAGCTCGGAGGAGTGAAGAATAGCCAGCACCTCG [ 32, 46 ]
126017 GTTGTGATTAGCTTTCAGATTCCGTACCTCCA
-------- -------------------------------- ----------------------------------------------
Repeats: 7 Average Length: 32 Average Length: 46
Focusing on the spacers, you'll note that the first 16 nt are more similar than expected by chance.
1 [CACTTGCTAATACAGC] TGTGGTTGAGCCAAACAATGAGATGGTAAT
2 [TACTTGCTAATACAGC] GCACGCGAGACCTTCACGCGACTAGGACGG
3 [TACTTGCTAATACAGC] CACGAGCCTCATCACGCGAACTCTCATCAC
4 [TACTTGCTAATCCAGC] CGAATTATTGCAACGCTTATCCTCGCCTCG
5 [CACTTGCTAACACAGC] ATAAAAACGACGACGACACGACCGACAGGT
6 [CACTTGCTAATACAGC] TCGGAGGAGTGAAGAATAGCCAGCACCTCG
It seems that the 3' ends of the repeats have been mistakenly annotated as part of the spacers. My first thought was that my parameterization did not allow for the combination of repeat and/or spacer lengths needed for the "correct" annotation. However, 48 nt repeats (32 + 16) and 30 nt spacers (46 - 16) are within the ranges given in my parameters above. So, it seems like minCED is expecting near-perfect conservation of the repeat sequences, and therefore draws the boundary between the repeats and spacers at the base where the sequence is maximally ambiguous (equal utilization of C and T). This example was easy enough to catch by eye, but similar misannotations are problematic if the spacers are used to build a BLAST database or CRISPR system phylogenies are constructed from repeat consensus sequences.
Preface: Thanks for creating and maintaining minCED! I use your software often and rely on its output for my own tools. I understand this issue is likely a problem with CRISPR Recognition Tool and not minCED itself, though minCED users should be aware of the implications for biological interpretation.
I'm using minCED 0.4.2. Generally, I will run minCED on metagenomic contigs with relaxed parameters, allowing for a greater range of spacer and repeat sizes than the default:
Recently, I noticed something peculiar about one of the CRISPRs associated with one sample:
Focusing on the spacers, you'll note that the first 16 nt are more similar than expected by chance.
It seems that the 3' ends of the repeats have been mistakenly annotated as part of the spacers. My first thought was that my parameterization did not allow for the combination of repeat and/or spacer lengths needed for the "correct" annotation. However, 48 nt repeats (32 + 16) and 30 nt spacers (46 - 16) are within the ranges given in my parameters above. So, it seems like minCED is expecting near-perfect conservation of the repeat sequences, and therefore draws the boundary between the repeats and spacers at the base where the sequence is maximally ambiguous (equal utilization of
CandT). This example was easy enough to catch by eye, but similar misannotations are problematic if the spacers are used to build a BLAST database or CRISPR system phylogenies are constructed from repeat consensus sequences.