repeat catalog v0.9
Draft catalog based on combining the following 4 catalogs in order:
- Known disease-associated loci
- a catalog of all perfect repeats in hg38 that span at least 9bp in the reference and consist of at least 3 repeats of some motif that is between 2bp (dinucleotide) and 1000bp in size. This catalog was computed using ColabRepeatFinder
- Illumina catalog of 174k polymorphic repeats
- Catalog of polymorphic loci in 51 HPRC samples computed using the methods described in [Weisburd 2023]
The merging procedure involved taking all loci from the 1st catalog, then all loci from the 2nd catalog unless they
A) overlapped a previously-added locus by 66% or more, and B) had the same motif as that locus after cyclic shift.
The numbers (and %) of loci in the combined catalog that were added from each of the source catalogs were as follows:
82 out of 3,289,806 ( 0.0%) from 1. known disease-associated loci
3,220,632 out of 3,289,806 (97.9%) from 2. perfect repeats in hg38
10,645 out of 3,289,806 ( 0.3%) from 3. Illumina catalog of 174k polymorphic loci
58,447 out of 3,289,806 ( 1.8%) from 4. polymorphic loci in 51 HPRC samples