Skip to content

Commit

Permalink
Create repeatMasker.md
Browse files Browse the repository at this point in the history
  • Loading branch information
friedue committed Nov 3, 2021
1 parent 919ef6c commit 15ff240
Showing 1 changed file with 40 additions and 0 deletions.
40 changes: 40 additions & 0 deletions repeatMasker.md
@@ -0,0 +1,40 @@
Understanding repeatMasker and repeat annotation
==================================================

>RepeatMasker is a program that screens DNA sequences for interspersed repeats and low complexity DNA sequences. The output of the program is a detailed annotation of the repeats that are present in the query sequence as well as a modified version of the query sequence in which all the annotated repeats have been masked
From: <http://www.repeatmasker.org/>

Repeats are identified with `RepeatModeler`.

The full repeatMasker track can be downloaded e.g. via `wget "https://hgdownload.soe.ucsc.edu/goldenPath/mm10/database/rmsk.txt.gz"`
The fields should be labelled as follows:

| Field | Meaning |
|-------|---------|
| chrom | "Genomic sequence name" |
| chromStart | "Start in genomic sequence" |
| chromEnd | "End in genomic sequence" |
| name | "Name of repeat"|
| score | "always 0 place holder"|
| strand | "Relative orientation + or -"|
| swScore | "Smith Waterman alignment score"|
| milliDiv| "Base mismatches in parts per thousand"|
|milliDel | "Bases deleted in parts per thousand"|
| milliIns | "Bases inserted in parts per thousand"|
| genoLeft | "-#bases after match in genomic sequence"|
| repClass | "Class of repeat"|
|repFamily | "Family of repeat"|
|repStart| "Start (if strand is +) or -#bases after match (if strand is -) in repeat sequence"|
| repEnd| "End in repeat sequence"|
| repLeft| "-#bases after match (if strand is +) or start (if strand is -) in repeat sequence"|

Based on info from <http://genomewiki.ucsc.edu/index.php/RepeatMasker>

## Families, classes and so on

>The most elementary level of classification of TEs is the family, which designates interspersed genomic copies derived from the amplification of an ancestral progenitor sequence (10). Each TE family can be represented by a consensus sequence approximating that of the ancestral progenitor.
From [Flynn et al. (2020)](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7196820/)

>RepeatModeler contains a basic homology-based classification module (RepeatClassifier) which compares the TE families generated by the various de novo tools to both the RepeatMasker Repeat Protein Database (DB) and to the RepeatMasker libraries (e.g., Dfam and/or RepBase). The Repeat Protein DB is a set of TE-derived coding sequences that covers a wide range of TE classes and organisms. As is often the case with a search against all known TE consensus sequences, there will be a high number of false positive or partial matches. RepeatClassifier uses a combination of score and overlap filters to produce a reduced set of high-confidence results. If there is a concordance in classification among the filtered results, RepeatClassifier will label the family using the RepeatMasker/Dfam classification system and adjust the orientation (if necessary). Remaining families are labeled “Unknown” if a call cannot be made. Classification is the only step that requires a database, and can be completed with only open-source Dfam if Repbase is not available.

0 comments on commit 15ff240

Please sign in to comment.