Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
1 changed file
with
40 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,40 @@ | ||
Understanding repeatMasker and repeat annotation | ||
================================================== | ||
|
||
>RepeatMasker is a program that screens DNA sequences for interspersed repeats and low complexity DNA sequences. The output of the program is a detailed annotation of the repeats that are present in the query sequence as well as a modified version of the query sequence in which all the annotated repeats have been masked | ||
From: <http://www.repeatmasker.org/> | ||
|
||
Repeats are identified with `RepeatModeler`. | ||
|
||
The full repeatMasker track can be downloaded e.g. via `wget "https://hgdownload.soe.ucsc.edu/goldenPath/mm10/database/rmsk.txt.gz"` | ||
The fields should be labelled as follows: | ||
|
||
| Field | Meaning | | ||
|-------|---------| | ||
| chrom | "Genomic sequence name" | | ||
| chromStart | "Start in genomic sequence" | | ||
| chromEnd | "End in genomic sequence" | | ||
| name | "Name of repeat"| | ||
| score | "always 0 place holder"| | ||
| strand | "Relative orientation + or -"| | ||
| swScore | "Smith Waterman alignment score"| | ||
| milliDiv| "Base mismatches in parts per thousand"| | ||
|milliDel | "Bases deleted in parts per thousand"| | ||
| milliIns | "Bases inserted in parts per thousand"| | ||
| genoLeft | "-#bases after match in genomic sequence"| | ||
| repClass | "Class of repeat"| | ||
|repFamily | "Family of repeat"| | ||
|repStart| "Start (if strand is +) or -#bases after match (if strand is -) in repeat sequence"| | ||
| repEnd| "End in repeat sequence"| | ||
| repLeft| "-#bases after match (if strand is +) or start (if strand is -) in repeat sequence"| | ||
|
||
Based on info from <http://genomewiki.ucsc.edu/index.php/RepeatMasker> | ||
|
||
## Families, classes and so on | ||
|
||
>The most elementary level of classification of TEs is the family, which designates interspersed genomic copies derived from the amplification of an ancestral progenitor sequence (10). Each TE family can be represented by a consensus sequence approximating that of the ancestral progenitor. | ||
From [Flynn et al. (2020)](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7196820/) | ||
|
||
>RepeatModeler contains a basic homology-based classification module (RepeatClassifier) which compares the TE families generated by the various de novo tools to both the RepeatMasker Repeat Protein Database (DB) and to the RepeatMasker libraries (e.g., Dfam and/or RepBase). The Repeat Protein DB is a set of TE-derived coding sequences that covers a wide range of TE classes and organisms. As is often the case with a search against all known TE consensus sequences, there will be a high number of false positive or partial matches. RepeatClassifier uses a combination of score and overlap filters to produce a reduced set of high-confidence results. If there is a concordance in classification among the filtered results, RepeatClassifier will label the family using the RepeatMasker/Dfam classification system and adjust the orientation (if necessary). Remaining families are labeled “Unknown” if a call cannot be made. Classification is the only step that requires a database, and can be completed with only open-source Dfam if Repbase is not available. |