# Identifying Repetitive DNA

During our previous tutorial on Genome Assembly, we briefly discussed the problem of repetitive DNA. In the context of
genome assembly, repetitive DNA makes it difficult to piece together separate "fragments" to form an entire assembled
genome. In this tutorial, we present repetitive DNA as both a problem and as an interesting biological question.

## Repetitive DNA

Repetitive DNA comes in many forms. The simplest are long stretches of identical sequence:

    ATGAGATGACAAAAAAAAAAAAAAAGAGCTGCAGTCCATG

Here, the "repeat" is a long stretch of adenine nucleotides. However, repetitive DNA can also be more complex:

    ATGAGATGACAGAGAGAGAGAGAGAGAGCTGCAGTCCATG

In this case, the repeat is now a two base-pair "unit" of adenine and guanine (AG). We can further increase the complexity:

    TTTAGGGTTCAGGGTTTAGGGTTTAGGGTTCAGGGCTGCAGTCCATG

In this case, the repetitive unit is `TT(T/C)AGGG`, where `(T/C)` means that the 3rd base in the repeat can be *either*
thymine or cytosine. This particular repeat is very important biologically - it is the repeat at the ends of all
_P. falciparum_ chromosomes, or telomeres. You have already seen how telomeres can affect the ability of researchers to
properly assemble entire genomes in the previous section of this tutorial.

Additionally, some repeats are formed of much larger units and are known as transposons. These are pieces of DNA that
can "jump" around the genome and make new copies of themselves. Transposons can be many thousands of basepairs long. In _Homo sapiens_,
there are three types of transposons which can still "jump": Alu, LINE-1, and SVA. _P. falciparum_, and in fact all other
Apicomplexans, mostly lack transposons, so we do not have to worry about them as part of this practical. If you want to
read more on tranposons, this paper is a great place to start:

Henry L. Levin and John V. Moran. _**Dynamic interactions between transposable elements and their hosts**_. Nature Reviews Genetics (2011).

Please note that this section is not an exhaustive list of all possible DNA repeats. For further reading on
how repetitive DNA can effect genome assembly and annotation, a good place to start is:

Todd J. Treangen and Steven L. Salzberg. _**Repetitive DNA and next-generation sequencing: computational challenges and solutions**_. Nature Reviews Genetics (2011).

## Exercise

Here is a piece of DNA from _P. falciparum_. Can you identify repetitive DNA sequences that are longer than 5 base pairs in this sequence?

    TATAAATACAATATAATATAACGACGAACAGATATGAAAGTGTTAGAACTAGACATACCA
    TTTTTCTGTGAAAAATACTTCAAGCTGTAGTATTATTATTATTGCGCTGCTTAGATGTAGT

**Question:** Can you think of any problems that a Malaria genome may cause when searching for repeats?

> _hint: look at part 5 from the "Genome Assembly" tutorial_

## Masking a Genome

The most widely used approach to identify repetitive DNA is called "RepeatMasker". Arian Smit _et al._ developed the
RepeatMasker algorithm in the 1990s to identify repeats in the large amount of sequencing data that was being produced
at the time. RepeatMasker utilizes a statistical method known as a "Hidden Markov Model" or HMM to accurately identify
repetitive DNA. The methodology behind HMMs is beyond the scope of this course, but if you want to learn more
[this website](https://towardsdatascience.com/hidden-markov-model-hmm-simple-explanation-in-high-level-b8722fa1a0d5) is
a good, simple introduction.

Before we can perform annotation, we now need to identify repetitive DNA that may confuse the tools we will use later
today. To do this, we will use the RepeatMasker tool discussed above. RepeatMasker is preinstalled on your VM, but is
available [online](https://www.repeatmasker.org/).

Here, we are going to run "RepeatMasker" to "mask" our assembled genome. In the case of genome annotation, to "mask"
simply means to identify repetitive DNA. In this tutorial, we are going to change any letter in our genome fasta file from
`UPPER CASE` to `lower case` where we think there is a repeat. This allows programs like "Augustus" (which we will use
later) to identify DNA repeats and avoid them if necessary.

To run RepeatMasker, just do:

In [None]:
RepeatMasker -species plasmodium -xsmall PB.contigs.polished.reheader.fasta

* `-species plasmodium` tells RepeatMasker to use a HMM trained on _Plasmodium_ species.
* `-xsmall` tells RepeatMasker to write lower case letters of masked sequences.

This command should take a few minutes and generate three files. You can see them with the following command:

In [None]:
ls PB.contigs.polished.reheader.fasta.*

which should return something like:

    PB.contigs.polished.reheader.fasta.masked
    PB.contigs.polished.reheader.fasta.out
    PB.contigs.polished.reheader.fasta.tbl
    PB.contigs.polished.reheader.fasta.cat
    PB.contigs.polished.reheader.fasta.fai

We already know what `PB.contigs.polished.reheader.fasta.fai` is. The new files are:

* `PB.contigs.polished.reheader.fasta.out` contains an extensive list of all the repeats "masked" in our assembled
genome
* `PB.contigs.polished.reheader.fasta.masked` is the exact same as `PB.contigs.polished.reheader.fasta` except
with all of those repeats in lower-case letters.
* `PB.contigs.polished.reheader.fasta.tbl` contains more specific
information and numbers on what was masked.
* `PB.contigs.polished.reheader.fasta.cat` contains actual reports of _all_ repeat sequences found by RepeatMasker

Look at the contents of `PB.contigs.polished.reheader.fasta.tbl` to answer the following questions (use `more`, `less`
or `cat`):

**Questions:**

1. Why do the sections "Retroelements" and "DNA transposons" all have zeros?
2. Approximately what proportion of our genome assembly was masked?

Now let's rename `PB.contigs.polished.reheader.fasta.masked` to a shorter name to make it easier to use later:

In [None]:
mv PB.contigs.polished.reheader.fasta.masked PB.masked.fasta

and then index it:

In [None]:
samtools faidx PB.masked.fasta

If RepeatMasker did not work properly or took too long to run, we have placed a backup masked genome at:

`annotation_backups/PB.masked.fasta`

Once you have masked your genome, move on to [Gene Discovery](gene_discovery.ipynb)