## Reference Based Alignment

**Goal**: Find the position in the reference genome from which the read was sampled

**Issues**:
- the human genome is large and highly repetitive
- huge amounts of data are produced from NGS
- SNPs, indels, and structural variations cause differences between the reference genome and the sequenced reads

### Short Read Mapping

**Input:** 
- a reference genome
- collection of many 25-100 bp reads
- user-specified parameters

**Output:**
- one or more genomic coordinates for each read

Repetitions, ambiguous sequences, and similar sequences will cause more than one possible coordinate for each read.

In practice, we expect only 70-75 percent of tags to successfully map to the reference genome due to SNPs, indels, structural variations, etc. We may also have viral genomes or other contaminants that may be part of the sample.

### Hash Table Mapping

One of the earliest sequence mapping algorithms (2009). Utilizes k-mers.

Number of possible k-mers is 4<sup>k</sup>, so hashing based on k-mers requires 4<sup>k</sup> slots in the hash table.

The hash table is made *only once* when the reference genome is first processed.

All possible candidate regions are then used to compare the reads to the reference genome. Only if both the k-mer and entire remainder match, is it considered an aligned read.

**Limitations:**
- needs a very large hash table
- longer the k-mer, reads are not well mapped due to increased mismatches
- shorter the k-mer, reads are mapped in too many places
- **SLOW**

#### Yeast Example

In [4]:
#path = r"/home/BIOINFO/public/yeast/chr01.fsa"
#path2 = r"/home/BIOINFO/public/yeast/reads.txt"
path = r"C:\Users\morri\OneDrive\Documents\School\UTHealth\2023 Spring\Practical Computational Genetics and Bioinformatics - 2782 - Jun Goo\Notes\chr01.txt"
path2 = r"C:\Users\morri\OneDrive\Documents\School\UTHealth\2023 Spring\Practical Computational Genetics and Bioinformatics - 2782 - Jun Goo\Notes\reads.txt"

In [5]:
with open(path,"r") as file:
    #first line is header
    header_text = file.readline()

    ref = ""
    #read all lines, removing `\n`
    for line in file:
        ref +=line.rstrip()
    # now ref contains entire sequence from file

#creating a hash table of 25-mers
k = 25
htable = {}
for i in range(0,len(ref)-k):
    #remember that DNA is indexed starting at 1, and python
    #indexes from 0
    htable[ref[i:i+k]] = i+1
    # this way is a short-hand. It does not allow multiple maps
    # in actual practice, this would be appending to a list.

#now to use hashed table to get "mapped position" of reads
with open(path2,"r") as file2:
    for line in file2:
        read = line.rstrip()
        # read should always be longer than the k-mer length
        query = read[0:k]
        # this is only the beginning of the read, in practice
        # you would take something from various pieces of this read
        # to be thorough
        if (htable.get(query,0)):
            print(htable[query],query)
        else:
            print("not mapped", query)

140821 ACATCGAAAGTACCACCACCCAAGT
82621 TTAGCGCTCTTGAAAAATATTTTAC
121861 TTTGGGCAAAAAGGAAAAACGGCTT
136501 ATTAGTCCTCAACCCCAGAGCATTT
192781 ATGCTCCACTATGCCGTGCAAGTCG
60661 AACTTCTTCAGCATACACAACTGAT
166261 TGAGTTGGGCACATGGCGCAGTTGG
30601 CGATCATTGTTACTACTAAACACTT
69601 ATTTCATTTCTCTTAGTATTCGCAT
133801 TTTGAAGCCGATTGCCCTACCCCTC
167761 CATATCCATTCAGTGCAGGTGTCGG
179281 TCATATTGCAGGTATGCCCACATCT
199501 AATAAAATAAAAAATTACTTAGGAA
98461 TCCATGCAGTATTTCTTAAAGTTGC
129901 AGCTGCAGATGTGTCTGGAAGGGTC
7141 AACTTCATTTACGTCATAAGAAAAT
70801 GAGATTAATCTCAGTACAAAACAAT
218401 CTTCAGGAACTTTTAAATGTGCTCA
116641 TACCAACAAACGGGTATAAACTGGT
47941 ACTCTTTGATGATGAGACTAGTTCA
45061 TTTGGTTTTGCTTATAGATAGAAAA
218821 GCTTCCAATTAGCGTGGAATTGCCT
140401 TGGACCTTTGGAATTCTGGTAGAAC
216121 TATCGTCTTAGGAACTCAAGATTTT
86581 AAATCTTCCATTGTAGAGAATTATT
196801 ACCTCTAGTTTTGGTGATGAAGCAA
139021 TTTTAAAAACTTGGTGAGATAGTGC
112861 ACACTGCAGCCAGATGCAATTGAAG
1201 CGTATACTAAAACGGACGTTACGAT
176521 AAATTTATATTTATATACATGCGCC
42901 TTAGATGTCTAACAATG

Socket exception: An existing connection was forcibly closed by the remote host (10054)


### Burrows-Wheeler Transformation Mapping

**Strengths:**
- Very compact, approximately 1/2 byte per bp
- Can fit on a standard computer with 2 GB memory, since it's the original text with a few "extras"
- O(N) search algorithm, proportional to the length of query for exact matches

Originally used for compression, the matrix forms a reversible permutation of the characters of a string. It creates all possible **cyclic shifts** of a string, so it contains a specific marker at the end of each string to be transformed. The indexing algorithm for Burrows-Wheeler can efficiently search for index markers and **reverse** the transformation.

It compresses because similar characters "cluster" together, so they can be "compressed" together. 

#### LF Mapping

Burrows-Wheeler has **LF Mapping**: L indicates "Last" and F indicates "First"

Each character is given a rank, equal to the number of times the character occured previously in the string. This forms T-ranking:

a<sub>0</sub>b<sub>0</sub>a<sub>1</sub>a<sub>2</sub>b<sub>1</sub>a<sub>3</sub>

In a Burrows-Wheeler Transformation, the T-ranks (relative order) of each character have the same order!

**LF Mapping** states that the ith occurence of character `c` in *L* and the ith occurence of `c` in *F* correspond to the *same* occurence in original string `T`.

This allows easy reversal of the transformation.

*F* always starts with the end marker (`$` in our examples), so *L* always starts with the first character of the string.

If `T` has 300 `A`s, 400 `C`s, 250 `G`s, and 700 `T`s, and `$` < `A` < `C` < `G` < `T`:

The Row (0-based) that begins with `G`<sub>100</sub> is:
1. Skip the row starting with `$` (1 row)
2. Skip rows starting with `A` (300 rows)
3. Skip the rows starting with `C` (400 rows)
4. Skip the first 100 rows starting with `G` (100 rows)
5. Answer: 1+300+400+100 = 801! Row 801

Essentially, this maps out the *bounce around* in the matrix that reverses the transformation.

#### FM Index

**FM** (Full-text Minute-space) **Index**: an index combining BWT with a few auxillary data structures.

Core is *F* and *L* from the BWM:
- *F* can be represented very simply, with 1 integer per alphabet character
- *L* is compressible
- overall, highly space economical

We use this to **query**:

If:

|   | F | L |
|---|---|---|
| 1 | $ | a<sub>0</sub> |
| 2 | a<sub>0</sub> | b<sub>0</sub> |
| 3 | a<sub>1</sub> | b<sub>1</sub> |
| 4 | a<sub>2</sub> | a<sub>1</sub> |
| 5 | a<sub>3</sub> | $ |
| 6 | b<sub>0</sub> | a<sub>2</sub> |
| 7 | b<sub>1</sub> | a<sub>3</sub> |

To query `aba`:
1. Initially query for a row that has `a` in *F*, and `b` in *L*. (Lines 2,3)
2. Query the rows that have `b` in *F*, and `a` in *L* to attempt to narrow the results from step 1 (would be relative slicing) (Lines 6,7)

**Improvements**:

- "Lookup" process can be slow; improve by storing 'partial' checkpoints for local ranks between step 1 and step 2
- "Finding position" in the original sequence can be done by storing the 'partial' index for some rows (not all) so during LF mapping we can find the relative position of the subsequence

## Sequence Assembly

### Assembly Algorithms

**Overlap Based**: Finds reads that are overlapping with each other

**De Bruijn Graph**: Decompose the graph into sequences of K-mers, and build a K-mer graph for assembly.

Only need a single base to indicate the 'next node' becaues *any adjacent two nodes share K-1 bases*. 

**Hash tables!**