<p style="text-align:center;">
<img src="figures/expert-databases.png" alt="bioinformatics_databases" width="500" class="center"/>
</p>

## Fasta File
is a text-based format for representing DNA sequences, in which base pairs are represented using a single-letter code [A,C,G,T]

A sequence in FASTA format begins with a single-line identifier description, followed by lines of DNA sequence data. The identifier description line is distinguished '>' symbol in the first column. 
The sequence data starts on the next line following the text line.

### Example
<p style="text-align:center;">
<a href="https://doi.org/10.3390/info7040056">
    <img src="figures/fasta.webp" alt="FASTA" width="600" class="center">
    </a></p>

In [1]:
import requests
from tqdm import tqdm

def download(url: str, fname: str):
    resp = requests.get(url, stream=True)
    total = int(resp.headers.get('content-length', 0))
    with open(fname, 'wb') as file, tqdm(
        desc=fname,
        total=total,
        unit='iB',
        unit_scale=True,
        unit_divisor=1024,
    ) as bar:
        for data in resp.iter_content(chunk_size=1024):
            size = file.write(data)
            bar.update(size)

In [3]:
download("http://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr1.fa.gz", "../data/chr1.fa.gz")

../data/chr1.fa.gz: 100%|██████████| 71.3M/71.3M [00:19<00:00, 3.83MiB/s]


In [2]:
import gzip
from Bio import SeqIO

with gzip.open("../data/chr1_500l.fa.gz", "rt") as handle:
    for record in SeqIO.parse(handle, "fasta"):
        print(record.seq)

NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

The International Union of Pure and Applied Chemistry (IUPAC) has defined a standard representation of DNA bases by single characters that specify either a single base (e.g. G for guanine, A for adenine) or a set of bases (e.g. R for either G or A).

|Symbol|Bases|Origin of designation|
|--- |--- |--- |
|G|G|Guanine|
|A|A|Adenine|
|T|T|Thymine|
|C|C|Cytosine|
|R|G or A|puRine|
|Y|T or C|pYrimidine|
|M|A or C|aMino|
|K|G or T|Keto|
|S|G or C|Strong interaction (3 H bonds)|
|W|A or T|Weak interaction (2 H bonds)|
|H|A or C or T|not-G, H follows G in the alphabet|
|B|G or T or C|not-A, B follows A|
|V|G or C or A|not-T (not-U), V follows U|
|D|G or A or T|not-C, D follows C|
|N|G or A or T or C|aNy|

In [4]:
import numpy as np

ALPHABET_MAP = {'A': 0, 'G': 1, 'C': 2, 'T': 3}
DNA_BASES = ['A', 'G', 'C', 'T']
    
AMBIGUITY_CODES = {
    'K': ['G', 'T'],
    'M': ['A', 'C'],
    'R': ['A', 'G'],
    'Y': ['C', 'T'],
    'S': ['C', 'G'],
    'W': ['A', 'T'],
    'B': ['C', 'G', 'T'],
    'V': ['A', 'C', 'G'],
    'H': ['A', 'C', 'T'],
    'D': ['A', 'G', 'T'],
    'X': ['A', 'C', 'G', 'T'],
    'N': ['A', 'C', 'G', 'T']
}

all_characters  = list(DNA_BASES) + sorted(AMBIGUITY_CODES.keys())

def remove_ambiguity(read):
    return "".join([np.random.choice(AMBIGUITY_CODES[n]) if not n in DNA_BASES 
                                                         else n for n in read])

def base_to_distribution(char):
    """Maps char to a probability distribution over DNA bases
    """
    if char not in all_characters:
        raise ValueError(
            'Base distribution requested for unreconized char %s.'% char)
    possible_bases = AMBIGUITY_CODES[char] if char in AMBIGUITY_CODES else char
    base_indices = [DNA_BASES.index(base) for base in possible_bases]
    probability_w = 1.0/len(possible_bases)
    distribution = np.zeros((len(DNA_BASES)))
    distribution[base_indices] = probability_w
        
    return distribution

In [6]:
remove_ambiguity("ACCTVTW")

'ACCTCTT'

In [7]:
base_to_distribution('W')

array([0.5, 0. , 0. , 0.5])

In [8]:
from Bio.Seq import Seq
from Bio.Data import IUPACData
from itertools import product

def extend_ambiguous_dna(seq):
    """return list of all possible sequences given an ambiguous DNA input"""
    d = IUPACData.ambiguous_dna_values
    return [list(map("".join, product(*map(d.get, seq))))]

In [14]:
extend_ambiguous_dna("AV")

[['AA', 'AC', 'AG']]

In [9]:
from Bio import Align

aligner = Align.PairwiseAligner()
alignments = aligner.align(Seq("TACCG"), Seq("AVG"))
for alignment in sorted(alignments):
    print("Score = %.1f:" % alignment.score)
    print(alignment)

Score = 2.0:
TA-CCG
-|---|
-AV--G

Score = 2.0:
TAC-CG
-|---|
-A-V-G

Score = 2.0:
TACCG
-|-.|
-A-VG

Score = 2.0:
TACC-G
-|---|
-A--VG

Score = 2.0:
TACCG
-|.-|
-AV-G



### FASTQ
![fastq](figures/fastqPic.png)
__Line 4__ Each letter corresponds to a quality score. Although there might be different definitions of the quality scores, a *de facto* standard in the field is to use "Phred quality scores". These scores represent the likelihood of the base being called wrong. Formally, ${\displaystyle Q_{\text{phred}}=-10\log _{\text{10}}e}$, where $e$ is the probability that the base is called wrong. Since the score is in minus log scale, the higher the score, the more unlikely that the base is called wrong.

In [14]:
with gzip.open("../data/b2b_dmso_s1.fastq.gz", "rt") as handle:
    for record in SeqIO.parse(handle, "fastq"):
        print(record.seq)

GGGTGAAGTCGAGTTTGTCGTGAGATTATTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGAGTAAAAACCGGTAATTATTTAGGGGTTTAGGGAAAGGATGAAAATGGGGGATATGTGTATTAAAATTAGGGTGTTTTATAGTGTGAATG
CTCAGGGCACCTTCCCAAACTTCAGGGAGCCGTATAATTCTTTACCCTCCCCGAAGAGAAGGGACACGGAACTGAGTTTCTTTTCCAAACTGATTCTGCTTTAGATGCGTTCATAGGAGTCAGTTTATGAAGGGCAAAGCATTCCCTTAC
TTCCTCTCATGACGAGACGCCCTTTAGCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTAATTTTTTTTTATAATATTTGTTTTTGGTTTTGGGTTAAAATGGGAAATAAAAAAAAAAAAAAAAAAAAAAAGAAATAT
TCGGGCAGTTCGTACAATCAAGGACTTGTTTTTTTTTTTTTTTTTTTTTTTTTTTTGATAAAAAAACTTTATTTACATTTGGTACAAGGCGTGTTTGGATATTTAATAAAACAGAAAATTGGGTGGTGGGAGAGGGTTATGCAAAAAACC
GTTCCGTAGCGTGAGTTGCTAGGGCTCATTTTTTTTTTTTTTTTTTTTTTTTTAGTAATTATTTTAATTATAACTGAAGAAAAAAAACAAAAGCAAATTATGTGTTTGCTTAAATGAGACCAAAAAGTTAAAATTAAAAAAAGATAATGC
GACTATGTCCCGTTGTCCCGCTGGTAGTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTAACCAGGTGGGTTTTTTTTTAAATAATAAAAAAAAAAAATTTAAATAAAAAACCCGGGCAAAACAAAAAAATTTGAAAAACCAATTTGG
AGGTCATCAGTGCGCTAATAATACGAGATTTTTTTTTTTTTTTTTTTTTTTTTTTTTTATAAAAAATGGGAAAAGTAACCCCCGCATAAATTGA

In [13]:
print(record.letter_annotations['phred_quality'])

[37, 37, 25, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 11, 37, 37, 37, 37, 37, 25, 37, 37, 37, 37, 37, 37, 25, 37, 37, 37, 37, 11, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 25, 25, 11, 25, 25, 11, 37, 11, 11, 11, 25, 11, 37, 37, 11, 25, 25, 37, 11, 11, 11, 11, 37, 37, 11, 11, 11, 37, 11, 37, 37, 11, 25, 11, 11, 11, 37, 25, 25, 37, 37, 37, 11, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 11, 11, 25, 11, 37, 11, 37, 11, 37, 11, 11, 11, 25, 11, 11, 11, 11, 11, 11, 11, 37, 11, 37, 37, 37, 11, 25, 11, 11, 37, 11, 11, 11, 25, 11, 11]


## Genome Annotation

Genome annotation is the process of attaching biological information to sequences. It consists of three main steps:

1. Identifying portions of the genome that do not code for proteins
2. Identifying elements on the genome, a process called gene prediction, and
3. Attaching biological information to these elements.

### GFF format

The general feature format (gene-finding format, generic feature format, GFF) is a file format used for describing genes and other features of DNA, RNA and protein sequences. GFF lines have nine required fields that must be tab-separated.

In [9]:
import numpy as np
import pandas as pd

gencode_gff = pd.read_table("../data/gencode.v39.annotation.gff3.gz", comment="#",
                        sep = "\t", names = ['seqname', 'source', 'feature', 'start' , 'end', 'score', 'strand', 'frame', 'attribute'])
gencode_gff.head()

Unnamed: 0,seqname,source,feature,start,end,score,strand,frame,attribute
0,chr1,HAVANA,gene,11869,14409,.,+,.,ID=ENSG00000223972.5;gene_id=ENSG00000223972.5...
1,chr1,HAVANA,transcript,11869,14409,.,+,.,ID=ENST00000456328.2;Parent=ENSG00000223972.5;...
2,chr1,HAVANA,exon,11869,12227,.,+,.,ID=exon:ENST00000456328.2:1;Parent=ENST0000045...
3,chr1,HAVANA,exon,12613,12721,.,+,.,ID=exon:ENST00000456328.2:2;Parent=ENST0000045...
4,chr1,HAVANA,exon,13221,14409,.,+,.,ID=exon:ENST00000456328.2:3;Parent=ENST0000045...


In [15]:
gencode_gff[(gencode_gff["feature"] == 'gene') & (gencode_gff["seqname"] == 'chrM')]

Unnamed: 0,seqname,source,feature,start,end,score,strand,frame,attribute
3238708,chrM,ENSEMBL,gene,577,647,.,+,.,ID=ENSG00000210049.1;gene_id=ENSG00000210049.1...
3238711,chrM,ENSEMBL,gene,648,1601,.,+,.,ID=ENSG00000211459.2;gene_id=ENSG00000211459.2...
3238714,chrM,ENSEMBL,gene,1602,1670,.,+,.,ID=ENSG00000210077.1;gene_id=ENSG00000210077.1...
3238717,chrM,ENSEMBL,gene,1671,3229,.,+,.,ID=ENSG00000210082.2;gene_id=ENSG00000210082.2...
3238720,chrM,ENSEMBL,gene,3230,3304,.,+,.,ID=ENSG00000209082.1;gene_id=ENSG00000209082.1...
3238723,chrM,ENSEMBL,gene,3307,4262,.,+,.,ID=ENSG00000198888.2;gene_id=ENSG00000198888.2...
3238727,chrM,ENSEMBL,gene,4263,4331,.,+,.,ID=ENSG00000210100.1;gene_id=ENSG00000210100.1...
3238730,chrM,ENSEMBL,gene,4329,4400,.,-,.,ID=ENSG00000210107.1;gene_id=ENSG00000210107.1...
3238733,chrM,ENSEMBL,gene,4402,4469,.,+,.,ID=ENSG00000210112.1;gene_id=ENSG00000210112.1...
3238736,chrM,ENSEMBL,gene,4470,5511,.,+,.,ID=ENSG00000198763.3;gene_id=ENSG00000198763.3...


### BED File
BED (Browser Extensible Data) format provides a flexible way to define the data lines that are displayed in an annotation track. BED lines have three required fields and nine additional optional fields.

In [22]:
gencode_bed = pd.read_table("../data/bedfile.bed", comment="#", skiprows=1, 
                        sep = "\t", names = ['chrom', 'chromStart', 'chromEnd', 'name', 'score', 'strand', 'thickStart', 'thickEnd', 'itemRgb'])
gencode_bed.head()

Unnamed: 0,chrom,chromStart,chromEnd,name,score,strand,thickStart,thickEnd,itemRgb
0,chr7,127471196,127472363,Pos1,0,+,127471196,127472363,25500
1,chr7,127472363,127473530,Pos2,0,+,127472363,127473530,25500
2,chr7,127473530,127474697,Pos3,0,+,127473530,127474697,25500
3,chr7,127474697,127475864,Pos4,0,+,127474697,127475864,25500
4,chr7,127475864,127477031,Neg1,0,-,127475864,127477031,255


## Central Dogma of Molecular Biology
### Transcription

In [33]:
# Instantiate sequence object
coding_dna = Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG')
print(coding_dna)
reverse_dna = coding_dna.reverse_complement()
print(reverse_dna)

ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG
CTATCGGGCACCCTTTCAGCGGCCCATTACAATGGCCAT


In [34]:
m_rna = coding_dna.transcribe()
m_rna

Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG')

In [35]:
m_rna.back_transcribe()

Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG')

### Translation

In [36]:
m_rna.translate()

Seq('MAIVMGR*KGAR*')

In [37]:
coding_dna.translate()

Seq('MAIVMGR*KGAR*')

In [38]:
coding_dna.translate(table="Vertebrate Mitochondrial")

Seq('MAIVMGRWKGAR*')