#### Over the last decade, genomics has become the backbone of __drug discovery, targeted therapeutics, disease diagnosis, and precision medicine__, leading to the chances of successful clinical trials. For example, in 2021, over 33% of FDA-approved new drug approvals were personalized medicines, a trend that sustained for the past five years

####  drastic decrease in the cost and turnaround time of DNA sequencing. For instance, while human genome sequencing was reported to cost around $3 billion and took 13 years to complete, today, you can get your genome sequenced in a day with less than $200 

### [Python Fundamentals for Data Science](https://www.freecodecamp.org/news/python-fundamentals-for-data-science/)

In [None]:
import Bio
print(Bio.__version__)

####  Genetics 101. 
- A cell represents the fundamental structural and functional unit of life. 
- DNA contains the instructions that are needed to perform different activities of the cell. DNA is the basis of genetic studies and consists of four building blocks called nucleotides 
    * adenine (A), 
    * guanine (G), 
    * cytosine (C),
    * thymine (T), 

which store information about life.

#### DNA has a double-helix structure with two complementary polymers interlaced with each other. In the complementary strand of DNA, A matches with T, and G matches with C, to form base pairs.

#### A genome represents the full DNA sequence of a cell that contains all the hereditary information

##### The size of genomes is different from species to species. For example, the human genome is made up of 3 billion base pairs spread across 46 chromosomes, whereas the bread wheat genome consists of 42 chromosomes and ~ 17 gigabases.

### A region of a genome that transcribes into a functional RNA molecule, or transcribes into an RNA and then encodes a functional protein, is called a gene.

###  a gene constitutes the fundamental unit of heredity of a living organism. By analogy, you can imagine the four nucleotides (A, G, C, and T) that make up the gene as letters in a sentence, genes as sentences in a book, and the genome as the actual book consisting of tens of thousands of words

#### Sanger sequencing, also known as the “chain termination method” is the first-generation sequencing method, and was developed by Frederick Sanger and his colleagues in 1977 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC431765/). It was used by the Human Genome Project to sequence the first draft of the human genome. This method relies on the natural method of DNA replication. Sanger sequencing involves the random incorporation of bases called deoxyribonucleoside triphosphates (dNTPs) by DNA polymerase into strands during copying, resulting in the termination of invitro transcription. These bases in the short fragments are then read out based on the presence of a dye molecule attached to the special bases at the end of each fragment. Despite being replaced by NGS technologies, Sanger sequencing remains the gold standard sequencing method and is routinely being used by research labs throughout the world for quick verification of short sequences generated by polymerase chain reaction (PCR) and other methods.

#### In contrast to Sanger sequencing, Illumina leverages sequencing by synthesis (SBS) technology, which is the tracking of labeled nucleotides as the DNA is copied in a massively parallel fashion generating output ranging from 300 kilobases up to several terabases in a single run. The typical size fragments generated by Illumina are in the range of 50-300 bases. Illumina dominates the NGS market.

####  Pacific Bioscience and Oxford Nanopore Technologies dominate this sector with their systems called single-molecule sequencing real-time (SMRT) sequencing and nanopore sequencing, respectively. Each of these technologies can rapidly generate very long reads of up to 15,000 bases long from single molecules of DNA and RNA.

### The main aim of genomics data analysis is to do a biological interpretation of large volumes of raw genomics data. It is very similar to any other kind of data analysis but in this case, it often requires domain-specific knowledge and tools.

![](https://learning.oreilly.com/api/v2/epubs/urn:orm:book:9781804615447/files/image/B18958_02_001.jpg)

In [None]:
from Bio.Seq import Seq

In [None]:
my_seq = Seq('AGTAGGACAGAT')

In [None]:
my_seq.complement()

In [None]:
my_seq.reverse_complement()

### Each Seq object has two important attributes:

* Data – The actual sequence string ('ATCTGTCCTACT').
* Alphabet – The type of sequence, for example, DNA, RNA, or protein. By default, it doesn’t represent any sequence. That means Biopython doesn’t necessarily know whether the input to the Seq object is a nucleotide (A, G, C, or T) or protein sequence consisting of the amino acids alanine (A), cysteine (C), glycine (G), and threonine (T). So, keep this in mind when you call methods such as complement and reverse_complement, depending on the type of input sequence.

The Seq object supports two types of methods – general methods (find, count, and so on) and nucleotide methods (complement, reverse_complement, transcribe, back_transcribe, translate, and so on). We will use a few of these methods later in the chapter.

SeqRecord object
After the Seq object, the next most important object in Biopython is SeqRecord or sequence record. This object differs from the Seq object in that it holds a sequence (as a Seq object) with additional information such as identifier, name, and description. You can use the Bio.SeqIO module with the SeqRecord object in Biopython.

In [None]:
from Bio import SeqIO

In [None]:
with open('ls_orchid.fasta') as file_in:
    for record in SeqIO.parse(file_in, format='fasta'):
        print(record.id)

In [None]:
with open(file='/home/dulunche/MLDL-Compunomics/biopython/Doc/examples/ls_orchid.gbk') as file_in:
    for record in SeqIO.parse(file_in, format='genbank'):
        print(record.annotations)

#### An example of sequence analysis includes inferring sequence composition, calculating GC content, calculation of % T, % A, and so on. In addition, more complicated tasks such as motif searching are also part of sequence analysis. These are considered features derived from sequences for training models, which we will see in the next chapter. These features have a direct impact on prediction purposes during model training. So, it is important to understand how to extract these features from sequence data, and for that, we will use Biopython. For this example, we will use SARS-CoV-2, which is a causative agent for Covid-19 and needs no introduction because of the widespread destruction it caused across the whole world causing millions of deaths:

In [18]:
with open('covid19.fasta') as file_in:
    for record in SeqIO.parse(file_in, format='fasta'):
        print(f'sequence information {record}')
        print(f'sequence length {len(record)}')

sequence information ID: NC_045512.2
Name: NC_045512.2
Description: NC_045512.2 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome
Number of features: 0
Seq('ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGT...AAA')
sequence length 29903
