# Biopython

Biopython je soubor nástrojů pro bioinformatiky.

Více info tady: http://biopython.org/wiki/Main_Page
        
Dokumentace: http://biopython.org/wiki/Documentation
        
Tutoriál je tady: http://biopython.org/DIST/docs/tutorial/Tutorial.html nebo tady: http://biopython.org/wiki/Getting_Started

Stáhnout biopython lze 2 způsoby:

1.  `!sudo pip3 install biopython`   
    
2.  `!conda install biopython`

Naimportnovat baliček lze potom následně:

In [1]:
import Bio

## Práce se sekvencemi

In [7]:
from Bio import Seq
# Vytvořit objekt Seq
my_seq = Seq.Seq('CATGTAGACTAG')
print(my_seq)

CATGTAGACTAG


In [8]:
type(my_seq)

Bio.Seq.Seq

In [13]:
# Vytisknout par informaci o sekvenci
print('Sekvence {0} má {1} bází'.format(my_seq, len(my_seq)))
print('Reverzní komplement k {0} je {1}'.format(my_seq, my_seq.reverse_complement()))
print('Sekvenci {0} odpovídá sekvence aminokyselin: {1} '.format(my_seq, my_seq.translate()))

Sekvence CATGTAGACTAG má 12 bází
Reverzní komplement k CATGTAGACTAG je CTAGTCTACATG
Sekvenci CATGTAGACTAG odpovídá sekvence aminokyselin: HVD* 


## Translační tabulky

In [28]:
# Načíst tabulky z modulu Data
from Bio.Data import CodonTable
standard_table = CodonTable.unambiguous_dna_by_name["Standard"]
mito_table = CodonTable.unambiguous_dna_by_name["Vertebrate Mitochondrial"]

In [29]:
# Vytisknout standardní tabulku kodonů
print(standard_table)

Table 1 Standard, SGC0

  |  T      |  C      |  A      |  G      |
--+---------+---------+---------+---------+--
T | TTT F   | TCT S   | TAT Y   | TGT C   | T
T | TTC F   | TCC S   | TAC Y   | TGC C   | C
T | TTA L   | TCA S   | TAA Stop| TGA Stop| A
T | TTG L(s)| TCG S   | TAG Stop| TGG W   | G
--+---------+---------+---------+---------+--
C | CTT L   | CCT P   | CAT H   | CGT R   | T
C | CTC L   | CCC P   | CAC H   | CGC R   | C
C | CTA L   | CCA P   | CAA Q   | CGA R   | A
C | CTG L(s)| CCG P   | CAG Q   | CGG R   | G
--+---------+---------+---------+---------+--
A | ATT I   | ACT T   | AAT N   | AGT S   | T
A | ATC I   | ACC T   | AAC N   | AGC S   | C
A | ATA I   | ACA T   | AAA K   | AGA R   | A
A | ATG M(s)| ACG T   | AAG K   | AGG R   | G
--+---------+---------+---------+---------+--
G | GTT V   | GCT A   | GAT D   | GGT G   | T
G | GTC V   | GCC A   | GAC D   | GGC G   | C
G | GTA V   | GCA A   | GAA E   | GGA G   | A
G | GTG V   | GCG A   | GAG E   | GGG G   | G
--+---------

In [30]:
# Vytisknout mitochondriální tabulku kodonů
print(mito_table)

Table 2 Vertebrate Mitochondrial, SGC1

  |  T      |  C      |  A      |  G      |
--+---------+---------+---------+---------+--
T | TTT F   | TCT S   | TAT Y   | TGT C   | T
T | TTC F   | TCC S   | TAC Y   | TGC C   | C
T | TTA L   | TCA S   | TAA Stop| TGA W   | A
T | TTG L   | TCG S   | TAG Stop| TGG W   | G
--+---------+---------+---------+---------+--
C | CTT L   | CCT P   | CAT H   | CGT R   | T
C | CTC L   | CCC P   | CAC H   | CGC R   | C
C | CTA L   | CCA P   | CAA Q   | CGA R   | A
C | CTG L   | CCG P   | CAG Q   | CGG R   | G
--+---------+---------+---------+---------+--
A | ATT I(s)| ACT T   | AAT N   | AGT S   | T
A | ATC I(s)| ACC T   | AAC N   | AGC S   | C
A | ATA M(s)| ACA T   | AAA K   | AGA Stop| A
A | ATG M(s)| ACG T   | AAG K   | AGG Stop| G
--+---------+---------+---------+---------+--
G | GTT V   | GCT A   | GAT D   | GGT G   | T
G | GTC V   | GCC A   | GAC D   | GGC G   | C
G | GTA V   | GCA A   | GAA E   | GGA G   | A
G | GTG V(s)| GCG A   | GAG E   | GGG G   

### Příklad: výpočet GC%

* Bez použití `Biopython`

In [31]:
my_seq = 'GATCGATGGGCCTATATAGGATCGAAAATCGC'
100 * float(my_seq.count("G") + my_seq.count("C")) / len(my_seq)

46.875

* S použitím `Biopython`

In [32]:
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC
from Bio.SeqUtils import GC
my_seq = Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC', IUPAC.unambiguous_dna)
GC(my_seq)

46.875

### Příklad: transkripce

In [38]:
# Transkripce DNA z templátového vlákna
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC
template_dna = Seq("CTATCGGGCACCCTTTCAGCGGCCCATTACAATGGCCAT", IUPAC.unambiguous_dna)
print(template_dna)
messenger_rna_template = template_dna.reverse_complement().transcribe()
print(messenger_rna_template)

CTATCGGGCACCCTTTCAGCGGCCCATTACAATGGCCAT
AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG


### Příklad: translace

In [40]:
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC
messenger_rna = Seq("AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG", IUPAC.unambiguous_rna)
# Translace RNA s použitím různých translačních tabulek
print(messenger_rna.translate())
print(messenger_rna.translate(table="Vertebrate Mitochondrial"))
print(messenger_rna.translate(table="Bacterial"))

MAIVMGR*KGAR*
MAIVMGRWKGAR*
MAIVMGR*KGAR*


# Práce se soubory FASTA

Analýza genu cytochrom c oxidáza, III. podjednotka.
[NCBI Reference Sequence](https://www.ncbi.nlm.nih.gov/nuccore/NC_015812.1?report=fasta&log$=seqview&from=9501&to=10284).

![](https://s-media-cache-ak0.pinimg.com/236x/dd/37/6c/dd376cbe805b9d7ed6b44c5b99b9353c.jpg)

In [16]:
from Bio import SeqIO
# Vytisknout pro každou sekvenci: 1. ID, 2. Délku
for seq_record in SeqIO.parse("python.fasta", "fasta"):
    print(seq_record.id)
    print(len(seq_record))

lcl|NC_015812.1_cds_YP_004733491.1_1
784
lcl|NC_015812.1_cds_YP_004733495.1_1
1794
lcl|NC_015812.1_cds_YP_004733496.1_2
513
lcl|NC_015812.1_cds_YP_004733485.1_1
964


In [18]:
handle = open("python.fasta", "r")
# Zapsat sekvence do seznamu
records = list(SeqIO.parse(handle, "fasta"))
handle.close()
print(type(records))
print(type(records[0]))

<class 'list'>
<class 'Bio.SeqRecord.SeqRecord'>


In [20]:
print(records[0].id)  # ID první sekvence
print(records[-1].id) # ID poslední sekvence

lcl|NC_015812.1_cds_YP_004733491.1_1
lcl|NC_015812.1_cds_YP_004733485.1_1


## Práce se soubory ve formátu GenBank

In [21]:
from Bio import SeqIO
input_handle = open("python.gbk", "r")
for record in SeqIO.parse(input_handle, "genbank") :
    print(record)
input_handle.close()

ID: NC_015812.1
Name: NC_015812
Description: Python molurus molurus mitochondrion, complete genome.
Database cross-references: Project:70065, BioProject:PRJNA70065
Number of features: 3
/taxonomy=['Eukaryota', 'Metazoa', 'Chordata', 'Craniata', 'Vertebrata', 'Euteleostomi', 'Lepidosauria', 'Squamata', 'Bifurcata', 'Unidentata', 'Episquamata', 'Toxicofera', 'Serpentes', 'Henophidia', 'Pythonidae', 'Python']
/keywords=['RefSeq']
/accessions=['NC_015812', 'REGION:', '2580..3543']
/references=[Reference(title='Complete mitochondrial genome of Python molurus', ...), Reference(title='Direct Submission', ...), Reference(title='Direct Submission', ...)]
/sequence_version=1
/topology=linear
/comment=REVIEWED REFSEQ: This record has been curated by NCBI staff. The
reference sequence is identical to HM581978.
COMPLETENESS: full length.
/date=21-JUL-2011
/data_file_division=VRT
/organism=Python molurus molurus
/source=mitochondrion Python molurus molurus (Indian python)
Seq('ATAATAACCAACATCCTACTCA

In [22]:
# Převest GenBank format na Fasta
count = SeqIO.convert("python.gbk", "genbank", "python_converted.fasta", "fasta")
print("Converted {0} records".format(count))

Converted 3 records


In [25]:
input_handle = open("python.gbk", "r")
for record in SeqIO.parse(input_handle, "genbank") :
    print(record.id)
    print(record.description)
    print(record.seq)
input_handle.close()

NC_015812.1
Python molurus molurus mitochondrion, complete genome.
ATAATAACCAACATCCTACTCAACATTATCAACCCACTACTCTACATCATCTCCATCCTAATCGCAGTAGCCTTCTTAACCTTCCTGGAACGAAAACTACTAGGGTATATACAACTACGAAAAGGCCCAAACCTAGTAGGCCCAATGGGCCTATTACAACCAATCGCAGACGGCATTAAACTCATCCTAAAAGAACCAACAAAACCCACACTCTCATCCCCAATCCTATTTACCCTATCCCCAATTCTAGCACTCACCTTAGCATTAACCACCTGAGCCCCAATACCCATACCATTCCCACTAACTAACATGAACCTAGCCTTACTATTCATCATAGCCATATCAGGTATATTCACATACACAATCCTCTGATCAGGCTGATCATCAAACTCAAAATATCCACTAATAGGCGCCATACGAGCCGTCGCACAAATCATTTCATACGAAGTCACACTAGGCCTAATCATTATATCTATAGCCACAATCACAGGAGGTTACTCACTACAATCTTTCACAACAACACAAGAACCCATATGACTTCTACTCCCATCATGACCATTAGCCATGATGTGATTCACATCCACCCTAGCAGAAACTAACCGATCACCATTCGATCTGACAGAAGGAGAATCCGAACTAGTCTCCGGATTCAATGTAGAATTCTCAGCTGGCCCATTTGCCCTACTCTTCCTAGCAGAATACACAAACATCCTTATAATAAACACCCTCTCCACCATAATATTCCTAAACCCAGGAACTCAAAACTCACCATTATTCACAATTAACCTTATAACAAAATCCATCCTCCTTACAACTATCTTCCTATGAGTCCGAGCATCCTACCCACGATTCCGCTATGACCAACTAATACACCTACTATGGAAACAATATTTACCCCTAACCCTAGGCATATGCATATTAAATATCTCAACC