### 바이오파이썬을 이용한 Sequence 조작
* Mus musculus TNFA gene을 이용함.
* gene data source: https://www.ncbi.nlm.nih.gov/nuccore/D84199.2
* Sequence 종류 분류, 상보/역상보서열 구하기, GC contents 계산, 전사/번역과정 파악

---

* Sequence 형식 지정 및 종류 분류하기

In [1]:
from Bio.Seq import Seq # Seq 모듈 불러오기

In [2]:
# 새로운 Sequence 지정
tnfa = '''ACCATGAGCACAGAAAGCATGATCCGCGACGTGGAACTGGCAGAAGAGGCACTCCCCCAAAAGATGGGGG
GCTTCCAGAACTCCAGGCGGTGCCTATGTCTCAGCCTCTTCTCATTCCTGCTTGTGGCAGGGGCCACCAC
GCTCTTCTGTCTACTGAACTTCGGGGTGATCGGTCCCCAAAGGGATGAGGTGAGTNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNCTCCTTCTTTTCCTACACAGAAGTTCCCAAATGGCCTCCCTCTCATCAGTTCTAT
GGCCCAGACCCTCACACTCAGTAAGTGTTCCCACACCTCTCTCTTAATTTAAGATGGAGGAAGGGCAGTT
AGGCATGGGATGAGATGGGGTGGGGGGAGAACTTAAAGCTTTGGTTTGGGAGGAAAGGGGTCTAAGTGCA
TAGATGCTTGCTGGGAAGCCTAAAAGGCTCATCCTTGCCTTTGTCTCTTCCCCTCCAGGATCATCTTCTC
AAAATTCGAGTGACAAGCCTGTAGCCCACGTCGTAGGTAANNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
CCCCCCCCCCTCAGCAAACCACCAAGTGGAGGAGCAGCTGGAGTGGCTGAGCCAGCGCGCCAACGCCCTC
CTGGCCAACGGCATGGATCTCAAAGACAACCAACTAGTGGTGCCAGCCGATGGGTTGTACCTTGTCTACT
CCCAGGTTCTCTTCAAGGGACAAGGCTGCCCCGACTACGTGCTCCTCACCCACACCGTCAGCCGATTTGC
TATCTCATACCAGGAGAAAGTCAACCTCCTCTCTGCCGTCAAGAGCCCCTGCCCCAAGGACACCCCTGAG
GGGGCTGAGCTCAAACCCTGGTATGAGCCCATATACCTGGGAGGAGTCTTCCAGCTGGAGAAGGGGGACC
AACTCAGCGCTGAGGTCAATCTGCCCAAGTACTTAGACTTTGCGGAGTCCGGGCAGGTCTACTTTGGAGT
CATTGCTCTGTGAAGA'''

# Seq 형식으로 지정. 문자열의 기능을 대부분 사용 가능
tnfa = Seq(''.join(tnfa.split('\n')))
tnfa

# Seq 형식을 문자열로 변환하기도 가능
# str(tnfa)

Seq('ACCATGAGCACAGAAAGCATGATCCGCGACGTGGAACTGGCAGAAGAGGCACTC...AGA', Alphabet())

In [3]:
# IUPAC 메소드 형식 지정하기 (DNA/RNA/Protein 분류)
# 같은 형식(DNA-DNA, RNA-RNA, Protein-Protein) 간에만 연산이 가능함
from Bio.Alphabet import IUPAC

tnfa = '''ACCATGAGCACAGAAAGCATGATCCGCGACGTGGAACTGGCAGAAGAGGCACTCCCCCAAAAGATGGGGG
GCTTCCAGAACTCCAGGCGGTGCCTATGTCTCAGCCTCTTCTCATTCCTGCTTGTGGCAGGGGCCACCAC
GCTCTTCTGTCTACTGAACTTCGGGGTGATCGGTCCCCAAAGGGATGAGGTGAGTNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNCTCCTTCTTTTCCTACACAGAAGTTCCCAAATGGCCTCCCTCTCATCAGTTCTAT
GGCCCAGACCCTCACACTCAGTAAGTGTTCCCACACCTCTCTCTTAATTTAAGATGGAGGAAGGGCAGTT
AGGCATGGGATGAGATGGGGTGGGGGGAGAACTTAAAGCTTTGGTTTGGGAGGAAAGGGGTCTAAGTGCA
TAGATGCTTGCTGGGAAGCCTAAAAGGCTCATCCTTGCCTTTGTCTCTTCCCCTCCAGGATCATCTTCTC
AAAATTCGAGTGACAAGCCTGTAGCCCACGTCGTAGGTAANNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
CCCCCCCCCCTCAGCAAACCACCAAGTGGAGGAGCAGCTGGAGTGGCTGAGCCAGCGCGCCAACGCCCTC
CTGGCCAACGGCATGGATCTCAAAGACAACCAACTAGTGGTGCCAGCCGATGGGTTGTACCTTGTCTACT
CCCAGGTTCTCTTCAAGGGACAAGGCTGCCCCGACTACGTGCTCCTCACCCACACCGTCAGCCGATTTGC
TATCTCATACCAGGAGAAAGTCAACCTCCTCTCTGCCGTCAAGAGCCCCTGCCCCAAGGACACCCCTGAG
GGGGCTGAGCTCAAACCCTGGTATGAGCCCATATACCTGGGAGGAGTCTTCCAGCTGGAGAAGGGGGACC
AACTCAGCGCTGAGGTCAATCTGCCCAAGTACTTAGACTTTGCGGAGTCCGGGCAGGTCTACTTTGGAGT
CATTGCTCTGTGAAGA'''.replace("N","")

tnfa = Seq(''.join(tnfa.split('\n')),IUPAC.unambiguous_dna)
tnfa

Seq('ACCATGAGCACAGAAAGCATGATCCGCGACGTGGAACTGGCAGAAGAGGCACTC...AGA', IUPACUnambiguousDNA())

---
#### 상보/역상보서열 구하기

In [4]:
# Sequence의 상보서열 구하기
tnfa.complement()

Seq('TGGTACTCGTGTCTTTCGTACTAGGCGCTGCACCTTGACCGTCTTCTCCGTGAG...TCT', IUPACUnambiguousDNA())

In [5]:
# Sequence의 역상보서열 구하기
# tnfa.complement()[::-1] 와 같은 결과를 보인다
tnfa.reverse_complement()

Seq('TCTTCACAGAGCAATGACTCCAAAGTAGACCTGCCCGGACTCCGCAAAGTCTAA...GGT', IUPACUnambiguousDNA())

---
#### GC contents 계산하기 (전체 염기 중 G+C의 비율)
* print(float(tnfa.count("G")+tnfa.count("C"))/len(tnfa)*100)
* 위 코드로 처리하려면 코드 길이가 길어진다. 따라서 GC 모듈을 이용해 간편하게 처리한다

In [6]:
from Bio.SeqUtils import GC
GC(tnfa)

55.34188034188034

---
#### 전사/번역과정 알아보기
* template = 역상보서열
* mRNA
* translate

In [7]:
tnfa_template = tnfa.reverse_complement
tnfa_template

<bound method Seq.reverse_complement of Seq('ACCATGAGCACAGAAAGCATGATCCGCGACGTGGAACTGGCAGAAGAGGCACTC...AGA', IUPACUnambiguousDNA())>

In [8]:
tnfa_mRNA = tnfa.transcribe()
tnfa_mRNA

Seq('ACCAUGAGCACAGAAAGCAUGAUCCGCGACGUGGAACUGGCAGAAGAGGCACUC...AGA', IUPACUnambiguousRNA())

In [9]:
tnfa_protein = tnfa_mRNA.translate()
tnfa_protein

Seq('TMSTESMIRDVELAEEALPQKMGGFQNSRRCLCLSLFSFLLVAGATTLFCLLNF...L*R', HasStopCodon(IUPACProtein(), '*'))

---

In [10]:
# 코돈 테이블이 다른 종의 경우, 조정할 필요가 있다
from Bio.Data import CodonTable

standard_table = CodonTable.unambiguous_dna_by_name["Standard"]
mycoplasma_table = CodonTable.unambiguous_dna_by_name["Mycoplasma"]

In [11]:
print(standard_table)

Table 1 Standard, SGC0

  |  T      |  C      |  A      |  G      |
--+---------+---------+---------+---------+--
T | TTT F   | TCT S   | TAT Y   | TGT C   | T
T | TTC F   | TCC S   | TAC Y   | TGC C   | C
T | TTA L   | TCA S   | TAA Stop| TGA Stop| A
T | TTG L(s)| TCG S   | TAG Stop| TGG W   | G
--+---------+---------+---------+---------+--
C | CTT L   | CCT P   | CAT H   | CGT R   | T
C | CTC L   | CCC P   | CAC H   | CGC R   | C
C | CTA L   | CCA P   | CAA Q   | CGA R   | A
C | CTG L(s)| CCG P   | CAG Q   | CGG R   | G
--+---------+---------+---------+---------+--
A | ATT I   | ACT T   | AAT N   | AGT S   | T
A | ATC I   | ACC T   | AAC N   | AGC S   | C
A | ATA I   | ACA T   | AAA K   | AGA R   | A
A | ATG M(s)| ACG T   | AAG K   | AGG R   | G
--+---------+---------+---------+---------+--
G | GTT V   | GCT A   | GAT D   | GGT G   | T
G | GTC V   | GCC A   | GAC D   | GGC G   | C
G | GTA V   | GCA A   | GAA E   | GGA G   | A
G | GTG V   | GCG A   | GAG E   | GGG G   | G
--+---------

In [12]:
print(mycoplasma_table)

Table 4 Mold Mitochondrial, Protozoan Mitochondrial, Coelenterate Mitochondrial, Mycoplasma, Spiroplasma, SGC3

  |  T      |  C      |  A      |  G      |
--+---------+---------+---------+---------+--
T | TTT F   | TCT S   | TAT Y   | TGT C   | T
T | TTC F   | TCC S   | TAC Y   | TGC C   | C
T | TTA L(s)| TCA S   | TAA Stop| TGA W   | A
T | TTG L(s)| TCG S   | TAG Stop| TGG W   | G
--+---------+---------+---------+---------+--
C | CTT L   | CCT P   | CAT H   | CGT R   | T
C | CTC L   | CCC P   | CAC H   | CGC R   | C
C | CTA L   | CCA P   | CAA Q   | CGA R   | A
C | CTG L(s)| CCG P   | CAG Q   | CGG R   | G
--+---------+---------+---------+---------+--
A | ATT I(s)| ACT T   | AAT N   | AGT S   | T
A | ATC I(s)| ACC T   | AAC N   | AGC S   | C
A | ATA I(s)| ACA T   | AAA K   | AGA R   | A
A | ATG M(s)| ACG T   | AAG K   | AGG R   | G
--+---------+---------+---------+---------+--
G | GTT V   | GCT A   | GAT D   | GGT G   | T
G | GTC V   | GCC A   | GAC D   | GGC G   | C
G | GTA V   | GC

In [13]:
# start codon을 알아보는 연산도 가능함
mycoplasma_table.start_codons

['TTA', 'TTG', 'CTG', 'ATT', 'ATC', 'ATA', 'ATG', 'GTG']