## 1. Download files

We will have to download files containing protein sequences which will be used as initial data to build a profile HMM. Later, we use this profile HMM as the base to do MSA given a new protein sequence. <br>

Using NCBI database, we download the protein sequences of 3 variants of Hepatitis virus C.

In [1]:
import os
from Bio import SeqIO
from Bio import Entrez

In [2]:
sequences = []
Entrez.email = "desthalia@mail.ugm.ac.id"
accessions = ["NP_671491.1", "BAB32872.1", "AEB71618.2"]

#Download FASTA files to local folder
for x in accessions:
    filename = x + ".fasta"
    if not os.path.isfile(filename):
        handle = Entrez.efetch(db="nucleotide", id=x, rettype="fasta", retmode="text")
        out_handle = open(filename, "w")
        out_handle.write(handle.read())
        out_handle.close()
        handle.close()
        print("Saved")

    record = SeqIO.read(filename, "fasta")
    sequences.append(record)

In [3]:
#Combine all the downloaded files into one file
SeqIO.write(sequences, "Hepatitis_variants.fasta", "fasta")

3

## 2. Profile HMM

Technically, we can infer a profile HMM from unalligned sequences. But for convinience sake, we use CLUSTAL to build an alligned sequence and 'convert' it into a profile HMM.

We use HMMER program (http://hmmer.org/) to build a profile HMM and use it to perform MSA. Install it on Anaconda: conda install -c biocore hmmer

In [4]:
from Bio.Align.Applications import ClustalwCommandline
import subprocess

In [6]:
#Using Clustal to build initial MSA
clustalw = r"/home/desthalia/clustalo/clustalo"
cline = ClustalwCommandline(clustalw, infile="Hepatitis_variants.fasta")
stdout, stderr = cline()

In [7]:
#Using HMMER program to convert
profile = subprocess.run(['hmmbuild', 'Hepatitis_variants.hmm', 'clustal.aln'])

Using new sequences, in this case <i>Eschericia coli</i>, to make perform another MSA. Results will be shown on Linux shell.

In [None]:
hmm = subprocess.run(['hmmalign', 'Hepatitis_variants.hmm', 'GCF_000005845.2_ASM584v2_protein.faa'])