## [Example: Constant + MuE (Profile HMM)](http://pyro.ai/examples/mue_profile.html#example-constant-mue-profile-hmm)

### A standard profile HMM model [1], which corresponds to a constant (delta function) distribution with a MuE observation [2]. 
#### This is a standard generative model of variable-length biological sequences (e.g. proteins) which does not require preprocessing the data by building a multiple sequence alignment (MSA). It can be compared to a more complex MuE model in this package, the FactorMuE.

### References:
[1] R. Durbin, S. R. Eddy, A. Krogh, and G. Mitchison (1998)
"Biological sequence analysis: probabilistic models of proteins and nucleic
acids"
Cambridge university press

[2] E. N. Weinstein, D. S. Marks (2021)
"A structured observation distribution for generative biological sequence
prediction and forecasting"
https://www.biorxiv.org/content/10.1101/2020.07.31.231381v2.full.pdf

In [None]:
%%sh
curl -O https://raw.githubusercontent.com/debbiemarkslab/MuE/master/models/examples/ve6_full.fasta

[Data](https://github.com/debbiemarkslab/MuE/blob/master/models/examples/ve6_full.fasta)

In [1]:
import json

In [3]:
import numpy as np
import torch
from torch.optim import Adam

import pyro
from mue.dataloaders import BiosequenceDataset
from mue.models import ProfileHMM
from pyro.optim import MultiStepLR

  from .autonotebook import tqdm as notebook_tqdm


[MuE](https://github.com/pyro-ppl/pyro/tree/dev/pyro/contrib/mue)

In [4]:
file = './ve6_full.fasta'

In [5]:
seqs = []
seq = ""
with open(file, 'r') as fr:
    for line in fr:
        if line[0] == '>':
           if seq !="":
                seq += "*"
                seqs.append(seq)
                seq=""
        else:
            seq += line.strip('\n')
 

In [6]:
dataset = BiosequenceDataset(
    file,
    'fasta',
    alphabet= 'amino-acid',
    include_stop=False,
    device='cpu'

)

In [7]:
batch_size = 2
split = 0.2

In [8]:
holdout_num = int(np.ceil(split*len(dataset)))

In [9]:
data_lengths = [len(dataset)-holdout_num, holdout_num]

In [10]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'

In [11]:
indices = torch.randperm(sum(data_lengths), device=device).tolist()

In [12]:
for t in torch._utils._accumulate(data_lengths):
    print(t)

1287
1609


In [13]:
dataset_train, dataset_test = [
    torch.utils.data.Subset(dataset, indices[(offset-length):offset])
    for offset, length in zip(
        torch._utils._accumulate(data_lengths),
        data_lengths
    )
]

In [14]:
latent_seq_length = int(dataset.max_length * 1.1)
latent_seq_length

173

In [15]:
model = ProfileHMM(
    latent_seq_length=latent_seq_length,
    alphabet_length= dataset.alphabet_length,
    prior_scale=1.0,
    cuda=False,
    indel_prior_bias= 10.0,
    pin_memory=False
)

In [16]:
scheduler = MultiStepLR(
    {
        'optimizer': Adam,
        'optim_args': {'lr':0.001},
        'milestones': json.loads("[]"),
        'gamma': 0.5
    }
)

In [17]:
n_epochs = 10

In [18]:
torch.set_default_dtype(torch.float64)

In [19]:
torch.set_default_tensor_type(torch.FloatTensor)

In [20]:
losses = model.fit_svi(dataset, 
                        n_epochs, 
                        batch_size, 
                        scheduler, 
                        jit= False
                    )

RuntimeError: expected scalar type Double but found Float