Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: how to extend a HMM to use String-like types as its state space? #12

Open
camilogarciabotero opened this issue Jun 14, 2023 · 22 comments · Fixed by #24
Open
Labels
enhancement New feature or request question Further information is requested

Comments

@camilogarciabotero
Copy link

Hey @gdalle

Thanks for working on this package. I'm very interested in applying this package to work with the BioSequences.jl package, which allows working with DNA sequences specialized types. I'm interested in the application of HMMs to the gene-finding problem. There are several examples, a simple one consists on classify a DNA sequence of an alphabet $S = {A, C, G, T}$ into coding or non-coding states. So, normally what one has is a string of characters (ACTATCTATCT...) whereby one wants to locate the coordinates of the coding and non-coding regions. Generally, the coding sequence (CDS) has an encoding characteristic such that it could each triplet (Codon) is translated into an amino acid.

Here is a simple representation of HMMs of the problem.
image

Where the sequence of characters corresponds to the emissions and the coding (C) and non-coding (N) hidden states are the ones we want to unveil.

My question now is, what would be the best way to implement or extend the HiddenMarkovModels.jl types and methods such that we can have BioSequences.jl types as state space?

I am imagining something like this:

using BioSequences, HiddenMarkovModels, Distributions

seq = randdnaseq(10^3)

1000nt DNA Sequence:
GCATCGTTGGGGGGCATGCCTTATTCGAGAGTCTTTGATCCCAAGACACGTAACCTATGCTATTACGCGCTGGGAAAT

Now, assuming that in this sequence there exist two states coding (C) and non-coding (N):

Sts: NNNNNNNNNNNNNNNNNNNNCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCNNNNNNNNNNNNNN
Seq: GCATCGTTGGGGGGCATGCCATGTTCGAGAGTCTTTGACCCAAGACACGTAACCTATGCTTGAACGCGCTGGGAAAT

I then want to create a hmm model so that it takes the new sequence, the transitions of the emissions C -> G, C -> T, ... and predicts the locations of the hidden states.


Now, nucleotides in each state (C, and N) display transition frequencies that are characteristics of each state. Normally one can represent the transition between the nucleotides as a Markov chain where each transition $a_{ij}$ where $i,j \in S$ in a DNA sequence of length $T$ can be obtained as:

$$ a_{ij} = P(X_{t = j} = j | X_{t-1} = i) = \frac{P(X_{t-1}=i, X_{t}=j)}{P(X_{t-1}=i)} $$

The initial probability distributions are denoted then as $\pi = {\pi_{A}, \pi_{C}, \pi_{G}, \pi_{T}}$, estimated from the transition between characters $c$ from the sequence:

$$ \pi_{i} = \frac{c_{i}}{\sum_{k}{c_k}} $$

nucleotide-markov-chain

Where probabilities of a new sequence are given by:

$$ P(X_{1} =i_{1}, ... X_{T}=i_{t}) = \pi_{i1} \prod_{t =1}^{T -1}{a_{it},i_{i+1}} $$

In the classification problem they are used in a $likelihood-ratio test$ to classify based on a decision rule:

$$ S(X) = log \frac{P_{C}(X_{1} = i_{1}, ..., X_{T} = i_{T})}{P_{N}(X_{1} = i_{1}, ..., X_{T} = i_{T})} \begin{cases} > \eta \rightarrow &\textit{coding}, \\ < \eta \rightarrow &\textit{ non-coding} \end{cases} $$

Where $\eta$ is a threshold value for a significance level.

This led my to the idea that it might necessary to use an extra package (or another implementation) to encode the Markov chain in order to get the transitions from nucleotide to nucleotide. Hope this was clear, and rings any bells on what could be a nice strategy.

Best.

@gdalle
Copy link
Owner

gdalle commented Jun 15, 2023

Hu @camilogarciabotero, glad that you might be using my package!
Currently, the emissions of an HMM can be arbitrary (characters like 'A' are allowed), as long as you define custom distributions. On the other hand, states have to be integers from $1$ to $N$, so you need to encode C = 1 and N = 2.
The more complicated aspect of your question is the dependencies between emissions. In the standard HMM, emissions $Y_t$ and $Y_{t-1}$ are independent conditionally on the current state $X_t$. If I understand correctly, in your model this dependency is no longer satisfied, and you have a transition of the form
$$P(Y_t \mid X_t, Y_{t-1})$$
This is not currently implemented in my package, but it could be. I'm not sure how difficult is is mathematically, would have to look it up.

@gdalle gdalle added question Further information is requested enhancement New feature or request labels Jun 15, 2023
@camilogarciabotero
Copy link
Author

Thank you for the quick response.

The encoding part sounds nice and achievable, I'll give it a try. On the other hand, as I understood the emissions display distributions that are independent between states (although the characters' probability is dependent on the immediately previous character). So that, for instance, the C state got a $\hat{A_{1}}$ distribution of the frequencies of the dinucleotide transitions, and the state N has a $\hat{A_{2}}$ distribution of their dinucleotide transitions. Then the transition between the hidden states $Y_{t}$ could be represented as:

$$ P(Y_{t} | \hat{A_{n}}) $$

Sorry if I'm not very clear with the descriptions. I am following Ch. 2 of the textbook of Axelson-Fisk (2015) for reference.

Axelson-Fisk, M., & Axelson-Fisk, M. (2015). Single Species Gene Finding. Comparative Gene Finding: Models, Algorithms and Implementation, 29-105.

@gdalle
Copy link
Owner

gdalle commented Jun 15, 2023

I think I understand your problem but just to be sure, check out the images below, where $z_t \in {C, N}$ and $x_t \in {A, T, G, C}$.
My package currently does this:
image
And you want this:
image
Because the next emission is influenced both by the current hidden state (coding or not) and by the past nucleotide.
Did I get that right?

@camilogarciabotero
Copy link
Author

Ok, I see the differences and yes, it makes sense that the emissions are ultimately influenced by both... Now I see your point. Then, it is still not achievable the second representation with the HiddenMarkovModels?

@gdalle
Copy link
Owner

gdalle commented Jun 15, 2023

Not in the current state of the package. This specific variant is called an autoregressive HMM (AR-HMM), and it requires a slightly different implementation. I'm not sure I will include it in the package, at least in the near future.

There is a trick however, which would allow you to estimate an AR-HMM using a standard HMM. The trick is to define a new, aggregate state $z_t' = (z_t, x_{t-1})$.
But that induces a lot of complications in the Baum-Welch re-estimation step. And more importantly, it requires something called "parameter-sharing" between the state process and the observation process, which is currently impossible. I wanted to make it possible in v0.2, so please track issue #15

@gdalle
Copy link
Owner

gdalle commented Jun 15, 2023

Wait, on second thought I guess it is indeed possible with v0.1, but I need to wrap my head around it. I will try to spit out an example

@camilogarciabotero
Copy link
Author

camilogarciabotero commented Jun 15, 2023

Cool, never heard of AR-HMMs. However, they are not referenced in the gene-finding theory I was following. It seems that is a more general framework where the observations depend on several previous steps, but in the Markov process of the nucleotides, they seem to be only of the first order, that way only depending on its immediately previous nucleotide.
So in an example with a S = "CCTCCCGGACCCTGGGCTCGGGAC" the dinucleotide transition probabilities $\hat{A}$ will be something like this:

$$ \begin{bmatrix} & \text{A} & \text{C} & \text{G} & \text{T} \\ \text{A} & 0.00 & 1.00 & 0.00 & 0.00 \\ \text{C} & 0.00 & 0.56 & 0.22 & 0.22 \\ \text{G} & 0.25 & 0.12 & 0.62 & 0.00 \\ \text{T} & 0.00 & 0.67 & 0.33 & 0.00 \\ \end{bmatrix} $$

So that the probability of a new sequence CCTG is:

$$ \begin{align*} p(CCTG) &= P(X_1 = C)P(X_2 = C|X_1 = C)P(X_3 = T|X_2 = C)P(X_4 = G|X_3 = T) \\ &= \pi_C \cdot a_{CC} \cdot a_{CT} \cdot a_{TG} = 0.0167. \end{align*} $$

Now in the HMMs the C and N will display different dinucleotide transition probabilities $\hat{A}$ which could then be used to classify a new observed sequence (nucleotide by nucleotide I believe) with a threshold $\eta$ of the subsequence that is being calculated. So maybe I can define a HMM model like this:

function dnahmm(N)
    π = [π₁, π₂, π₃, π₄]
    A = zeros(Int, 4, 4)  # initial distribution
    dists(A₁, A₂) # Coding and Non-coding transition probabilities
end

I will need to define a custom function to calculate the $\hat{A}$ s matrices, but maybe that will make the job. And the states will be encoded as a Dict (?)

N = Dict('N' => 0,'C' => 1)

And then I'm not pretty sure how to make the predictions... but I will expect to do something like:

state_seq, obs_seq = rand(hmm, "ATCGTTGGGGGGCATGCCATGTTCGAGAGTCTTTGACCCAAGACACGTAACCTATGCTTGAACGCGCTGGGAAAT") 
Sts: NNNNNNNNNNNNNNNNNNCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCNNNNNNNNNNNNNN
Seq: ATCGTTGGGGGGCATGCCATGTTCGAGAGTCTTTGACCCAAGACACGTAACCTATGCTTGAACGCGCTGGGAAAT

I will be looking forward to your example!

@gdalle
Copy link
Owner

gdalle commented Jul 11, 2023

Hey @camilogarciabotero!
I finally found some time to tinker with your problem using the latest version of my package (soon to be registered and documented here in the meantime).
I think this does what you want, but I'm not 100% sure of the math. I would love for you to test it on some real data that you are used to, and tell me if it gives you coherent results!
You can find the code in the file test/dna.jl

@gdalle
Copy link
Owner

gdalle commented Jul 11, 2023

The reason I'm doubting is because I can't seem to estimate the exact parameters even with very long sequences. But it might just be due to the hill-climbing behavior of the Baum-Welch algorithm, which is vulnerable to local optima.
I suggest initializing both nucleotide transition matrices with the same value, obtained by modeling the sequence of nucleotides as a plain Markov chain. A better initialization goes a long way.

@camilogarciabotero
Copy link
Author

This is great! I can't wait to try it and let you know how it goes for me (how do you suggest I try it out? Making a fork? What's the best way). Regarding:

I suggest initializing both nucleotide transition matrices with the same value, obtained by modeling the sequence of nucleotides as a plain Markov chain

I was actually working on a simple package to model DNA (LongNucOrView{4} from BioSequences.jl package) as a Markov chain. I just submitted the package BioMarkovChains.jl into the register.

@gdalle gdalle linked a pull request Jul 12, 2023 that will close this issue
@gdalle
Copy link
Owner

gdalle commented Jul 12, 2023

how do you suggest I try it out?

You can fork and play around with the code in the pull request #24 :)

@gdalle
Copy link
Owner

gdalle commented Jul 13, 2023

Closed by mistake. The file with the code is now on main and included in the package tests, but not yet documented cause I'm waiting to see if it does what you want

@gdalle gdalle reopened this Jul 13, 2023
@camilogarciabotero
Copy link
Author

camilogarciabotero commented Jul 13, 2023

Ok, thanks again for following this question and FR, I could try it out and got some questions:

  1. Regarding the DNACodingHMM struct
struct DNACodingHMM <: AbstractHMM
     cod_init::Vector{Float64}
     nuc_init::Vector{Float64}
     cod_trans::Matrix{Float64}
     nuc_trans::Array{Float64,3}
     function DNACodingHMM(; cod_init, nuc_init, cod_trans, nuc_trans)
         @assert length(cod_init) == 2
         @assert length(nuc_init) == 4
         @assert size(cod_trans) == (2, 2)
         @assert size(nuc_trans) == (2, 4, 4)
         return new(cod_init, nuc_init, cod_trans, nuc_trans)
     end
 end

Why is nuc_trans::Array{Float64,3} this dimensions? in the latter created object of this type dchmm:

dchmm = DNACodingHMM(;
    cod_init=rand(2),
    nuc_init=rand(4),
    cod_trans=rand_trans_mat(2),
    nuc_trans=stack([rand_trans_mat(4), rand_trans_mat(4)]; dims=1),
);

It looks like:

julia> dchmm.nuc_trans
2×4×4 Array{Float64, 3}:
[:, :, 1] =
 0.300879   0.227052  0.084543  0.0768361
 0.0478425  0.319101  0.432183  0.0618401

[:, :, 2] =
 0.332676  0.332078  0.18038    0.293399
 0.511775  0.204785  0.0600831  0.466781

[:, :, 3] =
 0.0560964   0.265743  0.470459  0.183983
 0.00122564  0.221522  0.330401  0.2772

[:, :, 4] =
 0.310349  0.175127  0.264617  0.445783
 0.439157  0.254592  0.177333  0.194179

I think this might be a little bit simpler to have a 4x4 (which is the case of two transitions) matrix since the nucleotide transitions $\mathscr{A}$ has the property of keeping the dimensions when the Markov chain increases the order $\mathscr{A}^{n} = (\mathscr{a}_{ij}(n))$.

However, as I could spot in the documentation, we can now easily create different HMMs with the interface, so my guess is that it could be easily modified...

  1. I followed the code example, and I keep having the following error when generating the random observations and states:
(; state_seq, obs_seq) = rand(dchmm, 10_000);
ERROR: ArgumentError: Invalid probability distribution.
Stacktrace:
 [1] check_prob_vec
   @ ~/.julia/packages/HiddenMarkovModels/hpDUT/src/utils/probvec.jl:7 [inlined]
 [2] check_mc(mc::MarkovChain{Vector{Float64}, Matrix{Float64}})
   @ HiddenMarkovModels ~/.julia/packages/HiddenMarkovModels/hpDUT/src/utils/check.jl:27
 [3] MarkovChain
   @ ~/.julia/packages/HiddenMarkovModels/hpDUT/src/mc.jl:7 [inlined]
 [4] MarkovChain
   @ ~/.julia/packages/HiddenMarkovModels/hpDUT/src/hmm.jl:66 [inlined]
 [5] rand(rng::Random.TaskLocalRNG, hmm::DNACodingHMM, T::Int64)
   @ HiddenMarkovModels ~/.julia/packages/HiddenMarkovModels/hpDUT/src/abstract_hmm.jl:56
 [6] rand(model::DNACodingHMM, T::Int64)
   @ HiddenMarkovModels ~/.julia/packages/HiddenMarkovModels/hpDUT/src/interface.jl:34
 [7] top-level scope
   @ Untitled-1:145

@gdalle
Copy link
Owner

gdalle commented Jul 13, 2023

As for the error, check out the new file test/dna.jl, it's just that I forgot to initialize the probability vectors with rand_prob_vec. It's fixed on the main branch.

As for the storage format, I could have separated the two matrices A but I put them together. Maybe not my best move but it's easy to change

@camilogarciabotero
Copy link
Author

camilogarciabotero commented Jul 13, 2023

Yeah, I tried the update and it runs now, my bad. Regarding the states, for a simply DNA alphabet ${A,C,G,T}$ (in this alphabetical order A -> 1, C -> 2 ...) wouldn't be then 16 states ($4^2)$, so $(AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, TT)$? I just see the explanation in the doscstring of the struct, but the alphabetical order could be handle I guess.

I am now getting this error for the following:

most_likely_coding_seq = get_coding.(viterbi(dchmm, obs_seq));

ERROR: NotImplementedError: The called method is part of a fallback definition for the `AbstractHiddenMarkovModel` interface.
Please implement `Base.length(::AbstractHMM)` for your type `T <: AbstractHiddenMarkovModel`.
Stacktrace:
 [1] length(#unused#::DNACodingHMM)
   @ HiddenMarkovModels ./none:0
 [2] viterbi(hmm::DNACodingHMM, obs_seq::Vector{Int64})
   @ HiddenMarkovModels ~/.julia/packages/HiddenMarkovModels/hpDUT/src/inference/viterbi.jl:32
 [3] top-level scope
   @ Untitled-1:124

@gdalle
Copy link
Owner

gdalle commented Jul 13, 2023

The order I gave in the docstring is an arbitrary example, it doesn't matter how you map letters to numbers as long as you do it consistently.

This error is weird cause I test against it in my code, did you just run my file or make some changes?

@camilogarciabotero
Copy link
Author

Ok, sorry for the bad attempts. I could finally run it completely. It looks like it is actually finding the hidden states. The thing is that since the dchmm is actually random, then that might explain why we are getting such a combination of 1 and 2 when most_likely_coding_seq (?). I put two transition models ECOLICDS and ECOLINOCDS in my BioMarkovChains.jl package but I am using the following struct to store a Markov chain model of a DNA sequence:

ECOLICDS
TransitionModel:
  - Transition Probability Matrix -> Matrix{Float64}(4 × 4):
    0.31	0.224	0.199	0.268
    0.251	0.215	0.313	0.221
    0.236	0.308	0.249	0.207
    0.178	0.217	0.338	0.267
  - Initial Probabilities -> Vector{Float64}(4 × 1):
    0.245
    0.243
    0.273
    0.239
  - Markov Chain Order:1
TransitionModel:
  - Transition Probability Matrix -> Matrix{Float64}(4 × 4):
    0.321	0.204	0.2	0.275
    0.282	0.233	0.269	0.215
    0.236	0.305	0.235	0.225
    0.207	0.219	0.259	0.314
  - Initial Probabilities -> Vector{Float64}(4 × 1):
    0.262
    0.239
    0.24
    0.259
  - Markov Chain Order:1

Maybe we can use those values to instantiate the dchmm and then try it out with an actual DNA sequence, so that we can see some real scenario?

@gdalle
Copy link
Owner

gdalle commented Jul 13, 2023

Yeah, definitely use realistic values, that's what I meant when I said you should test on actual field data. My initializations are all random so they don't mean anything.
I assume ECOLICDS is coding and ECOLINOCDS is non coding? Do you need help plugging these values into the model or will you manage?

If you already know the parameters $A_1$ and $A_2$ and just want to perform decoding, the viterbi algorithm should work fine as long as you supply the right parameters. But you also want to be careful about the transition matrix between coding and noncoding: if you expect long constant segments, the probabilities to switch states should be very low.

If you want to estimate the parameters, we need to provide a good initial guess to the baum_welch algorithm, so I'm not sure what that should be depending on the scenario at hand.

@camilogarciabotero
Copy link
Author

camilogarciabotero commented Jul 13, 2023

I assume ECOLICDS is coding and ECOLINOCDS is non coding?

yes

Do you need help plugging these values into the model or will you manage?

I will try them out, thank you very much for all the help.

I was trying to understand the functions:

get_coding(state) = 1 + (state - 1) ÷ 4
get_nucleotide(state) = 1 + (state - 1) % 4
get_state(coding, nucleotide) = 4(coding - 1) + nucleotide

Are they custom for the DNACodingHMM struct? If I modified to split the nuc_trans field will those be affected?

@gdalle
Copy link
Owner

gdalle commented Jul 14, 2023

No these functions are not related to how nuc_trans is stored (or how you encode the letters into numbers). They are just used to convert between a couple of integers (representing the coding <=2 and nucleotide <= 4) and a single integer state <= 8. The former representation is the biological one, but the latter is used by my package

@camilogarciabotero
Copy link
Author

Hi @gdalle,

I've been thinking and trying this feature a bit, but still haven't figured out how to plug in the two variables cod_init and cod_trans since haven't found real data set values for them. I will search a little bit more or will try to generate them using BioMarkovChains.jl I will ping again later.

Best.

@gdalle
Copy link
Owner

gdalle commented Jul 19, 2023

It all depends if you want to do decoding or learning. For deciding, these values are quite important, for learning they don't matter much cause they are just initial guesses. Why don't you try it with uniform distributions and see what happens?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants