# Markov Chains

## Introduction 

Markov chains are a widely used tool in natural language processing (NLP) for modeling and generating sequences of words. In NLP, a Markov chain is often used to model the probability of a sequence of words, where the probability of each word in the sequence depends only on the preceding word(s).

One of the most common applications of Markov chains in NLP is in language modeling. Language modeling is the task of predicting the probability of a sequence of words. Markov chains are often used for this task because they can model the probability of a sequence of words by assuming that the probability of each word only depends on the preceding word(s). This assumption is known as the Markov assumption.

For example, suppose we have a sentence "The cat sat on the mat." A first-order Markov model would assume that the probability of the word "cat" depends only on the preceding word "The," and the probability of the word "sat" depends only on the preceding word "cat," and so on. Using this model, we can compute the probability of any given sentence by multiplying together the probabilities of each word given its preceding word(s).

Markov chains can also be used for text generation, where the goal is to generate new text that follows the same statistical patterns as a given corpus of text. This is typically done by sampling from the conditional distribution of each word given its preceding word(s), using the Markov model. By iteratively sampling from this distribution, we can generate a sequence of words that has a similar statistical structure to the original text.

![Markov chain](https://upload.wikimedia.org/wikipedia/commons/thumb/2/2b/Markovkate_01.svg/520px-Markovkate_01.svg.png)

## History of Markov Chains

Markov chains are named after the Russian mathematician Andrey Markov. 

![Markov](https://upload.wikimedia.org/wikipedia/commons/thumb/7/70/AAMarkov.jpg/440px-AAMarkov.jpg)

Note: His son is also became a mathematician and also used initials A.A.Markov ...

Markov was interested in the behavior of sequences of random variables, and he developed a mathematical framework for studying sequences where the probability of each variable depends only on the preceding variable.

Before Markov there were several other mathematicians who had studied sequences of random variables, including the Bernoulli brothers, the French mathematician Pierre-Simon Laplace, and the German mathematician Karl Friedrich Gauss. 


However, Markov was the first to develop a general theory of sequences of random variables, and he was the first to use the term "Markov chain" to describe a sequence of random variables where the probability of each variable depends only on the preceding variable.

Markov developed his theory of Markov chains in the 1900s. This theory arose from his interested in Poisson process and the theory of stochastic processes.

The theory of Markov chains was further developed in the 1920s and 1930s by a number of mathematicians, including the Hungarian mathematician Alfréd Rényi and the Soviet mathematician Aleksandr Khinchin. In the mid-20th century, Markov chains became a central topic in the field of probability theory and were used to study a wide range of problems in physics, biology, economics, and other fields.

In the context of natural language processing, Markov chains were first used in the 1950s and 1960s for language modeling and text generation. Early applications of Markov chains in NLP included the development of automatic language translation systems and the generation of machine-generated poetry.

## Core Idea behind Markov Chains

The main idea behind Markov chains is to model the behavior of a sequence of random variables (such as a sequence of words in a sentence) by assuming that the probability of each variable depends only on the preceding variable (or variables).

The Markov chain is defined by a set of states and a transition matrix, which specifies the probability of transitioning from one state to another. Each state represents a possible value of the random variable, and the transition matrix specifies the probabilities of moving from one state to another.

The Markov property, which is central to the theory of Markov chains, states that the probability of transitioning to a new state depends only on the current state and not on any previous states. This means that the future behavior of the system is dependent only on its current state and not on any earlier history.

The behavior of a Markov chain can be analyzed by studying its stationary distribution, which is the long-term probability distribution of the chain as it iterates through its state space. Under certain conditions (such as irreducibility and aperiodicity), a Markov chain will converge to a unique stationary distribution, which provides information about the long-term behavior of the chain.

## Python program for Markov Chains

In [1]:
import random

class MarkovChain:

    def __init__(self, order=1):
        self.order = order
        self.transitions = {}  # we store our transitions
        self.starts = []
    
    def add_text(self, text):
        # Split the text into words
        words = text.split() # very simple tokenizer simply split any whitespace

        # Add the start of the text to the list of possible starts
        self.starts.append(words[:self.order])

        # Add each sequence of words of length self.order to the transitions dictionary
        for i in range(len(words) - self.order):
            key = tuple(words[i:i+self.order])
            value = words[i+self.order]
            if key in self.transitions:
                self.transitions[key].append(value)
            else:
                self.transitions[key] = [value]
    
    def generate_text(self, length):
        # Choose a random start sequence from the list of possible starts
        current_sequence = random.choice(self.starts)
        
        # Generate the next word in the sequence by choosing randomly from the possible transitions
        # we are using a list of all possible sequence so this means the probabilities are inherent
        generated_text = list(current_sequence)
        for i in range(length):
            key = tuple(current_sequence)
            if key in self.transitions:
                next_word = random.choice(self.transitions[key])
            else:
                # If we reach a sequence that is not in the transitions dictionary, choose a random start sequence
                current_sequence = random.choice(self.starts)
                key = tuple(current_sequence)
                next_word = random.choice(self.transitions[key])
            generated_text.append(next_word)
            current_sequence = current_sequence[1:] + [next_word,]

        return ' '.join(generated_text)




The quick brown fox jumps over the lazy dog. The lazy dog also likes to sleep on the


In [2]:
# Example usage
text = """The quick brown fox jumps over the lazy dog on the porch. 
The lazy dog then wakes up and chases the fox around the yard.
The lazy dog also likes to sleep on the couch.
The fox likes to sleep on the porch.
The quick brown fox steals the dog's bone and runs away.
"""

mc = MarkovChain(order=2)
mc.add_text(text)

generated_text = mc.generate_text(length=16)
print(generated_text)

The quick brown fox jumps over the lazy dog also likes to sleep on the couch. The fox


In [3]:
## Let's see the transitions

mc.transitions

{('The', 'quick'): ['brown', 'brown'],
 ('quick', 'brown'): ['fox', 'fox'],
 ('brown', 'fox'): ['jumps', 'steals'],
 ('fox', 'jumps'): ['over'],
 ('jumps', 'over'): ['the'],
 ('over', 'the'): ['lazy'],
 ('the', 'lazy'): ['dog'],
 ('lazy', 'dog'): ['on', 'then', 'also'],
 ('dog', 'on'): ['the'],
 ('on', 'the'): ['porch.', 'couch.', 'porch.'],
 ('the', 'porch.'): ['The', 'The'],
 ('porch.', 'The'): ['lazy', 'quick'],
 ('The', 'lazy'): ['dog', 'dog'],
 ('dog', 'then'): ['wakes'],
 ('then', 'wakes'): ['up'],
 ('wakes', 'up'): ['and'],
 ('up', 'and'): ['chases'],
 ('and', 'chases'): ['the'],
 ('chases', 'the'): ['fox'],
 ('the', 'fox'): ['around'],
 ('fox', 'around'): ['the'],
 ('around', 'the'): ['yard.'],
 ('the', 'yard.'): ['The'],
 ('yard.', 'The'): ['lazy'],
 ('dog', 'also'): ['likes'],
 ('also', 'likes'): ['to'],
 ('likes', 'to'): ['sleep', 'sleep'],
 ('to', 'sleep'): ['on', 'on'],
 ('sleep', 'on'): ['the', 'the'],
 ('the', 'couch.'): ['The'],
 ('couch.', 'The'): ['fox'],
 ('The', 'fo

In [None]:
## improvement would be to count the next words instead of using them in a list - not efficient for large corpora



In [4]:
mc3 = MarkovChain(order=3)
mc3.add_text(text)
mc3.transitions

{('The', 'quick', 'brown'): ['fox', 'fox'],
 ('quick', 'brown', 'fox'): ['jumps', 'steals'],
 ('brown', 'fox', 'jumps'): ['over'],
 ('fox', 'jumps', 'over'): ['the'],
 ('jumps', 'over', 'the'): ['lazy'],
 ('over', 'the', 'lazy'): ['dog'],
 ('the', 'lazy', 'dog'): ['on'],
 ('lazy', 'dog', 'on'): ['the'],
 ('dog', 'on', 'the'): ['porch.'],
 ('on', 'the', 'porch.'): ['The', 'The'],
 ('the', 'porch.', 'The'): ['lazy', 'quick'],
 ('porch.', 'The', 'lazy'): ['dog'],
 ('The', 'lazy', 'dog'): ['then', 'also'],
 ('lazy', 'dog', 'then'): ['wakes'],
 ('dog', 'then', 'wakes'): ['up'],
 ('then', 'wakes', 'up'): ['and'],
 ('wakes', 'up', 'and'): ['chases'],
 ('up', 'and', 'chases'): ['the'],
 ('and', 'chases', 'the'): ['fox'],
 ('chases', 'the', 'fox'): ['around'],
 ('the', 'fox', 'around'): ['the'],
 ('fox', 'around', 'the'): ['yard.'],
 ('around', 'the', 'yard.'): ['The'],
 ('the', 'yard.', 'The'): ['lazy'],
 ('yard.', 'The', 'lazy'): ['dog'],
 ('lazy', 'dog', 'also'): ['likes'],
 ('dog', 'also', 

In [5]:
mc_big = MarkovChain(order=5)
mc_big.add_text(text)
mc_big.transitions

{('The', 'quick', 'brown', 'fox', 'jumps'): ['over'],
 ('quick', 'brown', 'fox', 'jumps', 'over'): ['the'],
 ('brown', 'fox', 'jumps', 'over', 'the'): ['lazy'],
 ('fox', 'jumps', 'over', 'the', 'lazy'): ['dog'],
 ('jumps', 'over', 'the', 'lazy', 'dog'): ['on'],
 ('over', 'the', 'lazy', 'dog', 'on'): ['the'],
 ('the', 'lazy', 'dog', 'on', 'the'): ['porch.'],
 ('lazy', 'dog', 'on', 'the', 'porch.'): ['The'],
 ('dog', 'on', 'the', 'porch.', 'The'): ['lazy'],
 ('on', 'the', 'porch.', 'The', 'lazy'): ['dog'],
 ('the', 'porch.', 'The', 'lazy', 'dog'): ['then'],
 ('porch.', 'The', 'lazy', 'dog', 'then'): ['wakes'],
 ('The', 'lazy', 'dog', 'then', 'wakes'): ['up'],
 ('lazy', 'dog', 'then', 'wakes', 'up'): ['and'],
 ('dog', 'then', 'wakes', 'up', 'and'): ['chases'],
 ('then', 'wakes', 'up', 'and', 'chases'): ['the'],
 ('wakes', 'up', 'and', 'chases', 'the'): ['fox'],
 ('up', 'and', 'chases', 'the', 'fox'): ['around'],
 ('and', 'chases', 'the', 'fox', 'around'): ['the'],
 ('chases', 'the', 'fox'

## Big Order Markov Chains

You can see how deterministic even 5 order Markov Chains become, there is only one choice.

## Limitiations of simple Markov Chains

The Markov property is a strong assumption, and it is often violated in real-world applications. For example, the probability of a word in a sentence often depends on more than just the preceding word. For example, the probability of the word "cat" in the sentence "The cat sat on the mat" is likely to be higher if the preceding word is "the" than if the preceding word is "sat."

## Hidden Markov Chains

A hidden Markov chain, also known as a hidden Markov model (HMM), is a type of Markov chain in which the state of the system is not directly observable, but instead generates a sequence of observable outputs.

In a hidden Markov chain, there are two types of variables: the hidden states, which are not directly observable, and the observed outputs, which are generated by the hidden states. The hidden states represent the internal state of the system, while the observed outputs provide information about the hidden states.

For example, in speech recognition, a hidden Markov chain can be used to model the speech signal as a sequence of hidden states representing different phonemes, with each hidden state generating a sequence of observable acoustic features such as spectral coefficients or mel-frequency cepstral coefficients (MFCCs).

The behavior of a hidden Markov chain is defined by a transition matrix that specifies the probability of transitioning from one hidden state to another, and an emission matrix that specifies the probability of generating each observable output given the hidden state. The initial distribution over the hidden states is also specified.

The key challenge in working with hidden Markov chains is to estimate the hidden states given a sequence of observed outputs. This problem, known as the decoding problem, can be solved using the Viterbi algorithm or the forward-backward algorithm.

Hidden Markov chains are used in a wide range of applications, including speech recognition, natural language processing, bioinformatics, and finance. They are a powerful tool for modeling complex systems where the state of the system is not directly observable but can be inferred from a sequence of observable outputs.

## References

* https://www.americanscientist.org/article/first-links-in-the-markov-chain
* https://www.jstor.org/stable/1403785