# Text Generation With Markov Chains
The main idea of today's workshop is to write some code that can generate text based on some collection of text (we call this text input a **corpus**). The corpus could be wikipedia articles, the words from novels or a library of tweets, basically any sort of textual data. 

The goal of the process is to gernerate **original** text that is in the same style as the original inputs.

Therea are a number of different techniques that could be used to generate text and for this workshop we have chosen to use markov chains.  

Before we get started we'll need to understand a little about markov chains.


## Markov Chains
Markov chains are used to model stochastic **(random)** processes that have fixed states and transitions between these states that have probabilites attached to them. There's a lot going on in that sentence, so lets have an example! 

Lets say I have two choices for how to spend my afternoons. I could go to the gym or play video games. If I go to the gym on one day, I have 50% probability of going to the gym the following day (to try and keep up the good habits). If I play computer games on any afternoon I am highly-likely (80% probability) to play computer games the following day (yes I *am* addicted). 

The transitions from each choice must sum to **1**, so we can infer the chance of me playing video games after going to the gym is 50% and the chance of me going to the gym after playing computer games is 20%. 

The diagram below shows the relationship between the states and transitions. 

![Two state Markov chain](./images/markov_chain.png)


To use the model to generate a sequence of events, we need to pick a start state, and then generate a random value that will decide which state we move to. 

We have two choices for each state, i.e. remain in the current state or move to the other state. We will say that the interval 0 to X represents the chance of staying in the current state so for gym a number generated less than 0.5 will mean statying in the gym state and an number greater than or equal to 0.5 will transision us to the computer games state. For the computer games state Anything less than 0.8 will keep us playing games! 

Lets see this example in action. Highlight the cell below and run it to see what is generated. 

In [None]:
import random #import the random library

DAYS_TO_GENERATE = 5

GYM_GYM = 0.5 #Probabilty of being in gym state and staying there
GYM_GAMES = 0.5 #Probabilty of being in gym state moving to the games state

GAMES_GAMES = 0.8 #We stay playing games if we played today
GAMES_GYM = 0.2 #We go to the gym if we played today.

current_state = "Gym" #Lets start with good intentions! 

print("Starting state {}".format(current_state))

for i in range(DAYS_TO_GENERATE):
    
    random_number = random.random() #Generate a random number
    
    if current_state == "Gym": 
        if random_number < GYM_GYM: #use the gym probabilities 
            current_state = "Gym"
        else:
            current_state = "Games"
    else: 
        if random_number < GAMES_GAMES: #use the games probabilities 
            current_state = "Games"
        else:
            current_state = "Gym"
    
    print("Chosen number: {} \t New state: {}".format(random_number, current_state))

## Activity 1
Experiment with the code above. Here are some suggestions: 
1. Change the number of days generated
2. Change the starting state
3. Modify the transition probabilites 

## Data representation 
In the example above we defined variables to store the probabilities of the transitions. With large numbers of states this clearly wont be possible (the maximum number of transitions for $n$ states is $n^{2}$), so even with just 10 states there could be up to 100 transitions. 

Though not the optimal representation, one way to visualise the state transistions would be to place them in an array (or list in python), having them occur the number of times that represent their particular probabilities. 

For our inital example we could have some code as below: 

`
gym_transitions = ["Gym","Games"] #Represents one half probability of each transision
games_transitions = ["Games", "Games, "Games", "Games", "Gym"] #Represents 4/5 or 80% probability of games
`

The optimal approach would be a matrix as in the image below, which allows for the densest representation of transitions. 

![Transition Matrix](./images/transition_matrix.png)

We could easily represent this as:

`
transition_matrix = [ [0.5, 0.2], 
                      [0.5, 0.8]
                    ]
`