# Text Generation With Markov Chains
The main idea of today's workshop is to write some code that can generate text based on some collection of text (we call this text input a **corpus**). The corpus could be wikipedia articles, the words from novels or a library of tweets, basically any sort of textual data. 

The goal of the process is to generate **original** text that is in the same *style* as the  input.

Therea are a number of different techniques that could be used to generate text and for this workshop we have chosen to use Markov chains.  

Before we get started we'll need to understand a little about Markov chains.

## Markov Chains
Markov chains are used to model stochastic **(random)** processes that have fixed states and transitions between them. The transitions have probabilites associated with them. There's a lot going on in those two sentences, so lets have an example! 

I have two choices for how to spend my afternoons. I could go to the gym or play video games. If I go to the gym on one day, I have 0.5 probability of going to the gym the following day (to try and keep up the good habits). If I play computer games on any afternoon I am highly-likely (0.8 probability) to play computer games the following day (yes I *am* addicted). 

The transitions from each choice must sum to **1**, so we can infer the chance of me playing video games after going to the gym is 50% and the chance of me going to the gym after playing computer games is 20%. 

The diagram below shows the relationship between the states and transitions. 

![Two state Markov chain](./images/markov_chain.png)


To use the model to generate a sequence of events, we need to pick a start state, and then generate a random value that will decide which state we move to. 

We have two choices when in each state, i.e. remain in the current state or move to the other. We will say that the interval 0 to X represents the chance of staying in the current state so for gym a number generated less than 0.5 will mean staying in the gym state and an number greater than or equal to 0.5 will transision us to the computer games state. For the computer games state Anything less than 0.8 will keep us playing games! 

### Examples In This Notebook
This notebook contains sections of code that can be edited and then run straight from the browser. To run the code, select the block and then click on the run button in the toolbar above. 

Lets see this example in action. Highlight the cell below and run it to see what is generated. 

In [None]:
import random #import the random library to generate random numbers.

DAYS_TO_GENERATE = 5

GYM_GYM = 0.5 #Probability of being in gym state and staying there
GYM_GAMES = 0.5 #Probability of being in gym state moving to the games state

GAMES_GAMES = 0.8 #We stay playing games if we played today
GAMES_GYM = 0.2 #We go to the gym if we played today.

current_state = "Gym" #Lets start with good intentions! 

print("Starting state: {}".format(current_state))

for i in range(DAYS_TO_GENERATE):
    
    random_number = random.random() #Generate a random number
    
    if current_state == "Gym": 
        if random_number < GYM_GYM: #use the gym probabilities 
            current_state = "Gym"
        else:
            current_state = "Games"
    else: 
        if random_number < GAMES_GAMES: #use the games probabilities 
            current_state = "Games"
        else:
            current_state = "Gym"

    print("Day: {} \t Chosen number: {} \t New state: {}".format(i+1, random_number, current_state))

## Activity 1 (5 minutes)
Experiment with the code above. Here are some suggestions: 
1. Change the number of days generated
2. Change the starting state
3. Modify the transition probabilites 

## Data representation 
In the example above we defined variables to store the probabilities of the transitions. With large numbers of states this clearly wont be possible (the maximum number of transitions for $n$ states is $n^{2}$ - everything state linked to every other state), so even with just 10 states there could be up to 100 transitions. 

Though not an optimal representation, one useful way to visualise the state transitions would be to place them in an array (or list in python), having them occur the number of times that represent their particular probabilities. 

For our inital example we could have some code as below: 

`gym_transitions = ["Gym","Games"] #Represents one half probability of each transision
games_transitions = ["Games", "Games, "Games", "Games", "Gym"] #Represents 4/5 or 80% probability of games
`

If we do this then we can make use of the `choice` method in python's library to select from the options randomly.

`current_state = random.choice(games_transitions)`

Lets take a look at the example using the array based scheme. 

In [None]:
gym_transitions = ["Gym","Games"] #Represents one half probability of each transision
games_transitions = ["Games", "Games", "Games", "Games", "Gym"] #Represents 4/5 or 80% probability of games

current_state = "Gym"

for i in range(10):
    if current_state == "Gym":
        current_state = random.choice(gym_transitions)
    else:
        current_state = random.choice(games_transitions)
        
    print("Day: {} \t {}".format(i+1, current_state))    

This example looks much neater, and might even be a little easier to follow. There is stil one drawback to this approach in that we need to declare and store a variable for each state. 

This is fine for a small number of states, however later on we'll want to be able to give our text generation system any text and have it learn the states as it goes. 

A hashmap (or dictionary in python) allows us to set an identifying key for a value and then look it up later on. For example we could use the state as the key, and the list of values as the valuen stored under it. It would look a little like this:

`
{
    "Gym" : ["Gym", "Gym"],
    "Games" : ["Gym", "Games", "Games", "Games", "Games"]
}
`

If we had a dictionary called states we could write `states['Gym']` and the dictionary would return the array with the gym's states. 

We can create keys at any time, and associate them with a value so this is tha approach we will use today. 


## Activity 2 (5 minutes)
There are two things to think about here: 
1. What does changing the representation do for us? 
2. There is a more memory efficent representation of data. Can you figure out what it might be? 

## Getting Started with Text Generation
Now we have a basic understanding of Markov chains, lets take a look at how they might apply in a text scenario. 
We'll use the following statement as a tiny corpus:

**"Democracy is the best system of government. Dictatorship is the worst system of government"**

Our first Markov chain considered only a single state when choosing what to do next. If we did that for text generation, we'd generate the next word based only on its predecessor. This would produce text that looks very close to the inputs. If we want to generate more original text we'll have to consider more of the previous text to come up with something new. 

We also need to consider what will actually constitute a sentence. For our example we will: 
1. Expect a sentence to start with a capital letter
2. End a sentence with a full stop

The first thing to do is break down our example into two word segments, that we will use to build the keys of our dictionary that will represent the transition matrix.

0. Democracy is
1. is the
2. the best
3. best system (Omitting lots of items here.....)
7. of government

We can then add the words that follow the word pairs into an array, where the probabilities will be set by the number of occurances of the following words as we discussed above. An excerpt from the created dictionary is below. 

`
{
    "Democracy is": ["the"],
    "is the" : ["best", "worst"], #<-- Here is the magic, we have multiple options. 
    "the best" : ["system"],
}
`

From our small example, there are two possibilites for choosing the word that follows "is the". The program will choose randomly between them to give us our output.


If we are able to build the hash up from the corpus we can apply the rules for our sentences to generate new sentences. The four valid outputs from this are: 

1. Democracy is the **best** system of government. 
2. Democracy is the **worst** system of government.
3. Dictatorship is the **best** system of government.
4. Dictatorship is the **worst** system of government. 

This is a very small example, but demonstrates the main concepts pretty well. 

## Building a model
To solve the problem of text generation, we'll break the problem up into two steps: 
1. Building the model from the corpus
2. Generating new text from the model

## Activity 3 (10 minutes)
For this activity, we're going to code a simple function that builds our model. The function below takes a corpus of text as an input parameter and builds the hash of data for the model. There are some errors which need fixing before it works successfully

In [None]:
import pprint
def simple_model(corpus):
    model = {} #An empty hash to store the values as they are added
    
    #We'll use a variable called words to store the individual words
    words = "" #Split up the corpus into the individual words (see https://docs.python.org/3/library/stdtypes.html#str.split) for a hint
    
    #As we iterate through the list of words, we need to think about where to stop.
    for i in range(len(words)): 
        phrase = "{} {}".format(words[i], words[i+1])
        next_word = words[i+2]
        
        #If the phrase does not exist as a key in the model, add it and create an empty list to store the possible options.
        if phrase not in model:
            model[phrase] = []
        
        #Add in the possible transition into the list.
        model[phrase].append(next_word)
        
    return model

Lets test the code is working on our simple example.

In [None]:
sample_corpus = "Democracy is the best system of government. Dictatorship is the worst system of government."
model = simple_model(sample_corpus)
pprint.pprint(model)

#Add in some test inputs of your own here. 
test_input = "Add your own test input here."
model = simple_model(test_input)
pprint.pprint(model)

## Activity 3a (optional)

I said we'll use the previous two words to generate the next one, but it would be nice to have a more generic version of the model that we could tune to suit our application. Experimenting with this might produce better output later on. 

If you have time, complete the function below which allows the number of states to be set. 

In [None]:
def tuneable_model(corpus, states=2):
    model = {} #An empty hash to store the values as they are added
    
    words = #Complete this as above, splitting the supplied corpus into individual words.
    
    #Just a reminder that you might want to use python's list slicing to make this more generic
    
    for i in range(len(words)-states-1): 
        phrase = " ".join() #Code here please :) 
        next_word = words[] #Which index is the next word? 
        
    
        #The next lines work as they did in the simpler function before. 
        if phrase not in model:
            model[phrase] = []
        
        model[phrase].append(next_word)
        
    return model

And again lets test the model in two ways on the original tiny corpus looking at two and three words. If we've written it correctly the two_state_model should be the same as the model produced by the `simple_model` function. You should be able to review the output for the three state model and see if is correct. This is one of the advantages of having a small test set. 

In [None]:
two_state_model = tuneable_model(sample_corpus,2)
print("Two state model")
print("================")
pprint.pprint(two_state_model)

print()

three_state_model = tuneable_model(sample_corpus,3)
print("Three state model")
print("================")
pprint.pprint(three_state_model)

## Generating text from the model
Now that you have a model in place you are ready to write a function that can process the model and generate new text from a starting point. Remember the two simple rules that we put in place around sentences: 

1. Sentences start with a capital letter
2. Sentences end with a full stop. 

We are going to need to expand on them slightly to deal with more real world text:

1. Sentences start with a capital letter **or an @ symbol** (To allow us to work with twitter data) 
2. Sentences end with a full stop **or an exclamation mark or question mark ** (Again twitter is a fiery place) 

## Activity 4 (10 minutes) 

Its your turn to implement some code that works on this now. In order to make this a little easier, I've broken it down into a number of functions that we can combine into a single `generate_text` model at the end. Specifically you will have to implement three helper methods below:

1. `is_valid_start` should return True if the str parameter is a valid start for a sentence. 
2. `is_valid_end` should return True if the str parameter is a valid end of a sentence.
3. `words_in_string` should return the number of words in the string provided. 

I have provided some tests to help you know when your code is ready. 

In [None]:
#This function reads the str parameter and decides if it is a valid starting point (i.e starts with a capital letter)
def is_valid_start(str):
    return False

#This function reads a string parameter and decides if it is a valid end point
def is_valid_end(str):
    return False

#This function should produce the number of words in the string provided
def words_in_string(str):
    return 0

print("Testing is_valid_start")
print("\t Testing 'Hello' as valid start should be True: {}".format(is_valid_start("Hello")))
print("\t Testing 'hello' as valid start should be False: {}".format(is_valid_start("hello")))
print("\t Testing '@adam' as valid start should be True: {}".format(is_valid_start("@adam")))
print("\t Testing '!adam' as valid start should be False: {}".format(is_valid_start("!adam")))

print("Testing is_valid_end")
print("\t Testing 'end.' as valid end should be True: {}".format(is_valid_end("end.")))
print("\t Testing 'and,' as valid end should be False: {}".format(is_valid_end("and,")))
print("\t Testing 'fired!' as valid end should be True: {}".format(is_valid_end("fired!")))
print("\t Testing 'fired?' as valid end should be True: {}".format(is_valid_end("fired?")))

print("Testing words_in_string")
print("\t 'hello beautiful world!' should have length 3: {}".format(words_in_string("hello beautiful world!")))

In [None]:
def generate_text(model, sentences_to_generate=5, input_words=2, minimum_sentence_length=3):
    
    text = [] #An array of sentences
    words_added = 0
    for i in range(sentences_to_generate):
        sentence = ""
        #Choose a sentence starting location from the keys of the hash. Remember the rules!
        starting_point = random.choice(list(model.keys()))
        
        while is_valid_start(starting_point) == False:
            starting_point = random.choice(list(model.keys()))
        
        sentence += starting_point
        
        #Loop infinitely until we have reached the valid end of a sentence 
        while True: 
            key = " ".join(sentence.split()[-input_words:])
            
            next_generated_word = random.choice(model[key])
            
            sentence += (" " + next_generated_word)
            
            if words_in_string(sentence) >= minimum_sentence_length and is_valid_end(sentence):
                text.append(sentence)
                break
        
        
    return '\n'.join(text)

Now test the generation, using the original two sentence sample corpus. 

In [None]:
model = simple_model(sample_corpus)
new_sentences = generate_text(model)

print(new_sentences)


Hopefully, you now have generated some sentences that were not in the original corpus. Given the limited size of the inputs there really is not much that the system can do. Lets move on to experimenting with the system as a whole. 




## The system in action.......

You've worked really hard to build the system above. A standard part of any machine learning project is the collection and cleansing of a dataset to train on. This can be a time consuming process so we have put together some pre-prepared data for this session. We have developed the following datasets:
1. Donald Trump tweets 
2. Taylor Swift lyrics
3. A collection of scientific wikipedia articles
4. Some BBC articles on business and the economy.

We have wrapped these up into a single function called `fetch_corpus` that you can use to access the data sets as:
1. `fetch_corpus('trump')`
2. `fetch_corpus('taylor')`
3. `fetch_corpus('wikipedia')`
4. `fetch_corpus('bbc')`

These all return a string that you can feed into your models. 

Right then time to experiment! 

## Experimentation (20 mins)
If your code has worked so far with all of the tests, then you are good to go! If you've gone off-piste and haven't quite got it working then you can uncomment the import lines in the code below to bring in some pre-written versions of the functions for experimentation. 

When you have generated some interesting text, copy and paste it into the ideas tab of the slido wint 

In [None]:
#Uncomment the line below to include the prewritten functions if you need them
# from markov_helper import simple_model, tuneable_model, is_valid_start, is_valid_end, words_in_string, generate_text

from markov_helper import fetch_corpus

corpus = fetch_corpus('trump') #Choose a corpus 

#Setup the model 
model = simple_model(corpus)

#Generate text
new_sentences = generate_text(model, 10)

# #Try a more advanced model
# previous_states = 3
# minimum_length = 2
# model = tuneable_model(corpus, previous_states)
# #Remember the generate_text parameters are corpus, sentences to generate, input words and the minimum number of words in a sentence
# new_sentences = generate_text(model, 10, previous_states, minimum_length)

print(new_sentences)

## Conclusions
In this lab session we have used a relatively simple technique of markov chains to write code that generates text in the style of the original corpus. 

Though simple, some of the results look plausible and generating huge amounts of content is trivial. This ease of content generation makes it hard to establish whether content is genuine, especially when it is shared and reshared many times on social media platforms. 


## Going Further

In this session we have explored a small part of a field of research called Natural Language Processing, which combines elements of linguistics, computer science and artificial intelligence. Work in this area supports generating automatic summares of documents, machine translation (converting from one langauge to another) and question answering (e.g What is the capital of France?) 

You are welcome to extend this work for your own purposes. If you are interested in the field of natural language procesing you might like to explore other tools and techniques. Here's a few that you might find interesting. 

### [Natural Language Toolkit](https://www.nltk.org)
The Natural Language Toolkit (NLTK) is a fantastic Python library that includes a large number of functions to help with building NLP programs. The documentation is excellent. 

### Word Vectors 
If you explore the space further you will come across word vectors, which are learned relationships between words that are stored as vectors. The embeddings typically represent some meaning e.g. King - Man + Woman = Queen, where it knows that man/king is related to woman/queen. A good article can be found at [The Amazing Power of Word Vectors](https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-vectors/). 

### [Text Generation with an RNN](https://www.tensorflow.org/tutorials/text/text_generation)
This tutorial covers character based text generation using a recurrent neural network. This is a more computationally intensive method than Markov chains, but it produces some interesting results. 

