# Lab: Writing Haiku with Markov Chains

In [None]:
%%html
<link rel="stylesheet" type="text/css" href="../styles/styles.css">

This lab is inspired by [*'Impractical Python Projects: playful programming activities to make you smarter'*](https://nostarch.com/impracticalpythonprojects) by Lee Vaughan. First edition. San Francisco: No Starch Press, Inc. 2019. ISBN 9781593278915

> "The complexity of poetic interaction, the tricky dance among poet and text and reader, causes a game of hesitation. In this game, a properly programmed computer has a chance to slip in some interesting moves" (Charles Hartman, 1996 *Virtual Muse: Experiments in Computer Poetry*)

According to [Wikipedia](https://en.wikipedia.org/wiki/Haiku), **haiku** is a type of short form of Japanese poetry that traditionally consist of three lines composed of 17 syllables in a 5, 7, and 5 pattern. Tradtionally, nature - mainly the seasons - is taken as a subject.

<center>
Worker bees can leave. <br>
Even drones can fly away. <br>
The Queen is their slave. <br>

(Chuck Palahniuk, *The Fight Club*)
</center>

## Problem Statememt

<div class="alert alert-problem">
Given haiku corpus, write a program that generates haiku using Markov chain model.
</div>

## Load and Preprocess Haiku Corpus

Let's first download our [Haiku corpus](https://github.com/rlvaugh/Impractical_Python_Projects/blob/master/Chapter_8/train.txt).

In [None]:
# haiku file for training
corpus_file = "https://raw.githubusercontent.com/rlvaugh/Impractical_Python_Projects/master/Chapter_8/train.txt"

This corpus from the aforementioned book contains almost 300 ancient and modern haiku, more than 200 of which were written by the masters. The haiku in the initial corpus were duplicated 18 times and randomly distributed across the file to increase the number of values per key.

In [None]:
import requests # URL usage

In [None]:
def load_corpus(url: str) -> str:
    """Loads and returns training corpus of haiku.

    Args:
        url (str): URL to the file containing the corpus

    Returns:
        corpus (str): loaded corpus
    """
    response = requests.get(url)
    response.raise_for_status()
    
    data_raw = response.text
        
    return data_raw

In [None]:
def get_word_set(data_raw: str) -> set:
    """Gets a word set from a string.

    Args:
        data_raw (str): corpus string

    Returns:
        set: a set of words
    """
    data_cleaned = data_raw.replace("-", " ")
    word_set = set(data_cleaned.split())
    
    return word_set
    

In [None]:
# load raw text as a string
raw = load_corpus(corpus_file)
print(raw)

# load a set of words from the raw text
word_set = get_word_set(raw)
print(word_set)


Let's also apply some very basic preprocessing that splits our corpus into words:

In [None]:
# to be able to remove punctuation
from string import punctuation

def text_preprocessing(corpus: str) -> list: 
    """Performs a basic preprocessing to the corpus text by replacing new lines and splitting to words on spaces. 

    Args:
        corpus (str): Text corpus

    Returns:
        list: list of words given in the order as they appear in the corpus
    """
    # lowercase the text
    corpus = corpus.lower()
    # remove new line
    corpus = corpus.replace("\n", " ")
    # remove punctuation
    corpus = ''.join(char for char in corpus if char not in punctuation)
    # split into words on spaces
    corpus_list = corpus.split()
    
    return corpus_list

In [None]:
# split the raw text into a list of words
corpus_list = text_preprocessing(raw)

## Preliminaries: Counting Syllables

To respect haiku syllabic structure, we need to have to be able to count the number of syllables in words and phrases.

<div class="alert alert-problem">
<div class="problem objectives">
Given a text input, count the number of syllables in each word, and return the total syllable count.
</div>
</div>

To do so, we are going to use a syllable-count corpus, in particular, the [*Carnegie Mellon University Pronounciation Dictionary* (CMUdict)](http://www.speech.cs.cmu.edu/cgi-bin/cmudict) available via [NLTK (the Natural Language Toolkit)](https://www.nltk.org/). As some words may be missing from this corpus, we are going to use [`pyphen`](https://pypi.org/project/pyphen/) library allowing to hyphenate text using existing Hunspell hyphenation dictionaries. We are going to focus on English.

In [None]:
import nltk # NLP toolkit
from nltk.corpus import cmudict # Carnegie Mellon University Pronouncing Dictionary
import pyphen # library to count syllables using hyphenation dictionaries

nltk.download('cmudict') # download CMUdict <- takes a bit of time

In [None]:
# CMUdict
cmudict = cmudict.dict()
# hyphenation
hyph = pyphen.Pyphen(lang='en')

Let's examine how the word *minds* is presented in CMU dictionary:

In [None]:
cmudict['minds']

The CMUdict breaks every word into set of phonemes and marks vowels using numbers for stress: 0=no stress, 1=primary, and 2=secondary.

The vowels that are not pronounced are not included:

In [None]:
cmudict['care']

When multiple and cosecutive written vowels form a single phoneme, only the latter is presented:

In [None]:
cmudict['mouse']

Finally, a word may have multiple distinct pronounciations:

In [None]:
cmudict['read']

We can use numbers to identify the vowels and count the number of syllables:

In [None]:
w = "mistake"
print(f"Transcription of the word '{w}': {cmudict[w]}")

vow_list = [char for char in cmudict[w][0] if char[-1].isdigit()]
print(f"List of vowels: {vow_list}")

print(f"Number of syllables: {len(vow_list)}")

Even though CMUdict is a very accurate tool, some words might be missing:

In [None]:
cmudict['covid']

To deal with missing words, we can use `pyphen` library:

In [None]:
hyph.inserted('covid')

Note that `pyphen` representation is different from the one of CMUdict. In this case, in order to count the number of syllables, we can split the result based on the character `-` and count the number of elements in the list:

In [None]:
hyph_word = hyph.inserted(w)
print(hyph_word)

list_syl = hyph_word.split('-')
print(f"List of syllables of the word '{w}': {list_syl}")

print(f"Number of syllables: {len(list_syl)}")

Our main strategy is the following: for a given word, check if it is present in the CMUdict. It it is the case, then count the number of syllables. If not, use `pyphen` to count the number of syllables. 

We might also want to handle some special cases:

1. the words ending with "*'s*":

In [None]:
w = "sister's"
print(f"Original text: {w}")
if w.endswith("'s") or w.endswith("\u2019s"):
    w = w[:-2]
print(f"Updated text: {w}")

# curl apostophe
w = "sister\u2019s"
print(f"Original text: {w}")
if w.endswith("'s") or w.endswith("\u2019s"):
    w = w[:-2]
print(f"Updated text: {w}")

2. force a pronounciation of the words with multiple options, like the word *sake* that may refer to a Japanese drink and therefore, have 2 syllables in contrast to the pronounciation from the CMUdict:

In [None]:
cmudict['sake']

To deal with that, we can introduce a dictionary mapping the words with the number of syllables. If a word is present in this list, then the corresponding value will be returned. Here's the code that asks a user to manually insert such exceptions.

In [None]:
import sys
import pprint
import json

def make_change_exceptions_dict(missing_words: dict) -> dict:
    """Allows to make a manual change to missing_words and returns updated dict.

    Args:
        missing_words (dict): dictionary of word - syllables count pairs

    Returns:
        dict: updated missing_words dictionary
    """
    
    print("Make change to dictionary before saving?")
    print("""
          0 - Exit & Save 
          1 - Add a word or Change a syllable count
          2 - Remove a word
          """)
    
    while True:
        choice = input("Enter your choice: ")
        if choice == '0':
            break 
        if choice == '1':
            w = input("\nWord to add or change: ")
            num_sylls = input(f"Enter the number of syllables in {w}")
            if num_sylls.isdigit():
                break
            else:
                print("   Not a valid answer!", file=sys.stderr)
            # insert a word-num syllables pair into the dictionary
            missing_words[w] = int(num_sylls)
            
        elif choice == '2':
            w = input("Enter word to delete: ")
            missing_words.pop(w, None)
        
    return missing_words

def make_exceptions_dict(exception_set: set) -> dict:
    """Returns a dictionary of words and corresponding number of syllables from a set of words.

    Args:
        exception_set (set): a set of words for which the number of syllables should be inserted.

    Returns:
        dict: word - syllable count dictionary
    """
    missing_words = {} # resulting dictionary
    
    print("Insert the number of syllables in word.")
    
    for w in exception_set:
        while True:
            num_sylls = input(f"Enter the number of syllables in {w}")
            if num_sylls.isdigit():
                break
            else:
                print("   Not a valid answer!", file=sys.stderr)
        # insert a word-num sylls pair into the dictionary
        missing_words[w] = int(num_sylls)
    
    print()    
    pprint.pprint(missing_words, width=1)
    
    missing_words = make_change_exceptions_dict(missing_words)
    print("New words or syllable changes")
    pprint.pprint(missing_words, width=1)
    
    return missing_words

def save_exceptions(missing_words: dict) -> None:
    """Saves word-syllables count dictionary to a JSON file

    Args:
        missing_words (dict): word-syllables count dictionary
    """
    json_str = json.dumps(missing_words)
    
    f = open('data/missing_words.json', 'w')
    f.write(json_str)
    f.close()
    
    print("The dictionary has been saved to the file 'data/missing_words.json'")
    
def load_exceptions() -> dict: 
    """Loads word-syllables count dictionary from json file and returns it as a dictionary.

    Returns:
        dict: word-syllables count dictionary
    """
    missing_words = {}
    
    with open('data/missing_words.json') as f:
        missing_words = json.load(f)
    
    return missing_words    

In [None]:
# run the code if you want to update the exception list
except_set = set(['sake'])
mis_words = make_exceptions_dict(except_set)
save_exceptions(mis_words)

In [None]:
# load the dictionary
missing_words = load_exceptions()
print(missing_words)


Now, to be able to deal with a word or phrase, we need to: 
- lowercase or uppercase the text (`words.lower()`);
- split a phrase into words (`words.split()`);
- remove punctuation (`word.strip(punctuation)`);
- apply the count procedure:
   * handle words ending with "'s";
   * check if a word is an exception from the dictionary and if it's the case, use the correspoding value of the syllables;
   * if a word is from CMUdict, get its number of syllables (let's force the first pronounciation for simplicity);
   * else use `pyphen`.

<div class="alert-exercise">
<h5> QUESTION 1:</h5> Write `count_syllables()` function:

```
def count_syllables(words: str) -> int:
    """Counts syllables in English word or phrases.

    Args:
        words (str): a text to be analysed

    Returns:
        int: syllables count
    """
```

</div>

In [None]:
# ANSWER
def count_syllables(words: str) -> int:
    """Counts syllables in English word or phrases.

    Args:
        words (str): a text to be analysed

    Returns:
        int: syllables count
    """
    
    return 0

Let's test the function:

In [None]:
while True: 
    print("Test syllables counter")
    txt = input("Enter a word or a phrase")
    if txt == '':
        sys.exit()
    try:
        num_sylls = count_syllables(txt)
        print(f"Number of syllables in '{txt}' : {num_sylls}")
    except KeyError:
        print("Word not found. Try again.", file=sys.stderr)

## Markov Model for Next Word Generation

<div class="alert alert-idea">
<h4> 💡 Idea: put the best words in the best order</h4>

<div class="idea-steps">
<h5>Questions to ask:</h5>
<ul>
<li> How to understand what are "the best words?"

> Need for good examples, i.e. a training corpus of haiku
<\li>
<li> How to determine their "best order"?

> Markov chains allow to predict the next (subsequent) state based on the properties of the current state. In our case, state = word.
<\li>
</ul>
</div>

</div>

To illustrate the idea, think about the following examples. When someone says, "*May the Force be...*", we automatically think about "*with you*", or when we hear "*Houston, we have a...*", we end the phrase with "*problem*". So, certain word sequences strongly predict what comes next.

Here are some **key considerations**:

1. **Context creates probability distributions**: "*Elementary, my dear ...*" $\rightarrow$ "*Watson*" has much higher probability than random words
2. **Multiple valid completion exist**: "*Once upon a...*" $\rightarrow  \left\{\begin{array}{l} time \\ December \\ midnight\end{array}\right.$ with different probabilities
3. **Cultural knowledge influences predictions**: "*Luke, I am your ...*" strongly suggests "*father*" due to pop culture 
4. **Transition probabilities aren't uniform**: some word pairs are far more likely than others based on training data

<div class="alert alert-success">
<p>The <b>order</b> <strong>n</strong> (or <b>memory</b>) of a Markov chain determines how many previous states influence the prediction of the next state.</p>
<p>In our case, a model of order <strong>n</strong> means that the next word depends on the previous <strong>n</strong> words. </p>
<ul>
<li>Higher order = more context = potentially better predictions</li>
<li>Higher order = more parameters = more training data needed</li>
</ul>
</div>


Let's see how different orders work with the phrase: "*The quick brown fox jumps*"

|Order|Name|Context Used|Next Word Prediction|Example|
|---|---|---|---|---|
|n=0|Zero-order (Unigram)|No previous words|Based only on word frequency in corpus|$P(word) = count(word) / total_words$|
|n=1|First-order (Bigram)|Previous 1 word|$P(next\|current)$|$P(next \| ``jumps")$|
|n=2|Second-order (Trigram)|Previous 2 words|$P(next\|prev2, prev1)$|$P(next \| ``fox", ``jumps")$|

Thus, given a corpus, it is possible to kep a track of what is a consequent word for a gien word ($n=1$) or a pair of words ($n=2$). 

Let's consider a corpus of 2 haiku:

|Haiku 1 | Haiku 2|
|---|---|
|Beneath the old oak<br/>Dark stones line the babbling brook<br/>Sunlight through green leaves | Waves crash on dark stones<br/> Salt spray kisses the moonlight<br/>Ocean's ancient song|

Let $n=1$. Now, let's look what a Python's dictionary mapping each word to each subsequent word looks like:

```
 'ancient': ['song'],
 'babbling': ['brook'],
 'beneath': ['the'],
 'brook': ['sunlight'],
 'crash': ['on'],
 'dark': ['stones', 'stones'],
 'green': ['leaves'],
 'kisses': ['the'],
 'leaves': ['waves'],
 'line': ['the'],
 'moonlight': ["ocean's"],
 'oak': ['dark'],
 "ocean's": ['ancient'],
 'old': ['oak'],
 'on': ['dark'],
 'salt': ['spray'],
 'song': [],
 'spray': ['kisses'],
 'stones': ['line', 'salt'],
 'sunlight': ['through'],
 'the': ['old', 'babbling', 'moonlight'],
 'through': ['green'],
 'waves': ['crash']
```

Note that most words have only a single-word list in the corresponding value. This is due to our corpus size. However, if you check the results for `'the'`, you can see `['old', 'babbling', 'moonlight']`.

Note as well that for the key `'dark'`, we obtain the word `'stones'` twice. This is because we store every occurrence of a word as a separate, duplicate value. **Such repetitions allow to capture the statistical frequency of words.** In our case, it means that the word `'stones'` have more chances to occur after the word `'dark'` than other potential words.

You can also note that the mapping continues from the first haiku to the second one, so the dictionary contains the items like `'leaves': ['waves']`.

Now, let $n=2$. In this case, the keys of the dictionary will be composed of bigrams. Let's see what dictionary we can obtain:

```
 'ancient song': [],
 'babbling brook': ['sunlight'],
 'beneath the': ['old'],
 'brook sunlight': ['through'],
 'crash on': ['dark'],
 'dark stones': ['line', 'salt'],
 'green leaves': ['waves'],
 'kisses the': ['moonlight'],
 'leaves waves': ['crash'],
 'line the': ['babbling'],
 "moonlight ocean's": ['ancient'],
 'oak dark': ['stones'],
 "ocean's ancient": ['song'],
 'old oak': ['dark'],
 'on dark': ['stones'],
 'salt spray': ['kisses'],
 'spray kisses': ['the'],
 'stones line': ['the'],
 'stones salt': ['spray'],
 'sunlight through': ['green'],
 'the babbling': ['brook'],
 'the moonlight': ["ocean's"],
 'the old': ['oak'],
 'through green': ['leaves'],
 'waves crash': ['on']
```

As haiku are short and available training corpus is relatively small, we can limit $n=2$ to get some creativity and enforce some order.

Now let's write Python functions allowing to create such dictionaries of order 1 and 2 and apply them to the corpus containing two aforementioned haiku for testing.

To do so, we are going to need to prepare our corpus using `text_preprocessing()` function.

In [None]:
# test example
txt_1 = "Beneath the old oak\nDark stones line the babbling brook\nSunlight through green leaves."
txt_2 = "Waves crash on dark stones\nSalt spray kisses the moonlight\nOcean's ancient song."
# concatenate the text
txt = txt_1 + ' ' + txt_2
print(txt)

In [None]:
# apply text processing to our toy example 
# and split the text into a list of words
corp_list = text_preprocessing(txt)
print(corp_list)

<div class="alert-exercise">
<h5> QUESTION 2:</h5> Write the function `map_word_to_word()` that maps each word from a corpus to its subsequent word and returns a dictionary.</p>

```
def map_word_to_word(corpus_list: list) -> dict[str, list]:   
    """Maps each word from a corpus to its subsequent word and returns a dictionary.

    Args:
        corpus_list (list): a list of words representing texts to be analysed

    Returns:
        dict: dictionary mapping each word to a list of its subsequent words
    """
```
</div>

In [None]:
# ANSWER
def map_word_to_word(corpus_list: list) -> dict[str, list]:   
    """Maps each word from a corpus to its subsequent word and returns a dictionary.

    Args:
        corpus_list (list): a list of words representing texts to be analysed

    Returns:
        dict: dictionary mapping each word to a list of its subsequent words
    """
        
    return {}

<div class="alert-exercise">
<h5>QUESTION 3:</h5> Test it first on our toy example. Then, apply the function to our `corpus_list` and save the result to a variable.
</div>

In [None]:
# ANSWER


In [None]:
# ANSWER


<div class="alert-exercise">
<h5>QUESTION 4:</h5> Write the function `map_bigram_to_word()` that maps each pair of words from a corpus to its subsequent word and returns a dictionary.

```
def map_bigram_to_word(corpus_list: list) -> dict[str, list]:   
    """Maps each pair of words (bigram) from a corpus to its subsequent word and return a dictionary.

    Args:
        corpus_list (list): a list of words representing texts to be analysed

    Returns:
        dict: dictionary mapping each pair of words (bigram) to a list of its subsequent words
    """
```
</div>

In [None]:
# ANSWER
def map_bigram_to_word(corpus_list: list) -> dict[str, list]:   
    """Maps each pair of words (bigram) from a corpus to its subsequent word and return a dictionary.

    Args:
        corpus_list (list): a list of words representing texts to be analysed

    Returns:
        dict: dictionary mapping each pair of words (bigram) to a list of its subsequent words
    """
    
    
    return {}

<div class="alert-exercise">
<h5> QUESTION 5: </h5> Test it first on our toy corpus. Then, apply it to `corpus_list` and save the result to a variable.
</div>

In [None]:
# ANSWER


In [None]:
# ANSWER


Now, let's write a function that applies our Markov models. Given a prefix (a single word / a pair of words), a mapping dictionary for order-1 or order-2 model, the current number of syllables and the target budget, get a list of all candidate words from the dictionary for a given prefix (key).

<div class="alert-exercise">
<h5> QUESTION 6: </h5> Write a function `word_after_prefix()` that applies our Markov models. Try it with order-1 and order-2 models.

```
def word_after_prefix(prefix: str, map_dict: dict, current_syls: int, target_syls: int) -> list:
    """Given a prefix (a single word / a pair of words), a mapping dictionary for order-1 model, 
    the current number of syllables and the target budget, get a list of all candidate words from the dictionary.

    Args:
        prefix (str): a single word / a pair of words used as a prefix
        map_dict (dict): a mapping dictionary for order-1 or order-2 model
        current_syls (int): current number of syllables in a line
        target_syls (int): target number of syllables in a line

    Returns:
        list: a list of candidate words that fit the budget
    """
```
</div>

In [None]:
# ANSWER
def word_after_prefix(prefix: str, map_dict: dict, current_syls: int, target_syls: int) -> list:
    """Given a prefix (a single word / a pair of words), a mapping dictionary for order-1 model, 
    the current number of syllables and the target budget, get a list of all candidate words from the dictionary.

    Args:
        prefix (str): a single word / a pair of words used as a prefix
        map_dict (dict): a mapping dictionary for order-1 or order-2 model
        current_syls (int): current number of syllables in a line
        target_syls (int): target number of syllables in a line

    Returns:
        list: a list of candidate words that fit the budget
    """
    
    
    return []

In [None]:
# ANSWER


In [None]:
# ANSWER


## Haiku Generator

<div class="alert alert-idea">
<h4> 🎯 Strategy: line by line generation using Markov models</h4>

Generate haiku (5-7-5 syllable pattern) using order-1 and order-2 Markov models, processing each line independently with inter-line connections.

<div class="idea-steps">
<h5>Line generation process:</h5>
<ul>
<li>Line 1: Start with random seed word (≤4 syllables)<\li>
<li>Lines 2-3: Start using last pair from previous line as Markov key<\li>
<li>Word Selection Order:
    <ul>
    <li>Word 1: Random seed</li>
    <li>Word 2: Order-1 model (bigram)</li>
    <li>Word 3+: Order-2 model (trigram)</li>
    </ul>
</li>
<li>Syllable Control: Only accept words that fit remaining syllable budget</li>
<li>Fallback: If no valid words from current context, try random prefix from dictionary</li>
</ul>
</div>

</div>


So our algorithm can be summarised as follows:

```
For each line (target: 5/7/5 syllables):
  1. Initialize line with starting word(s)
  2. While syllable budget not reached:
     a. Try appropriate model (order-1 or order-2)
     b. Get candidate words from dictionary
     c. Check each candidate's syllables
     d. If fits and completes budget → finish line
     e. If fits but incomplete → add word, continue
     f. If no candidates fit → fallback to "ghost prefix"
  3. Use last 2 words as bridge to next line
```

<center>
<img src="img/haiku_flowchart_svg.svg" alt="Flowchart of haiku generation" width="500" height="800">
</center>

> What is a "ghost prefix"? 

A **ghost prefix** is a fallback mechanism used when the current Markov chain context cannot generate any words that fit the remaining syllable budget. The prefix is called "ghost" because it's not actually part of the current line being generated - it's a temporary, borrowed context used only to find suitable words. 

The mechanism works as follows:
- Keep the current partial line unchanged
- Randomly select a different word/word-pair from the Markov dictionaries
- Use this "ghost" context to find candidate words
- If a candidate fits the syllable budget, add it to the current line
- The ghost prefix is then discarded - it never appears in the final haiku

Let's illustrate the functioning with examples:

<center>
<img src="img/markov_behavior_diagram.svg" alt="Normal vs Ghost prefix behaviour">
</center>

Let's start by picking a word at random, so that we can get a seed word to start our haiku.

<div class="alert-exercise">
<h5> QUESTION 7: </h5> Write a function that picks a random word from the corpus list and the corresponding syllable count. If the word contains more than 4 syllables, let's retry in order to avoid one-word line or exceeding the budget. Test the function.


```
def random_word(corpus_list: list) -> tuple[str, int]:
    """Selects a random word from the corpus given as a list. If the word contains more than 4 syllables then retry. 
    Returns this word and its syllable count.

    Args:
        corpus_list (list): corpus given as a list

    Returns:
        str: a word picked randomly
        int: the number of syllables in this word
    """
```
</div>


In [None]:
# pseudo-random generator
import random

In [None]:
# ANSWER
def random_word(corpus_list: list) -> tuple[str, int]:
    """Selects a random word from the corpus given as a list. If the word contains more than 4 syllables then retry. 
    Returns this word and its syllable count.

    Args:
        corpus_list (list): corpus given as a list

    Returns:
        str: a word picked randomly
        int: the number of syllables in this word
    """
    pass

In [None]:
# ANSWER
# test the function


<div class="alert-exercise">
<h5> QUESTION 8: </h5> Write a function that implements ghost prefix behaviour: creates a random ghost prefix, applies a Markov model for a given ghost prefix and returns a list of acceptable candidates.


```
def use_ghost_prefix(corpus_list: list, map_dict: dict, current_syls: int, target_syls: int) -> list:
    """Creates a ghost prefix and finds valid candidates

    Args:
        corpus_list (list): corpus split into a list of words
        map_dict (dict): dictionary mapping prefixes with consecutive words
        current_syls (int): current number of syllables 
        target_syls (int): target number of syllables

    Returns:
        list: list of valid candidate words, or empty list if none found
    """
```
</div>

In [None]:
# ANSWER
def use_ghost_prefix(corpus_list: list, map_dict: dict, current_syls: int, target_syls: int) -> list:
    """Creates a ghost prefix and finds valid candidates

    Args:
        corpus_list (list): corpus split into a list of words
        map_dict (dict): dictionary mapping prefixes with consecutive words
        current_syls (int): current number of syllables 
        target_syls (int): target number of syllables

    Returns:
        list: list of valid candidate words, or empty list if none found
    """
        
    return []

In [None]:
# ANSWER


<div class="alert-exercise">
<h5> QUESTION 9: </h5> Write a function that gets a list of acceptable candidates for a given prefix. Applies ghost prefix procedure if the initial list is empty. 
    Randomly picks a word from the resulting list and counts its number of syllables.  


```
def get_next_word(prefix: str, corpus_list: list, map_dict: dict, current_syls: int, target_syls: int) -> tuple[str, int]:
    """Gets a list of acceptable candidates for a given prefix. Applies ghost prefix procedure if the initial list is empty. 
    Randomly picks a word from the resulting list and counts its number of syllables.

    Args:
        prefix (str): _description_
        corpus_list (list): _description_
        map_dict (dict): _description_
        current_syls (int): _description_
        target_syls (int): _description_

    Returns:
        str: word, randomly picked from the candidate list
        int: number of syllables in the resulting word
    """
```
</div>

In [None]:
# ANSWER
def get_next_word(prefix: str, corpus_list: list, map_dict: dict, current_syls: int, target_syls: int) -> tuple[str, int]:
    """Gets a list of acceptable candidates for a given prefix. Applies ghost prefix procedure if the initial list is empty. 
    Randomly picks a word from the resulting list and counts its number of syllables.

    Args:
        prefix (str): _description_
        corpus_list (list): _description_
        map_dict (dict): _description_
        current_syls (int): _description_
        target_syls (int): _description_

    Returns:
        str: word, randomly picked from the candidate list
        int: number of syllables in the resulting word
    """
    
    pass

In [None]:
# ANSWER


<div class="alert-exercise">
<h5> QUESTION 10: </h5> Write a function that generates first two words of the first line (target number of syllables = 5). The first (seed) word is picked randomly from corpus_list. The second word is selected using order-1 model from mao_dict_1 using the seed word as prefix. 


```
def get_first_two_words(corpus_list, map_dict_1, target_syls: int=5) -> tuple[list, int, list]:
    """Generates the first two words: the first (seed) word is picked randomly from corpus_list. 
    The second word is selected using order-1 model from mao_dict_1 using the seed word as prefix. 

    Args:
        map_dict_1 (_type_): dictionary mapping a single word prefix with the next words
        target_syls (int, optional): target number of syllables. Defaults to 5.

    Returns:
        list: current line containing 2 words
        int: number of syllables in the current line
        list: two last words of the line. Empty if the target number of syllables hasn't been reached
    """
```
</div>

In [None]:
# ANSWER
def get_first_two_words(corpus_list, map_dict_1, target_syls: int=5) -> tuple[list, int, list]:
    """Generates the first two words: the first (seed) word is picked randomly from corpus_list. 
    The second word is selected using order-1 model from mao_dict_1 using the seed word as prefix. 

    Args:
        map_dict_1 (_type_): dictionary mapping a single word prefix with the next words
        target_syls (int, optional): target number of syllables. Defaults to 5.

    Returns:
        list: current line containing 2 words
        int: number of syllables in the current line
        list: two last words of the line. Empty if the target number of syllables hasn't been reached
    """
    pass

<div class="alert-exercise">
<h5> QUESTION 11: </h5> Write a function that continues the current line till reaching the limit. 


```
def end_current_line(current_line: list, corpus_list, map_dict, current_syls, target_syls) -> list:
    """Continues the current line till reaching the limit.

    Args:
        current_line (list): initial state of the current line
        corpus_list (_type_): corpus split into a list of words
        map_dict (_type_): dictionary mapping a prefix with the next words
        current_syls (_type_): initial state of the current number of syllables
        target_syls (_type_): target number of syllables

    Returns:
        list: finalised version of the current line
    """
```
</div>

In [None]:
# ANSWER
def end_current_line(current_line: list, corpus_list, map_dict, current_syls, target_syls) -> list:
    """Continues the current line till reaching the limit.

    Args:
        current_line (list): initial state of the current line
        corpus_list (_type_): corpus split into a list of words
        map_dict (_type_): dictionary mapping a prefix with the next words
        current_syls (_type_): initial state of the current number of syllables
        target_syls (_type_): target number of syllables

    Returns:
        list: finalised version of the current line
    """
    pass

<div class="alert-exercise">
<h5> QUESTION 12: </h5> Write a function that creates a haiku line. To ensure a semantic continuity between the lines, the last two words from the previous line are kept and used for next word generation. 

<i>Hint:</i> Use the functions `get_first_two_words` and `end_current_line()` for simplicity.

<i>Hint:</i> You can temporarily add the last word pair to the current line to ensure its use as a prefix. 

```
def get_haiku_line(corpus_list: list, map_dict_1: dict, map_dict_2: dict, end_prev_line: list, target_syls: int) -> tuple[list, list]:
    """Creates a haiku line. The first line starts with a random seed word followed by a word obtained using order-1 Markov model. 
    For the rest, order-2 Markov model is used.

    Args:
        corpus_list (list): corpus split into a list of words
        map_dict_1 (dict): dictionary mapping a single word prefix to the next words
        map_dict_2 (dict): dictionary mapping a word-pair prefix to the next words
        end_prev_line (list): last pair of words from the previous line to ensure the continuity
        target_syls (int): target number of syllables. For haiku: 5, 7, 5.

    Returns:
        list: _description_
    """
```
</div>

In [None]:
# ANSWER
def get_haiku_line(corpus_list: list, map_dict_1: dict, map_dict_2: dict, end_prev_line: list, target_syls: int) -> tuple[list, list]:
    """Creates a haiku line. The first line starts with a random seed word followed by a word obtained using order-1 Markov model. 
    For the rest, order-2 Markov model is used.

    Args:
        corpus_list (list): corpus split into a list of words
        map_dict_1 (dict): dictionary mapping a single word prefix to the next words
        map_dict_2 (dict): dictionary mapping a word-pair prefix to the next words
        end_prev_line (list): last pair of words from the previous line to ensure the continuity
        target_syls (int): target number of syllables. For haiku: 5, 7, 5.

    Returns:
        list: _description_
    """
    pass

<div class="alert-exercise">
<h5> QUESTION 13: </h5> Test your program by generating several haiku.
</div>

In [None]:
# ANSWER


**CONGRATULATIONS!!!** You have generated your first haiku.

Evaluating the quality of haiku is not a trivial task. Here are some ideas:

- Generate a large number of haiku
- "Plagiarism" detection: detect duplicates (exact and near) from the training set (e.g. using N-gram level detection)
-  Rubbish haiku detection ("rubbish haiku" = clearly a random sequence of words): repetition penalties, function word overload (e.g. "the", "and", "or"), lack of content words (e.g. using POS tags likes "NN", "VB", "JJ", "RB")
- Automatically evaluating "good" haikus is the most challenging.

## Useful Links

1. [Introducing Markov Chains by HarvardX](https://www.youtube.com/watch?v=JHwyHIz6a8A)
2. [Markov Chain Stationary Distribution : Data Science Concepts](https://www.youtube.com/watch?v=4sXiCxZDrTU)