### N-Gram and Markov Chain
To have a better understanding of n-gram, let's try to compose an n-gram manually in Python. For this purpose, we will take [a pangram](https://en.wikipedia.org/wiki/Pangram) as an example.

In [317]:
pangram = "The quick brown fox jumps over the lazy dog"
pangram_german = "Franz jagt im komplett verwahrlosten Taxi quer durch Bayern"

def generate_ngram(text, n):
    # n needs to be larger than 1
    ngrams = []
    words = text.split(" ")
    for i in range(len(words) - n + 1):
        gram = words[i:i + n]
        ngrams.append(gram)
    return ngrams

unigrams = pangram.split(" ")
print("Unigrams:", unigrams)

bigrams = generate_ngram(pangram, 2)
print("Bigrams:", bigrams)

trigrams = generate_ngram(pangram, 3)
print("Trigrams:", trigrams)

Unigram: ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
Bigram: [['The', 'quick'], ['quick', 'brown'], ['brown', 'fox'], ['fox', 'jumps'], ['jumps', 'over'], ['over', 'the'], ['the', 'lazy'], ['lazy', 'dog']]
Bigram: [['The', 'quick', 'brown'], ['quick', 'brown', 'fox'], ['brown', 'fox', 'jumps'], ['fox', 'jumps', 'over'], ['jumps', 'over', 'the'], ['over', 'the', 'lazy'], ['the', 'lazy', 'dog']]


### Exercise 1
Try to modify the code above to generate bigram and trigram based on letters

In [524]:
def generate_ngram_character(text, n):
    # n needs to be larger than 1
    pass

unigrams = list(pangram)
print("Unigram:", unigrams)



Unigram: ['T', 'h', 'e', ' ', 'q', 'u', 'i', 'c', 'k', ' ', 'b', 'r', 'o', 'w', 'n', ' ', 'f', 'o', 'x', ' ', 'j', 'u', 'm', 'p', 's', ' ', 'o', 'v', 'e', 'r', ' ', 't', 'h', 'e', ' ', 'l', 'a', 'z', 'y', ' ', 'd', 'o', 'g']
Bigram: ['Th', 'he', 'e ', ' q', 'qu', 'ui', 'ic', 'ck', 'k ', ' b', 'br', 'ro', 'ow', 'wn', 'n ', ' f', 'fo', 'ox', 'x ', ' j', 'ju', 'um', 'mp', 'ps', 's ', ' o', 'ov', 've', 'er', 'r ', ' t', 'th', 'he', 'e ', ' l', 'la', 'az', 'zy', 'y ', ' d', 'do', 'og']
Bigram: ['The', 'he ', 'e q', ' qu', 'qui', 'uic', 'ick', 'ck ', 'k b', ' br', 'bro', 'row', 'own', 'wn ', 'n f', ' fo', 'fox', 'ox ', 'x j', ' ju', 'jum', 'ump', 'mps', 'ps ', 's o', ' ov', 'ove', 'ver', 'er ', 'r t', ' th', 'the', 'he ', 'e l', ' la', 'laz', 'azy', 'zy ', 'y d', ' do', 'dog']


One can also use the nltk library to generate ngram. You can install the nltk library by running the cell below:

In [30]:
import sys
!{sys.executable} -m pip install nltk


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [34]:
from nltk import ngrams

# words
bigrams = list(ngrams(pangram.split(" "), 2))
trigrams = list(ngrams(pangram.split(" "), 3))

print("Bigrams:", bigrams)
print("Trigrams:", trigrams)

# letters
bigrams = list(ngrams(pangram, 2))
trigrams = list(ngrams(pangram, 3))

print("Bigrams:", bigrams)
print("Trigrams:", trigrams)


Bigrams: [('The', 'quick'), ('quick', 'brown'), ('brown', 'fox'), ('fox', 'jumps'), ('jumps', 'over'), ('over', 'the'), ('the', 'lazy'), ('lazy', 'dog')]
Trigrams: [('The', 'quick', 'brown'), ('quick', 'brown', 'fox'), ('brown', 'fox', 'jumps'), ('fox', 'jumps', 'over'), ('jumps', 'over', 'the'), ('over', 'the', 'lazy'), ('the', 'lazy', 'dog')]
Bigrams: [('T', 'h'), ('h', 'e'), ('e', ' '), (' ', 'q'), ('q', 'u'), ('u', 'i'), ('i', 'c'), ('c', 'k'), ('k', ' '), (' ', 'b'), ('b', 'r'), ('r', 'o'), ('o', 'w'), ('w', 'n'), ('n', ' '), (' ', 'f'), ('f', 'o'), ('o', 'x'), ('x', ' '), (' ', 'j'), ('j', 'u'), ('u', 'm'), ('m', 'p'), ('p', 's'), ('s', ' '), (' ', 'o'), ('o', 'v'), ('v', 'e'), ('e', 'r'), ('r', ' '), (' ', 't'), ('t', 'h'), ('h', 'e'), ('e', ' '), (' ', 'l'), ('l', 'a'), ('a', 'z'), ('z', 'y'), ('y', ' '), (' ', 'd'), ('d', 'o'), ('o', 'g')]
Trigrams: [('T', 'h', 'e'), ('h', 'e', ' '), ('e', ' ', 'q'), (' ', 'q', 'u'), ('q', 'u', 'i'), ('u', 'i', 'c'), ('i', 'c', 'k'), ('c', 'k'

---
### Text Generation with Markov Chain

We will be using the Markovify library to play with Markov chain in Python.

In [525]:
import sys
!{sys.executable} -m pip install markovify


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


After installation, import the library into the notebook.

In [526]:
import markovify

Let's try to create a markov chain with the same example sentence we had from the previous session, "A rose is a rose is a rose". Let's first specify the `state_size` parameter to 1 because the default value is 2. You can try to change the value to 2 and observe the change in the printed result.

In [538]:
import json
text = "A rose is a rose is a rose ."
# Build the model.
text_model = markovify.Text(text, state_size=1)
model_str = text_model.chain.to_json()
model_json = json.loads(model_str)
print(model_str)

# notice that markovify always try to make sure to generate new sentence that is not present in the original data.
for i in range(2):
    print(text_model.make_sentence())


[[["___BEGIN__"], {"A": 1}], [["A"], {"rose": 1}], [["rose"], {"is": 2, ".": 1}], [["is"], {"a": 2}], [["a"], {"rose": 2}], [["."], {"___END__": 1}]]
A rose .
A rose .


You can use the following code to generate a more readable output of the model.

In [534]:
def custom_print(obj, indent=0):
    space = " " * indent

    if isinstance(obj, dict):
        print(space + "{")
        for k, v in obj.items():
            print(space + "  " + f"{k}: ", end="")
            custom_print(v, indent + 2)
        print(space + "}")

    elif isinstance(obj, list):
        print("[", end="")
        for i, v in enumerate(obj):
            if i > 0:
                print(", ", end="")
            custom_print(v, 0)
        print("]")

    else:
        # print primitives without quotes
        print(obj, end="")


custom_print(model_json, indent=4)


[[[___BEGIN__, ___BEGIN__]
, {
  A: 1}
]
, [[___BEGIN__, A]
, {
  rose: 1}
]
, [[A, rose]
, {
  is: 1}
]
, [[rose, is]
, {
  a: 2}
]
, [[is, a]
, {
  rose: 2}
]
, [[a, rose]
, {
  is: 1  .: 1}
]
, [[rose, .]
, {
  ___END__: 1}
]
]


After having a good understanding of how Markovify works, let's try to feed the model with more data. In the week3 folder, you can find a txt file containing <i>The Picture of Dorian Gray</i> from Oscar Wilde, downloaded from Project Gutenberg.

In [540]:
# Get raw text as string.
with open("pg26740.txt") as f:
    text = f.read()

meta, sep, content = text.partition("THE PREFACE") # using the .partition() method to get rid of text about the book
print(meta)

﻿The Project Gutenberg eBook of The Picture of Dorian Gray
    
This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this ebook or online
at www.gutenberg.org. If you are not located in the United States,
you will have to check the laws of the country where you are located
before using this eBook.

Title: The Picture of Dorian Gray

Author: Oscar Wilde

Release date: October 1, 2008 [eBook #26740]
                Most recently updated: March 10, 2010

Language: English

Credits: Produced by David Clarke, Chuck Greif and the Online
        Distributed Proofreading Team at http://www.pgdp.net


*** START OF THE PROJECT GUTENBERG EBOOK THE PICTURE OF DORIAN GRAY ***




Produced by David Clarke, Chuck Greif and the Online
Distributed Proofreading Team at http://www.pgdp.net









T

In [541]:
# Build the model.
text_model = markovify.Text(content)

# Print five randomly-generated sentences
for i in range(5):
    print(text_model.make_sentence())

If you received the news of Sibyl Vane.
He had been our ambassador at Madrid when Isabella was young, and catching the meaning of his exquisite youth and beauty.
The poor chap was killed in a number of other people.
He was not so abstruse as I was dominated, soul, brain, and the piano and let his fingers upon the latch.
She was conscious of sharing with the idea, and grew louder.


We can also specify the maximum character length for the generated sentence with `make_short_sentence()`.


In [543]:
# Print three randomly-generated sentences of no more than 20 characters
for i in range(3):
    print(text_model.make_short_sentence(20, tries=100)) # increase attempt to reduce the chance of None for short sentences

Don't look so good.
I must go and look.
As he was safe.


We can use this technique to create some rhythm in our generated result.

In [545]:
# Self defined function to make sure a sentence is within a character range between min and max
def make_sentence_range(min, max, tries=10):
    # using a while loop here to avoid the case of None
    sentence = None
    for i in range(tries):
        sentence = text_model.make_short_sentence(max)
        if sentence is not None and len(sentence) > min:
            break
    return sentence

print(text_model.make_short_sentence(30, tries=100))
print(make_sentence_range(80, 100))
print(text_model.make_short_sentence(20, tries=100))

A sense of humanity.
Everything could be destroyed by the midnight train, and I have a spy in one's house.
Credit is the type.


Markovify assumes us to use text files with normal sentence punctuation by default. If your text file is using line breaks instead (poems, list of words, headlines...) you can use `markovify.NewlineText()` instead.

In [550]:
text = open("../week2/list.txt").read()
model = markovify.NewlineText(text, state_size=1)

for i in range(3):
    print(model.make_sentence(tries=100))


There's a fire starting in my heart inside of your hand
What is your view of the party and the morning seems so grey
There's a fire starting in my heart inside of your hand
But I don't know what


Markovify uses word-level Markov chains by default, but we can modify its behavior to build character-level models instead. Character-level Markov models are especially useful for generating new words or names. In this example, we’ll use a list of color names to train a character-level model and generate new, Markovify-based color names.

In [552]:
# We are defining our own class here based on the NewlineText class from markovify
class CharacterText(markovify.NewlineText):
    # Split text into characters
    def word_split(self, sentence):
        return list(sentence)

    # Join characters back into a string
    def word_join(self, words):
        return "".join(words)

text = open("100_color_names.txt").read()

# Build the model.
model = CharacterText(text, state_size=2)

print("Generated color names:")
# Generate text
for i in range(5):
    print(model.make_sentence(tries=100))

Generated color names:
Peria
Amberry
Bricotta
Rusty
Zaffronze


#### Combining wo markov models


You can merge combine two different markov models and mix them in the ratio that you want.

In [453]:
# Model A loads Frankenstein
with open("pg84.txt") as f:
    text_a = f.read()

meta_a, sep_a, content_a = text_a.partition("Letter 1")
#print(meta_a)
model_a = markovify.Text(content_a, state_size=3)

# Model B loads Thus Spake Zarathustra
with open("pg1998.txt") as f:
    text_b = f.read()

meta_b, sep_b, content_b = text_b.partition("\n\nTHUS SPAKE ZARATHUSTRA.\n\n")
# print(meta_b)
model_b = markovify.Text(content_b, state_size=3)

model_combined = markovify.combine([ model_a, model_b ], [ 1, 1 ])  # merging them with the ratio of 1:1


We can try to generate a paragraph with more consistency by using `make_sentence_with_start()`.

In [553]:
first = model_combined.make_sentence_with_start("I", tries=100)
second = model_combined.make_sentence_with_start("Zarathustra", tries=100)
third = model_combined.make_sentence_with_start("He", tries=100)
paragraph = first + " " + second + " " + third

print(paragraph)

I instantly wrote to Geneva; nearly two months after the death of my adversary. Zarathustra recognises another higher man in the moon than in the woman. He endeavoured to soothe me as a wretch doomed to ignominy and perdition.


For more detailed documentation, check out [the official GitHub Repository](https://github.com/jsvine/markovify).

### Exercise 2

Find two text files of distinct styles and try to merge them with markovify.