# Generating text with character n-gram models

First, we need some training data.

In [1]:
import numpy as np
import matplotlib.pyplot as plt

We will use the `gutenberg` module for downloading some training data. You can see a list of open books below.

In [2]:
import nltk
nltk.download('gutenberg')

from nltk.corpus import gutenberg
print("Available books:", gutenberg.fileids())

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
Available books: ['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']


In [3]:
fileids = gutenberg.fileids()[:3]
print("Using:", fileids)

raw_text = gutenberg.raw(fileids)
raw_text = raw_text.replace('\n', ' ')
data = list(raw_text)
print("Training data consists of %i characters" % len(data))

Using: ['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt']
Training data consists of 2026385 characters


The model is the same as with word n-grams. The difference comes from the training data.

In [4]:
!wget -N https://raw.githubusercontent.com/fredrikwahlberg/5LN445/master/ngram.py

from ngram import NGramModel

--2021-09-06 14:34:10--  https://raw.githubusercontent.com/fredrikwahlberg/5LN445/master/ngram.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8687 (8.5K) [text/plain]
Saving to: ‘ngram.py’


Last-modified header missing -- time-stamps turned off.
2021-09-06 14:34:11 (65.0 MB/s) - ‘ngram.py’ saved [8687/8687]



In [5]:
model1 = NGramModel(data, 1)
print(model1)

model2 = NGramModel(data, 2)
print(model2)

model3 = NGramModel(data, 3)
print(model3)

model4 = NGramModel(data, 4)
print(model4)

1-gram model with 81 unique keys
2-gram model with 1542 unique keys
3-gram model with 10212 unique keys
4-gram model with 40030 unique keys


In [6]:
print(model1.predict_sequence(20))

['v', ' ', ' ', 'e', 'a', 't', 'a', 'e', 'n', 'r', ' ', 's', 'y', 'a', 's', 't', ' ', 'h', 's', ' ']


In [7]:
print("unigram:", "".join(model1.predict_sequence(200)))

unigram: ddoanut otaate?gtaayweh  im   ea trrtaorh  ef nidnc nhccwet rtwbxocnaiAsgihtn rn,ooa rlen g,ar.hera ntere oe,timeaio n  .v onnet  nosnltsp.,eihcnaiayeea myeiwm Morl  e TlHbet u i  "odaooh eeytat sM -g


In [8]:
print("bigram:", "".join(model2.predict_sequence(200)))

bigram: e oben man bin con d wn fr me trichear t, nold he wherreeystincond m Mre, d tomast iawe or. ise hishatoned ortiod t s dittticest wknckn, wireximpesnil ar. Win f s- tes tourdsero me avareshale whin ted


In [9]:
print("trigram:", "".join(model3.predict_sequence(200)))

trigram:  he send that give he guance to post donind much washe by to-daught a boaccought, whery uncomanningand nortabsed nothing whind ther so but thave wrompostaill ch theanch as hild to Emmost befordly re t


In [10]:
print("quadgram:", "".join(model4.predict_sequence(200)))

quadgram:  a made excuse's right was the Elinor's regants, but or Sir fromorning than howed rathe mome the spothem as to the poings ing, and may the Mary.  Trouse would not_ is it sits of cotted fortain fillevo
