# N-gram

## Preparing the data
Let's start with downloading some books from the Gutenberg project. Then, create three lists of words from some famous works.

In [1]:
import nltk
from nltk.corpus import gutenberg

nltk.download('gutenberg')

print("Downloaded books:", gutenberg.fileids())

[nltk_data] Error loading gutenberg: <urlopen error [Errno -3]
[nltk_data]     Temporary failure in name resolution>
Downloaded books: ['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']


In [2]:
words_austen = gutenberg.words(['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt'])
words_shakespeare = gutenberg.words(['shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt'])
words_bible = gutenberg.words(['bible-kjv.txt'])
print(gutenberg.raw(['austen-emma.txt'])[:500])
print(words_austen[:100])

[Emma by Jane Austen 1816]

VOLUME I

CHAPTER I


Emma Woodhouse, handsome, clever, and rich, with a comfortable home
and happy disposition, seemed to unite some of the best blessings
of existence; and had lived nearly twenty-one years in the world
with very little to distress or vex her.

She was the youngest of the two daughters of a most affectionate,
indulgent father; and had, in consequence of her sister's marriage,
been mistress of his house from a very early period.  Her mother
had died t
['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']', ...]


## Creating a model

In [3]:
import ngram

model = ngram.NGramModel(words_austen, 1)
print(model)

1-gram model with 11490 unique keys


In [4]:
pred = model.predict_sequence(100)
print(pred)
print(" ".join(pred))


['others', 'encourager', 'clear', 'the', 'a', ':', 'answers', 'he', 'My', 'and', 'her', ',', 'time', 'the', '.', 'at', 'the', 'might', 'for', 'Harriet', 'thoughtful', 'his', 'of', 'clear', 'with', 'puppyism', 'the', '.', ',', 'him', 'I', 'know', 'from', 'make', 'the', 'approbation', 'said', 'an', 'looked', 'understand', 'of', '.', 'was', 'will', 'interest', '.', '.', 'like', 'began', ',', 'is', 'visits', '.', 'a', '.', ',', 'being', ',', 'his', 'condition', 'a', 'case', 'again', '--', 'have', 'hear', 'will', 'how', 'criticism', 'similarity', ',', 'it', 'take', ',', 'you', ',', 'over', 'And', 'the', 'watch', ',', 'this', 'meant', 'a', 'there', ';', 'It', ',', '!', 'Barton', 'near', 'two', 'necessity', '.', 'be', 'the', ',', 'was', 'Ma', 'other']
others encourager clear the a : answers he My and her , time the . at the might for Harriet thoughtful his of clear with puppyism the . , him I know from make the approbation said an looked understand of . was will interest . . like began , is v

In [5]:
def nice_join(predicted_words):
    ret = str()
    i = 0
    lastword = None
    for word in predicted_words:
        if lastword in ["." , "!", "?"]:
            ret += word.capitalize()
        else:
            ret += word.lower()
        i += len(word)
        if i > 80:
            ret += '\n'
            i = 0
        else:
            ret += ' '
        lastword = word
    for s in ["!", "?", ".", ",", ";", ":"]:
        ret = ret.replace(" "+s, s)
    ret = ret.replace(" ' ", "'")
    return ret

In [6]:
# Austen data
model = ngram.NGramModel(words_austen, 4)
print("Created a", model)

Created a 4-gram model with 389276 unique keys


In [7]:
print("Predicting using a", model)
print(nice_join(model.predict_sequence(200)))

Predicting using a 4-gram model with 389276 unique keys
. It might be weeks, it might not be too nice, or too observant if elizabeth were his object; and that
as to a home, indeed! You, miss woodhouse, i wish with all my heart, and made her his secret standard
of perfection in woman;-- and many a long october and november evening must be struggled through
at hartfield, she was at taunton with the admiral, and had even changed her seat, or altering her attitude
, lost in her own estimation, meant nothing. One says those sort of things, occur to remind her of what
anxiety was -- but had her marriage been happy, so cheerful, so affectionate? And now, poor girl!
She was as well satisfied with what she did not think he could have heard it so lately. Penelope, my
dear -- but it does not make me unhappy, i assure you i never was more surprized -- but it won't ask me
. I am persuaded that mrs. Elton, in all mrs. Jennings, " we shall see 


In [8]:
# Shakespeare data
model = ngram.NGramModel(words_shakespeare, 1)
print(model)

1-gram model with 8960 unique keys


In [9]:
print("Predicting using a", model)
print(nice_join(model.predict_sequence(200)))

Predicting using a 1-gram model with 8960 unique keys
his a. Heere why marry on memory'is yet.:. His the tertia written court in. Thee. Euen view mortals
liue. A too yonger seene the stage any pastime, heare his'haue? Halfe, of this tongue deed himselfe
all'and me done. Monstrous spend after. Did truly why - she dropping pillowes is. Greefes helpfull
, come take. Of. He:, husband wisest detecting will. Gho what'buriall such plague, summons keepe
vnholy this hush that of and, can appeare,,'there, ',: the. There be one come for neither.:,, lord
seemes it seene i come yong receyue in to giue well tyrant, bent thee; old royalty no this lord you in
away'doe meanes he, way worme groomes does therefore macduffe. Malc wrong of edge my & as and to, liu
. Of onely d whips on enter'this his knockes secret blood augurers faire. You he in all shall this the
how st drinke most this:, 


In [10]:
# Bible data
model = ngram.NGramModel(words_bible, 1)
print(model)

1-gram model with 13769 unique keys


In [11]:
print("Predicting using a", model)
print(nice_join(model.predict_sequence(300)))

Predicting using a 1-gram model with 13769 unique keys
not am will with oppressor hand the king as speech: taken name man thee and of should of his turn heave
twenty hunger loosed handmaid cannot died 21 them of. Horses mouths water about which closed was
lord go law: heaven; and that; unto that no for: man and the coals priest and began the the cloud them
legs for him beam east smitten with in and of 14 closest it and fire nakedness 12 mind of name afterward
of for which of children we the 119 long the there i even golden the as sons 29 in smote the unto up 17 house
out way an, shall. The had raddai eye side, that of for thou she and unto crucify truth that:.. To brethren
your wicked see. In he in: became: 11 heart whosoever all my: to commandment for day, against lord
fruit thee they said that reward wisdom might and the son shall that had your, he a as he let river the
dwelt father: have 23 in yieldeth and 7 sword temptation of, meremoth 22 their art behold in favour
of we understoo

In [12]:
#bad_words_url = "https://www.cs.cmu.edu/~biglou/resources/bad-words.txt"