# N-gram

## Preparing the data
Let's start with downloading some books from the Gutenberg project. Then, create three lists of words from some famous works.

In [1]:
import nltk
from nltk.corpus import gutenberg

nltk.download('gutenberg')

print("Downloaded books:", nltk.corpus.gutenberg.fileids())

[nltk_data] Downloading package gutenberg to
[nltk_data]     /home/fredrik/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
Downloaded books: ['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']


In [2]:
words_austen = gutenberg.words(['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt'])
words_shakespeare = gutenberg.words(['shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt'])
words_bible = gutenberg.words(['bible-kjv.txt'])
print(gutenberg.raw(['austen-emma.txt'])[:500])
print(words_austen[:100])

[Emma by Jane Austen 1816]

VOLUME I

CHAPTER I


Emma Woodhouse, handsome, clever, and rich, with a comfortable home
and happy disposition, seemed to unite some of the best blessings
of existence; and had lived nearly twenty-one years in the world
with very little to distress or vex her.

She was the youngest of the two daughters of a most affectionate,
indulgent father; and had, in consequence of her sister's marriage,
been mistress of his house from a very early period.  Her mother
had died t
['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']', ...]


## Creating a model

In [3]:
import ngram

model = ngram.NGramModel(words_austen, 1)
print(model)

1-gram model with 11490 unique keys


In [4]:
pred = model.predict_sequence(100)
print(pred)
print(" ".join(pred))


['considered', 'admitted', 'might', 'as', 'desired', 'and', '."', 'amusement', 'believe', 'You', 'he', 'them', 'change', 'all', ',', 'could', ',', 'in', 'explanation', 'them', 'politely', 'where', 'I', 'been', 'no', 'that', 'still', 'on', 'felt', 'a', 'Sir', 'gentleman', ',', 'the', 'conversations', 'their', 'man', 'No', 'it', 'could', 'the', 'and', 'especially', ';', 'do', 'appeal', 'woman', 'lightly', 'evening', 'sit', 'discerned', 'body', 'scrupulous', 'his', 'sending', 'no', 'venture', 'and', 'necessity', 'in', 'had', 'the', '"', 's', 'a', 'surrounded', 'smart', 'into', 'man', '."', 'of', 'aye', 'so', 'was', 'the', ',', ',', 'sufficiently', 'them', 'to', 'so', 'however', 'was', 'It', 'a', 'composure', 'to', 'many', 'now', 'is', 'her', 'sat', 'table', 'excuse', 'enigmas', 'was', 'Lady', 'there', 'he', 'she']
considered admitted might as desired and ." amusement believe You he them change all , could , in explanation them politely where I been no that still on felt a Sir gentleman , 

In [None]:
# Austen data
model = ngram.NGramModel(words_austen, 6)
print(model)

In [11]:
print("Predicting using a", model)
def nice_join(predicted_words):
    ret = str()
    i = 0
    lastword = None
    for word in predicted_words:
        if lastword in ["." , "!", "?"]:
            ret += word.capitalize()
        else:
            ret += word.lower()
        i += len(word)
        if i > 80:
            ret += '\n'
            i = 0
        else:
            ret += ' '
        lastword = word
    for s in ["!", "?", ".", ",", ";", ":"]:
        ret = ret.replace(" "+s, s)
    ret = ret.replace(" ' ", "'")
    return ret
pred_text = nice_join(model.predict_sequence(300))
print(pred_text)

Predicting using a 6-gram model with 429414 unique keys
to papa and mamma's farther pressing invitations to come and dine with them whenever they asked him
! But that would be all over now.-- poor fellow!-- no more exploring parties to donwell made for _her_
. Oh! No; there would be a mrs. Knightley to throw cold water on every thing.-- extremely disagreeable
! But she was not at all in want of any thing." when emma afterwards heard that jane fairfax had been
seen wandering about the meadows, at some distance from highbury, on the afternoon of the very day
on which she had, under the plea of being unequal to any exercise, so peremptorily refused to go out
with her in the carriage, leave her at the abbey mill, while she drove a little farther, and call for
her again so soon, as to allow no time for insidious applications or dangerous recurrences to the
past, and give the most decided proof of what degree of intimacy was chosen for the future. She could
think of nothing better: and thoug

In [6]:
# Shakespeare data
model = ngram.NGramModel(words_shakespeare, 3)
print("Predicting using a", model)
print(nice_join(model.predict_sequence(300)))


Predicting using a 3-gram model with 74295 unique keys
is heyre too? Rosse. Ile tell thee she is dead bap. Nor did you nothing heare? Qu. Come now, in which
there are not liuing, to stay the grinding of the state of man, that she should locke her selfe. For
me, dar'd in the secret parts of him. This i made a good end; for euen now was heauie on me. I know that
we are gouern'd me by claudio, sirs: awake var. Cals my lord. Lenox. Heere was a caesar: you haue forgot
the taste of death, and dasht the braines were out, and the rule. That way the noise is this impon'd
for his issue, whose murther yet is but scratcht withall: ile be heere againe: i doubt not of standing
. Publius good cheere, there's done: i thinke we are, and volumnius. Thou com'st vnto a happy byrth
, but no more beleeu'd, belou'd, shall expell this something setled matter in his triumph mur. May
' t; he grew into his sight, not i'th'time, to thinke that caesar were dead, and gestures yeeld them
, as who goes farthest cassi

In [7]:
# Bible data
model = ngram.NGramModel(words_bible, 3)
print("Predicting using a", model)
print(nice_join(model.predict_sequence(300)))

Predicting using a 3-gram model with 444686 unique keys
shall be taken, and in the congregation of the candlestick of pure gold shall he be exalted. 89: 12
there is neither new moon be gone; 2: 21 tell ye your souls. 1: 14 declare ye among yourselves, because
in his hand upon the altar, and praised the lord, which shall be desolate because of the lord, be restored
to the servants of ishbosheth unto david, to pluck them out of the old prophet came to present themselves
before him, which was in favour with all men have kept the feast had tasted the good knowledge of god
. 29: 6 and abram hearkened to the grave; and there is no secret that they are given out of the word of
god, the love he had put them not: 115: 5 then pharaoh also called the circumcision in the morning shalt
thou make a battlement for thy servant, who is with thee? What is the life of sarah abraham's doing
, and the transgression of those that were in dan. 48: 10 because, even i, having obtained eternal
redemption for th

In [8]:
bad_words_url = "https://www.cs.cmu.edu/~biglou/resources/bad-words.txt"