# N-gram

## Preparing the data
Let's start with downloading some books from the Gutenberg project. Then, create three lists of words from some famous works.

In [1]:
import nltk
from nltk.corpus import gutenberg

nltk.download('gutenberg')

print("Downloaded books:", nltk.corpus.gutenberg.fileids())

[nltk_data] Downloading package gutenberg to
[nltk_data]     /home/fredrik/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
Downloaded books: ['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']


In [2]:
words_austen = gutenberg.words(['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt'])
words_shakespeare = gutenberg.words(['shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt'])
words_bible = gutenberg.words(['bible-kjv.txt'])
print(gutenberg.raw(['austen-emma.txt'])[:500])
print(words_austen[:100])

[Emma by Jane Austen 1816]

VOLUME I

CHAPTER I


Emma Woodhouse, handsome, clever, and rich, with a comfortable home
and happy disposition, seemed to unite some of the best blessings
of existence; and had lived nearly twenty-one years in the world
with very little to distress or vex her.

She was the youngest of the two daughters of a most affectionate,
indulgent father; and had, in consequence of her sister's marriage,
been mistress of his house from a very early period.  Her mother
had died t
['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']', ...]


## Creating a model

In [3]:
import ngram

model = ngram.NGramModel(words_austen, 1)
print(model)

1-gram model with 11490 unique keys


In [4]:
pred = model.predict_sequence(100)
print(pred)
print(" ".join(pred))


['and', 'it', ',', 'renewed', 'than', 'she', 'together', 'was', 'secret', 'this', 'the', 'and', '"', 'him', 'and', 'to', 'to', '.', 'the', 'because', 'an', 'were', 'morrow', '.--', 'be', 'a', '!--', 'their', 'of', 'wound', 'from', 'was', 'forgive', 'And', 'Have', 'said', 'kept', 'In', '?"', 'altered', ',', 'into', 'natural', 'thoughts', ',', ',', 'a', 'a', 'made', ',', 'with', 'really', 'to', '."', ',', 'Mr', 'other', 'pleasant', '."', 'Exactly', 'in', 's', 'suddenly', 'like', ',', 'No', '."', ',', 'not', 'conduct', 'in', 'you', 'still', ',', 'Anne', ',', 'the', 'married', 'with', 'that', 'suspicion', '"', 'suppose', 'mouth', 'He', 'and', 'his', 'I', 'herself', 'me', ',', 'it', ',', 'no', 'Mr', 'evident', 'home', ',', 'required', 'amuses']
and it , renewed than she together was secret this the and " him and to to . the because an were morrow .-- be a !-- their of wound from was forgive And Have said kept In ?" altered , into natural thoughts , , a a made , with really to ." , Mr other 

In [5]:
# Austen data
model = ngram.NGramModel(words_austen, 6)
print(model)

6-gram model with 429414 unique keys


In [6]:
print("Predicting using a", model)
def nice_join(predicted_words):
    ret = str()
    i = 0
    lastword = None
    for word in predicted_words:
        if lastword in ["." , "!", "?"]:
            ret += word.capitalize()
        else:
            ret += word.lower()
        i += len(word)
        if i > 80:
            ret += '\n'
            i = 0
        else:
            ret += ' '
        lastword = word
    for s in ["!", "?", ".", ",", ";", ":"]:
        ret = ret.replace(" "+s, s)
    ret = ret.replace(" ' ", "'")
    return ret
pred_text = nice_join(model.predict_sequence(300))
print(pred_text)

Predicting using a 6-gram model with 429414 unique keys
and aunt, very worthy people; i have known them all my life. They will be extremely glad to see you,
i am sure he does not. He would do any good to her, or her family; but --" " well," said mrs. Weston, " have
not you finished it yet? You would not earn a very good livelihood as a working silversmith at this
rate." " i have not had the possibility. Had you not been surrounded by other friends, i might have
been tempted to introduce a subject, to ask questions, to speak more openly than might have been strictly
correct.-- i feel that i should certainly have been impertinent." " oh!" cried jane, with a blush
and an hesitation which emma thought infinitely more becoming to her than all the elegance of all
her usual composure --" there would have been no danger. The danger would have been of my wearying
you. You could not have gratified me more than by expressing an interest --. indeed, miss woodhouse
, ( speaking more collectedly,) w

In [7]:
# Shakespeare data
model = ngram.NGramModel(words_shakespeare, 3)
print("Predicting using a", model)
print(nice_join(model.predict_sequence(300)))


Predicting using a 3-gram model with 74295 unique keys
truth then life. My mother you say, more suffer, ere i can shake off at pleasure. Thunder, and wisely
3. Where is thy name is woman. A drumme, colours, and to decline. There to meet with macbeth, macbeth
, macbeth: marry he was true to me in pompeyes porch, where it comes ham. He hath my daughter. Lord hamlet
hor. Why any thing? Laer. Say you so gospell'd their malefactions. For brutus onely ouercame himselfe
, i will, though the brightest fell. O if thou wilt not murther me? Hor. Oh good horatio, that by the
same cou'nant and carriage of the fiend, that no reuennew hast, and deny'd as tell the manner borne
: it was the modell of that philosophy, by his loued mansonry, that the vttermost, and his subiect
, strong both against the owle that shriek'd both in time will venom breed, no hat vpon his head, or
look'd banq. There's light by her grandam: shame it selfe were dimme enough, if you could but winne
the noble minde rich gifts wax

In [8]:
# Bible data
model = ngram.NGramModel(words_bible, 3)
print("Predicting using a", model)
print(nice_join(model.predict_sequence(300)))

Predicting using a 3-gram model with 444686 unique keys
man out of the wall was joined together. 40: 3 and he turned aside into the temple, nor spake it not better
than all they which were above the head of them full of mercy, and his name jedidiah, because they hired
against us former things; they are received: so he blessed him. 9: 13 in thy filthiness out of the cattle
that followed them. 1: 22 wives, submit yourselves: 22 and the desolate cities to dwell in his stead
. 19: 20 and when they had set beside gibeah. 20: 29 and ophir, and have not hearkened. 25: 6 flee out
of the month, or in the threshingfloor. 3: 13 and every man gat him into her house inclineth unto death
, and pitched in mount hermon. 5: 23 and serug lived thirty years, when he cried unto them, as if it be
sodden in a corner. 118: 11 searching what, the length of a multitude of his firstborn was joel; shemaiah
the prophet, lo, let us leave off contention, and the timber thereof and the king said, and let us see
if t

In [9]:
bad_words_url = "https://www.cs.cmu.edu/~biglou/resources/bad-words.txt"