# ngrams

Although NLTK is necessary for computing ngrams, it does make the job easier. This notebook demos some parts of the NLTK API for ngrams.

## References:

**API:** https://www.nltk.org/api/nltk.html#nltk.util.ngrams

Exploring the novel _Dracula_, available on Project Gutenberg:

In [1]:
import nltk

Download a text file version of _Dracula_ into our VM:

In [4]:
import requests

In [10]:
book = requests.get('http://www.gutenberg.org/cache/epub/345/pg345.txt')

In [13]:
book.text[:1000]

'\ufeffThe Project Gutenberg EBook of Dracula, by Bram Stoker\r\n\r\nThis eBook is for the use of anyone anywhere at no cost and with\r\nalmost no restrictions whatsoever.  You may copy it, give it away or\r\nre-use it under the terms of the Project Gutenberg License included\r\nwith this eBook or online at www.gutenberg.org/license\r\n\r\n\r\nTitle: Dracula\r\n\r\nAuthor: Bram Stoker\r\n\r\nRelease Date: August 16, 2013 [EBook #345]\r\n\r\nLanguage: English\r\n\r\n\r\n*** START OF THIS PROJECT GUTENBERG EBOOK DRACULA ***\r\n\r\n\r\n\r\n\r\nProduced by Chuck Greif and the Online Distributed\r\nProofreading Team at http://www.pgdp.net (This file was\r\nproduced from images generously made available by The\r\nInternet Archive)\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n                                DRACULA\r\n\r\n\r\n\r\n\r\n\r\n                                DRACULA\r\n\r\n                                  _by_\r\n\r\n                              Bram Stoker\r\n\r\n                        [Illu

Built-in support for bigrams and trigrams:

In [16]:
nltk.bigrams(book.text)

<generator object bigrams at 0x1165b2f50>

In [20]:
gen = nltk.bigrams(book.text)
for i in range(10):
    print(next(gen))

('\ufeff', 'T')
('T', 'h')
('h', 'e')
('e', ' ')
(' ', 'P')
('P', 'r')
('r', 'o')
('o', 'j')
('j', 'e')
('e', 'c')


Cleaning up: sentence segmentation, word segmentation

In [21]:
words = nltk.word_tokenize(book.text)

In [22]:
len(words)

193789

In [24]:
gen = nltk.bigrams(words)
for i in range(10):
    print(next(gen))

('\ufeffThe', 'Project')
('Project', 'Gutenberg')
('Gutenberg', 'EBook')
('EBook', 'of')
('of', 'Dracula')
('Dracula', ',')
(',', 'by')
('by', 'Bram')
('Bram', 'Stoker')
('Stoker', 'This')


In [42]:
gen = nltk.trigrams(words)
for i in range(10):
    print(next(gen))

('\ufeffThe', 'Project', 'Gutenberg')
('Project', 'Gutenberg', 'EBook')
('Gutenberg', 'EBook', 'of')
('EBook', 'of', 'Dracula')
('of', 'Dracula', ',')
('Dracula', ',', 'by')
(',', 'by', 'Bram')
('by', 'Bram', 'Stoker')
('Bram', 'Stoker', 'This')
('Stoker', 'This', 'eBook')


Some quick frequency calculations:

In [43]:
fdist = nltk.FreqDist(list(gen))

In [44]:
fdist.most_common(15)

[((':', '--', "''"), 445),
 (('*', '*', '*'), 393),
 ((',', 'and', 'I'), 209),
 ((',', 'and', 'the'), 197),
 (('.', "''", '``'), 153),
 (('.', '*', '*'), 110),
 (('said', ':', '--'), 106),
 ((',', 'and', 'we'), 104),
 ((',', 'for', 'I'), 97),
 (('me', ',', 'and'), 96),
 ((',', 'and', 'that'), 92),
 ((',', 'and', 'he'), 90),
 (('?', "''", '``'), 90),
 (('.', 'It', 'is'), 90),
 (('.', 'It', 'was'), 86)]

4-grams and higher are specified by a different method.

In [28]:
from nltk.util import ngrams

In [30]:
gen = ngrams(words,4)
for i in range(10):
    print(next(gen))

('\ufeffThe', 'Project', 'Gutenberg', 'EBook')
('Project', 'Gutenberg', 'EBook', 'of')
('Gutenberg', 'EBook', 'of', 'Dracula')
('EBook', 'of', 'Dracula', ',')
('of', 'Dracula', ',', 'by')
('Dracula', ',', 'by', 'Bram')
(',', 'by', 'Bram', 'Stoker')
('by', 'Bram', 'Stoker', 'This')
('Bram', 'Stoker', 'This', 'eBook')
('Stoker', 'This', 'eBook', 'is')


In [31]:
gen = ngrams(words,5)
for i in range(10):
    print(next(gen))

('\ufeffThe', 'Project', 'Gutenberg', 'EBook', 'of')
('Project', 'Gutenberg', 'EBook', 'of', 'Dracula')
('Gutenberg', 'EBook', 'of', 'Dracula', ',')
('EBook', 'of', 'Dracula', ',', 'by')
('of', 'Dracula', ',', 'by', 'Bram')
('Dracula', ',', 'by', 'Bram', 'Stoker')
(',', 'by', 'Bram', 'Stoker', 'This')
('by', 'Bram', 'Stoker', 'This', 'eBook')
('Bram', 'Stoker', 'This', 'eBook', 'is')
('Stoker', 'This', 'eBook', 'is', 'for')


By examining the NLTK API, we see optional fields for `pad_left`, `pad_right`, `left_pad_symbol`, and `right_pad_symbol`.

In [40]:
gen = ngrams(words, 3, pad_left=True, left_pad_symbol='<s>')
for i in range(10):
    print(next(gen))

('<s>', '<s>', '\ufeffThe')
('<s>', '\ufeffThe', 'Project')
('\ufeffThe', 'Project', 'Gutenberg')
('Project', 'Gutenberg', 'EBook')
('Gutenberg', 'EBook', 'of')
('EBook', 'of', 'Dracula')
('of', 'Dracula', ',')
('Dracula', ',', 'by')
(',', 'by', 'Bram')
('by', 'Bram', 'Stoker')


Adding in sentence segmentation and `<s>` `</s>` markers.

In [32]:
sents = nltk.sent_tokenize(book.text)

In [33]:
len(sents)

8569

In [37]:
sents[:10]

['\ufeffThe Project Gutenberg EBook of Dracula, by Bram Stoker\r\n\r\nThis eBook is for the use of anyone anywhere at no cost and with\r\nalmost no restrictions whatsoever.',
 'You may copy it, give it away or\r\nre-use it under the terms of the Project Gutenberg License included\r\nwith this eBook or online at www.gutenberg.org/license\r\n\r\n\r\nTitle: Dracula\r\n\r\nAuthor: Bram Stoker\r\n\r\nRelease Date: August 16, 2013 [EBook #345]\r\n\r\nLanguage: English\r\n\r\n\r\n*** START OF THIS PROJECT GUTENBERG EBOOK DRACULA ***\r\n\r\n\r\n\r\n\r\nProduced by Chuck Greif and the Online Distributed\r\nProofreading Team at http://www.pgdp.net (This file was\r\nproduced from images generously made available by The\r\nInternet Archive)\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n                                DRACULA\r\n\r\n\r\n\r\n\r\n\r\n                                DRACULA\r\n\r\n                                  _by_\r\n\r\n                              Bram Stoker\r\n\r\n                        [

In [36]:
sents[100:110]

['With some difficulty I got a\r\nfellow-passenger to tell me what they meant; he would not answer at\r\nfirst, but on learning that I was English, he explained that it was a\r\ncharm or guard against the evil eye.',
 'This was not very pleasant for me,\r\njust starting for an unknown place to meet an unknown man; but every one\r\nseemed so kind-hearted, and so sorrowful, and so sympathetic that I\r\ncould not but be touched.',
 'I shall never forget the last glimpse which I\r\nhad of the inn-yard and its crowd of picturesque figures, all crossing\r\nthemselves, as they stood round the wide archway, with its background of\r\nrich foliage of oleander and orange trees in green tubs clustered in the\r\ncentre of the yard.',
 'Then our driver, whose wide linen drawers covered\r\nthe whole front of the box-seat--"gotza" they call them--cracked his big\r\nwhip over his four small horses, which ran abreast, and we set off on\r\nour journey.',
 'I soon lost sight and recollection of ghostly fe

In [38]:
gen = ngrams(sents,3)
for i in range(10):
    print(next(gen))

('\ufeffThe Project Gutenberg EBook of Dracula, by Bram Stoker\r\n\r\nThis eBook is for the use of anyone anywhere at no cost and with\r\nalmost no restrictions whatsoever.', 'You may copy it, give it away or\r\nre-use it under the terms of the Project Gutenberg License included\r\nwith this eBook or online at www.gutenberg.org/license\r\n\r\n\r\nTitle: Dracula\r\n\r\nAuthor: Bram Stoker\r\n\r\nRelease Date: August 16, 2013 [EBook #345]\r\n\r\nLanguage: English\r\n\r\n\r\n*** START OF THIS PROJECT GUTENBERG EBOOK DRACULA ***\r\n\r\n\r\n\r\n\r\nProduced by Chuck Greif and the Online Distributed\r\nProofreading Team at http://www.pgdp.net (This file was\r\nproduced from images generously made available by The\r\nInternet Archive)\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n                                DRACULA\r\n\r\n\r\n\r\n\r\n\r\n                                DRACULA\r\n\r\n                                  _by_\r\n\r\n                              Bram Stoker\r\n\r\n                        [I