## Working with the Brown Corpus (using nltk):

To get the Borwn corpus, do **either** this:

In [4]:
import nltk
from nltk.corpus import brown

of if that doesn't work, **this**:

In [None]:
from nltk.corpus import download
download("brown")

The following code cell assigns a list of all 1.2 M word tokens in `brown` to the name `bw`:

In [5]:
bw = brown.words()

The value of the name `bw`:

In [6]:
bw

['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]

To get a data structure with all the word counts, do:

In [7]:
fd = nltk.FreqDist(bw)

What `fd` is is an`nltk FreqDist` (Frequency Distribution), essentially, a Python dictionary containing the count of all the word types in Brown.

In [8]:
fd["computer"]

13

That means there are 13 *tokens* of the word type 'computer' in the Brown corpus.  Note that case matters:

In [9]:
fd["Computer"]

0

In [13]:
fd['the'],fd['The']

(62713, 7258)

The `fd` FreqDist also has some features customized for word frequencies, like the `most_common` method), which allows you to find the top n most frequent words and their counts. For example:

In [7]:
fd.most_common(3)

[('the', 62713), (',', 58334), ('.', 49346)]

To find the number of word tokens in Brown, find the length of `bw`.

In [8]:
len(bw)

1161192

So, about 1.16 Million words.  You can also do this more efficiently, by using another one of the customized methods provided by a `FreqDist`:

In [9]:
fd.N()

1161192

To find out the number of **word types** in Brown, just get the length of the `fd` dictionary:

In [10]:
len(fd)

56057

It is handy to ignore case for purposes of this exercise. To do that, just lowercase all the words in Brown before making the frequency distribution. This changes the number of word types, but not the number of word tokens:

In [14]:
# This uses what Pythonistas call a list comprehension.
# It's just a compact way of writing a loop and collecting results in  a list.
bw = brown.words()
bw = [w.lower() for w in bw]
len(bw)

1161192

The length of the **corpus** (the body of data) hasn't changed. 

In [15]:
fd = nltk.FreqDist(bw)
len(fd)
#49815

49815

The dictionary has shrunk because we now ignore the distinction between 'The' and 'the'.

In [16]:
fd['the']

69971

In [17]:
fd['The']

0

You may find it useful to take a look at [Chapter One of the NLTK book.](http://www.nltk.org/book/ch01.html)

### Constructing a bigram model

In [18]:
# The Brown Corpus
from nltk.corpus import brown
from nltk import FreqDist
from nltk import bigrams

Here is some helpful code for computing the counts in the **Brown Corpus** needed for a bigram model.

In [19]:
#Lowers case all the words
words = [w.lower() for w in brown.words()]
fd = FreqDist(words)
fd2 = FreqDist(bigrams(words))

Number of times the word *bald* appeared in the Brown corpus:

In [20]:
fd["bald"]

5

Number of times the bigram *stormy weather* appeared in  the Brown corpus:

In [21]:
fd2["stormy","weather"]

1

Total count of the word tokens in the Brown corpus is unchanged:

In [22]:
fd.N()

1161192

Total count of all the bigram tokens in the Brown corpus:

In [23]:
fd2.N()

1161191

Twenty most common words in the Brown corpus together with their counts:

In [24]:
fd.most_common(20)

[('the', 69971),
 (',', 58334),
 ('.', 49346),
 ('of', 36412),
 ('and', 28853),
 ('to', 26158),
 ('a', 23195),
 ('in', 21337),
 ('that', 10594),
 ('is', 10109),
 ('was', 9815),
 ('he', 9548),
 ('for', 9489),
 ('``', 8837),
 ("''", 8789),
 ('it', 8760),
 ('with', 7289),
 ('as', 7253),
 ('his', 6996),
 ('on', 6741)]

Twenty most common bigrams in the corpus together with their counts:

In [25]:
fd2.most_common(20)

[(('of', 'the'), 9717),
 ((',', 'and'), 6302),
 (('.', 'the'), 6081),
 (('in', 'the'), 6025),
 ((',', 'the'), 3787),
 (('.', '``'), 3515),
 (('to', 'the'), 3484),
 (("''", '.'), 3332),
 ((';', ';'), 2784),
 (('.', 'he'), 2660),
 (('on', 'the'), 2466),
 (('?', '?'), 2346),
 (('and', 'the'), 2246),
 (("''", ','), 2032),
 ((',', 'but'), 1856),
 (('for', 'the'), 1852),
 (('.', 'it'), 1836),
 (('to', 'be'), 1718),
 (('at', 'the'), 1655),
 (('.', 'in'), 1619)]

In [26]:
def restrict_most_common (fd,f_rest):
    """
    Returns the x that satisfy restriction f_rest in sorted order
    according to fd.most_common()
    """
    bs,cts = zip(*[(x,ct) for (x,ct) in fd.most_common() if f_rest(x)])
    return bs, np.array(cts)

Example

In [27]:
bs,cts = restrict_most_common (fd, lambda  x:x.startswith("grou"))

In [28]:
bs

('group',
 'ground',
 'groups',
 'grounds',
 'groupings',
 'grounded',
 'groundwave',
 'grouped',
 'grouping',
 'groundwork',
 'grounding',
 'grounder',
 "group's",
 'groundless',
 'ground-glass',
 'ground-swell',
 'ground-level',
 'ground-truck')

In [29]:
cts

array([390, 186, 125,  58,   9,   6,   6,   5,   4,   3,   2,   1,   1,
         1,   1,   1,   1,   1])