# Language Model

In [1]:
import nltk

Download Brown corpous if needed

In [2]:
try:
    from nltk.corpus import brown
except:
    nltk.download('brown')
    from nltk.corpus import brown

In [3]:
' '.join(brown.words()[:20])

"The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced `` no evidence '' that"

In [4]:
len(brown.words())

1161192

## Count the occurrences
We use a dictionary that gives the number of occurrences for each word:

In [5]:
brown_counts = {}

In [6]:
for w in brown.words():
    if w in brown_counts:
        brown_counts[w] += 1
    else:
        brown_counts[w] = 1

In [7]:
brown_counts['be']

6344

In [8]:
brown_counts['The']

7258

## Frequency distribution using NLTK

In [9]:
freq_brown = nltk.FreqDist(brown.words())

In [10]:
freq_brown['The']

7258

In [11]:
freq_brown['Fulton']

17

Show the the frequency of 20 words.
Notice that we select items 0-19 in the dictionary, which are not kept though in a specific order.

In [12]:
[word for i, word in enumerate(freq_brown) if i < 20]

['The',
 'Fulton',
 'County',
 'Grand',
 'Jury',
 'said',
 'Friday',
 'an',
 'investigation',
 'of',
 "Atlanta's",
 'recent',
 'primary',
 'election',
 'produced',
 '``',
 'no',
 'evidence',
 "''",
 'that']

In [13]:
freq_brown.most_common(20)

[('the', 62713),
 (',', 58334),
 ('.', 49346),
 ('of', 36080),
 ('and', 27915),
 ('to', 25732),
 ('a', 21881),
 ('in', 19536),
 ('that', 10237),
 ('is', 10011),
 ('was', 9777),
 ('for', 8841),
 ('``', 8837),
 ("''", 8789),
 ('The', 7258),
 ('with', 7012),
 ('it', 6723),
 ('as', 6706),
 ('he', 6566),
 ('his', 6466)]

We can do the same using the brown_counts:

In [14]:
di = brown_counts.items()

We sort the brown_counts, according to the value of the count (second element of each item, i.e. item[1]), in reverse order, i.e. highr frequency first.
We displat the first 20 items:

In [15]:
sorted(di, key=lambda item: item[1], reverse=True)[:20]

[('the', 62713),
 (',', 58334),
 ('.', 49346),
 ('of', 36080),
 ('and', 27915),
 ('to', 25732),
 ('a', 21881),
 ('in', 19536),
 ('that', 10237),
 ('is', 10011),
 ('was', 9777),
 ('for', 8841),
 ('``', 8837),
 ("''", 8789),
 ('The', 7258),
 ('with', 7012),
 ('it', 6723),
 ('as', 6706),
 ('he', 6566),
 ('his', 6466)]

## MLE Probability from counts
Compute probability estimate from counts:

In [16]:
word_count = len(brown.words())

Divide frequency by length:

In [17]:
def unigram_prob(word):
    return freq_brown[word] / float(word_count)

Probability estimate of word "my":

In [18]:
unigram_prob('my')

0.0009998346526672592

## Bigrams
Collect all bigrams.

In [19]:
brown_bigrams = nltk.bigrams(brown.words())

Show the first 10 bigrams

In [20]:
[next(brown_bigrams) for _ in range(10)]

[('The', 'Fulton'),
 ('Fulton', 'County'),
 ('County', 'Grand'),
 ('Grand', 'Jury'),
 ('Jury', 'said'),
 ('said', 'Friday'),
 ('Friday', 'an'),
 ('an', 'investigation'),
 ('investigation', 'of'),
 ('of', "Atlanta's")]

Function nltk.ConditionalFreqDist() counts frequencies of pairs.
When given a list of bigrams, for each word it gives the FreqDist of the following word.

In [21]:
cfreq_bigram = nltk.ConditionalFreqDist(brown_bigrams)

The frequencies of words occurring after word "my":

In [22]:
cfreq_bigram["my"]

FreqDist({'own': 52, 'hand': 19, 'life': 19, 'mind': 19, 'first': 15, 'wife': 14, 'hands': 14, 'eyes': 13, 'father': 13, 'mother': 12, ...})

The 20 most frequent words to come after "my", with their frequencies:

In [24]:
cfreq_bigram["my"].most_common(20)

[('own', 52),
 ('hand', 19),
 ('life', 19),
 ('mind', 19),
 ('first', 15),
 ('wife', 14),
 ('hands', 14),
 ('eyes', 13),
 ('father', 13),
 ('mother', 12),
 ('husband', 12),
 ('way', 12),
 ('head', 11),
 ('left', 8),
 ('heart', 7),
 ('point', 7),
 ('body', 7),
 ('Uncle', 7),
 ('best', 6),
 ('family', 6)]

Function nltk.ConditionalProbDist() maps pairs to probabilities.
One way in which we can do this is by using Maximum Likelihood Estimation (MLE)

In [23]:
cprob_bigram = nltk.ConditionalProbDist(cfreq_bigram, nltk.MLEProbDist)

Here is what we find for "my": a Maximum Likelihood Estimation-based probability distribution, as a `MLEProbDist` object.

In [24]:
cprob_my = cprob_bigram["my"]

We can find the probability of words that can come after "my" by using the method `prob()`

In [25]:
cprob_my.prob('own')

0.04478897502153316

The probabilities in cprob_bigram now form a trained bigram language model.
The typical use for a language model is to ask it for the probabillity of a word sequence:

$P(how\ do\ you\ do) = P(how) * P(do|how) * P(you|do) * P(do|you)$

In [26]:
unigram_prob("how") * \
    cprob_bigram["how"].prob("do") * \
    cprob_bigram["do"].prob("you") * \
    cprob_bigram["you"].prob("do")

1.5639033871961e-09

## Language generation
We can also use a language model in another way: 
we can let it generate text at random.
This is not so useful, but can be insightful into what it is that the language model has been learning.

Given a conditional probability distribution, method `generate` chooses a random sample

In [42]:
help(nltk.probability.ProbDistI.generate)

Help on function generate in module nltk.probability:

generate(self)
    Return a randomly selected sample from this probability distribution.
    The probability of returning each sample ``samp`` is equal to
    ``self.prob(samp)``.



In [28]:
cprob_bigram["my"].generate()

'studio'

We can use this to generate text at random:

In [27]:
def gentext(cpd, word, len=20):
    print(word, end=' ')
    for i in range(len):
        word = cpd[word].generate()
        print(word, end=' ') 
    print()

In [29]:
gentext(cprob_bigram, "my", 20)

my camera tours , Society , who is the Wizard of '' -- rebuilding of diplomacy met by J. Wexler had 


What kind of genres are there in the Brown corpus?

In [30]:
brown.categories()

['adventure',
 'belles_lettres',
 'editorial',
 'fiction',
 'government',
 'hobbies',
 'humor',
 'learned',
 'lore',
 'mystery',
 'news',
 'religion',
 'reviews',
 'romance',
 'science_fiction']

Let's try Science Fiction.

In [31]:
cfreq_scifi = nltk.ConditionalFreqDist(nltk.bigrams(brown.words(categories = "science_fiction")))
cprob_scifi = nltk.ConditionalProbDist(cfreq_scifi, nltk.MLEProbDist)

In [32]:
gentext(cprob_scifi, "in", 30)

in an agreement , gender was the shocks and , since he can't see , my property was North America , adjectives , read . Thus events occurred , plus a 


## Trigrams

In [33]:
brown_trigrams = nltk.trigrams(brown.words())

In [34]:
next(brown_trigrams)

('The', 'Fulton', 'County')

In [39]:
#from nltk.model import NgramModel
#lm = NgramModel(3, brown.words(), nltk.MLEProbDist)

Here is how to do this with NLTK books:

In [35]:
from nltk.book import *

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


In [36]:
def bigram_cpd(text):
    bigrams = list(nltk.ngrams(text, 2))
    return nltk.ConditionalProbDist(nltk.ConditionalFreqDist(bigrams),
                                    nltk.MLEProbDist)

**Monty Python and the Holy Grail**

In [37]:
text6_cpd = bigram_cpd(text6)

In [38]:
gentext(text6_cpd, "I", 100)

I knew that looks like . VILLAGER # 1 : The Prince ? WOMAN : Ni ! You must have suffered much rejoicing . Beyond the -- I ' d better than some watery tart threw a question of them seldom live to be nice . ROBIN : And there was saved Sir Knight of full fifty men lie strewn about that ? GALAHAD : Who ' m on . He ' s not a vicious streak a bit scared ! rewr !] [ horn ] [ thump ] [ boom ] ARTHUR : No , Sir Galahad . GUARD : 


**Sense and Sensibility**

In [39]:
text2_cpd = bigram_cpd(text2)

In [40]:
gentext(text2_cpd, "I", 100)

I am going there , fancied that Willoughby ' s entreaties that the matter , turning again . Jennings won ' they hardly left her spirits . " Thank Heaven knows how droll , by that , in the uncertainty ; and a valued . Use your assertion . Lodging as could even doubted not reflect on him open and my affection , had not patience till a reverie , as she must be suffered for it be impertinent remarks , scorning , I certainly are not do not justify such a favour . But , their breakfast the general tremour 
