# Language Model

In [49]:
import nltk

Download Brown corpus if needed

In [50]:
try:
    from nltk.corpus import brown
except:
    nltk.download('brown')
    from nltk.corpus import brown

In [4]:
' '.join(brown.words()[:20])

"The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced `` no evidence '' that"

In [5]:
len(brown.words())

1161192

## Count the occurrences
We use a dictionary to store the number of occurrences for each word:

In [6]:
brown_counts = {}

In [7]:
for w in brown.words():
    if w in brown_counts:
        brown_counts[w] += 1
    else:
        brown_counts[w] = 1

In [8]:
brown_counts['be']

6344

In [9]:
brown_counts['the']

62713

## Frequency distribution using NLTK

Function nltk.FreqDist() builds a frequency distribution, that associates to each word the count of its occurrences.

In [10]:
freq_brown = nltk.FreqDist(brown.words())

Show the frequency of a given word

In [11]:
freq_brown['The']

7258

In [12]:
freq_brown['Fulton']

17

Show the the frequency of 20 words.
Notice that we select items 0-19 in the dictionary, which are not kept though in a specific order.

In [13]:
[word for i, word in enumerate(freq_brown) if i < 20]

['the',
 ',',
 '.',
 'of',
 'and',
 'to',
 'a',
 'in',
 'that',
 'is',
 'was',
 'for',
 '``',
 "''",
 'The',
 'with',
 'it',
 'as',
 'he',
 'his']

Show the 20 most frequent words, i.e. this time we take them ordered by frequency

In [14]:
freq_brown.most_common(20)

[('the', 62713),
 (',', 58334),
 ('.', 49346),
 ('of', 36080),
 ('and', 27915),
 ('to', 25732),
 ('a', 21881),
 ('in', 19536),
 ('that', 10237),
 ('is', 10011),
 ('was', 9777),
 ('for', 8841),
 ('``', 8837),
 ("''", 8789),
 ('The', 7258),
 ('with', 7012),
 ('it', 6723),
 ('as', 6706),
 ('he', 6566),
 ('his', 6466)]

We can do the same using the brown_counts:

In [15]:
di = brown_counts.items()

We sort the brown_counts, according to the value of the count (second element of each item, i.e. item[1]), in reverse order, i.e. highr frequency first.
We displat the first 20 items:

In [16]:
sorted(di, key=lambda item: item[1], reverse=True)[:20]

[('the', 62713),
 (',', 58334),
 ('.', 49346),
 ('of', 36080),
 ('and', 27915),
 ('to', 25732),
 ('a', 21881),
 ('in', 19536),
 ('that', 10237),
 ('is', 10011),
 ('was', 9777),
 ('for', 8841),
 ('``', 8837),
 ("''", 8789),
 ('The', 7258),
 ('with', 7012),
 ('it', 6723),
 ('as', 6706),
 ('he', 6566),
 ('his', 6466)]

## MLE Probability from counts
Compute probability estimate from counts:

In [17]:
word_count = len(brown.words())

Divide frequency by number of occurrences of words:

In [18]:
def unigram_prob(word):
    return freq_brown[word] / float(word_count)

Probability estimate of word "my":

In [19]:
unigram_prob('my')

0.0009998346526672592

## Bigrams
Collect all bigrams.

In [20]:
brown_bigrams = nltk.bigrams(brown.words())

Show the first 10 bigrams

In [21]:
[next(brown_bigrams) for _ in range(10)]

[('The', 'Fulton'),
 ('Fulton', 'County'),
 ('County', 'Grand'),
 ('Grand', 'Jury'),
 ('Jury', 'said'),
 ('said', 'Friday'),
 ('Friday', 'an'),
 ('an', 'investigation'),
 ('investigation', 'of'),
 ('of', "Atlanta's")]

Function nltk.ConditionalFreqDist() counts frequencies of pairs.
When given a list of bigrams, for each word it gives the FreqDist of the following word.

In [22]:
cfreq_bigram = nltk.ConditionalFreqDist(brown_bigrams)

The frequencies of words occurring after word "my":

In [23]:
cfreq_bigram["my"]

FreqDist({'own': 52, 'hand': 19, 'life': 19, 'mind': 19, 'first': 15, 'wife': 14, 'hands': 14, 'eyes': 13, 'father': 13, 'mother': 12, ...})

The 20 most frequent words to come after "my", with their frequencies:

In [24]:
cfreq_bigram["my"].most_common(20)

[('own', 52),
 ('hand', 19),
 ('life', 19),
 ('mind', 19),
 ('first', 15),
 ('wife', 14),
 ('hands', 14),
 ('eyes', 13),
 ('father', 13),
 ('mother', 12),
 ('husband', 12),
 ('way', 12),
 ('head', 11),
 ('left', 8),
 ('heart', 7),
 ('point', 7),
 ('body', 7),
 ('Uncle', 7),
 ('best', 6),
 ('family', 6)]

Function nltk.ConditionalProbDist() maps pairs to probabilities.
One way in which we can do this is by using Maximum Likelihood Estimation (MLE)

In [25]:
cprob_bigram = nltk.ConditionalProbDist(cfreq_bigram, nltk.MLEProbDist)

Here is what we find for "my": a Maximum Likelihood Estimation-based probability distribution, as a `MLEProbDist` object.

In [26]:
cprob_my = cprob_bigram["my"]

We can find the probability of words that can come after "my" by using the method `prob()`

In [27]:
cprob_my.prob('own')

0.04478897502153316

The probabilities in cprob_bigram now form a trained bigram language model.
The typical use for a language model is to ask it for the probabillity of a word sequence:

$P(how\ do\ you\ do) = P(how) * P(do|how) * P(you|do) * P(do|you)$

In [28]:
unigram_prob("how") * \
    cprob_bigram["how"].prob("do") * \
    cprob_bigram["do"].prob("you") * \
    cprob_bigram["you"].prob("do")

1.5639033871961e-09

## Language generation
We can also use a language model in another way: 
we can let it generate text at random.
This is not so useful, but can be insightful into what it is that the language model has been learning.

Given a conditional probability distribution, method `generate` chooses a random sample

In [29]:
help(nltk.probability.ProbDistI.generate)

Help on function generate in module nltk.probability:

generate(self)
    Return a randomly selected sample from this probability distribution.
    The probability of returning each sample ``samp`` is equal to
    ``self.prob(samp)``.



In [51]:
cprob_bigram["my"].generate()

'mare'

We can use this to generate text at random:

In [31]:
def gentext(cpd, word, len=20):
    print(word, end=' ')
    for i in range(len):
        word = cpd[word].generate()
        print(word, end=' ') 
    print()

In [52]:
gentext(cprob_bigram, "my", 20)

my hand and these folds . The great value of bottled or childhood . He smiled . Whatever you utilizing vending 


What kind of genres are there in the Brown corpus?

In [33]:
brown.categories()

['adventure',
 'belles_lettres',
 'editorial',
 'fiction',
 'government',
 'hobbies',
 'humor',
 'learned',
 'lore',
 'mystery',
 'news',
 'religion',
 'reviews',
 'romance',
 'science_fiction']

Let's try Science Fiction.

In [34]:
cfreq_scifi = nltk.ConditionalFreqDist(nltk.bigrams(brown.words(categories = "science_fiction")))
cprob_scifi = nltk.ConditionalProbDist(cfreq_scifi, nltk.MLEProbDist)

In [35]:
gentext(cprob_scifi, "in", 30)

in temperature and practicality , said Macneff stopped pacing to study the proper living there was a wave of light up to follow the contrary-to-reality thoughts ! ! ! It would 


Here is how to do this with NLTK books:

In [36]:
def bigram_cpd(text):
    bigrams = nltk.bigrams(text) # nltk.ngrams(text, 2)
    return nltk.ConditionalProbDist(nltk.ConditionalFreqDist(bigrams),
                                    nltk.MLEProbDist)

## Austin

### Emma

In [37]:
emma = nltk.corpus.gutenberg.words('austen-emma.txt')

In [38]:
emma_cpd = bigram_cpd(emma)

In [39]:
gentext(emma_cpd, "She")

She was every thing two , my mother is a job of his usual time ." " I was plain as 


**Sense and Sensibility**

In [40]:
sense_cpd = bigram_cpd(nltk.corpus.gutenberg.words('austen-sense.txt'))

In [41]:
gentext(sense_cpd, "I", 100)

I am rather gave her papers , that grew impatient for Marianne had to own discretion . Design could not it , and their apprehensions as she treated with which Marianne , and to a while . Your sincere affection for as Mrs . " Dearest Marianne considered what passed on the kingdom ; and there was not attempt at that in it , and looked up stairs . Mrs . It was not the general drift of nothing for thinking it . She thanked him with what SHE could doubt . Mrs . " Is this act of her opinion 


## Trigrams

In [42]:
brown_trigrams = nltk.trigrams(brown.words())

In [43]:
next(brown_trigrams)

('The', 'Fulton', 'County')

In [44]:
help(nltk.lm.MLE)

Help on class MLE in module nltk.lm.models:

class MLE(nltk.lm.api.LanguageModel)
 |  MLE(order, vocabulary=None, counter=None)
 |  
 |  Class for providing MLE ngram model scores.
 |  
 |  Inherits initialization from BaseNgramModel.
 |  
 |  Method resolution order:
 |      MLE
 |      nltk.lm.api.LanguageModel
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  unmasked_score(self, word, context=None)
 |      Returns the MLE score for a word given a context.
 |      
 |      Args:
 |      - word is expected to be a string
 |      - context is expected to be something reasonably convertible to a tuple
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes defined here:
 |  
 |  __abstractmethods__ = frozenset()
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from nltk.lm.api.LanguageModel:
 |  
 |  __init__(self, order, vocabulary=None, counter=None)
 |      Create

In [45]:
from nltk.lm import MLE
lm = MLE(3)

We pad each sentence at both ends with \<s\> and \</s\>, then concatenate all of them and extract the vocabulary of tokens.

In [46]:
from nltk.lm.preprocessing import padded_everygram_pipeline
train, vocab = padded_everygram_pipeline(3, brown.sents())

In [47]:
lm.fit(train, vocab)

In [48]:
' '.join(lm.generate(20))

'art and poetry , Homeric critics are often ignorant as well as ethical conflict between two and the way from'