*ANLP 2020/21; Uni Potsdam; D. Schlangen, B. Aktas*

# Work Sheet for Week 02: Language Models

Galina Ryazanskaya

## Background

For this worksheet, you again need to work with text corpora. NLTK provides easy access to several standardly used (English) corpora, as described [in Chapter 2 of the nltk book](https://www.nltk.org/book/ch02.html).

## Questions / Exercises

### [E1] Write out the equation for trigram probability estimation (by modifying Eq. 3.11 in JM3, Chapter 3)

$$P(w_n|w_{n-2}w_{n-1}) = \frac{C(w_{n-2}w_{n-1}w_n)}{C(w_{n-2}w_{n-1})}$$

In [82]:
import math
import nltk
import numpy as np
import pandas as pd
import random

from collections import Counter
from itertools import tee, islice
from tqdm.notebook import tqdm

---

### [E2] Write a program to compute unsmoothed unigrams and bigrams from a given corpus.


In [7]:
def get_ngrams(corpus, n=2):
  """
  creates n-gram lists of words in a given corpus

  :param corpus: list od str
  :param n: int, length of n-gram, 
            optional, default = 2
  :yeild: list of str tuples, n-grams
  """
  tlst = corpus
  while True:
    a, b = tee(tlst)  # make two identical iteratiors of the current list of words (at the pointer position)
    l = tuple(islice(a, n))  # take the first n words from the current list of words
    if len(l) == n:  # check if the list was long enough for the slice to be of length n
      yield l  # yeild current n-gram
      next(b)  # move the pointer
      tlst = b  # cnahge the current list of words to the moved one
    else:  # if the list was not long enough for the slice to be of length n
      break  # terminate

In [8]:
li = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.'
list(get_ngrams(li.split(), 4))[:3]

[('Lorem', 'ipsum', 'dolor', 'sit'),
 ('ipsum', 'dolor', 'sit', 'amet,'),
 ('dolor', 'sit', 'amet,', 'consectetur')]

---

### [E3] Run your N-gram program on two different small corpora of your choice (you might use email text or newsgroups). Now compare the statistics of the two corpora. What are the differences in the most common unigrams between the two? How about interesting differences in bigrams?


In [9]:
nltk.download('gutenberg')

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Unzipping corpora/gutenberg.zip.


True

In [10]:
hamlet = nltk.corpus.gutenberg.words('shakespeare-hamlet.txt')
bible = nltk.corpus.gutenberg.words('bible-kjv.txt')

In [11]:
len(hamlet)

37360

In [12]:
len(bible)

1010654

#### unigrams


In [13]:
hamlet_top_50 = Counter(hamlet).most_common(50)
bible_top_50 = Counter(bible).most_common(50)

In [14]:
keys = lambda mc: [t[0] for t in mc]
mc_dict = lambda mc: {t[0]: t[1] for t in mc}

hamlet_50_keys = set(keys(hamlet_top_50)) 
bible_50_keys = set(keys(bible_top_50))

hamlet_diff = hamlet_50_keys - bible_50_keys
bible_diff = bible_50_keys - hamlet_50_keys
union = hamlet_50_keys.union(bible_50_keys)
intersection = hamlet_50_keys.intersection(bible_50_keys)

In [15]:
hamlet_diff

{"'",
 '-',
 'But',
 'Ham',
 'Hamlet',
 'King',
 'Lord',
 'That',
 'The',
 'To',
 'but',
 'd',
 'haue',
 'on',
 'our',
 's',
 'so',
 'this',
 'what',
 'you',
 'your'}

Obvoiusly, the names form Shakespear are not in the bible, but also capitalisation seems to work differently in the two corpora.

In [16]:
bible_diff

{'1',
 '2',
 'God',
 'LORD',
 'all',
 'from',
 'have',
 'out',
 'said',
 'thee',
 'their',
 'them',
 'they',
 'thou',
 'thy',
 'unto',
 'upon',
 'was',
 'were',
 'which',
 'ye'}

In [17]:
unigrams_df = pd.DataFrame(index=union)
unigrams_df['hamlet'] = pd.Series(mc_dict(hamlet_top_50))
unigrams_df['bible'] = pd.Series(mc_dict(bible_top_50))

In [18]:
sorted = unigrams_df.loc[intersection].sort_values('bible', ascending=False)
sorted.style.background_gradient()

Unnamed: 0,hamlet,bible
",",2892.0,70509.0
the,860.0,62103.0
:,565.0,43766.0
and,606.0,38847.0
of,576.0,34480.0
.,1886.0,26160.0
to,576.0,13396.0
And,257.0,12846.0
that,257.0,12576.0
in,359.0,12331.0


The words that are the most frequent in the two corpora are the so-called stopwords.

#### bigrams

In [19]:
bigrams_hamlet = Counter(get_ngrams(hamlet))
bigrams_bible = Counter(get_ngrams(bible))

In [20]:
hamlet_top_200 = bigrams_hamlet.most_common(200)
bible_top_200 = bigrams_bible.most_common(200)

hamlet_200_keys = set(keys(hamlet_top_200)) 
bible_200_keys = set(keys(bible_top_200))

union = hamlet_200_keys.union(bible_200_keys)
intersection = hamlet_200_keys.intersection(bible_200_keys)
hamlet_diff = hamlet_200_keys - bible_200_keys
bible_diff = bible_200_keys - hamlet_200_keys

In [21]:
hamlet_diff  # spelling differences

{("'", 'Tis'),
 ("'", 'd'),
 ("'", 'l'),
 ("'", 're'),
 ("'", 'st'),
 ("'", 't'),
 ("'", 'th'),
 ("'", 'tis'),
 (',', "'"),
 (',', 'A'),
 (',', 'And'),
 (',', 'As'),
 (',', 'But'),
 (',', 'For'),
 (',', 'Ile'),
 (',', 'Or'),
 (',', 'That'),
 (',', 'The'),
 (',', 'This'),
 (',', 'To'),
 (',', 'When'),
 (',', 'With'),
 (',', 'a'),
 (',', 'by'),
 (',', 'for'),
 (',', 'his'),
 (',', 'if'),
 (',', 'is'),
 (',', 'it'),
 (',', 'let'),
 (',', 'my'),
 (',', 'what'),
 (',', 'with'),
 (',', 'you'),
 ('.', "'"),
 ('.', 'A'),
 ('.', 'But'),
 ('.', 'Come'),
 ('.', 'Do'),
 ('.', 'Enter'),
 ('.', 'Exeunt'),
 ('.', 'Exit'),
 ('.', 'For'),
 ('.', 'Giue'),
 ('.', 'Good'),
 ('.', 'Ham'),
 ('.', 'He'),
 ('.', 'How'),
 ('.', 'I'),
 ('.', 'If'),
 ('.', 'In'),
 ('.', 'It'),
 ('.', 'King'),
 ('.', 'Let'),
 ('.', 'My'),
 ('.', 'Nay'),
 ('.', 'No'),
 ('.', 'O'),
 ('.', 'Oh'),
 ('.', 'So'),
 ('.', 'That'),
 ('.', 'The'),
 ('.', 'There'),
 ('.', 'This'),
 ('.', 'To'),
 ('.', 'We'),
 ('.', 'What'),
 ('.', 'Why'),
 

In [22]:
bible_diff  # digits, God, Israel

{(',', 'O'),
 (',', 'because'),
 (',', 'even'),
 (',', 'saying'),
 (',', 'they'),
 (',', 'which'),
 ('.', '1'),
 ('.', '10'),
 ('.', '11'),
 ('.', '12'),
 ('.', '13'),
 ('.', '14'),
 ('.', '15'),
 ('.', '16'),
 ('.', '18'),
 ('.', '19'),
 ('.', '2'),
 ('.', '20'),
 ('.', '21'),
 ('.', '22'),
 ('.', '3'),
 ('.', '4'),
 ('.', '5'),
 ('.', '6'),
 ('.', '7'),
 ('.', '8'),
 ('.', '9'),
 ('1', ':'),
 ('10', ':'),
 ('11', ':'),
 ('12', ':'),
 ('13', ':'),
 ('14', ':'),
 ('15', ':'),
 ('16', ':'),
 ('17', ':'),
 ('18', ':'),
 ('19', ':'),
 ('2', ':'),
 ('20', ':'),
 ('21', ':'),
 ('22', ':'),
 ('23', ':'),
 ('24', ':'),
 ('3', ':'),
 ('4', ':'),
 ('5', ':'),
 ('6', ':'),
 ('7', ':'),
 ('8', ':'),
 ('9', ':'),
 (':', '1'),
 (':', '10'),
 (':', '11'),
 (':', '12'),
 (':', '13'),
 (':', '14'),
 (':', '15'),
 (':', '16'),
 (':', '17'),
 (':', '18'),
 (':', '19'),
 (':', '2'),
 (':', '20'),
 (':', '21'),
 (':', '22'),
 (':', '23'),
 (':', '24'),
 (':', '25'),
 (':', '26'),
 (':', '27'),
 (':', '3')

In [23]:
bigrams_df = pd.DataFrame(index=union)
bigrams_df['hamlet'] = pd.Series(mc_dict(hamlet_top_200))
bigrams_df['bible'] = pd.Series(mc_dict(bible_top_200))

In [24]:
sorted = bigrams_df.loc[intersection].sort_values('bible', ascending=False)
sorted.style.background_gradient()

Unnamed: 0,Unnamed: 1,hamlet,bible
",",and,305.0,24921.0
of,the,59.0,11442.0
in,the,65.0,4879.0
and,the,26.0,4044.0
;,and,20.0,3214.0
:,and,22.0,3027.0
",",that,63.0,2924.0
to,the,56.0,2135.0
",",the,48.0,2117.0
him,",",28.0,2033.0


Again, the most common bi-grams are boring to look at, as they consist of stop-words...

---

### [E4] Add an option to your program to generate random sentences.

As I cannot add "\<s>" and "\</s>" easily, so I will begin with manually iserted start tokens, and terminate on ".", "!", or "?".


In [84]:
def get_MLE(word_counts, bigram_counts):
  """
  gets conditional bigram probabilities 

  :param word_counts: collections.Counter, word counts
  :param bigram_counts: collections.Counter, bigram counts
  :return: pd.DataFrame, conditional probabilities of bigrams
  """
  keys = word_counts.keys()
  MX = pd.DataFrame(index=keys, columns=keys)
  for bigram in tqdm(bigram_counts.keys(), total = len(bigram_counts.keys()), desc=f"search over bigrams: ", leave=True):
    MX[bigram[0]][bigram[1]] =  bigram_counts[bigram] / word_counts[bigram[0]]
  return MX.fillna(0.0)

In [85]:
hamlet_probs = get_MLE(Counter(hamlet), Counter(get_ngrams(hamlet)))
hamlet_probs.head()

HBox(children=(FloatProgress(value=0.0, description='search over bigrams: ', max=22117.0, style=ProgressStyle(…




Unnamed: 0,[,The,Tragedie,of,Hamlet,by,William,Shakespeare,1599,],Actus,Primus,.,Scoena,Prima,Enter,Barnardo,and,Francisco,two,Centinels,Who,',s,there,?,Fran,Nay,answer,me,:,Stand,&,vnfold,your,selfe,Bar,Long,liue,the,...,arriued,stage,placed,vnknowing,carnall,bloudie,acts,accidentall,casuall,slaughters,vpshot,mistooke,Falne,Inuentors,Truly,Noblest,claime,Inuite,perform,whiles,mindes,Lest,errors,happen,Captaines,Beare,royally,Souldiours,Warre,lowdly,Becomes,Field,amis,Souldiers,Marching,Peale,Ordenance,FINIS,tragedie,HAMLET
[,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
The,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0228,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.006536,0.0,0.0,0.0,0.0,0.015929,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Tragedie,0.0,0.007519,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
of,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00495,0.0,0.047619,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.125,0.008811,0.00177,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
Hamlet,0.0,0.0,0.0,0.006944,0.010101,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.003712,0.0,0.0,0.121951,0.0,0.00165,0.0,0.0,0.0,0.0,0.0,0.008197,0.0,0.0,0.0,0.0,0.125,0.0,0.00354,0.0,0.04,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [27]:
def generate_sentence(start, probs, max_len=10):
  """
  generates a sentence from 'start' based on an n-gram probability model, terminates at max_len

  :param start: str, start tokens
  :param probs: pd.DataFrame of float, conditional probabilities matrix
  :param max_len: int, maximal length, optional, default = 10
  :return: list of str, sentence
  """
  tok = start
  sent = [tok]
  while tok not in ['.', '!', '?'] and len(sent) < max_len:
    next = random.choices(probs.index, weights=probs[tok])[0]
    sent.append(next)
    tok = next
  return sent

In [28]:
# beging with ".", as one of </s> or <s>
for x in range(10):
  first = random.choices(hamlet_probs.index, weights=hamlet_probs['.'])[0]
  print(' '.join(generate_sentence(first, hamlet_probs, 20)))

Now Mother lookes so did the croaking Rauen doth besmerch The Trumpet to come , and our neglected loue to
In my Lord , whose loue and nickname Gods , come to Heauen , smiling damned Incest .
My most carefully vpon ' d he this world , and held his death was I heare him Ham .
Thy louing to heare the Starre , though I see my Watch , the Murther .
How came about that followes .
I to a Pit of you , oh , but skin and admiration Ham .
That it my Lord , casuall slaughters Of those foresaid Lands So bloodily hast thou should with plaist ' l
I prythee , or three liberall conceited Carriages infaith are fire Qu .
I take for our withers are hot loue passing well you this deed , pretty Lady , do ?
I thinke of the World one .


In [29]:
# begin with "The"
for x in range(10):
  print(' '.join(generate_sentence('The', hamlet_probs, 20)))

The King .
The body .
The Queen .
The insolence of gain - word .
The Cannons to Hercules himselfe , to euery one with bloud of it will more Choller Guild .
The Treacherous Instrument you auoid it they are , I can it lacks of all his iaw , you at
The Ratifiers and Volume of his brow , ' tis a Friend Ham .
The King , That might they shall do you Ophe .
The rugged Pyrrhus stood , and houer o ' s an Vnction to do , And ( Which was doubtfull
The Harlots Cheeke beautied with him with him Polon .


In [30]:
# begin with "I"
for x in range(10):
  print(' '.join(generate_sentence('I', hamlet_probs, 20)))

I do beseech you Madam , And makes A Sword Gho .
I can it said .
I vse you Sir ?
I dare not quoted him bare fac ' t ' d , And for King .
I my Lord with you haue heard Of his occulted guilt , Receiues rebuke from his fauours , Though nothing
I pray you all the primall eldest curse vpon ' th ' t , and Passion , and so fare
I meane time Ile obserue his day do well ; and Impotence Was he lay your Ambition in the matter
I hold Ambition of his vowes ; Ile husband them the flame With all within ' re his Vouchers ,
I am forbid my Heart - hast in the most deiect and vnschool ' t againe .
I dare not thinke you Goodman Deluer Clown .


---

### [E5] Add an option to compute the probability (according to the model) of an arbitrary input sentence. Create an input sentence that causes a problem. (I.e., where the output is different from your intuition about what the output should be.)

In [31]:
def compute_prob(sent, probs):
  """
  computes sentence probability according to the model

  :param sent: list of str, sentence to compute the probability of
  :param probs: pd.DataFrame of float, conditional probabilities matrix
  :return: float, probability of the sentence according to the model
  """
  prob = 0
  for i in range(len(sent)):
    if i + 1 < len(sent):
      first_word = sent[i]
      second_word = sent[i + 1]
      if first_word in probs.index and second_word in probs.index:
        current_prob = probs[first_word][second_word]
        if not current_prob:
          print(f'bigram "{first_word} {second_word}" is not encountered in the corpus.')
        prob += np.log(current_prob)
      else: 
        print('\nOOV')
        print(first_word, '-', first_word in probs.index)
        print(second_word, '-', second_word in probs.index)
        return 0.0
  return np.exp(prob)

In [32]:
compute_prob('Oh , Lord'.split(), hamlet_probs)  # this works

7.279609812914022e-05

In [33]:
compute_prob('Sleep is all I want .'.split(), hamlet_probs)  # This does not work because of the OOV


OOV
Sleep - False
is - True


0.0

In [34]:
compute_prob('My King , I want to die for you . '.split(), hamlet_probs)  # This does not work because of the sparsity of Hamlet corpus

bigram "My King" is not encountered in the corpus.
bigram "I want" is not encountered in the corpus.
bigram "want to" is not encountered in the corpus.
bigram "to die" is not encountered in the corpus.
bigram "die for" is not encountered in the corpus.




0.0

In [51]:
nltk.download('brown')

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!


True

In [52]:
news = nltk.corpus.brown.words(categories='news')

In [86]:
news_probs = get_MLE(Counter(news), Counter(get_ngrams(news)))
news_probs.head()

HBox(children=(FloatProgress(value=0.0, description='search over bigrams: ', max=62887.0, style=ProgressStyle(…




Unnamed: 0,The,Fulton,County,Grand,Jury,said,Friday,an,investigation,of,Atlanta's,recent,primary,election,produced,``,no,evidence,'',that,any,irregularities,took,place,.,jury,further,in,term-end,presentments,the,City,Executive,Committee,",",which,had,over-all,charge,deserves,...,displays,booklists,brochures,publishes,Sum,Substance,newsletter,governed,geographically,bear,Librarians,borne,multiplying,newcomers,maligned,teen-agers,co-ops,citizen,advances,educated,discriminating,slave,spiritual,restraint,falls,grows,translate,inadequacy,vigorously,"25,000,000","50,000,000",124,"1,509",66,pupils,render,vitally,16-22,richer,fuller
The,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000351,0.0,0.0,0.0,0.0,0.0,0.04918,0.0,0.0,0.004274,0.0,0.0,0.0,0.0,0.0,0.163524,0.0,0.0,0.001585,0.0,0.0,0.0,0.0,0.0,0.0,0.000386,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Fulton,0.001241,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000351,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.001247,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000528,0.0,0.0,0.000896,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
County,0.0,0.428571,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000351,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000248,0.0,0.0,0.000528,0.0,0.0,0.000179,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Grand,0.0,0.0,0.028571,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000528,0.0,0.0,0.000179,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Jury,0.0,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [61]:
compute_prob('The United States'.split(), news_probs)  # OK

0.0031659108411054996

In [54]:
compute_prob('I am very sleepy .'.split(), news_probs)  # This does not work because of the OOV


OOV
very - True
sleepy - False


0.0

In [58]:
compute_prob('The Jury did not say .'.split(), news_probs)  #  This does not work because of the sparsity of the corpus

bigram "The Jury" is not encountered in the corpus.
bigram "Jury did" is not encountered in the corpus.
bigram "say ." is not encountered in the corpus.




0.0

---

### [E6] Add Laplace smoothing, and UNK. Does that fix the problem from the previous exercise? What does this intervention do to the probability mass? Compare the smoothed and unsmoothed probability of P("the" | "end") and P("end" | "the").



#### add UNK

In [44]:
def add_unk(corpus, freq=0.005):
  """
  replaces all the least frequent tokens in a corpus with <UNK>

  :param corpus: list of str, corpus
  :param freq: float, fraction below which the tokens are replaced, optional, default  = 0.005
  :return: list of str, corpus with the least frequent replaced with <UNK>
  """
  counts = Counter(corpus)
  N = len(counts)
  f = math.ceil(freq * N)
  to_unks = [t[0] for t in counts.most_common()[-f:]]
  unked = ['<UNK>' if word in to_unks else word for word in corpus]
  return unked

In [55]:
'<UNK>' in add_unk(news)

True

In [88]:
unk_news = add_unk(news)

#### add smoothing

In [91]:
def smooth_Laplace(word_counts, bigram_counts):
  """
  computes Laplace smoothed conditional bigram probabilities 

  :param word_counts: collections.Counter, word counts
  :param bigram_counts: collections.Counter, bigram counts
  :return: pd.DataFrame, Laplace smoothed conditional probabilities of bigrams
  """
  keys = word_counts.keys()
  MX = pd.DataFrame(index=keys, columns=keys)
  V = len(keys)
  for bigram in tqdm(bigram_counts.keys(), total = len(bigram_counts.keys()), desc=f"search over bigrams: ", leave=True):
    MX[bigram[0]][bigram[1]] =  (bigram_counts[bigram] + 1 )/ (word_counts[bigram[0]] + V)
  for word in tqdm(keys, total = V, desc=f"fill NAs: ", leave=True):
    word_NA_filler = 1.0 / (word_counts[word] + V )
    MX[word].fillna(word_NA_filler, inplace = True)  
  return MX


## THIS TAKES FOREVER
# def smooth_Laplace(word_counts, bigram_counts):
#   """
#   computes Laplace smoothed conditional bigram probabilities 

#   :param word_counts: collections.Counter, word counts
#   :param bigram_counts: collections.Counter, bigram counts
#   :return: pd.DataFrame, Laplace smoothed conditional probabilities of bigrams
#   """
#   keys = word_counts.keys()
#   MX = pd.DataFrame(index=keys, columns=keys)
#   for first_word in keys:
#     for second_word in keys:
#       MX[first_word][second_word] =  (bigram_counts[(first_word, second_word)] + 1 )/ (word_counts[first_word] + len(keys))
#   return MX

In [105]:
smooth_unk_hamlet_probs = smooth_Laplace(Counter(unk_hamlet), Counter(get_ngrams(unk_hamlet)))
smooth_unk_hamlet_probs

HBox(children=(FloatProgress(value=0.0, description='search over bigrams: ', max=22102.0, style=ProgressStyle(…




HBox(children=(FloatProgress(value=0.0, description='fill NAs: ', max=5420.0, style=ProgressStyle(description_…




Unnamed: 0,[,The,Tragedie,of,Hamlet,by,William,Shakespeare,1599,],Actus,Primus,.,Scoena,Prima,Enter,Barnardo,and,Francisco,two,Centinels,Who,',s,there,?,Fran,Nay,answer,me,:,Stand,&,vnfold,your,selfe,Bar,Long,liue,the,...,dying,occurrents,solicited,cracke,flights,Drumme,English,Ambassador,Colours,Fortin,search,quarry,hauocke,proud,feast,Cell,Princes,shoote,bloodily,Amb,affaires,fulfill,abilitie,iumpe,bloodie,Polake,warres,arriued,stage,placed,vnknowing,carnall,bloudie,acts,accidentall,casuall,slaughters,vpshot,mistooke,<UNK>
[,0.000184,0.00018,0.000184,0.000167,0.000181,0.000181,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000137,0.000184,0.000184,0.000182,0.000184,0.000166,0.000184,0.000184,0.000184,0.000184,0.000163,0.000180,0.000182,0.00017,0.000184,0.000184,0.000184,0.000177,0.000167,0.000184,0.000184,0.000184,0.000177,0.000182,0.000184,0.000184,0.000184,0.000159,...,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184
The,0.000369,0.00018,0.000184,0.000167,0.000181,0.000181,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.006022,0.000184,0.000184,0.000182,0.000184,0.000166,0.000184,0.000184,0.000184,0.000184,0.000163,0.000180,0.000182,0.00068,0.000184,0.000184,0.000184,0.000177,0.001671,0.000184,0.000184,0.000184,0.000177,0.000182,0.000184,0.000184,0.000184,0.000159,...,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184
Tragedie,0.000184,0.00036,0.000184,0.000167,0.000181,0.000181,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000137,0.000184,0.000184,0.000182,0.000184,0.000166,0.000184,0.000184,0.000184,0.000184,0.000163,0.000180,0.000182,0.00017,0.000184,0.000184,0.000184,0.000177,0.000167,0.000184,0.000184,0.000184,0.000177,0.000182,0.000184,0.000184,0.000184,0.000159,...,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184
of,0.000184,0.00018,0.000369,0.000167,0.000181,0.000181,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000137,0.000184,0.000184,0.000182,0.000184,0.000664,0.000184,0.000368,0.000184,0.000184,0.000163,0.000180,0.000182,0.00017,0.000184,0.000184,0.000368,0.000531,0.000334,0.000184,0.000184,0.000184,0.000177,0.000182,0.000184,0.000184,0.000184,0.000159,...,0.000184,0.000184,0.000184,0.000184,0.000369,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000369,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000551
Hamlet,0.000184,0.00018,0.000184,0.000834,0.000362,0.000181,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.001095,0.000184,0.000184,0.001999,0.000184,0.000332,0.000184,0.000184,0.000184,0.000184,0.000163,0.000361,0.000182,0.00017,0.000184,0.000184,0.000368,0.000177,0.000501,0.000184,0.000367,0.000184,0.000177,0.000182,0.000184,0.000184,0.000184,0.000159,...,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000367
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
casuall,0.000184,0.00018,0.000184,0.000167,0.000181,0.000181,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000137,0.000184,0.000184,0.000182,0.000184,0.000166,0.000184,0.000184,0.000184,0.000184,0.000163,0.000180,0.000182,0.00017,0.000184,0.000184,0.000184,0.000177,0.000167,0.000184,0.000184,0.000184,0.000177,0.000182,0.000184,0.000184,0.000184,0.000159,...,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184
slaughters,0.000184,0.00018,0.000184,0.000167,0.000181,0.000181,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000137,0.000184,0.000184,0.000182,0.000184,0.000166,0.000184,0.000184,0.000184,0.000184,0.000163,0.000180,0.000182,0.00017,0.000184,0.000184,0.000184,0.000177,0.000167,0.000184,0.000184,0.000184,0.000177,0.000182,0.000184,0.000184,0.000184,0.000159,...,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000369,0.000184,0.000184,0.000184,0.000184
vpshot,0.000184,0.00018,0.000184,0.000167,0.000181,0.000181,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000137,0.000184,0.000184,0.000182,0.000184,0.000166,0.000184,0.000184,0.000184,0.000184,0.000163,0.000180,0.000182,0.00017,0.000184,0.000184,0.000184,0.000177,0.000167,0.000184,0.000184,0.000184,0.000177,0.000182,0.000184,0.000184,0.000184,0.000159,...,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184
mistooke,0.000184,0.00018,0.000184,0.000167,0.000181,0.000181,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000137,0.000184,0.000184,0.000182,0.000184,0.000166,0.000184,0.000184,0.000184,0.000184,0.000163,0.000180,0.000182,0.00017,0.000184,0.000184,0.000184,0.000177,0.000167,0.000184,0.000184,0.000184,0.000177,0.000182,0.000184,0.000184,0.000184,0.000159,...,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184,0.000184


#### problems from the previous exercise

In [116]:
unk_prepr = lambda sent, vocab: ['<UNK>' if word not in vocab else word for word in sent]

In [118]:
unk_prepr('Sleep is all I want .'.split(), smooth_unk_hamlet_probs.index)

['<UNK>', 'is', 'all', 'I', 'want', '.']

In [106]:
compute_prob('Oh , Lord'.split(), smooth_unk_hamlet_probs)

5.910326943525643e-07

In [119]:
compute_prob(unk_prepr('Sleep is all I want .'.split(), smooth_unk_hamlet_probs.index), smooth_unk_hamlet_probs)

5.386255939005639e-19

In [108]:
compute_prob('My King , I want to die for you . '.split(), smooth_unk_hamlet_probs)

1.875013048116561e-29

#### What does this intervention do to the probability mass?
The overall probability mass does not change, but it is redistributed in such a way, that the improbable bigrams recieve a small share of mass that is taken from the more probable bigrams.

In [113]:
compute_prob(['<UNK>'], smooth_unk_hamlet_probs) 

1.0

#### compare the smoothed and unsmoothed probability of P("the" | "end") and P("end" | "the")

In [109]:
compute_prob('the end'.split(), hamlet_probs) 

0.005813953488372095

In [110]:
compute_prob('end The'.split(), hamlet_probs) 

0.0588235294117647

In [111]:
compute_prob('the end'.split(), smooth_unk_hamlet_probs) 

0.0009554140127388538

In [112]:
compute_prob('end The'.split(), smooth_unk_hamlet_probs) 

0.0003678499172337687