The count vectorizer from SKLearn will be used to count all of the words in our Shakespeare corpus.

In [6]:
import nltk
import requests
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np

nltk.download('punkt_tab')

url = "https://www.gutenberg.org/cache/epub/100/pg100.txt"
response = requests.get(url).text[785:-18093].lower()

sentences = nltk.sent_tokenize(response)

vectorizer = CountVectorizer(ngram_range=(1,2))
counts = vectorizer.fit_transform(sentences)
word_list = vectorizer.get_feature_names_out().tolist()
total_word_count = len(word_list)
all_word_counts = np.sum(counts, axis=0).flatten()

def ngram_count(ngram: str) -> int:
  if ngram not in word_list:
    return None

  idx = word_list.index(ngram)
  count = all_word_counts[0, idx]
  return count

print(f"Unique words in corpus: {total_word_count}")
ngram = "when"
print(f"Count of '{ngram}': {ngram_count(ngram)}")

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


Unique words in corpus: 349202
Count of 'when': 2225


The unigram language model always returns the most common word in the corpus.

In [7]:
def unigram_text_gen(in_string: str) -> str:
  final_word = word_list[np.argmax(all_word_counts)]
  return final_word

print(f"My dog is very {unigram_text_gen('My dog is very')}")

My dog is very the


The bigram language model selects the word that most commonly follows the last word of its input.

In [8]:
def bigram_text_gen(in_string: str) -> str:
  last_word = in_string.split(' ')[-1]
  candidate_list = []
  candidate_count = []
  for ngram in word_list:
    ngram_list = ngram.split(' ')
    if len(ngram_list) != 2:
      continue
    if ngram_list[0] == last_word:
      candidate_list.append(ngram_list[1])
      candidate_count.append(ngram_count(ngram_list[1]))
  # Handle empty candidate_count
  if not candidate_count:
    # Return the most common unigram as a fallback
    return word_list[np.argmax(all_word_counts)]
  else:
    return candidate_list[np.argmax(candidate_count)]


print(f"My dog is very {bigram_text_gen('My dog is very')}")

My dog is very her


Complete the implementation of the perplexity measure to compare these two models.

In [10]:
def perplexity(sentence: str) -> float:
  words = nltk.word_tokenize(sentence.lower())
  # If no words, define perplexity as 1.0 (arbitrary choice for empty input)
  if not words:
    return 1.0

  sum_log_prob = 0.0
  N = len(words)

  for i in range(N):
    # First word uses unigram probability
    if i == 0:
      count_unigram = ngram_count(words[i])
      if count_unigram is None or count_unigram == 0:
        return float('inf')
      prob = count_unigram / np.sum(all_word_counts)

    # Subsequent words use bigram probability
    else:
      bigram = words[i-1] + " " + words[i]
      count_bigram = ngram_count(bigram)
      count_prev_unigram = ngram_count(words[i-1])

      if (count_bigram is None or count_bigram == 0 or
          count_prev_unigram is None or count_prev_unigram == 0):
        return float('inf')

      prob = count_bigram / count_prev_unigram

    # Accumulate log probabilities
    sum_log_prob += np.log(prob)

  # Perplexity = exp(- (1/N) * sum(log(probabilities)))
  return float(np.exp(-sum_log_prob / N))

In [11]:
test_sentences = [
    "my dog is very",
    "when i was young",
    "the quick brown fox jumps",
    "it was the best of times",
    "this is nonsense dzzz"
]

for sentence in test_sentences:
  a_output = unigram_text_gen(sentence)
  print(f"Model A says: {sentence} {a_output}")
  print(f"Perplexity of model A: {perplexity(f'{sentence} {a_output}')}")

  b_output = bigram_text_gen(sentence)
  print(f"Model B says: {sentence} {b_output}")
  print(f"Perplexity of model B: {perplexity(f'{sentence} {b_output}')}")

Model A says: my dog is very the
Perplexity of model A: inf
Model B says: my dog is very her
Perplexity of model B: 245.38653309406996
Model A says: when i was young the
Perplexity of model A: inf
Model B says: when i was young and
Perplexity of model B: inf
Model A says: the quick brown fox jumps the
Perplexity of model A: inf
Model B says: the quick brown fox jumps with
Perplexity of model B: inf
Model A says: it was the best of times the
Perplexity of model A: 95.98762248951924
Model B says: it was the best of times the
Perplexity of model B: 95.98762248951924
Model A says: this is nonsense dzzz the
Perplexity of model A: inf
Model B says: this is nonsense dzzz the
Perplexity of model B: inf


**Use this text block to explain the perplexity of the two models. What does
this number tell us about the entroy or enthalpy of the models?**

A common metric for evaluating how well a language model (such as a bigram or unigram model) fits or predicts a particular sentence is called perplexity; a higher perplexity suggests greater surprise from the text, whereas a lower perplexity indicates less surprise. Perplexity increases to an endless level when a model gives no probability to any portion of a sentence (for example, an unseen word). Perplexity has nothing to do with enthalpy, a thermodynamic concept that is not relevant to language modeling, but it is strongly related to entropy in information theory, which quantifies uncertainty.