<a href="https://colab.research.google.com/github/eriksali/DNN_2023_NLP/blob/main/NLP01_03.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CSI 5900: Lectures 01-03 Code Examples

Prof. Steven Wilson, Oakland University

# Setup / installs

In [None]:
! pip install nltk

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


# ELIZA Example

From https://www.nltk.org/_modules/nltk/chat/eliza.html

In [None]:
from nltk.chat.util import Chat, reflections

## Initialize Patterns

In [None]:
pairs = (
    (
        r"I need (.*)",
        (
            "Why do you need %1?",
            "Would it really help you to get %1?",
            "Are you sure you need %1?",
        ),
    ),
    (
        r"Why don\'t you (.*)",
        (
            "Do you really think I don't %1?",
            "Perhaps eventually I will %1.",
            "Do you really want me to %1?",
        ),
    ),
    (
        r"Why can\'t I (.*)",
        (
            "Do you think you should be able to %1?",
            "If you could %1, what would you do?",
            "I don't know -- why can't you %1?",
            "Have you really tried?",
        ),
    ),
    (
        r"I can\'t (.*)",
        (
            "How do you know you can't %1?",
            "Perhaps you could %1 if you tried.",
            "What would it take for you to %1?",
        ),
    ),
    (
        r"I am (.*)",
        (
            "Did you come to me because you are %1?",
            "How long have you been %1?",
            "How do you feel about being %1?",
        ),
    ),
    (
        r"I\'m (.*)",
        (
            "How does being %1 make you feel?",
            "Do you enjoy being %1?",
            "Why do you tell me you're %1?",
            "Why do you think you're %1?",
        ),
    ),
    (
        r"Are you (.*)",
        (
            "Why does it matter whether I am %1?",
            "Would you prefer it if I were not %1?",
            "Perhaps you believe I am %1.",
            "I may be %1 -- what do you think?",
        ),
    ),
    (
        r"What (.*)",
        (
            "Why do you ask?",
            "How would an answer to that help you?",
            "What do you think?",
        ),
    ),
    (
        r"How (.*)",
        (
            "How do you suppose?",
            "Perhaps you can answer your own question.",
            "What is it you're really asking?",
        ),
    ),
    (
        r"Because (.*)",
        (
            "Is that the real reason?",
            "What other reasons come to mind?",
            "Does that reason apply to anything else?",
            "If %1, what else must be true?",
        ),
    ),
    (
        r"(.*) sorry (.*)",
        (
            "There are many times when no apology is needed.",
            "What feelings do you have when you apologize?",
        ),
    ),
    (
        r"Hello(.*)",
        (
            "Hello... I'm glad you could drop by today.",
            "Hi there... how are you today?",
            "Hello, how are you feeling today?",
        ),
    ),
    (
        r"I think (.*)",
        ("Do you doubt %1?", "Do you really think so?", "But you're not sure %1?"),
    ),
    (
        r"(.*) friend (.*)",
        (
            "Tell me more about your friends.",
            "When you think of a friend, what comes to mind?",
            "Why don't you tell me about a childhood friend?",
        ),
    ),
    (r"Yes", ("You seem quite sure.", "OK, but can you elaborate a bit?")),
    (
        r"(.*) computer(.*)",
        (
            "Are you really talking about me?",
            "Does it seem strange to talk to a computer?",
            "How do computers make you feel?",
            "Do you feel threatened by computers?",
        ),
    ),
    (
        r"Is it (.*)",
        (
            "Do you think it is %1?",
            "Perhaps it's %1 -- what do you think?",
            "If it were %1, what would you do?",
            "It could well be that %1.",
        ),
    ),
    (
        r"It is (.*)",
        (
            "You seem very certain.",
            "If I told you that it probably isn't %1, what would you feel?",
        ),
    ),
    (
        r"Can you (.*)",
        (
            "What makes you think I can't %1?",
            "If I could %1, then what?",
            "Why do you ask if I can %1?",
        ),
    ),
    (
        r"Can I (.*)",
        (
            "Perhaps you don't want to %1.",
            "Do you want to be able to %1?",
            "If you could %1, would you?",
        ),
    ),
    (
        r"You are (.*)",
        (
            "Why do you think I am %1?",
            "Does it please you to think that I'm %1?",
            "Perhaps you would like me to be %1.",
            "Perhaps you're really talking about yourself?",
        ),
    ),
    (
        r"You\'re (.*)",
        (
            "Why do you say I am %1?",
            "Why do you think I am %1?",
            "Are we talking about you, or me?",
        ),
    ),
    (
        r"I don\'t (.*)",
        ("Don't you really %1?", "Why don't you %1?", "Do you want to %1?"),
    ),
    (
        r"I feel (.*)",
        (
            "Good, tell me more about these feelings.",
            "Do you often feel %1?",
            "When do you usually feel %1?",
            "When you feel %1, what do you do?",
        ),
    ),
    (
        r"I have (.*)",
        (
            "Why do you tell me that you've %1?",
            "Have you really %1?",
            "Now that you have %1, what will you do next?",
        ),
    ),
    (
        r"I would (.*)",
        (
            "Could you explain why you would %1?",
            "Why would you %1?",
            "Who else knows that you would %1?",
        ),
    ),
    (
        r"Is there (.*)",
        (
            "Do you think there is %1?",
            "It's likely that there is %1.",
            "Would you like there to be %1?",
        ),
    ),
    (
        r"My (.*)",
        (
            "I see, your %1.",
            "Why do you say that your %1?",
            "When your %1, how do you feel?",
        ),
    ),
    (
        r"You (.*)",
        (
            "We should be discussing you, not me.",
            "Why do you say that about me?",
            "Why do you care whether I %1?",
        ),
    ),
    (r"Why (.*)", ("Why don't you tell me the reason why %1?", "Why do you think %1?")),
    (
        r"I want (.*)",
        (
            "What would it mean to you if you got %1?",
            "Why do you want %1?",
            "What would you do if you got %1?",
            "If you got %1, then what would you do?",
        ),
    ),
    (
        r"(.*) mother(.*)",
        (
            "Tell me more about your mother.",
            "What was your relationship with your mother like?",
            "How do you feel about your mother?",
            "How does this relate to your feelings today?",
            "Good family relations are important.",
        ),
    ),
    (
        r"(.*) father(.*)",
        (
            "Tell me more about your father.",
            "How did your father make you feel?",
            "How do you feel about your father?",
            "Does your relationship with your father relate to your feelings today?",
            "Do you have trouble showing affection with your family?",
        ),
    ),
    (
        r"(.*) child(.*)",
        (
            "Did you have close friends as a child?",
            "What is your favorite childhood memory?",
            "Do you remember any dreams or nightmares from childhood?",
            "Did the other children sometimes tease you?",
            "How do you think your childhood experiences relate to your feelings today?",
        ),
    ),
    (
        r"(.*)\?",
        (
            "Why do you ask that?",
            "Please consider whether you can answer your own question.",
            "Perhaps the answer lies within yourself?",
            "Why don't you tell me?",
        ),
    ),
    (
        r"quit",
        (
            "Thank you for talking with me.",
            "Good-bye.",
            "Thank you, that will be $150.  Have a good day!",
        ),
    ),
    (
        r"(.*)",
        (
            "Please tell me more.",
            "Let's change focus a bit... Tell me about your family.",
            "Can you elaborate on that?",
            "Why do you say that %1?",
            "I see.",
            "Very interesting.",
            "%1.",
            "I see.  And what does that tell you?",
            "How does that make you feel?",
            "How do you feel when you say that?",
        ),
    ),
)

eliza_chatbot = Chat(pairs, reflections)

## Chat with ELIZA

In [None]:
def eliza_chat():
    print("Therapist\n---------")
    print("Talk to the program by typing in plain English, using normal upper-")
    print('and lower-case letters and punctuation.  Enter "quit" when done.')
    print("=" * 72)
    print("Hello.  How are you feeling today?")

    eliza_chatbot.converse()


eliza_chat()

Therapist
---------
Talk to the program by typing in plain English, using normal upper-
and lower-case letters and punctuation.  Enter "quit" when done.
Hello.  How are you feeling today?
>Good, just teaching class
Please tell me more.
>we are in a lecture right now
Can you elaborate on that?
>no
I see.  And what does that tell you?
>I wish I had dinner right now
you wish you had dinner right now.
>yes
OK, but can you elaborate a bit?
>I'm hungry but I want to learn about NLP
How does being hungry but you want to learn about nlp make you feel?
>conflicted
conflicted.
>quit
Good-bye.


# Regular expressions

In [None]:
import re

## Example 1

In [None]:
# you can pre-compile patterns
def run_regex(pattern, string):
    return re.findall(pattern, string)

pattern = "\w+[aeiou]+\w+"
string = "Find all wrds wth some vwls in the middle of them"

re.findall(pattern,string)

['Find', 'some', 'middle', 'them']

## Example 2

In [None]:
# name and extract parts, attempt 1
string = "some other stuff Instructor: Wilson    Course: NLP"
pattern1 = "Instructor: .* Course: .*"
regex_match1 = re.search(pattern1,string)
print(regex_match1)
print(regex_match1.groups())

<re.Match object; span=(17, 50), match='Instructor: Wilson    Course: NLP'>
()


In [None]:
# attempt 2: using groups
string = "some other stuff Instructor: Wilson    Course: NLP"
pattern2 = "Instructor: (.*) Course: (.*)"
regex_match2 = re.search(pattern2,string)
print(regex_match2)
print(regex_match2.groups())

<re.Match object; span=(17, 50), match='Instructor: Wilson    Course: NLP'>
('Wilson   ', 'NLP')


In [None]:
# attempt 3: naming groups and output as dictionary
string = "some other stuff Instructor: Wilson    Course: NLP"
pattern3 = "Instructor:\s*(?P<instructor>\w*)\s*Course:\s*(?P<course>.*)"
regex_match3 = re.search(pattern3,string)
print(regex_match3)
print(regex_match3.groupdict())

<re.Match object; span=(17, 50), match='Instructor: Wilson    Course: NLP'>
{'instructor': 'Wilson', 'course': 'NLP'}


# Text Normalization

In [None]:
# lowercasing

"MyText".lower()

'mytext'

In [None]:
# simple tokenization

"The cat sat on the mat".split()

['The', 'cat', 'sat', 'on', 'the', 'mat']

In [None]:
# substitution

"The cat sat on the mat".replace("cat","rat")

'The rat sat on the mat'

In [None]:
# lowercase, tokenize, remove stopwords
mystops = ["a","an","the","for","but","so","on"]
mystops = set(mystops)

tokens = [tok for tok in "The cat sat on the mat".lower().split() if tok not in mystops]
' '.join(tokens)

'cat sat mat'

In [None]:
# first attempt at a preprocessing function

def preprocess(text, lowercase=False, stopwords=[]):

    # lowercase
    if lowercase:
        text = text.lower()
    # tokenize and remove stopwords
    tokens = text.split()
    if stopwords:
        tokens = [t for t in tokens if t not in stopwords]
    return tokens

# test
preprocess("The cat sat on the mat", True, mystops)

['cat', 'sat', 'mat']

In [None]:
tokens = preprocess("Time flies like an arrow, fruit flies like a banana.", True, mystops)
'banana.' in tokens

True

### What else to consider?
- Punctuation
    - Should the . in "Dr." be the same as in "This is the end."?
    - Should can't be split into "can" and "t"? What about "I said 'can' earlier"?
- Stemming/lemmatization
    - should "Running", "ran", and "run" be treated the same?
- Emojis
    - What to do about them? 🤔
- Stopword list
    - Should "I" be a stopword?
        - The Secret Life of Pronouns: https://www.youtube.com/watch?v=PGsQwAu3PzU

In [None]:
# Stemming with the Porter stemmer
from nltk.stem.porter import *
stemmer = PorterStemmer()
plurals = ['caresses', 'flies', 'dies', 'mules', 'denied', 'died', 'agreed', 'owned', 'humbled', 'sized', 'meeting', 'stating', 'siezing', 'itemization', 'sensational', 'traditional', 'reference', 'colonizer', 'plotted']
singles = [stemmer.stem(plural) for plural in plurals]
print(' '.join(singles))

caress fli die mule deni die agre own humbl size meet state siez item sensat tradit refer colon plot


Does it matter if the results are "real words"?

In [None]:
# Basic word tokenization
sentence = "Dr. A is going run some experiments."
sentence.split()

['Dr.', 'A', 'is', 'going', 'run', 'some', 'experiments.']

In [None]:
# basic sentence tokenization
sentence.split('.')

['Dr', ' A is going run some experiments', '']

In [None]:
# NLTK version
import nltk
nltk.download('punkt')
nltk.word_tokenize(sentence)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


['Dr.', 'A', 'is', 'going', 'run', 'some', 'experiments', '.']

In [None]:
# NLTK version
nltk.sent_tokenize(sentence)

['Dr. A is going run some experiments.']

# Edit Distance

Let's write a function to compute the distance between two strings

In [None]:
def edit_distance(word1, word2):
    
    if word1 == "" and word2 == "":
        ed = 0

    elif word1 == "":
        ed = len(word2)

    elif word2 == "":
        ed = len(word1)

    else:
        not_same = word1[-1] != word2[-1]
        repl = edit_distance(word1[:-1],word2[:-1]) + (2 * int(not_same))
        ins = edit_distance(word1, word2[:-1]) + 1
        delt = edit_distance(word1[:-1], word2) + 1

        ed = min(repl,ins,delt)

    return ed

# test
edit_distance("intention", "execution")

8

### Dr. Wilson's solution

In [None]:
import numpy as np
def my_edit_distance(w1,w2):
    n = len(w1)
    m = len(w2)
    D = np.zeros((n+1,m+1))

    # going from w1 to empty string requires
    # number of deletes equal to current length of w1
    for i in range(n+1):
        D[i,0] = i

    # going from empty string to w2 requires
    # number of inserts equal to current length of w2
    for j in range(m+1):
        D[0,j] = j

    # all other operations are based on these initial values
    for i in range(1,n+1):
        for j in range(1,m+1):
            del_cost = D[i-1,j] + 1
            ins_cost = D[i,j-1] + 1
            sub_cost = D[i-1,j-1] + (0 if w1[i-1]==w2[j-1] else 2)
            D[i,j] = min(del_cost, sub_cost, ins_cost)

    return int(D[n,m])
    
my_edit_distance("intention", "execution")

8

## Spell checker

If time left, let's write a program that takes a word and returns "Correct" if it's in a dictionary, and returns the "nearest" candidates otherwise.

In [None]:
import nltk
nltk.download('words')
from nltk.corpus import words
all_words = set(words.words())
'antelope' in all_words

[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!


False

In [None]:
def spellcheck(word):
    return "Correct"

### Dr. Wilson's solution

In [None]:
w = 'antlope'
if w not in all_words:
    suggestions = []
    shortest = len(w) * 2
    for ow in all_words:
        ed = my_edit_distance(w,ow)
        if ed == shortest:
            suggestions.append(ow)
        elif ed < shortest:
            suggestions = [ow]
            shortest = ed
print(shortest)
print(suggestions)

1
['antelope']


# First Language Model

Let's build an ngram language model and use it to generate some text.

In [None]:
# Load a sample corpus to work with
! wget https://sherlock-holm.es/stories/plain-text/cano.txt

--2023-01-19 22:36:25--  https://sherlock-holm.es/stories/plain-text/cano.txt
Resolving sherlock-holm.es (sherlock-holm.es)... 49.12.76.210, 2a01:4f8:c17:3ff5::2
Connecting to sherlock-holm.es (sherlock-holm.es)|49.12.76.210|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3868223 (3.7M) [text/plain]
Saving to: ‘cano.txt’


2023-01-19 22:36:28 (2.16 MB/s) - ‘cano.txt’ saved [3868223/3868223]



In [None]:
import collections
import string
import re
# Compute all n-gram counts
# and all (m<n)-gram counts (for all positive integer m)

# but let's just say N is 2 for now (updated to 3 for fun)
N = 3

# make a dictionary mapping from the value of N to all the counts
# Update 1: we actually only need N and N-1 grams and won't get 0 grams if N=1
LM = {n:collections.defaultdict(int) for n in range( max(1,N-1) ,N+1)} 
vocab = set([])
# Could also have just done
# LM = {N:collections.defaultdict(int), N-1:collections.defaultdict(int)}

# open file
with open('cano.txt','r') as infile:
    data = infile.read()
    # simple sentence sengmentation: split on .
    sentences = data.split('.')
    for sentence in sentences:
        # simple tokenization: split on whitespace
        words = sentence.lower().split()
        # Update 2: remove punctuation characters from all words then remove empty words
        words = [word.translate(str.maketrans('', '', string.punctuation)) for word in words]
        words = [word for word in words if word]
        # Update 5: add words to vocab set
        for word in words:
            vocab.add(word)
        # Update 3: use N-1 of the special tokens at the start and end of the sentence
        # not strictly necessary but makes life easier since we no longer need
        # special cases to handle the smaller n-grams
        words = ['<s>']* (N-1) + words + ['</s>']*(N-1)
        for i in range(len(words)):
            # Update 4: update this range() to only cover N and N-1
            for j in range(max(1,N-1),N+1):
                if j+i < len(words):
                    LM[j][ tuple(words[i:j+i]) ] += 1
                

In [None]:
LM[N-1]

defaultdict(int,
            {('<s>', '<s>'): 41102,
             ('<s>', 'the'): 2383,
             ('the', 'complete'): 6,
             ('complete', 'sherlock'): 1,
             ('sherlock', 'holmes'): 385,
             ('holmes', 'arthur'): 1,
             ('arthur', 'conan'): 2,
             ('conan', 'doyle'): 2,
             ('doyle', 'table'): 1,
             ('table', 'of'): 11,
             ('of', 'contents'): 8,
             ('contents', 'a'): 1,
             ('a', 'study'): 11,
             ('study', 'in'): 13,
             ('in', 'scarlet'): 13,
             ('scarlet', 'the'): 1,
             ('the', 'sign'): 20,
             ('sign', 'of'): 76,
             ('of', 'the'): 4274,
             ('the', 'four'): 19,
             ('four', 'the'): 1,
             ('the', 'adventures'): 6,
             ('adventures', 'of'): 5,
             ('of', 'sherlock'): 21,
             ('holmes', 'a'): 6,
             ('a', 'scandal'): 10,
             ('scandal', 'in'): 6,
             ('

In [None]:
# P(Yellow Door|The)
LM[3][('the','man','said')]/LM[2][('the','man')]

0.0024330900243309003

In [None]:
sorted_list = sorted(LM[3].items(),key=lambda x:x[1], reverse=True)
for s in sorted_list[:100]:
    print(s)

(('<s>', '<s>', 'i'), 4746)
(('<s>', '<s>', 'the'), 2383)
(('<s>', '<s>', 'it'), 2259)
(('<s>', '<s>', 'he'), 2140)
(('<s>', '<s>', 'you'), 1427)
(('<s>', '<s>', 'but'), 1231)
(('<s>', '<s>', 'holmes'), 1097)
(('<s>', '<s>', 'there'), 986)
(('<s>', 'it', 'was'), 895)
(('<s>', '<s>', 'we'), 872)
(('<s>', '<s>', 'and'), 777)
(('<s>', '<s>', 'what'), 707)
(('<s>', '<s>', 'then'), 663)
(('<s>', 'it', 'is'), 661)
(('<s>', '<s>', 'well'), 646)
(('<s>', '<s>', 'a'), 621)
(('<s>', '<s>', 'if'), 528)
(('<s>', '<s>', 'that'), 528)
(('<s>', '<s>', 'this'), 514)
(('<s>', 'i', 'have'), 484)
(('<s>', '<s>', 'she'), 467)
(('<s>', '<s>', 'in'), 454)
(('<s>', '<s>', 'my'), 447)
(('said', 'he', '</s>'), 436)
(('<s>', '<s>', 'his'), 413)
(('<s>', '<s>', 'no'), 387)
(('<s>', '<s>', 'as'), 386)
(('<s>', '<s>', 'now'), 385)
(('<s>', '<s>', 'when'), 377)
(('<s>', 'he', 'was'), 373)
(('<s>', '<s>', 'they'), 360)
(('<s>', 'i', 'am'), 320)
(('<s>', '<s>', 'at'), 317)
(('<s>', 'there', 'was'), 312)
(('<s>', 'i',

In [None]:
# Write a function to return the probability of any n-gram
# given another n-1-gram
# for bigrams we just need bigram

# hint: when multiplying many probabilities together
# do exp(log(x1) * log(x2) * log(x3) ...) instead of
# x1 * x2 * x3 ...

# tokens is a list of words
def compute_probability(LM, tokens):
    total_prob = 1
    tokens = ['<s>']* (N-1) + tokens + ['</s>']*(N-1)
    for i in range(N-1, len(tokens)):
        ngram = tokens[i-N+1:i+1]
        numerator = LM[N][tuple(ngram)] + 1
        denom = LM[N-1][tuple(ngram[:-1])] + len(vocab)
        prob = numerator/denom
        print(ngram,numerator,denom,prob)
        total_prob *= prob
    return total_prob

compute_probability(LM, ['the','man','read','the','book'])

['<s>', '<s>', 'the'] 2384 63854 0.037335170858521
['<s>', 'the', 'man'] 89 25135 0.0035408792520389893
['the', 'man', 'read'] 1 23163 4.317230065190174e-05
['man', 'read', 'the'] 1 22752 4.3952180028129396e-05
['read', 'the', 'book'] 2 22797 8.773084177742686e-05
['the', 'book', '</s>'] 3 22777 0.0001317118145497651
['book', '</s>', '</s>'] 1 22766 4.392515154177282e-05


1.2732250421223216e-25

In [None]:
compute_probability(LM, ['man','the','read','read','book'])

['<s>', '<s>', 'man'] 13 63854 0.00020358943840636452
['<s>', 'man', 'the'] 1 22764 4.3929010718678616e-05
['man', 'the', 'read'] 1 22765 4.392708104546453e-05
['the', 'read', 'read'] 1 22752 4.3952180028129396e-05
['read', 'read', 'book'] 1 22752 4.3952180028129396e-05
['read', 'book', '</s>'] 1 22752 4.3952180028129396e-05
['book', '</s>', '</s>'] 1 22766 4.392515154177282e-05


1.465188644520509e-30

In [None]:
tokens = ['<s>','a','b','c']
(tokens[1-2+1:1])

['<s>']

In [None]:
import math
result = 0.0001 * 0.0002 * 0.0003
log_result = math.log(0.0001) + math.log(0.0002) + math.log(0.0003)
print(result)
print(log_result)
print(math.e**log_result)

5.9999999999999995e-12
-25.839261646700493
6.000000000000011e-12


In [None]:
# Write a function to generate a full sentence
# given any start words
# </s> should be treated as the end of a sentence

# bonus: modify to choose out of the top K possibilities
#        weighted by their probabilities

# update to end sentence if that happens

import random

def generate(probs, prompt=[("<s>","<s>")], K=1000):
    prev_ngram = prompt[-1]
    output_tokens = []
    next_word = ""
    while next_word != "</s>":
        candidates = sorted(probs[prev_ngram],key=lambda x:x[1],reverse=True)[:K]
        words, weights = zip(*candidates)
        next_word = random.choices(words, weights,k=1)[0]
        if next_word == "</s>":
            output_tokens.append('.')
        else:
            output_tokens.append(next_word)
            prev_ngram = tuple(list(prev_ngram[1:]) + [next_word])
    return " ".join(output_tokens)

generate(probabilities, prompt = [('sherlock','is')])

'always nature watsonnature and josiah amberleyyou can be no doubt have had some of them she screamed out a revolver from the window the blind .'

In [None]:
sorted(probabilities[('are','the')],key=lambda x:x[1],reverse=True)

[('one', 0.05511811023622047),
 ('very', 0.047244094488188976),
 ('only', 0.03937007874015748),
 ('main', 0.023622047244094488),
 ('facts', 0.023622047244094488),
 ('second', 0.015748031496062992),
 ('principal', 0.015748031496062992),
 ('mormons', 0.015748031496062992),
 ('more', 0.015748031496062992),
 ('special', 0.015748031496062992),
 ('two', 0.015748031496062992),
 ('lights', 0.015748031496062992),
 ('same', 0.015748031496062992),
 ('missing', 0.015748031496062992),
 ('last', 0.015748031496062992),
 ('scowrers', 0.015748031496062992),
 ('devil', 0.015748031496062992),
 ('pick', 0.007874015748031496),
 ('gentleman', 0.007874015748031496),
 ('sole', 0.007874015748031496),
 ('persecuted', 0.007874015748031496),
 ('daughter', 0.007874015748031496),
 ('man', 0.007874015748031496),
 ('others', 0.007874015748031496),
 ('traces', 0.007874015748031496),
 ('accredited', 0.007874015748031496),
 ('regulars', 0.007874015748031496),
 ('right', 0.007874015748031496),
 ('es', 0.00787401574803149

In [None]:
probabilities[('<s>','<s>')]

[('the', 0.057977713979854996),
 ('sherlock', 0.004622646099946475),
 ('d', 0.00029195659578609316),
 ('watson', 0.00347914943311761),
 ('late', 0.00017030801420855432),
 ('chapter', 0.0012408155320908958),
 ('having', 0.0009488589363048027),
 ('on', 0.006423045107294049),
 ('i', 0.11546883363339984),
 ('there', 0.023989100287090653),
 ('worn', 4.865943263101552e-05),
 ('here', 0.004136051773636319),
 ('for', 0.005303878156780692),
 ('under', 0.00026762687947058536),
 ('so', 0.00652036397255608),
 ('choosing', 4.865943263101552e-05),
 ('in', 0.011045691207240523),
 ('whatever', 0.00046226460999464745),
 ('you', 0.034718505182229575),
 ('poor', 0.0005352537589411707),
 ('what', 0.017201109435063987),
 ('trying', 2.432971631550776e-05),
 ('thats', 0.002870906525229916),
 ('and', 0.01890418957714953),
 ('a', 0.015108753831930319),
 ('he', 0.05206559291518661),
 ('by', 0.003746776312588195),
 ('young', 0.00026762687947058536),
 ('why', 0.005668823901513308),
 ('as', 0.009391270497785997),


In [None]:
a = ('the',)
b = tuple('the',)
print(a)
print(b)

('the',)
('t', 'h', 'e')


Do we really want to compute all of the probabilities for possible next words every time we see a new word?

With a model this size, can we just store the probability of any N-gram that could follow each (N-1)-gram?

In [None]:
# map from each (N-1)-gram tuple to a list of all N-grams that could
# possibly follow (based on corpus) and their probabilities
probabilities = collections.defaultdict(list)

for ngram, ngram_count in LM[N].items():
    n1gram = ngram[:-1]
    n1gram_count = LM[N-1][n1gram]
    if ngram_count > 0:
        probabilities[n1gram].append( tuple([ngram[-1], ngram_count/n1gram_count]) )

In [None]:
probabilities

defaultdict(list,
            {('<s>',): [('the', 0.057977713979854996),
              ('sherlock', 0.004622646099946475),
              ('d', 0.00029195659578609316),
              ('watson', 0.00347914943311761),
              ('late', 0.00017030801420855432),
              ('chapter', 0.0012408155320908958),
              ('having', 0.0009488589363048027),
              ('on', 0.006423045107294049),
              ('i', 0.11546883363339984),
              ('there', 0.023989100287090653),
              ('worn', 4.865943263101552e-05),
              ('here', 0.004136051773636319),
              ('for', 0.005303878156780692),
              ('under', 0.00026762687947058536),
              ('so', 0.00652036397255608),
              ('choosing', 4.865943263101552e-05),
              ('in', 0.011045691207240523),
              ('whatever', 0.00046226460999464745),
              ('you', 0.034718505182229575),
              ('poor', 0.0005352537589411707),
              ('what', 0.01720110943

Now go back and do the `generate()` function based on these probabilities instead of the LM itself.