<a href="https://colab.research.google.com/github/azoqi/Natural-Language-Processing/blob/main/ClassCoding4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Problem: we will implement a program to convincingly imitate William Shakespeare. To do this we will develop an n-gram language model.

Requirements: write a Python program that counts the occurrences of words in THE COMPLETE WORKS OF WILLIAM SHAKESPEARE. You will need to appropriately tokenize the text using NLTK. Your program should support at least a trigram model (i.e. two words followed by a third), but ideally will be parameterized to accept any value of n. The output of your program should be one sentence generated by the trained model.

In [8]:
import nltk
import requests
import random
import sys

In [9]:
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [10]:
#Build an n-gram model as a dictionary mapping (n-1)-tuples (prefixes) to a dictionary of following words and their counts.
def build_ngram_model(tokens, n):
    model = {}
    for i in range(len(tokens) - n + 1):
        prefix = tuple(tokens[i:i+n-1])
        next_word = tokens[i+n-1]
        if prefix not in model:
            model[prefix] = {}
        model[prefix][next_word] = model[prefix].get(next_word, 0) + 1
    return model


In [11]:
#Given a dictionary mapping items to weights, choose an item at random with probability proportional to its weight.
def weighted_choice(choices):
    total = sum(choices.values())
    r = random.uniform(0, total)
    cumulative = 0
    for word, weight in choices.items():
        cumulative += weight
        if r <= cumulative:
            return word
    return None

In [12]:
# Generate a sentence using the n-gram model. Starts with a random (n-1)-gram (preferring one that starts with a capital letter)
# and continues until a terminal punctuation is reached or max_words is exceeded.
def generate_sentence(model, n, max_words=50):
    # Find starting prefixes that begin with a capital letter.
    starters = [prefix for prefix in model.keys() if prefix[0][0].isupper()]
    if not starters:
        starters = list(model.keys())
    current_prefix = random.choice(starters)
    sentence = list(current_prefix)

    for _ in range(max_words):
        if current_prefix not in model:
            break
        next_words = model[current_prefix]
        next_word = weighted_choice(next_words)
        if not next_word:
            break
        sentence.append(next_word)
        current_prefix = tuple(sentence[-(n-1):])
        # Stop if the new word ends with typical sentence-ending punctuation.
        if next_word in ['.', '!', '?'] or next_word.endswith(('.', '!', '?')):
            break
    return " ".join(sentence)

In [13]:
def clean_gutenberg_text(text):
    start_marker = "*** START OF THE PROJECT GUTENBERG EBOOK"
    end_marker = "*** END OF THE PROJECT GUTENBERG EBOOK"
    start_idx = text.find(start_marker)
    if start_idx != -1:
        text = text[start_idx:]
    end_idx = text.find(end_marker)
    if end_idx != -1:
        text = text[:end_idx]
    return text

In [14]:
def main():
    n = 3
    if len(sys.argv) > 1:
        try:
            n = int(sys.argv[1])
        except ValueError:
            print("Invalid value for n; using default n=3.")

    url = "https://www.gutenberg.org/files/100/100-0.txt"

    print("Downloading Shakespeare text...")
    response = requests.get(url)
    text = response.text
    text = clean_gutenberg_text(text)

    print("Tokenizing text using NLTK...")
    tokens = nltk.word_tokenize(text)

    print(f"Building a {n}-gram model...")
    model = build_ngram_model(tokens, n)

    print("\nGenerating a sentence using the trained model:")
    sentence = generate_sentence(model, n)
    print(sentence)

if __name__ == "__main__":
    main()

Invalid value for n; using default n=3.
Downloading Shakespeare text...
Tokenizing text using NLTK...
Building a 3-gram model...

Generating a sentence using the trained model:
Morton ; Tell him my most sovereign reason , reason and sanity could not stir from this reproach .
