<a href="https://colab.research.google.com/github/alex-smith-uwec/NLP_Spring2025/blob/main/Basic%20Text%20Normalization%20and%20Counting-Part%203.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


Import corpus from Project Gutenberg

Bigram analysis

[ngram language model](https://en.wikipedia.org/wiki/Word_n-gram_language_model)


# Import a corpus from [Project Gutenberg](https://www.gutenberg.org/)

In [1]:
import requests

# Fetch book (Plain Text UTF-8)
url = "https://www.gutenberg.org/cache/epub/1342/pg1342.txt" #Jane Austen, Pride and Prejudice
response = requests.get(url)
text = response.text

print(text[:500])


﻿The Project Gutenberg eBook of Pride and Prejudice
    
This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this ebook or online
at www.gutenberg.org. If you are not located in the United States,
you will have to check the laws of the country where you are located
before using this


In [5]:
# Strip headers and footers
start_index = text.find("Chapter I.]")
end_index = text.find("*** END OF THE PROJECT GUTENBERG EBOOK PRIDE AND PREJUDICE ***")
corpus = text[start_index:end_index]

# Display first 500 characters
print(f"first words:\n {corpus[:500]}\n\n")

print(f"last words:\n\n\n {corpus[-400:]}")

first words:
 Chapter I.]


It is a truth universally acknowledged, that a single man in possession
of a good fortune must be in want of a wife.

However little known the feelings or views of such a man may be on his
first entering a neighbourhood, this truth is so well fixed in the minds
of the surrounding families, that he is considered as the rightful
property of some one or other of their daughters.

“My dear Mr. Bennet,” said his lady to him one day, “have you heard that
Netherfield Park is l


last words:


 gratitude towards the persons who, by bringing
her into Derbyshire, had been the means of uniting them.

                            [Illustration:

                                  THE
                                  END
                                   ]




             CHISWICK PRESS:--CHARLES WHITTINGHAM AND CO.
                  TOOKS COURT, CHANCERY LANE, LONDON.







In [6]:
import nltk
import numpy as np
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords


from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures

from nltk import bigrams
from nltk.probability import ConditionalFreqDist

import random


In [7]:
# Download NLTK resources
nltk.download('punkt_tab')
nltk.download('stopwords')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [8]:
#Tokenize the corpus into words
words = word_tokenize(corpus.lower())

# Remove punctuation
filtered_words = [word for word in words if word.isalnum()]
no_stop_filtered_words = [word for word in filtered_words if word not in stopwords.words('english')]

In [9]:
finder = BigramCollocationFinder.from_words(filtered_words)
finder_no_stops = BigramCollocationFinder.from_words(no_stop_filtered_words)

In [10]:
N=3 #filter to bigrams that appear at least N times
finder.apply_freq_filter(N)  # Only keep bigrams that occur at least N times
frequent_bigrams = finder.nbest(BigramAssocMeasures.raw_freq, 20)

frequent_bigrams

[('of', 'the'),
 ('to', 'be'),
 ('in', 'the'),
 ('i', 'am'),
 ('of', 'her'),
 ('it', 'was'),
 ('to', 'the'),
 ('of', 'his'),
 ('she', 'was'),
 ('she', 'had'),
 ('had', 'been'),
 ('it', 'is'),
 ('i', 'have'),
 ('that', 'he'),
 ('to', 'her'),
 ('could', 'not'),
 ('for', 'the'),
 ('he', 'had'),
 ('and', 'the'),
 ('he', 'was')]

In [12]:
N=3
finder_no_stops.apply_freq_filter(N)  # Only keep bigrams that occur at least N times
frequent_bigrams_no_stops = finder_no_stops.nbest(BigramAssocMeasures.raw_freq, 20)

frequent_bigrams_no_stops

[('lady', 'catherine'),
 ('miss', 'bingley'),
 ('miss', 'bennet'),
 ('said', 'elizabeth'),
 ('sir', 'william'),
 ('de', 'bourgh'),
 ('miss', 'darcy'),
 ('young', 'man'),
 ('1894', 'george'),
 ('colonel', 'fitzwilliam'),
 ('colonel', 'forster'),
 ('dare', 'say'),
 ('elizabeth', 'could'),
 ('young', 'ladies'),
 ('miss', 'lucas'),
 ('illustration', 'chapter'),
 ('cried', 'elizabeth'),
 ('said', 'bennet'),
 ('uncle', 'aunt'),
 ('great', 'deal')]

Note the bigrams (1894, george) and (illustration, chapter).
Should go back and delete these from corpus.

In [15]:
from collections import Counter, defaultdict # Import Counter and defaultdict

defaultdict is a subclass of Python's built-in dict class, provided by the collections module. It is used to create dictionaries that provide a default value for a key if it has not been set yet. This avoids the need to check for the existence of a key before accessing or modifying its value.

In [13]:
tokens=filtered_words #non-alpha removed, but stopwords remain

In [16]:
# Count unigrams and bigrams
unigram_counts = Counter(tokens)
bigram_counts = Counter(bigrams(tokens))

In [17]:
# Calculate bigram probabilities with Laplace smoothing
vocab_size = len(unigram_counts)
bigram_probs = defaultdict(lambda: 1 / vocab_size)  # Default probability for unseen bigrams

for (w1, w2), count in bigram_counts.items():
    bigram_probs[(w1, w2)] = (count + 1) / (unigram_counts[w1] + vocab_size)

# Test the bigram model: Probability of "this is"
print(f"P('is' | 'this') = {bigram_probs[('this', 'is')]}")

P('is' | 'this') = 0.004526935264825713


In [18]:
# Generate text using the model

def generate_bigram_text(start_word, length=20):
    text = [start_word]
    for _ in range(length - 1):
        # Get all possible next words and their probabilities
        next_word_probs = [(pair[1], prob) for pair, prob in bigram_probs.items()
                          if pair[0] == text[-1]]
        if not next_word_probs:
            break
        words, probs = zip(*next_word_probs)
        next_word = random.choices(words, weights=probs)[0]
        text.append(next_word)
    return ' '.join(text)




In [19]:
generate_bigram_text('miss')

'miss bennet one could provoke her perverseness he determined not watch his coming with composure everything she bore his sisters'

In [25]:
from nltk.collocations import TrigramCollocationFinder
from nltk.metrics import TrigramAssocMeasures

In [24]:
# Tokenized text (example input)
tokens=filtered_words #non-alpha removed, but stopwords remain

# Create a TrigramCollocationFinder
finder = TrigramCollocationFinder.from_words(tokens)

# Apply frequency filter: only keep trigrams that occur at least N times
N = 3  # Minimum frequency
finder.apply_freq_filter(N)

# Get the top 20 frequent trigrams based on raw frequency
frequent_trigrams = finder.nbest(TrigramAssocMeasures.raw_freq, 20)

frequent_trigrams

[('i', 'am', 'sure'),
 ('i', 'do', 'not'),
 ('as', 'soon', 'as'),
 ('she', 'could', 'not'),
 ('i', 'can', 'not'),
 ('that', 'he', 'had'),
 ('1894', 'by', 'george'),
 ('in', 'the', 'world'),
 ('it', 'would', 'be'),
 ('it', 'was', 'not'),
 ('that', 'he', 'was'),
 ('could', 'not', 'be'),
 ('i', 'am', 'not'),
 ('that', 'it', 'was'),
 ('as', 'well', 'as'),
 ('i', 'dare', 'say'),
 ('would', 'have', 'been'),
 ('by', 'no', 'means'),
 ('can', 'not', 'be'),
 ('that', 'she', 'had')]

In [20]:
# Count unigrams, bigrams, and trigrams
unigram_counts = Counter(tokens)
bigram_counts = Counter(bigrams(tokens))
trigram_counts = Counter(zip(tokens[:-2], tokens[1:-1], tokens[2:]))



In [26]:
# Calculate trigram probabilities with Laplace smoothing
vocab_size = len(unigram_counts)
trigram_probs = defaultdict(lambda: 1 / vocab_size)  # Default probability for unseen trigrams

for (w1, w2, w3), count in trigram_counts.items():
    trigram_probs[(w1, w2, w3)] = (count + 1) / (bigram_counts[(w1, w2)] + vocab_size)

# Test the trigram model: Probability of "am sure" given "i"
print(f"P('sure' | 'i am') = {trigram_probs[('i', 'am', 'sure')]}")



P('sure' | 'i am') = 0.009545804464973056


In [27]:
# Generate text using the trigram model
def generate_trigram_text(start_words, length=20):
    # `start_words` should contain two words to start the trigram model
    if len(start_words) != 2:
        raise ValueError("start_words must contain exactly two words.")

    text = list(start_words)
    for _ in range(length - 2):
        # Get all possible next words and their probabilities
        next_word_probs = [
            (triple[2], prob) for triple, prob in trigram_probs.items()
            if triple[0] == text[-2] and triple[1] == text[-1]
        ]
        if not next_word_probs:
            break
        words, probs = zip(*next_word_probs)
        next_word = random.choices(words, weights=probs)[0]
        text.append(next_word)
    return ' '.join(text)

In [36]:
generate_trigram_text(start_words=('i','am'),length=20)

'i am her particular friend you see by jane had read the note aloud if your mother of inviting as'