EXERCISE 3 BALOGO-GUMACAL

 Step 1: Install & Import Required Libraries

In [1]:
# Install required libraries
%pip install wikipedia-api nltk

# Import necessary libraries
import wikipediaapi
import nltk
import re
from nltk.util import ngrams
from collections import Counter
import math

# Download necessary NLTK data
nltk.download('punkt')


Note: you may need to restart the kernel to use updated packages.


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\kengu\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

**(10 points) Use the Wikipedia python module and access any topic, as you will use that as your corpus, with a word limit of 1000 words.**

Step 2: Fetch Wikipedia Text

In [2]:
# Initialize Wikipedia API
wiki_wiki = wikipediaapi.Wikipedia(user_agent="Colab-Perplexity", language='en')

# Fetch Wikipedia content for "Artificial Intelligence"
page = wiki_wiki.page("Artificial Intelligence")
text = page.text[:10000]  # Limiting text for performance

# Print first 500 characters for verification
print(text[:500])


Artificial intelligence (AI) refers to the capability of computational systems to perform tasks typically associated with human intelligence, such as learning, reasoning, problem-solving, perception, and decision-making. It is a field of research in computer science that develops and studies methods and software that enable machines to perceive their environment and use learning and intelligence to take actions that maximize their chances of achieving defined goals. Such machines may be called A


**(60 points) Train 2 models: a Bigram and Trigram Language Model, use the shared code as reference for bigram modeling, and update it to support trigrams.**

Step 3: Preprocess Text (Remove Punctuation & Lowercase)

Tokenizing the text into words using NLTK.
Converting to lowercase.
Removing special characters.


In [3]:
# Remove punctuation using regex and convert to lowercase
text = re.sub(r'[^\w\s]', '', text).lower()

# Print cleaned text preview
print(text[:500])

# Tokenize the cleaned text
tokens = nltk.word_tokenize(text)

# Print first 20 tokens
print(tokens[:20])



artificial intelligence ai refers to the capability of computational systems to perform tasks typically associated with human intelligence such as learning reasoning problemsolving perception and decisionmaking it is a field of research in computer science that develops and studies methods and software that enable machines to perceive their environment and use learning and intelligence to take actions that maximize their chances of achieving defined goals such machines may be called ais
highprof
['artificial', 'intelligence', 'ai', 'refers', 'to', 'the', 'capability', 'of', 'computational', 'systems', 'to', 'perform', 'tasks', 'typically', 'associated', 'with', 'human', 'intelligence', 'such', 'as']


Step 4: Generate N-grams (Bigrams & Trigrams)

In [4]:
# Create bigrams and trigrams
bigrams = list(ngrams(tokens, 2))
trigrams = list(ngrams(tokens, 3))

# Print sample bigram and trigram
print("Sample bigram:", bigrams[:5])
print("Sample trigram:", trigrams[:5])


Sample bigram: [('artificial', 'intelligence'), ('intelligence', 'ai'), ('ai', 'refers'), ('refers', 'to'), ('to', 'the')]
Sample trigram: [('artificial', 'intelligence', 'ai'), ('intelligence', 'ai', 'refers'), ('ai', 'refers', 'to'), ('refers', 'to', 'the'), ('to', 'the', 'capability')]


Step 5: Compute N-gram Frequencies

In [5]:
# Count unigrams, bigrams, and trigrams
unigram_counts = Counter(tokens)
bigram_counts = Counter(bigrams)
trigram_counts = Counter(trigrams)

# Print sample counts
print("Unigram count (sample):", list(unigram_counts.items())[:5])
print("Bigram count (sample):", list(bigram_counts.items())[:5])
print("Trigram count (sample):", list(trigram_counts.items())[:5])


Unigram count (sample): [('artificial', 4), ('intelligence', 5), ('ai', 25), ('refers', 1), ('to', 27)]
Bigram count (sample): [(('artificial', 'intelligence'), 2), (('intelligence', 'ai'), 1), (('ai', 'refers'), 1), (('refers', 'to'), 1), (('to', 'the'), 2)]
Trigram count (sample): [(('artificial', 'intelligence', 'ai'), 1), (('intelligence', 'ai', 'refers'), 1), (('ai', 'refers', 'to'), 1), (('refers', 'to', 'the'), 1), (('to', 'the', 'capability'), 1)]


 Step 6: Implement Laplace Smoothing Function

In [6]:
# Vocabulary size
V = len(set(tokens))

# Function to compute probability with Laplace Smoothing
def laplace_smoothing(ngram, ngram_counts, lower_order_counts, V, k=1):
    """
    Computes probability using Laplace smoothing.
    ngram: The n-gram tuple.
    ngram_counts: Frequency count of the n-gram.
    lower_order_counts: Frequency count of the lower-order n-gram.
    V: Vocabulary size.
    k: Smoothing parameter (default = 1).
    """
    ngram_count = ngram_counts[ngram] + k  # Add k smoothing
    lower_order_count = lower_order_counts[ngram[:-1]] + (V * k)  # Adjust denominator
    return ngram_count / lower_order_count

# Test probability calculation
sample_bigram = ('artificial', 'intelligence')
sample_trigram = ('the', 'artificial', 'intelligence')

print("Bigram probability (Laplace smoothed):", laplace_smoothing(sample_bigram, bigram_counts, unigram_counts, V))
print("Trigram probability (Laplace smoothed):", laplace_smoothing(sample_trigram, trigram_counts, bigram_counts, V))


Bigram probability (Laplace smoothed): 0.004942339373970346
Trigram probability (Laplace smoothed): 0.0016474464579901153


**(30 points) Using a test sentence “The quick brown fox jumps over the lazy dog near the bank of the river.” OR generate your own test sentence, create a function that will determine the perplexity score for each trained model.**

Step 7: Implement Perplexity Calculation

In [7]:
def perplexity(test_sentence, ngram_counts, lower_order_counts, n, V):
    """
    Computes the perplexity of a test sentence.
    test_sentence: Input sentence to evaluate.
    ngram_counts: Frequency count of the n-grams.
    lower_order_counts: Frequency count of lower n-grams.
    n: Order of the n-gram model.
    V: Vocabulary size.
    """
    # Tokenize test sentence (removes punctuation, converts to lowercase)
    test_tokens = nltk.word_tokenize(test_sentence.lower())
    test_tokens = [word for word in test_tokens if word.isalnum()]  # Remove any leftover punctuation
    test_ngrams = list(ngrams(test_tokens, n))

    # Compute log probability sum
    log_prob_sum = 0
    for ngram in test_ngrams:
        prob = laplace_smoothing(ngram, ngram_counts, lower_order_counts, V)
        log_prob_sum += math.log(prob)

    # Compute perplexity
    return math.exp(-log_prob_sum / len(test_ngrams))

# Test perplexity function with a simple sentence
test_sentence = "The quick brown fox jumps over the lazy dog near the bank of the river."

bigram_perplexity = perplexity(test_sentence, bigram_counts, unigram_counts, 2, V)
trigram_perplexity = perplexity(test_sentence, trigram_counts, bigram_counts, 3, V)

print(f"Bigram Perplexity: {bigram_perplexity}")
print(f"Trigram Perplexity: {trigram_perplexity}")


Bigram Perplexity: 549.7742642081902
Trigram Perplexity: 607.2302444894901
