#  Sustainable Prompt Optimization - Extended N-gram Language Modeling

This notebook builds on the lab’s unigram and bigram work to add:

- Trigram model (3-word sequences)
- General N-gram counting
- Interpolated sentence fluency scoring combining unigram, bigram, and trigram probabilities
- Selecting the best candidate prompt based on interpolated scoring


In [23]:
# 📦 Setup and Imports (reuse from lab)
import nltk
import re
import string
from collections import Counter, defaultdict
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer

nltk.download('punkt')
nltk.download('stopwords')


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\mathe\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\mathe\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

#### Load the dataset from the folder

In [24]:
def load_corpus_from_file(filepath: str) -> str:
    """
    Loads and returns the content of a text corpus from the given file path.

    Parameters:
    - filepath (str): Path to the text file containing the corpus.

    Returns:
    - str: The entire corpus as a single string.
    """
    try:
        with open(filepath, 'r', encoding='utf-8') as file:
            corpus_text = file.read()
        print(f"✅ Successfully loaded corpus from: {filepath}")
        return corpus_text
    except FileNotFoundError:
        print(f"❌ File not found: {filepath}")
        return ""
    except Exception as e:
        print(f"❌ Error loading corpus: {e}")
        return ""


### 🧪 Tokenization & Normalization (reuse from lab)

Lowercase, regex tokenization, stopword removal, and stemming.


In [25]:
def simple_tokenizer(text):
    return re.findall(r'\b\w+\b', text.lower())

def normalize(tokens):
    stop_words = set(stopwords.words('english'))
    stemmer = PorterStemmer()
    return [stemmer.stem(word) for word in tokens if word not in stop_words and word not in string.punctuation]


### 📚 Load & Preprocess Corpus

Replace `sample_corpus` with your large corpus text content.


In [26]:
# Example placeholder corpus — replace with your large corpus text
# sample_corpus = """
# Your large corpus text goes here. This should be a substantial text dataset.
# It could be combined documents, articles, books, or any sizable textual material.
# """
sample_corpus = load_corpus_from_file("./data/corpus.txt")
tokens = normalize(simple_tokenizer(sample_corpus))


✅ Successfully loaded corpus from: ./data/corpus.txt


### 🔢 Build N-Gram Counts: Unigram, Bigram, Trigram

Precompute counts from the normalized token list for efficient probability estimates.


In [27]:
# Unigram counts
unigram_counts = Counter(tokens)
total_words = len(tokens)

# Bigram counts
bigram_counts = defaultdict(int)
for i in range(len(tokens) - 1):
    bigram = (tokens[i], tokens[i + 1])
    bigram_counts[bigram] += 1

# Trigram counts
trigram_counts = defaultdict(int)
for i in range(len(tokens) - 2):
    trigram = (tokens[i], tokens[i + 1], tokens[i + 2])
    trigram_counts[trigram] += 1


### 🧮 Maximum Likelihood Estimation (MLE) for N-Gram Probabilities

Estimating probabilities for unigram, bigram, and trigram models.


In [28]:
def unigram_prob(w):
    return unigram_counts[w] / total_words if w in unigram_counts else 1e-7  # small smoothing

def bigram_prob(w1, w2):
    return bigram_counts[(w1, w2)] / unigram_counts[w1] if unigram_counts[w1] > 0 else 0

def trigram_prob(w1, w2, w3):
    return trigram_counts[(w1, w2, w3)] / bigram_counts[(w1, w2)] if bigram_counts[(w1, w2)] > 0 else 0


### 🧩 Sentence Probability Functions for Each N-Gram Model

Product of conditional probabilities according to chain rule.


In [29]:
def sentence_prob_unigram(sentence):
    words = normalize(simple_tokenizer(sentence))
    prob = 1.0
    for word in words:
        prob *= unigram_prob(word)
    return prob

def sentence_prob_bigram(sentence):
    words = normalize(simple_tokenizer(sentence))
    if not words:
        return 0.0
    prob = unigram_prob(words[0])
    for i in range(len(words) - 1):
        prob *= bigram_prob(words[i], words[i + 1])
    return prob

def sentence_prob_trigram(sentence):
    words = normalize(simple_tokenizer(sentence))
    if len(words) < 3:
        return sentence_prob_bigram(sentence)
    prob = unigram_prob(words[0]) * bigram_prob(words[0], words[1])
    for i in range(2, len(words)):
        prob *= trigram_prob(words[i - 2], words[i - 1], words[i])
    return prob


### 🔗 Interpolated Sentence Probability (Unigram + Bigram + Trigram)

Weights control importance of each level (sum to 1).


In [30]:
def interpolated_sentence_prob(sentence, lambda_uni=0.1, lambda_bi=0.3, lambda_tri=0.6):
    words = normalize(simple_tokenizer(sentence))
    if not words:
        return 0.0

    if len(words) == 1:
        return unigram_prob(words[0])

    prob = unigram_prob(words[0])
    prob *= lambda_bi * bigram_prob(words[0], words[1]) + lambda_uni * unigram_prob(words[1])

    for i in range(2, len(words)):
        tri = trigram_prob(words[i - 2], words[i - 1], words[i])
        bi = bigram_prob(words[i - 1], words[i])
        uni = unigram_prob(words[i])

        combined = lambda_tri * tri + lambda_bi * bi + lambda_uni * uni
        prob *= combined

    return prob


### 🧠 Select Best Prompt From Candidates Using Interpolated N-Gram Scoring


In [31]:
def select_best_prompt_ngram(prompts, lambda_uni=0.1, lambda_bi=0.3, lambda_tri=0.6):
    scored = []
    for prompt in prompts:
        score = interpolated_sentence_prob(prompt, lambda_uni, lambda_bi, lambda_tri)
        scored.append((prompt, score))
    scored.sort(key=lambda x: x[1], reverse=True)
    return {
        "best_prompt": scored[0][0],
        "scores": scored
    }


### ✅ Example Usage with Prompt Candidates


In [32]:
candidates = [
    "How to configure system quickly.",
    "Please provide configuration steps in a fast manner.",
    "Assist with server setup guidance to deploy with speed.",
    "Fast deployment via configuration help."
]

results = select_best_prompt_ngram(candidates)

print("✅ Best Prompt:", results["best_prompt"])
print("\n📊 Fluency Scores:")
for prompt, score in results["scores"]:
    print(f'"{prompt}" → Score: {score:.2e}')


✅ Best Prompt: How to configure system quickly.

📊 Fluency Scores:
"How to configure system quickly." → Score: 4.27e-19
"Fast deployment via configuration help." → Score: 4.06e-31
"Please provide configuration steps in a fast manner." → Score: 5.78e-35
"Assist with server setup guidance to deploy with speed." → Score: 1.00e-47


### 📝 Summary

- Added trigram and general N-gram counting to capture more context in language modeling.
- Implemented interpolation of unigram, bigram, and trigram scores for robust sentence fluency evaluation.
- Provided a reusable function to rank candidate prompts based on this combined model.
- This approach is ideal when you have a **large corpus** enabling reliable N-gram probability estimation.


## 🗣 Student Talking Point: Detailed Explanation for the approach

### How Probabilistic N-gram Language Models Support Our Final Project
#### In our final project, Sustainable AI:
 Transparency and Dynamic Data Center Optimization, one key goal of the Optimization & Recommendation Engine is to suggest semantically equivalent but more energy-efficient prompt alternatives for large language model (LLM) inference.

Using probabilistic language models—specifically unigram, bigram, trigram, and generalized N-gram models adapted from the lab—we provide a principled mechanism to evaluate the linguistic fluency of candidate prompts generated by the system.

Role of N-gram Models in Prompt Optimization

#### Fluency scoring:
The N-gram models estimate the probability that a given sequence of words (a prompt) appears naturally in real language usage.

Unigram models consider individual word frequencies.

Bigram models consider pairs of consecutive words.

Trigram and higher N-gram models incorporate longer sequences (3 or more consecutive words), modeling more context.

#### Filtering candidate prompts:
When the engine generates multiple prompt alternatives (candidate rewrites), each candidate is scored using these language models.
Prompts with higher combined N-gram probabilities tend to be more natural and fluent, reducing the risk of awkward or unclear phrasing.

#### Interpolation and robustness:
By combining unigram, bigram, and trigram probabilities with weighted interpolation, the model balances the generality of shorter N-grams with the contextual richness of longer N-grams. This approach makes scoring both reliable and discriminative, even when the corpus data is large but sparse.

#### Why Multiple Prompts?
The engine does not produce a single optimized prompt blindly it generates several paraphrases to explore different ways to reduce prompt length and computational cost.

The probabilistic N-gram models provide a quantitative metric to select the best candidates that maintain linguistic clarity while potentially lowering inference energy.

Other modules (semantic similarity, energy cost estimation) work in concert with these language models to ensure meaning preservation and energy efficiency.

### Integration Summary
Tokenization and normalization prepare prompts for model evaluation.

Unigram, bigram, trigram, and N-gram counts are computed on a large training corpus to estimate probabilities.

Sentence probability or fluency scores are computed by applying the chain rule using the N-gram models and smoothing by interpolation.

The engine ranks prompt candidates by these fluency scores to retain only the most natural and energy-efficient prompts.

Lower-probability candidates are filtered out, preventing degraded user experience due to awkward language.

### Final Benefit
This N-gram based probabilistic fluency scoring component makes your prompt optimization system:

More linguistically aware and reliable than naïve length-based heuristics.

Able to handle a large variety of prompts, including rare or domain-specific phrases, when combined with a large corpus.

A powerful tool in the feedback loop that aligns language generation with environmental sustainability by recommending prompts that save energy without sacrificing clarity or intent.
