# Generating a style profile

Another strategy for controlling model style is to build a prompt based off of the stylometric profile of the dataset.

## Data

We will the same data from [before](./01-generating-a-dataset.ipynb).
All the manifests should be cached by `loam-iiif`.

In [1]:
from loam_iiif import iiif

client = iiif.IIIFClient()

# Berkeley Folk Music Festival
collection_url = "https://api.dc.library.northwestern.edu/api/v2/collections/18ec4c6b-192a-4ab8-9903-ea0f393c35f7?as=iiif"
max_manifests = 5000

manifest_ids, _collection_ids = client.get_manifests_and_collections_ids(collection_url, max_manifests)

print(f"🔎 Found {len(manifest_ids)} manifests")

data: list[str] = []

for id in manifest_ids:
    manifest = client.fetch_json(id)

    if "summary" not in manifest:
        continue

    summary: dict = manifest["summary"]
    keys = summary.keys()

    if len(keys) == 0:
        continue

    summary_text: str = ""
    if "none" in keys:
        summary_text = "\n".join(summary["none"])
    else:
        summary_text = "\n".join(summary[keys[0]])

    data.append(summary_text)

print(f"✅ Processed {len(data)} manifests")

🔎 Found 5000 manifests
✅ Processed 4906 manifests


## Stylometric profile

Next, we create a profile of the summaries based off of the field of [stylometry](https://en.wikipedia.org/wiki/Stylometry).
This is a very rudimentary analysis, but it will serve well enough for an LLM prompt.

We will not dive into the theory.

In [2]:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
import pandas as pd
import numpy as np
from collections import Counter
import random

# Download required NLTK data (if not already downloaded)
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger_eng')

def basic_stylometric_profile(texts: list[str]):
    """
    Generates a basic stylometric profile for a list of texts.

    Args:
        texts: A list of strings (each string is a document).

    Returns:
        A Pandas DataFrame containing the stylometric features for each text.
    """

    num_texts = len(texts)
    results = {}

    results['num_words'] = [len(text.split()) for text in texts]

    results['avg_word_length'] = [np.mean([len(word) for word in text.split()]) for text in texts]

    results['num_sentences'] = [len(sent_tokenize(text)) for text in texts] # Use nltk.sent_tokenize

    results['avg_sentence_length'] = [results['num_words'][i] / results['num_sentences'][i] if results['num_sentences'][i] > 0 else 0 for i in range(num_texts)]

    stop_words = set(nltk.corpus.stopwords.words('english'))

    results['stopword_frequency'] = [sum([1 for word in texts[i].lower().split() if word in stop_words]) / results['num_words'][i] if results['num_words'][i] > 0 else 0 for i in range(num_texts)]

    results['comma_frequency'] = [text.count(',') / len(text) if len(text) > 0 else 0 for text in texts]

    def type_token_ratio(text):
        tokens = word_tokenize(text.lower())
        types = set(tokens)
        return len(types) / len(tokens) if len(tokens) > 0 else 0

    results['type_token_ratio'] = [type_token_ratio(text) for text in texts]

    # Punctuation Frequency
    def punctuation_frequency(text):
        punctuation = "!\"#$%&'()*+,-./:;<=>?@[]^_`{|}~"
        counts = Counter(c for c in text if c in punctuation)
        total_punctuation = sum(counts.values())
        return total_punctuation / len(text) if len(text) > 0 else 0

    results['punctuation_frequency'] = [punctuation_frequency(text) for text in texts]

    # Passive Voice Ratio (approximate, requires more sophisticated parsing for perfect accuracy)
    def passive_voice_ratio(text):
        sentences = sent_tokenize(text)
        passive_count = 0
        for sentence in sentences:
            tokens = word_tokenize(sentence)
            tagged = nltk.pos_tag(tokens)
            for i in range(len(tagged) - 1):
                if tagged[i][1] == 'VBN' and (tagged[i+1][0].lower() == 'by' or tagged[i-1][0].lower() in ['is', 'are', 'was', 'were', 'be', 'been', 'being']):  #Basic heuristic
                    passive_count += 1
                    break # Count only one passive per sentence.
        return passive_count / len(sentences) if sentences else 0


    results['passive_voice_ratio'] = [passive_voice_ratio(text) for text in texts]


    df = pd.DataFrame(results)
    return df

# Generate the stylometric profile
profile_df = basic_stylometric_profile(data)

# Calculate statistics for sentence length
mean_sentence_length = profile_df['avg_sentence_length'].mean()
std_sentence_length = profile_df['avg_sentence_length'].std()
lower_bound = int(round(mean_sentence_length - std_sentence_length))
upper_bound = int(round(mean_sentence_length + std_sentence_length))
lower_bound = max(1, lower_bound)  # Ensure it's not negative
upper_bound = max(lower_bound + 1, upper_bound) # Ensure upper bound > lower bound

# Calculate stop word frequency
mean_stopword_frequency = profile_df['stopword_frequency'].mean()

# Calculate comma frequency
mean_comma_frequency = profile_df['comma_frequency'].mean()
std_comma_frequency = profile_df['comma_frequency'].std()
comma_lower_bound = max(0, mean_comma_frequency - std_comma_frequency)
comma_upper_bound = mean_comma_frequency + std_comma_frequency

# Calculate passive voice ratio
mean_passive_voice_ratio = profile_df['passive_voice_ratio'].mean()
std_passive_voice_ratio = profile_df['passive_voice_ratio'].std()
passive_voice_lower_bound = max(0, mean_passive_voice_ratio - std_passive_voice_ratio)
passive_voice_upper_bound = mean_passive_voice_ratio + std_passive_voice_ratio


def check_sentence_length(sentence, lower_bound, upper_bound):
    num_words = len(sentence.split())
    return lower_bound <= num_words <= upper_bound


def check_stopword_frequency(sentence, mean_stopword_frequency):
    stop_words = set(nltk.corpus.stopwords.words('english'))
    num_words = len(sentence.split())
    stopword_count = sum(1 for word in sentence.lower().split() if word in stop_words)
    stopword_frequency = stopword_count / num_words if num_words > 0 else 0
    return abs(stopword_frequency - mean_stopword_frequency) <= 0.1 # Tolerance


def check_comma_frequency(sentence, comma_lower_bound, comma_upper_bound):
    comma_frequency = sentence.count(',') / len(sentence) if len(sentence) > 0 else 0
    return comma_lower_bound <= comma_frequency <= comma_upper_bound


def check_passive_voice_ratio(sentence, mean_passive_voice_ratio):
     tokens = word_tokenize(sentence)
     tagged = nltk.pos_tag(tokens)
     passive_count = 0
     for i in range(len(tagged) - 1):
         if tagged[i][1] == 'VBN' and (tagged[i+1][0].lower() == 'by' or tagged[i-1][0].lower() in ['is', 'are', 'was', 'were', 'be', 'been', 'being']):  #Basic heuristic
             passive_count += 1
             break # Count only one passive per sentence.
     ratio = passive_count / len(sent_tokenize(sentence)) if len(sent_tokenize(sentence)) > 0 else 0

     return abs(ratio - mean_passive_voice_ratio) <= 0.1 #Tolerance


def find_example_sentences(texts, num_examples=2,
                           lower_bound=None, upper_bound=None,
                           mean_stopword_frequency=None,
                           comma_lower_bound=None, comma_upper_bound=None,
                           mean_passive_voice_ratio = None):
    """Finds sentences that match specified stylistic properties.  Properties that aren't specified are ignored"""

    examples = []
    possible_examples = []

    for text in texts:
        sentences = sent_tokenize(text)
        for sentence in sentences:
            # Check all relevant criteria
            length_ok = True if lower_bound is None else check_sentence_length(sentence, lower_bound, upper_bound)
            stopword_ok = True if mean_stopword_frequency is None else check_stopword_frequency(sentence, mean_stopword_frequency)
            comma_ok = True if comma_lower_bound is None else check_comma_frequency(sentence, comma_lower_bound, comma_upper_bound)
            passive_voice_ok = True if mean_passive_voice_ratio is None else check_passive_voice_ratio(sentence, mean_passive_voice_ratio)

            if length_ok and stopword_ok and comma_ok and passive_voice_ok:
                possible_examples.append(sentence)

    if len(possible_examples) >= num_examples:
        examples = random.sample(possible_examples, num_examples)
    else:
        examples = random.choices(possible_examples, k=num_examples) # Allow duplicates when the number of examples is low


    return examples


example_sentence_lengths = find_example_sentences(data, num_examples=2, lower_bound=lower_bound, upper_bound=upper_bound)
example_stopword_frequencies = find_example_sentences(data, num_examples=2, mean_stopword_frequency=mean_stopword_frequency)
example_comma_frequencies = find_example_sentences(data, num_examples=2, comma_lower_bound=comma_lower_bound, comma_upper_bound=comma_upper_bound)
example_passive_voice_ratios = find_example_sentences(data, num_examples=2, mean_passive_voice_ratio = mean_passive_voice_ratio)

[nltk_data] Downloading package punkt to /Users/charles/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/charles/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /Users/charles/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


After doing the calculations, we'll create a prompt template.

In [3]:
# Create the LLM prompt
llm_prompt = f"""
Follow these guidelines when describing an image:

1.  Use short sentences that are around {lower_bound}-{upper_bound} words long. Examples:
    -   {example_sentence_lengths[0] if len(example_sentence_lengths) > 0 else "No example available"}
    -   {example_sentence_lengths[1] if len(example_sentence_lengths) > 1 else "No example available"}

2.  Use a {mean_stopword_frequency:.2f} proportion of stop words in your sentences. Examples:
    -   {example_stopword_frequencies[0] if len(example_stopword_frequencies) > 0 else "No example available"}
    -   {example_stopword_frequencies[1] if len(example_stopword_frequencies) > 1 else "No example available"}

3. Use a comma frequency between {comma_lower_bound:.2f} and {comma_upper_bound:.2f}. Examples:
    -   {example_comma_frequencies[0] if len(example_comma_frequencies) > 0 else "No example available"}
    -   {example_comma_frequencies[1] if len(example_comma_frequencies) > 1 else "No example available"}

4. Aim for a passive voice ratio close to {mean_passive_voice_ratio:.2f}. Examples:
    -   {example_passive_voice_ratios[0] if len(example_passive_voice_ratios) > 0 else "No example available"}
    -   {example_passive_voice_ratios[1] if len(example_passive_voice_ratios) > 1 else "No example available"}
"""

print("\nLLM Prompt:")
print(llm_prompt)


LLM Prompt:

Follow these guidelines when describing an image:

1.  Use short sentences that are around 6-21 words long. Examples:
    -   An unidentified man sits between them.
    -   Digital image scanned from black and white negative.

2.  Use a 0.27 proportion of stop words in your sentences. Examples:
    -   Clipping from Folknik's newsletter about Teatro Campesino.
    -   Flyer advertising double feature movie screenings of "Tarzan the Ape Man" and the Little Rascals' "Fly My Kite," with free bananas for the first 10 guests at each show

3. Use a comma frequency between 0.00 and 0.02. Examples:
    -   Mississippi John Hurt (playing guitar) and Bess Lomax Hawes at a Berkeley Folk Music Festival workshop.
    -   Bearded man.

4. Aim for a passive voice ratio close to 0.03. Examples:
    -   Template for form letters to Berkeley Folk Music Festival artists and folklorists
    -   Digital image scanned from black and white negative.



Then save the file.

In [4]:
file_path = "llm_prompt.txt"
with open(file_path, "w") as file:
    file.write(llm_prompt)

print(f"LLM prompt has been saved to {file_path}")

LLM prompt has been saved to llm_prompt.txt
