# Formulaic Sentence Detection in Tang Dynasty Edicts

This notebook identifies **formulaic sentences** (standardized phrases that appear across multiple edicts) versus **unique sentences** (edict-specific content) in Tang Dynasty imperial edicts using SIKU-BERT embeddings and cosine similarity.

## Overview

Imperial edicts often follow conventional templates with formulaic language interspersed with context-specific content. This analysis helps distinguish between:

- **Formulaic sentences**: Phrases that appear repeatedly across different edicts with high semantic similarity (‚â• threshold)
- **Unique sentences**: Edict-specific content with no close counterparts in other documents

## Methodology

1. **Segmentation**: Split each edict into sentences based on Chinese punctuation marks („ÄÇÔºõÔºÅÔºü)
2. **Embedding**: Generate SIKU-BERT embeddings for all sentences
3. **Similarity Analysis**: Compute cosine similarity between sentences across different edicts
4. **Classification**: Mark sentences as formulaic if similarity ‚â• threshold (default 0.85)
5. **Visualization**: Export formatted texts with bold highlighting for formulaic sentences
6. **Interactive Exploration**: Compare edicts dynamically with selective highlighting

## Configuration

Key parameters you can adjust:
- `EDICT_TYPE`: The document type to analyze (e.g., 'ÂÜåÊñá', 'Âç≥‰ΩçËµ¶', 'ÊîπÂÖÉËµ¶')
- `SIMILARITY_THRESHOLD`: Minimum cosine similarity to consider sentences formulaic (0.0-1.0)
- `MIN_SENTENCE_LENGTH`: Minimum characters for a valid sentence

## Outputs

The notebook generates:
1. **Markdown file**: Formatted texts with formulaic sentences in **bold**
2. **CSV file**: Detailed sentence-level data with similarity scores
3. **Interactive widget**: Dynamic edict comparison with highlighting

Let's begin the analysis!

## Installation

Install all required Python packages for the analysis. This includes:
- **pandas**: Data manipulation and CSV handling
- **numpy**: Numerical computations
- **torch**: PyTorch for running SIKU-BERT model
- **transformers**: HuggingFace library for BERT models
- **scikit-learn**: Cosine similarity computation
- **tqdm**: Progress bars for long operations
- **ipywidgets**: Interactive widgets for edict exploration

In [None]:
pip install pandas numpy torch transformers scikit-learn tqdm ipywidgets

## 1. Import Libraries

Import all necessary libraries and configure display settings. We suppress warnings to keep output clean and set pandas to display full column information for better visibility of results.

In [None]:
import pandas as pd
import numpy as np
import torch
from transformers import AutoTokenizer, AutoModel
from sklearn.metrics.pairwise import cosine_similarity
from tqdm import tqdm
import ipywidgets as widgets
from IPython.display import display, HTML, Markdown
import re
import warnings
warnings.filterwarnings('ignore')

# Set pandas display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 100)

print("Libraries imported successfully!")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

## 2. Configuration

Set analysis parameters. **EDICT_TYPE** determines which document type to analyze. **SIMILARITY_THRESHOLD** controls how similar sentences must be to count as formulaic (0.85 = 85% similarity). Lower thresholds will identify more sentences as formulaic; higher thresholds will be more conservative. The notebook will automatically detect GPU availability for faster processing.

In [None]:
# Configuration
EDICT_TYPE = 'ÂÜåÊñá'  # Change to analyze different edict types
SIMILARITY_THRESHOLD = 0.85  # Sentences with similarity ‚â• this are considered formulaic
MIN_SENTENCE_LENGTH = 5  # Minimum characters for a valid sentence
MODEL_PATH = './sikubert'  # Path to local SIKU-BERT model
OUTPUT_FILE = f'formulaic_analysis_{EDICT_TYPE}.md'  # Output Markdown file

# Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

print(f"Configuration:")
print(f"  Edict type: {EDICT_TYPE}")
print(f"  Similarity threshold: {SIMILARITY_THRESHOLD}")
print(f"  Min sentence length: {MIN_SENTENCE_LENGTH}")
print(f"  Model path: {MODEL_PATH}")
print(f"  Output file: {OUTPUT_FILE}")
print(f"  Device: {device}")

## 3. Load SIKU-BERT Model

Load the SIKU-BERT model from the local path. SIKU-BERT is a BERT model specifically trained on classical Chinese texts from the Siku Quanshu (ÂõõÂ∫´ÂÖ®Êõ∏), making it ideal for analyzing Tang Dynasty documents. The model is set to evaluation mode (no training) and moved to GPU if available.

In [None]:
print(f"Loading SIKU-BERT model from {MODEL_PATH}...")

tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = AutoModel.from_pretrained(MODEL_PATH)
model.to(device)
model.eval()

print(f"‚úÖ Model loaded successfully!")
print(f"   Model parameters: {sum(p.numel() for p in model.parameters()):,}")

## 4. Define Embedding Function

Define the embedding generation function. This function converts text strings into dense vector representations (embeddings) using SIKU-BERT. We use the [CLS] token embedding as the sentence representation, which captures the overall semantic meaning. Batch processing improves efficiency when encoding many sentences.

In [None]:
def get_embeddings(texts, batch_size=16):
    """
    Generate SIKU-BERT embeddings for a list of texts.
    
    Args:
        texts: List of text strings
        batch_size: Number of texts to process at once
    
    Returns:
        numpy array of embeddings (num_texts, embedding_dim)
    """
    all_embeddings = []
    
    with torch.no_grad():
        for i in tqdm(range(0, len(texts), batch_size), desc="Generating embeddings"):
            batch_texts = texts[i:i+batch_size]
            
            # Tokenize
            inputs = tokenizer(
                batch_texts,
                padding=True,
                truncation=True,
                max_length=512,
                return_tensors='pt'
            )
            inputs = {k: v.to(device) for k, v in inputs.items()}
            
            # Get embeddings
            outputs = model(**inputs)
            
            # Use [CLS] token embedding
            embeddings = outputs.last_hidden_state[:, 0, :].cpu().numpy()
            all_embeddings.append(embeddings)
    
    return np.vstack(all_embeddings)

print("‚úÖ Embedding function defined")

## 5. Load and Segment Edicts

Load the edict dataset from CSV and filter by document type. The CSV file contains extracted Tang Dynasty edicts with punctuated text. We select only edicts matching the specified type (e.g., 'ÂÜåÊñá') and ensure they have valid text content. The notebook displays all available edicts for transparency.

In [None]:
print(f"Loading edicts from extracted_edicts_punc.csv...")

df_all = pd.read_csv('extracted_edicts_punc.csv', encoding='utf-8-sig')

# Filter by edict type
df_edicts = df_all[
    (df_all['document_type'] == EDICT_TYPE) & 
    (df_all['text_contents_punctuated'].notna())
].copy()

df_edicts.reset_index(drop=True, inplace=True)

print(f"\nFound {len(df_edicts)} edicts of type '{EDICT_TYPE}'")

if len(df_edicts) < 2:
    print("\n‚ö†Ô∏è  WARNING: Need at least 2 edicts for comparison!")
    print("   Please choose a different EDICT_TYPE with more examples.")
else:
    print("\nEdicts:")
    for idx, row in df_edicts.iterrows():
        print(f"  {idx+1}. {row['text_title']}")

Segment each edict into individual sentences. Chinese sentences are identified by major punctuation marks: „ÄÇ(period), Ôºõ(semicolon), ÔºÅ(exclamation), and Ôºü(question mark). We also track the position of each sentence within the original text for later reconstruction with formatting. Short fragments below MIN_SENTENCE_LENGTH are filtered out to avoid noise.

In [None]:
print(f"\nSegmenting edicts into sentences...")

def segment_sentences(text):
    """
    Segment text into sentences based on Chinese punctuation.
    
    Returns:
        List of (sentence_text, start_pos, end_pos) tuples
    """
    # Split by major delimiters
    parts = re.split(r'([„ÄÇÔºõÔºÅÔºü])', text)
    
    # Reconstruct sentences with delimiters
    sentences = []
    current_pos = 0
    
    for i in range(0, len(parts)-1, 2):
        if i+1 < len(parts):
            sent = parts[i] + parts[i+1]
            sent = sent.strip()
            
            if len(sent) >= MIN_SENTENCE_LENGTH:
                # Find position in original text
                start_pos = text.find(sent, current_pos)
                if start_pos == -1:
                    start_pos = current_pos
                end_pos = start_pos + len(sent)
                
                sentences.append((sent, start_pos, end_pos))
                current_pos = end_pos
    
    # Handle last sentence without delimiter
    if len(parts) % 2 == 1:
        last_sent = parts[-1].strip()
        if len(last_sent) >= MIN_SENTENCE_LENGTH:
            start_pos = text.find(last_sent, current_pos)
            if start_pos == -1:
                start_pos = current_pos
            end_pos = start_pos + len(last_sent)
            sentences.append((last_sent, start_pos, end_pos))
    
    return sentences

# Segment all edicts
edict_sentences = []  # List of dicts with metadata

for idx, row in df_edicts.iterrows():
    edict_title = row['text_title']
    full_text = row['text_contents_punctuated']
    
    sentences = segment_sentences(full_text)
    
    for sent_idx, (sent_text, start_pos, end_pos) in enumerate(sentences):
        edict_sentences.append({
            'edict_idx': idx,
            'edict_title': edict_title,
            'sentence_idx': sent_idx,
            'sentence_text': sent_text,
            'start_pos': start_pos,
            'end_pos': end_pos,
            'full_text': full_text
        })

df_sentences = pd.DataFrame(edict_sentences)

print(f"‚úÖ Segmentation complete!")
print(f"   Total sentences: {len(df_sentences)}")
print(f"   Sentences per edict:")
for idx, row in df_edicts.iterrows():
    count = len(df_sentences[df_sentences['edict_idx'] == idx])
    print(f"     {row['text_title']}: {count} sentences")

## 6. Generate Embeddings

Generate SIKU-BERT embeddings for all sentences in the dataset. This converts each sentence from text into a numerical vector that captures its semantic meaning. The embedding process may take several minutes depending on the number of sentences and whether GPU acceleration is available. Progress is shown with a progress bar.

In [None]:
print(f"Generating SIKU-BERT embeddings for {len(df_sentences)} sentences...\n")

# Get all sentence texts
sentence_texts = df_sentences['sentence_text'].tolist()

# Generate embeddings
embeddings = get_embeddings(sentence_texts, batch_size=16)

print(f"\n‚úÖ Embeddings generated!")
print(f"   Shape: {embeddings.shape}")
print(f"   Memory: {embeddings.nbytes / 1024 / 1024:.2f} MB")

## 7. Compute Similarity and Identify Formulaic Sentences

Compute pairwise cosine similarity between all sentences and identify formulaic patterns. For each sentence, we find its most similar counterpart from **other edicts** (not from the same edict). If the maximum similarity ‚â• threshold, the sentence is classified as formulaic. This approach ensures we're detecting cross-document patterns rather than internal repetition.

In [None]:
print(f"Computing sentence similarities...\n")

# Compute full similarity matrix
similarity_matrix = cosine_similarity(embeddings)

print(f"‚úÖ Similarity matrix computed: {similarity_matrix.shape}\n")

# For each sentence, find max similarity with sentences from OTHER edicts
formulaic_flags = []
max_similarities = []
best_matches = []

for i in tqdm(range(len(df_sentences)), desc="Identifying formulaic sentences"):
    current_edict_idx = df_sentences.iloc[i]['edict_idx']
    
    # Find indices of sentences from other edicts
    other_edict_mask = df_sentences['edict_idx'] != current_edict_idx
    other_edict_indices = df_sentences[other_edict_mask].index.tolist()
    
    if len(other_edict_indices) == 0:
        # Only one edict - cannot compare
        formulaic_flags.append(False)
        max_similarities.append(0.0)
        best_matches.append(None)
        continue
    
    # Get similarities to other edicts
    similarities_to_others = similarity_matrix[i, other_edict_indices]
    
    # Find maximum similarity
    max_sim = similarities_to_others.max()
    max_sim_idx_in_others = similarities_to_others.argmax()
    best_match_idx = other_edict_indices[max_sim_idx_in_others]
    
    # Determine if formulaic
    is_formulaic = max_sim >= SIMILARITY_THRESHOLD
    
    formulaic_flags.append(is_formulaic)
    max_similarities.append(max_sim)
    best_matches.append(best_match_idx)

# Add to dataframe
df_sentences['is_formulaic'] = formulaic_flags
df_sentences['max_similarity'] = max_similarities
df_sentences['best_match_idx'] = best_matches

# Statistics
num_formulaic = df_sentences['is_formulaic'].sum()
num_unique = len(df_sentences) - num_formulaic

print(f"\n‚úÖ Formulaic identification complete!")
print(f"\nResults:")
print(f"  Total sentences: {len(df_sentences)}")
print(f"  Formulaic sentences (‚â•{SIMILARITY_THRESHOLD} similarity): {num_formulaic} ({num_formulaic/len(df_sentences)*100:.1f}%)")
print(f"  Unique sentences: {num_unique} ({num_unique/len(df_sentences)*100:.1f}%)")

print(f"\nFormulaic sentences by edict:")
for idx, row in df_edicts.iterrows():
    edict_sents = df_sentences[df_sentences['edict_idx'] == idx]
    num_form = edict_sents['is_formulaic'].sum()
    total = len(edict_sents)
    print(f"  {row['text_title']}: {num_form}/{total} ({num_form/total*100:.1f}%)")

## 8. Display Sample Results

Display representative examples of formulaic and unique sentences. Formulaic examples are sorted by similarity score to show the strongest patterns. For each formulaic sentence, we show both the original sentence and its best match from another edict. Unique examples demonstrate sentences with no close counterparts across the corpus.

In [None]:
print("=" * 100)
print("SAMPLE FORMULAIC SENTENCES")
print("=" * 100)

if num_formulaic > 0:
    formulaic_sents = df_sentences[df_sentences['is_formulaic']].nlargest(5, 'max_similarity')
    
    for i, (idx, row) in enumerate(formulaic_sents.iterrows(), 1):
        print(f"\n{'-' * 100}")
        print(f"Example #{i}")
        print(f"{'-' * 100}")
        print(f"Edict: {row['edict_title']}")
        print(f"Similarity: {row['max_similarity']:.3f}")
        print(f"\nSentence:")
        print(f"  {row['sentence_text']}")
        
        if row['best_match_idx'] is not None:
            match_row = df_sentences.iloc[row['best_match_idx']]
            print(f"\nBest match (from {match_row['edict_title']}):")
            print(f"  {match_row['sentence_text']}")
    
    print(f"\n{'=' * 100}")
else:
    print("\n‚ö†Ô∏è  No formulaic sentences found with current threshold.")
    print(f"   Consider lowering SIMILARITY_THRESHOLD (current: {SIMILARITY_THRESHOLD})")

print("\n" + "=" * 100)
print("SAMPLE UNIQUE SENTENCES")
print("=" * 100)

if num_unique > 0:
    unique_sents = df_sentences[~df_sentences['is_formulaic']].nsmallest(5, 'max_similarity')
    
    for i, (idx, row) in enumerate(unique_sents.iterrows(), 1):
        print(f"\n{'-' * 100}")
        print(f"Example #{i}")
        print(f"{'-' * 100}")
        print(f"Edict: {row['edict_title']}")
        print(f"Max similarity: {row['max_similarity']:.3f}")
        print(f"\nSentence:")
        print(f"  {row['sentence_text']}")
    
    print(f"\n{'=' * 100}")
else:
    print("\n‚ö†Ô∏è  No unique sentences found - all sentences are formulaic!")

## 9. Export to Markdown

Export results to a formatted Markdown file for human reading. Each edict is reconstructed with **formulaic sentences in bold** and unique sentences in regular text. The file includes statistics for each edict and a summary at the top. This provides an easy way to read and analyze the complete texts with visual distinction between formulaic and unique content.

In [None]:
print(f"Exporting results to {OUTPUT_FILE}...\n")

with open(OUTPUT_FILE, 'w', encoding='utf-8') as f:
    # Write header
    f.write(f"# Formulaic Sentence Analysis: {EDICT_TYPE}\n\n")
    f.write(f"**Analysis Parameters:**\n")
    f.write(f"- Edict Type: {EDICT_TYPE}\n")
    f.write(f"- Number of Edicts: {len(df_edicts)}\n")
    f.write(f"- Similarity Threshold: {SIMILARITY_THRESHOLD}\n")
    f.write(f"- Total Sentences: {len(df_sentences)}\n")
    f.write(f"- Formulaic Sentences: {num_formulaic} ({num_formulaic/len(df_sentences)*100:.1f}%)\n")
    f.write(f"- Unique Sentences: {num_unique} ({num_unique/len(df_sentences)*100:.1f}%)\n")
    f.write(f"\n**Legend:**\n")
    f.write(f"- **Bold text** = Formulaic sentence (similar to sentences in other edicts)\n")
    f.write(f"- Regular text = Unique sentence (no close counterparts in other edicts)\n")
    f.write(f"\n---\n\n")
    
    # Process each edict
    for edict_idx, edict_row in df_edicts.iterrows():
        edict_title = edict_row['text_title']
        full_text = edict_row['text_contents_punctuated']
        
        # Get sentences for this edict
        edict_sents = df_sentences[df_sentences['edict_idx'] == edict_idx].sort_values('sentence_idx')
        
        # Write edict header
        f.write(f"## {edict_title}\n\n")
        
        # Statistics for this edict
        num_form_edict = edict_sents['is_formulaic'].sum()
        total_sents_edict = len(edict_sents)
        f.write(f"*Formulaic: {num_form_edict}/{total_sents_edict} sentences ({num_form_edict/total_sents_edict*100:.1f}%)*\n\n")
        
        # Reconstruct text with formatting
        # We'll process the full text and apply bold formatting
        formatted_text = full_text
        
        # Sort sentences by position (reverse order for string replacement)
        sorted_sents = edict_sents.sort_values('start_pos', ascending=False)
        
        for _, sent_row in sorted_sents.iterrows():
            sent_text = sent_row['sentence_text']
            start_pos = sent_row['start_pos']
            end_pos = sent_row['end_pos']
            is_formulaic = sent_row['is_formulaic']
            
            # Extract original sentence from full text
            original_sent = full_text[start_pos:end_pos]
            
            # Apply formatting
            if is_formulaic:
                # Bold for formulaic
                formatted_sent = f"**{original_sent}**"
            else:
                # Keep as-is for unique
                formatted_sent = original_sent
            
            # Replace in formatted text
            formatted_text = formatted_text[:start_pos] + formatted_sent + formatted_text[end_pos:]
        
        # Write formatted text
        f.write(formatted_text)
        f.write("\n\n---\n\n")

print(f"‚úÖ Export complete!")
print(f"\nOutput file: {OUTPUT_FILE}")
print(f"\nThe file contains:")
print(f"  - Complete text of all {len(df_edicts)} edicts")
print(f"  - Formulaic sentences in **bold**")
print(f"  - Unique sentences in regular text")
print(f"  - Statistics for each edict")

## 10. Create Detailed CSV Report

Create a detailed CSV report with sentence-level analysis data. This machine-readable file contains every sentence with its classification (formulaic/unique), similarity score, and best matching sentence from other edicts. Useful for further statistical analysis, data processing, or integration with other tools.

In [None]:
# Export detailed sentence-level data
csv_file = f'formulaic_sentences_{EDICT_TYPE}.csv'

# Prepare export data
export_df = df_sentences[[
    'edict_title', 'sentence_idx', 'sentence_text', 
    'is_formulaic', 'max_similarity'
]].copy()

# Add best match information
best_match_titles = []
best_match_texts = []

for idx, row in df_sentences.iterrows():
    if row['best_match_idx'] is not None and pd.notna(row['best_match_idx']):
        match_row = df_sentences.iloc[int(row['best_match_idx'])]
        best_match_titles.append(match_row['edict_title'])
        best_match_texts.append(match_row['sentence_text'])
    else:
        best_match_titles.append('')
        best_match_texts.append('')

export_df['best_match_edict'] = best_match_titles
export_df['best_match_sentence'] = best_match_texts

export_df.to_csv(csv_file, index=False, encoding='utf-8-sig')

print(f"‚úÖ Detailed CSV report saved: {csv_file}")
print(f"\nCSV contains sentence-level data:")
print(f"  - Edict title")
print(f"  - Sentence index")
print(f"  - Sentence text")
print(f"  - Formulaic flag")
print(f"  - Maximum similarity score")
print(f"  - Best matching edict and sentence")

## 11. Summary Statistics

Display comprehensive summary statistics for the entire analysis. This includes overall formulaic/unique ratios, similarity score distributions, and per-edict breakdowns. The interpretation section helps contextualize the results by explaining what different levels of formulaic content might indicate about the document collection.

In [None]:
print("\n" + "=" * 100)
print("FORMULAIC SENTENCE ANALYSIS - SUMMARY")
print("=" * 100)

print(f"\nüìä Dataset:")
print(f"   Edict type: {EDICT_TYPE}")
print(f"   Total edicts: {len(df_edicts)}")
print(f"   Total sentences: {len(df_sentences)}")

print(f"\nüéØ Detection Results:")
print(f"   Similarity threshold: {SIMILARITY_THRESHOLD}")
print(f"   Formulaic sentences: {num_formulaic} ({num_formulaic/len(df_sentences)*100:.1f}%)")
print(f"   Unique sentences: {num_unique} ({num_unique/len(df_sentences)*100:.1f}%)")

if num_formulaic > 0:
    formulaic_df = df_sentences[df_sentences['is_formulaic']]
    print(f"\nüìà Similarity Metrics (Formulaic Sentences):")
    print(f"   Mean similarity: {formulaic_df['max_similarity'].mean():.3f}")
    print(f"   Median similarity: {formulaic_df['max_similarity'].median():.3f}")
    print(f"   Min similarity: {formulaic_df['max_similarity'].min():.3f}")
    print(f"   Max similarity: {formulaic_df['max_similarity'].max():.3f}")

print(f"\nüìù Per-Edict Breakdown:")
for idx, row in df_edicts.iterrows():
    edict_sents = df_sentences[df_sentences['edict_idx'] == idx]
    num_form = edict_sents['is_formulaic'].sum()
    total = len(edict_sents)
    avg_sim = edict_sents[edict_sents['is_formulaic']]['max_similarity'].mean()
    
    print(f"\n   {row['text_title']}:")
    print(f"     Total sentences: {total}")
    print(f"     Formulaic: {num_form} ({num_form/total*100:.1f}%)")
    if num_form > 0:
        print(f"     Avg similarity: {avg_sim:.3f}")

print(f"\nüìÅ Output Files:")
print(f"   1. {OUTPUT_FILE} - Formatted Markdown with bold highlighting")
print(f"   2. {csv_file} - Detailed sentence-level CSV data")

print(f"\nüí° Interpretation:")
if num_formulaic > len(df_sentences) * 0.5:
    print(f"   High formulaic content ({num_formulaic/len(df_sentences)*100:.0f}%) suggests strong")
    print(f"   adherence to template conventions in {EDICT_TYPE} edicts.")
elif num_formulaic > len(df_sentences) * 0.3:
    print(f"   Moderate formulaic content ({num_formulaic/len(df_sentences)*100:.0f}%) indicates a mix")
    print(f"   of standard phrasing and edict-specific content.")
else:
    print(f"   Low formulaic content ({num_formulaic/len(df_sentences)*100:.0f}%) suggests high")
    print(f"   variability and context-specific language in these edicts.")

print("\n" + "=" * 100)
print("‚úÖ Analysis complete!")
print("=" * 100)

## 12. Visualize Edict Lengths and Formulaic Content

Create a visual comparison of edict lengths showing the distribution of formulaic vs. unique content. Each edict is represented as a horizontal bar where colored segments indicate formulaic passages and white segments show unique content. This provides an at-a-glance view of how standardized language is distributed across different documents.

In [None]:
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import matplotlib.font_manager as fm

# Configure matplotlib to display Chinese characters properly
# Try to find available Chinese fonts on the system
def get_chinese_font():
    """Find an available Chinese font on the system."""
    chinese_fonts = [
        'WenQuanYi Micro Hei',  # Common on Linux
        'WenQuanYi Zen Hei',
        'Noto Sans CJK SC',
        'Noto Sans CJK TC',
        'Droid Sans Fallback',
        'SimHei',  # Windows
        'Arial Unicode MS',  # macOS
        'Microsoft YaHei',
        'STHeiti',
    ]
    
    available_fonts = [f.name for f in fm.fontManager.ttflist]
    
    for font in chinese_fonts:
        if font in available_fonts:
            print(f"Using Chinese font: {font}")
            return font
    
    # If no Chinese font found, try the first CJK font available
    for font_name in available_fonts:
        if any(keyword in font_name.lower() for keyword in ['cjk', 'chinese', 'han', 'hei', 'song', 'kai']):
            print(f"Using Chinese font: {font_name}")
            return font_name
    
    print("‚ö†Ô∏è  Warning: No Chinese font detected. Chinese characters may not display correctly.")
    print("   On Linux, install fonts with: sudo apt-get install fonts-wqy-microhei fonts-wqy-zenhei")
    return 'DejaVu Sans'

chinese_font = get_chinese_font()
plt.rcParams['font.sans-serif'] = [chinese_font, 'DejaVu Sans']
plt.rcParams['axes.unicode_minus'] = False  # Fix minus sign display

print("Creating edict length visualization with formulaic content...\n")

# Prepare data for visualization
edict_viz_data = []

for edict_idx, edict_row in df_edicts.iterrows():
    # Get text without punctuation from text_contents column
    text_no_punc = edict_row['text_contents'] if 'text_contents' in edict_row else edict_row['text_contents_punctuated']
    # Remove common Chinese punctuation marks
    for punct in ['„ÄÇ', 'Ôºõ', 'ÔºÅ', 'Ôºü', 'Ôºå', '„ÄÅ', 'Ôºö', '„Äå', '„Äç', '„Äé', '„Äè', 'Ôºà', 'Ôºâ', '„Ää', '„Äã']:
        text_no_punc = text_no_punc.replace(punct, '')
    
    total_length = len(text_no_punc)
    
    # Get sentences for this edict
    edict_sents = df_sentences[df_sentences['edict_idx'] == edict_idx].sort_values('start_pos')
    
    # Build segments for visualization
    # Each segment is (start_char, end_char, is_formulaic)
    segments = []
    
    for _, sent_row in edict_sents.iterrows():
        # Calculate positions without punctuation
        full_text_with_punc = edict_row['text_contents_punctuated']
        start_pos_punc = sent_row['start_pos']
        end_pos_punc = sent_row['end_pos']
        
        # Count characters without punctuation up to this point
        text_before = full_text_with_punc[:start_pos_punc]
        start_pos_no_punc = len(text_before)
        for punct in ['„ÄÇ', 'Ôºõ', 'ÔºÅ', 'Ôºü', 'Ôºå', '„ÄÅ', 'Ôºö', '„Äå', '„Äç', '„Äé', '„Äè', 'Ôºà', 'Ôºâ', '„Ää', '„Äã']:
            start_pos_no_punc -= text_before.count(punct)
        
        sent_text = full_text_with_punc[start_pos_punc:end_pos_punc]
        sent_length_no_punc = len(sent_text)
        for punct in ['„ÄÇ', 'Ôºõ', 'ÔºÅ', 'Ôºü', 'Ôºå', '„ÄÅ', 'Ôºö', '„Äå', '„Äç', '„Äé', '„Äè', 'Ôºà', 'Ôºâ', '„Ää', '„Äã']:
            sent_length_no_punc -= sent_text.count(punct)
        
        end_pos_no_punc = start_pos_no_punc + sent_length_no_punc
        
        segments.append({
            'start': start_pos_no_punc,
            'end': end_pos_no_punc,
            'is_formulaic': sent_row['is_formulaic']
        })
    
    edict_viz_data.append({
        'idx': edict_idx,
        'title': edict_row['text_title'],
        'length': total_length,
        'segments': segments
    })

# Sort by length for better visualization
edict_viz_data.sort(key=lambda x: x['length'], reverse=True)

# Create the visualization
fig, ax = plt.subplots(figsize=(14, max(6, len(edict_viz_data) * 0.6)))

y_positions = range(len(edict_viz_data))
bar_height = 0.7

# Draw bars for each edict
for i, edict_data in enumerate(edict_viz_data):
    y_pos = len(edict_viz_data) - 1 - i  # Reverse order for top-to-bottom
    
    # Draw background (full length in light gray)
    ax.barh(y_pos, edict_data['length'], height=bar_height, 
            color='#f0f0f0', edgecolor='gray', linewidth=0.5)
    
    # Draw formulaic segments in color
    for segment in edict_data['segments']:
        if segment['is_formulaic']:
            seg_start = segment['start']
            seg_width = segment['end'] - segment['start']
            ax.barh(y_pos, seg_width, left=seg_start, height=bar_height,
                   color='#ff6b6b', edgecolor='none')

# Customize the plot
ax.set_yticks(range(len(edict_viz_data)))
ax.set_yticklabels([ed['title'] for ed in reversed(edict_viz_data)], fontsize=10, fontproperties=fm.FontProperties(family=chinese_font))
ax.set_xlabel('Length (characters, excluding punctuation)', fontsize=12)
ax.set_title(f'Edict Lengths and Formulaic Content Distribution - {EDICT_TYPE}', 
             fontsize=14, fontweight='bold', pad=20, fontproperties=fm.FontProperties(family=chinese_font))

# Add grid for readability
ax.grid(axis='x', alpha=0.3, linestyle='--')
ax.set_axisbelow(True)

# Add legend
formulaic_patch = mpatches.Patch(color='#ff6b6b', label='Formulaic sentences')
unique_patch = mpatches.Patch(color='#f0f0f0', label='Unique sentences')
ax.legend(handles=[formulaic_patch, unique_patch], loc='lower right', fontsize=10, prop=fm.FontProperties(family=chinese_font))

# Add text annotations showing exact lengths
for i, edict_data in enumerate(edict_viz_data):
    y_pos = len(edict_viz_data) - 1 - i
    # Calculate formulaic character count
    formulaic_chars = sum(seg['end'] - seg['start'] 
                          for seg in edict_data['segments'] 
                          if seg['is_formulaic'])
    unique_chars = edict_data['length'] - formulaic_chars
    formulaic_pct = (formulaic_chars / edict_data['length'] * 100) if edict_data['length'] > 0 else 0
    
    # Add text at the end of the bar
    ax.text(edict_data['length'] + 50, y_pos, 
            f"{edict_data['length']} chars ({formulaic_pct:.0f}% formulaic)",
            va='center', fontsize=9, color='#555')

plt.tight_layout()
plt.show()

# Print summary statistics
print("\n" + "="*100)
print("EDICT LENGTH SUMMARY")
print("="*100)
print(f"\n{'Edict Title':<50} {'Total':<10} {'Formulaic':<12} {'Unique':<10} {'% Formulaic':<12}")
print("-"*100)

for edict_data in edict_viz_data:
    formulaic_chars = sum(seg['end'] - seg['start'] 
                          for seg in edict_data['segments'] 
                          if seg['is_formulaic'])
    unique_chars = edict_data['length'] - formulaic_chars
    formulaic_pct = (formulaic_chars / edict_data['length'] * 100) if edict_data['length'] > 0 else 0
    
    title_short = edict_data['title'][:47] + '...' if len(edict_data['title']) > 50 else edict_data['title']
    print(f"{title_short:<50} {edict_data['length']:<10} {formulaic_chars:<12} {unique_chars:<10} {formulaic_pct:<12.1f}")

print("\n" + "="*100)
print(f"Average edict length: {sum(ed['length'] for ed in edict_viz_data) / len(edict_viz_data):.0f} characters")
print(f"Shortest edict: {min(ed['length'] for ed in edict_viz_data)} characters")
print(f"Longest edict: {max(ed['length'] for ed in edict_viz_data)} characters")
print("="*100)

## 13. Interactive Edict Explorer

Compare edicts interactively with formulaic sentences highlighted:

Parse the generated Markdown file to extract formatted edict texts. This reads back the formatted texts that were previously exported, organizing them into a data structure that can be used by the interactive widget. Each edict is stored with its title, formatted text (with ** bold markers), and associated sentence data.

In [None]:
# Read the generated Markdown file to extract formatted texts
print(f"üìñ Reading formatted texts from {OUTPUT_FILE}...")

with open(OUTPUT_FILE, 'r', encoding='utf-8') as f:
    markdown_content = f.read()

# Parse the markdown to extract individual edicts with formatting
# Split by edict headers (## Title format)
edict_sections = re.split(r'\n## ', markdown_content)

# Skip the first section (header/metadata)
edict_sections = edict_sections[1:] if len(edict_sections) > 1 else []

# Store edict data with formatted text
formatted_edicts = {}

for section in edict_sections:
    lines = section.strip().split('\n')
    if not lines:
        continue
    
    # First line is the title
    title = lines[0].strip()
    
    # Find the metadata line (starts with *)
    text_start = None
    for i, line in enumerate(lines):
        if line.strip().startswith('*'):
            text_start = i + 1
            break
    
    if text_start is None or text_start >= len(lines):
        continue
    
    # Collect text lines until we hit the separator (---)
    text_lines = []
    for i in range(text_start, len(lines)):
        if lines[i].strip() == '---':
            break
        text_lines.append(lines[i])
    
    formatted_text = '\n'.join(text_lines).strip()
    
    # Find the edict in our dataframe by matching title
    matching_edicts = df_edicts[df_edicts['text_title'] == title]
    if len(matching_edicts) > 0:
        edict_idx = matching_edicts.index[0]
        formatted_edicts[edict_idx] = {
            'title': title,
            'formatted_text': formatted_text,
            'sentences': df_sentences[df_sentences['edict_idx'] == edict_idx]
        }

print(f"‚úÖ Loaded {len(formatted_edicts)} formatted edicts")
print(f"   Formulaic sentences will be shown in **bold**")

Define a function to find and compare similar edicts. This function identifies which other edicts share the most formulaic sentences with a selected edict. It calculates similarity scores based on the proportion of shared formulaic content and tracks exactly which sentences match between edicts. This data powers the interactive comparison feature.

In [None]:
def find_similar_edicts_detailed(edict_idx, top_k=3):
    """
    Find most similar edicts with detailed comparison.
    Returns similarity scores, shared formulaic sentences, and matching sentence indices.
    """
    edict_sentences = df_sentences[df_sentences['edict_idx'] == edict_idx]
    formulaic_in_edict = edict_sentences[edict_sentences['is_formulaic']]
    
    # Find which other edicts share formulaic sentences
    similarity_scores = {}
    shared_formulas = {}
    matching_sentence_indices = {}  # Track which sentences in other edicts match
    
    for other_idx in formatted_edicts.keys():
        if other_idx == edict_idx:
            continue
        
        # Count shared formulaic sentences between the two edicts
        shared_count = 0
        shared_texts = []
        matching_indices = set()
        
        for _, sent in formulaic_in_edict.iterrows():
            # Check if this sentence's best match is in the other edict
            if pd.notna(sent['best_match_idx']):
                match_idx = int(sent['best_match_idx'])
                match_edict_idx = df_sentences.iloc[match_idx]['edict_idx']
                if match_edict_idx == other_idx:
                    shared_count += 1
                    shared_texts.append(sent['sentence_text'])
                    matching_indices.add(match_idx)
        
        if shared_count > 0:
            # Calculate overall similarity as proportion of shared formulas
            similarity = shared_count / max(len(formulaic_in_edict), 1)
            similarity_scores[other_idx] = similarity
            shared_formulas[other_idx] = shared_texts
            matching_sentence_indices[other_idx] = matching_indices
    
    # Sort by similarity
    sorted_edicts = sorted(similarity_scores.items(), key=lambda x: x[1], reverse=True)[:top_k]
    
    return sorted_edicts, shared_formulas, matching_sentence_indices

print("‚úÖ Similarity comparison function defined")

Create an interactive widget for dynamic edict exploration. This widget allows you to:
- Select any edict from a dropdown menu
- View the full text with formulaic sentences highlighted in bold
- See statistics about formulaic vs. unique content
- Compare with the most similar edicts
- Only matching sentences (not all formulaic ones) are highlighted in similar edicts

The widget uses HTML rendering for proper bold formatting and provides an intuitive interface for exploring the formulaic patterns across the document collection.

In [None]:
# Create interactive widget
edict_options = [(data['title'], idx) for idx, data in formatted_edicts.items()]

dropdown = widgets.Dropdown(
    options=edict_options,
    description='Select Edict:',
    style={'description_width': 'initial'},
    layout=widgets.Layout(width='80%')
)

show_comparison = widgets.Checkbox(
    value=True,
    description='Show similar edicts',
    style={'description_width': 'initial'}
)

output = widgets.Output()

def format_text_with_bold(full_text, sentence_indices_to_bold):
    """
    Format text with only specified sentences in bold.
    
    Args:
        full_text: Original full text
        sentence_indices_to_bold: Set of df_sentences indices to highlight
        
    Returns:
        HTML formatted text
    """
    if not sentence_indices_to_bold:
        # No highlighting needed
        return full_text.replace('\n', '<br>')
    
    # Get all sentences for this text, sorted by position
    sentences_to_format = df_sentences[df_sentences.index.isin(sentence_indices_to_bold)].sort_values('start_pos')
    
    # Build HTML with bold tags
    result_html = ""
    last_pos = 0
    
    for _, sent_row in sentences_to_format.iterrows():
        start_pos = sent_row['start_pos']
        end_pos = sent_row['end_pos']
        
        # Add text before this sentence
        if start_pos > last_pos:
            result_html += full_text[last_pos:start_pos]
        
        # Add this sentence in bold
        result_html += '<strong>' + full_text[start_pos:end_pos] + '</strong>'
        last_pos = end_pos
    
    # Add remaining text
    if last_pos < len(full_text):
        result_html += full_text[last_pos:]
    
    # Convert newlines to HTML breaks
    result_html = result_html.replace('\n', '<br>')
    
    return result_html

def on_edict_select(change):
    with output:
        output.clear_output(wait=True)  # Wait to clear until new output is ready
        edict_idx = change['new']
        edict_data = formatted_edicts[edict_idx]
        
        # Display selected edict with formatting
        print("="*100)
        print(f"üìú Selected Edict: {edict_data['title']}")
        print("="*100)
        
        # Get statistics
        edict_sents = edict_data['sentences']
        num_formulaic = len(edict_sents[edict_sents['is_formulaic']])
        num_total = len(edict_sents)
        
        print(f"\nüìä Statistics:")
        print(f"   Total sentences: {num_total}")
        print(f"   Formulaic sentences: {num_formulaic} ({num_formulaic/num_total*100:.1f}%)")
        print(f"   Unique sentences: {num_total - num_formulaic} ({(num_total-num_formulaic)/num_total*100:.1f}%)")
        
        print(f"\nüìñ Text (formulaic sentences in **bold**):")
        print("-"*100)
        
        # Get the original full text and highlight formulaic sentences
        full_text = df_edicts.loc[edict_idx, 'text_contents_punctuated']
        formulaic_indices = set(edict_sents[edict_sents['is_formulaic']].index)
        
        html_output = format_text_with_bold(full_text, formulaic_indices)
        display(HTML(f'<div style="white-space: pre-wrap; font-family: monospace; line-height: 1.8;">{html_output}</div>'))
        
        # Show comparison with similar edicts
        if show_comparison.value:
            print("\n" + "="*100)
            print("üîç Most Similar Edicts (by shared formulaic content)")
            print("="*100)
            
            similar, shared, matching_indices = find_similar_edicts_detailed(edict_idx, top_k=3)
            
            if not similar:
                print("\n   No similar edicts found with shared formulaic content.")
            else:
                for i, (other_idx, score) in enumerate(similar, 1):
                    other_data = formatted_edicts[other_idx]
                    other_sents = other_data['sentences']
                    other_formulaic = len(other_sents[other_sents['is_formulaic']])
                    other_total = len(other_sents)
                    
                    print(f"\n{i}. {other_data['title']}")
                    print(f"   Similarity: {score:.2%}")
                    print(f"   Shared formulaic sentences: {len(shared[other_idx])}")
                    print(f"   Total sentences: {other_total} (formulaic: {other_formulaic}/{other_total})")
                    print(f"\n   Full Text (matching phrases in **bold**):")
                    print("-"*80)
                    
                    # Display the similar edict's full text with ONLY matching sentences in bold
                    other_full_text = df_edicts.loc[other_idx, 'text_contents_punctuated']
                    other_html_output = format_text_with_bold(other_full_text, matching_indices[other_idx])
                    display(HTML(f'<div style="white-space: pre-wrap; font-family: monospace; line-height: 1.8; margin-left: 20px;">{other_html_output}</div>'))
                    
                    print("\n   Shared formulaic phrases (examples):")
                    for j, shared_text in enumerate(shared[other_idx][:5], 1):
                        preview = shared_text[:100] + ('...' if len(shared_text) > 100 else '')
                        print(f"      ‚Ä¢ {preview}")

dropdown.observe(on_edict_select, names='value')

print("üí° Interactive Edict Explorer")
print("="*100)
print("\nSelect an edict from the dropdown to:")
print("  ‚Ä¢ View the full text with formulaic sentences highlighted in bold")
print("  ‚Ä¢ See statistics about formulaic vs. unique content")
print("  ‚Ä¢ Compare with other similar edicts (full text shown)")
print("  ‚Ä¢ Only matching phrases are highlighted in similar edicts")
print("\n‚ö†Ô∏è  Note: First selection may take a moment to render the formatted text.\n")

display(widgets.VBox([
    widgets.HBox([dropdown, show_comparison]),
    output
]))