# Edict Similarity Analysis with SIKU-BERT

This notebook analyzes edicts of a specified type from the extracted_edicts_punc.csv file by creating embedd}ings using SIKU-BERT and comparing their content similarity.

## 0. Setup

Configure proxy settings and install required packages.

In [None]:
!pip install pandas numpy torch transformers scikit-learn matplotlib seaborn tqdm scipy accelerate

## 1. Import Required Libraries

Import the necessary libraries, including pandas for data manipulation, transformers for SIKU-BERT, and matplotlib/seaborn for visualization.

In [None]:
# Import Required Libraries
import pandas as pd
import numpy as np
import torch
from transformers import AutoTokenizer, AutoModel
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt
import matplotlib.font_manager as fm
import seaborn as sns
from tqdm import tqdm

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 100)
plt.rcParams['figure.figsize'] = (12, 10)

# Configure Chinese font support for all platforms
print("Configuring Chinese font support...")
available_fonts = [f.name for f in fm.fontManager.ttflist]

# Comprehensive list of Chinese fonts for different platforms
# Windows: SimHei, Microsoft YaHei, SimSun
# Mac: PingFang SC, Heiti SC, STHeiti
# Linux: WenQuanYi, Noto Sans CJK, Droid Sans Fallback
# Colab/Cloud: Noto Sans CJK SC (pre-installed)
chinese_fonts = [
    'Noto Sans CJK SC', 'Noto Sans CJK TC', 'Noto Sans SC',  # Cloud/Linux priority
    'PingFang SC', 'Heiti SC', 'STHeiti',  # Mac
    'Microsoft YaHei', 'SimHei', 'SimSun',  # Windows
    'WenQuanYi Micro Hei', 'WenQuanYi Zen Hei',  # Linux
    'Droid Sans Fallback', 'AR PL UMing CN'
]

selected_font = None
for font in chinese_fonts:
    if font in available_fonts:
        selected_font = font
        print(f"✓ Found Chinese font: {selected_font}")
        break

if selected_font is None:
    print("⚠ No Chinese font found. Chinese characters may not display correctly.")
    print("  On Colab, run: !apt-get install -y fonts-noto-cjk")
    print("  On Linux, run: sudo apt-get install fonts-noto-cjk")
    print("  On Mac/Windows, Chinese fonts should be pre-installed.")
    selected_font = 'DejaVu Sans'  # Fallback
else:
    # Configure matplotlib to use the selected font globally
    plt.rcParams['font.sans-serif'] = [selected_font, 'DejaVu Sans']
    plt.rcParams['axes.unicode_minus'] = False  # Fix minus sign display
    print(f"✓ Matplotlib configured to use: {selected_font}")

print("Libraries imported successfully!")

## 2. Load SIKU-BERT Model

Load the SIKU-BERT model and tokenizer from a local directory.

In [None]:
# Model configuration - using local model directory
model_id = "sikubert"

print("Loading SIKU-BERT tokenizer and model from local directory...")
tokenizer = AutoTokenizer.from_pretrained(
    model_id,
    trust_remote_code=True,
    local_files_only=True
)

print("Loading SIKU-BERT model...")
model = AutoModel.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True,
    output_attentions=True,  # Enable attention output for token weighting
    local_files_only=True
)

model.eval()  # Set to evaluation mode
print("SIKU-BERT model loaded successfully!")
print(f"Model device: {model.device}")

## 3. Define Embedding Function

Create a function to generate embeddings using SIKU-BERT with mean pooling.

In [None]:
def get_embedding(text, pooling='mean'):
    """
    Generate embedding for text using SIKU-BERT.
    
    Args:
        text: Input text to embed
        pooling: Pooling strategy ('mean', 'max', 'cls')
    
    Returns:
        Numpy array of embedding
    """
    # Tokenize
    inputs = tokenizer(text, return_tensors="pt", padding=True, 
                      truncation=True, max_length=512)
    inputs = {k: v.to(model.device) for k, v in inputs.items()}
    
    # Get embeddings
    with torch.no_grad():
        outputs = model(**inputs)
        embeddings = outputs.last_hidden_state
    
    # Apply pooling
    if pooling == 'mean':
        # Mean pooling
        attention_mask = inputs['attention_mask'].unsqueeze(-1)
        masked_embeddings = embeddings * attention_mask
        sum_embeddings = masked_embeddings.sum(dim=1)
        sum_mask = attention_mask.sum(dim=1)
        pooled = sum_embeddings / sum_mask
    elif pooling == 'max':
        # Max pooling
        pooled = torch.max(embeddings, dim=1)[0]
    else:  # 'cls'
        # Use [CLS] token
        pooled = embeddings[:, 0, :]
    
    return pooled.cpu().numpy()[0]

print("Embedding function defined.")

## 4. Load and Filter the Data

Load the CSV file extracted_edicts_punctuated.csv using pandas.

In [None]:
# Load the CSV file
df = pd.read_csv('extracted_edicts_punc.csv', encoding='utf-8-sig')

# Display basic information about the dataset
print(f"Total rows in dataset: {len(df)}")
print(f"\nColumn names: {df.columns.tolist()}")

# Display unique document types
if 'document_type' in df.columns:
    unique_types = df['document_type'].unique()
    print(f"\nUnique document types ({len(unique_types)}):")
    for doc_type in sorted(unique_types):
        count = len(df[df['document_type'] == doc_type])
        print(f"  - {doc_type}: {count} documents")
else:
    print("\nNo document_type column found")

## 4b. Configure Document Type

Select which document type to analyze. You can change this to any of the types listed above.

In [None]:
# Configure the document type to analyze
DOCUMENT_TYPE = '恩宥'  # Change this to analyze a different document type

# Filter for the selected document type
df_d_type = df[df['document_type'] == DOCUMENT_TYPE].copy()
print(f"\nFiltered for document type: '{DOCUMENT_TYPE}'")
print(f"Number of documents: {len(df_d_type)}")

# Display first few rows
df_d_type.head()

## 5. Preprocess the Text Data

Clean and preprocess the text_contents_punctuated column by removing null values and performing any necessary text cleaning.

In [None]:
# Check for null values
print(f"Null values in text_contents_punctuated: {df_d_type['text_contents_punctuated'].isnull().sum()}")

# Remove rows with null text content
df_d_type = df_d_type[df_d_type['text_contents_punctuated'].notna()].copy()

# Remove rows with empty strings
df_d_type = df_d_type[df_d_type['text_contents_punctuated'].str.strip() != ''].copy()

# Reset index
df_d_type.reset_index(drop=True, inplace=True)

print(f"\nNumber of edicts after cleaning: {len(df_d_type)}")

# Display text length statistics
df_d_type['text_length'] = df_d_type['text_contents_punctuated'].str.len()
print(f"\nText length statistics:")
print(df_d_type['text_length'].describe())

# Display sample texts
print("\nSample edict texts:")
for idx, text in enumerate(df_d_type['text_contents_punctuated'].head(3)):
    print(f"\nEdict {idx + 1}: {text[:200]}...")

## 6. Generate Text Embeddings with SIKU-BERT

Use SIKU-BERT to convert the text contents into numerical embeddings/vectors.

In [None]:
# Generate embeddings for all edicts
print("Generating embeddings for all edicts using SIKU-BERT...")
print("This may take a few minutes depending on the number of edicts.\n")

embeddings_list = []

for idx, text in enumerate(tqdm(df_d_type['text_contents_punctuated'], desc="Encoding texts")):
    embedding = get_embedding(text, pooling='mean')
    embeddings_list.append(embedding)

# Convert to numpy array
edict_embeddings = np.array(embeddings_list)

print(f"\nEmbedding shape: {edict_embeddings.shape}")
print(f"Number of edicts: {edict_embeddings.shape[0]}")
print(f"Embedding dimension: {edict_embeddings.shape[1]}")

# Save embeddings for later use
np.save(f'{DOCUMENT_TYPE}_embeddings_sikubert.npy', edict_embeddings)
print(f"\nEmbeddings saved to '{DOCUMENT_TYPE}_embeddings_sikubert.npy'")

## 7. Calculate Similarity Between Edicts

Compute pairwise cosine similarity between all edict embeddings to measure content similarity.

In [None]:
# Calculate cosine similarity matrix
print("Calculating pairwise cosine similarity...")
similarity_matrix = cosine_similarity(edict_embeddings)

print(f"Similarity matrix shape: {similarity_matrix.shape}")
print(f"\nSimilarity statistics:")
upper_triangle = similarity_matrix[np.triu_indices_from(similarity_matrix, k=1)]
print(f"Min similarity: {upper_triangle.min():.4f}")
print(f"Max similarity (excluding diagonal): {upper_triangle.max():.4f}")
print(f"Mean similarity: {upper_triangle.mean():.4f}")
print(f"Median similarity: {np.median(upper_triangle):.4f}")
print(f"Std deviation: {upper_triangle.std():.4f}")

# Display a sample of the similarity matrix
print("\nSample of similarity matrix (first 5x5):")
print(pd.DataFrame(similarity_matrix[:5, :5]).round(3))

# Save similarity matrix
np.save(f'{DOCUMENT_TYPE}_similarity_matrix_sikubert.npy', similarity_matrix)
print(f"\nSimilarity matrix saved to '{DOCUMENT_TYPE}_similarity_matrix_sikubert.npy'")

## 8. Visualize Similarity Matrix

Create a heatmap visualization of the similarity matrix to identify patterns and clusters of similar edicts.

In [None]:
# Create heatmap of similarity matrix with edict titles
# Prepare labels - truncate long titles for readability
labels = []
for i in range(len(df_d_type)):
    title = df_d_type.iloc[i]['text_title'] if 'text_title' in df_d_type.columns else f"Edict {i}"
    # Truncate title if too long
    if len(title) > 20:
        title = title[:20] + '...'
    # Add index for clarity
    labels.append(f"{i}: {title}")

# Determine appropriate font size based on number of edicts
n_edicts = len(df_d_type)
if n_edicts <= 20:
    label_font_size = 8
    fig_size = (16, 14)
elif n_edicts <= 50:
    label_font_size = 6
    fig_size = (20, 18)
else:
    label_font_size = 4
    fig_size = (24, 22)

# Ensure Chinese font is being used for this plot
try:
    current_font = plt.rcParams['font.sans-serif'][0]
    print(f"Using font for heatmap: {current_font}")
except:
    print("⚠ Font not configured, Chinese characters may not display correctly")

plt.figure(figsize=fig_size)
sns.heatmap(
    similarity_matrix,
    cmap='YlOrRd',
    vmin=0,
    vmax=1,
    square=True,
    cbar_kws={'label': 'Cosine Similarity'},
    xticklabels=labels,
    yticklabels=labels
)
plt.title(f'Cosine Similarity Matrix of {DOCUMENT_TYPE} Edicts (SIKU-BERT)', fontsize=16, pad=20)
plt.xlabel('Edict Index and Title', fontsize=12)
plt.ylabel('Edict Index and Title', fontsize=12)

# Rotate x-axis labels for better readability
plt.xticks(rotation=90, fontsize=label_font_size)
plt.yticks(rotation=0, fontsize=label_font_size)

plt.tight_layout()
plt.savefig(f'similarity_heatmap_{DOCUMENT_TYPE}_sikubert.png', dpi=300, bbox_inches='tight')
plt.show()

print(f"Heatmap saved as 'similarity_heatmap_{DOCUMENT_TYPE}_sikubert.png'")

In [None]:
# Create a distribution plot of similarity scores
plt.figure(figsize=(10, 6))
similarity_scores = similarity_matrix[np.triu_indices_from(similarity_matrix, k=1)]
plt.hist(similarity_scores, bins=50, edgecolor='black', alpha=0.7, color='steelblue')
plt.axvline(similarity_scores.mean(), color='red', linestyle='--', linewidth=2, label=f'Mean: {similarity_scores.mean():.3f}')
plt.axvline(np.median(similarity_scores), color='green', linestyle='--', linewidth=2, label=f'Median: {np.median(similarity_scores):.3f}')
plt.xlabel('Cosine Similarity Score', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.title(f'Distribution of Pairwise Similarity Scores (SIKU-BERT) - {DOCUMENT_TYPE}', fontsize=14)
plt.legend()
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.savefig(f'similarity_distribution_{DOCUMENT_TYPE}_sikubert.png', dpi=300, bbox_inches='tight')
plt.show()

print(f"Distribution plot saved as 'similarity_distribution_{DOCUMENT_TYPE}_sikubert.png'")

## 9. Find Most Similar Edicts

For each edict, identify and display the top N most similar edicts based on the similarity scores.

In [None]:
# Function to find top N most similar edicts
def find_most_similar(edict_idx, similarity_matrix, df, top_n=5):
    """
    Find the top N most similar edicts to a given edict.
    
    Parameters:
    - edict_idx: Index of the edict to compare
    - similarity_matrix: Pairwise similarity matrix
    - df: DataFrame containing edict information
    - top_n: Number of similar edicts to return
    
    Returns:
    - DataFrame with most similar edicts
    """
    # Get similarity scores for the given edict
    similarities = similarity_matrix[edict_idx]
    
    # Get indices of top N similar edicts (excluding itself)
    similar_indices = np.argsort(similarities)[::-1][1:top_n+1]
    
    # Create results DataFrame
    results = []
    for rank, idx in enumerate(similar_indices, 1):
        results.append({
            'Rank': rank,
            'Index': idx,
            'Similarity': similarities[idx],
            'Title': df.iloc[idx]['text_title'] if 'text_title' in df.columns else f"Edict {idx}",
            'Text_Preview': df.iloc[idx]['text_contents_punctuated'][:100] + '...'
        })
    
    return pd.DataFrame(results)

# Example: Find most similar edicts for the first 3 edicts
print("=" * 100)
for i in range(min(3, len(df_d_type))):
    print(f"\n{'='*100}")
    title = df_d_type.iloc[i]['text_title'] if 'text_title' in df_d_type.columns else f"Edict {i}"
    print(f"EDICT #{i} - {title}")
    print(f"Original Text:")
    print(f"{df_d_type.iloc[i]['text_contents_punctuated'][:200]}...")
    print(f"\n{'-'*100}")
    print(f"Top 5 Most Similar Edicts:")
    print(f"{'-'*100}")
    
    similar_edicts = find_most_similar(i, similarity_matrix, df_d_type, top_n=5)
    
    for _, row in similar_edicts.iterrows():
        print(f"\nRank {row['Rank']} - {row['Title']} (Similarity: {row['Similarity']:.4f}):")
        print(f"  {row['Text_Preview']}")
    
    print(f"\n{'='*100}")

## 10. Edict Similarity Summary Statistics

Calculate and display summary statistics for each edict's similarity to all other edicts.

In [None]:
# Create a summary DataFrame with similarity statistics for each edict
summary_data = []

for i in range(len(df_d_type)):
    similarities = similarity_matrix[i]
    # Exclude self-similarity (diagonal)
    other_similarities = np.concatenate([similarities[:i], similarities[i+1:]])
    
    title = df_d_type.iloc[i]['text_title'] if 'text_title' in df_d_type.columns else f"Edict {i}"
    
    summary_data.append({
        'Edict_Index': i,
        'Title': title,
        'Text_Length': df_d_type.iloc[i]['text_length'],
        'Text_Preview': df_d_type.iloc[i]['text_contents_punctuated'][:80] + '...',
        'Avg_Similarity': other_similarities.mean(),
        'Max_Similarity': other_similarities.max(),
        'Min_Similarity': other_similarities.min(),
        'Std_Similarity': other_similarities.std()
    })

summary_df = pd.DataFrame(summary_data)
# Set Edict_Index as the index to avoid duplicate numbering
summary_df.set_index('Edict_Index', inplace=True)

print("\nEdicts Ranked by Average Similarity to Other Edicts:")
print("="*100)
summary_df.sort_values('Avg_Similarity', ascending=False).head(10)

## 10b. Identify Highly Similar Pairs

Find pairs of edicts with similarity scores above a specified threshold.

In [None]:
# Identify pairs of highly similar edicts
# Start with a high threshold and provide statistics to guide adjustment
threshold = 0.98  # Adjust this threshold as needed (try 0.9, 0.95, or 0.98)

print(f"Pairs of edicts with similarity >= {threshold}:")
print("="*100)

high_similarity_pairs = []
for i in range(len(similarity_matrix)):
    for j in range(i+1, len(similarity_matrix)):
        if similarity_matrix[i, j] >= threshold:
            title_1 = df_d_type.iloc[i]['text_title'] if 'text_title' in df_d_type.columns else f"Edict {i}"
            title_2 = df_d_type.iloc[j]['text_title'] if 'text_title' in df_d_type.columns else f"Edict {j}"
            
            high_similarity_pairs.append({
                'Edict_1_Index': i,
                'Edict_1_Title': title_1,
                'Edict_2_Index': j,
                'Edict_2_Title': title_2,
                'Similarity': similarity_matrix[i, j],
                'Text_1_Preview': df_d_type.iloc[i]['text_contents_punctuated'][:80] + '...',
                'Text_2_Preview': df_d_type.iloc[j]['text_contents_punctuated'][:80] + '...'
            })

if high_similarity_pairs:
    pairs_df = pd.DataFrame(high_similarity_pairs)
    pairs_df = pairs_df.sort_values('Similarity', ascending=False)
    # Reset index to show Pair_Number starting from 1
    pairs_df.reset_index(drop=True, inplace=True)
    pairs_df.index = pairs_df.index + 1
    pairs_df.index.name = 'Pair_Number'
    
    print(f"\nFound {len(pairs_df)} pairs with similarity >= {threshold}")
    
    # Show threshold guidance
    print("\n" + "="*100)
    print("Threshold Guidance:")
    print("="*100)
    for test_threshold in [0.99, 0.95, 0.90, 0.85, 0.80]:
        count = len([1 for i in range(len(similarity_matrix)) 
                     for j in range(i+1, len(similarity_matrix)) 
                     if similarity_matrix[i, j] >= test_threshold])
        print(f"  Threshold {test_threshold}: {count} pairs")
    print("\nRecommendation: Use a threshold that gives 10-100 pairs for meaningful analysis")
    print("="*100 + "\n")
    
    # Save to CSV for further analysis
    pairs_df.to_csv(f'similar_pairs_{DOCUMENT_TYPE}_threshold_{threshold}.csv', 
                    encoding='utf-8-sig')
    print(f"Pairs saved to 'similar_pairs_{DOCUMENT_TYPE}_threshold_{threshold}.csv'\n")
    
    # Display the results
    print(f"Displaying top 20 most similar pairs:")
    display(pairs_df.head(20))
    
    # Return the full dataframe for further inspection
    pairs_df
else:
    print(f"\nNo pairs found with similarity >= {threshold}")
    print("Consider lowering the threshold to find similar pairs.")
    print(f"\nTry thresholds between {upper_triangle.mean():.3f} (mean) and {upper_triangle.max():.3f} (max)")

## 11. Cluster Analysis 

Perform hierarchical clustering to identify groups of similar edicts.

In [None]:
from scipy.cluster.hierarchy import dendrogram, linkage
from scipy.spatial.distance import squareform
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.font_manager as fm

# Font was already configured in section 1, just verify it's available
print("Using configured Chinese font for dendrogram...")
try:
    current_font = plt.rcParams['font.sans-serif'][0]
    print(f"✓ Current font: {current_font}")
    selected_font = current_font
except:
    # Fallback if font wasn't configured
    selected_font = 'DejaVu Sans'
    print("⚠ Using fallback font: DejaVu Sans")

# Convert similarity to distance
distance_matrix = 1 - similarity_matrix

# Ensure the matrix is symmetric and diagonal is zero
distance_matrix = (distance_matrix + distance_matrix.T) / 2
np.fill_diagonal(distance_matrix, 0)

# Convert the square distance matrix to condensed form (upper triangle)
# scipy's linkage expects a condensed distance matrix
condensed_distance = squareform(distance_matrix, checks=False)

# Perform hierarchical clustering
print("Performing hierarchical clustering...")
linkage_matrix = linkage(condensed_distance, method='average')

# Create labels from edict titles and social categories
# Truncate long titles and add index for clarity
labels = []
for i in range(len(df_d_type)):
    title = df_d_type.iloc[i]['text_title'] if 'text_title' in df_d_type.columns else f"Edict {i}"
    social_cat = df_d_type.iloc[i]['social_category'] if 'social_category' in df_d_type.columns else ''
    
    # Truncate title if too long
    if len(title) > 15:
        title = title[:15] + '...'
    
    # Combine index, social category, and title
    if social_cat:
        labels.append(f"{i}: [{social_cat}] {title}")
    else:
        labels.append(f"{i}: {title}")

# Plot dendrogram with Chinese character support
plt.figure(figsize=(24, 12))
dendrogram(
    linkage_matrix, 
    labels=labels, 
    leaf_font_size=8,
    leaf_rotation=90,  # Rotate labels vertically for better readability
    color_threshold=0.7 * max(linkage_matrix[:, 2])  # Color threshold for visual grouping
)
plt.title(f'{DOCUMENT_TYPE}诏令层次聚类分析 (SIKU-BERT)', fontsize=18, pad=20, 
         fontproperties=fm.FontProperties(family=selected_font))
plt.xlabel('诏令编号、类别与标题', fontsize=14, 
          fontproperties=fm.FontProperties(family=selected_font))
plt.ylabel('距离 (1 - 相似度)', fontsize=14, 
          fontproperties=fm.FontProperties(family=selected_font))
plt.tight_layout()
plt.savefig(f'dendrogram_{DOCUMENT_TYPE}_sikubert.png', dpi=300, bbox_inches='tight')
plt.show()

print(f"Dendrogram saved as 'dendrogram_{DOCUMENT_TYPE}_sikubert.png'")

# Save the mapping table to a CSV file for easy reference
mapping_df = pd.DataFrame({
    'Index': range(len(df_d_type)),
    'Social_Category': [df_d_type.iloc[i]['social_category'] if 'social_category' in df_d_type.columns 
                        else '' for i in range(len(df_d_type))],
    'Title': [df_d_type.iloc[i]['text_title'] if 'text_title' in df_d_type.columns 
              else f"Edict {i}" for i in range(len(df_d_type))],
    'Text_Preview': [df_d_type.iloc[i]['text_contents_punctuated'][:100] + '...' 
                     for i in range(len(df_d_type))]
})

mapping_df.to_csv(f'edict_index_mapping_{DOCUMENT_TYPE}.csv', index=False, encoding='utf-8')
print(f"\nEdict index mapping saved to 'edict_index_mapping_{DOCUMENT_TYPE}.csv'")

# Also print the mapping table
print("\n" + "="*100)
print("Edict Index to Title Mapping:")
print("="*100)
for i in range(len(df_d_type)):
    title = df_d_type.iloc[i]['text_title'] if 'text_title' in df_d_type.columns else f"Edict {i}"
    social_cat = df_d_type.iloc[i]['social_category'] if 'social_category' in df_d_type.columns else ''
    if social_cat:
        print(f"{i}: [{social_cat}] {title}")
    else:
        print(f"{i}: {title}")

## Summary

This notebook has successfully:
1. Loaded and filtered edicts of the selected document type from the dataset
2. Generated semantic embeddings using SIKU-BERT (specialized for Classical Chinese)
3. Calculated pairwise cosine similarity between all edicts
4. Visualized the similarity patterns through heatmaps and distributions
5. Identified the most similar edicts for content comparison
6. Performed clustering analysis to identify groups of related documents

The use of SIKU-BERT provides embeddings specifically trained on Classical Chinese texts, which should better capture the semantic relationships in the edicts compared to general-purpose multilingual models.

All embeddings and similarity matrices have been saved for future analysis. To analyze a different document type, simply change the `DOCUMENT_TYPE` variable in section 4b and re-run the subsequent cells.