# English Moral Scoring: Aesop's Fables

This notebook scores moral foundations in English children's literature (Aesop's Fables).

## Process:
1. Load English Master Moral Vectors (from validation notebook)
2. Load cleaned Aesop's Fables
3. Generate embeddings using IndicSBERT (same model for consistency)
4. Calculate cosine similarity to moral foundations
5. Visualize and compare with Tamil results

In [None]:
import os
import pickle
from tqdm import tqdm
import numpy as np
import pandas as pd
from sentence_transformers import SentenceTransformer
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import seaborn as sns

# Create output directory (in parent directory)
os.makedirs('../phase3_outputs', exist_ok=True)

sns.set_style("whitegrid")

## Step 1: Load English Master Moral Vectors

In [None]:
# Load master vectors from validation notebook (in parent directory)
with open('../master_vectors_all_languages.pkl', 'rb') as f:
    all_vectors = pickle.load(f)

# Extract English vectors
if 'english' in all_vectors:
    master_vectors_english = all_vectors['english']
else:
    raise FileNotFoundError("English master vectors not found in pickle file")

# Ensure numpy arrays
for k, v in master_vectors_english.items():
    master_vectors_english[k] = np.array(v, dtype=np.float32)

print(f"Loaded {len(master_vectors_english)} English master moral vectors:")
for foundation in master_vectors_english.keys():
    print(f"  - {foundation}")

## Step 2: Load Cleaned Aesop's Fables

In [3]:
# Load the preprocessed fables
df = pd.read_csv('processedDataEnglish/aesops_fables_cleaned.csv', encoding='utf-8')

print(f"Loaded {len(df)} fables")
print(f"\nColumns: {df.columns.tolist()}")
print(f"\nFirst fable:")
print(df.iloc[0])

Loaded 284 fables

Columns: ['fable_number', 'title', 'text', 'moral']

First fable:
fable_number                                                    1
title                                      THE FOX AND THE GRAPES
text            A hungry Fox saw some fine bunches of Grapes h...
moral                                                         NaN
Name: 0, dtype: object


## Step 3: Load Embedding Model

Using the **same model** as Tamil (IndicSBERT) for cross-language consistency.

In [4]:
MODEL_NAME = 'l3cube-pune/indic-sentence-similarity-sbert'
print(f"Loading model: {MODEL_NAME}")
model = SentenceTransformer(MODEL_NAME)
print("✓ Model loaded")

Loading model: l3cube-pune/indic-sentence-similarity-sbert
✓ Model loaded


## Step 4: Generate Embeddings for Fables

In [5]:
# Extract texts
texts = df['text'].astype(str).tolist()

print(f"Generating embeddings for {len(texts)} fables...")
embeddings = model.encode(texts, show_progress_bar=True, convert_to_numpy=True)
print(f"✓ Embeddings generated: {embeddings.shape}")

Generating embeddings for 284 fables...


Batches: 100%|██████████| 9/9 [00:06<00:00,  1.30it/s]

✓ Embeddings generated: (284, 768)





## Step 5: Calculate Cosine Similarity to Moral Foundations

In [6]:
def cosine_similarity_matrix(matA, vecsB):
    """
    matA: (n, d) numpy array of embeddings
    vecsB: dict of {label: (d,) }
    returns: DataFrame of shape (n, len(vecsB)) with cosine similarities
    """
    labels = list(vecsB.keys())
    B = np.vstack([vecsB[l] for l in labels])  # (m, d)
    
    # Normalize
    A_norm = matA / np.linalg.norm(matA, axis=1, keepdims=True)
    B_norm = B / np.linalg.norm(B, axis=1, keepdims=True)
    
    # Compute similarity
    sims = A_norm.dot(B_norm.T)  # (n, m)
    df = pd.DataFrame(sims, columns=labels)
    return df

print("Computing similarity scores...")
scores_df = cosine_similarity_matrix(embeddings, master_vectors_english)
print(f"✓ Similarity scores computed: {scores_df.shape}")
print(f"\nFirst few scores:")
print(scores_df.head())

Computing similarity scores...
✓ Similarity scores computed: (284, 10)

First few scores:
   care.virtue  care.vice  fairness.virtue  fairness.vice  loyalty.virtue  \
0     0.226500   0.253206         0.231159       0.298035        0.270405   
1     0.251561   0.200681         0.227459       0.261651        0.266288   
2     0.320856   0.387398         0.284212       0.340054        0.341983   
3     0.377295   0.430981         0.358485       0.403133        0.390082   
4     0.418126   0.336647         0.400727       0.354216        0.436591   

   loyalty.vice  authority.virtue  authority.vice  sanctity.virtue  \
0      0.263454          0.245616        0.266740         0.269665   
1      0.222719          0.245587        0.188370         0.261332   
2      0.340156          0.347480        0.299541         0.363154   
3      0.386400          0.455086        0.391337         0.415869   
4      0.331591          0.410644        0.321784         0.408357   

   sanctity.vice  
0      

## Step 6: Identify Dominant Moral for Each Fable

In [7]:
# Combine with original dataframe
result_df = pd.concat([df.reset_index(drop=True), scores_df.reset_index(drop=True)], axis=1)

# Add dominant moral
moral_cols = scores_df.columns.tolist()
result_df['dominant_moral'] = scores_df.idxmax(axis=1)
result_df['dominant_score'] = scores_df.max(axis=1)

print(f"Results DataFrame: {result_df.shape}")
print(f"\nDominant moral distribution:")
print(result_df['dominant_moral'].value_counts())

Results DataFrame: (284, 16)

Dominant moral distribution:
dominant_moral
care.virtue         77
loyalty.virtue      49
sanctity.vice       41
authority.virtue    35
sanctity.virtue     35
care.vice           23
loyalty.vice        10
authority.vice       7
fairness.vice        5
fairness.virtue      2
Name: count, dtype: int64


## Step 7: Summary Statistics

In [8]:
# Average scores per moral foundation
summary = result_df[moral_cols].mean().sort_values(ascending=False)

print("="*60)
print("AESOP'S FABLES - MORAL FOUNDATION SCORES")
print("="*60)
print(f"\nAverage similarity scores:")
for moral, score in summary.items():
    print(f"  {moral:20s}: {score:.4f}")

print(f"\nTop 3 moral foundations:")
for i, (moral, score) in enumerate(summary.head(3).items(), 1):
    count = (result_df['dominant_moral'] == moral).sum()
    print(f"  {i}. {moral:20s}: avg={score:.3f}, dominant in {count} fables")

AESOP'S FABLES - MORAL FOUNDATION SCORES

Average similarity scores:
  authority.virtue    : 0.3076
  sanctity.virtue     : 0.3066
  care.virtue         : 0.3061
  loyalty.virtue      : 0.2956
  sanctity.vice       : 0.2913
  care.vice           : 0.2787
  fairness.virtue     : 0.2775
  loyalty.vice        : 0.2764
  fairness.vice       : 0.2622
  authority.vice      : 0.2597

Top 3 moral foundations:
  1. authority.virtue    : avg=0.308, dominant in 35 fables
  2. sanctity.virtue     : avg=0.307, dominant in 35 fables
  3. care.virtue         : avg=0.306, dominant in 77 fables


## Step 8: Visualizations

### 8.1: Bar Chart of Average Scores

In [None]:
fig, ax = plt.subplots(figsize=(12, 6))

summary.plot(kind='bar', ax=ax, color='steelblue', alpha=0.8)
ax.set_xlabel('Moral Foundation', fontsize=12)
ax.set_ylabel('Average Similarity Score', fontsize=12)
ax.set_title("Aesop's Fables: Average Moral Foundation Scores", fontsize=14, fontweight='bold')
ax.tick_params(axis='x', rotation=45)
ax.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.savefig('../phase3_outputs/aesops_moral_scores_bar.png', dpi=300)
plt.show()

print("✓ Saved bar chart to ../phase3_outputs/aesops_moral_scores_bar.png")

### 8.2: Dominant Moral Distribution

In [None]:
fig, ax = plt.subplots(figsize=(12, 6))

dominant_counts = result_df['dominant_moral'].value_counts()
dominant_counts.plot(kind='bar', ax=ax, color='coral', alpha=0.8)
ax.set_xlabel('Moral Foundation', fontsize=12)
ax.set_ylabel('Number of Fables', fontsize=12)
ax.set_title("Aesop's Fables: Dominant Moral Distribution", fontsize=14, fontweight='bold')
ax.tick_params(axis='x', rotation=45)
ax.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.savefig('../phase3_outputs/aesops_dominant_moral_dist.png', dpi=300)
plt.show()

print("✓ Saved distribution chart to ../phase3_outputs/aesops_dominant_moral_dist.png")

### 8.3: UMAP Visualization (Optional - requires umap-learn)

In [None]:
try:
    import umap
    
    print("Generating UMAP visualization...")
    reducer = umap.UMAP(n_components=2, random_state=42)
    emb2d = reducer.fit_transform(embeddings)
    
    plt.figure(figsize=(12, 8))
    
    # Color by dominant moral
    labels = result_df['dominant_moral'].astype(str)
    unique_labels = labels.unique()
    label2int = {l:i for i,l in enumerate(sorted(unique_labels))}
    colors = [label2int[l] for l in labels]
    
    plt.scatter(emb2d[:,0], emb2d[:,1], c=colors, s=20, cmap='tab20', alpha=0.7)
    plt.title("UMAP 2D: Aesop's Fables by Moral Foundation", fontsize=14, fontweight='bold')
    
    # Legend (top 10 to avoid crowd)
    import matplotlib.patches as mpatches
    handles = []
    for l in sorted(unique_labels)[:10]:
        handles.append(mpatches.Patch(color=plt.cm.tab20(label2int[l]%20), label=l))
    plt.legend(handles=handles, bbox_to_anchor=(1.05,1), loc="upper left")
    
    plt.tight_layout()
    plt.savefig('../phase3_outputs/aesops_umap.png', dpi=300, bbox_inches='tight')
    plt.show()
    
    print("✓ Saved UMAP to ../phase3_outputs/aesops_umap.png")
    
except ImportError:
    print("⚠ UMAP not installed. Skipping UMAP visualization.")
    print("  To enable: pip install umap-learn")

## Step 9: Compare with Tamil Results

In [None]:
# Load Tamil results for comparison (from parent directory)
try:
    thiru_summary = pd.read_csv('../phase3_outputs/thirukkural_moral_summary.csv', index_col=0)
    aathi_summary = pd.read_csv('../phase3_outputs/aathichudi_moral_summary.csv', index_col=0)
    
    # Combine for comparison
    comparison_df = pd.DataFrame({
        'English_Aesop': summary,
        'Tamil_Thirukkural': thiru_summary['0'],
        'Tamil_Aathichudi': aathi_summary['0']
    })
    
    print("\n" + "="*60)
    print("CROSS-LANGUAGE COMPARISON")
    print("="*60)
    print(comparison_df.round(3))
    
    # Visualize comparison
    fig, ax = plt.subplots(figsize=(14, 6))
    
    x = np.arange(len(comparison_df))
    width = 0.25
    
    bars1 = ax.bar(x - width, comparison_df['English_Aesop'], width, 
                   label='English (Aesop)', alpha=0.8, color='steelblue')
    bars2 = ax.bar(x, comparison_df['Tamil_Thirukkural'], width, 
                   label='Tamil (Thirukkural)', alpha=0.8, color='coral')
    bars3 = ax.bar(x + width, comparison_df['Tamil_Aathichudi'], width, 
                   label='Tamil (Aathichudi)', alpha=0.8, color='lightgreen')
    
    ax.set_xlabel('Moral Foundation', fontsize=12)
    ax.set_ylabel('Average Similarity Score', fontsize=12)
    ax.set_title('Cross-Language Moral Foundation Comparison', fontsize=14, fontweight='bold')
    ax.set_xticks(x)
    ax.set_xticklabels(comparison_df.index, rotation=45, ha='right')
    ax.legend()
    ax.grid(axis='y', alpha=0.3)
    
    plt.tight_layout()
    plt.savefig('../phase3_outputs/cross_language_comparison.png', dpi=300)
    plt.show()
    
    print("\n✓ Saved comparison chart to ../phase3_outputs/cross_language_comparison.png")
    
except FileNotFoundError:
    print("⚠ Tamil results not found. Run Step3.ipynb first to generate Tamil scores.")

## Step 10: Save Results

In [None]:
# Save detailed scores (to parent directory)
result_df.to_csv('../phase3_outputs/aesops_moral_scores.csv', index=False, encoding='utf-8')
print("✓ Saved detailed scores to ../phase3_outputs/aesops_moral_scores.csv")

# Save summary
summary.to_csv('../phase3_outputs/aesops_moral_summary.csv')
print("✓ Saved summary to ../phase3_outputs/aesops_moral_summary.csv")

# Save embeddings
np.save('../phase3_outputs/aesops_embeddings.npy', embeddings)
print("✓ Saved embeddings to ../phase3_outputs/aesops_embeddings.npy")

print("\n" + "="*60)
print("ALL DONE! English moral scoring complete.")
print("="*60)

## Summary

### What We Did:
1. ✅ Loaded English master moral vectors
2. ✅ Loaded preprocessed Aesop's Fables
3. ✅ Generated embeddings using IndicSBERT
4. ✅ Calculated moral foundation scores
5. ✅ Identified dominant morals
6. ✅ Created visualizations
7. ✅ Compared with Tamil results

### Next Steps:
- Run cross-cultural statistical analysis (Spearman, JSD)
- Compare English vs Tamil moral distributions
- Analyze cultural differences in moral values

In [14]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import spearmanr, pearsonr
from scipy.spatial.distance import jensenshannon
from scipy.stats import entropy

sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)