# English Text Preprocessing: Aesop's Fables

This notebook processes Aesop's Fables into a structured format similar to Tamil texts.

## Input:
- `dataEnglish/AesopsFables.txt` - Raw text file

## Output:
- `processedDataEnglish/aesops_fables_cleaned.csv` - Structured CSV with columns:
  - fable_number
  - title
  - text (the story)
  - moral (if present)

In [1]:
import re
import pandas as pd
import os

# Create output directory
os.makedirs('processedDataEnglish', exist_ok=True)

## Step 1: Load the Raw Text

In [2]:
# Read the file
with open('dataEnglish/AesopsFables.txt', 'r', encoding='utf-8') as f:
    raw_text = f.read()

print(f"File loaded: {len(raw_text)} characters")
print(f"First 500 characters:")
print(raw_text[:500])

File loaded: 198696 characters
First 500 characters:
AESOP'S FABLES




GRAPES
THE FOX AND THE GRAPES

THE FOX AND THE GRAPES
A hungry Fox saw some fine bunches of Grapes hanging from a vine that was trained along a high trellis, and did his best to reach them by jumping as high as he could into the air. But it was all in vain, for they were just out of reach: so he gave up trying, and walked away with an air of dignity and unconcern, remarking, "I thought those Grapes were ripe, but I see now they are quite sour."





THE GOOSE THAT LAID THE GOL


## Step 2: Parse into Individual Fables

### Strategy:
- Titles are in ALL CAPS (multiple words)
- Stories follow titles
- Morals are often short sentences at the end
- Multiple blank lines separate fables

In [3]:
def parse_aesops_fables(text):
    """
    Parse Aesop's Fables text into structured format.
    
    Returns: list of dicts with {title, text, moral}
    """
    lines = text.split('\n')
    
    fables = []
    current_title = None
    current_text = []
    current_moral = None
    
    # Skip header
    start_idx = 0
    for i, line in enumerate(lines):
        if "AESOP'S FABLES" in line:
            start_idx = i + 1
            break
    
    i = start_idx
    while i < len(lines):
        line = lines[i].strip()
        
        # Skip empty lines
        if not line:
            i += 1
            continue
        
        # Check if this is a title (ALL CAPS and contains multiple words)
        if line.isupper() and len(line.split()) >= 2:
            # Save previous fable if exists
            if current_title and current_text:
                fables.append({
                    'title': current_title,
                    'text': ' '.join(current_text),
                    'moral': current_moral
                })
            
            # Start new fable
            current_title = line
            current_text = []
            current_moral = None
        
        # Regular text line (part of story)
        elif current_title:
            current_text.append(line)
        
        i += 1
    
    # Save last fable
    if current_title and current_text:
        fables.append({
            'title': current_title,
            'text': ' '.join(current_text),
            'moral': current_moral
        })
    
    return fables

# Parse the fables
fables = parse_aesops_fables(raw_text)
print(f"\nParsed {len(fables)} fables")
print(f"\nFirst fable:")
print(f"Title: {fables[0]['title']}")
print(f"Text: {fables[0]['text'][:200]}...")
print(f"Moral: {fables[0]['moral']}")


Parsed 284 fables

First fable:
Title: THE FOX AND THE GRAPES
Text: A hungry Fox saw some fine bunches of Grapes hanging from a vine that was trained along a high trellis, and did his best to reach them by jumping as high as he could into the air. But it was all in va...
Moral: None


## Step 3: Extract Morals

Many fables end with a moral. These are often:
- Short sentences (< 100 characters)
- Come after the main story
- Sometimes italicized or set apart

We'll use heuristics to extract them.

In [4]:
def extract_moral(text):
    """
    Try to extract the moral from the end of the fable.
    
    Heuristic: Last sentence if it's short and sounds like a moral.
    """
    sentences = text.split('. ')
    if len(sentences) < 2:
        return None, text
    
    last_sentence = sentences[-1].strip()
    
    # Check if last sentence is likely a moral:
    # - Relatively short (< 150 chars)
    # - Doesn't start with character names (capital letters followed by lowercase)
    # - Contains wisdom keywords
    
    moral_keywords = ['always', 'never', 'often', 'wise', 'better', 'beware', 
                     'fool', 'should', 'must', 'learns', 'loses', 'gains']
    
    if (len(last_sentence) < 150 and 
        any(keyword in last_sentence.lower() for keyword in moral_keywords)):
        # This is likely a moral
        story_text = '. '.join(sentences[:-1])
        return last_sentence, story_text
    
    # No moral detected
    return None, text

# Extract morals from all fables
for fable in fables:
    moral, story = extract_moral(fable['text'])
    fable['moral'] = moral
    fable['text'] = story

# Count fables with morals
fables_with_morals = sum(1 for f in fables if f['moral'])
print(f"Fables with extracted morals: {fables_with_morals}/{len(fables)}")

# Show examples
print("\nExamples with morals:")
for i, fable in enumerate(fables[:5]):
    if fable['moral']:
        print(f"\n{i+1}. {fable['title']}")
        print(f"   Moral: {fable['moral']}")

Fables with extracted morals: 25/284

Examples with morals:

2. THE GOOSE THAT LAID THE GOLDEN EGGS
   Moral: Much wants more and loses all.


## Step 4: Clean and Normalize Text

Similar to Tamil preprocessing:
- Remove extra whitespace
- Normalize punctuation
- Remove any remaining artifacts

In [5]:
def clean_text(text):
    """
    Clean English text.
    """
    if not text:
        return text
    
    # Remove multiple spaces
    text = re.sub(r'\s+', ' ', text)
    
    # Fix punctuation spacing
    text = re.sub(r'\s+([.,!?;:])', r'\1', text)
    text = re.sub(r'([.,!?;:])([A-Za-z])', r'\1 \2', text)
    
    # Remove leading/trailing whitespace
    text = text.strip()
    
    return text

# Clean all fables
for fable in fables:
    fable['title'] = clean_text(fable['title'])
    fable['text'] = clean_text(fable['text'])
    if fable['moral']:
        fable['moral'] = clean_text(fable['moral'])

print("✓ Text cleaned")

✓ Text cleaned


## Step 5: Create Structured DataFrame

In [6]:
# Create DataFrame
df = pd.DataFrame(fables)
df.insert(0, 'fable_number', range(1, len(df) + 1))

print(f"DataFrame created: {len(df)} fables")
print(f"\nColumns: {df.columns.tolist()}")
print(f"\nFirst few rows:")
print(df.head())

DataFrame created: 284 fables

Columns: ['fable_number', 'title', 'text', 'moral']

First few rows:
   fable_number                                title  \
0             1               THE FOX AND THE GRAPES   
1             2  THE GOOSE THAT LAID THE GOLDEN EGGS   
2             3                 THE CAT AND THE MICE   
3             4                  THE MISCHIEVOUS DOG   
4             5   THE CHARCOAL-BURNER AND THE FULLER   

                                                text  \
0  A hungry Fox saw some fine bunches of Grapes h...   
1  A Man and his Wife had the good fortune to pos...   
2  There was once a house that was overrun with M...   
3  There was once a Dog who used to snap at peopl...   
4  There was once a Charcoal-burner who lived and...   

                            moral  
0                            None  
1  Much wants more and loses all.  
2                            None  
3                            None  
4                            None  


## Step 6: Statistics and Quality Check

In [7]:
print("="*60)
print("AESOP'S FABLES - PREPROCESSING STATISTICS")
print("="*60)

print(f"\nTotal fables: {len(df)}")
print(f"Fables with morals: {df['moral'].notna().sum()} ({df['moral'].notna().sum()/len(df)*100:.1f}%)")
print(f"Fables without morals: {df['moral'].isna().sum()}")

print(f"\nText length statistics:")
df['text_length'] = df['text'].str.len()
print(f"  Mean: {df['text_length'].mean():.0f} characters")
print(f"  Median: {df['text_length'].median():.0f} characters")
print(f"  Min: {df['text_length'].min():.0f} characters")
print(f"  Max: {df['text_length'].max():.0f} characters")

print(f"\nSample fables:")
for i in [0, len(df)//2, len(df)-1]:
    print(f"\n{df.iloc[i]['fable_number']}. {df.iloc[i]['title']}")
    print(f"   Length: {df.iloc[i]['text_length']} chars")
    if pd.notna(df.iloc[i]['moral']):
        print(f"   Moral: {df.iloc[i]['moral'][:80]}..." if len(df.iloc[i]['moral']) > 80 else f"   Moral: {df.iloc[i]['moral']}")
    else:
        print(f"   Moral: (none extracted)")

AESOP'S FABLES - PREPROCESSING STATISTICS

Total fables: 284
Fables with morals: 25 (8.8%)
Fables without morals: 259

Text length statistics:
  Mean: 655 characters
  Median: 592 characters
  Min: 180 characters
  Max: 2520 characters

Sample fables:

1. THE FOX AND THE GRAPES
   Length: 394 chars
   Moral: (none extracted)

143. THE WOLF, THE FOX, AND THE APE
   Length: 407 chars
   Moral: (none extracted)

284. THE TRAVELLER AND FORTUNE
   Length: 417 chars
   Moral: (none extracted)


## Step 7: Save to CSV

In [8]:
# Drop the text_length column (just for statistics)
df_clean = df.drop(columns=['text_length'])

# Save to CSV
output_path = 'processedDataEnglish/aesops_fables_cleaned.csv'
df_clean.to_csv(output_path, index=False, encoding='utf-8')

print(f"✓ Saved to {output_path}")
print(f"\nFile info:")
print(f"  Rows: {len(df_clean)}")
print(f"  Columns: {df_clean.columns.tolist()}")
print(f"\nReady for moral scoring!")

✓ Saved to processedDataEnglish/aesops_fables_cleaned.csv

File info:
  Rows: 284
  Columns: ['fable_number', 'title', 'text', 'moral']

Ready for moral scoring!


## Summary

### What We Did:
1. ✅ Loaded raw Aesop's Fables text
2. ✅ Parsed into individual fables (title + text)
3. ✅ Extracted morals where present
4. ✅ Cleaned and normalized text
5. ✅ Created structured CSV

### Next Step:
Run `English_Moral_Scoring.ipynb` to:
- Load English MFD master vectors
- Generate embeddings for each fable
- Score moral foundations
- Compare with Tamil results