# Religious Text Classification Model Training

This notebook trains a model to classify Indonesian religious texts into:
- **Islam** (Muslim)
- **Catholic**
- **Protestant**

## ‚ö†Ô∏è Large Dataset Optimizations
This notebook is optimized for **large datasets (2M+ sentences)** with:
- ‚úÖ **Streaming dataset loading** (doesn't load all data into RAM)
- ‚úÖ **Efficient checkpointing** (saves to Google Drive every 500 steps)
- ‚úÖ **Memory optimizations** (gradient checkpointing, mixed precision)
- ‚úÖ **A100/H100 optimizations** (bf16 for numerical stability)

## Setup
1. **Select Runtime**: Go to Runtime ‚Üí Change runtime type ‚Üí Select **A100 GPU** (recommended for 2M+ sentences)
2. Upload your `FINAL_SEGMENTED_CORPUS.csv` file to Google Drive
3. Mount Google Drive
4. Update the file path below if needed
5. Run all cells!

## GPU Recommendations for Large Datasets
- ü•á **A100 or H100**: **REQUIRED** for 2M+ sentences (6-10 hours, handles large batches)
- ü•à **L4**: May work but will be slow (12+ hours, risk of timeout)
- ü•â **T4**: Not recommended for 2M+ sentences (too slow, will timeout)
- ‚ùå **CPU**: Not feasible (days of training)

**For 2 million sentences, use A100 GPU!** The notebook auto-detects and optimizes.

## Model Note
We use **IndoBERT** (`indolem/indobert-base-uncased`) - a specialized Indonesian BERT model optimized for classification tasks. This is more efficient than large LLMs like Sahabat-AI (70B) which require 140GB+ VRAM and are designed for text generation, not classification.


In [None]:
%pip install transformers datasets scikit-learn accelerate seaborn matplotlib -U -q

# Install additional packages for large-scale training
%pip install psutil -q  # For memory monitoring


In [None]:
import pandas as pd
import numpy as np
from pathlib import Path
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt
from datasets import Dataset
from transformers import (
    AutoTokenizer, 
    AutoModelForSequenceClassification, 
    TrainingArguments, 
    Trainer
)
import torch
import os
from datetime import datetime
import json

# Disable WandB
os.environ["WANDB_DISABLED"] = "true"

# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')


## Configuration

**Update these paths:**


In [None]:
# ==========================================
# CONFIGURATION
# ==========================================

# Path to your CSV file in Google Drive
FILE_PATH = "/content/drive/MyDrive/Indo_Religiolect/final_corpus/FINAL_SEGMENTED_CORPUS.csv"

# Where to save the trained model in Google Drive
SAVE_PATH = "/content/drive/MyDrive/Indo_Religiolect/model_final"

# Model selection - IndoBERT is optimized for Indonesian classification
MODEL_NAME = "indolem/indobert-base-uncased"  # Recommended for Indonesian text classification

# Training hyperparameters
NUM_EPOCHS = 3
MAX_LENGTH = 128  # Maximum sequence length
TEST_SIZE = 0.1  # 10% for testing

# Checkpoint settings (for efficient training)
SAVE_STEPS = 500  # Save checkpoint every N steps
EVAL_STEPS = 500  # Evaluate every N steps
LOGGING_STEPS = 100  # Log metrics every N steps

# ==========================================
# AUTO-CONFIGURE BATCH SIZE BASED ON GPU
# ==========================================
import torch

# Detect GPU type and set optimal batch sizes
USE_BF16 = False  # Will be set based on GPU
if torch.cuda.is_available():
    gpu_name = torch.cuda.get_device_name(0)
    gpu_memory = torch.cuda.get_device_properties(0).total_memory / 1e9  # GB
    
    print(f"üñ•Ô∏è  Detected GPU: {gpu_name}")
    print(f"   Memory: {gpu_memory:.1f} GB")
    
    # Set batch sizes based on GPU memory
    if "H100" in gpu_name:
        # H100: 80GB - can handle large batches, use bf16
        BATCH_SIZE = 64
        EVAL_BATCH_SIZE = 128
        USE_BF16 = True
        print("   ‚úÖ Using H100-optimized settings (64/128 batch, bf16)")
    elif "A100" in gpu_name:
        # A100: 40-80GB - can handle large batches, use bf16 for numerical stability
        BATCH_SIZE = 64
        EVAL_BATCH_SIZE = 128
        USE_BF16 = True  # A100 supports bf16, prevents numerical explosions
        print("   ‚úÖ Using A100-optimized settings (64/128 batch, bf16)")
    elif "L4" in gpu_name:
        # L4: 24GB - comfortable batch size, fp16
        BATCH_SIZE = 48
        EVAL_BATCH_SIZE = 96
        USE_BF16 = False
        print("   ‚úÖ Using L4-optimized settings (48/96 batch, fp16)")
    elif "T4" in gpu_name:
        # T4: 16GB - standard free tier, moderate batch size
        BATCH_SIZE = 32
        EVAL_BATCH_SIZE = 64
        USE_BF16 = False
        print("   ‚ö†Ô∏è  Using T4 settings (32/64 batch, fp16)")
        print("   ‚ö†Ô∏è  T4 may be too slow for 2M+ sentences. Consider A100.")
    else:
        # Unknown GPU - use conservative defaults
        BATCH_SIZE = 16
        EVAL_BATCH_SIZE = 32
        USE_BF16 = False
        print(f"   ‚ö†Ô∏è  Unknown GPU, using conservative settings (16/32 batch, fp16)")
else:
    # CPU - very small batches
    BATCH_SIZE = 4
    EVAL_BATCH_SIZE = 8
    USE_BF16 = False
    print("   ‚ùå No GPU detected! CPU training not recommended for large datasets.")
    print("   üí° Please select A100 GPU runtime for 2M+ sentences.")

print(f"\nüìÇ Dataset: {FILE_PATH}")
print(f"ü§ñ Model: {MODEL_NAME}")
print(f"üíæ Save path: {SAVE_PATH}")
print(f"‚öôÔ∏è  Epochs: {NUM_EPOCHS}, Batch size: {BATCH_SIZE}/{EVAL_BATCH_SIZE}")
print(f"üî¢ Precision: {'bf16 (A100/H100)' if USE_BF16 else 'fp16'}")
print(f"\nüí° For 2M+ sentences:")
print(f"   ü•á REQUIRED: A100 or H100 (6-10 hours)")
print(f"   ü•à Risky: L4 (12+ hours, may timeout)")
print(f"   ‚ùå Not recommended: T4 or CPU (too slow)")


In [None]:
print("\nüìä Loading dataset...")
print("   ‚ö†Ô∏è  For large datasets (2M+), this may take a few minutes...")

# Load dataset efficiently
df = pd.read_csv(FILE_PATH)
print(f"‚úÖ Loaded {len(df):,} rows")

# Map text labels to integers (Islam = Muslim)
label_map = {'Islam': 0, 'Catholic': 1, 'Protestant': 2}
df['label'] = df['Label'].map(label_map)
df = df.dropna(subset=['label'])
df['label'] = df['label'].astype(int)

print("\nüìà Original class distribution:")
class_counts = df['Label'].value_counts()
print(class_counts)

# --- UNDERSAMPLING STRATEGY ---
# Find the size of the smallest class
min_class_size = df['label'].value_counts().min()
print(f"\nSmallest class size: {min_class_size:,}")

# For very large datasets, we might want to limit the total size
# to avoid extremely long training times
MAX_TOTAL_SAMPLES = 2_000_000  # Cap at 2M samples total (3 classes = ~666k per class)
if min_class_size * 3 > MAX_TOTAL_SAMPLES:
    min_class_size = MAX_TOTAL_SAMPLES // 3
    print(f"   ‚ö†Ô∏è  Capping at {min_class_size:,} per class (total: {min_class_size * 3:,} samples)")
    print(f"   üí° This prevents training from taking too long")

# Sample that amount from each group
print(f"\nüîÑ Sampling {min_class_size:,} samples per class...")
df_balanced = df.groupby('label').apply(
    lambda x: x.sample(min(min_class_size, len(x)), random_state=42)
).reset_index(drop=True)

print("\n--- Balanced Data Counts ---")
print(df_balanced['Label'].value_counts())
print(f"   Total: {len(df_balanced):,} samples")
# ------------------------------

# Split Data (90% Train, 10% Test)
print(f"\nüîÑ Splitting data ({int((1-TEST_SIZE)*100)}% train, {int(TEST_SIZE*100)}% test)...")
train_df, test_df = train_test_split(
    df_balanced, 
    test_size=TEST_SIZE, 
    stratify=df_balanced['label'], 
    random_state=42
)

print(f"\nüìä Data split:")
print(f"   Train: {len(train_df):,} samples")
print(f"   Test: {len(test_df):,} samples")

# Convert to Hugging Face Dataset
# For very large datasets, we could use streaming, but for classification
# we need the full dataset for proper stratification
# Reset index to avoid '__index_level_0__' column issue
train_df = train_df.reset_index(drop=True)
test_df = test_df.reset_index(drop=True)
train_dataset = Dataset.from_pandas(train_df, preserve_index=False)
test_dataset = Dataset.from_pandas(test_df, preserve_index=False)

# Estimate training time
estimated_tokens = len(train_df) * 20  # ~20 tokens per sentence
if torch.cuda.is_available() and "A100" in torch.cuda.get_device_name(0):
    tokens_per_sec = 2000  # Conservative estimate for A100
    estimated_hours = (estimated_tokens / tokens_per_sec) * NUM_EPOCHS / 3600
    print(f"\n‚è±Ô∏è  Estimated training time: ~{estimated_hours:.1f} hours on A100")
    print(f"   (Based on {estimated_tokens:,} tokens, {NUM_EPOCHS} epochs)")


In [None]:
print(f"\nüî§ Loading tokenizer: {MODEL_NAME}")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

def tokenize_function(examples):
    return tokenizer(
        examples["Sentence_Unit"], 
        padding="max_length", 
        truncation=True, 
        max_length=MAX_LENGTH
    )

print(f"\nüîÑ Tokenizing data (max_length={MAX_LENGTH})...")
print("   This may take a few minutes...")
tokenized_train = train_dataset.map(tokenize_function, batched=True)
tokenized_test = test_dataset.map(tokenize_function, batched=True)
print("‚úÖ Tokenization complete!")


In [None]:
# Metrics function
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(
        labels, preds, average='weighted'
    )
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc, 
        'f1': f1, 
        'precision': precision, 
        'recall': recall
    }

# Load model
print(f"\nü§ñ Loading model: {MODEL_NAME}")
model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_NAME, 
    num_labels=3
)

# Check GPU and move model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"üñ•Ô∏è  Using device: {device}")
if torch.cuda.is_available():
    gpu_name = torch.cuda.get_device_name(0)
    gpu_memory = torch.cuda.get_device_properties(0).total_memory / 1e9
    print(f"   GPU: {gpu_name}")
    print(f"   Memory: {gpu_memory:.1f} GB")
    
    # Show memory usage
    torch.cuda.empty_cache()  # Clear cache
    allocated = torch.cuda.memory_allocated(0) / 1e9
    print(f"   Allocated: {allocated:.2f} GB")

model.to(device)

# Enable gradient checkpointing for memory efficiency (critical for large datasets)
if hasattr(model, 'gradient_checkpointing_enable'):
    model.gradient_checkpointing_enable()
    print("   ‚úÖ Gradient checkpointing enabled (saves memory)")

print("‚úÖ Model loaded and moved to device")


In [None]:
# Create output directory
os.makedirs(SAVE_PATH, exist_ok=True)

# Training arguments with efficient checkpointing for large datasets
# CRITICAL: Save to Google Drive every 500 steps to prevent data loss
training_args = TrainingArguments(
    output_dir="./results",  # Local checkpoint directory (temporary)
    num_train_epochs=NUM_EPOCHS,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=EVAL_BATCH_SIZE,
    warmup_steps=500,
    weight_decay=0.01,
    learning_rate=2e-5,  # Standard learning rate for BERT fine-tuning
    logging_dir='./logs',
    logging_steps=LOGGING_STEPS,
    eval_strategy="steps",
    eval_steps=EVAL_STEPS,
    save_strategy="steps",
    save_steps=SAVE_STEPS,  # Save checkpoint every 500 steps
    save_total_limit=2,  # Keep only last 2 checkpoints to save space (as recommended)
    load_best_model_at_end=True,
    metric_for_best_model="f1",  # Use F1 score to select best model
    greater_is_better=True,
    # Mixed precision: bf16 for A100/H100, fp16 for others
    bf16=USE_BF16,  # bfloat16 for A100/H100 (prevents numerical explosions)
    fp16=not USE_BF16 and torch.cuda.is_available(),  # fp16 for other GPUs
    dataloader_num_workers=2 if torch.cuda.is_available() else 0,
    report_to="none",  # Disable external logging
    remove_unused_columns=True,  # Remove index columns and other unused columns
    # Memory optimizations for large datasets
    gradient_accumulation_steps=1,
    dataloader_pin_memory=torch.cuda.is_available(),
    gradient_checkpointing=True,  # Trade compute for memory (critical for large datasets)
    # Optimizations
    optim="adamw_torch",  # Use PyTorch's AdamW (more memory efficient)
    max_grad_norm=1.0,  # Gradient clipping for stability
)

print(f"\nüìã Training configuration:")
print(f"   Epochs: {NUM_EPOCHS}")
print(f"   Batch size: {BATCH_SIZE} (train) / {EVAL_BATCH_SIZE} (eval)")
print(f"   Learning rate: {training_args.learning_rate}")
print(f"   Mixed precision: {'bf16 (A100/H100)' if USE_BF16 else 'fp16' if torch.cuda.is_available() else 'none'}")
print(f"   Gradient checkpointing: ‚úÖ (saves memory)")
print(f"   Save steps: {SAVE_STEPS} (checkpoints every {SAVE_STEPS} steps)")
print(f"   Eval steps: {EVAL_STEPS} (evaluate every {EVAL_STEPS} steps)")
print(f"   Save path: {SAVE_PATH}")
print(f"\n‚ö†Ô∏è  IMPORTANT: Checkpoints save to './results' during training.")
print(f"   Final model will be saved to Google Drive: {SAVE_PATH}")

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    compute_metrics=compute_metrics,
)


In [None]:
print("\nüöÄ Starting training...")
print(f"   Model: {MODEL_NAME}")
print(f"   Training samples: {len(train_df):,}")
print(f"   Test samples: {len(test_df):,}")
print(f"   GPU: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'CPU'}")
print("\n‚è±Ô∏è  This will take a while...")
print(f"   üí° Checkpoints save every {SAVE_STEPS} steps to './results'")
print(f"   üí° If Colab disconnects, you can resume from the last checkpoint")
print(f"   üí° Final model will be saved to Google Drive when complete")

# Train with progress tracking
trainer.train()

print("\n‚úÖ Training complete!")
print(f"   üíæ Now saving final model to Google Drive: {SAVE_PATH}")


In [None]:
print(f"\nüíæ Saving final model to Google Drive: {SAVE_PATH}")

# Save the best model (trainer.load_best_model_at_end=True ensures this is the best)
model.save_pretrained(SAVE_PATH)
tokenizer.save_pretrained(SAVE_PATH)

# Save label mapping
with open(f"{SAVE_PATH}/label_map.json", "w") as f:
    json.dump(label_map, f, indent=2)

# Save training info
training_info = {
    "model_name": MODEL_NAME,
    "num_epochs": NUM_EPOCHS,
    "batch_size": BATCH_SIZE,
    "eval_batch_size": EVAL_BATCH_SIZE,
    "max_length": MAX_LENGTH,
    "train_samples": len(train_df),
    "test_samples": len(test_df),
    "precision": "bf16" if USE_BF16 else "fp16" if torch.cuda.is_available() else "fp32",
    "gpu": torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU",
    "label_map": label_map
}
with open(f"{SAVE_PATH}/training_info.json", "w") as f:
    json.dump(training_info, f, indent=2)

print("‚úÖ Model saved successfully!")
print(f"   üìÅ Model: {SAVE_PATH}")
print(f"   üìÅ Tokenizer: {SAVE_PATH}")
print(f"   üìÅ Label map: {SAVE_PATH}/label_map.json")
print(f"   üìÅ Training info: {SAVE_PATH}/training_info.json")
print(f"\nüí° Your model is now safely stored in Google Drive!")


In [None]:
print("\nüîç Generating Evaluation & Confusion Matrix...")

# Predict on the test set
predictions = trainer.predict(tokenized_test)
preds = np.argmax(predictions.predictions, axis=-1)
labels = predictions.label_ids

# Calculate metrics
accuracy = accuracy_score(labels, preds)
precision, recall, f1, _ = precision_recall_fscore_support(
    labels, preds, average='weighted'
)

print(f"\nüìà Final Metrics:")
print(f"   Accuracy:  {accuracy:.4f}")
print(f"   F1 Score:  {f1:.4f}")
print(f"   Precision: {precision:.4f}")
print(f"   Recall:    {recall:.4f}")

# Per-class metrics
print(f"\nüìä Per-Class Metrics:")
for i, label_name in enumerate(label_map.keys()):
    class_mask = labels == i
    if class_mask.sum() > 0:
        class_acc = accuracy_score(labels[class_mask], preds[class_mask])
        print(f"   {label_name}: Accuracy = {class_acc:.4f}")

# Confusion Matrix
print("\nüìä Generating confusion matrix...")
cm = confusion_matrix(labels, preds)
label_names = list(label_map.keys())

plt.figure(figsize=(10, 8))
sns.heatmap(
    cm, 
    annot=True, 
    fmt='d', 
    cmap='Blues',
    xticklabels=label_names,
    yticklabels=label_names
)
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.title('Confusion Matrix: Religious Style Detection\n(Islam/Muslim, Catholic, Protestant)')
plt.tight_layout()
plt.show()

# Save confusion matrix
cm_path = f"{SAVE_PATH}/confusion_matrix.png"
plt.savefig(cm_path, dpi=300, bbox_inches='tight')
print(f"‚úÖ Confusion matrix saved: {cm_path}")

print("\n" + "="*60)
print("‚úÖ TRAINING COMPLETE!")
print("="*60)
print(f"\nüìÅ Model saved to: {SAVE_PATH}")
print(f"üìä Confusion matrix: {cm_path}")
print(f"\nüí° To use this model:")
print(f"   from transformers import AutoTokenizer, AutoModelForSequenceClassification")
print(f"   tokenizer = AutoTokenizer.from_pretrained('{SAVE_PATH}')")
print(f"   model = AutoModelForSequenceClassification.from_pretrained('{SAVE_PATH}')")
