# Vietnamese Text Summarization: mT5 + LLM (LoRA) Pipeline

## Architecture

```
Input Document
      ‚Üì
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ  Stage 1: ViT5  ‚îÇ  Fast summarization
‚îÇ                 ‚îÇ  Extract key info
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
         ‚Üì
   Initial Summary
         ‚Üì
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ  Stage 2: LLM   ‚îÇ  Rewrite with:
‚îÇ   + LoRA        ‚îÇ  - Better fluency
‚îÇ                 ‚îÇ  - Natural Vietnamese
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
         ‚Üì
   Final Summary
```

## Benefits

- ‚úÖ **Stage 1 (ViT5)**: Fast, accurate extraction
- ‚úÖ **Stage 2 (LLM + LoRA)**: Intelligent rewriting, better fluency
- ‚úÖ **Memory Efficient**: 4-bit quantization, works on 8GB+ GPU
- ‚úÖ **Fast Training**: LoRA trains <1% of parameters

## 1. Setup & Installation

In [None]:
# Install required packages
!pip install -q transformers datasets peft bitsandbytes accelerate evaluate tqdm sentencepiece

In [None]:
import torch
import pandas as pd
from datasets import Dataset
from transformers import (
    AutoTokenizer,
    AutoModelForSeq2SeqLM,
    AutoModelForCausalLM,
    Trainer,
    TrainingArguments,
    DataCollatorForLanguageModeling,
    BitsAndBytesConfig
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, PeftModel
from tqdm import tqdm
import evaluate

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")

## 2. Configuration

In [None]:
# Configuration
CONFIG = {
    # Stage 1: ViT5 Model
    'stage1_model': 'VietAI/vit5-base',  # or your trained checkpoint path
    'stage1_checkpoint': './vit5_vi_sum/checkpoint-best',  # Path to your trained Stage 1 model
    
    # Stage 2: Vietnamese LLM
    'stage2_model': 'Qwen/Qwen2.5-7B-Instruct',  # Vietnamese LLM
    
    # Data paths
    'train_data': 'data/train.csv',
    'val_data': 'data/validation.csv',
    'test_data': 'data/test.csv',
    
    # LoRA Training
    'output_dir': './lora_rewriter',
    'epochs': 3,
    'batch_size': 4,  # Adjust based on your GPU (RTX 3090: 8, RTX 4070: 4, RTX 3060: 2)
    'learning_rate': 2e-4,
    
    # Optional: Limit samples for quick testing
    'max_train_samples': None,  # Set to 1000 for quick test
    'max_val_samples': None,    # Set to 100 for quick test
    'max_test_samples': 100,
}

print("Configuration:")
for key, value in CONFIG.items():
    print(f"  {key}: {value}")

## 3. Load Data

In [None]:
# Load datasets
print("Loading datasets...")
train_df = pd.read_csv(CONFIG['train_data'])
val_df = pd.read_csv(CONFIG['val_data'])
test_df = pd.read_csv(CONFIG['test_data'])

# Limit samples if specified
if CONFIG['max_train_samples']:
    train_df = train_df.head(CONFIG['max_train_samples'])
if CONFIG['max_val_samples']:
    val_df = val_df.head(CONFIG['max_val_samples'])
if CONFIG['max_test_samples']:
    test_df = test_df.head(CONFIG['max_test_samples'])

print(f"\nDataset sizes:")
print(f"  Train: {len(train_df):,} samples")
print(f"  Val: {len(val_df):,} samples")
print(f"  Test: {len(test_df):,} samples")

# Show sample
print("\nSample data:")
print(f"Document: {train_df.iloc[0]['document'][:200]}...")
print(f"Summary: {train_df.iloc[0]['summary']}")

## 4. Stage 1: Generate mT5 Summaries

Use your trained ViT5 model to generate initial summaries for all training documents.

In [None]:
def generate_mt5_summaries(documents, model_path, batch_size=8, device='cuda'):
    """
    Generate summaries using trained mT5/ViT5 model
    """
    print(f"\nüîÑ Generating {len(documents)} summaries...")
    
    # Load model
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    model = AutoModelForSeq2SeqLM.from_pretrained(model_path)
    model = model.to(device)
    model.eval()
    
    summaries = []
    
    with torch.no_grad():
        for i in tqdm(range(0, len(documents), batch_size), desc="Generating"):
            batch_docs = documents[i:i+batch_size]
            
            # Add prefix and tokenize
            inputs = tokenizer(
                ["t√≥m t·∫Øt: " + doc for doc in batch_docs],
                max_length=512,
                truncation=True,
                padding=True,
                return_tensors="pt"
            ).to(device)
            
            # Generate
            outputs = model.generate(
                **inputs,
                max_length=128,
                num_beams=4,
                early_stopping=True,
                no_repeat_ngram_size=3
            )
            
            # Decode
            batch_summaries = tokenizer.batch_decode(outputs, skip_special_tokens=True)
            summaries.extend(batch_summaries)
    
    # Clean up
    del model
    del tokenizer
    torch.cuda.empty_cache()
    
    print(f"‚úÖ Generated {len(summaries)} summaries")
    return summaries

In [None]:
# Generate mT5 summaries for all datasets
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

print("Generating Stage 1 summaries...")
train_mt5_summaries = generate_mt5_summaries(
    train_df['document'].tolist(),
    CONFIG['stage1_checkpoint'],
    batch_size=8,
    device=device
)

val_mt5_summaries = generate_mt5_summaries(
    val_df['document'].tolist(),
    CONFIG['stage1_checkpoint'],
    batch_size=8,
    device=device
)

test_mt5_summaries = generate_mt5_summaries(
    test_df['document'].tolist(),
    CONFIG['stage1_checkpoint'],
    batch_size=8,
    device=device
)

# Show samples
print("\n" + "="*80)
print("Sample Stage 1 Outputs:")
print("="*80)
for i in range(3):
    print(f"\nExample {i+1}:")
    print(f"Document: {train_df.iloc[i]['document'][:200]}...")
    print(f"Stage 1 (mT5): {train_mt5_summaries[i]}")
    print(f"Human: {train_df.iloc[i]['summary']}")

## 5. Create Training Dataset for LoRA

Create prompt-based training examples:
- **Input**: Original document + mT5 summary
- **Target**: Human-written summary (rewriting goal)

In [None]:
def create_prompt(original_doc, mt5_summary, target_summary=None):
    """
    Create prompt for training/inference
    """
    # Truncate document for context
    doc_preview = original_doc[:500] + "..." if len(original_doc) > 500 else original_doc
    
    prompt = f"""B·∫°n l√† chuy√™n gia vi·∫øt l·∫°i vƒÉn b·∫£n ti·∫øng Vi·ªát. Nhi·ªám v·ª•: c·∫£i thi·ªán b·∫£n t√≥m t·∫Øt sau.

Y√™u c·∫ßu:
- Gi·ªØ nguy√™n th√¥ng tin v√† √Ω nghƒ©a
- C·∫£i thi·ªán s·ª± t·ª± nhi√™n v√† m·∫°ch l·∫°c
- S·ª≠ d·ª•ng t·ª´ ng·ªØ ph√π h·ª£p ti·∫øng Vi·ªát
- Ng·∫Øn g·ªçn, s√∫c t√≠ch

VƒÇN B·∫¢N G·ªêC:
{doc_preview}

T√ìM T·∫ÆT C·∫¶N VI·∫æT L·∫†I:
{mt5_summary}

T√ìM T·∫ÆT ƒê√É C·∫¢I THI·ªÜN:
"""
    if target_summary:
        prompt += target_summary
    
    return prompt

def create_training_dataset(documents, mt5_summaries, target_summaries):
    """
    Create training dataset with prompt format
    """
    examples = []
    
    for doc, mt5_sum, target_sum in zip(documents, mt5_summaries, target_summaries):
        prompt = create_prompt(doc, mt5_sum, target_sum)
        examples.append({"text": prompt})
    
    return Dataset.from_list(examples)

In [None]:
# Create training datasets
print("Creating training datasets...")

train_dataset = create_training_dataset(
    train_df['document'].tolist(),
    train_mt5_summaries,
    train_df['summary'].tolist()
)

val_dataset = create_training_dataset(
    val_df['document'].tolist(),
    val_mt5_summaries,
    val_df['summary'].tolist()
)

print(f"\nDataset created:")
print(f"  Train examples: {len(train_dataset)}")
print(f"  Val examples: {len(val_dataset)}")

# Show sample prompt
print("\n" + "="*80)
print("Sample Training Prompt:")
print("="*80)
print(train_dataset[0]['text'])

## 6. Load Vietnamese LLM with 4-bit Quantization

In [None]:
print(f"Loading LLM: {CONFIG['stage2_model']}")
print("Using 4-bit quantization to save memory...")

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

# Load model
model = AutoModelForCausalLM.from_pretrained(
    CONFIG['stage2_model'],
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True
)

tokenizer = AutoTokenizer.from_pretrained(CONFIG['stage2_model'])
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

print("‚úÖ Model loaded successfully")

## 7. Apply LoRA Configuration

In [None]:
print("Applying LoRA configuration...")

# Prepare model for training
model = prepare_model_for_kbit_training(model)

# LoRA config
lora_config = LoraConfig(
    r=16,  # LoRA rank (higher = more capacity, slower)
    lora_alpha=32,  # Scaling factor
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],  # Which modules to apply LoRA
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

# Apply LoRA
model = get_peft_model(model, lora_config)

print("\nüìä Trainable parameters:")
model.print_trainable_parameters()

## 8. Tokenize Datasets

In [None]:
print("Tokenizing datasets...")

def tokenize_function(examples):
    return tokenizer(
        examples["text"],
        truncation=True,
        max_length=1024,
        padding="max_length"
    )

tokenized_train = train_dataset.map(tokenize_function, batched=True)
tokenized_val = val_dataset.map(tokenize_function, batched=True)

print("‚úÖ Tokenization complete")

## 9. Train LoRA Adapter

In [None]:
print("Setting up training...")

training_args = TrainingArguments(
    output_dir=CONFIG['output_dir'],
    num_train_epochs=CONFIG['epochs'],
    per_device_train_batch_size=CONFIG['batch_size'],
    per_device_eval_batch_size=CONFIG['batch_size'],
    gradient_accumulation_steps=4,
    learning_rate=CONFIG['learning_rate'],
    warmup_steps=100,
    logging_steps=50,
    eval_strategy="steps",
    eval_steps=200,
    save_strategy="steps",
    save_steps=200,
    save_total_limit=2,
    load_best_model_at_end=True,
    fp16=True,
    report_to="none",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

print("\nüöÄ Starting training...")
print(f"   Total steps: {len(tokenized_train) // CONFIG['batch_size'] // 4 * CONFIG['epochs']}")
print(f"   Eval every: 200 steps")
print(f"   Expected time: ~2-3 hours (depending on GPU)")

In [None]:
# Train!
trainer.train()

In [None]:
# Save final model
print("\nüíæ Saving LoRA adapter...")
model.save_pretrained(CONFIG['output_dir'])
tokenizer.save_pretrained(CONFIG['output_dir'])

print(f"‚úÖ Training complete!")
print(f"   Saved to: {CONFIG['output_dir']}")

## 10. Evaluation: Compare Stage 1 vs Stage 2

In [None]:
def generate_rewritten_summaries(documents, mt5_summaries, lora_checkpoint, batch_size=4):
    """
    Generate rewritten summaries using trained LoRA
    """
    print(f"\nüîÑ Rewriting {len(mt5_summaries)} summaries with LoRA...")
    
    # Load model with LoRA
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.float16,
        bnb_4bit_use_double_quant=True,
    )
    
    base_model = AutoModelForCausalLM.from_pretrained(
        CONFIG['stage2_model'],
        quantization_config=bnb_config,
        device_map="auto",
        trust_remote_code=True
    )
    
    model = PeftModel.from_pretrained(base_model, lora_checkpoint)
    tokenizer = AutoTokenizer.from_pretrained(lora_checkpoint)
    
    rewritten = []
    
    for i in tqdm(range(len(documents)), desc="Rewriting"):
        # Create prompt (without target)
        prompt = create_prompt(documents[i], mt5_summaries[i], target_summary=None)
        
        # Generate
        inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=1024).to(model.device)
        
        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_new_tokens=200,
                temperature=0.3,
                top_p=0.9,
                do_sample=True,
                pad_token_id=tokenizer.eos_token_id
            )
        
        # Decode
        full_response = tokenizer.decode(outputs[0], skip_special_tokens=True)
        
        # Extract rewritten part
        if "T√ìM T·∫ÆT ƒê√É C·∫¢I THI·ªÜN:" in full_response:
            rewritten_summary = full_response.split("T√ìM T·∫ÆT ƒê√É C·∫¢I THI·ªÜN:")[-1].strip()
        else:
            rewritten_summary = full_response[len(prompt):].strip()
        
        rewritten.append(rewritten_summary)
    
    del model
    del base_model
    torch.cuda.empty_cache()
    
    print(f"‚úÖ Rewriting complete")
    return rewritten

In [None]:
# Generate rewritten summaries for test set
test_rewritten_summaries = generate_rewritten_summaries(
    test_df['document'].tolist(),
    test_mt5_summaries,
    CONFIG['output_dir']
)

In [None]:
# Compute ROUGE scores
print("\n" + "="*80)
print("üìä EVALUATION RESULTS")
print("="*80)

rouge = evaluate.load("rouge")

# Stage 1 (mT5 only) vs Human
mt5_results = rouge.compute(
    predictions=test_mt5_summaries,
    references=test_df['summary'].tolist()
)

print("\nüìä Stage 1 (mT5 only):")
print(f"   ROUGE-1: {mt5_results['rouge1']:.4f}")
print(f"   ROUGE-2: {mt5_results['rouge2']:.4f}")
print(f"   ROUGE-L: {mt5_results['rougeL']:.4f}")

# Stage 2 (mT5 + LoRA rewrite) vs Human
rewritten_results = rouge.compute(
    predictions=test_rewritten_summaries,
    references=test_df['summary'].tolist()
)

print("\nüìä Stage 2 (mT5 + LoRA rewrite):")
print(f"   ROUGE-1: {rewritten_results['rouge1']:.4f} ({rewritten_results['rouge1'] - mt5_results['rouge1']:+.4f})")
print(f"   ROUGE-2: {rewritten_results['rouge2']:.4f} ({rewritten_results['rouge2'] - mt5_results['rouge2']:+.4f})")
print(f"   ROUGE-L: {rewritten_results['rougeL']:.4f} ({rewritten_results['rougeL'] - mt5_results['rougeL']:+.4f})")

# Calculate improvement
improvement = (
    (rewritten_results['rougeL'] - mt5_results['rougeL']) / mt5_results['rougeL'] * 100
)
print(f"\n‚ú® Overall improvement: {improvement:+.1f}%")

## 11. Sample Comparisons

In [None]:
# Show sample comparisons
print("\n" + "="*80)
print("üìù SAMPLE COMPARISONS")
print("="*80)

for i in range(min(5, len(test_df))):
    print(f"\n{'='*80}")
    print(f"Example {i+1}")
    print(f"{'='*80}")
    
    print(f"\nüìÑ Original ({len(test_df.iloc[i]['document'])} chars):")
    print(test_df.iloc[i]['document'][:300] + "...")
    
    print(f"\nüìù Stage 1 (mT5) - {len(test_mt5_summaries[i])} chars:")
    print(test_mt5_summaries[i])
    
    print(f"\n‚ú® Stage 2 (Rewritten) - {len(test_rewritten_summaries[i])} chars:")
    print(test_rewritten_summaries[i])
    
    print(f"\nüë§ Human Reference - {len(test_df.iloc[i]['summary'])} chars:")
    print(test_df.iloc[i]['summary'])

## 12. Save Results

In [None]:
# Save evaluation results
results_df = pd.DataFrame({
    'document': test_df['document'].tolist(),
    'human_summary': test_df['summary'].tolist(),
    'stage1_mt5': test_mt5_summaries,
    'stage2_rewritten': test_rewritten_summaries,
})

output_file = 'evaluation_results.csv'
results_df.to_csv(output_file, index=False)
print(f"\nüíæ Results saved to: {output_file}")

# Save metrics
metrics = {
    'stage1_rouge1': mt5_results['rouge1'],
    'stage1_rouge2': mt5_results['rouge2'],
    'stage1_rougeL': mt5_results['rougeL'],
    'stage2_rouge1': rewritten_results['rouge1'],
    'stage2_rouge2': rewritten_results['rouge2'],
    'stage2_rougeL': rewritten_results['rougeL'],
    'improvement_pct': improvement,
}

import json
with open('evaluation_metrics.json', 'w', encoding='utf-8') as f:
    json.dump(metrics, f, indent=2, ensure_ascii=False)

print(f"üíæ Metrics saved to: evaluation_metrics.json")

## 13. Production Usage Example

In [None]:
# Example: Use the trained pipeline for new documents
from mt5_llm_lora_pipeline import MT5_LLM_Summarizer

# Initialize with trained models
summarizer = MT5_LLM_Summarizer(
    stage1_model=CONFIG['stage1_checkpoint'],  # Your trained ViT5
    stage2_model=CONFIG['stage2_model'],       # Base LLM
    lora_checkpoint=CONFIG['output_dir'],      # Trained LoRA
    use_4bit=True
)

# Test document
test_text = """
Chi·ªÅu 26/1, UBND TP H√† N·ªôi t·ªï ch·ª©c h·ªçp b√°o c√¥ng b·ªë k·∫øt qu·∫£ th·ª±c hi·ªán
nhi·ªám v·ª• ph√°t tri·ªÉn kinh t·∫ø - x√£ h·ªôi nƒÉm 2024. Theo ƒë√≥, t·ªïng s·∫£n ph·∫©m
tr√™n ƒë·ªãa b√†n (GRDP) c·ªßa H√† N·ªôi nƒÉm 2024 ∆∞·ªõc tƒÉng 7,5% so v·ªõi nƒÉm 2023,
cao h∆°n m·ª©c tƒÉng tr∆∞·ªüng chung c·ªßa c·∫£ n∆∞·ªõc (7,09%). Trong ƒë√≥, khu v·ª±c
n√¥ng nghi·ªáp tƒÉng 3,2%, c√¥ng nghi·ªáp - x√¢y d·ª±ng tƒÉng 7,8%, d·ªãch v·ª• tƒÉng 7,4%.
Thu ng√¢n s√°ch nh√† n∆∞·ªõc tr√™n ƒë·ªãa b√†n ƒë·∫°t 478.000 t·ª∑ ƒë·ªìng, v∆∞·ª£t 10,9% d·ª± to√°n.
"""

# Generate summary
result = summarizer.summarize(test_text, use_stage2=True, verbose=True)

print("\n" + "="*80)
print("PRODUCTION TEST")
print("="*80)
print(f"\nStage 1: {result['stage1']}")
print(f"\nStage 2: {result['stage2']}")
print(f"\nFinal: {result['final']}")

## ‚úÖ Training Complete!

### What you achieved:

1. ‚úÖ Trained LoRA adapter for Vietnamese summary rewriting
2. ‚úÖ Evaluated Stage 1 (mT5) vs Stage 2 (mT5 + LoRA)
3. ‚úÖ Saved trained model and evaluation results

### Expected Results:

- **Stage 1 (mT5)**: Fast, accurate but may be choppy
- **Stage 2 (LoRA)**: +5-10% ROUGE improvement, much better fluency

### Next Steps:

1. Review sample comparisons above
2. Check `evaluation_results.csv` for full results
3. Use trained pipeline in production (see example above)
4. Fine-tune prompts if needed
5. Try different LLMs for Stage 2 (Vistral, Gemma-2-Vi, etc.)

### Saved Files:

- `./lora_rewriter/` - Trained LoRA weights
- `evaluation_results.csv` - Full evaluation results
- `evaluation_metrics.json` - ROUGE scores

---

**Notebook created:** 2025-01-06  
**Status:** ‚úÖ Ready to use