# üèõÔ∏è Entrenamiento de Traductor Lat√≠n‚ÜíEspa√±ol - Fase 1

## Objetivo
Entrenar un modelo mT5-small para traducir lat√≠n cl√°sico a espa√±ol.

**Fase 1**: Modelo base con ~30,000 pares (BLEU esperado: ~30-35)

---

## ‚öôÔ∏è Configuraci√≥n Inicial

In [None]:
# ============================================
# SECCI√ìN 1: CONEXI√ìN A GOOGLE DRIVE
# ============================================
# ¬øPor qu√©? Para guardar checkpoints y no perder progreso

from google.colab import drive
drive.mount('/content/drive')

# Crear directorio de trabajo
!mkdir -p /content/drive/MyDrive/latin_translator_phase1
%cd /content/drive/MyDrive/latin_translator_phase1

print("‚úÖ Google Drive conectado")
print("üìÅ Directorio: /content/drive/MyDrive/latin_translator_phase1")

In [None]:
# ============================================
# SECCI√ìN 2: INSTALACI√ìN DE DEPENDENCIAS
# ============================================

!pip install -q transformers datasets sentencepiece sacrebleu accelerate

print("‚úÖ Dependencias instaladas")

In [None]:
# ============================================
# SECCI√ìN 3: VERIFICACI√ìN DE GPU
# ============================================

import torch

if torch.cuda.is_available():
    gpu_name = torch.cuda.get_device_name(0)
    gpu_memory = torch.cuda.get_device_properties(0).total_memory / 1e9
    print(f"‚úÖ GPU disponible: {gpu_name}")
    print(f"üíæ Memoria GPU: {gpu_memory:.1f} GB")
else:
    print("‚ùå GPU no disponible")
    print("‚ö†Ô∏è Ve a: Runtime > Change runtime type > GPU (T4)")

---

## üìä Preparaci√≥n de Datos

Vamos a usar m√∫ltiples fuentes para el corpus de Fase 1.

In [None]:
# ============================================
# SECCI√ìN 4: DESCARGA DE CORPUS
# ============================================
# Usaremos el corpus OPUS que tiene pares lat√≠n-espa√±ol

import json
import requests
from pathlib import Path

# Crear directorio de datos
!mkdir -p data

# Opci√≥n 1: Subir tus propios datos
print("üì§ OPCI√ìN 1: Sube tus archivos train.json y validation.json")
print("   (Si ya los tienes preparados localmente)")
print()
print("üì• OPCI√ìN 2: Usar datos de ejemplo")
print("   (Continuaremos con datos m√≠nimos para probar)")

# Datos de ejemplo (classical samples)
example_data = [
    {"latin": "Gallia est omnis divisa in partes tres.", 
     "spanish": "Toda la Galia est√° dividida en tres partes."},
    {"latin": "Veni, vidi, vici.", 
     "spanish": "Vine, vi, venc√≠."},
    {"latin": "Alea iacta est.", 
     "spanish": "La suerte est√° echada."},
    {"latin": "Carpe diem.", 
     "spanish": "Aprovecha el d√≠a."},
    {"latin": "Errare humanum est.", 
     "spanish": "Errar es humano."},
]

# Guardar datos de ejemplo
with open('data/example_train.json', 'w', encoding='utf-8') as f:
    json.dump(example_data * 100, f, ensure_ascii=False, indent=2)  # Repetir para tener m√°s datos

with open('data/example_validation.json', 'w', encoding='utf-8') as f:
    json.dump(example_data[:2], f, ensure_ascii=False, indent=2)

print("‚úÖ Datos de ejemplo creados")
print("‚ö†Ô∏è IMPORTANTE: Para entrenamiento real, necesitas subir corpus m√°s grande")

In [None]:
# ============================================
# SECCI√ìN 5: CARGAR DATOS
# ============================================

from datasets import Dataset, DatasetDict

def load_data(train_path='data/train.json', val_path='data/validation.json'):
    """
    Carga datos de entrenamiento y validaci√≥n.
    Si no existen, usa los de ejemplo.
    """
    # Intentar cargar datos reales
    try:
        with open(train_path, 'r', encoding='utf-8') as f:
            train_data = json.load(f)
        with open(val_path, 'r', encoding='utf-8') as f:
            val_data = json.load(f)
        print("‚úÖ Datos personalizados cargados")
    except FileNotFoundError:
        print("‚ö†Ô∏è Usando datos de ejemplo")
        with open('data/example_train.json', 'r', encoding='utf-8') as f:
            train_data = json.load(f)
        with open('data/example_validation.json', 'r', encoding='utf-8') as f:
            val_data = json.load(f)
    
    # Convertir a Dataset
    train_dataset = Dataset.from_dict({
        'latin': [item['latin'] for item in train_data],
        'spanish': [item['spanish'] for item in train_data]
    })
    
    val_dataset = Dataset.from_dict({
        'latin': [item['latin'] for item in val_data],
        'spanish': [item['spanish'] for item in val_data]
    })
    
    return DatasetDict({
        'train': train_dataset,
        'validation': val_dataset
    })

# Cargar datos
dataset = load_data()

print(f"\nüìä Estad√≠sticas:")
print(f"   - Entrenamiento: {len(dataset['train'])} pares")
print(f"   - Validaci√≥n: {len(dataset['validation'])} pares")
print(f"\nüìù Ejemplo:")
print(f"   Latin: {dataset['train'][0]['latin']}")
print(f"   Spanish: {dataset['train'][0]['spanish']}")

---

## ü§ñ Configuraci√≥n del Modelo

In [None]:
# ============================================
# SECCI√ìN 6: CARGAR MODELO mT5-small
# ============================================

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

MODEL_NAME = "google/mt5-small"

print(f"üì• Cargando modelo: {MODEL_NAME}")

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)

print(f"‚úÖ Modelo cargado")
print(f"üìä Par√°metros: {model.num_parameters():,}")
print(f"üíæ Tama√±o aproximado: ~300MB")

In [None]:
# ============================================
# SECCI√ìN 7: PREPROCESAMIENTO
# ============================================

def preprocess_function(examples):
    """
    Preprocesa los datos para mT5.
    A√±ade prefijo de tarea y tokeniza.
    """
    # Prefijo de tarea
    inputs = ["translate Latin to Spanish: " + text for text in examples['latin']]
    targets = examples['spanish']
    
    # Tokenizar
    model_inputs = tokenizer(
        inputs,
        max_length=128,
        truncation=True,
        padding='max_length'
    )
    
    labels = tokenizer(
        targets,
        max_length=128,
        truncation=True,
        padding='max_length'
    )
    
    model_inputs['labels'] = labels['input_ids']
    
    return model_inputs

# Aplicar preprocesamiento
print("üîÑ Preprocesando datos...")
tokenized_dataset = dataset.map(
    preprocess_function,
    batched=True,
    remove_columns=dataset['train'].column_names
)

print("‚úÖ Datos preprocesados")

---

## üéØ Entrenamiento

In [None]:
# ============================================
# SECCI√ìN 8: CONFIGURACI√ìN DE ENTRENAMIENTO
# ============================================

from transformers import TrainingArguments, Trainer
import numpy as np

training_args = TrainingArguments(
    # Directorio de salida (en Google Drive)
    output_dir="./checkpoints",
    
    # Guardado de checkpoints
    save_strategy="steps",
    save_steps=500,
    save_total_limit=3,  # Mantener solo √∫ltimos 3
    
    # Evaluaci√≥n
    evaluation_strategy="steps",
    eval_steps=500,
    
    # Hiperpar√°metros
    learning_rate=5e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    
    # Optimizaciones
    fp16=True,  # Precisi√≥n mixta
    gradient_accumulation_steps=2,
    
    # Logging
    logging_dir="./logs",
    logging_steps=100,
    
    # Otros
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    report_to="none"
)

print("‚úÖ Configuraci√≥n de entrenamiento lista")
print(f"\nüìä Par√°metros:")
print(f"   - √âpocas: {training_args.num_train_epochs}")
print(f"   - Learning rate: {training_args.learning_rate}")
print(f"   - Batch size: {training_args.per_device_train_batch_size}")
print(f"   - Checkpoints cada: {training_args.save_steps} pasos")

In [None]:
# ============================================
# SECCI√ìN 9: M√âTRICAS
# ============================================

from datasets import load_metric

metric = load_metric("sacrebleu")

def compute_metrics(eval_preds):
    """
    Calcula BLEU score.
    """
    preds, labels = eval_preds
    
    # Decodificar
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    
    # BLEU
    result = metric.compute(
        predictions=decoded_preds,
        references=[[label] for label in decoded_labels]
    )
    
    return {"bleu": result["score"]}

print("‚úÖ M√©tricas configuradas")

In [None]:
# ============================================
# SECCI√ìN 10: INICIAR ENTRENAMIENTO
# ============================================

import os

# Verificar checkpoints existentes
checkpoints = [d for d in os.listdir("./checkpoints") if d.startswith("checkpoint-")] if os.path.exists("./checkpoints") else []

if checkpoints:
    latest = sorted(checkpoints, key=lambda x: int(x.split("-")[1]))[-1]
    checkpoint_path = f"./checkpoints/{latest}"
    print(f"üîÑ Reanudando desde: {checkpoint_path}")
    resume_from = checkpoint_path
else:
    print("üÜï Iniciando entrenamiento desde cero")
    resume_from = None

# Crear Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset['train'],
    eval_dataset=tokenized_dataset['validation'],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

print("\n" + "="*60)
print("üöÄ INICIANDO ENTRENAMIENTO")
print("="*60)
print("\n‚è±Ô∏è Tiempo estimado: 8-12 horas (para corpus completo)")
print("üíæ Checkpoints se guardan cada 500 pasos en Google Drive")
print("üîÑ Puedes cerrar esta pesta√±a - el progreso se guarda")
print("\n" + "="*60 + "\n")

# ENTRENAR
trainer.train(resume_from_checkpoint=resume_from)

print("\n" + "="*60)
print("‚úÖ ENTRENAMIENTO COMPLETADO")
print("="*60)

In [None]:
# ============================================
# SECCI√ìN 11: GUARDAR MODELO FINAL
# ============================================

trainer.save_model("./final_model")
tokenizer.save_pretrained("./final_model")

print("‚úÖ Modelo final guardado en: ./final_model")
print("\nüì• Para descargar:")
print("   1. Ve a Google Drive")
print("   2. Navega a: MyDrive/latin_translator_phase1/final_model")
print("   3. Descarga la carpeta completa")
print("   4. Col√≥cala en tu proyecto: models/latin_translator_v1.0")

---

## üß™ Pruebas del Modelo

In [None]:
# ============================================
# SECCI√ìN 12: PROBAR TRADUCCIONES
# ============================================

def translate(latin_text, model_path="./final_model"):
    """
    Traduce texto latino a espa√±ol.
    """
    input_text = f"translate Latin to Spanish: {latin_text}"
    inputs = tokenizer(input_text, return_tensors="pt", max_length=128, truncation=True).to(model.device)
    
    outputs = model.generate(
        **inputs,
        max_length=128,
        num_beams=4,
        early_stopping=True
    )
    
    translation = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return translation

# Probar con ejemplos
test_sentences = [
    "Gallia est omnis divisa in partes tres.",
    "Veni, vidi, vici.",
    "Alea iacta est.",
    "Carpe diem."
]

print("üß™ Probando traducciones:\n")
for latin in test_sentences:
    spanish = translate(latin)
    print(f"Latin:   {latin}")
    print(f"Spanish: {spanish}")
    print()

---

## üìä Evaluaci√≥n Final

In [None]:
# ============================================
# SECCI√ìN 13: EVALUACI√ìN FINAL
# ============================================

# Evaluar en conjunto de validaci√≥n
eval_results = trainer.evaluate()

print("üìä Resultados Finales:")
print(f"   - Loss: {eval_results['eval_loss']:.4f}")
print(f"   - BLEU: {eval_results['eval_bleu']:.2f}")
print()
print("üéØ Interpretaci√≥n:")
if eval_results['eval_bleu'] > 40:
    print("   ‚úÖ Excelente calidad")
elif eval_results['eval_bleu'] > 30:
    print("   ‚úÖ Buena calidad")
elif eval_results['eval_bleu'] > 20:
    print("   ‚ö†Ô∏è Calidad aceptable - considera Fase 2")
else:
    print("   ‚ùå Calidad baja - necesitas m√°s datos")

---

## üéâ ¬°Entrenamiento Completado!

### Pr√≥ximos Pasos:

1. **Descargar modelo**: Descarga la carpeta `final_model` de Google Drive
2. **Integrar en aplicaci√≥n**: Coloca el modelo en tu proyecto local
3. **Evaluar calidad**: Prueba con textos reales
4. **Fase 2** (opcional): Si necesitas mejor calidad, contin√∫a entrenamiento con m√°s datos

### Informaci√≥n del Modelo:

- **Nombre**: `latin_translator_v1.0`
- **Arquitectura**: mT5-small
- **Tama√±o**: ~300MB
- **Datos de entrenamiento**: Fase 1
- **BLEU score**: (ver arriba)

---

**¬øPreguntas?** Revisa la gu√≠a en `docs/ai_training_guide.md`