<img src="https://github.com/hernancontigiani/ceia_memorias_especializacion/raw/master/Figures/logoFIUBA.jpg" width="500" align="center">

# Procesamiento de Lenguaje Natural
## Desaf√≠o 3: Modelo de Lenguaje con Tokenizaci√≥n por Caracteres

**Autor:** Carlos Espinola  
**Fecha:** Diciembre 2025

### Versi√≥n: TensorFlow/Keras

---
## Objetivos del Desaf√≠o

### Consigna
1. **Seleccionar un corpus de texto** sobre el cual entrenar el modelo de lenguaje
2. **Pre-procesamiento**: tokenizar el corpus, estructurar el dataset y separar datos de entrenamiento y validaci√≥n
3. **Proponer arquitecturas RNN**: implementar modelos basados en unidades recurrentes (SimpleRNN, LSTM, GRU)
4. **Generaci√≥n de secuencias** con diferentes estrategias:
   - Greedy Search
   - Beam Search Determin√≠stico
   - Beam Search Estoc√°stico (analizando el efecto de la temperatura)

### Sugerencias
- Guiarse por el descenso de la **perplejidad** en validaci√≥n para finalizar el entrenamiento
- Explorar: SimpleRNN (celda de Elman), LSTM y GRU
- `RMSprop` es el optimizador recomendado para buena convergencia

In [None]:
# =============================================================================
# 1. IMPORTACI√ìN DE LIBRER√çAS
# =============================================================================

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
import urllib.request
import bs4 as bs

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, Model
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint, Callback

# Configuraci√≥n de estilo para gr√°ficos
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')

print(f"üîß TensorFlow version: {tf.__version__}")
print(f"üîß Keras version: {keras.__version__}")

In [None]:
# Configuraci√≥n de GPU
gpus = tf.config.list_physical_devices('GPU')
if gpus:
    print(f"\nüéÆ GPU detectada: {len(gpus)} dispositivo(s)")
    for gpu in gpus:
        print(f"   ‚Ä¢ {gpu.name}")
    
    # Habilitar crecimiento din√°mico de memoria
    for gpu in gpus:
        tf.config.experimental.set_memory_growth(gpu, True)
    
    # Habilitar mixed precision para acelerar entrenamiento
    tf.keras.mixed_precision.set_global_policy('mixed_float16')
    print("\n‚ö° Mixed Precision (float16) ACTIVADO")
    print("‚úÖ Optimizaciones de GPU activadas")
else:
    print("\n‚ö†Ô∏è No se detect√≥ GPU. El entrenamiento ser√° m√°s lento en CPU.")
    print("   Si tienes GPU, verifica la instalaci√≥n de CUDA y TensorFlow-GPU.")

---
## 2. Selecci√≥n y Descarga del Corpus

In [None]:
# Descargar el libro desde textos.info
url = 'https://www.textos.info/homero/odisea/ebook'
raw_html = urllib.request.urlopen(url)
raw_html = raw_html.read()

# Parsear el HTML con BeautifulSoup
article_html = bs.BeautifulSoup(raw_html, 'lxml')

# Extraer todos los p√°rrafos
article_paragraphs = article_html.find_all('p')

# Concatenar el texto de todos los p√°rrafos
corpus = ''
for para in article_paragraphs:
    corpus += para.text + ' '

# Convertir a min√∫sculas para normalizar
corpus = corpus.lower()

print(f"üìö Longitud total del corpus: {len(corpus):,} caracteres")
print(f"\nüìñ Primeros 500 caracteres del corpus:")
print("-" * 50)
print(corpus[:500])

In [None]:
# An√°lisis de distribuci√≥n de caracteres en el corpus
char_counts = Counter(corpus)
most_common = char_counts.most_common(20)

fig, ax = plt.subplots(figsize=(14, 5))
chars, counts = zip(*most_common)
chars_display = [repr(c) if c in [' ', '\n', '\t'] else c for c in chars]
bars = ax.bar(chars_display, counts, color='steelblue', edgecolor='navy', alpha=0.8)
ax.set_xlabel('Caracter', fontsize=12)
ax.set_ylabel('Frecuencia', fontsize=12)
ax.set_title('Distribuci√≥n de los 20 caracteres m√°s frecuentes', fontsize=14)
plt.xticks(rotation=45, fontsize=11)
plt.tight_layout()
plt.show()

---
## 3. Tokenizaci√≥n por Caracteres

In [None]:
# Crear vocabulario de caracteres √∫nicos
chars_vocab = sorted(set(corpus))
vocab_size = len(chars_vocab)

print(f"üìù Tama√±o del vocabulario: {vocab_size} caracteres √∫nicos")
print(f"\nüî§ Caracteres en el vocabulario:")
print(chars_vocab)

In [None]:
# Crear diccionarios de mapeo
char2idx = {ch: idx for idx, ch in enumerate(chars_vocab)}
idx2char = {idx: ch for ch, idx in char2idx.items()}

print("üîó Ejemplos de mapeo char2idx:")
for ch in ['a', 'e', 'i', 'o', 'u', ' ', '.']:
    if ch in char2idx:
        print(f"  '{ch}' -> {char2idx[ch]}")

In [None]:
# Tokenizar el corpus completo
tokenized_corpus = np.array([char2idx[ch] for ch in corpus], dtype=np.int32)

print(f"üìä Corpus tokenizado - shape: {tokenized_corpus.shape}")
print(f"\nüî¢ Primeros 50 tokens:")
print(tokenized_corpus[:50])

---
## 4. Estructuraci√≥n del Dataset

In [None]:
# Definir tama√±o de contexto
MAX_CONTEXT_SIZE = 100

print(f"‚öôÔ∏è Tama√±o de contexto: {MAX_CONTEXT_SIZE} caracteres")

In [None]:
# Divisi√≥n en entrenamiento y validaci√≥n
p_val = 0.1
split_idx = int(len(tokenized_corpus) * (1 - p_val))

train_corpus = tokenized_corpus[:split_idx]
val_corpus = tokenized_corpus[split_idx:]

print(f"üìä Divisi√≥n del corpus:")
print(f"  ‚Ä¢ Entrenamiento: {len(train_corpus):,} caracteres ({len(train_corpus)/len(tokenized_corpus)*100:.1f}%)")
print(f"  ‚Ä¢ Validaci√≥n: {len(val_corpus):,} caracteres ({len(val_corpus)/len(tokenized_corpus)*100:.1f}%)")

In [None]:
def create_sequences(corpus_data, seq_length):
    """Crea secuencias de entrada y target para entrenamiento many-to-many."""
    n_sequences = len(corpus_data) - seq_length
    X = np.zeros((n_sequences, seq_length), dtype=np.int32)
    y = np.zeros((n_sequences, seq_length), dtype=np.int32)
    for i in range(n_sequences):
        X[i] = corpus_data[i:i + seq_length]
        y[i] = corpus_data[i + 1:i + seq_length + 1]
    return X, y

X_train, y_train = create_sequences(train_corpus, MAX_CONTEXT_SIZE)
X_val, y_val = create_sequences(val_corpus, MAX_CONTEXT_SIZE)

print(f"üì¶ Secuencias de entrenamiento: X={X_train.shape}, y={y_train.shape}")
print(f"üì¶ Secuencias de validaci√≥n: X={X_val.shape}, y={y_val.shape}")

In [None]:
# Crear tf.data.Dataset
BATCH_SIZE = 1024 if gpus else 128
BUFFER_SIZE = 10000

train_dataset = tf.data.Dataset.from_tensor_slices((X_train, y_train))
train_dataset = train_dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)
train_dataset = train_dataset.prefetch(tf.data.AUTOTUNE)

val_dataset = tf.data.Dataset.from_tensor_slices((X_val, y_val))
val_dataset = val_dataset.batch(BATCH_SIZE)
val_dataset = val_dataset.prefetch(tf.data.AUTOTUNE)

print(f"üì¶ Datasets creados:")
print(f"  ‚Ä¢ Batches de entrenamiento: {len(train_dataset)}")
print(f"  ‚Ä¢ Batches de validaci√≥n: {len(val_dataset)}")
print(f"  ‚Ä¢ Tama√±o de batch: {BATCH_SIZE}")

---
## 5. Definici√≥n de Arquitecturas RNN

In [None]:
def build_char_language_model(vocab_size, hidden_size=256, num_layers=2,
                               rnn_type='lstm', dropout=0.5, embedding_dim=128,
                               embed_dropout=0.2):
    """Construye un modelo de lenguaje a nivel de caracteres."""
    rnn_classes = {
        'rnn': layers.SimpleRNN,
        'lstm': layers.LSTM,
        'gru': layers.GRU
    }
    
    if rnn_type.lower() not in rnn_classes:
        raise ValueError(f"rnn_type debe ser 'rnn', 'lstm' o 'gru'")
    
    RNNLayer = rnn_classes[rnn_type.lower()]
    
    inputs = layers.Input(shape=(None,), dtype=tf.int32)
    x = layers.Embedding(vocab_size, embedding_dim)(inputs)
    x = layers.Dropout(embed_dropout)(x)
    
    for i in range(num_layers):
        x = RNNLayer(
            hidden_size,
            return_sequences=True,
            dropout=dropout if i < num_layers - 1 else 0,
            recurrent_dropout=dropout if i < num_layers - 1 else 0
        )(x)
    
    x = layers.LayerNormalization()(x)
    x = layers.Dropout(dropout)(x)
    outputs = layers.Dense(vocab_size, dtype='float32')(x)
    
    return Model(inputs=inputs, outputs=outputs)

print("‚úÖ Funci√≥n build_char_language_model definida")

In [None]:
# Comparar arquitecturas
print("üìä Comparaci√≥n de arquitecturas:")
print("=" * 50)
for rnn_type in ['rnn', 'lstm', 'gru']:
    model_temp = build_char_language_model(vocab_size, rnn_type=rnn_type)
    print(f"  {rnn_type.upper():>5}: {model_temp.count_params():>10,} par√°metros")
    del model_temp
print("=" * 50)

---
## 6. Entrenamiento del Modelo

In [None]:
class PerplexityCallback(Callback):
    """Callback para mostrar la perplejidad durante el entrenamiento."""
    def on_epoch_end(self, epoch, logs=None):
        logs = logs or {}
        train_ppl = np.exp(logs.get('loss', 0))
        val_ppl = np.exp(logs.get('val_loss', 0))
        print(f" | Train PPL: {train_ppl:7.2f} | Val PPL: {val_ppl:7.2f}")

print("‚úÖ PerplexityCallback definido")

In [None]:
def train_model_keras(rnn_type, vocab_size, train_dataset, val_dataset,
                      hidden_size=256, num_layers=2, embedding_dim=128,
                      dropout=0.5, embed_dropout=0.2,
                      learning_rate=0.001, weight_decay=1e-5,
                      label_smoothing=0.1, num_epochs=30, patience=5):
    """Entrena un modelo de lenguaje con early stopping."""
    print(f"\nüöÄ Iniciando entrenamiento - {rnn_type.upper()}")
    print("=" * 70)
    
    model = build_char_language_model(
        vocab_size, hidden_size, num_layers, rnn_type,
        dropout, embedding_dim, embed_dropout
    )
    
    optimizer = keras.optimizers.RMSprop(learning_rate=learning_rate, weight_decay=weight_decay)
    loss_fn = keras.losses.SparseCategoricalCrossentropy(from_logits=True, label_smoothing=label_smoothing)
    
    model.compile(optimizer=optimizer, loss=loss_fn, metrics=['accuracy'])
    
    callbacks = [
        EarlyStopping(monitor='val_loss', patience=patience, restore_best_weights=True, verbose=1),
        PerplexityCallback()
    ]
    
    history = model.fit(train_dataset, validation_data=val_dataset,
                        epochs=num_epochs, callbacks=callbacks, verbose=1)
    
    best_ppl = np.exp(min(history.history['val_loss']))
    print("=" * 70)
    print(f"‚úÖ Mejor perplejidad de validaci√≥n: {best_ppl:.2f}")
    
    return model, history

print("‚úÖ Funci√≥n train_model_keras definida")

In [None]:
# Hiperpar√°metros
HIDDEN_SIZE = 256
NUM_LAYERS = 2
EMBEDDING_DIM = 128
DROPOUT = 0.5
EMBED_DROPOUT = 0.2
LEARNING_RATE = 0.001
NUM_EPOCHS = 30
PATIENCE = 5
WEIGHT_DECAY = 1e-5
LABEL_SMOOTHING = 0.1

models = {}
histories = {}

print("‚öôÔ∏è Hiperpar√°metros configurados")

In [None]:
# Entrenar SimpleRNN
print("\n" + "="*70)
print("üì¶ ENTRENANDO MODELO: SimpleRNN")
print("="*70)

models['rnn'], histories['rnn'] = train_model_keras(
    'rnn', vocab_size, train_dataset, val_dataset,
    HIDDEN_SIZE, NUM_LAYERS, EMBEDDING_DIM, DROPOUT, EMBED_DROPOUT,
    LEARNING_RATE, WEIGHT_DECAY, LABEL_SMOOTHING, NUM_EPOCHS, PATIENCE
)

In [None]:
# Entrenar LSTM
print("\n" + "="*70)
print("üì¶ ENTRENANDO MODELO: LSTM")
print("="*70)

models['lstm'], histories['lstm'] = train_model_keras(
    'lstm', vocab_size, train_dataset, val_dataset,
    HIDDEN_SIZE, NUM_LAYERS, EMBEDDING_DIM, DROPOUT, EMBED_DROPOUT,
    LEARNING_RATE, WEIGHT_DECAY, LABEL_SMOOTHING, NUM_EPOCHS, PATIENCE
)

In [None]:
# Entrenar GRU
print("\n" + "="*70)
print("üì¶ ENTRENANDO MODELO: GRU")
print("="*70)

models['gru'], histories['gru'] = train_model_keras(
    'gru', vocab_size, train_dataset, val_dataset,
    HIDDEN_SIZE, NUM_LAYERS, EMBEDDING_DIM, DROPOUT, EMBED_DROPOUT,
    LEARNING_RATE, WEIGHT_DECAY, LABEL_SMOOTHING, NUM_EPOCHS, PATIENCE
)

In [None]:
# Graficar curvas de entrenamiento
colors = {'rnn': '#e74c3c', 'lstm': '#3498db', 'gru': '#2ecc71'}
labels_map = {'rnn': 'SimpleRNN', 'lstm': 'LSTM', 'gru': 'GRU'}

if histories:
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    for model_type, history in histories.items():
        val_ppl = [np.exp(loss) for loss in history.history['val_loss']]
        epochs = range(1, len(val_ppl) + 1)
        axes[0].plot(epochs, val_ppl, color=colors[model_type],
                     label=labels_map[model_type], linewidth=2, marker='o', markersize=4)
    axes[0].set_xlabel('√âpoca')
    axes[0].set_ylabel('Perplejidad')
    axes[0].set_title('Perplejidad de Validaci√≥n')
    axes[0].legend()
    axes[0].grid(True, alpha=0.3)
    
    for model_type, history in histories.items():
        epochs = range(1, len(history.history['loss']) + 1)
        axes[1].plot(epochs, history.history['loss'], color=colors[model_type],
                     label=f"{labels_map[model_type]} (train)", linestyle='-')
        axes[1].plot(epochs, history.history['val_loss'], color=colors[model_type],
                     label=f"{labels_map[model_type]} (val)", linestyle='--')
    axes[1].set_xlabel('√âpoca')
    axes[1].set_ylabel('Loss')
    axes[1].set_title('Curvas de Aprendizaje')
    axes[1].legend()
    axes[1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    print("\nüìä Resumen de Modelos:")
    print("=" * 55)
    for model_type in histories.keys():
        best_ppl = np.exp(min(histories[model_type].history['val_loss']))
        print(f"  {labels_map[model_type]:<10}: PPL = {best_ppl:.2f}")
    print("=" * 55)

---
## 7. Generaci√≥n de Secuencias

In [None]:
# Seleccionar mejor modelo
best_model_type = min(histories, key=lambda x: min(histories[x].history['val_loss']))
model = models[best_model_type]
best_ppl = np.exp(min(histories[best_model_type].history['val_loss']))
print(f"üèÜ Usando modelo: {best_model_type.upper()} (PPL: {best_ppl:.2f})")

In [None]:
def greedy_search(model, seed_text, max_length, num_chars):
    """Genera texto usando b√∫squeda voraz."""
    generated_text = seed_text.lower()
    for _ in range(num_chars):
        tokens = [char2idx.get(ch, 0) for ch in generated_text[-max_length:]]
        if len(tokens) < max_length:
            tokens = [0] * (max_length - len(tokens)) + tokens
        x = np.array([tokens], dtype=np.int32)
        logits = model.predict(x, verbose=0)
        next_char_idx = np.argmax(logits[0, -1, :])
        generated_text += idx2char[next_char_idx]
    return generated_text

print("‚úÖ Funci√≥n greedy_search definida")

In [None]:
def sample_with_temperature(model, seed_text, max_length, num_chars, temperature=1.0):
    """Genera texto usando muestreo con temperatura."""
    generated_text = seed_text.lower()
    for _ in range(num_chars):
        tokens = [char2idx.get(ch, 0) for ch in generated_text[-max_length:]]
        if len(tokens) < max_length:
            tokens = [0] * (max_length - len(tokens)) + tokens
        x = np.array([tokens], dtype=np.int32)
        logits = model.predict(x, verbose=0)
        logits_scaled = logits[0, -1, :] / temperature
        probs = tf.nn.softmax(logits_scaled).numpy()
        next_char_idx = np.random.choice(len(probs), p=probs)
        generated_text += idx2char[next_char_idx]
    return generated_text

print("‚úÖ Funci√≥n sample_with_temperature definida")

In [None]:
def beam_search_deterministic(model, seed_text, max_length, num_chars, beam_width=5):
    """Genera texto usando beam search determin√≠stico."""
    seed_text = seed_text.lower()
    beams = [(seed_text, 0.0)]
    for _ in range(num_chars):
        all_candidates = []
        for text, score in beams:
            tokens = [char2idx.get(ch, 0) for ch in text[-max_length:]]
            if len(tokens) < max_length:
                tokens = [0] * (max_length - len(tokens)) + tokens
            x = np.array([tokens], dtype=np.int32)
            logits = model.predict(x, verbose=0)
            log_probs = tf.nn.log_softmax(logits[0, -1, :]).numpy()
            top_indices = np.argsort(log_probs)[-beam_width:]
            for idx in top_indices:
                all_candidates.append((text + idx2char[idx], score + log_probs[idx]))
        all_candidates.sort(key=lambda x: x[1], reverse=True)
        beams = all_candidates[:beam_width]
    final_sequences = [(text, score / len(text)) for text, score in beams]
    final_sequences.sort(key=lambda x: x[1], reverse=True)
    return final_sequences[0][0], final_sequences

print("‚úÖ Funci√≥n beam_search_deterministic definida")

In [None]:
def beam_search_stochastic(model, seed_text, max_length, num_chars, beam_width=5, temperature=1.0):
    """Genera texto usando beam search estoc√°stico."""
    seed_text = seed_text.lower()
    beams = [(seed_text, 0.0)]
    for _ in range(num_chars):
        all_candidates = []
        for text, score in beams:
            tokens = [char2idx.get(ch, 0) for ch in text[-max_length:]]
            if len(tokens) < max_length:
                tokens = [0] * (max_length - len(tokens)) + tokens
            x = np.array([tokens], dtype=np.int32)
            logits = model.predict(x, verbose=0)
            logits_scaled = logits[0, -1, :] / temperature
            probs = tf.nn.softmax(logits_scaled).numpy()
            log_probs = np.log(probs + 1e-10)
            sampled_indices = np.random.choice(len(probs), size=min(beam_width, len(probs)),
                                               replace=False, p=probs)
            for idx in sampled_indices:
                all_candidates.append((text + idx2char[idx], score + log_probs[idx]))
        all_candidates.sort(key=lambda x: x[1], reverse=True)
        beams = all_candidates[:beam_width]
    final_sequences = [(text, score / len(text)) for text, score in beams]
    final_sequences.sort(key=lambda x: x[1], reverse=True)
    return final_sequences[0][0], final_sequences

print("‚úÖ Funci√≥n beam_search_stochastic definida")

In [None]:
# Ejemplos de generaci√≥n
seed = "ulises dijo"
print("="*70)
print(f"üî¨ COMPARACI√ìN DE M√âTODOS - Semilla: '{seed}'")
print("="*70)

print("\nüìç GREEDY:")
print(greedy_search(model, seed, MAX_CONTEXT_SIZE, 100))

print("\nüé≤ SAMPLING (T=0.7):")
print(sample_with_temperature(model, seed, MAX_CONTEXT_SIZE, 100, 0.7))

print("\nüìä BEAM SEARCH (width=5):")
result, _ = beam_search_deterministic(model, seed, MAX_CONTEXT_SIZE, 100, 5)
print(result)

print("\nüéØ BEAM STOCHASTIC (width=5, T=0.7):")
result, _ = beam_search_stochastic(model, seed, MAX_CONTEXT_SIZE, 100, 5, 0.7)
print(result)

In [None]:
# Efecto de la temperatura
seed = "el h√©roe regres√≥"
print("="*70)
print(f"üå°Ô∏è EFECTO DE LA TEMPERATURA - Semilla: '{seed}'")
print("="*70)

for temp in [0.2, 0.5, 0.8, 1.0, 1.5]:
    print(f"\nüå°Ô∏è T = {temp}:")
    print(sample_with_temperature(model, seed, MAX_CONTEXT_SIZE, 80, temp))

In [None]:
# Guardar modelo
model.save('best_char_lm_keras.keras')
print("‚úÖ Modelo guardado en 'best_char_lm_keras.keras'")

---
## Fin del Desaf√≠o 3 (Versi√≥n Keras)

‚úÖ **Objetivos cumplidos:**
1. Corpus seleccionado y preprocesado
2. Tokenizaci√≥n por caracteres implementada
3. Tres arquitecturas RNN evaluadas (SimpleRNN, LSTM, GRU)
4. Cuatro m√©todos de generaci√≥n implementados