# Generación de Versos con Modelos Secuenciales
En este apartado trabajaremos con modelos secuenciales como *Recurrent Neural Networks* (`RNNs`) y *Long Short Term Memories* (`LSTMs`), modelo que aunque no sean tan potentes como algunos de los modelos generativos que veremos después, hemos considerado útiles de desarrollar para reflejar la evolución real que tuvieron los modelos generativos a lo largo de los años.

Debido a que inicialmente consideramos que el número de instancias que ofrecía el dataset que usabamos para manipular el **Corán**, hemos decidido trabajar con un segundo conjunto de datos el cual ofrece casi diez veces el número de instancias entrenables que el primero. [El dataset disponible en Kaggle](https://www.kaggle.com/datasets/fahd09/hadith-dataset), colleciona varios `Hadith`-s, representaciones de acciones o palabras dichas por el **Profeta Mohammed**. No obstante, este archivo presenta una estrutura completamente diferente al habitual, por lo tanto, requerirá de una limpieza y manipulación diferente. 

A lo largo de este cuaderno, crearemos todas las clases y funciones necesarias para crear y usar los modelos secuenciales generativos precedentes a los **Transformers**, aún así, en la documentación principal profundizaremos más en el análisis completo.

### Librerías Necearias

In [11]:
# Dependencias
import os
from argparse import Namespace
import json
import re
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
from torch.nn import functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import fasttext
import unicodedata
from tqdm import tqdm_notebook
import math

### Código de Clases + Funciones Necesarias

Clase Vocabulary general, donde se definen las funciones principales que heredará la clase del vocabulario específico:

In [12]:
class Vocabulary:
    def __init__(self, token_to_idx=None):
        # inicializar atributos
        if token_to_idx is None:
            token_to_idx = {}
        self._token_to_idx = dict(token_to_idx)
        self._idx_to_token = {idx: token for token, idx in self._token_to_idx.items()}

    def to_serializable(self):
        # función para serializar el diccionario token (label) - idx (int)
        return {"token_to_idx": self._token_to_idx}

    @classmethod
    def from_serializable(cls, contents):
        return cls(token_to_idx=contents["token_to_idx"])

    def add_token(self, token):
        # función para añadir token (nuevo) al diccionario
        if token in self._token_to_idx:
            return self._token_to_idx[token]

        index = len(self._token_to_idx)
        self._token_to_idx[token] = index
        self._idx_to_token[index] = token
        return index

    def add_many_tokens(self, tokens):
        # función para añadir N > 1 tokens al diccionario
        return [self.add_token(t) for t in tokens]

    def lookup_token(self, token):
        # función para obtener el idx del token introducido
        return self._token_to_idx[token]

    def lookup_index(self, index):
        # función para obtener el token del idx introducido
        if index not in self._idx_to_token:
            return "<UNK>"
        return self._idx_to_token[index]

    def __len__(self):
        # devuelve el tamaño del diccionario
        return len(self._token_to_idx)

    def __str__(self):
        # devuelve el tamaño del vocabulario
        return f"<Vocabulary(size={len(self)})>"

Vocabulary especial Corán con los tokens especiales \<eos>, \<bos>, ... Hereda de la clase Vocabulario principal.

In [13]:
class VocabularyCoran(Vocabulary):
    def __init__(self, token_to_idx=None, unk_token="<UNK>",
                 mask_token="<MASK>", begin_seq_token="<BEGIN>",
                 end_seq_token="<END>"):

        super().__init__(token_to_idx)
        # añadimos los tokens especiales a nuestro diccionario
        self._mask_token = mask_token
        self._unk_token = unk_token
        self._begin_seq_token = begin_seq_token
        self._end_seq_token = end_seq_token

        self.mask_index = self.add_token(self._mask_token)
        self.unk_index = self.add_token(self._unk_token)
        self.begin_seq_index = self.add_token(self._begin_seq_token)
        self.end_seq_index = self.add_token(self._end_seq_token)

    def to_serializable(self):
        # función para serializar el diccionario token (label) - idx (int)
        contents = super().to_serializable()
        contents.update({
            "unk_token": self._unk_token,
            "mask_token": self._mask_token,
            "begin_seq_token": self._begin_seq_token,
            "end_seq_token": self._end_seq_token
        })
        return contents

    @classmethod
    def from_serializable(cls, contents):
        vocab = cls(
            token_to_idx=contents["token_to_idx"],
            unk_token=contents["unk_token"],
            mask_token=contents["mask_token"],
            begin_seq_token=contents["begin_seq_token"],
            end_seq_token=contents["end_seq_token"],
        )
        return vocab

    def lookup_token(self, token):
        # función para obtener el idx del token introducido
        return self._token_to_idx.get(token, self.unk_index)

Vectorizer - Nuestro vectorizador que será responsable de convertir los labels a vectores:


In [14]:
class CoranVectorizer:
    def __init__(self, char_vocab: VocabularyCoran):
        # constructor de vocabulario 
        self.char_vocab = char_vocab

    def vectorize(self, text: str, vector_length: int):
        # función donde vectorizamos texto

        indices = [self.char_vocab.begin_seq_index] # añadimos <bos> al principio
        indices.extend(self.char_vocab.lookup_token(ch) for ch in text) # añadimos los tokens restantes en medio de la oración
        indices.append(self.char_vocab.end_seq_index) # añadimos <eos> al final

        from_indices = indices[:-1]
        to_indices = indices[1:]

        # El from_vector será <bos> con los tokens de la secuencia (sin el <eos>)
        from_vector = np.full(vector_length, fill_value=self.char_vocab.mask_index, dtype=np.int64)
        # Y el to_vector será os tokens de la secuencia + <eos>
        to_vector = np.full(vector_length, fill_value=self.char_vocab.mask_index, dtype=np.int64)

        n = min(vector_length, len(from_indices))
        from_vector[:n] = from_indices[:n]

        n = min(vector_length, len(to_indices))
        to_vector[:n] = to_indices[:n]

        return from_vector, to_vector

    @classmethod
    def from_dataframe(cls, df: pd.DataFrame, text_col="text"):
        char_vocab = VocabularyCoran()
        for text in df[text_col].astype(str):
            for ch in text:
                char_vocab.add_token(ch)
        return cls(char_vocab)

    def to_serializable(self):
        return {"char_vocab": self.char_vocab.to_serializable()}

    @classmethod
    def from_serializable(cls, contents):
        char_vocab = VocabularyCoran.from_serializable(contents["char_vocab"])
        return cls(char_vocab)

Funciones para el entrenamiento (métricas de evaluación, argumentos de entrenamiento, ...)

In [15]:
def generate_batches(dataset, batch_size, device, shuffle=True, drop_last=True):
    # genera batches para mandarlos al cpu/gpu (si tenemos cuda)
    dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=shuffle, drop_last=drop_last)
    for batch in dataloader:
        yield {k: v.to(device) for k, v in batch.items()}

def sequence_loss(y_pred, y_true, mask_index):
    # loss function, en nuestro caso el cross entropy loss. Ya que compararemos la distribución de predicciones con la ground truth
    # B, T y V son las dimensiones de nuestro tensor predicho
    B, T, V = y_pred.shape
    y_pred = y_pred.reshape(B * T, V)
    y_true = y_true.reshape(B * T)
    # calculamos la comparación entre distribuciones predichas y verdaderas
    loss_fn = nn.CrossEntropyLoss(ignore_index=mask_index)
    return loss_fn(y_pred, y_true)

def compute_accuracy(y_pred, y_true, mask_index):
    # función para calcular la accuracy, comparando cada caracter predicho con el ground truth
    y_hat = y_pred.argmax(dim=-1)  
    valid = (y_true != mask_index)
    correct = (y_hat == y_true) & valid
    denom = valid.sum().item()
    if denom == 0:
        return 0.0
    return correct.sum().item() / denom

def make_train(args):
    # sacado del notebook de ALUD, argumentos de entrenamiento
    return {"stop_early": False,
            "early_stopping_step": 0,
            "early_stopping_best_val": 1e8,
            "epoch_index": 0,
            "train_loss": [],
            "train_acc": [],
            "val_loss": [],
            "val_acc": [],
            "model_filename": args.model_state_file}

def update_training_state(args, model, train_state):
    # función para tener en cuenta mejora/desmejora de rendimiento -> early_stopping
    if train_state["epoch_index"] == 0:
        torch.save(model.state_dict(), train_state["model_filename"])
        train_state["stop_early"] = False
        return train_state

    # código para el early_stopping
    loss_t = train_state["val_loss"][-1]
    if loss_t < train_state["early_stopping_best_val"]:
        torch.save(model.state_dict(), train_state["model_filename"])
        train_state["early_stopping_best_val"] = loss_t
        train_state["early_stopping_step"] = 0
    else:
        train_state["early_stopping_step"] += 1

    train_state["stop_early"] = train_state["early_stopping_step"] >= args.early_stopping_criteria
    return train_state

Funciones para obtener y mostrar los nuevos versos una vez entrenados los modelos, emplearemos estas funciones una vez realizados los entrenamientos.

In [16]:
def sample_from_model(model, vectorizer, num_samples=10, max_length=300, temperature=0.8, top_k=None):
    # Función para coger los nuevos versos generados y mostrarlos
    # En nuestro caso 10 samples
    model.eval()
    vocab = vectorizer.char_vocab
    device = next(model.parameters()).device
    samples = []

    for _ in range(num_samples):
        indices = [vocab.begin_seq_index]

        for _ in range(max_length):
            x = torch.tensor(indices, dtype=torch.long, device=device).unsqueeze(0)

            with torch.no_grad():
                logits = model(x, apply_softmax=False)         
                next_logits = logits[0, -1] / max(temperature, 1e-8)

                if top_k is not None and top_k > 0:
                    v, ix = torch.topk(next_logits, k=top_k)
                    filtered = torch.full_like(next_logits, float("-inf"))
                    filtered[ix] = v
                    next_logits = filtered

                probs = torch.softmax(next_logits, dim=0)
                next_index = torch.multinomial(probs, 1).item()

            if next_index == vocab.end_seq_index:
                break
            indices.append(next_index)

        samples.append(indices)

    return samples

def decode_samples(sampled_indices, vectorizer):
    # Función para devoler los labels de los índices conseguidos en la función anterior
    char_vocab = vectorizer.char_vocab
    decoded = []

    for indices in sampled_indices:
        chars = [
            char_vocab.lookup_index(idx)
            for idx in indices
            if idx not in (
                char_vocab.begin_seq_index,
                char_vocab.end_seq_index,
                char_vocab.mask_index
            )
        ]
        decoded.append("".join(chars))

    return decoded

Como usaremos los pesos del modelo de embeddings usado anteriormente (`fastText`), los importaremos aquí:

In [17]:
def obtener_pesos(vectorizer, modelo_ft):
    vocab = vectorizer.char_vocab
    token_to_idx = vocab._token_to_idx
    tamaño_vocab = len(token_to_idx)
    embedding_dim = modelo_ft.get_dimension()
    pesos = np.zeros((tamaño_vocab, embedding_dim))

    for token, idx in token_to_idx.items():
        pesos[idx] = modelo_ft.get_word_vector(token)
    
    return torch.FloatTensor(pesos)

Función para devolver y mostrar resultados:

In [18]:
def nuevos_versos(nombre_modelo, nombre_vectorizer):
    num_names = 10

    model = nombre_modelo.cpu()

    sampled_verses = decode_samples(
        sample_from_model(
            model,
            nombre_vectorizer,
            num_samples=num_names,
            max_length=300,
            temperature=0.8
        ),
        nombre_vectorizer
    )

    print("-" * 30)
    for i in range(num_names):
        print(f"\n Verse {i+1}:\n{sampled_verses[i]}")

### Funciones para los entrenamientos: RNN y LSTM

Clase Dataset del Corán, lo amoldaremos al formato en que viene escrito el **Corán**.

In [19]:
class CoranDataset(Dataset):
    def __init__(self, df: pd.DataFrame, vectorizer: CoranVectorizer, text_col="text"):
        self.df = df.reset_index(drop=True)
        self._vectorizer = vectorizer
        self._text_col = text_col

        self._max_seq_length = int(self.df[text_col].astype(str).map(len).max()) + 2 # el +2 incluye los tokens del diccionario + <bos> y <eos>

        n = len(self.df)
        train_end = int(n * 0.70) # 70% de las instancias al train set
        val_end = int(n * .85) # 15 para el validation set, y el otro 15 para el test

        self.train_df = self.df.iloc[:train_end]
        self.val_df = self.df.iloc[train_end:val_end]
        self.test_df = self.df.iloc[val_end:]

        self._lookup_dict = {
            "train": (self.train_df, len(self.train_df)),
            "val": (self.val_df, len(self.val_df)),
            "test": (self.test_df, len(self.test_df)),
        }

        self.set_split("train")

    # a partir de aquí hay metodos necesarios para manipular nuestro dataset específico
    @classmethod
    def load_dataset_and_make_vectorizer(cls, coran_txt, sep="|"):
        df = pd.read_csv(coran_txt, sep=sep, names=["sura", "ayah", "text"])
        df["text"] = df["text"].astype(str).str.lower()
        vectorizer = CoranVectorizer.from_dataframe(df, text_col="text")
        return cls(df, vectorizer, text_col="text")

    @classmethod
    def load_dataset_and_load_vectorizer(cls, coran_txt, vectorizer_filepath, sep="|"):
        df = pd.read_csv(coran_txt, sep=sep, names=["sura", "ayah", "text"])
        df["text"] = df["text"].astype(str).str.lower()
        vectorizer = cls.load_vectorizer_only(vectorizer_filepath)
        return cls(df, vectorizer, text_col="text")

    @staticmethod
    def load_vectorizer_only(vectorizer_filepath):
        with open(vectorizer_filepath, "r", encoding="utf-8") as f:
            contents = json.load(f)
        return CoranVectorizer.from_serializable(contents)

    def save_vectorizer(self, vectorizer_filepath):
        with open(vectorizer_filepath, "w", encoding="utf-8") as f:
            json.dump(self._vectorizer.to_serializable(), f, ensure_ascii=False)

    def get_vectorizer(self):
        return self._vectorizer

    def set_split(self, split="train"):
        self._target_split = split
        self._target_df, self._target_size = self._lookup_dict[split]

    def __len__(self):
        return self._target_size

    def __getitem__(self, index):
        row = self._target_df.iloc[index]
        text = str(row[self._text_col])
        x, y = self._vectorizer.vectorize(text, vector_length=self._max_seq_length)
        return {"x_data": torch.tensor(x, dtype=torch.long),
                "y_target": torch.tensor(y, dtype=torch.long)}

Función de entrenamiento RNN, arquitectura que usaremos para la NN.

In [20]:
def train_RNN(coran_path, output_path, ruta_ft):
    args = Namespace(
        coran_txt=coran_path,
        vectorizer_file="vectorizer.json",
        model_state_file="model.pth",
        save_dir=output_path,

        char_embedding_size=300, # 300 porque los embeddings del ft son de 300, tienen que coincidir
        rnn_hidden_size=128, # tamaño del hidden state del RNN

        seed=1337,
        learning_rate=1e-3, # lr
        batch_size=256,
        num_epochs=50,
        early_stopping_criteria=5,

        cuda=True,
        reload_from_files=False
    )

    np.random.seed(args.seed)
    torch.manual_seed(args.seed)
    if args.cuda and torch.cuda.is_available():
        torch.cuda.manual_seed_all(args.seed)
        args.device = torch.device("cuda")
    else:
        args.device = torch.device("cpu")

    # código para guardar/cargar archivos
    os.makedirs(args.save_dir, exist_ok=True)
    if args.vectorizer_file and not os.path.isabs(args.vectorizer_file):
        args.vectorizer_file = os.path.join(args.save_dir, args.vectorizer_file)
    if args.model_state_file and not os.path.isabs(args.model_state_file):
        args.model_state_file = os.path.join(args.save_dir, args.model_state_file)

    if args.reload_from_files and os.path.exists(args.vectorizer_file):
        dataset = CoranDataset.load_dataset_and_load_vectorizer(args.coran_txt, args.vectorizer_file)
    else:
        dataset = CoranDataset.load_dataset_and_make_vectorizer(args.coran_txt)
        dataset.save_vectorizer(args.vectorizer_file)

    dataset.set_split("train")
    train_dataloader = DataLoader(dataset, batch_size=args.batch_size, shuffle=True)

    dataset.set_split("val")
    val_dataloader = DataLoader(dataset, batch_size=args.batch_size, shuffle=False)

    dataset.set_split("test")
    test_dataloader = DataLoader(dataset, batch_size=args.batch_size, shuffle=False)

    vectorizer = dataset.get_vectorizer()
    mask_index = vectorizer.char_vocab.mask_index

    # aquí configuramos y llamamos a la función de pesos de fastText
    def obtener_pesos(vectorizer, modelo_ft):
        vocab = vectorizer.char_vocab
        token_to_idx = vocab._token_to_idx
        tamaño_vocab = len(token_to_idx)
        embedding_dim = modelo_ft.get_dimension()
        pesos = np.zeros((tamaño_vocab, embedding_dim))

        for token, idx in token_to_idx.items():
            pesos[idx] = modelo_ft.get_word_vector(token)
    
        return torch.FloatTensor(pesos)

    ft_model = fasttext.load_model(ruta_ft)
    pretrained_ft_pesos = obtener_pesos(vectorizer, ft_model)

    # Y creamos el modelo
    model = CoranRNN(
        vocab_size=len(vectorizer.char_vocab),
        embedding_size=args.char_embedding_size,
        rnn_hidden_size=args.rnn_hidden_size,
        padding_idx=mask_index,
        pretrained_embeddings_ft=pretrained_ft_pesos
    ).to(args.device)

    # A partir de aquí, entrenamiento normal
    optimizer = torch.optim.Adam(model.parameters(), lr=args.learning_rate)
    scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode="min", factor=0.5, patience=1)

    train_state = make_train(args)

    for epoch in tqdm_notebook(range(args.num_epochs)):
        train_state["epoch_index"] = epoch

        # Train
        dataset.set_split("train")
        model.train()
        running_loss, running_acc = 0.0, 0.0
        for bi, batch in enumerate(generate_batches(dataset, args.batch_size, args.device, shuffle=True)):
            optimizer.zero_grad()
            y_pred = model(batch["x_data"])
            loss = sequence_loss(y_pred, batch["y_target"], mask_index)
            loss.backward()
            optimizer.step()

            running_loss += (loss.item() - running_loss) / (bi + 1)
            acc = compute_accuracy(y_pred, batch["y_target"], mask_index)
            running_acc += (acc - running_acc) / (bi + 1)

        train_state["train_loss"].append(running_loss)
        train_state["train_acc"].append(running_acc)

        # Validation
        dataset.set_split("val")
        model.eval()
        vloss, vacc = 0.0, 0.0
        with torch.no_grad():
            for bi, batch in enumerate(generate_batches(dataset, args.batch_size, args.device, shuffle=False)):
                y_pred = model(batch["x_data"])
                loss = sequence_loss(y_pred, batch["y_target"], mask_index)

                vloss += (loss.item() - vloss) / (bi + 1)
                acc = compute_accuracy(y_pred, batch["y_target"], mask_index)
                vacc += (acc - vacc) / (bi + 1)

        train_state["val_loss"].append(vloss)
        train_state["val_acc"].append(vacc)

        val_ppl = math.exp(vloss)
        train_state.setdefault("val_ppl", []).append(val_ppl)

        train_state = update_training_state(args, model, train_state)
        scheduler.step(vloss)

        print(f"Epoch {epoch+1:03d} | train_loss={running_loss:.4f} "
                f"| val_loss={vloss:.4f} | val_ppl={val_ppl:.2f} | val_acc={vacc:.4f}")

        
        dataset.save_vectorizer(args.vectorizer_file)
        torch.save(model.state_dict(), args.model_state_file)

        if train_state["stop_early"]:
            print("Early stopping activado.")
            break

    return args, dataset, vectorizer, model

Función de entrenamiento LSTM, casi idéntico a la arquitectura RNN.

In [21]:
def train_LSTM(coran_path, output_path, ruta_ft):
    args = Namespace(
        coran_txt=coran_path,
        vectorizer_file="vectorizer.json",
        model_state_file="model.pth",
        save_dir=output_path,

        char_embedding_size=300, # lo mismo que ft 
        lstm_hidden_size=256,

        seed=1337,
        learning_rate=1e-3,
        batch_size=64,
        num_epochs=50,
        early_stopping_criteria=5,

        cuda=True,
        reload_from_files=False
    )

    np.random.seed(args.seed)
    torch.manual_seed(args.seed)
    if args.cuda and torch.cuda.is_available():
        torch.cuda.manual_seed_all(args.seed)
        args.device = torch.device("cuda")
    else:
        args.device = torch.device("cpu")

    os.makedirs(args.save_dir, exist_ok=True)
    if args.vectorizer_file and not os.path.isabs(args.vectorizer_file):
        args.vectorizer_file = os.path.join(args.save_dir, args.vectorizer_file)
    if args.model_state_file and not os.path.isabs(args.model_state_file):
        args.model_state_file = os.path.join(args.save_dir, args.model_state_file)

    if args.reload_from_files and os.path.exists(args.vectorizer_file):
        dataset = CoranDataset.load_dataset_and_load_vectorizer(args.coran_txt, args.vectorizer_file)
    else:
        dataset = CoranDataset.load_dataset_and_make_vectorizer(args.coran_txt)
        dataset.save_vectorizer(args.vectorizer_file)

    dataset.set_split("train")
    train_dataloader = DataLoader(dataset, batch_size=args.batch_size, shuffle=True)

    dataset.set_split("val")
    val_dataloader = DataLoader(dataset, batch_size=args.batch_size, shuffle=False)

    dataset.set_split("test")
    test_dataloader = DataLoader(dataset, batch_size=args.batch_size, shuffle=False)

    vectorizer = dataset.get_vectorizer()
    mask_index = vectorizer.char_vocab.mask_index

    def obtener_pesos(vectorizer, modelo_ft):
        vocab = vectorizer.char_vocab
        token_to_idx = vocab._token_to_idx
        tamaño_vocab = len(token_to_idx)
        embedding_dim = modelo_ft.get_dimension()
        pesos = np.zeros((tamaño_vocab, embedding_dim))

        for token, idx in token_to_idx.items():
            pesos[idx] = modelo_ft.get_word_vector(token)
    
        return torch.FloatTensor(pesos)

    ft_model = fasttext.load_model(ruta_ft)
    pretrained_ft_pesos = obtener_pesos(vectorizer, ft_model)

    model = CoranLSTM(
        vocab_size=len(vectorizer.char_vocab),
        embedding_size=args.char_embedding_size,
        lstm_hidden_size=args.lstm_hidden_size,
        padding_idx=mask_index,
        pretrained_embeddings_ft=pretrained_ft_pesos
    ).to(args.device)

    optimizer = torch.optim.Adam(model.parameters(), lr=args.learning_rate)
    scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode="min", factor=0.5, patience=1)

    train_state = make_train(args)

    for epoch in tqdm_notebook(range(args.num_epochs)):
        train_state["epoch_index"] = epoch

        # Train
        dataset.set_split("train")
        model.train()
        running_loss, running_acc = 0.0, 0.0
        for bi, batch in enumerate(generate_batches(dataset, args.batch_size, args.device, shuffle=True)):
            optimizer.zero_grad()
            y_pred = model(batch["x_data"])
            loss = sequence_loss(y_pred, batch["y_target"], mask_index)
            loss.backward()
            optimizer.step()

            running_loss += (loss.item() - running_loss) / (bi + 1)
            acc = compute_accuracy(y_pred, batch["y_target"], mask_index)
            running_acc += (acc - running_acc) / (bi + 1)

        train_state["train_loss"].append(running_loss)
        train_state["train_acc"].append(running_acc)

        # Val
        dataset.set_split("val")
        model.eval()
        vloss, vacc = 0.0, 0.0
        with torch.no_grad():
            for bi, batch in enumerate(generate_batches(dataset, args.batch_size, args.device, shuffle=False)):
                y_pred = model(batch["x_data"])
                loss = sequence_loss(y_pred, batch["y_target"], mask_index)

                vloss += (loss.item() - vloss) / (bi + 1)
                acc = compute_accuracy(y_pred, batch["y_target"], mask_index)
                vacc += (acc - vacc) / (bi + 1)

        train_state["val_loss"].append(vloss)
        train_state["val_acc"].append(vacc)

        val_ppl = math.exp(vloss)
        train_state.setdefault("val_ppl", []).append(val_ppl)

        train_state = update_training_state(args, model, train_state)
        scheduler.step(vloss)

        print(f"Epoch {epoch+1:03d} | train_loss={running_loss:.4f} "
                f"| val_loss={vloss:.4f} | val_ppl={val_ppl:.2f} | val_acc={vacc:.4f}")
        
        dataset.save_vectorizer(args.vectorizer_file)
        torch.save(model.state_dict(), args.model_state_file)

        if train_state["stop_early"]:
            print("Early stopping activado.") 
            break

    return args, dataset, vectorizer, model


## Dataset del Corán

### RNN - Corán


Modelo RNN para el Corán, arquitectura interna de la NN con sus funciones típicas de *\__init__* y *forward*:

In [22]:
class CoranRNN(nn.Module):
    # nuestro modelo nn para el rnn
    def __init__(self, vocab_size, embedding_size, rnn_hidden_size, padding_idx, dropout_p=0.5,
                 pretrained_embeddings_ft = None):
        super().__init__()
        # arquitectura de nuestra rnn

        self.char_emb = nn.Embedding(vocab_size, embedding_size, padding_idx=padding_idx) # capa de inicio del tamaño del vocabulario
        # Aquí metemos los embeddings (pesos) del fasttext
        if pretrained_embeddings_ft is not None:
            self.char_emb.weight.data.copy_(pretrained_embeddings_ft)

        self.rnn = nn.RNN(embedding_size, rnn_hidden_size, batch_first=True, nonlinearity="tanh") # rnn
        self.fc = nn.Linear(rnn_hidden_size, vocab_size) # fully connected
        self.dropout_p = dropout_p # probabilidad de dropout de neuronas

    def forward(self, x_in, apply_softmax=False):
        x_emb = self.char_emb(x_in)             
        y_out, _ = self.rnn(x_emb)               
        y_out = F.dropout(y_out, p=self.dropout_p, training=self.training)
        logits = self.fc(y_out)                  
        if apply_softmax:
            return F.softmax(logits, dim=-1)
        return logits

Entrenamiento del RNN para el Corán árabe:

In [23]:
args, dataset, vectorizer, model_rnn = train_RNN(coran_path="../data/cleaned_data/cleaned_arab_quran.txt",
                                                 output_path="Unai/Models/RNN/arab/coran/coran_rnn_v1",
                                                 ruta_ft="../src/modelos/fasttext_arabic_busqueda_semantica.bin")

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for epoch in tqdm_notebook(range(args.num_epochs)):


  0%|          | 0/50 [00:00<?, ?it/s]

Epoch 001 | train_loss=3.3638 | val_loss=2.9685 | val_ppl=19.46 | val_acc=0.2478
Epoch 002 | train_loss=2.9649 | val_loss=2.8756 | val_ppl=17.74 | val_acc=0.2209
Epoch 003 | train_loss=2.8369 | val_loss=2.7578 | val_ppl=15.76 | val_acc=0.2529
Epoch 004 | train_loss=2.6940 | val_loss=2.6387 | val_ppl=14.00 | val_acc=0.3005
Epoch 005 | train_loss=2.5849 | val_loss=2.5473 | val_ppl=12.77 | val_acc=0.3085
Epoch 006 | train_loss=2.5143 | val_loss=2.4867 | val_ppl=12.02 | val_acc=0.3226
Epoch 007 | train_loss=2.4624 | val_loss=2.4416 | val_ppl=11.49 | val_acc=0.3298
Epoch 008 | train_loss=2.4258 | val_loss=2.4055 | val_ppl=11.08 | val_acc=0.3394
Epoch 009 | train_loss=2.3955 | val_loss=2.3772 | val_ppl=10.78 | val_acc=0.3416
Epoch 010 | train_loss=2.3705 | val_loss=2.3520 | val_ppl=10.51 | val_acc=0.3457
Epoch 011 | train_loss=2.3471 | val_loss=2.3302 | val_ppl=10.28 | val_acc=0.3510
Epoch 012 | train_loss=2.3269 | val_loss=2.3095 | val_ppl=10.07 | val_acc=0.3571
Epoch 013 | train_loss=2.307

In [24]:
nuevos_versos(model_rnn, vectorizer)

------------------------------

 Verse 1:
اقال الله ۚ ولذي الالنهم ۚ واستقزوه الصهم ولي الددكم فاسرب الذين كانوا يقولون الا تصابا انتي اوري الي اليه لا يعلمون

 Verse 2:
ان الال لنيع لي تريك عليهم بخليه ۗ وهم فض خلقا ۗ ان تجب والمرترون الله في الرحون غشار الا عن يحمئت ولي الي في ال شرك ۖ فاتلمنوا شمي بعاء الا ملهم من ايله الا تد لهم من عليه ما يخرجون

 Verse 3:
يل ذي المنوا ييدونا من يحم ۚ واللما جوت الله له ان ارزين الله ۚ انهم كنفتم تعضد الاكيب

 Verse 4:
قال لفك المستكان عليكم ان الاعاء لا خشي اللم

 Verse 5:
من لم فسول الي الملائ ۚ قوله ساء الذين تمبرون

 Verse 6:
والله السجات ۚ قاله قالوا اعلم يلينا لهم من تلكم من الذراه علي المرقين

 Verse 7:
واليه وقال الله الا فاطين لا تمتماء فيسمع ان يشرسكم من الذين افحيم

 Verse 8:
يب سف اوترا لهم الي في الانت عليم

 Verse 9:
والي انام ذل الله علي الله واتنكا ۖ والهم من الستيب

 Verse 10:
فان هذ كذله لجمثرين


Entrenamiento RNN Corán en inglés:

In [25]:
args, dataset, vectorizer, model_rnn = train_RNN(coran_path="../data/cleaned_data/cleaned_english_quran.txt",
                                                 output_path="Unai/Models/RNN/english/coran/coran_rnn_v1",
                                                 ruta_ft="../src/modelos/fasttext_english_busqueda_semantica.bin")

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for epoch in tqdm_notebook(range(args.num_epochs)):


  0%|          | 0/50 [00:00<?, ?it/s]

Epoch 001 | train_loss=3.1022 | val_loss=2.8182 | val_ppl=16.75 | val_acc=0.1842
Epoch 002 | train_loss=2.7729 | val_loss=2.6313 | val_ppl=13.89 | val_acc=0.2449
Epoch 003 | train_loss=2.5656 | val_loss=2.4176 | val_ppl=11.22 | val_acc=0.3382
Epoch 004 | train_loss=2.3795 | val_loss=2.2630 | val_ppl=9.61 | val_acc=0.3666
Epoch 005 | train_loss=2.2622 | val_loss=2.1633 | val_ppl=8.70 | val_acc=0.3845
Epoch 006 | train_loss=2.1857 | val_loss=2.0922 | val_ppl=8.10 | val_acc=0.3939
Epoch 007 | train_loss=2.1291 | val_loss=2.0349 | val_ppl=7.65 | val_acc=0.4090
Epoch 008 | train_loss=2.0815 | val_loss=1.9840 | val_ppl=7.27 | val_acc=0.4243
Epoch 009 | train_loss=2.0393 | val_loss=1.9408 | val_ppl=6.96 | val_acc=0.4332
Epoch 010 | train_loss=2.0027 | val_loss=1.9006 | val_ppl=6.69 | val_acc=0.4423
Epoch 011 | train_loss=1.9676 | val_loss=1.8637 | val_ppl=6.45 | val_acc=0.4493
Epoch 012 | train_loss=1.9364 | val_loss=1.8311 | val_ppl=6.24 | val_acc=0.4612
Epoch 013 | train_loss=1.9081 | val_l

Obtenemos los nuevos versos:

In [26]:
nuevos_versos(model_rnn, vectorizer)

------------------------------

 Verse 1:
and have sereld

 Verse 2:
and about you of efilper for the dos fade for them a there allah becisely to the sild at for he who de the pars and the remant that whime sais he allah parase and is that way for the ealled o murist on is in the exaltion if allah have mestedert of the sene to the enister than thatish and exallt will

 Verse 3:
indeed the withe say the mereis or goud ferreas indeed in you and whore and seek not but they it enter dourmenn and messengers wead not believe in the herother dose ho dinngned then we proty who son in the ranseon that whe the magot aid in mall and their lord and except the poiginh not and it an one

 Verse 4:
indeat intrees in chater and distared and thith allah he will be and darthin so you is ahmease gord the do thise and that allah seide phorlond to lich it and what is merusrecslise of it in thome mund in forit and war and in the sore made which our people they trey from you allah for when allah who w

 Vers

### LSTM - Corán

Modelo del LSTM para el Corán

In [27]:
class CoranLSTM(nn.Module):
    def __init__(self, vocab_size, embedding_size, lstm_hidden_size, padding_idx, dropout_p=0.5, pretrained_embeddings_ft=None):
        super().__init__()
        self.char_emb = nn.Embedding(vocab_size, embedding_size, padding_idx=padding_idx)
        if pretrained_embeddings_ft is not None:
            self.char_emb.weight.data.copy_(pretrained_embeddings_ft)
        self.lstm = nn.LSTM(embedding_size, lstm_hidden_size, batch_first=True)
        self.fc = nn.Linear(lstm_hidden_size, vocab_size)
        self.dropout_p = dropout_p

    def forward(self, x_in, apply_softmax=False):
        x_emb = self.char_emb(x_in)              
        y_out, _ = self.lstm(x_emb)              
        y_out = F.dropout(y_out, p=self.dropout_p, training=self.training)
        logits = self.fc(y_out)                 
        return F.softmax(logits, dim=-1) if apply_softmax else logits

Entrenamiento del LSTM para el Corán árabe:

In [28]:
args, dataset, vectorizer, model_lstm = train_LSTM(coran_path="../data/cleaned_data/cleaned_arab_quran.txt",
                                                 output_path="Unai/Models/LSTM/arab/coran/coran_lstm_v1",
                                                 ruta_ft="../src/modelos/fasttext_arabic_busqueda_semantica.bin")

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for epoch in tqdm_notebook(range(args.num_epochs)):


  0%|          | 0/50 [00:00<?, ?it/s]

Epoch 001 | train_loss=2.9997 | val_loss=2.7963 | val_ppl=16.38 | val_acc=0.2471
Epoch 002 | train_loss=2.5932 | val_loss=2.5047 | val_ppl=12.24 | val_acc=0.3136
Epoch 003 | train_loss=2.4149 | val_loss=2.3790 | val_ppl=10.79 | val_acc=0.3430
Epoch 004 | train_loss=2.3033 | val_loss=2.2856 | val_ppl=9.83 | val_acc=0.3595
Epoch 005 | train_loss=2.2160 | val_loss=2.2091 | val_ppl=9.11 | val_acc=0.3796
Epoch 006 | train_loss=2.1418 | val_loss=2.1396 | val_ppl=8.50 | val_acc=0.3980
Epoch 007 | train_loss=2.0757 | val_loss=2.0836 | val_ppl=8.03 | val_acc=0.4191
Epoch 008 | train_loss=2.0219 | val_loss=2.0364 | val_ppl=7.66 | val_acc=0.4321
Epoch 009 | train_loss=1.9744 | val_loss=1.9981 | val_ppl=7.37 | val_acc=0.4425
Epoch 010 | train_loss=1.9343 | val_loss=1.9641 | val_ppl=7.13 | val_acc=0.4531
Epoch 011 | train_loss=1.8980 | val_loss=1.9401 | val_ppl=6.96 | val_acc=0.4624
Epoch 012 | train_loss=1.8659 | val_loss=1.9140 | val_ppl=6.78 | val_acc=0.4681
Epoch 013 | train_loss=1.8372 | val_l

In [29]:
nuevos_versos(model_lstm, vectorizer)

------------------------------

 Verse 1:
وضلغ عليهم من الذين امنوا وعملوا الصالحات وما انزل الي الرحمن فلينات حجاله الا ما شاء الله ۖ ومن المؤمنين ۖ ان الله كان بينهما ۖ ومن يعد ولا اذا ما لم كنتم من دان ولين ۚ كذلك هذا لفذكر الله وتلكا من الكافرين

 Verse 2:
وان لك مواي الا ان اختتك فيها لا يهدي من يشاء من بان الله اشركم مما يضتهم الله وهم لا يحسنون عما قبل من الناس برسول الله ۖ فرب السمات والارض ولا تعذبهم بالوماه وسمع الله ما كنتم تعلمون

 Verse 3:
قال رب اني وانا وما تخذ الذين كفروا الله ۖ فان انزل الله للذين كذبوا ۚ ان كل ملك في الارض الموت ۚ ولا اسما الولا الكافرين

 Verse 4:
اويجيمون الله غير الله الا ان يفيروا انه لا يسالهم ما ما اعلم بناء ان يتقون ۖ فمن اكفر الا جاء الله بما يشعرون

 Verse 5:
وان الذين امنوا وعلموا ان هذا الذي انا كنا واحده ما كانوا يعملون

 Verse 6:
يالم لم تنتظا انهم يستهبظون والذين امنوا والموادوا وهم يتقون ۚ انه الموت ولا تحشرون

 Verse 7:
قال امنكم من ان يعبر الله اجلا ولكن اجر الا ما انزل اليك لمن دونك يدعل فان الله ازيد يحي عيد الموسي ۖ ولما اخرج ان اح

Lanzamos entrenamiento inglés de LSTM:

In [30]:
args, dataset, vectorizer, model_lstm = train_LSTM(coran_path="../data/cleaned_data/cleaned_english_quran.txt",
                                                 output_path="Unai/Models/LSTM/english/coran/coran_lstm_v1",
                                                 ruta_ft="../src/modelos/fasttext_english_busqueda_semantica.bin")

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for epoch in tqdm_notebook(range(args.num_epochs)):


  0%|          | 0/50 [00:00<?, ?it/s]

Epoch 001 | train_loss=2.8106 | val_loss=2.4503 | val_ppl=11.59 | val_acc=0.2774
Epoch 002 | train_loss=2.2447 | val_loss=2.0349 | val_ppl=7.65 | val_acc=0.4028
Epoch 003 | train_loss=1.9601 | val_loss=1.8020 | val_ppl=6.06 | val_acc=0.4726
Epoch 004 | train_loss=1.7781 | val_loss=1.6467 | val_ppl=5.19 | val_acc=0.5090
Epoch 005 | train_loss=1.6492 | val_loss=1.5333 | val_ppl=4.63 | val_acc=0.5450
Epoch 006 | train_loss=1.5499 | val_loss=1.4529 | val_ppl=4.28 | val_acc=0.5650
Epoch 007 | train_loss=1.4753 | val_loss=1.3881 | val_ppl=4.01 | val_acc=0.5818
Epoch 008 | train_loss=1.4173 | val_loss=1.3384 | val_ppl=3.81 | val_acc=0.5954
Epoch 009 | train_loss=1.3705 | val_loss=1.2983 | val_ppl=3.66 | val_acc=0.6106
Epoch 010 | train_loss=1.3298 | val_loss=1.2679 | val_ppl=3.55 | val_acc=0.6186
Epoch 011 | train_loss=1.2970 | val_loss=1.2387 | val_ppl=3.45 | val_acc=0.6247
Epoch 012 | train_loss=1.2700 | val_loss=1.2191 | val_ppl=3.38 | val_acc=0.6298
Epoch 013 | train_loss=1.2450 | val_los

In [31]:
nuevos_versos(model_lstm, vectorizer)

------------------------------

 Verse 1:
and indeed they will be recompensed except for a man or harms it in the believers will and have a virmonation and then for them has sin indeed he is the knowing and markmeds and ever is allah to him belongs the peace of earth so we submitted to him and they will say the unly of the dosire to a strai

 Verse 2:
and one a him the sea and upon moses and a prophet from the great punishment a good provision he could not give him to made for them will pass while you instinat the sin to paradise whom you were a greater partners of them inspresting to allah

 Verse 3:
and indeed thele except that you may have among them is a corrupter from them and allah is with the righteous

 Verse 4:
and when theas the light and become worthless allah ssidus so benefit things and recovends to fire but their lord benefit them chosen indeed allah is fortiving in one who are the established sinclest our lord in like the signs of allah

 Verse 5:
and moses seek the trut

## Dataset Hadith-s (Kaggle)

Visualizamos el la estructura del df, cogeremos las columnas (hadith-s) que nos interesan: `text_ar` y `text_en`. Como el archivo viene estructurado de una manera poco usual, realizaremos una limpieza exhaustiva.

In [32]:
hadith_df = pd.read_csv("../data/hadith_dataset/all_hadiths_clean.csv")
hadith_df.head(1)

Unnamed: 0,id,hadith_id,source,chapter_no,hadith_no,chapter,chain_indx,text_ar,text_en
0,0,1,Sahih Bukhari,1,1,Revelation - كتاب بدء الوحى,"30418, 20005, 11062, 11213, 11042, 3",حدثنا الحميدي عبد الله بن الزبير، قال حدثنا سف...,Narrated 'Umar bin Al-Khattab: ...


Árabe:

In [33]:
hadith_ar = hadith_df["text_ar"]
hadith_ar = pd.DataFrame(hadith_ar).dropna()
hadith_ar.to_csv("../data/hadith_dataset/hadith_ar/hadith_ar.csv", index=False, encoding="utf-8")
print(hadith_ar.count())
pd.set_option('display.max_colwidth', None)
hadith_ar.head(1)

text_ar    34433
dtype: int64


Unnamed: 0,text_ar
0,"حدثنا الحميدي عبد الله بن الزبير، قال حدثنا سفيان، قال حدثنا يحيى بن سعيد الأنصاري، قال أخبرني محمد بن إبراهيم التيمي، أنه سمع علقمة بن وقاص الليثي، يقول سمعت عمر بن الخطاب رضى الله عنه على المنبر قال سمعت رسول الله صلى الله عليه وسلم يقول ‏""‏ إنما الأعمال بالنيات، وإنما لكل امرئ ما نوى، فمن كانت هجرته إلى دنيا يصيبها أو إلى امرأة ينكحها فهجرته إلى ما هاجر إليه ‏""‏‏.‏"


Limpieza árabe:

In [34]:
QUOTE_CHARS = r"\"'“”„«»‹›`´"

# Diacríticos árabes (harakat) + marcas coránicas comunes
ARABIC_DIACRITICS = re.compile(r"[\u0610-\u061A\u064B-\u065F\u0670\u06D6-\u06ED]")

# Rangos Unicode típicos para árabe (básico + extendidos)
ARABIC_RANGES = r"\u0600-\u06FF\u0750-\u077F\u08A0-\u08FF"

def _strip_wrapping_quotes(text: str, max_loops: int = 5) -> str:
    if not text:
        return text
    t = text.strip()
    for _ in range(max_loops):
        new_t = re.sub(rf'^\s*[{QUOTE_CHARS}]+\s*', '', t)
        new_t = re.sub(rf'\s*[{QUOTE_CHARS}]+\s*$', '', new_t)
        new_t = new_t.strip()
        if new_t == t:
            break
        t = new_t
    return t

def normalize_arabic(text: str, remove_diacritics: bool = True) -> str:
    # Normalización Unicode (unifica formas)
    text = unicodedata.normalize("NFKC", text)

    # Quitar tatweel (kashida)
    text = text.replace("\u0640", "")

    # Unificar algunas variantes comunes (opcional, útil en muchos corpus)
    text = text.replace("أ", "ا").replace("إ", "ا").replace("آ", "ا")
    text = text.replace("ى", "ي")
    text = text.replace("ة", "ه")  # si prefieres mantenerla, comenta esta línea

    if remove_diacritics:
        text = re.sub(ARABIC_DIACRITICS, "", text)

    return text

def clean_hadith_text_ar(text, remove_diacritics: bool = True):
    if not isinstance(text, str):
        return ""

    # 1) Espacios/saltos de línea
    text = text.replace("\n", " ").replace("\r", " ").strip()

    # 2) Quitar comillas envolventes
    text = _strip_wrapping_quotes(text)

    # 3) Normalización árabe (sin lower)
    text = normalize_arabic(text, remove_diacritics=remove_diacritics)

    # 4) Eliminar narrador (si el encabezado está en inglés, como en tu caso)
    palabras_clave = (
        r"(said|asked|the|i\s+heard|i\s+was\s+told|i\s+informed|while|informed|abu|allah|"
        r"if|when|once|some|whenever|it|sometimes|thereupon|then|and|but)"
    )
    patron_narrador = r'^\s*narrated\s+.*?[:\-]?\s*(?=\b' + palabras_clave + r'\b)'
    text = re.sub(patron_narrador, "", text).strip()

    # 5) Quitar comillas residuales
    text = re.sub(rf'^\s*[{QUOTE_CHARS}]+\s*', "", text)
    text = re.sub(rf'\s*[{QUOTE_CHARS}]+\s*$', "", text)

    # 6) Mantener: letras árabes, números, espacios y puntuación básica.
    # Incluye puntuación árabe: ، ؛ ؟  (comma/semicolon/question mark)
    allowed = rf"[^0-9\s{ARABIC_RANGES}\.,!?'\-\(\)«»\"،؛؟]"
    text = re.sub(allowed, " ", text)

    # 7) Colapsar espacios
    text = re.sub(r"\s+", " ", text).strip()

    return text

In [35]:
hadith_ar = hadith_df[["text_ar"]].copy()

hadith_ar = hadith_ar.dropna(subset=["text_ar"])

hadith_ar = hadith_ar.iloc[1:].reset_index(drop=True)

hadith_ar["text_ar"] = hadith_ar["text_ar"].apply(clean_hadith_text_ar)
hadith_ar = hadith_ar.iloc[1:].reset_index(drop=True)

hadith_ar = hadith_ar[hadith_ar["text_ar"] != ""].reset_index(drop=True)

output_path = "../data/hadith_dataset/hadith_ar/hadith_ar_cleaned.csv"

hadith_ar.to_csv(output_path, index=False, encoding="utf-8")

Inglés + función de limpieza inglesa

In [36]:
QUOTE_CHARS = r"\"'“”„«»‹›`´"

def _strip_wrapping_quotes(text: str, max_loops: int = 5) -> str:
    if not text:
        return text

    t = text.strip()
    for _ in range(max_loops):
        # ^\s*["'“”...]+ (captura comillas al inicio) y ["'“”...]+\s*$ (al final)
        new_t = re.sub(rf'^\s*[{QUOTE_CHARS}]+\s*', '', t)
        new_t = re.sub(rf'\s*[{QUOTE_CHARS}]+\s*$', '', new_t)
        new_t = new_t.strip()
        if new_t == t:
            break
        t = new_t
    return t

def clean_hadith_text(text):
    if not isinstance(text, str):
        return ""

    text = text.replace('\n', ' ').replace('\r', ' ').strip()

    text = _strip_wrapping_quotes(text)

    text = text.replace('""', '"').lower()

    # Limpieza del formato original del .csv: narrated by (nommbre del narrador) + texto que queremos
    palabras_clave = (
        r"(said|asked|the|i\s+heard|i\s+was\s+told|i\s+informed|while|informed|abu|allah|"
        r"if|when|once|some|whenever|it|sometimes|thereupon|then|and|but)"
    )
    patron_narrador = r'^\s*narrated\s+.*?[:\-]?\s*(?=\b' + palabras_clave + r'\b)'
    text = re.sub(patron_narrador, '', text).strip()

    text = re.sub(rf'^\s*[{QUOTE_CHARS}]+\s*', '', text)
    text = re.sub(rf'\s*[{QUOTE_CHARS}]+\s*$', '', text)
    text = re.sub(r"[^a-z0-9\s.,!?'\-\(\)]", " ", text)

    text = re.sub(r"\s+", " ", text).strip()

    return text

In [37]:
hadith_en = hadith_df[["text_en"]].copy()

hadith_en = hadith_en.dropna(subset=["text_en"])

hadith_en = hadith_en.iloc[1:].reset_index(drop=True)

hadith_en["text_en"] = hadith_en["text_en"].apply(clean_hadith_text)
hadith_en = hadith_en.iloc[1:].reset_index(drop=True)

hadith_en = hadith_en[hadith_en["text_en"] != ""].reset_index(drop=True)

output_path = "../data/hadith_dataset/hadith_en/hadith_en_cleaned.csv"

hadith_en.to_csv(output_path, index=False, encoding="utf-8")

Clase Dataset del Hadith dataset, adaptado al formato del dataset de *Kaggle*:

In [38]:
class HadithDataset(Dataset):
    def __init__(self, df: pd.DataFrame, vectorizer: CoranVectorizer, text_col):
        # text_col: text_en (hadith_en) y text_ar (hadith_ar)
        self.df = df.reset_index(drop=True)
        self._vectorizer = vectorizer
        self._text_col = text_col
        self._max_seq_length = min(int(self.df[text_col].astype(str).map(len).max()) + 2, 500)        
        n = len(self.df)
        train_end = int(n * 0.70) # 70% de las instancias al train set
        val_end = int(n * .85) # 15 para el validation set, y el otro 15 para el test

        self.train_df = self.df.iloc[:train_end]
        self.val_df = self.df.iloc[train_end:val_end]
        self.test_df = self.df.iloc[val_end:]

        self._lookup_dict = {
            "train": (self.train_df, len(self.train_df)),
            "val": (self.val_df, len(self.val_df)),
            "test": (self.test_df, len(self.test_df)),
        }

        self.set_split("train")

    @classmethod
    def load_dataset_and_make_vectorizer(cls, hadith_csv, text_col):
        df = pd.read_csv(hadith_csv)
        # FIX: Use text_col instead of "text"
        df[text_col] = df[text_col].astype(str).str.lower()
        vectorizer = CoranVectorizer.from_dataframe(df, text_col)
        return cls(df, vectorizer, text_col)

    @classmethod
    def load_dataset_and_load_vectorizer(cls, hadith_csv, vectorizer_filepath, text_col):
        df = pd.read_csv(hadith_csv)
        # FIX: Use text_col instead of "text"
        df[text_col] = df[text_col].astype(str).str.lower()
        vectorizer = cls.load_vectorizer_only(vectorizer_filepath)
        return cls(df, vectorizer, text_col)
        
    @staticmethod
    def load_vectorizer_only(vectorizer_filepath):
        with open(vectorizer_filepath, "r", encoding="utf-8") as f:
            contents = json.load(f)
        return CoranVectorizer.from_serializable(contents)

    def save_vectorizer(self, vectorizer_filepath):
        with open(vectorizer_filepath, "w", encoding="utf-8") as f:
            json.dump(self._vectorizer.to_serializable(), f, ensure_ascii=False)

    def get_vectorizer(self):
        return self._vectorizer

    def set_split(self, split="train"):
        self._target_split = split
        self._target_df, self._target_size = self._lookup_dict[split]

    def __len__(self):
        return self._target_size

    def __getitem__(self, index):
        row = self._target_df.iloc[index]
        text = str(row[self._text_col])
        x, y = self._vectorizer.vectorize(text, vector_length=self._max_seq_length)
        return {"x_data": torch.tensor(x, dtype=torch.long), "y_target": torch.tensor(y, dtype=torch.long)}

### RNN - Hadith

In [39]:
def train_RNN(hadith_path, output_path, text_col, ft_ruta):
    args = Namespace(
        hadith_csv=hadith_path,
        vectorizer_file="vectorizer.json",
        model_state_file="model.pth",
        save_dir=output_path,

        char_embedding_size=300, # 300 porque los embeddings del ft son de 300, tienen que coincidir
        rnn_hidden_size=128, # 256-ekin peatau itenzatek

        seed=1337,
        learning_rate=1e-3,
        batch_size=256,
        num_epochs=50,
        early_stopping_criteria=5,

        cuda=True,
        reload_from_files=False
    )

    print(args.batch_size)

    np.random.seed(args.seed)
    torch.manual_seed(args.seed)
    if args.cuda and torch.cuda.is_available():
        torch.cuda.manual_seed_all(args.seed)
        args.device = torch.device("cuda")
    else:
        args.device = torch.device("cpu")

    os.makedirs(args.save_dir, exist_ok=True)
    if args.vectorizer_file and not os.path.isabs(args.vectorizer_file):
        args.vectorizer_file = os.path.join(args.save_dir, args.vectorizer_file)
    if args.model_state_file and not os.path.isabs(args.model_state_file):
        args.model_state_file = os.path.join(args.save_dir, args.model_state_file)

    if args.reload_from_files and os.path.exists(args.vectorizer_file):
        dataset = HadithDataset.load_dataset_and_load_vectorizer(args.hadith_csv, args.vectorizer_file, text_col)
    else:
        dataset = HadithDataset.load_dataset_and_make_vectorizer(args.hadith_csv, text_col)
        dataset.save_vectorizer(args.vectorizer_file)

    dataset.set_split("train")
    train_dataloader = DataLoader(dataset, batch_size=args.batch_size, shuffle=True)

    dataset.set_split("val")
    val_dataloader = DataLoader(dataset, batch_size=args.batch_size, shuffle=False)

    dataset.set_split("test")
    test_dataloader = DataLoader(dataset, batch_size=args.batch_size, shuffle=False)

    vectorizer = dataset.get_vectorizer()
    mask_index = vectorizer.char_vocab.mask_index

    def obtener_pesos(vectorizer, modelo_ft):
        vocab = vectorizer.char_vocab
        token_to_idx = vocab._token_to_idx
        tamaño_vocab = len(token_to_idx)
        embedding_dim = modelo_ft.get_dimension()
        pesos = np.zeros((tamaño_vocab, embedding_dim))

        for token, idx in token_to_idx.items():
            pesos[idx] = modelo_ft.get_word_vector(token)
    
        return torch.FloatTensor(pesos)

    ft_model = fasttext.load_model(ft_ruta)
    pretrained_ft_pesos = obtener_pesos(vectorizer, ft_model)

    model = CoranRNN(
        vocab_size=len(vectorizer.char_vocab),
        embedding_size=args.char_embedding_size,
        rnn_hidden_size=args.rnn_hidden_size,
        padding_idx=mask_index,
        pretrained_embeddings_ft=pretrained_ft_pesos
    ).to(args.device)

    optimizer = torch.optim.Adam(model.parameters(), lr=args.learning_rate)
    scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode="min", factor=0.5, patience=1)

    train_state = make_train(args)

    for epoch in tqdm_notebook(range(args.num_epochs)):
        train_state["epoch_index"] = epoch

        # Train
        dataset.set_split("train")
        model.train()
        running_loss, running_acc = 0.0, 0.0
        for bi, batch in enumerate(generate_batches(dataset, args.batch_size, args.device, shuffle=True)):
            optimizer.zero_grad()
            y_pred = model(batch["x_data"])
            loss = sequence_loss(y_pred, batch["y_target"], mask_index)
            loss.backward()
            optimizer.step()

            running_loss += (loss.item() - running_loss) / (bi + 1)
            acc = compute_accuracy(y_pred, batch["y_target"], mask_index)
            running_acc += (acc - running_acc) / (bi + 1)

        train_state["train_loss"].append(running_loss)
        train_state["train_acc"].append(running_acc)

        # Val
        dataset.set_split("val")
        model.eval()
        vloss, vacc = 0.0, 0.0
        with torch.no_grad():
            for bi, batch in enumerate(generate_batches(dataset, args.batch_size, args.device, shuffle=False)):
                y_pred = model(batch["x_data"])
                loss = sequence_loss(y_pred, batch["y_target"], mask_index)

                vloss += (loss.item() - vloss) / (bi + 1)
                acc = compute_accuracy(y_pred, batch["y_target"], mask_index)
                vacc += (acc - vacc) / (bi + 1)

        train_state["val_loss"].append(vloss)
        train_state["val_acc"].append(vacc)

        val_ppl = math.exp(vloss)
        train_state.setdefault("val_ppl", []).append(val_ppl)

        train_state = update_training_state(args, model, train_state)
        scheduler.step(vloss)

        print(f"Epoch {epoch+1:03d} | train_loss={running_loss:.4f} "
                f"| val_loss={vloss:.4f} | val_ppl={val_ppl:.2f} | val_acc={vacc:.4f}")
        
        dataset.save_vectorizer(args.vectorizer_file)
        torch.save(model.state_dict(), args.model_state_file)

        if train_state["stop_early"]:
            print("Early stopping activado.")
            break

    return args, dataset, vectorizer, model

Entrenamiento RNN con dataset Hadith en árabe:

In [40]:
args, dataset, vectorizer, model_rnn = train_RNN(hadith_path="../data/hadith_dataset/hadith_ar/hadith_ar_cleaned.csv",
                                                 output_path="Unai/Models/RNN/arab/hadith/coran_rnn_v1",
                                                 text_col="text_ar",
                                                 ft_ruta="../src/modelos/fasttext_arabic_busqueda_semantica.bin")

256


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for epoch in tqdm_notebook(range(args.num_epochs)):


  0%|          | 0/50 [00:00<?, ?it/s]

Epoch 001 | train_loss=2.8336 | val_loss=2.2752 | val_ppl=9.73 | val_acc=0.3948
Epoch 002 | train_loss=2.1672 | val_loss=1.9568 | val_ppl=7.08 | val_acc=0.4791
Epoch 003 | train_loss=1.9640 | val_loss=1.7776 | val_ppl=5.92 | val_acc=0.5121
Epoch 004 | train_loss=1.8447 | val_loss=1.6680 | val_ppl=5.30 | val_acc=0.5482
Epoch 005 | train_loss=1.7721 | val_loss=1.6024 | val_ppl=4.96 | val_acc=0.5599
Epoch 006 | train_loss=1.7270 | val_loss=1.5635 | val_ppl=4.78 | val_acc=0.5708
Epoch 007 | train_loss=1.6968 | val_loss=1.5355 | val_ppl=4.64 | val_acc=0.5764
Epoch 008 | train_loss=1.6735 | val_loss=1.5135 | val_ppl=4.54 | val_acc=0.5809
Epoch 009 | train_loss=1.6548 | val_loss=1.4963 | val_ppl=4.47 | val_acc=0.5847
Epoch 010 | train_loss=1.6403 | val_loss=1.4829 | val_ppl=4.41 | val_acc=0.5898
Epoch 011 | train_loss=1.6278 | val_loss=1.4721 | val_ppl=4.36 | val_acc=0.5923
Epoch 012 | train_loss=1.6177 | val_loss=1.4629 | val_ppl=4.32 | val_acc=0.5949
Epoch 013 | train_loss=1.6079 | val_loss

In [41]:
nuevos_versos(model_rnn, vectorizer)

------------------------------

 Verse 1:
حدثناه سمعه وابي بعد حنصد في البير لامر حوله من الافزيع الله لبي ما التما الكيت بصله من غفر، فقال له عن يول الماله فيقلت يعمر والحجا . قال الوهل - عن حرار وابي رؤي المباره، وما مك اما قال " ما اليران . وان الله الناس فربراعبي الي لا الصبي من صال الذا يقال وال شيده وجل، فانس لا يكنا يشاكل في الولس .

 Verse 2:
حدثنا ابو الزبر بن حاجن، عت النهير بن الوحيم، عن ابي الزهري، ان ابن عباع بن سمل، عن اسااق، عن عبد الله بن اليمان، - ضه بن الاسلم ابن عشر وابو برضل اوار شوك عن الاسراه . قال يبكن ملاتي في اول الخرج، صماه بي اسحما عن ابي هريره، قال حدثنا جعبرا ذي ان ابو مسمعه والدار شعبه عن الرهم المزهد وفا حديثه حدثن

 Verse 3:
وحدثني محدي، حدثنا ابو عمر بن عبد الله بن جعيري، حدثنا ابو بزري عن عبر الله بن عمرو بن شعيب، حدثنا ابي حيو بكن النبير بو حدثنا الصببد، عن اليسن، قال قال ابن ابي هيي به " . قال ابو اقال الحكر النبي صلي الله عليه وسلم، فان اسلم ينت بالله في الصبره اله الا الا قال " ناس شعط واعثوا معه القاسم ورسيوه فما

 Verse 4:
حدثنا عبد الرحرد ومحوك 

Entrenamiento Hadith RNN inglés:

In [42]:
args, dataset, vectorizer, model_rnn = train_RNN(hadith_path="../data/hadith_dataset/hadith_en/hadith_en_cleaned.csv",
                                                 output_path="Unai/Models/RNN/english/hadith/coran_rnn_v1",
                                                 text_col="text_en",
                                                 ft_ruta="../src/modelos/fasttext_english_busqueda_semantica.bin")

256


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for epoch in tqdm_notebook(range(args.num_epochs)):


  0%|          | 0/50 [00:00<?, ?it/s]

Epoch 001 | train_loss=2.8153 | val_loss=2.2922 | val_ppl=9.90 | val_acc=0.3475
Epoch 002 | train_loss=2.2519 | val_loss=2.0392 | val_ppl=7.68 | val_acc=0.4174
Epoch 003 | train_loss=2.0799 | val_loss=1.8865 | val_ppl=6.60 | val_acc=0.4689
Epoch 004 | train_loss=1.9620 | val_loss=1.7737 | val_ppl=5.89 | val_acc=0.4984
Epoch 005 | train_loss=1.8814 | val_loss=1.7020 | val_ppl=5.49 | val_acc=0.5145
Epoch 006 | train_loss=1.8288 | val_loss=1.6528 | val_ppl=5.22 | val_acc=0.5238
Epoch 007 | train_loss=1.7896 | val_loss=1.6185 | val_ppl=5.05 | val_acc=0.5323
Epoch 008 | train_loss=1.7613 | val_loss=1.5938 | val_ppl=4.92 | val_acc=0.5435
Epoch 009 | train_loss=1.7382 | val_loss=1.5725 | val_ppl=4.82 | val_acc=0.5469
Epoch 010 | train_loss=1.7199 | val_loss=1.5556 | val_ppl=4.74 | val_acc=0.5510
Epoch 011 | train_loss=1.7044 | val_loss=1.5424 | val_ppl=4.68 | val_acc=0.5547
Epoch 012 | train_loss=1.6916 | val_loss=1.5321 | val_ppl=4.63 | val_acc=0.5555
Epoch 013 | train_loss=1.6809 | val_loss

In [43]:
nuevos_versos(model_rnn, vectorizer)

------------------------------

 Verse 1:
ighilger said to is a copp to when ou tone of the kimah he revines norn that allah's mease of his hands wan oother, in the verions and the prophet that allah's apostla ghe hakim. when he restorr b. masima) him and said bet in the prayer the prepss of the chould lis the paso folldiad the toret for th

 Verse 2:
in while it and canes it in the promertidite the anashir the people of was a bong the salithter had may allah said whirched has allah with entromen whot komen whent him, and one lowshing the prophet sayeay the upreation and the saces on the authorinnanna (theand has and plating and whos oolst and an

 Verse 3:
abu muraishah in her and the proy of muse shaor the himpina, whore spait he said modet of the eravat in the last of the sowure of the messenger of al-asatahan is there (sayan a and will both and thet and the prophet ( ) come clive that he react of be tasit of allah's apostle sheash bettern corpave n

 Verse 4:
the messenger of allah

### LSTM - Hadith

In [44]:
def train_LSTM(hadith_path, output_path, text_col, ft_ruta):
    args = Namespace(
        hadith_csv=hadith_path,
        vectorizer_file="vectorizer.json",
        model_state_file="model.pth",
        save_dir=output_path,

        char_embedding_size=300, # lo mismo que ft 
        lstm_hidden_size=256,

        seed=1337,
        learning_rate=1e-3,
        batch_size=64,
        num_epochs=50,
        early_stopping_criteria=5,

        cuda=True,
        reload_from_files=False
    )

    np.random.seed(args.seed)
    torch.manual_seed(args.seed)
    if args.cuda and torch.cuda.is_available():
        torch.cuda.manual_seed_all(args.seed)
        args.device = torch.device("cuda")
    else:
        args.device = torch.device("cpu")

    os.makedirs(args.save_dir, exist_ok=True)
    if args.vectorizer_file and not os.path.isabs(args.vectorizer_file):
        args.vectorizer_file = os.path.join(args.save_dir, args.vectorizer_file)
    if args.model_state_file and not os.path.isabs(args.model_state_file):
        args.model_state_file = os.path.join(args.save_dir, args.model_state_file)

    if args.reload_from_files and os.path.exists(args.vectorizer_file):
        dataset = HadithDataset.load_dataset_and_load_vectorizer(args.hadith_csv, args.vectorizer_file, text_col)
    else:
        dataset = HadithDataset.load_dataset_and_make_vectorizer(args.hadith_csv, text_col)
        dataset.save_vectorizer(args.vectorizer_file)

    dataset.set_split("train")
    train_dataloader = DataLoader(dataset, batch_size=args.batch_size, shuffle=True)

    dataset.set_split("val")
    val_dataloader = DataLoader(dataset, batch_size=args.batch_size, shuffle=False)

    dataset.set_split("test")
    test_dataloader = DataLoader(dataset, batch_size=args.batch_size, shuffle=False)

    vectorizer = dataset.get_vectorizer()
    mask_index = vectorizer.char_vocab.mask_index

    def obtener_pesos(vectorizer, modelo_ft):
        vocab = vectorizer.char_vocab
        token_to_idx = vocab._token_to_idx
        tamaño_vocab = len(token_to_idx)
        embedding_dim = modelo_ft.get_dimension()
        pesos = np.zeros((tamaño_vocab, embedding_dim))

        for token, idx in token_to_idx.items():
            pesos[idx] = modelo_ft.get_word_vector(token)
    
        return torch.FloatTensor(pesos)

    ft_modelo = fasttext.load_model(ft_ruta)
    pretrained_ft_pesos = obtener_pesos(vectorizer, ft_modelo)

    model = CoranLSTM(
        vocab_size=len(vectorizer.char_vocab),
        embedding_size=args.char_embedding_size,
        lstm_hidden_size=args.lstm_hidden_size,
        padding_idx=mask_index,
        pretrained_embeddings_ft=pretrained_ft_pesos
    ).to(args.device)

    optimizer = torch.optim.Adam(model.parameters(), lr=args.learning_rate)
    scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode="min", factor=0.5, patience=1)

    train_state = make_train(args)

    for epoch in tqdm_notebook(range(args.num_epochs)):
        train_state["epoch_index"] = epoch

        # Train
        dataset.set_split("train")
        model.train()
        running_loss, running_acc = 0.0, 0.0
        for bi, batch in enumerate(generate_batches(dataset, args.batch_size, args.device, shuffle=True)):
            optimizer.zero_grad()
            y_pred = model(batch["x_data"])
            loss = sequence_loss(y_pred, batch["y_target"], mask_index)
            loss.backward()
            optimizer.step()

            running_loss += (loss.item() - running_loss) / (bi + 1)
            acc = compute_accuracy(y_pred, batch["y_target"], mask_index)
            running_acc += (acc - running_acc) / (bi + 1)

        train_state["train_loss"].append(running_loss)
        train_state["train_acc"].append(running_acc)

        # Val
        dataset.set_split("val")
        model.eval()
        vloss, vacc = 0.0, 0.0
        with torch.no_grad():
            for bi, batch in enumerate(generate_batches(dataset, args.batch_size, args.device, shuffle=False)):
                y_pred = model(batch["x_data"])
                loss = sequence_loss(y_pred, batch["y_target"], mask_index)

                vloss += (loss.item() - vloss) / (bi + 1)
                acc = compute_accuracy(y_pred, batch["y_target"], mask_index)
                vacc += (acc - vacc) / (bi + 1)

        train_state["val_loss"].append(vloss)
        train_state["val_acc"].append(vacc)

        val_ppl = math.exp(vloss)
        train_state.setdefault("val_ppl", []).append(val_ppl)

        train_state = update_training_state(args, model, train_state)
        scheduler.step(vloss)

        print(f"Epoch {epoch+1:03d} | train_loss={running_loss:.4f} "
                f"| val_loss={vloss:.4f} | val_ppl={val_ppl:.2f} | val_acc={vacc:.4f}")
        
        dataset.save_vectorizer(args.vectorizer_file)
        torch.save(model.state_dict(), args.model_state_file)

        if train_state["stop_early"]:
            print("Early stopping activado.")
            break

    return args, dataset, vectorizer, model


Entrenamiento LSTM Hadith árabe:

In [45]:
args, dataset, vectorizer, model_lstm = train_LSTM(hadith_path="../data/hadith_dataset/hadith_ar/hadith_ar_cleaned.csv",
                                                 output_path="Unai/Models/LSTM/arab/hadith/hadith_lstm_v1",
                                                 text_col="text_ar",
                                                 ft_ruta="../src/modelos/fasttext_arabic_busqueda_semantica.bin")

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for epoch in tqdm_notebook(range(args.num_epochs)):


  0%|          | 0/50 [00:00<?, ?it/s]

Epoch 001 | train_loss=2.1909 | val_loss=1.5693 | val_ppl=4.80 | val_acc=0.5630
Epoch 002 | train_loss=1.5308 | val_loss=1.3352 | val_ppl=3.80 | val_acc=0.6204
Epoch 003 | train_loss=1.3720 | val_loss=1.2388 | val_ppl=3.45 | val_acc=0.6445
Epoch 004 | train_loss=1.2929 | val_loss=1.1925 | val_ppl=3.30 | val_acc=0.6551
Epoch 005 | train_loss=1.2431 | val_loss=1.1514 | val_ppl=3.16 | val_acc=0.6694
Epoch 006 | train_loss=1.2089 | val_loss=1.1331 | val_ppl=3.11 | val_acc=0.6729
Epoch 007 | train_loss=1.1834 | val_loss=1.1127 | val_ppl=3.04 | val_acc=0.6800
Epoch 008 | train_loss=1.1633 | val_loss=1.0906 | val_ppl=2.98 | val_acc=0.6863
Epoch 009 | train_loss=1.1472 | val_loss=1.0771 | val_ppl=2.94 | val_acc=0.6899
Epoch 010 | train_loss=1.1333 | val_loss=1.0720 | val_ppl=2.92 | val_acc=0.6914
Epoch 011 | train_loss=1.1218 | val_loss=1.0573 | val_ppl=2.88 | val_acc=0.6960
Epoch 012 | train_loss=1.1115 | val_loss=1.0520 | val_ppl=2.86 | val_acc=0.6972
Epoch 013 | train_loss=1.1024 | val_loss

In [46]:
nuevos_versos(model_lstm, vectorizer)

------------------------------

 Verse 1:
حدثنا محمد بن بشار، حدثنا يحيي، - وهو ابن سلمه - حدثنا ابي، عن عائشه الكني، قال زيد علي ابن عباس عن النبي صلي الله عليه وسلم وهو محرم وابو بكر السناد في شراد صاع الامت ليلت وتعالي سكره فانا مبتاع اليم جاريه وهو في صاحب من السام .

 Verse 2:
حدثنا ابو بكر بن ابي شيبه، حدثنا وكيع، حدثنا الاعمش، عن ابي صالح، عن ابي هريره، ان رسول الله صلي الله عليه وسلم قال " اللهم اني اعرج، عن ابوه اخبر، انما من اكثره باصلي المسجد " . قال ابو عيسي هذا حديث حسن صحيح .

 Verse 3:
حدثنا ابو كريب، حدثنا ابو الزناد، عن الاعمش، عن ابي وائل، عن سمعه، ان ابن عمر، قال دخل علي النبي صلي الله عليه وسلم ان اقبل هن كلما اعتق في الناس ليس لموت ما هو الوداك، ولا تدخل موسي الي الذين يضحك عن المسجد المسجد والصبح سوس فروع دنيا ووحل في الفتح بالمدينه، والمسطج العزود وراح ندي يا ابا الاسود ان

 Verse 4:
حدثنا ابو بكر بن ابي شيبه، حدثنا ابو اسامه، عن الحسان المخزوم، عن ذراع، عن ابي الدرداء، عن ابي العاليه، عن ابيه، ان النبي صلي الله عليه وسلم قال " بين بعلها او لا القاعه فكلا تمنا " 

Entrenamiento Hadith LSTM inglés:

In [47]:
args, dataset, vectorizer, model_lstm = train_LSTM(hadith_path="../data/hadith_dataset/hadith_en/hadith_en_cleaned.csv",
                                                 output_path="Unai/Models/LSTM/english/hadith/hadith_lstm_v1",
                                                 text_col="text_en",
                                                 ft_ruta="../src/modelos/fasttext_english_busqueda_semantica.bin")

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for epoch in tqdm_notebook(range(args.num_epochs)):


  0%|          | 0/50 [00:00<?, ?it/s]

Epoch 001 | train_loss=2.2849 | val_loss=1.6955 | val_ppl=5.45 | val_acc=0.5145
Epoch 002 | train_loss=1.6425 | val_loss=1.4027 | val_ppl=4.07 | val_acc=0.5843
Epoch 003 | train_loss=1.4422 | val_loss=1.2827 | val_ppl=3.61 | val_acc=0.6168
Epoch 004 | train_loss=1.3442 | val_loss=1.2171 | val_ppl=3.38 | val_acc=0.6368
Epoch 005 | train_loss=1.2833 | val_loss=1.1733 | val_ppl=3.23 | val_acc=0.6477
Epoch 006 | train_loss=1.2417 | val_loss=1.1424 | val_ppl=3.13 | val_acc=0.6551
Epoch 007 | train_loss=1.2113 | val_loss=1.1198 | val_ppl=3.06 | val_acc=0.6613
Epoch 008 | train_loss=1.1876 | val_loss=1.1052 | val_ppl=3.02 | val_acc=0.6654
Epoch 009 | train_loss=1.1690 | val_loss=1.0875 | val_ppl=2.97 | val_acc=0.6693
Epoch 010 | train_loss=1.1533 | val_loss=1.0795 | val_ppl=2.94 | val_acc=0.6726
Epoch 011 | train_loss=1.1402 | val_loss=1.0682 | val_ppl=2.91 | val_acc=0.6751
Epoch 012 | train_loss=1.1294 | val_loss=1.0599 | val_ppl=2.89 | val_acc=0.6776
Epoch 013 | train_loss=1.1200 | val_loss

In [48]:
nuevos_versos(model_lstm, vectorizer)

------------------------------

 Verse 1:
abu sa'id 'abdullah deporited to the prophet and said, o allah's apostle! do you think having in a seven person should say from his head from blanken, and the prayers eyedittery.

 Verse 2:
abu huraira said that the heavens were with allah's messenger (may peace be upon him), then i was the committed the traditions of the prophet saying, in the name of assufe is a return of the house) where. he got up on the sallah i were until he was it desching by the signs of allah.' umar said 'abu 

 Verse 3:
allah's apostle said, do you go and be in it and he has gold collected by saying the prayer to the reventioneror. in the same day and granted the same san musalab b. abu umama and it was sitting with allah's messenger (may peace be upon him), so he called the prophet by a prevented away the prevente

 Verse 4:
abu huraira allah's apostle said, indeed with the doubt.

 Verse 5:
allah's apostle was leading the actratcessous prayers in charity. abu bakr c

Realizaremos las comparaciones y aclararemos nuestras conclusiones en la documentación.