# Generación de Versos con Modelos Secuenciales
En este apartado trabajaremos con modelos secuenciales como *Recurrent Neural Networks* (`RNNs`) y *Long Short Term Memories* (`LSTMs`), modelo que aunque no sean tan potentes como algunos de los modelos generativos que veremos después, hemos considerado útiles de desarrollar para reflejar la evolución real que tuvieron los modelos generativos a lo largo de los años.

Debido a que inicialmente consideramos que el número de instancias que ofrecía el dataset que usabamos para manipular el **Corán**, hemos decidido trabajar con un segundo conjunto de datos el cual ofrece casi diez veces el número de instancias entrenables que el primero. [El dataset disponible en Kaggle](https://www.kaggle.com/datasets/fahd09/hadith-dataset), colleciona varios `Hadith`-s, representaciones de acciones o palabras dichas por el **Profeta Mohammed**. No obstante, este archivo presenta una estrutura completamente diferente al habitual, por lo tanto, requerirá de una limpieza y manipulación diferente. 

A lo largo de este cuaderno, crearemos todas las clases y funciones necesarias para crear y usar los modelos secuenciales generativos precedentes a los **Transformers**, aún así, en la documentación principal profundizaremos más en el análisis completo.

### Librerías Necearias

In [1]:
# Dependencias
import os
from argparse import Namespace
import json
import re
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
from torch.nn import functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import fasttext
import unicodedata
from tqdm import tqdm_notebook

### Código de Clases + Funciones Necesarias

Clase Vocabulary general, donde se definen las funciones principales que heredará la clase del vocabulario específico:

In [2]:
class Vocabulary:
    def __init__(self, token_to_idx=None):
        # inicializar atributos
        if token_to_idx is None:
            token_to_idx = {}
        self._token_to_idx = dict(token_to_idx)
        self._idx_to_token = {idx: token for token, idx in self._token_to_idx.items()}

    def to_serializable(self):
        # función para serializar el diccionario token (label) - idx (int)
        return {"token_to_idx": self._token_to_idx}

    @classmethod
    def from_serializable(cls, contents):
        return cls(token_to_idx=contents["token_to_idx"])

    def add_token(self, token):
        # función para añadir token (nuevo) al diccionario
        if token in self._token_to_idx:
            return self._token_to_idx[token]

        index = len(self._token_to_idx)
        self._token_to_idx[token] = index
        self._idx_to_token[index] = token
        return index

    def add_many_tokens(self, tokens):
        # función para añadir N > 1 tokens al diccionario
        return [self.add_token(t) for t in tokens]

    def lookup_token(self, token):
        # función para obtener el idx del token introducido
        return self._token_to_idx[token]

    def lookup_index(self, index):
        # función para obtener el token del idx introducido
        if index not in self._idx_to_token:
            return "<UNK>"
        return self._idx_to_token[index]

    def __len__(self):
        # devuelve el tamaño del diccionario
        return len(self._token_to_idx)

    def __str__(self):
        # devuelve el tamaño del vocabulario
        return f"<Vocabulary(size={len(self)})>"

Vocabulary especial Corán con los tokens especiales \<eos>, \<bos>, ... Hereda de la clase Vocabulario principal.

In [3]:
class VocabularyCoran(Vocabulary):
    def __init__(self, token_to_idx=None, unk_token="<UNK>",
                 mask_token="<MASK>", begin_seq_token="<BEGIN>",
                 end_seq_token="<END>"):

        super().__init__(token_to_idx)
        # añadimos los tokens especiales a nuestro diccionario
        self._mask_token = mask_token
        self._unk_token = unk_token
        self._begin_seq_token = begin_seq_token
        self._end_seq_token = end_seq_token

        self.mask_index = self.add_token(self._mask_token)
        self.unk_index = self.add_token(self._unk_token)
        self.begin_seq_index = self.add_token(self._begin_seq_token)
        self.end_seq_index = self.add_token(self._end_seq_token)

    def to_serializable(self):
        # función para serializar el diccionario token (label) - idx (int)
        contents = super().to_serializable()
        contents.update({
            "unk_token": self._unk_token,
            "mask_token": self._mask_token,
            "begin_seq_token": self._begin_seq_token,
            "end_seq_token": self._end_seq_token
        })
        return contents

    @classmethod
    def from_serializable(cls, contents):
        vocab = cls(
            token_to_idx=contents["token_to_idx"],
            unk_token=contents["unk_token"],
            mask_token=contents["mask_token"],
            begin_seq_token=contents["begin_seq_token"],
            end_seq_token=contents["end_seq_token"],
        )
        return vocab

    def lookup_token(self, token):
        # función para obtener el idx del token introducido
        return self._token_to_idx.get(token, self.unk_index)

Vectorizer - Nuestro vectorizador que será responsable de convertir los labels a vectores:


In [4]:
class CoranVectorizer:
    def __init__(self, char_vocab: VocabularyCoran):
        # constructor de vocabulario 
        self.char_vocab = char_vocab

    def vectorize(self, text: str, vector_length: int):
        # función donde vectorizamos texto

        indices = [self.char_vocab.begin_seq_index] # añadimos <bos> al principio
        indices.extend(self.char_vocab.lookup_token(ch) for ch in text) # añadimos los tokens restantes en medio de la oración
        indices.append(self.char_vocab.end_seq_index) # añadimos <eos> al final

        from_indices = indices[:-1]
        to_indices = indices[1:]

        # El from_vector será <bos> con los tokens de la secuencia (sin el <eos>)
        from_vector = np.full(vector_length, fill_value=self.char_vocab.mask_index, dtype=np.int64)
        # Y el to_vector será os tokens de la secuencia + <eos>
        to_vector = np.full(vector_length, fill_value=self.char_vocab.mask_index, dtype=np.int64)

        n = min(vector_length, len(from_indices))
        from_vector[:n] = from_indices[:n]

        n = min(vector_length, len(to_indices))
        to_vector[:n] = to_indices[:n]

        return from_vector, to_vector

    @classmethod
    def from_dataframe(cls, df: pd.DataFrame, text_col="text"):
        char_vocab = VocabularyCoran()
        for text in df[text_col].astype(str):
            for ch in text:
                char_vocab.add_token(ch)
        return cls(char_vocab)

    def to_serializable(self):
        return {"char_vocab": self.char_vocab.to_serializable()}

    @classmethod
    def from_serializable(cls, contents):
        char_vocab = VocabularyCoran.from_serializable(contents["char_vocab"])
        return cls(char_vocab)

Funciones para el entrenamiento (métricas de evaluación, argumentos de entrenamiento, ...)

In [5]:
def generate_batches(dataset, batch_size, device, shuffle=True, drop_last=True):
    # genera batches para mandarlos al cpu/gpu (si tenemos cuda)
    dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=shuffle, drop_last=drop_last)
    for batch in dataloader:
        yield {k: v.to(device) for k, v in batch.items()}

def sequence_loss(y_pred, y_true, mask_index):
    # loss function, en nuestro caso el cross entropy loss. Ya que compararemos la distribución de predicciones con la ground truth
    # B, T y V son las dimensiones de nuestro tensor predicho
    B, T, V = y_pred.shape
    y_pred = y_pred.reshape(B * T, V)
    y_true = y_true.reshape(B * T)
    # calculamos la comparación entre distribuciones predichas y verdaderas
    loss_fn = nn.CrossEntropyLoss(ignore_index=mask_index)
    return loss_fn(y_pred, y_true)

def compute_accuracy(y_pred, y_true, mask_index):
    # función para calcular la accuracy, comparando cada caracter predicho con el ground truth
    y_hat = y_pred.argmax(dim=-1)  
    valid = (y_true != mask_index)
    correct = (y_hat == y_true) & valid
    denom = valid.sum().item()
    if denom == 0:
        return 0.0
    return correct.sum().item() / denom

def make_train(args):
    # sacado del notebook de ALUD, argumentos de entrenamiento
    return {"stop_early": False,
            "early_stopping_step": 0,
            "early_stopping_best_val": 1e8,
            "epoch_index": 0,
            "train_loss": [],
            "train_acc": [],
            "val_loss": [],
            "val_acc": [],
            "model_filename": args.model_state_file}

def update_training_state(args, model, train_state):
    # función para tener en cuenta mejora/desmejora de rendimiento -> early_stopping
    if train_state["epoch_index"] == 0:
        torch.save(model.state_dict(), train_state["model_filename"])
        train_state["stop_early"] = False
        return train_state

    # código para el early_stopping
    loss_t = train_state["val_loss"][-1]
    if loss_t < train_state["early_stopping_best_val"]:
        torch.save(model.state_dict(), train_state["model_filename"])
        train_state["early_stopping_best_val"] = loss_t
        train_state["early_stopping_step"] = 0
    else:
        train_state["early_stopping_step"] += 1

    train_state["stop_early"] = train_state["early_stopping_step"] >= args.early_stopping_criteria
    return train_state

Funciones para obtener y mostrar los nuevos versos una vez entrenados los modelos, emplearemos estas funciones una vez realizados los entrenamientos.

In [6]:
def sample_from_model(model, vectorizer, num_samples=10, max_length=300, temperature=0.8, top_k=None):
    # Función para coger los nuevos versos generados y mostrarlos
    # En nuestro caso 10 samples
    model.eval()
    vocab = vectorizer.char_vocab
    device = next(model.parameters()).device
    samples = []

    for _ in range(num_samples):
        indices = [vocab.begin_seq_index]

        for _ in range(max_length):
            x = torch.tensor(indices, dtype=torch.long, device=device).unsqueeze(0)

            with torch.no_grad():
                logits = model(x, apply_softmax=False)         
                next_logits = logits[0, -1] / max(temperature, 1e-8)

                if top_k is not None and top_k > 0:
                    v, ix = torch.topk(next_logits, k=top_k)
                    filtered = torch.full_like(next_logits, float("-inf"))
                    filtered[ix] = v
                    next_logits = filtered

                probs = torch.softmax(next_logits, dim=0)
                next_index = torch.multinomial(probs, 1).item()

            if next_index == vocab.end_seq_index:
                break
            indices.append(next_index)

        samples.append(indices)

    return samples

def decode_samples(sampled_indices, vectorizer):
    # Función para devoler los labels de los índices conseguidos en la función anterior
    char_vocab = vectorizer.char_vocab
    decoded = []

    for indices in sampled_indices:
        chars = [
            char_vocab.lookup_index(idx)
            for idx in indices
            if idx not in (
                char_vocab.begin_seq_index,
                char_vocab.end_seq_index,
                char_vocab.mask_index
            )
        ]
        decoded.append("".join(chars))

    return decoded

Como usaremos los pesos del modelo de embeddings usado anteriormente (`fastText`), los importaremos aquí:

In [7]:
def obtener_pesos(vectorizer, modelo_ft):
    vocab = vectorizer.char_vocab
    token_to_idx = vocab._token_to_idx
    tamaño_vocab = len(token_to_idx)
    embedding_dim = modelo_ft.get_dimension()
    pesos = np.zeros((tamaño_vocab, embedding_dim))

    for token, idx in token_to_idx.items():
        pesos[idx] = modelo_ft.get_word_vector(token)
    
    return torch.FloatTensor(pesos)

Función para devolver y mostrar resultados:

In [8]:
def nuevos_versos(nombre_modelo, nombre_vectorizer):
    num_names = 10

    model = nombre_modelo.cpu()

    sampled_verses = decode_samples(
        sample_from_model(
            model,
            nombre_vectorizer,
            num_samples=num_names,
            max_length=300,
            temperature=0.8
        ),
        nombre_vectorizer
    )

    print("-" * 30)
    for i in range(num_names):
        print(f"\n Verse {i+1}:\n{sampled_verses[i]}")

### Funciones para los entrenamientos: RNN y LSTM

Clase Dataset del Corán, lo amoldaremos al formato en que viene escrito el **Corán**.

In [9]:
class CoranDataset(Dataset):
    def __init__(self, df: pd.DataFrame, vectorizer: CoranVectorizer, text_col="text"):
        self.df = df.reset_index(drop=True)
        self._vectorizer = vectorizer
        self._text_col = text_col

        self._max_seq_length = int(self.df[text_col].astype(str).map(len).max()) + 2 # el +2 incluye los tokens del diccionario + <bos> y <eos>

        n = len(self.df)
        train_end = int(n * 0.70) # 70% de las instancias al train set
        val_end = int(n * .85) # 15 para el validation set, y el otro 15 para el test

        self.train_df = self.df.iloc[:train_end]
        self.val_df = self.df.iloc[train_end:val_end]
        self.test_df = self.df.iloc[val_end:]

        self._lookup_dict = {
            "train": (self.train_df, len(self.train_df)),
            "val": (self.val_df, len(self.val_df)),
            "test": (self.test_df, len(self.test_df)),
        }

        self.set_split("train")

    # a partir de aquí hay metodos necesarios para manipular nuestro dataset específico
    @classmethod
    def load_dataset_and_make_vectorizer(cls, coran_txt, sep="|"):
        df = pd.read_csv(coran_txt, sep=sep, names=["sura", "ayah", "text"])
        df["text"] = df["text"].astype(str).str.lower()
        vectorizer = CoranVectorizer.from_dataframe(df, text_col="text")
        return cls(df, vectorizer, text_col="text")

    @classmethod
    def load_dataset_and_load_vectorizer(cls, coran_txt, vectorizer_filepath, sep="|"):
        df = pd.read_csv(coran_txt, sep=sep, names=["sura", "ayah", "text"])
        df["text"] = df["text"].astype(str).str.lower()
        vectorizer = cls.load_vectorizer_only(vectorizer_filepath)
        return cls(df, vectorizer, text_col="text")

    @staticmethod
    def load_vectorizer_only(vectorizer_filepath):
        with open(vectorizer_filepath, "r", encoding="utf-8") as f:
            contents = json.load(f)
        return CoranVectorizer.from_serializable(contents)

    def save_vectorizer(self, vectorizer_filepath):
        with open(vectorizer_filepath, "w", encoding="utf-8") as f:
            json.dump(self._vectorizer.to_serializable(), f, ensure_ascii=False)

    def get_vectorizer(self):
        return self._vectorizer

    def set_split(self, split="train"):
        self._target_split = split
        self._target_df, self._target_size = self._lookup_dict[split]

    def __len__(self):
        return self._target_size

    def __getitem__(self, index):
        row = self._target_df.iloc[index]
        text = str(row[self._text_col])
        x, y = self._vectorizer.vectorize(text, vector_length=self._max_seq_length)
        return {"x_data": torch.tensor(x, dtype=torch.long),
                "y_target": torch.tensor(y, dtype=torch.long)}

Función de entrenamiento RNN, arquitectura que usaremos para la NN.

In [10]:
def train_RNN(coran_path, output_path, ruta_ft):
    args = Namespace(
        coran_txt=coran_path,
        vectorizer_file="vectorizer.json",
        model_state_file="model.pth",
        save_dir=output_path,

        char_embedding_size=300, # 300 porque los embeddings del ft son de 300, tienen que coincidir
        rnn_hidden_size=128, # tamaño del hidden state del RNN

        seed=1337,
        learning_rate=1e-3, # lr
        batch_size=256,
        num_epochs=50,
        early_stopping_criteria=5,

        cuda=True,
        reload_from_files=False
    )

    np.random.seed(args.seed)
    torch.manual_seed(args.seed)
    if args.cuda and torch.cuda.is_available():
        torch.cuda.manual_seed_all(args.seed)
        args.device = torch.device("cuda")
    else:
        args.device = torch.device("cpu")

    # código para guardar/cargar archivos
    os.makedirs(args.save_dir, exist_ok=True)
    if args.vectorizer_file and not os.path.isabs(args.vectorizer_file):
        args.vectorizer_file = os.path.join(args.save_dir, args.vectorizer_file)
    if args.model_state_file and not os.path.isabs(args.model_state_file):
        args.model_state_file = os.path.join(args.save_dir, args.model_state_file)

    if args.reload_from_files and os.path.exists(args.vectorizer_file):
        dataset = CoranDataset.load_dataset_and_load_vectorizer(args.coran_txt, args.vectorizer_file)
    else:
        dataset = CoranDataset.load_dataset_and_make_vectorizer(args.coran_txt)
        dataset.save_vectorizer(args.vectorizer_file)

    vectorizer = dataset.get_vectorizer()
    mask_index = vectorizer.char_vocab.mask_index

    # aquí configuramos y llamamos a la función de pesos de fastText
    def obtener_pesos(vectorizer, modelo_ft):
        vocab = vectorizer.char_vocab
        token_to_idx = vocab._token_to_idx
        tamaño_vocab = len(token_to_idx)
        embedding_dim = modelo_ft.get_dimension()
        pesos = np.zeros((tamaño_vocab, embedding_dim))

        for token, idx in token_to_idx.items():
            pesos[idx] = modelo_ft.get_word_vector(token)
    
        return torch.FloatTensor(pesos)

    ft_model = fasttext.load_model(ruta_ft)
    pretrained_ft_pesos = obtener_pesos(vectorizer, ft_model)

    # Y creamos el modelo
    model = CoranRNN(
        vocab_size=len(vectorizer.char_vocab),
        embedding_size=args.char_embedding_size,
        rnn_hidden_size=args.rnn_hidden_size,
        padding_idx=mask_index,
        pretrained_embeddings_ft=None
    ).to(args.device)

    # A partir de aquí, entrenamiento normal
    optimizer = torch.optim.Adam(model.parameters(), lr=args.learning_rate)
    scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode="min", factor=0.5, patience=1)

    train_state = make_train(args)

    for epoch in tqdm_notebook(range(args.num_epochs)):
        train_state["epoch_index"] = epoch

        # Train
        dataset.set_split("train")
        model.train()
        running_loss, running_acc = 0.0, 0.0
        for bi, batch in enumerate(generate_batches(dataset, args.batch_size, args.device, shuffle=True)):
            optimizer.zero_grad()
            y_pred = model(batch["x_data"])
            loss = sequence_loss(y_pred, batch["y_target"], mask_index)
            loss.backward()
            optimizer.step()

            running_loss += (loss.item() - running_loss) / (bi + 1)
            acc = compute_accuracy(y_pred, batch["y_target"], mask_index)
            running_acc += (acc - running_acc) / (bi + 1)

        train_state["train_loss"].append(running_loss)
        train_state["train_acc"].append(running_acc)

        # Validation
        dataset.set_split("val")
        model.eval()
        vloss, vacc = 0.0, 0.0
        with torch.no_grad():
            for bi, batch in enumerate(generate_batches(dataset, args.batch_size, args.device, shuffle=False)):
                y_pred = model(batch["x_data"])
                loss = sequence_loss(y_pred, batch["y_target"], mask_index)

                vloss += (loss.item() - vloss) / (bi + 1)
                acc = compute_accuracy(y_pred, batch["y_target"], mask_index)
                vacc += (acc - vacc) / (bi + 1)

        train_state["val_loss"].append(vloss)
        train_state["val_acc"].append(vacc)

        train_state = update_training_state(args, model, train_state)
        scheduler.step(vloss)

        print(f"Epoch {epoch+1:03d} | train_loss={running_loss:.4f} "
              f"| val_loss={vloss:.4f} | val_acc={vacc:.4f}")
        
        dataset.save_vectorizer(args.vectorizer_file)
        torch.save(model.state_dict(), args.model_state_file)

        if train_state["stop_early"]:
            print("Early stopping activado.")
            break

    return args, dataset, vectorizer, model

Función de entrenamiento LSTM, casi idéntico a la arquitectura RNN.

In [11]:
def train_LSTM(coran_path, output_path, ruta_ft):
    args = Namespace(
        coran_txt=coran_path,
        vectorizer_file="vectorizer.json",
        model_state_file="model.pth",
        save_dir=output_path,

        char_embedding_size=300, # lo mismo que ft 
        lstm_hidden_size=256,

        seed=1337,
        learning_rate=1e-3,
        batch_size=64,
        num_epochs=50,
        early_stopping_criteria=5,

        cuda=True,
        reload_from_files=False
    )

    np.random.seed(args.seed)
    torch.manual_seed(args.seed)
    if args.cuda and torch.cuda.is_available():
        torch.cuda.manual_seed_all(args.seed)
        args.device = torch.device("cuda")
    else:
        args.device = torch.device("cpu")

    os.makedirs(args.save_dir, exist_ok=True)
    if args.vectorizer_file and not os.path.isabs(args.vectorizer_file):
        args.vectorizer_file = os.path.join(args.save_dir, args.vectorizer_file)
    if args.model_state_file and not os.path.isabs(args.model_state_file):
        args.model_state_file = os.path.join(args.save_dir, args.model_state_file)

    if args.reload_from_files and os.path.exists(args.vectorizer_file):
        dataset = CoranDataset.load_dataset_and_load_vectorizer(args.coran_txt, args.vectorizer_file)
    else:
        dataset = CoranDataset.load_dataset_and_make_vectorizer(args.coran_txt)
        dataset.save_vectorizer(args.vectorizer_file)

    vectorizer = dataset.get_vectorizer()
    mask_index = vectorizer.char_vocab.mask_index

    def obtener_pesos(vectorizer, modelo_ft):
        vocab = vectorizer.char_vocab
        token_to_idx = vocab._token_to_idx
        tamaño_vocab = len(token_to_idx)
        embedding_dim = modelo_ft.get_dimension()
        pesos = np.zeros((tamaño_vocab, embedding_dim))

        for token, idx in token_to_idx.items():
            pesos[idx] = modelo_ft.get_word_vector(token)
    
        return torch.FloatTensor(pesos)

    ft_model = fasttext.load_model(ruta_ft)
    pretrained_ft_pesos = obtener_pesos(vectorizer, ft_model)

    model = CoranLSTM(
        vocab_size=len(vectorizer.char_vocab),
        embedding_size=args.char_embedding_size,
        lstm_hidden_size=args.lstm_hidden_size,
        padding_idx=mask_index,
        pretrained_embeddings_ft=None
    ).to(args.device)

    optimizer = torch.optim.Adam(model.parameters(), lr=args.learning_rate)
    scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode="min", factor=0.5, patience=1)

    train_state = make_train(args)

    for epoch in tqdm_notebook(range(args.num_epochs)):
        train_state["epoch_index"] = epoch

        # Train
        dataset.set_split("train")
        model.train()
        running_loss, running_acc = 0.0, 0.0
        for bi, batch in enumerate(generate_batches(dataset, args.batch_size, args.device, shuffle=True)):
            optimizer.zero_grad()
            y_pred = model(batch["x_data"])
            loss = sequence_loss(y_pred, batch["y_target"], mask_index)
            loss.backward()
            optimizer.step()

            running_loss += (loss.item() - running_loss) / (bi + 1)
            acc = compute_accuracy(y_pred, batch["y_target"], mask_index)
            running_acc += (acc - running_acc) / (bi + 1)

        train_state["train_loss"].append(running_loss)
        train_state["train_acc"].append(running_acc)

        # Val
        dataset.set_split("val")
        model.eval()
        vloss, vacc = 0.0, 0.0
        with torch.no_grad():
            for bi, batch in enumerate(generate_batches(dataset, args.batch_size, args.device, shuffle=False)):
                y_pred = model(batch["x_data"])
                loss = sequence_loss(y_pred, batch["y_target"], mask_index)

                vloss += (loss.item() - vloss) / (bi + 1)
                acc = compute_accuracy(y_pred, batch["y_target"], mask_index)
                vacc += (acc - vacc) / (bi + 1)

        train_state["val_loss"].append(vloss)
        train_state["val_acc"].append(vacc)

        train_state = update_training_state(args, model, train_state)
        scheduler.step(vloss)

        print(f"Epoch {epoch+1:03d} | train_loss={running_loss:.4f} "
              f"| val_loss={vloss:.4f} | val_acc={vacc:.4f}")
        
        dataset.save_vectorizer(args.vectorizer_file)
        torch.save(model.state_dict(), args.model_state_file)

        if train_state["stop_early"]:
            print("Early stopping activado.") 
            break

    return args, dataset, vectorizer, model


## Dataset del Corán

### RNN - Corán


Modelo RNN para el Corán, arquitectura interna de la NN con sus funciones típicas de *\__init__* y *forward*:

In [12]:
class CoranRNN(nn.Module):
    # nuestro modelo nn para el rnn
    def __init__(self, vocab_size, embedding_size, rnn_hidden_size, padding_idx, dropout_p=0.5,
                 pretrained_embeddings_ft = None):
        super().__init__()
        # arquitectura de nuestra rnn

        self.char_emb = nn.Embedding(vocab_size, embedding_size, padding_idx=padding_idx) # capa de inicio del tamaño del vocabulario
        # Aquí metemos los embeddings (pesos) del fasttext
        if pretrained_embeddings_ft is not None:
            self.char_emb.weight.data.copy_(pretrained_embeddings_ft)

        self.rnn = nn.RNN(embedding_size, rnn_hidden_size, batch_first=True, nonlinearity="tanh") # rnn
        self.fc = nn.Linear(rnn_hidden_size, vocab_size) # fully connected
        self.dropout_p = dropout_p # probabilidad de dropout de neuronas

    def forward(self, x_in, apply_softmax=False):
        x_emb = self.char_emb(x_in)             
        y_out, _ = self.rnn(x_emb)               
        y_out = F.dropout(y_out, p=self.dropout_p, training=self.training)
        logits = self.fc(y_out)                  
        if apply_softmax:
            return F.softmax(logits, dim=-1)
        return logits

Entrenamiento del RNN para el Corán árabe:

In [13]:
args, dataset, vectorizer, model_rnn = train_RNN(coran_path="../data/cleaned_data/cleaned_arab_quran.txt",
                                                 output_path="Unai/Models/RNN/arab/coran/coran_rnn_v1",
                                                 ruta_ft="../src/modelos/fasttext_arabic_busqueda_semantica.bin")

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for epoch in tqdm_notebook(range(args.num_epochs)):


  0%|          | 0/50 [00:00<?, ?it/s]

Epoch 001 | train_loss=3.1985 | val_loss=2.7052 | val_acc=0.2970
Epoch 002 | train_loss=2.6372 | val_loss=2.5042 | val_acc=0.3285
Epoch 003 | train_loss=2.4997 | val_loss=2.4174 | val_acc=0.3423
Epoch 004 | train_loss=2.4226 | val_loss=2.3597 | val_acc=0.3512
Epoch 005 | train_loss=2.3700 | val_loss=2.3174 | val_acc=0.3588
Epoch 006 | train_loss=2.3306 | val_loss=2.2840 | val_acc=0.3661
Epoch 007 | train_loss=2.2970 | val_loss=2.2542 | val_acc=0.3722
Epoch 008 | train_loss=2.2708 | val_loss=2.2276 | val_acc=0.3773
Epoch 009 | train_loss=2.2473 | val_loss=2.2050 | val_acc=0.3790
Epoch 010 | train_loss=2.2271 | val_loss=2.1864 | val_acc=0.3832
Epoch 011 | train_loss=2.2093 | val_loss=2.1680 | val_acc=0.3900
Epoch 012 | train_loss=2.1937 | val_loss=2.1518 | val_acc=0.3942
Epoch 013 | train_loss=2.1798 | val_loss=2.1378 | val_acc=0.3974
Epoch 014 | train_loss=2.1669 | val_loss=2.1244 | val_acc=0.4013
Epoch 015 | train_loss=2.1550 | val_loss=2.1116 | val_acc=0.4049
Epoch 016 | train_loss=2.

In [14]:
nuevos_versos(model_rnn, vectorizer)

------------------------------

 Verse 1:
اقول الله ۚ ولذي شكرون

 Verse 2:
وبالله وازوهه منهم بشر الددكم فاسر والذين كان الذيو ينشاب

 Verse 3:
وليا اكاتاتم ورجدوك من يبير مهمنما

 Verse 4:
وان الان لن عذاب ۚ ولك الي طبعن الصادهم فضل الا الصول الي الله روجعا في الالغروا الشار فالذين يحسم اولي المعينع م شرك به الله ويوم يابع ني بانا هم تقولي وما انتم لهم من عليه الي يرجون علي الوالنوا يصدونا من يحم ۚ واللما توتياخر لله ان ارجامه للمنهم بهم كنفتم معضد الاكيب

 Verse 5:
قال لفكلا ممتكم علي الين تملانا عليم

 Verse 6:
ان اللم

 Verse 7:
من تجتم ولا تكبرين

 Verse 8:
فالله رسله اا الذين تنبرون عبيلها ۖ فاتان تبال قالوا اعلم ۚ ونال الناق تلكم من الذروه علي البراء وكنوا ويقبم الاتين الذاطين لا تمتمون

 Verse 9:
فاسمعوا الشهسك ۚ وان اسماوا يعاب لوا ايتربا

 Verse 10:
فلا يس فيما انت عليم


Entrenamiento RNN Corán en inglés:

In [15]:
args, dataset, vectorizer, model_rnn = train_RNN(coran_path="../data/cleaned_data/cleaned_english_quran.txt",
                                                 output_path="Unai/Models/RNN/english/coran/coran_rnn_v1",
                                                 ruta_ft="../src/modelos/fasttext_english_busqueda_semantica.bin")

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for epoch in tqdm_notebook(range(args.num_epochs)):


  0%|          | 0/50 [00:00<?, ?it/s]

Epoch 001 | train_loss=2.8400 | val_loss=2.3406 | val_acc=0.3593
Epoch 002 | train_loss=2.3090 | val_loss=2.1265 | val_acc=0.3883
Epoch 003 | train_loss=2.1560 | val_loss=2.0118 | val_acc=0.4208
Epoch 004 | train_loss=2.0622 | val_loss=1.9314 | val_acc=0.4351
Epoch 005 | train_loss=1.9947 | val_loss=1.8686 | val_acc=0.4501
Epoch 006 | train_loss=1.9403 | val_loss=1.8171 | val_acc=0.4638
Epoch 007 | train_loss=1.8990 | val_loss=1.7746 | val_acc=0.4781
Epoch 008 | train_loss=1.8628 | val_loss=1.7388 | val_acc=0.4881
Epoch 009 | train_loss=1.8335 | val_loss=1.7066 | val_acc=0.4986
Epoch 010 | train_loss=1.8062 | val_loss=1.6793 | val_acc=0.5024
Epoch 011 | train_loss=1.7831 | val_loss=1.6552 | val_acc=0.5120
Epoch 012 | train_loss=1.7643 | val_loss=1.6347 | val_acc=0.5170
Epoch 013 | train_loss=1.7457 | val_loss=1.6155 | val_acc=0.5212
Epoch 014 | train_loss=1.7298 | val_loss=1.5978 | val_acc=0.5268
Epoch 015 | train_loss=1.7161 | val_loss=1.5837 | val_acc=0.5301
Epoch 016 | train_loss=1.

Obtenemos los nuevos versos:

In [16]:
nuevos_versos(model_rnn, vectorizer)

------------------------------

 Verse 1:
and have sery re and you o creaty forper forced

 Verse 2:
but fead when they not grow and comp ones in the silde to you crelinge the parade umong them with him do not send and while you indeed has but feat to evelely believe what they will be indous and he were but they will have forming sees were you wealishe pavessed to you and it is they peace from alla

 Verse 3:
there to you and will not had seople in the seak and will to the encingder me not and has been the sack so we who are allah are not has nirnent it hat and allah for then we are suongis of people accoterith wemall and their lord and except the songer to have prome and indean intaht air ahthersand you

 Verse 4:
there are this way rithen servealy with the people dis ahment igrilt diddented you astahe sorovess on the lord to ligh inlent om thise weurre slise whine thet frith of if their relued hak indeed allah made afy you and them allis and erseng hig the comselves what wh is it it


### LSTM - Corán

Modelo del LSTM para el Corán

In [17]:
class CoranLSTM(nn.Module):
    def __init__(self, vocab_size, embedding_size, lstm_hidden_size, padding_idx, dropout_p=0.5, pretrained_embeddings_ft=None):
        super().__init__()
        self.char_emb = nn.Embedding(vocab_size, embedding_size, padding_idx=padding_idx)
        if pretrained_embeddings_ft is not None:
            self.char_emb.weight.data.copy_(pretrained_embeddings_ft)
        self.lstm = nn.LSTM(embedding_size, lstm_hidden_size, batch_first=True)
        self.fc = nn.Linear(lstm_hidden_size, vocab_size)
        self.dropout_p = dropout_p

    def forward(self, x_in, apply_softmax=False):
        x_emb = self.char_emb(x_in)              
        y_out, _ = self.lstm(x_emb)              
        y_out = F.dropout(y_out, p=self.dropout_p, training=self.training)
        logits = self.fc(y_out)                 
        return F.softmax(logits, dim=-1) if apply_softmax else logits

Entrenamiento del LSTM para el Corán árabe:

In [18]:
args, dataset, vectorizer, model_lstm = train_LSTM(coran_path="../data/cleaned_data/cleaned_arab_quran.txt",
                                                 output_path="Unai/Models/LSTM/arab/coran/coran_lstm_v1",
                                                 ruta_ft="../src/modelos/fasttext_arabic_busqueda_semantica.bin")

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for epoch in tqdm_notebook(range(args.num_epochs)):


  0%|          | 0/50 [00:00<?, ?it/s]

Epoch 001 | train_loss=2.5568 | val_loss=2.2842 | val_acc=0.3650
Epoch 002 | train_loss=2.1703 | val_loss=2.1156 | val_acc=0.4034
Epoch 003 | train_loss=2.0399 | val_loss=2.0243 | val_acc=0.4309
Epoch 004 | train_loss=1.9612 | val_loss=1.9674 | val_acc=0.4482
Epoch 005 | train_loss=1.9054 | val_loss=1.9163 | val_acc=0.4618
Epoch 006 | train_loss=1.8642 | val_loss=1.8829 | val_acc=0.4705
Epoch 007 | train_loss=1.8284 | val_loss=1.8591 | val_acc=0.4769
Epoch 008 | train_loss=1.7998 | val_loss=1.8333 | val_acc=0.4840
Epoch 009 | train_loss=1.7745 | val_loss=1.8199 | val_acc=0.4902
Epoch 010 | train_loss=1.7542 | val_loss=1.8036 | val_acc=0.4917
Epoch 011 | train_loss=1.7324 | val_loss=1.7955 | val_acc=0.4943
Epoch 012 | train_loss=1.7134 | val_loss=1.7818 | val_acc=0.5011
Epoch 013 | train_loss=1.6980 | val_loss=1.7750 | val_acc=0.4994
Epoch 014 | train_loss=1.6874 | val_loss=1.7642 | val_acc=0.5060
Epoch 015 | train_loss=1.6705 | val_loss=1.7570 | val_acc=0.5062
Epoch 016 | train_loss=1.

In [19]:
nuevos_versos(model_lstm, vectorizer)

------------------------------

 Verse 1:
وضلغ عليه فرعون

 Verse 2:
واذ قالوا الذين يعلمون ان الذين كفروا طايام عميل ما جاءوا شهرا فاتوا بالجحمن منه الي انفسهم ۚ لهم هي يوم يقوم القيم العرف لاكثر ال نعمل مهلكم والله كربا وهم اله الرحمه ۖ ان الله الحكم وانا لتقول به ۚ وكان الله لا يتبعون ما تسجه الاتينا ۚ والذين يستكونوا عن ما اضتدوا به والجنه لا تيتهاعم الاخره ۖ وان

 Verse 3:
وان الله لا يستكرون بالبتغير

 Verse 4:
لا يستطعون انقلبوا من السماء الا ما كنتم تعلمون

 Verse 5:
قال رب اني انزل اليس لعلمه ياتينا الي الظالمين

 Verse 6:
ان يقول الله ونتكم اخره ۖ وكانوا قالوا الا الموت ۚ ان الله يعلم الموعاهم ان ايتهم صابرين غير البرسه وفي الملائكه وهو وحق الله واخذهم ولا يؤمنون

 Verse 7:
نستاف فيه من المسراه مجرمين

 Verse 8:
فلما اتخذ من السماء او تفرقوا في الحسانه ۗ والله الام بعلي شهيدا وقلوبهم او كانوا يعلمون

 Verse 9:
والذين اخذلون ظلام ۗ والله لا يكبن بالحق ۚ واتخذوا المؤمنات والنار ثم يريون واتيناه من المؤمنين

 Verse 10:
ربهم جنات عليه ۗ ۚ الا ان الله في الارض وما يدعون الي الله ز

Lanzamos entrenamiento inglés de LSTM:

In [20]:
args, dataset, vectorizer, model_lstm = train_LSTM(coran_path="../data/cleaned_data/cleaned_english_quran.txt",
                                                 output_path="Unai/Models/LSTM/english/coran/coran_lstm_v1",
                                                 ruta_ft="../src/modelos/fasttext_english_busqueda_semantica.bin")

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for epoch in tqdm_notebook(range(args.num_epochs)):


  0%|          | 0/50 [00:00<?, ?it/s]

Epoch 001 | train_loss=2.2428 | val_loss=1.8082 | val_acc=0.4615
Epoch 002 | train_loss=1.7094 | val_loss=1.5443 | val_acc=0.5356
Epoch 003 | train_loss=1.5275 | val_loss=1.4156 | val_acc=0.5723
Epoch 004 | train_loss=1.4246 | val_loss=1.3365 | val_acc=0.5938
Epoch 005 | train_loss=1.3559 | val_loss=1.2767 | val_acc=0.6129
Epoch 006 | train_loss=1.3056 | val_loss=1.2412 | val_acc=0.6248
Epoch 007 | train_loss=1.2663 | val_loss=1.2099 | val_acc=0.6325
Epoch 008 | train_loss=1.2352 | val_loss=1.1853 | val_acc=0.6394
Epoch 009 | train_loss=1.2091 | val_loss=1.1635 | val_acc=0.6445
Epoch 010 | train_loss=1.1881 | val_loss=1.1513 | val_acc=0.6501
Epoch 011 | train_loss=1.1698 | val_loss=1.1328 | val_acc=0.6528
Epoch 012 | train_loss=1.1544 | val_loss=1.1248 | val_acc=0.6570
Epoch 013 | train_loss=1.1396 | val_loss=1.1145 | val_acc=0.6597
Epoch 014 | train_loss=1.1275 | val_loss=1.1108 | val_acc=0.6620
Epoch 015 | train_loss=1.1143 | val_loss=1.0992 | val_acc=0.6658
Epoch 016 | train_loss=1.

In [21]:
nuevos_versos(model_lstm, vectorizer)

------------------------------

 Verse 1:
and indeed we to allah he will surely be of the obliging therein and you have done thereof but fear the wrongdoing people

 Verse 2:
the women side and remember the most compensation how many a party of them they never come and we saved you who does not know the home of your lord but those who disbelieve should be in the promise and thus have care are the in pertain that you meat one who denied the hearts of need of you a wrong t

 Verse 3:
allah this worldly life and it is allah who will be among the hearts are disbelievers

 Verse 4:
say if only may give them from his signs and we will say with they say o my people will him is a servant in the hereafter has come to you one who disbelieved as the light not but as a messenger and the prostration within the hereafter and the day he will be reconcerned and he associate anything afte

 Verse 5:
and they signs before you so allah do indeed they were in life a great corruption

 Verse 6:
and we were

## Dataset Hadith-s (Kaggle)

Visualizamos el la estructura del df, cogeremos las columnas (hadith-s) que nos interesan: `text_ar` y `text_en`. Como el archivo viene estructurado de una manera poco usual, realizaremos una limpieza exhaustiva.

In [22]:
hadith_df = pd.read_csv("../data/hadith_dataset/all_hadiths_clean.csv")
hadith_df.head(1)

Unnamed: 0,id,hadith_id,source,chapter_no,hadith_no,chapter,chain_indx,text_ar,text_en
0,0,1,Sahih Bukhari,1,1,Revelation - كتاب بدء الوحى,"30418, 20005, 11062, 11213, 11042, 3",حدثنا الحميدي عبد الله بن الزبير، قال حدثنا سف...,Narrated 'Umar bin Al-Khattab: ...


Árabe:

In [23]:
hadith_ar = hadith_df["text_ar"]
hadith_ar = pd.DataFrame(hadith_ar).dropna()
hadith_ar.to_csv("../data/hadith_dataset/hadith_ar/hadith_ar.csv", index=False, encoding="utf-8")
print(hadith_ar.count())
pd.set_option('display.max_colwidth', None)
hadith_ar.head(1)

text_ar    34433
dtype: int64


Unnamed: 0,text_ar
0,"حدثنا الحميدي عبد الله بن الزبير، قال حدثنا سفيان، قال حدثنا يحيى بن سعيد الأنصاري، قال أخبرني محمد بن إبراهيم التيمي، أنه سمع علقمة بن وقاص الليثي، يقول سمعت عمر بن الخطاب رضى الله عنه على المنبر قال سمعت رسول الله صلى الله عليه وسلم يقول ‏""‏ إنما الأعمال بالنيات، وإنما لكل امرئ ما نوى، فمن كانت هجرته إلى دنيا يصيبها أو إلى امرأة ينكحها فهجرته إلى ما هاجر إليه ‏""‏‏.‏"


Limpieza árabe:

In [24]:
QUOTE_CHARS = r"\"'“”„«»‹›`´"

# Diacríticos árabes (harakat) + marcas coránicas comunes
ARABIC_DIACRITICS = re.compile(r"[\u0610-\u061A\u064B-\u065F\u0670\u06D6-\u06ED]")

# Rangos Unicode típicos para árabe (básico + extendidos)
ARABIC_RANGES = r"\u0600-\u06FF\u0750-\u077F\u08A0-\u08FF"

def _strip_wrapping_quotes(text: str, max_loops: int = 5) -> str:
    if not text:
        return text
    t = text.strip()
    for _ in range(max_loops):
        new_t = re.sub(rf'^\s*[{QUOTE_CHARS}]+\s*', '', t)
        new_t = re.sub(rf'\s*[{QUOTE_CHARS}]+\s*$', '', new_t)
        new_t = new_t.strip()
        if new_t == t:
            break
        t = new_t
    return t

def normalize_arabic(text: str, remove_diacritics: bool = True) -> str:
    # Normalización Unicode (unifica formas)
    text = unicodedata.normalize("NFKC", text)

    # Quitar tatweel (kashida)
    text = text.replace("\u0640", "")

    # Unificar algunas variantes comunes (opcional, útil en muchos corpus)
    text = text.replace("أ", "ا").replace("إ", "ا").replace("آ", "ا")
    text = text.replace("ى", "ي")
    text = text.replace("ة", "ه")  # si prefieres mantenerla, comenta esta línea

    if remove_diacritics:
        text = re.sub(ARABIC_DIACRITICS, "", text)

    return text

def clean_hadith_text_ar(text, remove_diacritics: bool = True):
    if not isinstance(text, str):
        return ""

    # 1) Espacios/saltos de línea
    text = text.replace("\n", " ").replace("\r", " ").strip()

    # 2) Quitar comillas envolventes
    text = _strip_wrapping_quotes(text)

    # 3) Normalización árabe (sin lower)
    text = normalize_arabic(text, remove_diacritics=remove_diacritics)

    # 4) Eliminar narrador (si el encabezado está en inglés, como en tu caso)
    palabras_clave = (
        r"(said|asked|the|i\s+heard|i\s+was\s+told|i\s+informed|while|informed|abu|allah|"
        r"if|when|once|some|whenever|it|sometimes|thereupon|then|and|but)"
    )
    patron_narrador = r'^\s*narrated\s+.*?[:\-]?\s*(?=\b' + palabras_clave + r'\b)'
    text = re.sub(patron_narrador, "", text).strip()

    # 5) Quitar comillas residuales
    text = re.sub(rf'^\s*[{QUOTE_CHARS}]+\s*', "", text)
    text = re.sub(rf'\s*[{QUOTE_CHARS}]+\s*$', "", text)

    # 6) Mantener: letras árabes, números, espacios y puntuación básica.
    # Incluye puntuación árabe: ، ؛ ؟  (comma/semicolon/question mark)
    allowed = rf"[^0-9\s{ARABIC_RANGES}\.,!?'\-\(\)«»\"،؛؟]"
    text = re.sub(allowed, " ", text)

    # 7) Colapsar espacios
    text = re.sub(r"\s+", " ", text).strip()

    return text

In [25]:
hadith_ar = hadith_df[["text_ar"]].copy()

hadith_ar = hadith_ar.dropna(subset=["text_ar"])

hadith_ar = hadith_ar.iloc[1:].reset_index(drop=True)

hadith_ar["text_ar"] = hadith_ar["text_ar"].apply(clean_hadith_text_ar)
hadith_ar = hadith_ar.iloc[1:].reset_index(drop=True)

hadith_ar = hadith_ar[hadith_ar["text_ar"] != ""].reset_index(drop=True)

output_path = "../data/hadith_dataset/hadith_ar/hadith_ar_cleaned.csv"

hadith_ar.to_csv(output_path, index=False, encoding="utf-8")

Inglés + función de limpieza inglesa

In [26]:
QUOTE_CHARS = r"\"'“”„«»‹›`´"

def _strip_wrapping_quotes(text: str, max_loops: int = 5) -> str:
    if not text:
        return text

    t = text.strip()
    for _ in range(max_loops):
        # ^\s*["'“”...]+ (captura comillas al inicio) y ["'“”...]+\s*$ (al final)
        new_t = re.sub(rf'^\s*[{QUOTE_CHARS}]+\s*', '', t)
        new_t = re.sub(rf'\s*[{QUOTE_CHARS}]+\s*$', '', new_t)
        new_t = new_t.strip()
        if new_t == t:
            break
        t = new_t
    return t

def clean_hadith_text(text):
    if not isinstance(text, str):
        return ""

    text = text.replace('\n', ' ').replace('\r', ' ').strip()

    text = _strip_wrapping_quotes(text)

    text = text.replace('""', '"').lower()

    # Limpieza del formato original del .csv: narrated by (nommbre del narrador) + texto que queremos
    palabras_clave = (
        r"(said|asked|the|i\s+heard|i\s+was\s+told|i\s+informed|while|informed|abu|allah|"
        r"if|when|once|some|whenever|it|sometimes|thereupon|then|and|but)"
    )
    patron_narrador = r'^\s*narrated\s+.*?[:\-]?\s*(?=\b' + palabras_clave + r'\b)'
    text = re.sub(patron_narrador, '', text).strip()

    text = re.sub(rf'^\s*[{QUOTE_CHARS}]+\s*', '', text)
    text = re.sub(rf'\s*[{QUOTE_CHARS}]+\s*$', '', text)
    text = re.sub(r"[^a-z0-9\s.,!?'\-\(\)]", " ", text)

    text = re.sub(r"\s+", " ", text).strip()

    return text

In [27]:
hadith_en = hadith_df[["text_en"]].copy()

hadith_en = hadith_en.dropna(subset=["text_en"])

hadith_en = hadith_en.iloc[1:].reset_index(drop=True)

hadith_en["text_en"] = hadith_en["text_en"].apply(clean_hadith_text)
hadith_en = hadith_en.iloc[1:].reset_index(drop=True)

hadith_en = hadith_en[hadith_en["text_en"] != ""].reset_index(drop=True)

output_path = "../data/hadith_dataset/hadith_en/hadith_en_cleaned.csv"

hadith_en.to_csv(output_path, index=False, encoding="utf-8")

Clase Dataset del Hadith dataset, adaptado al formato del dataset de *Kaggle*:

In [28]:
class HadithDataset(Dataset):
    def __init__(self, df: pd.DataFrame, vectorizer: CoranVectorizer, text_col):
        # text_col: text_en (hadith_en) y text_ar (hadith_ar)
        self.df = df.reset_index(drop=True)
        self._vectorizer = vectorizer
        self._text_col = text_col
        self._max_seq_length = min(int(self.df[text_col].astype(str).map(len).max()) + 2, 500)        
        n = len(self.df)
        train_end = int(n * 0.70) # 70% de las instancias al train set
        val_end = int(n * .85) # 15 para el validation set, y el otro 15 para el test

        self.train_df = self.df.iloc[:train_end]
        self.val_df = self.df.iloc[train_end:val_end]
        self.test_df = self.df.iloc[val_end:]

        self._lookup_dict = {
            "train": (self.train_df, len(self.train_df)),
            "val": (self.val_df, len(self.val_df)),
            "test": (self.test_df, len(self.test_df)),
        }

        self.set_split("train")

    @classmethod
    def load_dataset_and_make_vectorizer(cls, hadith_csv, text_col):
        df = pd.read_csv(hadith_csv)
        # FIX: Use text_col instead of "text"
        df[text_col] = df[text_col].astype(str).str.lower()
        vectorizer = CoranVectorizer.from_dataframe(df, text_col)
        return cls(df, vectorizer, text_col)

    @classmethod
    def load_dataset_and_load_vectorizer(cls, hadith_csv, vectorizer_filepath, text_col):
        df = pd.read_csv(hadith_csv)
        # FIX: Use text_col instead of "text"
        df[text_col] = df[text_col].astype(str).str.lower()
        vectorizer = cls.load_vectorizer_only(vectorizer_filepath)
        return cls(df, vectorizer, text_col)
        
    @staticmethod
    def load_vectorizer_only(vectorizer_filepath):
        with open(vectorizer_filepath, "r", encoding="utf-8") as f:
            contents = json.load(f)
        return CoranVectorizer.from_serializable(contents)

    def save_vectorizer(self, vectorizer_filepath):
        with open(vectorizer_filepath, "w", encoding="utf-8") as f:
            json.dump(self._vectorizer.to_serializable(), f, ensure_ascii=False)

    def get_vectorizer(self):
        return self._vectorizer

    def set_split(self, split="train"):
        self._target_split = split
        self._target_df, self._target_size = self._lookup_dict[split]

    def __len__(self):
        return self._target_size

    def __getitem__(self, index):
        row = self._target_df.iloc[index]
        text = str(row[self._text_col])
        x, y = self._vectorizer.vectorize(text, vector_length=self._max_seq_length)
        return {"x_data": torch.tensor(x, dtype=torch.long), "y_target": torch.tensor(y, dtype=torch.long)}

### RNN - Hadith

In [29]:
def train_RNN(hadith_path, output_path, text_col, ft_ruta):
    args = Namespace(
        hadith_csv=hadith_path,
        vectorizer_file="vectorizer.json",
        model_state_file="model.pth",
        save_dir=output_path,

        char_embedding_size=300, # 300 porque los embeddings del ft son de 300, tienen que coincidir
        rnn_hidden_size=128, # 256-ekin peatau itenzatek

        seed=1337,
        learning_rate=1e-3,
        batch_size=256,
        num_epochs=50,
        early_stopping_criteria=5,

        cuda=True,
        reload_from_files=False
    )

    print(args.batch_size)

    np.random.seed(args.seed)
    torch.manual_seed(args.seed)
    if args.cuda and torch.cuda.is_available():
        torch.cuda.manual_seed_all(args.seed)
        args.device = torch.device("cuda")
    else:
        args.device = torch.device("cpu")

    os.makedirs(args.save_dir, exist_ok=True)
    if args.vectorizer_file and not os.path.isabs(args.vectorizer_file):
        args.vectorizer_file = os.path.join(args.save_dir, args.vectorizer_file)
    if args.model_state_file and not os.path.isabs(args.model_state_file):
        args.model_state_file = os.path.join(args.save_dir, args.model_state_file)

    if args.reload_from_files and os.path.exists(args.vectorizer_file):
        dataset = HadithDataset.load_dataset_and_load_vectorizer(args.hadith_csv, args.vectorizer_file, text_col)
    else:
        dataset = HadithDataset.load_dataset_and_make_vectorizer(args.hadith_csv, text_col)
        dataset.save_vectorizer(args.vectorizer_file)

    vectorizer = dataset.get_vectorizer()
    mask_index = vectorizer.char_vocab.mask_index

    def obtener_pesos(vectorizer, modelo_ft):
        vocab = vectorizer.char_vocab
        token_to_idx = vocab._token_to_idx
        tamaño_vocab = len(token_to_idx)
        embedding_dim = modelo_ft.get_dimension()
        pesos = np.zeros((tamaño_vocab, embedding_dim))

        for token, idx in token_to_idx.items():
            pesos[idx] = modelo_ft.get_word_vector(token)
    
        return torch.FloatTensor(pesos)

    ft_model = fasttext.load_model(ft_ruta)
    pretrained_ft_pesos = obtener_pesos(vectorizer, ft_model)

    model = CoranRNN(
        vocab_size=len(vectorizer.char_vocab),
        embedding_size=args.char_embedding_size,
        rnn_hidden_size=args.rnn_hidden_size,
        padding_idx=mask_index,
        pretrained_embeddings_ft=None
    ).to(args.device)

    optimizer = torch.optim.Adam(model.parameters(), lr=args.learning_rate)
    scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode="min", factor=0.5, patience=1)

    train_state = make_train(args)

    for epoch in tqdm_notebook(range(args.num_epochs)):
        train_state["epoch_index"] = epoch

        # Train
        dataset.set_split("train")
        model.train()
        running_loss, running_acc = 0.0, 0.0
        for bi, batch in enumerate(generate_batches(dataset, args.batch_size, args.device, shuffle=True)):
            optimizer.zero_grad()
            y_pred = model(batch["x_data"])
            loss = sequence_loss(y_pred, batch["y_target"], mask_index)
            loss.backward()
            optimizer.step()

            running_loss += (loss.item() - running_loss) / (bi + 1)
            acc = compute_accuracy(y_pred, batch["y_target"], mask_index)
            running_acc += (acc - running_acc) / (bi + 1)

        train_state["train_loss"].append(running_loss)
        train_state["train_acc"].append(running_acc)

        # Val
        dataset.set_split("val")
        model.eval()
        vloss, vacc = 0.0, 0.0
        with torch.no_grad():
            for bi, batch in enumerate(generate_batches(dataset, args.batch_size, args.device, shuffle=False)):
                y_pred = model(batch["x_data"])
                loss = sequence_loss(y_pred, batch["y_target"], mask_index)

                vloss += (loss.item() - vloss) / (bi + 1)
                acc = compute_accuracy(y_pred, batch["y_target"], mask_index)
                vacc += (acc - vacc) / (bi + 1)

        train_state["val_loss"].append(vloss)
        train_state["val_acc"].append(vacc)

        train_state = update_training_state(args, model, train_state)
        scheduler.step(vloss)

        print(f"Epoch {epoch+1:03d} | train_loss={running_loss:.4f} "
              f"| val_loss={vloss:.4f} | val_acc={vacc:.4f}")
        
        dataset.save_vectorizer(args.vectorizer_file)
        torch.save(model.state_dict(), args.model_state_file)

        if train_state["stop_early"]:
            print("Early stopping activado.")
            break

    return args, dataset, vectorizer, model

Entrenamiento RNN con dataset Hadith en árabe:

In [30]:
args, dataset, vectorizer, model_rnn = train_RNN(hadith_path="../data/hadith_dataset/hadith_ar/hadith_ar_cleaned.csv",
                                                 output_path="Unai/Models/RNN/arab/hadith/coran_rnn_v1",
                                                 text_col="text_ar",
                                                 ft_ruta="../src/modelos/fasttext_arabic_busqueda_semantica.bin")

256


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for epoch in tqdm_notebook(range(args.num_epochs)):


  0%|          | 0/50 [00:00<?, ?it/s]

Epoch 001 | train_loss=2.4272 | val_loss=1.9430 | val_acc=0.4766
Epoch 002 | train_loss=1.9415 | val_loss=1.7300 | val_acc=0.5292
Epoch 003 | train_loss=1.8107 | val_loss=1.6277 | val_acc=0.5493
Epoch 004 | train_loss=1.7418 | val_loss=1.5645 | val_acc=0.5646
Epoch 005 | train_loss=1.6981 | val_loss=1.5274 | val_acc=0.5728
Epoch 006 | train_loss=1.6696 | val_loss=1.5026 | val_acc=0.5824
Epoch 007 | train_loss=1.6496 | val_loss=1.4840 | val_acc=0.5888
Epoch 008 | train_loss=1.6339 | val_loss=1.4710 | val_acc=0.5917
Epoch 009 | train_loss=1.6217 | val_loss=1.4605 | val_acc=0.5948
Epoch 010 | train_loss=1.6108 | val_loss=1.4547 | val_acc=0.6012
Epoch 011 | train_loss=1.6021 | val_loss=1.4449 | val_acc=0.6002
Epoch 012 | train_loss=1.5952 | val_loss=1.4351 | val_acc=0.6051
Epoch 013 | train_loss=1.5879 | val_loss=1.4306 | val_acc=0.6057
Epoch 014 | train_loss=1.5830 | val_loss=1.4259 | val_acc=0.6075
Epoch 015 | train_loss=1.5773 | val_loss=1.4225 | val_acc=0.6068
Epoch 016 | train_loss=1.

In [31]:
nuevos_versos(model_rnn, vectorizer)

------------------------------

 Verse 1:
حدثناه العياوب وحبنا عنرادين بن يحيي بن حبير بن عائشه، ان رسول الله صلي الله عليه وسلم قتله، فقال له عن يليمه، انه سود اغيره الالحا فقد النيل فلاالك رحيت او يقتل الليت الله ومن امعاما قال حفيده عن ابن عبيد الله الجابي ثمره بيد ما لا يسيك بن عصر بالسمع فخرف عن ابي هذا هذي الملاج بهل يشاكل بيله ولس حم

 Verse 2:
حدثنا ابو العبا بن حدثني، تحار واما منا ويلم ابن صالع اله خفيان في الناعمه وسمع بيوه فقال قال عنيي كما يقرا يم ما انضر بعضاه في وابا علي ابن شعبه من ابي اسله مر اللكه .

 Verse 3:
حدثني ابو التمول الولا والملرص فيه في سلمه علي السهاء يا رسول الله لا اسمذي ان اليصيت عن رسول الله صلي الله عليه وسلم فانتا واولي ابن عمر وجدي البجراع شهس الاسلعم " .

 Verse 4:
حدثنا محمد بن ابي مالي، عن سعيد بن ابي علي، قال قال رسول الله صلي الله علي الله والوخرا والعلي بن النبي صلي الله عليه وسلم " انه تعتلها علي الاساق ورسول الله صلي الله عليه وسلم يحدث ان يسطر ينك بالله في الصلاه الهما ان رسول الله صلي الله عليه وسلم قال " لي كوا فما انا جعلم الهارد وما كلن

Entrenamiento Hadith RNN inglés:

In [32]:
args, dataset, vectorizer, model_rnn = train_RNN(hadith_path="../data/hadith_dataset/hadith_en/hadith_en_cleaned.csv",
                                                 output_path="Unai/Models/RNN/english/hadith/coran_rnn_v1",
                                                 text_col="text_en",
                                                 ft_ruta="../src/modelos/fasttext_english_busqueda_semantica.bin")

256


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for epoch in tqdm_notebook(range(args.num_epochs)):


  0%|          | 0/50 [00:00<?, ?it/s]

Epoch 001 | train_loss=2.4778 | val_loss=2.0084 | val_acc=0.4385
Epoch 002 | train_loss=2.0395 | val_loss=1.8146 | val_acc=0.4940
Epoch 003 | train_loss=1.9119 | val_loss=1.7096 | val_acc=0.5117
Epoch 004 | train_loss=1.8377 | val_loss=1.6460 | val_acc=0.5303
Epoch 005 | train_loss=1.7901 | val_loss=1.6046 | val_acc=0.5391
Epoch 006 | train_loss=1.7570 | val_loss=1.5759 | val_acc=0.5439
Epoch 007 | train_loss=1.7324 | val_loss=1.5540 | val_acc=0.5498
Epoch 008 | train_loss=1.7141 | val_loss=1.5397 | val_acc=0.5537
Epoch 009 | train_loss=1.6990 | val_loss=1.5253 | val_acc=0.5564
Epoch 010 | train_loss=1.6866 | val_loss=1.5149 | val_acc=0.5592
Epoch 011 | train_loss=1.6768 | val_loss=1.5058 | val_acc=0.5617
Epoch 012 | train_loss=1.6682 | val_loss=1.4992 | val_acc=0.5633
Epoch 013 | train_loss=1.6607 | val_loss=1.4932 | val_acc=0.5675
Epoch 014 | train_loss=1.6544 | val_loss=1.4858 | val_acc=0.5699
Epoch 015 | train_loss=1.6488 | val_loss=1.4833 | val_acc=0.5681
Epoch 016 | train_loss=1.

In [33]:
nuevos_versos(model_rnn, vectorizer)

------------------------------

 Verse 1:
igh in the anoungilen to the bra' gour bin ald before his wame not nor mand to remater and in his hands wan for his camen came of all hammed buts properte whan with the messenger of upo thes sar and the sun) the prophet who said and and the prophet ( ) said when he who was would narration. him the p

 Verse 2:
in while it and canep it in the promertidia's for that antwe of make the for it and and then are the ritting the messerch that when i with entrong to we more leasted and they thel would, who we the butren in the prophet ( ) upon him) har the som and abna le taman abdullah be for hand of ook.

 Verse 3:
said and is with them to said a man from them and he rashed of in perood as allah be parmodeven the perpoat in the last in the sow unon the sayevers ham and she serno sacepll that he wat and the the are that had allah if he of a torning has dightrea the prophet houbadass with the apostle sood he not

 Verse 4:
the runt of his allahs of ibn 

### LSTM - Hadith

In [34]:
def train_LSTM(hadith_path, output_path, text_col, ft_ruta):
    args = Namespace(
        hadith_csv=hadith_path,
        vectorizer_file="vectorizer.json",
        model_state_file="model.pth",
        save_dir=output_path,

        char_embedding_size=300, # lo mismo que ft 
        lstm_hidden_size=256,

        seed=1337,
        learning_rate=1e-3,
        batch_size=64,
        num_epochs=50,
        early_stopping_criteria=5,

        cuda=True,
        reload_from_files=False
    )

    np.random.seed(args.seed)
    torch.manual_seed(args.seed)
    if args.cuda and torch.cuda.is_available():
        torch.cuda.manual_seed_all(args.seed)
        args.device = torch.device("cuda")
    else:
        args.device = torch.device("cpu")

    os.makedirs(args.save_dir, exist_ok=True)
    if args.vectorizer_file and not os.path.isabs(args.vectorizer_file):
        args.vectorizer_file = os.path.join(args.save_dir, args.vectorizer_file)
    if args.model_state_file and not os.path.isabs(args.model_state_file):
        args.model_state_file = os.path.join(args.save_dir, args.model_state_file)

    if args.reload_from_files and os.path.exists(args.vectorizer_file):
        dataset = HadithDataset.load_dataset_and_load_vectorizer(args.hadith_csv, args.vectorizer_file, text_col)
    else:
        dataset = HadithDataset.load_dataset_and_make_vectorizer(args.hadith_csv, text_col)
        dataset.save_vectorizer(args.vectorizer_file)

    vectorizer = dataset.get_vectorizer()
    mask_index = vectorizer.char_vocab.mask_index

    def obtener_pesos(vectorizer, modelo_ft):
        vocab = vectorizer.char_vocab
        token_to_idx = vocab._token_to_idx
        tamaño_vocab = len(token_to_idx)
        embedding_dim = modelo_ft.get_dimension()
        pesos = np.zeros((tamaño_vocab, embedding_dim))

        for token, idx in token_to_idx.items():
            pesos[idx] = modelo_ft.get_word_vector(token)
    
        return torch.FloatTensor(pesos)

    ft_modelo = fasttext.load_model(ft_ruta)
    pretrained_ft_pesos = obtener_pesos(vectorizer, ft_modelo)

    model = CoranLSTM(
        vocab_size=len(vectorizer.char_vocab),
        embedding_size=args.char_embedding_size,
        lstm_hidden_size=args.lstm_hidden_size,
        padding_idx=mask_index,
        pretrained_embeddings_ft=None
    ).to(args.device)

    optimizer = torch.optim.Adam(model.parameters(), lr=args.learning_rate)
    scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode="min", factor=0.5, patience=1)

    train_state = make_train(args)

    for epoch in tqdm_notebook(range(args.num_epochs)):
        train_state["epoch_index"] = epoch

        # Train
        dataset.set_split("train")
        model.train()
        running_loss, running_acc = 0.0, 0.0
        for bi, batch in enumerate(generate_batches(dataset, args.batch_size, args.device, shuffle=True)):
            optimizer.zero_grad()
            y_pred = model(batch["x_data"])
            loss = sequence_loss(y_pred, batch["y_target"], mask_index)
            loss.backward()
            optimizer.step()

            running_loss += (loss.item() - running_loss) / (bi + 1)
            acc = compute_accuracy(y_pred, batch["y_target"], mask_index)
            running_acc += (acc - running_acc) / (bi + 1)

        train_state["train_loss"].append(running_loss)
        train_state["train_acc"].append(running_acc)

        # Val
        dataset.set_split("val")
        model.eval()
        vloss, vacc = 0.0, 0.0
        with torch.no_grad():
            for bi, batch in enumerate(generate_batches(dataset, args.batch_size, args.device, shuffle=False)):
                y_pred = model(batch["x_data"])
                loss = sequence_loss(y_pred, batch["y_target"], mask_index)

                vloss += (loss.item() - vloss) / (bi + 1)
                acc = compute_accuracy(y_pred, batch["y_target"], mask_index)
                vacc += (acc - vacc) / (bi + 1)

        train_state["val_loss"].append(vloss)
        train_state["val_acc"].append(vacc)

        train_state = update_training_state(args, model, train_state)
        scheduler.step(vloss)

        print(f"Epoch {epoch+1:03d} | train_loss={running_loss:.4f} "
              f"| val_loss={vloss:.4f} | val_acc={vacc:.4f}")
        
        dataset.save_vectorizer(args.vectorizer_file)
        torch.save(model.state_dict(), args.model_state_file)

        if train_state["stop_early"]:
            print("Early stopping activado.")
            break

    return args, dataset, vectorizer, model


Entrenamiento LSTM Hadith árabe:

In [35]:
args, dataset, vectorizer, model_lstm = train_LSTM(hadith_path="../data/hadith_dataset/hadith_ar/hadith_ar_cleaned.csv",
                                                 output_path="Unai/Models/LSTM/arab/hadith/hadith_lstm_v1",
                                                 text_col="text_ar",
                                                 ft_ruta="../src/modelos/fasttext_arabic_busqueda_semantica.bin")

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for epoch in tqdm_notebook(range(args.num_epochs)):


  0%|          | 0/50 [00:00<?, ?it/s]

Epoch 001 | train_loss=1.7276 | val_loss=1.3263 | val_acc=0.6259
Epoch 002 | train_loss=1.3555 | val_loss=1.2133 | val_acc=0.6521
Epoch 003 | train_loss=1.2678 | val_loss=1.1617 | val_acc=0.6646
Epoch 004 | train_loss=1.2200 | val_loss=1.1318 | val_acc=0.6727
Epoch 005 | train_loss=1.1885 | val_loss=1.1050 | val_acc=0.6813
Epoch 006 | train_loss=1.1653 | val_loss=1.0935 | val_acc=0.6826
Epoch 007 | train_loss=1.1478 | val_loss=1.0778 | val_acc=0.6885
Epoch 008 | train_loss=1.1332 | val_loss=1.0643 | val_acc=0.6916
Epoch 009 | train_loss=1.1221 | val_loss=1.0554 | val_acc=0.6941
Epoch 010 | train_loss=1.1112 | val_loss=1.0470 | val_acc=0.6966
Epoch 011 | train_loss=1.1030 | val_loss=1.0426 | val_acc=0.6972
Epoch 012 | train_loss=1.0953 | val_loss=1.0392 | val_acc=0.6986
Epoch 013 | train_loss=1.0887 | val_loss=1.0299 | val_acc=0.7010
Epoch 014 | train_loss=1.0828 | val_loss=1.0252 | val_acc=0.7020
Epoch 015 | train_loss=1.0779 | val_loss=1.0237 | val_acc=0.7022
Epoch 016 | train_loss=1.

In [36]:
nuevos_versos(model_lstm, vectorizer)

------------------------------

 Verse 1:
حدثنا محمد بن بشار، حدثنا يحيي، - يعني ابن زيد - حدثنا معمر، عن الزهري، عن سالم، عن ابن عمر، ان رسول الله صلي الله عليه وسلم قال " اذا اصبح ان يسترض بلته فقد صاع الناس لعله وتقام الكلب القالي " .

 Verse 2:
حدثنا ابو نعيم، حدثنا ابن ابي مريم، حدثنا عمرو بن الحارث، عن ابي الزبير، عن جابر، قال قال رسول الله صلي الله عليه وسلم " باسل جعله عصره بني السمع الا الاعلي والخشيه الي الصبي وابو بهم من من اكثره بالفرق له " . قال يحيي في قوله والله لا اغتسل والا امرت الا المهدي لربع اني الزناء او ان يستحبها اللقيه 

 Verse 3:
حدثنا اسماعيل، قال حدثني مالك، عن ابي الزبير، عن ابي الاعمش، عن ابي هريره، قال قال رسول الله صلي الله عليه وسلم " لا يلبس المشركين . قال فاعيني ان يسال عن امر الخير واخذها في عرده ابو حمزه الخفين " . قال مالك وهي التراويه . فقال لي رسول الله صلي الله عليه وسلم " ان شما وان يعقد الاباري العبد المحاخي

 Verse 4:
حدثنا سليمان بن حرب، حدثنا شعبه، عن يزيد بن ابي ابي، عبد الله عن عائشه رضي الله عنها انها قالت انبات الشمس في الاعام كلامته . ف

Entrenamiento Hadith LSTM inglés:

In [37]:
args, dataset, vectorizer, model_lstm = train_LSTM(hadith_path="../data/hadith_dataset/hadith_en/hadith_en_cleaned.csv",
                                                 output_path="Unai/Models/LSTM/english/hadith/hadith_lstm_v1",
                                                 text_col="text_en",
                                                 ft_ruta="../src/modelos/fasttext_english_busqueda_semantica.bin")

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for epoch in tqdm_notebook(range(args.num_epochs)):


  0%|          | 0/50 [00:00<?, ?it/s]

Epoch 001 | train_loss=1.8283 | val_loss=1.3833 | val_acc=0.5927
Epoch 002 | train_loss=1.3998 | val_loss=1.2415 | val_acc=0.6284
Epoch 003 | train_loss=1.2932 | val_loss=1.1730 | val_acc=0.6460
Epoch 004 | train_loss=1.2379 | val_loss=1.1359 | val_acc=0.6558
Epoch 005 | train_loss=1.2025 | val_loss=1.1139 | val_acc=0.6595
Epoch 006 | train_loss=1.1758 | val_loss=1.0901 | val_acc=0.6675
Epoch 007 | train_loss=1.1563 | val_loss=1.0759 | val_acc=0.6709
Epoch 008 | train_loss=1.1409 | val_loss=1.0704 | val_acc=0.6722
Epoch 009 | train_loss=1.1278 | val_loss=1.0549 | val_acc=0.6766
Epoch 010 | train_loss=1.1175 | val_loss=1.0462 | val_acc=0.6812
Epoch 011 | train_loss=1.1080 | val_loss=1.0432 | val_acc=0.6793
Epoch 012 | train_loss=1.1005 | val_loss=1.0356 | val_acc=0.6832
Epoch 013 | train_loss=1.0935 | val_loss=1.0314 | val_acc=0.6838
Epoch 014 | train_loss=1.0875 | val_loss=1.0249 | val_acc=0.6859
Epoch 015 | train_loss=1.0823 | val_loss=1.0229 | val_acc=0.6867
Epoch 016 | train_loss=1.

In [38]:
nuevos_versos(model_lstm, vectorizer)

------------------------------

 Verse 1:
abu sa'id al-khudri reported that allah's messenger (may peace be upon him) was slaid that the prophet clothes his (umra) and then said, the messenger of allah (may peace be upon him) said he came to allah's messenger (may peace be upon him) as some of the fire shall be in the truth of it is the all

 Verse 2:
the messenger of allah ( ) said indeed i profess more than three and the shoulder of the night of his wife. i were under his latter and he about the side committing water.

 Verse 3:
allah's apostle asked about the sun as a dream in the messenger of allah ( ) on the day of nahr until the day of judgment who blessed 'abdullah said to him who then appeared the camels ((i.e. the people) replied yes, he would say this is a narration stones the most response by a prevented (with his 

 Verse 4:
the prophet ( ) said if they used to say indeed will the day of resurrection.

 Verse 5:
abu hurairah said a man came to mu'adh is the second person wh

Realizaremos las comparaciones y aclararemos nuestras conclusiones en la documentación.