### **Señas Chapinas: Traductor de LENSEGUA**
#### *Módulo de Procesamiento de Lenguaje Natural*

Stefano Alberto Aragoni Maldonado

-------------------

#### **Data Augmentation**

En la fase inicial del proyecto, se recopilaron frases en español que utilizaban la gramática de LENSEGUA. Estas frases fueron almacenadas en un archivo CSV junto con sus contrapartes gramaticalmente correctas en español.

Estas frases se utilizarán para fine-tunear un LLM pre-entrenado, con el objetivo de que este modelo pueda asimilar la gramática de LENSEGUA. A través de esto, se espera que pueda interpretar oraciones que utilicen dicha gramática y las escriba correctamente en español.



Para este proceso se necesitan muchas oraciones, por lo cual se plantea aplicar data augmentation. Esto se logrará modificando las oraciones según diferentes variables como el tiempo verbal, los lugares, los sujetos, y otros elementos gramaticales. De esta manera, se generará una mayor cantidad de datos de entrenamiento, lo que permitirá al modelo preentrenado mejorar su capacidad de comprensión y generación de texto utilizando la gramática de LENSEGUA.

El proceso de data augmentation incluirá varias técnicas específicas. Una de ellas será la sustitución de palabras por sinónimos, lo que permitirá mantener el significado de las frases mientras se introducen variaciones léxicas. Otra técnica consistirá en cambiar el tiempo verbal de las oraciones, transformando frases en presente a pasado o futuro, y viceversa. Adicionalmente, se podrán alterar los lugares y sujetos de las oraciones para generar diferentes contextos.

___

#### *Importar librerías*
Como primer paso, se importan las librerías necesarias para el desarrollo del problema.

In [1]:
# Import libraries and modules
import numpy as np
import pandas as pd
import csv
import re
from collections import Counter
import nltk
from nltk import ngrams, CFG
from nltk.corpus import wordnet
from copy import deepcopy
from tqdm import tqdm

import seaborn as sns
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix
from prettytable import PrettyTable

from regex import *

___

#### *Cargar el dataset*

Luego, se carga el dataset que contiene las frases en español y su contraparte en LENSEGUA.

In [3]:
data = pd.read_csv("../../dataset/raw/dataset.csv")                               # Load the dataset

data["LENSEGUA"] = data["LENSEGUA"].apply(lambda x: x.lower())  # Convert to lowercase
data["ESPAÑOL"] = data["ESPAÑOL"].apply(lambda x: x.lower())    # Convert to lowercase

data["LENSEGUA"] = data["LENSEGUA"].apply(lambda x: re.sub(r'"', '', x))    # Remove double quotes
data["ESPAÑOL"] = data["ESPAÑOL"].apply(lambda x: re.sub(r'"', '', x))      # Remove double quotes

data.head(20)                                                   # Display the first 20 rows of the dataset         

Unnamed: 0,LENSEGUA,ESPAÑOL
0,abuela 70 años tener,mi abuela tiene 70 años.
1,abuelo 70 años tener,mi abuelo tiene 70 años.
2,abuelo enfermo mucho,mi abuelo está muy enfermo.
3,aeropuerto dónde pregunta,¿dónde está el aeropuerto?
4,aeropuerto yo ir cuál pregunta,¿a cuál aeropuerto tengo que ir?
5,ahora tu libro leer cuál pregunta,¿cuál libro estás leyendo ahora?
6,alegre señas aprender y hacer,es alegre aprender y hacer señas.
7,amiga desaparecer tu información compartir,mi amiga desapareció. comparte la información.
8,anillo precio cuánto pregunta,¿cuánto cuesta el anillo?
9,año pasado ella carro aprender,ella aprendió a manejar el año pasado.


Posteriormente, se separa el dataset 80% para entrenamiento y 20% para validación. Solo al dataset de entrenamiento se le aplicarán las técnicas de data augmentation.

In [75]:
data = data.sample(frac=1).reset_index(drop=True)               # Shuffle data

train_size = int(0.8 * len(data))                               # Calculate size of training set

train_data = data[:train_size]                                  # Split data into training and validation sets
validation_data = data[train_size:]                             # Training set is 80% of data, validation set is 20% of data


# ------------------------ 
data = train_data                                               # Set data to training data

___

#### *Equivalencias de palabras*

Posteriormente, se definen las equivalencias de palabras que se utilizarán para realizar la sustitución de palabras.

In [76]:
# ------ Places ------

masc_home_places = ["cuarto", "baño", "comedor", "garage", "sótano", "balcón", "pasillo", "estudio", "vestíbulo", "closet"]
fem_home_places = ["cocina", "sala", "habitación", "terraza", "bodega", "lavandería"]

masc_outdoor_places = ["jardín", "patio", "parque", "río", "lago", "bosque", "desierto", "volcán", "cerro", "cementerio"]
fem_outdoor_places = ["playa", "montaña", "cascada", "cueva", "selva", "pradera", "isla", "laguna", "granja", "colina", "finca", "reserva"]

masc_work_places = ["estudio", "consultorio", "cubículo", "salón", "laboratorio", "colegio", "taller", "instituto"]
fem_work_places = ["oficina", "escuela", "universidad", "biblioteca", "fábrica"]

masc_sleep_places = ["hotel", "motel", "hostal", "albergue", "apartamento", "bungaló", "chalé"]
fem_sleep_places = ["casa", "residencia", "mansión", "cabaña", "choza", "villa"]

masc_tourist_places = ["museo", "zoológico", "monumento", "templo", "cine", "estadio", "teatro", "acuario", "mirador", "puente"]
fem_tourist_places = ["plaza", "catedral", "iglesia", "capilla", "pirámide", "ruina", "fuente", "feria"]

masc_eating_places = ["restaurante", "bistro", "bar", "comedor"]
fem_eating_places = ["cafetería", "taquería", "cevichería", "pizzería", "pastelería", "sandwichería", "heladería", "cantina"]

masc_shopping_food_places = ["almacén", "mercado", "supermercado"]
fem_shopping_food_places = ["tienda", "panadería", "carnicería", "pescadería", "frutería", "verdulería"]

masc_health_places = ["hospital", "consultorio", "laboratorio", "sanatorio", "quirófano"]
fem_health_places = ["clínica", "farmacia", "enfermería", "ambulancia"]

masc_shopping_places = ["bazaar", "outlet"]
fem_shopping_places = ["boutique", "joyería", "librería", "florería", "ferretería", "papelería", "perfumería"]

masc_transport_places = ["aeropuerto", "puerto", "parqueo", "estacionamiento", "muelle", "hangar", "heliopuerto"]
fem_transport_places = ["estación", "parada", "terminal", "autopista", "carretera", "calle", "avenida"]

# ------ People ------
masc_family_older = ["abuelo", "tío", "papá", "padrino", "padre"]
fem_family_older = ["abuela", "tía", "mamá", "madrina", "madre"]

masc_family_younger = ["hermano", "primo", "sobrino", "hijo"]
fem_family_younger = ["hermana", "prima", "sobrina", "hija"]

masc_teachers = ["maestro", "profesor", "director", "rector", "licenciado", "ingeniero", "arquitecto", "abogado"]
fem_teachers = ["maestra", "profesora", "directora", "rectora", "licenciada", "ingeniera", "arquitecta", "abogada"]

masc_friends = ["amigo", "compañero", "colega", "vecino", "conocido", "amante", "novio", "esposo", "amigo"]
fem_friends = ["amiga", "compañera", "colega", "vecina", "conocida", "amante", "novia", "esposa", "amiga"]

masc_medical_emergency = ["doctor", "médico", "enfermero", "cirujano", "odontólogo", "psicólogo", "cardiólogo", "neurólogo", "pediatra", "internista"]
fem_medical_emergency = ["doctora", "médica", "enfermera", "cirujana", "odontóloga", "psicóloga", "cardióloga", "neuróloga", "pediatra", "internista"]

masc_first_responders = ["policía", "bombero", "paramédico", "guardia", "rescatista"]
fem_first_responders = ["bombera", "paramédica", "guardia", "rescatista"]

masc_first_responders_plural = ["policías", "bomberos", "paramédicos", "guardias", "rescatistas"]
fem_first_responders_plural = ["bomberas", "paramédicas", "guardias", "rescatistas"]

masc_professions = ["taxista", "mesero", "camarero", "electricista", "plomero", "pintor", "jardinero", "chef", "piloto", "actor", "cantante", "escritor", "bailarín", "atleta", "científico"]
fem_professions = ["camarera", "electricista", "fontanera", "pintora", "jardinera", "pilota", "actriz", "cantante", "escritora", "bailarina", "atleta", "científica"]

# ------ Objects ------

masc_id_objects = ["DPI", "pasaporte", "carné"]
fem_id_objects = ["licencia", "tarjeta", "credencial", "cédula"]

masc_readable_objects = ["libro", "periódico", "cómic", "diccionario", "cuaderno"]
fem_readable_objects = ["revista", "novela", "biografía", "autobiografía"]

masc_readable_objects_plural = ["libros", "periódicos", "cómics", "diccionarios", "cuadernos"]
fem_readable_objects_plural = ["revistas", "novelas", "biografías", "autobiografías"]

masc_personal_objects = ["teléfono", "reloj", "mapa", "paraguas", "cargador", "llavero"]
fem_personal_objects = ["laptop", "computadora", "tablet", "cartera", "llave", "mochila", "cámara", "billetera", "maleta", "bolsa", "mascarilla"]

masc_dinero_objects = ["dinero", "billetes", "monedas"]

masc_toys = ["balón", "rompecabezas", "carrito", "lego", "peluche", "robot", "dinosaurio"]
fem_toys = ["muñeca", "cocinita", "pelota", "barbie"]


masc_objects_transport = ["avión", "barco", "tren", "helicóptero", "camión", "bus", "metro", "tractor", "scooter"]
fem_objects_transport = ["bicicleta", "motocicleta", "patineta", "lancha", "canoa", "balsa"]

masc_homework_work_tools = ["lápiz", "borrador", "cuaderno", "papel", "pegamento", "sacapuntas", "marcador", "corrector", "clip", "pincel"]
fem_homework_work_tools = ["pluma", "libreta", "regla", "calculadora", "impresora", "engrapadora", "perforadora", "agenda"]

masc_work = ["informe", "reporte", "ensayo", "proyecto", "examen", "documento", "resumen", "artículo"]
fem_work = ["presentación", "exposición", "tarea", "investigación", "tesis", "disertación"]

masc_house_pets = ["perro", "gato", "pez", "pájaro", "conejo", "hámster", "cuyo", "hurón", "erizo", "gecko"]
fem_house_pets = ["iguana", "serpiente", "tortuga", "tarántula", "araña", "rana", "salamandra", "cacatúa", "guacamaya"]

masc_wild_animals = ["león", "tigre", "oso", "elefante", "mono", "lobo", "zorro", "hipopótamo", "rinoceronte", "antílope"]
fem_wild_animals = ["jirafa", "cebra", "pantera", "cabra", "oveja", "vaca", "gacela", "leona", "cebra"]

masc_insects = ["mosquito", "avispón", "escarabajo", "grillo", "saltamontes", "escorpión", "ciempiés"]
fem_insects = ["mosca", "abeja", "avispa", "hormiga", "cucaracha", "mariposa", "libélula", "polilla", "oruga"]

# ------ Events ------
masc_events = ["concierto", "festival", "carnaval", "bautizo", "desfile", "partido"]
fem_events = ["fiesta", "celebración"]


# ------ Food ------
masc_fruits_vegetables_singular = ["tomate", "aguacate", "plátano", "mango", "coco", "limón", "durazno", "melón", "kiwi", "elote", "maíz"]
fem_fruits_vegetables_singular = ["manzana", "pera", "uva", "sandía", "papaya", "piña", "fresa", "zanahoria", "papa", "cebolla", "lechuga"]

masc_fruits_vegetables_plural = ["tomates", "aguacates", "plátanos", "mangos", "cocos", "limones", "duraznos", "melones", "kiwis", "elotes", "maíces"]
fem_fruits_vegetables_plural = ["manzanas", "peras", "uvas", "sandías", "papayas", "piñas", "fresas", "zanahorias", "papas", "cebollas", "lechugas"]

masc_meat = ["pollo", "carne", "cerdo", "pescado", "jamón", "chorizo", "salami", "pepperoni"]
fem_meat = ["pechuga", "costilla", "chuleta", "salchicha", "longaniza"]

masc_meal = ["estofado", "caldo", "hotdog", "sushi", "taco", "muffin", "pan"]
fem_meal = ["ensalada", "sopa", "pizza", "hamburguesa", "torta", "tostada", "tortilla"]

meals_masc = ["desayuno", "almuerzo"]
meals_fem = ["cena", "merienda"]

masc_drinks = ["refresco", "jugo", "té", "café", "agua", "vino", "whisky", "ron", "vodka", "tequila", "mezcal", "pulque", "licor", "brandy"]
fem_drinks = ["leche", "cerveza", "champagne", "sidra", "limonada", "horchata", "michelada", "margarita"]

# ------ Feelings ------
neutral_feelings = ["feliz", "triste"]

masc_feelings = ["enojado", "cansado", "sorprendido", "asustado", "preocupado", "aburrido", "contento", "tranquilo", "nervioso", "emocionado"]
fem_feelings = ["enojada", "cansada", "sorprendida", "asustada", "preocupada", "aburrida", "contenta", "tranquila", "nerviosa", "emocionada"]

# ------ Estaciones ------
masc_estaciones = ["verano", "otoño", "invierno"]
fem_estaciones = ["primavera"]

# ------ Meses ------
meses = ["enero", "febrero", "marzo", "abril", "mayo", "junio", "julio", "agosto", "septiembre", "octubre", "noviembre", "diciembre"]

# ------ Días de la semana ------
dias_semana = ["lunes", "martes", "miércoles", "jueves", "viernes", "sábado", "domingo"]


words = [masc_home_places, fem_home_places, masc_outdoor_places, fem_outdoor_places, masc_work_places, fem_work_places, masc_sleep_places, fem_sleep_places, masc_tourist_places, fem_tourist_places, masc_eating_places, fem_eating_places, masc_shopping_food_places, fem_shopping_food_places, masc_health_places, fem_health_places, masc_shopping_places, fem_shopping_places, masc_transport_places, fem_transport_places, masc_family_older, fem_family_older, masc_family_younger, fem_family_younger, masc_teachers, fem_teachers, masc_friends, fem_friends, masc_medical_emergency, fem_medical_emergency, masc_first_responders, fem_first_responders, masc_first_responders_plural, fem_first_responders_plural, masc_professions, fem_professions, masc_id_objects, fem_id_objects, masc_readable_objects, fem_readable_objects, masc_readable_objects_plural, fem_readable_objects_plural, masc_personal_objects, fem_personal_objects, masc_dinero_objects, masc_toys, fem_toys, masc_objects_transport, fem_objects_transport, masc_homework_work_tools, fem_homework_work_tools, masc_work, fem_work, masc_house_pets, fem_house_pets, masc_wild_animals, fem_wild_animals, masc_insects, fem_insects, masc_events, fem_events, masc_fruits_vegetables_singular, fem_fruits_vegetables_singular, masc_fruits_vegetables_plural, fem_fruits_vegetables_plural, masc_meat, fem_meat, masc_meal, fem_meal, meals_masc, meals_fem, masc_drinks, fem_drinks, neutral_feelings, masc_feelings, fem_feelings, masc_estaciones, fem_estaciones, meses, dias_semana]

___

#### *Funciones - Data Augmentation*

Con el vocabulario definido, se procede a desarrollar las funciones que permitirán realizar el data augmentation.

In [77]:
# ----------------- Generate Sentences ----------------- #
# Function: generate_sentences
# Description: Recursive function to generate all possible combinations of a sentence
# Parameters:
#       - original: list of words of the original sentence
#       - current_words: list of words of the current sentence
#       - index: index of the current word
#       - sentences: DataFrame with the sentences generated
# Return:
#       - current_words: string with the current sentence
#       - sentences: DataFrame with the sentences generated
# ----------------------------------------------- #
def generate_sentences(original, current_words, index, sentences):

    if sentences.empty:                                 # If the DataFrame is empty add the original sentence
        sentences = pd.concat([sentences, pd.DataFrame([" ".join(original)], columns=["ORACIÓN"])], ignore_index=True)

    index_temp = index                                  # Copy the index to a temporary variable
    found_alternative = False                           # Flag to indicate if an alternative word was found
    
    for word in original[index:]:                       # Iterate over the words of the original sentence

        if found_alternative:                           # If an alternative word was found and changed, return the current sentence and the DataFrame with the sentences generated
            return " ".join(current_words), sentences   # Return the current sentence and the DataFrame with the sentences generated
        
        index_sublist = -1                                  
        for i, sublist in enumerate(words):             # Iterate over the list of words
            if word in sublist:                         # If the word is in the list of words
                index_sublist = i                       # Save the index of the list of words
                break

        if index_sublist == -1:                         # If the word is not in the list of words, continue with the next word
            current_words.append(word)                  # Add the word to the current sentence
            index_temp += 1                             # Increment the index

        else:                                           # If the word is in the list of words
            found_alternative = True                    # Set the flag to True

            for new_word in words[index_sublist]:       # Iterate over the words of the list of words

                new_words = deepcopy(current_words)     # Copy the current words to a new variable
                new_words.append(new_word)              # Add the new word to the current words

                current_string, updated_sentences = generate_sentences(original, new_words, index_temp + 1, sentences)      # Recursive call to generate_sentences

                if not updated_sentences.empty:         # If the DataFrame with the sentences generated is not empty
                    sentences = pd.concat([sentences, updated_sentences], ignore_index=True)                                # Concatenate the DataFrame with the sentences generated

                    sentences = sentences.drop_duplicates()                                                                 # Drop duplicates

    if index_temp == len(original):                     # If the index is equal to the length of the original sentence
        sentences = pd.concat([sentences, pd.DataFrame([" ".join(current_words)], columns=["ORACIÓN"])], ignore_index=True) # Concatenate the DataFrame with the current sentence

        sentences = sentences.drop_duplicates()         # Drop duplicates

    return " ".join(current_words), sentences           # Return the current sentence and the DataFrame with the sentences generated

In [78]:
# ----------------- Split Words ----------------- #
# Function: split_words
# Description: Split the words of a sentence, taking into account the punctuation
# Parameters:
#       - sentence: string with the sentence
# Return:
#       - sentence: list with the words of the sentence
# ----------------------------------------------- #
def split_words(sentence):

    new_sentence = []                                       # List to store the words of the sentence
    punctiation = [",", ".", ";", "!", "?", "¿", "¡"]       # List of punctuation marks

    for i, char in enumerate(sentence):                     # Iterate over the characters of the sentence
        if char in punctiation:                             # If the character is a punctuation mark
            if i > 0 and sentence[-1] != " ":               # If the character is not the first character and the last character is not a space
                new_sentence.append(" ")                    # Add a space to the list of words
            new_sentence.append(char)                       # Add the punctuation mark to the list of words
            new_sentence.append(" ")                        # Add a space to the list of words
        else:                                               # If the character is not a punctuation mark
            new_sentence.append(char)                       # Add the character to the list of words

    sentence = "".join(new_sentence)                        # Join the list of words to form the sentence

    sentence = sentence.split()                             # Split the sentence into words

    return sentence                                         # Return the list of words


# ----------------- Correct Case ----------------- #
# Function: correct_case
# Description: Correct the case of the words of a sentence
# Parameters:
#       - sentence: string with the sentence
# Return:
#       - sentence: string with the sentence with the correct case
# ----------------------------------------------- #
def correct_case(sentence):

    new_sentence = []                                       # List to store the words of the sentence
    punctiation = [",", ".", ";", "!", "?", "¿", "¡"]       # List of punctuation marks

    punctiation_bool = False                                # Flag to indicate if the character is a punctuation mark
    remove_space = False                                    # Flag to indicate if the space should be removed

    sentence = sentence.strip()                             # Remove leading and trailing whitespaces            

    for i, char in enumerate(sentence):                     # Iterate over the characters of the sentence
        if i == 0 and char.isalpha():                       # If the character is the first character and is a letter
            new_sentence.append(char.upper())               # Add the character in uppercase to the list of words
        elif char in punctiation:                           # If the character is a punctuation mark
            if char == "¿" or char == "¡":                  # If the character is an inverted question or exclamation mark
                remove_space = True                         # Set the flag to True
            else:                                           # If the character is a punctuation mark           
                if new_sentence[-1] == " ":                 # If the last character is a space
                    new_sentence.pop()                      # Remove the last character

            if char != ",":                                 # If the character is not a comma
                punctiation_bool = True                     # Set the flag to True
            new_sentence.append(char)                       # Add the punctuation mark to the list of words

        elif punctiation_bool:                              # If the character is not a punctuation mark and the flag is True
            if char.isalpha():                              # If the character is a letter
                new_sentence.append(char.upper())           # Add the character in uppercase to the list of words
                punctiation_bool = False                    # Set the flag to False
                remove_space = False                        # Set the flag to False
            else:                                           # If the character is not a letter
                if not remove_space:                        # If the flag is False
                    new_sentence.append(char)               # Add the character to the list of words
                else:                                       # If the flag is True
                    remove_space = False                    # Set the flag to False
        else:                                               # If the character is not a punctuation mark           
            new_sentence.append(char)                       # Add the character to the list of words

    return "".join(new_sentence)                            # Return the list of words

___

#### *Data Augmentation*

Con las funciones definidas, se aplica el data augmentation al dataset original. Se generan nuevas frases a partir de las existentes, aplicando las técnicas de sustitución de palabras, cambio de tiempo verbal, y alteración de lugares y sujetos.

In [79]:
augmented_data = pd.DataFrame(columns=["LENSEGUA", "ESPAÑOL"])                  # DataFrame to store the augmented dataset

for index, row in (data.iterrows()):                                            # Iterate over the rows of the dataset

    words_lensegua = split_words(row["LENSEGUA"])                               # Split the words of the Lensegua sentence  
    words_espanol = split_words(row["ESPAÑOL"])                                 # Split the words of the Spanish sentence

    words_found = 0                                                             # Variable to store the number of "changeable" words found

    for word in words_espanol:                                                  # Iterate over the words of the Spanish sentence              
        if word in [item for sublist in words for item in sublist]:             # If the word is in the list of words
            words_found += 1                                                    # Increment the variable

    if words_found == 0:                                                        # If no "changeable" words were found
        new_row = pd.DataFrame({'LENSEGUA': [row["LENSEGUA"]], 'ESPAÑOL': [row["ESPAÑOL"]]})                # Create a new row with the original sentences
        augmented_data = pd.concat([augmented_data, new_row], ignore_index=True)                            # Concatenate the new row to the DataFrame    

    else:                                                                       # If "changeable" words were found
        _, lensegua_aug = generate_sentences(words_lensegua, [], 0, pd.DataFrame(columns=["ORACIÓN"]))      # Generate all possible combinations of the Lensegua sentence
        spanish_aug = []                                                        # List to store the augmented Spanish sentences             

        lensegua_aug = lensegua_aug.sort_values(by=["ORACIÓN"])                 # Sort the lensegua augmented sentences
        if len(lensegua_aug) > 20:                                              # If the length of the lensegua augmented sentences is greater than 20
            lensegua_aug = lensegua_aug.sample(n=20)                            # Sample 20 random sentences

        for index1 in lensegua_aug.index:                                       # Iterate over the indexes of the lensegua augmented sentences

            aug_sentence = lensegua_aug["ORACIÓN"][index1]                      # Get the augmented sentence
            aug_sentence_split = split_words(aug_sentence)                      # Split the words of the augmented sentence
            spanish_aug_temp = deepcopy(words_espanol)                          # Copy the words of the Spanish sentence

            if len(aug_sentence_split) == len(words_lensegua):                  # If the length of the augmented sentence is equal to the length of the Lensegua sentence
                
                changed_id = []                                                 # List to store the indexes of the words that were changed

                for word1, word2 in zip(words_lensegua, aug_sentence_split):    # Iterate over the words of the Lensegua sentence and the augmented sentence
                    if word1 != word2:                                          # If the words are different
                        
                        for i, word_temp in enumerate(spanish_aug_temp):        # Iterate over the words of the Spanish sentence
                            if word_temp == word1:                              # If the word is equal to the word of the Lensegua sentence

                                if i in changed_id:                             # If the index is in the list of changed indexes
                                    pass

                                else: 
                                    spanish_aug_temp[i] = word2                 # Change the word of the Spanish sentence
                                    changed_id.append(i)                        # Add the index to the list
                                    break

                spanish_aug.append(correct_case(" ".join(spanish_aug_temp)))    # Add the augmented Spanish sentence to the list

        spanish_aug = pd.DataFrame(spanish_aug, columns=["ORACIÓN"])            # Create a DataFrame with the augmented Spanish sentences

        for index1, index2 in zip(spanish_aug.index, lensegua_aug.index):       # Iterate over the indexes of the augmented Spanish sentences and the augmented Lensegua sentences
            if len(spanish_aug) != len(lensegua_aug):                           # If the length of the augmented Spanish sentences is different from the length of the augmented Lensegua sentences
                print(spanish_aug)  
                print(lensegua_aug)

            new_row = pd.DataFrame({'LENSEGUA': [lensegua_aug["ORACIÓN"][index2]], 'ESPAÑOL': [spanish_aug["ORACIÓN"][index1]]})        # Create a new row with the augmented sentences
            augmented_data = pd.concat([augmented_data, new_row], ignore_index=True)                                                    # Concatenate the new row to the DataFrame          

In [80]:
augmented_data.head(20)                                                     # Display the first 20 rows of the augmented dataset

Unnamed: 0,LENSEGUA,ESPAÑOL
0,él trabajar mucho pero ganar poco,él trabaja mucho pero gana poco.
1,después artículo nosotros descansar,Después del artículo descansamos.
2,después documento nosotros descansar,Después del documento descansamos.
3,después ensayo nosotros descansar,Después del ensayo descansamos.
4,después examen nosotros descansar,Después del examen descansamos.
5,después informe nosotros descansar,Después del informe descansamos.
6,después proyecto nosotros descansar,Después del proyecto descansamos.
7,después reporte nosotros descansar,Después del reporte descansamos.
8,después resumen nosotros descansar,Después del resumen descansamos.
9,ayer tu cabaña limpiar,Ayer limpiaste la cabaña.


In [81]:
# Strip sentences
augmented_data["LENSEGUA"] = augmented_data["LENSEGUA"].apply(lambda x: x.strip())                  # Strip the Lensegua sentences
augmented_data["ESPAÑOL"] = augmented_data["ESPAÑOL"].apply(lambda x: x.strip())                    # Strip the Spanish sentences

validation_data["LENSEGUA"] = validation_data["LENSEGUA"].apply(lambda x: x.strip())                # Strip the Lensegua sentences
validation_data["ESPAÑOL"] = validation_data["ESPAÑOL"].apply(lambda x: x.strip())                  # Strip the Spanish sentences

# Remove '"' characters
augmented_data["LENSEGUA"] = augmented_data["LENSEGUA"].apply(lambda x: re.sub(r'"', '', x))        # Remove double quotes
augmented_data["ESPAÑOL"] = augmented_data["ESPAÑOL"].apply(lambda x: re.sub(r'"', '', x))          # Remove double quotes

validation_data["LENSEGUA"] = validation_data["LENSEGUA"].apply(lambda x: re.sub(r'"', '', x))      # Remove double quotes
validation_data["ESPAÑOL"] = validation_data["ESPAÑOL"].apply(lambda x: re.sub(r'"', '', x))        # Remove double quotes

# Correct case
augmented_data["LENSEGUA"] = augmented_data["LENSEGUA"].apply(lambda x: correct_case(x))            # Correct the case of the Lensegua sentences
augmented_data["ESPAÑOL"] = augmented_data["ESPAÑOL"].apply(lambda x: correct_case(x))              # Correct the case of the Spanish sentences

validation_data["LENSEGUA"] = validation_data["LENSEGUA"].apply(lambda x: correct_case(x))          # Correct the case of the Lensegua sentences
validation_data["ESPAÑOL"] = validation_data["ESPAÑOL"].apply(lambda x: correct_case(x))            # Correct the case of the Spanish sentences

# Randomize the order of the rows
augmented_data = augmented_data.sample(frac=1).reset_index(drop=True)                               # Randomize the order of the rows


# Make sure the datasets are divisible by 8
if len(augmented_data) % 8 != 0:                                                                    # If the length of the augmented dataset is not divisible by 8
    augmented_data = augmented_data.iloc[:-(len(augmented_data) % 8)]                               # Remove the remaining rows

if len(validation_data) % 8 != 0:                                                                    # If the length of the validation dataset is not divisible by 8
    validation_data = validation_data.iloc[:-(len(validation_data) % 8)]                             # Remove the remaining rows


augmented_data.to_csv("../../dataset/processed/train_data.csv", index=False, quoting=csv.QUOTE_ALL, quotechar='"')          # Save the augmented dataset to a CSV file
validation_data.to_csv("../../dataset/processed/validation_data.csv", index=False, quoting=csv.QUOTE_ALL, quotechar='"')    # Save the validation dataset to a CSV file