### **Señas Chapinas: Traductor de LENSEGUA**
#### *Módulo de Procesamiento de Lenguaje Natural*

Stefano Alberto Aragoni Maldonado

-------------------

#### **Cálculo de Distancias de Levenshtein**

En este programa se busca evaluar las interpretaciones realizadas por ambos, el modelo GPT-3.5-Turbo base y el modelo GPT-3.5-Turbo fine-tuneado para la tarea de interpretación de LENSEGUA. 

Esto se hará a través de calcular la distancia de Levenshtein entre las interpretaciones generadas y las interpretaciones teóricas esperadas. 

Cabe destacar que por cada modelo se generaron cuatro posibles interpretaciones para cada frase. Esto debido a que se utilizaron 4 prompts distintos para la generación de interpretaciones.

___

#### *Importar librerías*
Como primer paso, se importan las librerías necesarias para el desarrollo del problema.

In [1]:
import numpy as np
import pandas as pd
import csv
import re
import Levenshtein as lev

___

#### *Cargar los datasets*

Luego, se carga los datasets que contienen las interpretaciones generadas por ambos modelos.

In [2]:
original_pv0 = pd.read_csv('../../dataset/interpretations/original_pv0.csv')
original_pv1 = pd.read_csv('../../dataset/interpretations/original_pv1.csv')
original_pv2 = pd.read_csv('../../dataset/interpretations/original_pv2.csv')
original_pv3 = pd.read_csv('../../dataset/interpretations/original_pv3.csv')
original_pv4 = pd.read_csv('../../dataset/interpretations/original_pv4.csv')

finetuned_pv0 = pd.read_csv('../../dataset/interpretations/finetuned_pv0.csv')
finetuned_pv1 = pd.read_csv('../../dataset/interpretations/finetuned_pv1.csv')
finetuned_pv2 = pd.read_csv('../../dataset/interpretations/finetuned_pv2.csv')
finetuned_pv3 = pd.read_csv('../../dataset/interpretations/finetuned_pv3.csv')
finetuned_pv4 = pd.read_csv('../../dataset/interpretations/finetuned_pv4.csv')

___

#### *Función para hacer el cálculo de distancia de Levenshtein*

Con todo lo anterior, se define una función que permitirá calcular la distancia de Levenshtein entre las interpretaciones generadas y las interpretaciones esperadas. Esta recibe como parámetros el dataset con las el texto ESPAÑOL (correcto), el texto LENSEGUA (frase a interpretar), y el texto RESPUESTA (interpretación generada por el modelo).

Como resultado, se obtiene un archivo CSV que contiene la información del dataset original y una columna con la distancia de Levenshtein para cada interpretación generada.

In [241]:
def leveshtein_distance(data, file_name):

    # Rows for new dataset
    rows = []

    for i, row in data.iterrows():

        # Get rows from the dataset
        spanish = row["ESPAÑOL"]
        interpreted = row["RESPUESTA"]

        # Calculate the levenshtein distance between spanish and interpreted
        distance = lev.distance(spanish, interpreted)

        # Calculate the percentage difference
        max_length = max(len(spanish), len(interpreted))
        percentage_diff = (distance / max_length) * 100

        # Add the row to the new dataset
        rows.append([spanish, interpreted, max_length, distance, percentage_diff])

    # Create the new dataset
    results_df = pd.DataFrame(rows, columns=["ESPAÑOL", "RESPUESTA", "MAX_LENGTH", "LEVENSHTEIN_DISTANCE", "PERCENTAGE_DIFFERENCE"])

    # Save the new dataset
    results_df.to_csv(file_name, index=False)

    # Return the mean of the levenshtein distances and the average percentage difference
    avg_max_length = results_df["MAX_LENGTH"].mean()
    avg_distance = results_df["LEVENSHTEIN_DISTANCE"].mean()
    avg_percentage_diff = results_df["PERCENTAGE_DIFFERENCE"].mean()

    return avg_max_length, avg_distance, avg_percentage_diff


In [242]:
print("Levenshtein Distance between Spanish and Interpreted Sentences")

original_prompt_v0 = leveshtein_distance(original_pv0, "../../dataset/distances/distances_original_pv0.csv")
print("\nOriginal Model - Prompt V0: ")
print("\tAverage Sentence Length: ", original_prompt_v0[0])
print("\tAverage Levenshtein Distance: ", original_prompt_v0[1])
print("\tAverage Percentage Difference: ", original_prompt_v0[2], "%")

original_prompt_v1 = leveshtein_distance(original_pv1, "../../dataset/distances/distances_original_pv1.csv")
print("\nOriginal Model - Prompt V1: ")
print("\tAverage Sentence Length: ", original_prompt_v1[0])
print("\tAverage Levenshtein Distance: ", original_prompt_v1[1])
print("\tAverage Percentage Difference: ", original_prompt_v1[2], "%")

original_prompt_v2 = leveshtein_distance(original_pv2, "../../dataset/distances/distances_original_pv2.csv")
print("\nOriginal Model - Prompt V2: ")
print("\tAverage Sentence Length: ", original_prompt_v2[0])
print("\tAverage Levenshtein Distance: ", original_prompt_v2[1])
print("\tAverage Percentage Difference: ", original_prompt_v2[2], "%")

original_prompt_v3 = leveshtein_distance(original_pv3, "../../dataset/distances/distances_original_pv3.csv")
print("\nOriginal Model - Prompt V3: ")
print("\tAverage Sentence Length: ", original_prompt_v3[0])
print("\tAverage Levenshtein Distance: ", original_prompt_v3[1])
print("\tAverage Percentage Difference: ", original_prompt_v3[2], "%")

original_prompt_v4 = leveshtein_distance(original_pv4, "../../dataset/distances/distances_original_pv4.csv")
print("\nOriginal Model - Prompt V4: ")
print("\tAverage Sentence Length: ", original_prompt_v4[0])
print("\tAverage Levenshtein Distance: ", original_prompt_v4[1])
print("\tAverage Percentage Difference: ", original_prompt_v4[2], "%")

Levenshtein Distance between Spanish and Interpreted Sentences

Original Model - Prompt V0: 
	Average Sentence Length:  31.905
	Average Levenshtein Distance:  10.065
	Average Percentage Difference:  27.932895900120712 %

Original Model - Prompt V1: 
	Average Sentence Length:  32.825
	Average Levenshtein Distance:  10.755
	Average Percentage Difference:  28.1327943786462 %

Original Model - Prompt V2: 
	Average Sentence Length:  29.625
	Average Levenshtein Distance:  8.66
	Average Percentage Difference:  27.78596639438847 %

Original Model - Prompt V3: 
	Average Sentence Length:  29.215
	Average Levenshtein Distance:  8.635
	Average Percentage Difference:  28.59262308454481 %

Original Model - Prompt V4: 
	Average Sentence Length:  29.235
	Average Levenshtein Distance:  8.68
	Average Percentage Difference:  28.548691520694337 %


In [243]:
print("Levenshtein Distance between Spanish and Interpreted Sentences")

finetuned_prompt_v0 = leveshtein_distance(finetuned_pv0, "../../dataset/distances/distances_finetuned_pv0.csv")
print("\nFine-tuned Model - Prompt V0: ")
print("\tAverage Sentence Length: ", finetuned_prompt_v0[0])
print("\tAverage Levenshtein Distance: ", finetuned_prompt_v0[1])
print("\tAverage Percentage Difference: ", finetuned_prompt_v0[2], "%")

finetuned_prompt_v1 = leveshtein_distance(finetuned_pv1, "../../dataset/distances/distances_finetuned_pv1.csv")
print("\nFine-tuned Model - Prompt V1: ")
print("\tAverage Sentence Length: ", finetuned_prompt_v1[0])
print("\tAverage Levenshtein Distance: ", finetuned_prompt_v1[1])
print("\tAverage Percentage Difference: ", finetuned_prompt_v1[2], "%")

finetuned_prompt_v2 = leveshtein_distance(finetuned_pv2, "../../dataset/distances/distances_finetuned_pv2.csv")
print("\nFine-tuned Model - Prompt V2: ")
print("\tAverage Sentence Length: ", finetuned_prompt_v2[0])
print("\tAverage Levenshtein Distance: ", finetuned_prompt_v2[1])
print("\tAverage Percentage Difference: ", finetuned_prompt_v2[2], "%")

finetuned_prompt_v3 = leveshtein_distance(finetuned_pv3, "../../dataset/distances/distances_finetuned_pv3.csv")
print("\nFine-tuned Model - Prompt V3: ")
print("\tAverage Sentence Length: ", finetuned_prompt_v3[0])
print("\tAverage Levenshtein Distance: ", finetuned_prompt_v3[1])
print("\tAverage Percentage Difference: ", finetuned_prompt_v3[2], "%")

finetuned_prompt_v4 = leveshtein_distance(finetuned_pv4, "../../dataset/distances/distances_finetuned_pv4.csv")
print("\nFine-tuned Model - Prompt V4: ")
print("\tAverage Sentence Length: ", finetuned_prompt_v4[0])
print("\tAverage Levenshtein Distance: ", finetuned_prompt_v4[1])
print("\tAverage Percentage Difference: ", finetuned_prompt_v4[2], "%")

Levenshtein Distance between Spanish and Interpreted Sentences

Fine-tuned Model - Prompt V0: 
	Average Sentence Length:  27.655
	Average Levenshtein Distance:  5.815
	Average Percentage Difference:  20.12713768277029 %

Fine-tuned Model - Prompt V1: 
	Average Sentence Length:  27.925
	Average Levenshtein Distance:  5.84
	Average Percentage Difference:  19.098171768746294 %

Fine-tuned Model - Prompt V2: 
	Average Sentence Length:  27.23
	Average Levenshtein Distance:  5.37
	Average Percentage Difference:  19.00521992929269 %

Fine-tuned Model - Prompt V3: 
	Average Sentence Length:  27.05
	Average Levenshtein Distance:  4.3
	Average Percentage Difference:  15.276810080558636 %

Fine-tuned Model - Prompt V4: 
	Average Sentence Length:  26.635
	Average Levenshtein Distance:  3.375
	Average Percentage Difference:  11.979230778799952 %
