<img src="https://github.com/hernancontigiani/ceia_memorias_especializacion/raw/master/Figures/logoFIUBA.jpg" width="500" align="center">


# Procesamiento de lenguaje natural
## LSTM Bot QA

## Ejercicio a resolver

Construir QA Bot basado en el ejemplo del traductor pero con un dataset QA.

Recomendaciones:
- MAX_VOCAB_SIZE = 8000
- max_length ~ 10
- Embeddings 300 Fasttext
- n_units = 128
- LSTM Dropout 0.2
- Epochs 30~50

Preguntas interesantes:
- Do you read?
- Do you have any pet?
- Where are you from?

__IMPORTANTE__: Recuerde para la entrega del ejercicio debe quedar registrado en el colab las preguntas y las respuestas del BOT para que podamos evaluar el desempeño final.



### Datos
El objecto es utilizar datos disponibles del challenge ConvAI2 (Conversational Intelligence Challenge 2) de conversaciones en inglés. Se construirá un BOT para responder a preguntas del usuario (QA).\
[LINK](http://convai.io/data/)

In [1]:
!pip install --upgrade --no-cache-dir gdown --quiet

In [7]:
# ✅ Instalar TensorFlow y Keras (en Colab ya suelen venir, pero se fuerza versión estable)
!pip install -U tensorflow keras scikit-learn


Collecting tensorflow
  Downloading tensorflow-2.19.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.1 kB)
Collecting keras
  Downloading keras-3.10.0-py3-none-any.whl.metadata (6.0 kB)
Collecting scikit-learn
  Downloading scikit_learn-1.7.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (17 kB)
Collecting tensorboard~=2.19.0 (from tensorflow)
  Downloading tensorboard-2.19.0-py3-none-any.whl.metadata (1.8 kB)
Collecting ml-dtypes<1.0.0,>=0.5.1 (from tensorflow)
  Downloading ml_dtypes-0.5.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (21 kB)
Downloading tensorflow-2.19.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (644.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m644.9/644.9 MB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading keras-3.10.0-py3-none-any.whl (1.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m34.8 MB/s[0m eta [36

In [1]:
import tensorflow as tf
import keras
import sklearn

print("TensorFlow version:", tf.__version__)
print("Keras version:", keras.__version__)
print("scikit-learn version:", sklearn.__version__)


TensorFlow version: 2.19.0
Keras version: 3.10.0
scikit-learn version: 1.7.0


In [3]:
import re

import numpy as np
import pandas as pd

import tensorflow as tf
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.utils import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Activation, Dropout, Dense
from tensorflow.keras.layers import Flatten, LSTM, SimpleRNN
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Embedding
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.layers import Input

In [None]:
# Descargar la carpeta de dataset
import os
import gdown
if os.access('data_volunteers.json', os.F_OK) is False:
    url = 'https://drive.google.com/uc?id=1awUxYwImF84MIT5-jCaYAPe2QwSgS1hN&export=download'
    output = 'data_volunteers.json'
    gdown.download(url, output, quiet=False)
else:
    print("El dataset ya se encuentra descargado")

In [73]:
# dataset_file
import json

text_file = "data_volunteers.json"
with open(text_file) as f:
    data = json.load(f) # la variable data será un diccionario



In [74]:
# Observar los campos disponibles en cada linea del dataset
data[0].keys()

dict_keys(['dialog', 'start_time', 'end_time', 'bot_profile', 'user_profile', 'eval_score', 'profile_match', 'participant1_id', 'participant2_id'])

In [75]:
chat_in = []
chat_out = []

input_sentences = []
output_sentences = []
output_sentences_inputs = []
max_len = 30

def clean_text(txt):
    txt = txt.lower()
    txt.replace("\'d", " had")
    txt.replace("\'s", " is")
    txt.replace("\'m", " am")
    txt.replace("don't", "do not")
    txt = re.sub(r'\W+', ' ', txt)

    return txt

for line in data:
    for i in range(len(line['dialog'])-1):
        # vamos separando el texto en "preguntas" (chat_in)
        # y "respuestas" (chat_out)
        chat_in = clean_text(line['dialog'][i]['text'])
        chat_out = clean_text(line['dialog'][i+1]['text'])

        if len(chat_in) >= max_len or len(chat_out) >= max_len:
            continue

        input_sentence, output = chat_in, chat_out

        # output sentence (decoder_output) tiene <eos>
        output_sentence = output + ' <eos>'
        # output sentence input (decoder_input) tiene <sos>
        output_sentence_input = '<sos> ' + output

        input_sentences.append(input_sentence)
        output_sentences.append(output_sentence)
        output_sentences_inputs.append(output_sentence_input)

print("Cantidad de rows utilizadas:", len(input_sentences))

Cantidad de rows utilizadas: 6033


In [76]:
input_sentences[1], output_sentences[1], output_sentences_inputs[1]

('hi how are you ', 'not bad and you  <eos>', '<sos> not bad and you ')

### 2 - Preprocesamiento
Realizar el preprocesamiento necesario para obtener:
- word2idx_inputs, max_input_len
- word2idx_outputs, max_out_len, num_words_output
- encoder_input_sequences, decoder_output_sequences, decoder_targets

In [77]:
# Parámetros recomendados
MAX_VOCAB_SIZE = 8000
MAX_SEQUENCE_LENGTH = 10

# Inicializar tokenizador
tokenizer = Tokenizer(num_words=MAX_VOCAB_SIZE, oov_token="<OOV>", filters='' )  # ¡No filtramos símbolos!)
tokenizer.fit_on_texts(input_sentences + output_sentences + output_sentences_inputs)

# Vocabulario real
word2idx = tokenizer.word_index
print(f"Tamaño real del vocabulario: {len(word2idx)}")

# Convertir a secuencias numéricas
encoder_input_sequences = tokenizer.texts_to_sequences(input_sentences)
decoder_input_sequences = tokenizer.texts_to_sequences(output_sentences_inputs)
decoder_target_sequences = tokenizer.texts_to_sequences(output_sentences)

# Padding para normalizar longitud
encoder_input_sequences = pad_sequences(encoder_input_sequences, maxlen=MAX_SEQUENCE_LENGTH, padding='post')
decoder_input_sequences = pad_sequences(decoder_input_sequences, maxlen=MAX_SEQUENCE_LENGTH, padding='post')
decoder_target_sequences = pad_sequences(decoder_target_sequences, maxlen=MAX_SEQUENCE_LENGTH, padding='post')

# Verificación de ejemplos
print("\nEjemplo encoder_input_sequences[0]:", encoder_input_sequences[0])
print("Ejemplo decoder_input_sequences[0]:", decoder_input_sequences[0])
print("Ejemplo decoder_target_sequences[0]:", decoder_target_sequences[0])

# Shapes útiles
num_samples = len(encoder_input_sequences)
vocab_size = min(MAX_VOCAB_SIZE, len(word2idx) + 1)
print(f"\nTotal de pares: {num_samples}")
print(f"Vocab size para Embedding: {vocab_size}")

Tamaño real del vocabulario: 2163

Ejemplo encoder_input_sequences[0]: [22  0  0  0  0  0  0  0  0  0]
Ejemplo decoder_input_sequences[0]: [ 3 20 12  9  4  0  0  0  0  0]
Ejemplo decoder_target_sequences[0]: [20 12  9  4  2  0  0  0  0  0]

Total de pares: 6033
Vocab size para Embedding: 2164


### 3 - Preparar los embeddings
Utilizar los embeddings de Glove o FastText para transformar los tokens de entrada en vectores

In [78]:
#import tensorflow as tf
#from tensorflow.keras.layers import Input, Embedding, LSTM, Dense, Dropout
#from tensorflow.keras.models import Model

# Parámetros clave
EMBEDDING_DIM = 300   # Recomendado: FastText ~300 dimensiones
LATENT_DIM = 256      # Unidades LSTM (encoder y decoder)
DROPOUT_RATE = 0.2

# Capa Embedding compartida
embedding_layer = Embedding(
    input_dim=vocab_size,      # Tamaño del vocabulario
    output_dim=EMBEDDING_DIM,  # Dimensión de embedding
    input_length=MAX_SEQUENCE_LENGTH,
    mask_zero=True             # Muy importante para ignorar padding en LSTM
)

# Encoder
encoder_inputs = Input(shape=(MAX_SEQUENCE_LENGTH,))
x = embedding_layer(encoder_inputs)
encoder_lstm = LSTM(LATENT_DIM, return_state=True, dropout=DROPOUT_RATE)
encoder_outputs, state_h, state_c = encoder_lstm(x)
encoder_states = [state_h, state_c]

# Decoder
decoder_inputs = Input(shape=(MAX_SEQUENCE_LENGTH,))
x = embedding_layer(decoder_inputs)  # Se puede compartir el mismo embedding
decoder_lstm = LSTM(LATENT_DIM, return_sequences=True, return_state=True, dropout=DROPOUT_RATE)
decoder_outputs, _, _ = decoder_lstm(x, initial_state=encoder_states)
decoder_dense = Dense(vocab_size, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)

# Modelo final Seq2Seq
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

# Compilación
model.compile(
    optimizer='rmsprop',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

# Resumen del modelo
model.summary()


### 4 - Entrenar el modelo
Entrenar un modelo basado en el esquema encoder-decoder utilizando los datos generados en los puntos anteriores. Utilce como referencias los ejemplos vistos en clase.

In [79]:

#import numpy as np

# La salida debe tener 3 dimensiones: (samples, timesteps, 1)
decoder_target_data = np.expand_dims(decoder_target_sequences, -1)

print(f"Shape encoder_input_sequences: {encoder_input_sequences.shape}")
print(f"Shape decoder_input_sequences: {decoder_input_sequences.shape}")
print(f"Shape decoder_target_data: {decoder_target_data.shape}")

# Entrenamos el modelo
history = model.fit(
    [encoder_input_sequences, decoder_input_sequences],
    decoder_target_data,
    batch_size=64,
    epochs=50,           # O 50, según sugerencia
    validation_split=0.2
)


Shape encoder_input_sequences: (6033, 10)
Shape decoder_input_sequences: (6033, 10)
Shape decoder_target_data: (6033, 10, 1)
Epoch 1/50
[1m76/76[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m42s[0m 488ms/step - accuracy: 0.0987 - loss: 6.2327 - val_accuracy: 0.1304 - val_loss: 4.4825
Epoch 2/50
[1m76/76[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m33s[0m 434ms/step - accuracy: 0.1402 - loss: 4.1117 - val_accuracy: 0.1413 - val_loss: 4.2212
Epoch 3/50
[1m76/76[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m43s[0m 463ms/step - accuracy: 0.1644 - loss: 3.7330 - val_accuracy: 0.1708 - val_loss: 4.0089
Epoch 4/50
[1m76/76[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m41s[0m 463ms/step - accuracy: 0.2019 - loss: 3.4802 - val_accuracy: 0.1787 - val_loss: 3.8224
Epoch 5/50
[1m76/76[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m34s[0m 450ms/step - accuracy: 0.2128 - loss: 3.2479 - val_accuracy: 0.1836 - val_loss: 3.7406
Epoch 6/50
[1m76/76[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m

### 5 - Inferencia
Experimentar el funcionamiento de su modelo. Recuerde que debe realizar la inferencia de los modelos por separado de encoder y decoder.

In [80]:
# Modelo encoder (igual que el de entrenamiento)
encoder_model = Model(encoder_inputs, encoder_states)

# Modelo decoder para inferencia paso a paso
# Nuevos inputs de estados
decoder_state_input_h = Input(shape=(LATENT_DIM,))
decoder_state_input_c = Input(shape=(LATENT_DIM,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]

# Reusar embedding y LSTM del decoder
decoder_outputs, state_h, state_c = decoder_lstm(
    embedding_layer(decoder_inputs), initial_state=decoder_states_inputs)

decoder_states = [state_h, state_c]
decoder_outputs = decoder_dense(decoder_outputs)

decoder_model = Model(
    [decoder_inputs] + decoder_states_inputs,
    [decoder_outputs] + decoder_states
)


In [81]:
def decode_sequence(input_seq):
    # Codificar input y obtener estados iniciales
    states_value = encoder_model.predict(input_seq)

    # Inicializar target sequence con solo el token <sos>
    target_seq = np.zeros((1, 1))
    target_seq[0, 0] = tokenizer.word_index['<sos>']

    stop_condition = False
    decoded_sentence = ''

    while not stop_condition:
        output_tokens, h, c = decoder_model.predict([target_seq] + states_value)

        # Obtener índice más probable
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_word = tokenizer.index_word.get(sampled_token_index, '')

        if sampled_word == '<eos>' or len(decoded_sentence.split()) > MAX_SEQUENCE_LENGTH:
            stop_condition = True
        else:
            decoded_sentence += ' ' + sampled_word

        # Actualizar target sequence (solo con el índice predicho)
        target_seq = np.zeros((1, 1))
        target_seq[0, 0] = sampled_token_index

        # Actualizar estados
        states_value = [h, c]

    return decoded_sentence.strip()


## **Ensayos**

### **Ensayo 3**

In [82]:
# Hiperparametros utilizados

# MAX_VOCAB_SIZE = 8000
# MAX_SEQUENCE_LENGTH = 10
# EMBEDDING_DIM = 300
# LATENT_DIM = 256      # Unidades LSTM (encoder y decoder)
# DROPOUT_RATE = 0.2
# EPOCHS = 50

# Parámetros 2.3 M

def respond(user_input):
    seq = tokenizer.texts_to_sequences([user_input])
    seq = pad_sequences(seq, maxlen=MAX_SEQUENCE_LENGTH, padding='post')
    response = decode_sequence(seq)
    return response

def ask_question(question):
  print("Pregunta: ", question)
  print("Respuesta:", respond(question))
  print("-------------------------------------------\n")

# Ejemplos de prueba


for question in [
    "Do you read?",
    "Do you have any pet?",
    "Where are you from?",
    "What do you do for a living?",
    "Do you have a favourite singer?",
    "would you like talking about music?",
    "Hey, how are you?",
    "How old are you?",
    "I enjoy music, do you?",
    "Do you like hamburguers?",
    "Do you enjoy basketball?"

]:
  ask_question(question)

Pregunta:  Do you read?
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1s/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 355ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 65ms/step
Respuesta: yes
-------------------------------------------

Pregunta:  Do you have any pet?
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 70ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 66ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 69ms/step
Respuesta: no
-------------------------------------------

Pregunta:  Where are you from?
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 71ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 68ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 66ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 69ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 69ms/step
[1

### **Ensayo 2**

In [72]:
# Hiperparametros utilizados

# MAX_VOCAB_SIZE = 8000
# MAX_SEQUENCE_LENGTH = 10
# EMBEDDING_DIM = 300
# LATENT_DIM = 128      # Unidades LSTM (encoder y decoder)
# DROPOUT_RATE = 0.2
# EPOCHS = 20

# Parámetros: 1.3 M

def respond(user_input):
    seq = tokenizer.texts_to_sequences([user_input])
    seq = pad_sequences(seq, maxlen=MAX_SEQUENCE_LENGTH, padding='post')
    response = decode_sequence(seq)
    return response

def ask_question(question):
  print("Pregunta: ", question)
  print("Respuesta:", respond(question))
  print("-------------------------------------------\n")

# Ejemplos de prueba


for question in [
    "Do you read?",
    "Do you have any pet?",
    "Where are you from?",
    "What do you do for a living?",
    "Do you have a favourite singer?",
    "would you like talking about music?",
    "Hey, how are you?",
    "How old are you?",
    "I enjoy music, do you?",
    "Do you like hamburguers?",
    "Do you enjoy basketball?"

]:
  ask_question(question)

Pregunta:  Do you read?
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 402ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 341ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 80ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 72ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 79ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 79ms/step
Respuesta: i am a vegan
-------------------------------------------

Pregunta:  Do you have any pet?
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 76ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 75ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 79ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 75ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 50ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 60ms/step
Respuesta

### **Ensayo 1**

In [62]:
# Hiperparametros utilizados

# MAX_VOCAB_SIZE = 8000
# MAX_SEQUENCE_LENGTH = 10
# EMBEDDING_DIM = 300
# LATENT_DIM = 128      # Unidades LSTM (encoder y decoder)
# DROPOUT_RATE = 0.2
# EPOCHS = 50

# Parámetros: 1.3 M

def respond(user_input):
    seq = tokenizer.texts_to_sequences([user_input])
    seq = pad_sequences(seq, maxlen=MAX_SEQUENCE_LENGTH, padding='post')
    response = decode_sequence(seq)
    return response

def ask_question(question):
  print("Pregunta: ", question)
  print("Respuesta:", respond(question))
  print("-------------------------------------------\n")

# Ejemplos de prueba


for question in [
    "Do you read?",
    "Do you have any pet?",
    "Where are you from?",
    "What do you do for a living?",
    "Do you have a favourite singer?",
    "would you like talking about music?",
    "Hey, how are you?",
    "How old are you?",
    "I enjoy music, do you?",
    "Do you like hamburguers?",
    "Do you enjoy basketball?"

]:
  ask_question(question)

Pregunta:  Do you read?
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 40ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 45ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 59ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 57ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 60ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 48ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 51ms/step
Respuesta: i do not like it
-------------------------------------------

Pregunta:  Do you have any pet?
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 55ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 60ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 58ms/step
Respuesta: no
-------------------------------------------

Pregunta:  Where are you from?
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m

## **Conclusiones**

Se explorararon distintos ajustes de hiperparámetros (principalmente número de épocas y capacidad de la red (LATENT_DIM) ) y se observó cómo impactan en la calidad y coherencia de las respuestas generadas por el bot.

#### **Principales observaciones**

**Impacto de aumentar LATENT_DIM (Ensayo 3)** *El mejor desempeño*

>Incrementar de 128 a 256 unidades LSTM permitió una ligera mejora en la calidad de las respuestas, con menos respuestas evasivas ("I am not sure what you mean") y más frases completas. También fue el que consumió mayor tiempo de entrenamiento.

**Impacto de las épocas de entrenamiento**

>Entrenar sólo 20 épocas (Ensayo 2) produjo un modelo subentrenado lo que dió como rsultado a respuestas genéricas y fuera del contexto de la pregunta.

**Persistencia de respuestas genéricas**

>Incluso con 50 épocas y mayor capacidad (Ensayo 3), el modelo tiende a usar respuestas de evasión o frases seguras (“I do not know”, “I like to read”), y aún así evidenciar cierta limitación.

### **Resumen de los ensayos**

| Ensayo | Épocas | LATENT_DIM | Respuestas destacadas |
|--------|--------|-------------|-----------------------|
| **Ensayo 1** | 50 | 128 | Respuestas correctas para saludos y preguntas simples (*"How are you?" → "I am doing well how are you"*), pero evasivas en preguntas abiertas (*"Where are you from?" → "I am not sure what you mean"*). |
| **Ensayo 2** | 20 | 128 | Respuestas más cortas y genéricas, incoherentes en algunos casos (*"Do you read?" → "I am a vegan"*), con frases repetidas como *"I like to read"*. |
| **Ensayo 3** | 50 | 256 | Ligera mejora: respuestas más directas (*"Do you read?" → "yes"*), frases completas y algo más coherentes; persisten repeticiones y evasivas en temas más específicos. |
