# Building a dataset to fine-tuning an embedding model

Creating a dataset to customize the embedding model (BAAI/bge-m3) for a specific domain (Legal) and language (Spanish).

Based on:

- https://www.philschmid.de/sagemaker-train-deploy-embedding-models
- https://github.com/virattt/financial-datasets/
- https://github.com/virattt/financial-datasets/blob/main/financial_datasets/prompts.py

In [1]:
import os
from groq import Groq
import json
import time

from datasets import Dataset

In [2]:
import sys
sys.path.append(os.path.abspath('../../'))

from src.etls.boe.scrapper import BOEScrapper

[2024-07-11 14:32:05,451] [16647] [INFO] [root] Initialized logging
[2024-07-11 14:32:05,453] [16647] [INFO] [root] Initializing logging
[2024-07-11 14:32:05,453] [16647] [INFO] [root] Initialized logging


# Config

In [3]:
QUESTIONS_PER_DOCUMENT = 100

In [4]:
OUTPUT_NAME_DATASET = "dariolopez/justicio-rag-embedding-qa"

In [5]:
URLS = list(set([
    # 
    "https://www.boe.es/diario_boe/xml.php?id=BOE-A-2023-12203",
    "https://www.boe.es/diario_boe/xml.php?id=BOE-A-2015-10565",
    "https://www.boe.es/diario_boe/xml.php?id=BOE-A-2015-10566",
    "https://www.boe.es/diario_boe/xml.php?id=BOE-A-2015-11719",
    "https://www.boe.es/diario_boe/xml.php?id=BOE-A-1985-5392",
    "https://www.boe.es/diario_boe/xml.php?id=BOE-A-2015-11719",
    "https://www.boe.es/diario_boe/xml.php?id=BOE-A-2013-12887",
    "https://www.boe.es/diario_boe/xml.php?id=BOE-A-2018-16673",
    "https://www.boe.es/diario_boe/xml.php?id=BOE-A-2006-21990",
    "https://www.boe.es/diario_boe/xml.php?id=BOE-A-2017-657",
    "https://www.boe.es/diario_boe/xml.php?id=BOE-A-2007-6115",
    "https://www.boe.es/diario_boe/xml.php?id=BOE-A-2004-21760",
    "https://www.boe.es/diario_boe/xml.php?id=BOE-A-2021-9347",
    "https://www.boe.es/diario_boe/xml.php?id=BOE-A-2021-21007",
    "https://www.boe.es/diario_boe/xml.php?id=BOJA-b-2017-90529",
    "https://www.boe.es/diario_boe/xml.php?id=BOE-A-2013-12632",
    "https://www.boe.es/diario_boe/xml.php?id=BOE-A-2000-544",
    "https://www.boe.es/diario_boe/xml.php?id=BOE-A-2015-11072",
    "https://www.boe.es/diario_boe/xml.php?id=BOE-A-1978-31229",
    # 
    "https://www.boe.es/diario_boe/xml.php?id=BOE-A-1995-24292",
    "https://www.boe.es/diario_boe/xml.php?id=BOE-A-2008-2493",
    "https://www.boe.es/diario_boe/xml.php?id=BOE-A-2008-2492",
    "https://www.boe.es/diario_boe/xml.php?id=BOE-A-2014-7534",
    "https://www.boe.es/diario_boe/xml.php?id=BOE-A-2021-9347",
    "https://www.boe.es/diario_boe/xml.php?id=BOE-A-1999-19448",
    "https://www.boe.es/diario_boe/xml.php?id=BOE-A-2013-12632",
    "https://www.boe.es/diario_boe/xml.php?id=BOE-A-2021-13605",
    "https://www.boe.es/diario_boe/xml.php?id=BOE-A-2023-12203",
    "https://www.boe.es/diario_boe/xml.php?id=BOE-A-2023-5366",
    "https://www.boe.es/diario_boe/xml.php?id=BOE-A-2022-11589",
    "https://www.boe.es/diario_boe/xml.php?id=BOE-A-2015-11724",
    "https://www.boe.es/diario_boe/xml.php?id=BOE-A-2011-15623",
    "https://www.boe.es/diario_boe/xml.php?id=BOE-A-1986-10499",
    "https://www.boe.es/diario_boe/xml.php?id=BOE-A-2020-17264",
    "https://www.boe.es/diario_boe/xml.php?id=BOE-A-2023-16066",
    "https://www.boe.es/diario_boe/xml.php?id=BOE-A-2007-5825",
    "https://www.boe.es/diario_boe/xml.php?id=BOE-A-2022-14630",
    "https://www.boe.es/diario_boe/xml.php?id=BOJA-b-2024-90030",
    "https://www.boe.es/diario_boe/xml.php?id=BOE-A-2001-12716",
    "https://www.boe.es/diario_boe/xml.php?id=BOE-A-2009-5491",
    "https://www.boe.es/diario_boe/xml.php?id=BOE-A-2003-771",
    "https://www.boe.es/diario_boe/xml.php?id=BOE-A-2017-12902",
    "https://www.boe.es/diario_boe/xml.php?id=BOE-A-2003-20254",
    "https://www.boe.es/diario_boe/xml.php?id=BOE-A-2004-4214"
]))

In [6]:
URLS.remove("https://www.boe.es/diario_boe/xml.php?id=BOE-A-2023-12203")
URLS.insert(0, "https://www.boe.es/diario_boe/xml.php?id=BOE-A-2023-12203")

In [7]:
URLS[0]

'https://www.boe.es/diario_boe/xml.php?id=BOE-A-2023-12203'

In [8]:
len(URLS)

40

In [9]:
SYSTEM_PROMPT = """
Como un experto en derecho y leyes españolas, tu tarea es crear preguntas y respuestas sobre el Boletín Oficial del Estado (BOE) de España.

Tu tarea consiste en crear pares de preguntas y respuestas independientes, sin referencia a ningún documento concreto. 

Estas preguntas y respuestas se utilizarán de forma independiente en futuras aplicaciones como la evaluación y el ajuste de LLM, en las que no se dispondrá de ningún documento de referencia.

Sigue las siguientes reglas:

1. Derivación directa: Las respuestas deben derivarse directamente del contenido proporcionado.
2. Preguntas autónomas: Asegúrate de que las preguntas se pueden responder completamente a partir de la información proporcionada y no implican la existencia de un documento más amplio.
3. Claridad y precisión: Las preguntas deben ser claras, precisas y no ambiguas.
4. Referencias prohibidas: Evita explícitamente frases como "según el documento", “según el capítulo”, "en el texto", "como se menciona en el artículo", o cualquier implicación de textos externos. No construyas preguntas que requieran conocer la estructura del documento o la ubicación de la información en el mismo.
5. Inclusión del contexto: Incluya la información específica del contenido que apoya la respuesta. El contexto debe permitir que la respuesta sea independiente de cualquier texto externo.
6. Suficiencia de la información: Si el contenido carece de información suficiente para formar una pareja completa de pregunta-respuesta, no fuerces una.
7. Respuestas originales: Crea las respuestas con tus propias palabras; no se permite la copia directa del contenido.
8. Asegúrate de responder siempre en español.
9. El contexto debería tener una longitud aproximada de 1000 caracteres. 

Salida:

La salida tendrá el siguiente formato JSON:

```
[
    {
        "question": "texto para la pregunta",
        “context”: "texto para el contexto",
        "answer": "texto para la respuesta"
    },
    {
        "question": "texto para la pregunta",
        “context”: "texto para el contexto",
        "answer": "texto para la respuesta"
    }
]
```

En la salida NO incluyas ningún texto extra, CONTESTA exclusivamente en formato JSON.

Ejemplo de salida bien generada:

```
[
    {
        "question": "¿Cuál es el papel de la Unión Europea en la política de vivienda?",
        "context": "El artículo 19 del Pilar Europeo de derechos sociales, incorpora la vivienda entre los principios y los derechos esenciales para el funcionamiento de los sistemas de bienestar europeo y, por último, la Carta de los Derechos Fundamentales de la Unión Europea aprobada por el Parlamento, el Consejo y la Comisión Europea el 7 de diciembre de 2000 establece en su artículo 34.3 que «con el fin de combatir la exclusión social y la pobreza, la Unión reconoce y respeta el derecho a una ayuda social y a una ayuda de vivienda para garantizar una existencia digna a todos aquellos que no dispongan de recursos suficientes».",
        "answer": "La Unión Europea ha avanzado en el reconocimiento del derecho a la vivienda de toda persona, y ha establecido principios y derechos esenciales para el funcionamiento de los sistemas de bienestar europeo."
    },
    {
        "question": "¿Qué sucede si el acreedor hipotecario es un gran tenedor de vivienda?",
        "context": "En el caso de que el acreedor hipotecario sea un gran tenedor de vivienda, el inmueble objeto de demanda sea la vivienda habitual del deudor hipotecario y se tenga constancia, conforme a los apartados anteriores, que éste se encuentra en situación de vulnerabilidad económica, no se admitirán las demandas de ejecución hipotecaria en las que no se acredite que la parte actora se ha sometido al procedimiento de conciliación o intermediación que a tal efecto establezcan las Administraciones Públicas competentes, en base al análisis de las circunstancias de ambas partes y de las posibles ayudas y subvenciones existentes conforme a la legislación y normativa autonómica en materia de vivienda.",
        "answer": "No se admitirán las demandas de ejecución hipotecaria si no se acredita que la parte actora se ha sometido al procedimiento de conciliación o intermediación."
    },
    {
        "question": "¿Cuál es el requisito para acreditar la concurrencia o no de vulnerabilidad económica de la parte ejecutada?",
        "context": "Para acreditar la concurrencia o no de vulnerabilidad económica de la parte ejecutada se deberá aportar documento acreditativo, de vigencia no superior a tres meses, emitido, previo consentimiento de éste, por los servicios de las Administraciones autonómicas y locales competentes en materia de vivienda, asistencia social, evaluación e información de situaciones de necesidad social y atención inmediata a personas en situación o riesgo de exclusión social que hayan sido específicamente designados conforme a la legislación y normativa autonómica en materia de vivienda.",
        "answer": "Un documento acreditativo emitido por los servicios competentes, con una vigencia no superior a tres meses."
    }
]
```

Ejemplo de salida mal generada:
```
Aquí te dejo 9 tripletas de pregunta/respuesta/contexto sobre el bloque de texto proporcionado:
[
    {
        "Pregunta": "...",
        "Respuesta": "",
        "Contexto": "..."
    },
    {
        'question': '¿Qué tipo de viviendas se promueven en zonas de mercado residencial tensionado?',
        'context': 'Artículo 17. Vivienda asequible incentivada.',
        'answer': 'Viviendas asequibles incentivadas.'
    },
    {
        "question": "¿Cuál es el derecho constitucional que se reconoce en el artículo 47 de la Constitución Española?",
        "context": "La Constitución española (CE) reconoce, en su artículo 47, el derecho al disfrute de una vivienda digna y adecuada e impone seguidamente a los poderes públicos el deber de promover las condiciones necesarias que garanticen la igualdad en el ejercicio de los derechos y el cumplimiento de los deberes constitucionales.",
        "answer": "El derecho al disfrute de una vivienda digna y adecuada."
    },
]
```

NUNCA menciones las palabras "documento", "texto", "presentación", "archivo", "tabla", “artículo”, "ley", “capítulo”, “preámbulo”, “título preliminar”, “disposición” o “disposiciones generales” en sus preguntas o respuestas.

Asegúrate SIEMPRE de que todas las preguntas y respuestas sean precisas, autónomas y pertinentes, sin basarse en ningún documento o texto original ni insinuar su existencia, evitando estrictamente cualquier invención o especulación.

Asegúrate SIEMPRE de que la respuesta esté contenida en el contexto.
"""

In [11]:
client = Groq(
    api_key=os.environ.get('GROQ_API_KEY'),
)

In [12]:
# groq limits: https://console.groq.com/settings/limits

def split_text_into_parts(text, part_size):
    parts = []
    start = 0
    while start < len(text):
        end = start + part_size
        parts.append(text[start:end])
        start = end
    return parts

In [13]:
data = []
errors = []
for url in URLS[::-1]:

    document_id = url.split("?id=")[1]
    
    boe_scrapper = BOEScrapper()
    # text = boe_scrapper.download_eli_document(url)
    text = boe_scrapper.download_document(url, output_text=True)

    # Split original text because of the Groq API limits: https://console.groq.com/settings/limits
    splitted = split_text_into_parts(text, part_size=15_500)
    print(f"The document {url} was splitted on {len(splitted)} texts")
    for text_splitted in splitted:
        time.sleep(66)  # groq limits: https://console.groq.com/settings/limits

        questions = QUESTIONS_PER_DOCUMENT // len(splitted)
        limit_questions = 16
        if questions > limit_questions:  # groq limits generation
            questions = limit_questions
        elif questions == 0:
            questions = 1
        print(f"Generating {questions} questions")

        # Generate questions
        query = f"Genera {questions} tripletas de pregunta/respuesta/contexto para el siguiente bloque de texto: {text_splitted}. Retorna en formato JSON."
        chat_completion = client.chat.completions.create(
            messages=[
                {
                    "role": "system",
                    "content": SYSTEM_PROMPT,
                },
                {
                    "role": "user",
                    "content": query,
                },
            ],
            model="llama3-70b-8192",
            temperature=0,
            stream=False,
            # max_tokens=1024
        )

        try:
            content = chat_completion.choices[0].message.content

            json_start = content.find('[')
            cleaned_content = content[json_start:].replace('\n', '').replace('\r', '').replace(',]', ']').replace('```', '')

            if cleaned_content.strip():
                responses = json.loads(cleaned_content)
                for response in responses:
                    if len(response['context']) > 120:
                        row = {}
                        row['document'] = document_id
                        row['question'] = response['question']
                        row['context'] = response['context']
                        row['answer'] = response['answer']
                        data.append(response)
            
        except Exception as e:
            print(f"Se ha producido una excepción: {e}")
            print(f"Wrong generated on url {url}")
            print(chat_completion.choices[0].message.content)

[2024-07-11 14:32:05,614] [16647] [INFO] [download_document] Scrapping document: https://www.boe.es/diario_boe/xml.php?id=BOE-A-2003-771
[2024-07-11 14:32:05,906] [16647] [INFO] [download_eli_document] Scrapping consolidated text (eli) from document: https://www.boe.es/eli/es-an/l/2002/12/16/5
[2024-07-11 14:32:06,478] [16647] [INFO] [download_eli_document] Scrapped consolidated text (eli) successfully from document: https://www.boe.es/eli/es-an/l/2002/12/16/5
[2024-07-11 14:32:06,480] [16647] [INFO] [download_document] Scrapped document successfully https://www.boe.es/eli/es-an/l/2002/12/16/5


The document https://www.boe.es/diario_boe/xml.php?id=BOE-A-2003-771 was splitted on 2 texts
Generating 16 questions
Generating 16 questions


[2024-07-11 14:34:35,175] [16647] [INFO] [download_document] Scrapping document: https://www.boe.es/diario_boe/xml.php?id=BOE-A-2020-17264
[2024-07-11 14:34:35,780] [16647] [INFO] [download_eli_document] Scrapping consolidated text (eli) from document: https://www.boe.es/eli/es/lo/2020/12/29/3
[2024-07-11 14:34:36,693] [16647] [INFO] [download_eli_document] Scrapped consolidated text (eli) successfully from document: https://www.boe.es/eli/es/lo/2020/12/29/3
[2024-07-11 14:34:36,695] [16647] [INFO] [download_document] Scrapped document successfully https://www.boe.es/eli/es/lo/2020/12/29/3


The document https://www.boe.es/diario_boe/xml.php?id=BOE-A-2020-17264 was splitted on 20 texts
Generating 5 questions
Generating 5 questions
Generating 5 questions
Generating 5 questions
Generating 5 questions
Generating 5 questions
Generating 5 questions
Generating 5 questions
Generating 5 questions
Generating 5 questions
Generating 5 questions
Generating 5 questions
Generating 5 questions
Generating 5 questions
Generating 5 questions
Generating 5 questions
Generating 5 questions
Generating 5 questions
Generating 5 questions
Generating 5 questions


[2024-07-11 14:58:07,806] [16647] [INFO] [download_document] Scrapping document: https://www.boe.es/diario_boe/xml.php?id=BOE-A-2015-11724
[2024-07-11 14:58:09,012] [16647] [INFO] [download_eli_document] Scrapping consolidated text (eli) from document: https://www.boe.es/eli/es/rdlg/2015/10/30/8
[2024-07-11 14:58:10,387] [16647] [INFO] [download_eli_document] Scrapped consolidated text (eli) successfully from document: https://www.boe.es/eli/es/rdlg/2015/10/30/8
[2024-07-11 14:58:10,389] [16647] [INFO] [download_document] Scrapped document successfully https://www.boe.es/eli/es/rdlg/2015/10/30/8


The document https://www.boe.es/diario_boe/xml.php?id=BOE-A-2015-11724 was splitted on 52 texts
Generating 1 questions
Generating 1 questions
Generating 1 questions
Generating 1 questions
Generating 1 questions
Generating 1 questions
Generating 1 questions
Generating 1 questions
Generating 1 questions
Generating 1 questions
Generating 1 questions
Generating 1 questions
Generating 1 questions
Generating 1 questions
Generating 1 questions
Generating 1 questions
Generating 1 questions
Generating 1 questions
Generating 1 questions
Generating 1 questions
Generating 1 questions
Generating 1 questions
Generating 1 questions
Generating 1 questions
Generating 1 questions
Generating 1 questions
Generating 1 questions
Generating 1 questions
Generating 1 questions
Generating 1 questions
Generating 1 questions
Generating 1 questions
Generating 1 questions
Generating 1 questions
Generating 1 questions
Se ha producido una excepción: Extra data: line 1 column 708 (char 707)
Wrong generated on url http

[2024-07-11 15:58:59,838] [16647] [INFO] [download_document] Scrapping document: https://www.boe.es/diario_boe/xml.php?id=BOE-A-2015-11072
[2024-07-11 15:59:00,302] [16647] [INFO] [download_eli_document] Scrapping consolidated text (eli) from document: https://www.boe.es/eli/es/l/2015/10/14/45
[2024-07-11 15:59:01,133] [16647] [INFO] [download_eli_document] Scrapped consolidated text (eli) successfully from document: https://www.boe.es/eli/es/l/2015/10/14/45
[2024-07-11 15:59:01,134] [16647] [INFO] [download_document] Scrapped document successfully https://www.boe.es/eli/es/l/2015/10/14/45


The document https://www.boe.es/diario_boe/xml.php?id=BOE-A-2015-11072 was splitted on 5 texts
Generating 16 questions
Generating 16 questions
Generating 16 questions
Generating 16 questions
Generating 16 questions


[2024-07-11 16:05:12,963] [16647] [INFO] [download_document] Scrapping document: https://www.boe.es/diario_boe/xml.php?id=BOE-A-2008-2492
[2024-07-11 16:05:13,289] [16647] [INFO] [download_eli_document] Scrapping consolidated text (eli) from document: https://www.boe.es/eli/es-an/l/2007/11/26/12
[2024-07-11 16:05:14,085] [16647] [INFO] [download_eli_document] Scrapped consolidated text (eli) successfully from document: https://www.boe.es/eli/es-an/l/2007/11/26/12
[2024-07-11 16:05:14,087] [16647] [INFO] [download_document] Scrapped document successfully https://www.boe.es/eli/es-an/l/2007/11/26/12


The document https://www.boe.es/diario_boe/xml.php?id=BOE-A-2008-2492 was splitted on 5 texts
Generating 16 questions
Generating 16 questions
Generating 16 questions
Generating 16 questions
Generating 16 questions


[2024-07-11 16:11:27,543] [16647] [INFO] [download_document] Scrapping document: https://www.boe.es/diario_boe/xml.php?id=BOE-A-1985-5392
[2024-07-11 16:11:28,178] [16647] [INFO] [download_eli_document] Scrapping consolidated text (eli) from document: https://www.boe.es/eli/es/l/1985/04/02/7
[2024-07-11 16:11:29,598] [16647] [INFO] [download_eli_document] Scrapped consolidated text (eli) successfully from document: https://www.boe.es/eli/es/l/1985/04/02/7
[2024-07-11 16:11:29,599] [16647] [INFO] [download_document] Scrapped document successfully https://www.boe.es/eli/es/l/1985/04/02/7


The document https://www.boe.es/diario_boe/xml.php?id=BOE-A-1985-5392 was splitted on 11 texts
Generating 9 questions
Generating 9 questions
Generating 9 questions
Generating 9 questions
Generating 9 questions
Generating 9 questions
Generating 9 questions
Generating 9 questions
Generating 9 questions
Generating 9 questions
Generating 9 questions


[2024-07-11 16:24:40,809] [16647] [INFO] [download_document] Scrapping document: https://www.boe.es/diario_boe/xml.php?id=BOE-A-2021-9347
[2024-07-11 16:24:41,408] [16647] [INFO] [download_eli_document] Scrapping consolidated text (eli) from document: https://www.boe.es/eli/es/lo/2021/06/04/8
[2024-07-11 16:24:42,344] [16647] [INFO] [download_eli_document] Scrapped consolidated text (eli) successfully from document: https://www.boe.es/eli/es/lo/2021/06/04/8
[2024-07-11 16:24:42,345] [16647] [INFO] [download_document] Scrapped document successfully https://www.boe.es/eli/es/lo/2021/06/04/8


The document https://www.boe.es/diario_boe/xml.php?id=BOE-A-2021-9347 was splitted on 18 texts
Generating 5 questions
Generating 5 questions
Generating 5 questions
Generating 5 questions
Generating 5 questions
Generating 5 questions
Generating 5 questions
Generating 5 questions
Generating 5 questions
Generating 5 questions
Generating 5 questions
Generating 5 questions
Generating 5 questions
Generating 5 questions
Generating 5 questions
Generating 5 questions
Generating 5 questions
Generating 5 questions


[2024-07-11 16:46:01,229] [16647] [INFO] [download_document] Scrapping document: https://www.boe.es/diario_boe/xml.php?id=BOE-A-2017-657
[2024-07-11 16:46:01,864] [16647] [INFO] [download_eli_document] Scrapping consolidated text (eli) from document: https://www.boe.es/eli/es-an/l/2016/12/27/9
[2024-07-11 16:46:02,944] [16647] [INFO] [download_eli_document] Scrapped consolidated text (eli) successfully from document: https://www.boe.es/eli/es-an/l/2016/12/27/9
[2024-07-11 16:46:02,946] [16647] [INFO] [download_document] Scrapped document successfully https://www.boe.es/eli/es-an/l/2016/12/27/9


The document https://www.boe.es/diario_boe/xml.php?id=BOE-A-2017-657 was splitted on 17 texts
Generating 5 questions
Generating 5 questions
Generating 5 questions
Generating 5 questions
Generating 5 questions
Generating 5 questions
Generating 5 questions
Generating 5 questions
Generating 5 questions
Generating 5 questions
Generating 5 questions
Generating 5 questions
Generating 5 questions
Generating 5 questions
Generating 5 questions
Generating 5 questions
Generating 5 questions


[2024-07-11 17:06:06,458] [16647] [INFO] [download_document] Scrapping document: https://www.boe.es/diario_boe/xml.php?id=BOE-A-2001-12716
[2024-07-11 17:06:06,788] [16647] [INFO] [download_eli_document] Scrapping consolidated text (eli) from document: https://www.boe.es/eli/es-ga/l/2001/05/31/4
[2024-07-11 17:06:07,450] [16647] [INFO] [download_eli_document] Scrapped consolidated text (eli) successfully from document: https://www.boe.es/eli/es-ga/l/2001/05/31/4
[2024-07-11 17:06:07,451] [16647] [INFO] [download_document] Scrapped document successfully https://www.boe.es/eli/es-ga/l/2001/05/31/4


The document https://www.boe.es/diario_boe/xml.php?id=BOE-A-2001-12716 was splitted on 2 texts
Generating 16 questions
Generating 16 questions


[2024-07-11 17:08:34,587] [16647] [INFO] [download_document] Scrapping document: https://www.boe.es/diario_boe/xml.php?id=BOE-A-2015-11719
[2024-07-11 17:08:35,192] [16647] [INFO] [download_eli_document] Scrapping consolidated text (eli) from document: https://www.boe.es/eli/es/rdlg/2015/10/30/5
[2024-07-11 17:08:36,209] [16647] [INFO] [download_eli_document] Scrapped consolidated text (eli) successfully from document: https://www.boe.es/eli/es/rdlg/2015/10/30/5
[2024-07-11 17:08:36,211] [16647] [INFO] [download_document] Scrapped document successfully https://www.boe.es/eli/es/rdlg/2015/10/30/5


The document https://www.boe.es/diario_boe/xml.php?id=BOE-A-2015-11719 was splitted on 12 texts
Generating 8 questions
Generating 8 questions
Generating 8 questions
Generating 8 questions
Generating 8 questions
Generating 8 questions
Generating 8 questions
Generating 8 questions
Generating 8 questions
Generating 8 questions
Generating 8 questions
Generating 8 questions


[2024-07-11 17:22:55,314] [16647] [INFO] [download_document] Scrapping document: https://www.boe.es/diario_boe/xml.php?id=BOE-A-2023-5366
[2024-07-11 17:22:55,921] [16647] [INFO] [download_eli_document] Scrapping consolidated text (eli) from document: https://www.boe.es/eli/es/l/2023/02/28/4
[2024-07-11 17:22:56,702] [16647] [INFO] [download_eli_document] Scrapped consolidated text (eli) successfully from document: https://www.boe.es/eli/es/l/2023/02/28/4
[2024-07-11 17:22:56,704] [16647] [INFO] [download_document] Scrapped document successfully https://www.boe.es/eli/es/l/2023/02/28/4


The document https://www.boe.es/diario_boe/xml.php?id=BOE-A-2023-5366 was splitted on 14 texts
Generating 7 questions
Generating 7 questions
Generating 7 questions
Generating 7 questions
Generating 7 questions
Generating 7 questions
Generating 7 questions
Generating 7 questions
Generating 7 questions
Generating 7 questions
Generating 7 questions
Generating 7 questions
Generating 7 questions
Generating 7 questions


[2024-07-11 17:39:42,033] [16647] [INFO] [download_document] Scrapping document: https://www.boe.es/diario_boe/xml.php?id=BOE-A-2014-7534
[2024-07-11 17:39:42,568] [16647] [INFO] [download_eli_document] Scrapping consolidated text (eli) from document: https://www.boe.es/eli/es-an/l/2014/06/24/1
[2024-07-11 17:39:43,142] [16647] [INFO] [download_eli_document] Scrapped consolidated text (eli) successfully from document: https://www.boe.es/eli/es-an/l/2014/06/24/1
[2024-07-11 17:39:43,143] [16647] [INFO] [download_document] Scrapped document successfully https://www.boe.es/eli/es-an/l/2014/06/24/1


The document https://www.boe.es/diario_boe/xml.php?id=BOE-A-2014-7534 was splitted on 7 texts
Generating 14 questions
Generating 14 questions
Generating 14 questions
Generating 14 questions
Generating 14 questions
Generating 14 questions
Generating 14 questions


[2024-07-11 17:48:17,634] [16647] [INFO] [download_document] Scrapping document: https://www.boe.es/diario_boe/xml.php?id=BOE-A-2007-5825
[2024-07-11 17:48:18,263] [16647] [INFO] [download_eli_document] Scrapping consolidated text (eli) from document: https://www.boe.es/eli/es/lo/2007/03/19/2
[2024-07-11 17:48:19,204] [16647] [INFO] [download_eli_document] Scrapped consolidated text (eli) successfully from document: https://www.boe.es/eli/es/lo/2007/03/19/2
[2024-07-11 17:48:19,206] [16647] [INFO] [download_document] Scrapped document successfully https://www.boe.es/eli/es/lo/2007/03/19/2


The document https://www.boe.es/diario_boe/xml.php?id=BOE-A-2007-5825 was splitted on 15 texts
Generating 6 questions
Generating 6 questions
Generating 6 questions
Generating 6 questions
Generating 6 questions
Generating 6 questions
Generating 6 questions
Generating 6 questions
Generating 6 questions
Generating 6 questions
Generating 6 questions
Generating 6 questions
Generating 6 questions
Generating 6 questions
Generating 6 questions


[2024-07-11 18:06:00,349] [16647] [INFO] [download_document] Scrapping document: https://www.boe.es/diario_boe/xml.php?id=BOE-A-2013-12632
[2024-07-11 18:06:00,834] [16647] [INFO] [download_eli_document] Scrapping consolidated text (eli) from document: https://www.boe.es/eli/es/rdlg/2013/11/29/1
[2024-07-11 18:06:01,750] [16647] [INFO] [download_eli_document] Scrapped consolidated text (eli) successfully from document: https://www.boe.es/eli/es/rdlg/2013/11/29/1
[2024-07-11 18:06:01,751] [16647] [INFO] [download_document] Scrapped document successfully https://www.boe.es/eli/es/rdlg/2013/11/29/1


The document https://www.boe.es/diario_boe/xml.php?id=BOE-A-2013-12632 was splitted on 9 texts
Generating 11 questions
Generating 11 questions
Generating 11 questions


[2024-07-11 18:10:35,442] [16647] [INFO] [groq._base_client] Retrying request to /openai/v1/chat/completions in 0.964836 seconds


Generating 11 questions
Generating 11 questions
Generating 11 questions
Generating 11 questions
Generating 11 questions
Generating 11 questions


[2024-07-11 18:17:55,315] [16647] [INFO] [download_document] Scrapping document: https://www.boe.es/diario_boe/xml.php?id=BOE-A-2023-16066
[2024-07-11 18:17:55,968] [16647] [INFO] [download_eli_document] Scrapping consolidated text (eli) from document: https://www.boe.es/eli/es-an/l/2023/06/07/5
[2024-07-11 18:17:57,219] [16647] [INFO] [download_eli_document] Scrapped consolidated text (eli) successfully from document: https://www.boe.es/eli/es-an/l/2023/06/07/5
[2024-07-11 18:17:57,221] [16647] [INFO] [download_document] Scrapped document successfully https://www.boe.es/eli/es-an/l/2023/06/07/5


The document https://www.boe.es/diario_boe/xml.php?id=BOE-A-2023-16066 was splitted on 31 texts
Generating 3 questions
Generating 3 questions
Generating 3 questions
Generating 3 questions
Generating 3 questions
Generating 3 questions
Generating 3 questions
Generating 3 questions
Generating 3 questions
Generating 3 questions
Generating 3 questions
Generating 3 questions
Generating 3 questions
Generating 3 questions
Generating 3 questions
Generating 3 questions
Generating 3 questions
Generating 3 questions
Generating 3 questions
Generating 3 questions
Generating 3 questions
Generating 3 questions
Generating 3 questions
Generating 3 questions
Generating 3 questions
Generating 3 questions
Generating 3 questions
Generating 3 questions
Generating 3 questions
Generating 3 questions
Generating 3 questions


[2024-07-11 18:53:57,264] [16647] [INFO] [download_document] Scrapping document: https://www.boe.es/diario_boe/xml.php?id=BOE-A-1986-10499
[2024-07-11 18:53:58,135] [16647] [INFO] [download_eli_document] Scrapping consolidated text (eli) from document: https://www.boe.es/eli/es/l/1986/04/25/14
[2024-07-11 18:53:59,003] [16647] [INFO] [download_eli_document] Scrapped consolidated text (eli) successfully from document: https://www.boe.es/eli/es/l/1986/04/25/14
[2024-07-11 18:53:59,005] [16647] [INFO] [download_document] Scrapped document successfully https://www.boe.es/eli/es/l/1986/04/25/14


The document https://www.boe.es/diario_boe/xml.php?id=BOE-A-1986-10499 was splitted on 9 texts
Generating 11 questions
Generating 11 questions
Generating 11 questions
Generating 11 questions
Generating 11 questions
Generating 11 questions
Generating 11 questions
Generating 11 questions
Generating 11 questions


[2024-07-11 19:04:47,086] [16647] [INFO] [download_document] Scrapping document: https://www.boe.es/diario_boe/xml.php?id=BOE-A-2000-544
[2024-07-11 19:04:47,698] [16647] [INFO] [download_eli_document] Scrapping consolidated text (eli) from document: https://www.boe.es/eli/es/lo/2000/01/11/4
[2024-07-11 19:04:48,578] [16647] [INFO] [download_eli_document] Scrapped consolidated text (eli) successfully from document: https://www.boe.es/eli/es/lo/2000/01/11/4
[2024-07-11 19:04:48,579] [16647] [INFO] [download_document] Scrapped document successfully https://www.boe.es/eli/es/lo/2000/01/11/4


The document https://www.boe.es/diario_boe/xml.php?id=BOE-A-2000-544 was splitted on 4 texts
Generating 16 questions
Generating 16 questions
Generating 16 questions
Generating 16 questions


[2024-07-11 19:09:47,089] [16647] [INFO] [download_document] Scrapping document: https://www.boe.es/diario_boe/xml.php?id=BOE-A-2007-6115
[2024-07-11 19:09:47,644] [16647] [INFO] [download_eli_document] Scrapping consolidated text (eli) from document: https://www.boe.es/eli/es/lo/2007/03/22/3
[2024-07-11 19:09:48,301] [16647] [INFO] [download_eli_document] Scrapped consolidated text (eli) successfully from document: https://www.boe.es/eli/es/lo/2007/03/22/3
[2024-07-11 19:09:48,302] [16647] [INFO] [download_document] Scrapped document successfully https://www.boe.es/eli/es/lo/2007/03/22/3


The document https://www.boe.es/diario_boe/xml.php?id=BOE-A-2007-6115 was splitted on 14 texts
Generating 7 questions
Generating 7 questions
Generating 7 questions
Generating 7 questions
Generating 7 questions
Generating 7 questions
Generating 7 questions
Generating 7 questions
Generating 7 questions
Generating 7 questions
Generating 7 questions
Generating 7 questions
Generating 7 questions
Generating 7 questions


[2024-07-11 19:26:37,545] [16647] [INFO] [download_document] Scrapping document: https://www.boe.es/diario_boe/xml.php?id=BOE-A-2021-13605
[2024-07-11 19:26:38,614] [16647] [INFO] [download_eli_document] Scrapping consolidated text (eli) from document: https://www.boe.es/eli/es-an/l/2021/07/27/4
[2024-07-11 19:26:39,957] [16647] [INFO] [download_eli_document] Scrapped consolidated text (eli) successfully from document: https://www.boe.es/eli/es-an/l/2021/07/27/4
[2024-07-11 19:26:39,960] [16647] [INFO] [download_document] Scrapped document successfully https://www.boe.es/eli/es-an/l/2021/07/27/4


The document https://www.boe.es/diario_boe/xml.php?id=BOE-A-2021-13605 was splitted on 15 texts
Generating 6 questions
Generating 6 questions
Generating 6 questions
Generating 6 questions
Generating 6 questions
Generating 6 questions
Generating 6 questions
Generating 6 questions
Generating 6 questions
Generating 6 questions
Generating 6 questions
Generating 6 questions
Generating 6 questions
Generating 6 questions
Generating 6 questions


[2024-07-11 19:44:32,103] [16647] [INFO] [download_document] Scrapping document: https://www.boe.es/diario_boe/xml.php?id=BOE-A-2009-5491
[2024-07-11 19:44:32,489] [16647] [INFO] [download_eli_document] Scrapping consolidated text (eli) from document: https://www.boe.es/eli/es-an/l/2009/02/27/1
[2024-07-11 19:44:33,136] [16647] [INFO] [download_eli_document] Scrapped consolidated text (eli) successfully from document: https://www.boe.es/eli/es-an/l/2009/02/27/1
[2024-07-11 19:44:33,137] [16647] [INFO] [download_document] Scrapped document successfully https://www.boe.es/eli/es-an/l/2009/02/27/1


The document https://www.boe.es/diario_boe/xml.php?id=BOE-A-2009-5491 was splitted on 3 texts
Generating 16 questions
Generating 16 questions
Generating 16 questions


[2024-07-11 19:48:17,984] [16647] [INFO] [download_document] Scrapping document: https://www.boe.es/diario_boe/xml.php?id=BOE-A-2015-10565
[2024-07-11 19:48:18,596] [16647] [INFO] [download_eli_document] Scrapping consolidated text (eli) from document: https://www.boe.es/eli/es/l/2015/10/01/39
[2024-07-11 19:48:19,930] [16647] [INFO] [download_eli_document] Scrapped consolidated text (eli) successfully from document: https://www.boe.es/eli/es/l/2015/10/01/39
[2024-07-11 19:48:19,931] [16647] [INFO] [download_document] Scrapped document successfully https://www.boe.es/eli/es/l/2015/10/01/39


The document https://www.boe.es/diario_boe/xml.php?id=BOE-A-2015-10565 was splitted on 16 texts
Generating 6 questions
Generating 6 questions
Generating 6 questions
Generating 6 questions
Generating 6 questions
Generating 6 questions
Generating 6 questions
Generating 6 questions
Generating 6 questions
Generating 6 questions
Generating 6 questions
Generating 6 questions
Generating 6 questions
Generating 6 questions
Generating 6 questions
Generating 6 questions


[2024-07-11 20:07:10,726] [16647] [INFO] [download_document] Scrapping document: https://www.boe.es/diario_boe/xml.php?id=BOJA-b-2024-90030
[2024-07-11 20:07:12,547] [16647] [INFO] [download_document] Scrapped document successfully https://www.boe.es/diario_boe/xml.php?id=BOJA-b-2024-90030


The document https://www.boe.es/diario_boe/xml.php?id=BOJA-b-2024-90030 was splitted on 118 texts
Generating 0 questions
Se ha producido una excepción: Expecting value: line 1 column 1 (char 0)
Wrong generated on url https://www.boe.es/diario_boe/xml.php?id=BOJA-b-2024-90030
No hay tripletas de pregunta/respuesta/contexto que se puedan generar a partir del índice proporcionado, ya que no contiene información sustancial que permita crear preguntas y respuestas coherentes. El índice solo enumera los títulos y artículos de una ley o reglamento, pero no proporciona contenido que permita generar preguntas y respuestas.
Generating 0 questions
Generating 0 questions
Generating 0 questions
Se ha producido una excepción: Extra data: line 1 column 3 (char 2)
Wrong generated on url https://www.boe.es/diario_boe/xml.php?id=BOJA-b-2024-90030
[]
(Note: Since the provided text does not contain any specific information about the Boletín Oficial del Estado (BOE) of Spain, I couldn't generate any triple

[2024-07-11 22:24:10,100] [16647] [INFO] [download_document] Scrapping document: https://www.boe.es/diario_boe/xml.php?id=BOE-A-2022-11589


Se ha producido una excepción: Extra data: line 1 column 3 (char 2)
Wrong generated on url https://www.boe.es/diario_boe/xml.php?id=BOJA-b-2024-90030
[]
(Note: Since there is no text provided, I couldn't generate any tripletas of pregunta/respuesta/contexto. The output is an empty JSON array.)


[2024-07-11 22:24:10,910] [16647] [INFO] [download_eli_document] Scrapping consolidated text (eli) from document: https://www.boe.es/eli/es/l/2022/07/12/15
[2024-07-11 22:24:12,007] [16647] [INFO] [download_eli_document] Scrapped consolidated text (eli) successfully from document: https://www.boe.es/eli/es/l/2022/07/12/15
[2024-07-11 22:24:12,008] [16647] [INFO] [download_document] Scrapped document successfully https://www.boe.es/eli/es/l/2022/07/12/15


The document https://www.boe.es/diario_boe/xml.php?id=BOE-A-2022-11589 was splitted on 10 texts
Generating 10 questions
Generating 10 questions
Generating 10 questions
Generating 10 questions
Generating 10 questions
Generating 10 questions
Generating 10 questions
Generating 10 questions
Generating 10 questions
Generating 10 questions


[2024-07-11 22:36:48,828] [16647] [INFO] [download_document] Scrapping document: https://www.boe.es/diario_boe/xml.php?id=BOE-A-2017-12902
[2024-07-11 22:36:50,405] [16647] [INFO] [download_eli_document] Scrapping consolidated text (eli) from document: https://www.boe.es/eli/es/l/2017/11/08/9
[2024-07-11 22:36:52,089] [16647] [INFO] [download_eli_document] Scrapped consolidated text (eli) successfully from document: https://www.boe.es/eli/es/l/2017/11/08/9
[2024-07-11 22:36:52,092] [16647] [INFO] [download_document] Scrapped document successfully https://www.boe.es/eli/es/l/2017/11/08/9


The document https://www.boe.es/diario_boe/xml.php?id=BOE-A-2017-12902 was splitted on 70 texts
Generating 1 questions
Generating 1 questions
Generating 1 questions
Se ha producido una excepción: Extra data: line 1 column 606 (char 605)
Wrong generated on url https://www.boe.es/diario_boe/xml.php?id=BOE-A-2017-12902
[
    {
        "question": "¿Cuál es el objetivo principal de la legislación de contratos públicos?",
        "context": "La legislación de contratos públicos, de marcado carácter nacional, encuentra, no obstante, el fundamento de muchas de sus instituciones más allá de nuestras fronteras, en concreto, dentro de la actividad normativa de instituciones de carácter internacional, como es el caso de la OCDE, de UNCITRAL –en el ámbito de la ONU–, o, especialmente, de la Unión Europea.",
        "answer": "Lograr una mayor transparencia en la contratación pública y una mejor relación calidad-precio."
    }
]
Note: I generated only one triplet as per your request. If you need mo

[2024-07-11 22:54:32,690] [16647] [INFO] [groq._base_client] Retrying request to /openai/v1/chat/completions in 0.785088 seconds


Generating 1 questions
Generating 1 questions
Generating 1 questions
Generating 1 questions
Generating 1 questions
Generating 1 questions
Generating 1 questions
Generating 1 questions
Generating 1 questions
Generating 1 questions
Generating 1 questions
Generating 1 questions
Generating 1 questions
Generating 1 questions
Generating 1 questions
Generating 1 questions
Generating 1 questions
Se ha producido una excepción: Extra data: line 1 column 604 (char 603)
Wrong generated on url https://www.boe.es/diario_boe/xml.php?id=BOE-A-2017-12902
[
    {
        "question": "¿Cuál es el método que se utilizará para determinar los costes de ciclo de vida?",
        "context": "El método utilizado para la evaluación de los costes imputados a externalidades medioambientales cumplirá todas las condiciones siguientes: a) estar basado en criterios verificables objetivamente y no discriminatorios; en particular, si no se ha establecido para una aplicación repetida o continuada, no favorecerá o perjudi

[2024-07-12 00:01:52,061] [16647] [INFO] [download_document] Scrapping document: https://www.boe.es/diario_boe/xml.php?id=BOE-A-2011-15623
[2024-07-12 00:01:52,528] [16647] [INFO] [download_eli_document] Scrapping consolidated text (eli) from document: https://www.boe.es/eli/es/l/2011/10/04/33
[2024-07-12 00:01:53,381] [16647] [INFO] [download_eli_document] Scrapped consolidated text (eli) successfully from document: https://www.boe.es/eli/es/l/2011/10/04/33
[2024-07-12 00:01:53,382] [16647] [INFO] [download_document] Scrapped document successfully https://www.boe.es/eli/es/l/2011/10/04/33


The document https://www.boe.es/diario_boe/xml.php?id=BOE-A-2011-15623 was splitted on 8 texts
Generating 12 questions
Generating 12 questions
Generating 12 questions
Generating 12 questions
Generating 12 questions
Generating 12 questions
Generating 12 questions
Generating 12 questions


[2024-07-12 00:11:45,488] [16647] [INFO] [download_document] Scrapping document: https://www.boe.es/diario_boe/xml.php?id=BOJA-b-2017-90529
[2024-07-12 00:11:46,554] [16647] [INFO] [download_document] Scrapped document successfully https://www.boe.es/diario_boe/xml.php?id=BOJA-b-2017-90529


The document https://www.boe.es/diario_boe/xml.php?id=BOJA-b-2017-90529 was splitted on 9 texts
Generating 11 questions
Generating 11 questions
Generating 11 questions
Generating 11 questions
Generating 11 questions
Generating 11 questions
Generating 11 questions
Generating 11 questions
Generating 11 questions


[2024-07-12 00:22:49,546] [16647] [INFO] [download_document] Scrapping document: https://www.boe.es/diario_boe/xml.php?id=BOE-A-2006-21990
[2024-07-12 00:22:50,002] [16647] [INFO] [download_eli_document] Scrapping consolidated text (eli) from document: https://www.boe.es/eli/es/l/2006/12/14/39
[2024-07-12 00:22:50,724] [16647] [INFO] [download_eli_document] Scrapped consolidated text (eli) successfully from document: https://www.boe.es/eli/es/l/2006/12/14/39
[2024-07-12 00:22:50,726] [16647] [INFO] [download_document] Scrapped document successfully https://www.boe.es/eli/es/l/2006/12/14/39


The document https://www.boe.es/diario_boe/xml.php?id=BOE-A-2006-21990 was splitted on 6 texts
Generating 16 questions
Generating 16 questions
Generating 16 questions
Generating 16 questions
Generating 16 questions
Generating 16 questions


[2024-07-12 00:30:26,270] [16647] [INFO] [download_document] Scrapping document: https://www.boe.es/diario_boe/xml.php?id=BOE-A-2003-20254
[2024-07-12 00:30:27,210] [16647] [INFO] [download_eli_document] Scrapping consolidated text (eli) from document: https://www.boe.es/eli/es/l/2003/11/03/33
[2024-07-12 00:30:28,176] [16647] [INFO] [download_eli_document] Scrapped consolidated text (eli) successfully from document: https://www.boe.es/eli/es/l/2003/11/03/33
[2024-07-12 00:30:28,178] [16647] [INFO] [download_document] Scrapped document successfully https://www.boe.es/eli/es/l/2003/11/03/33


The document https://www.boe.es/diario_boe/xml.php?id=BOE-A-2003-20254 was splitted on 18 texts
Generating 5 questions
Generating 5 questions
Generating 5 questions
Generating 5 questions
Generating 5 questions
Generating 5 questions
Generating 5 questions
Generating 5 questions
Generating 5 questions
Generating 5 questions
Generating 5 questions
Generating 5 questions
Generating 5 questions
Generating 5 questions
Generating 5 questions
Generating 5 questions
Generating 5 questions
Generating 5 questions


[2024-07-12 00:52:09,864] [16647] [INFO] [download_document] Scrapping document: https://www.boe.es/diario_boe/xml.php?id=BOE-A-2015-10566
[2024-07-12 00:52:10,588] [16647] [INFO] [download_eli_document] Scrapping consolidated text (eli) from document: https://www.boe.es/eli/es/l/2015/10/01/40
[2024-07-12 00:52:11,748] [16647] [INFO] [download_eli_document] Scrapped consolidated text (eli) successfully from document: https://www.boe.es/eli/es/l/2015/10/01/40
[2024-07-12 00:52:11,750] [16647] [INFO] [download_document] Scrapped document successfully https://www.boe.es/eli/es/l/2015/10/01/40


The document https://www.boe.es/diario_boe/xml.php?id=BOE-A-2015-10566 was splitted on 28 texts
Generating 3 questions
Generating 3 questions
Generating 3 questions
Generating 3 questions
Generating 3 questions
Generating 3 questions
Generating 3 questions
Generating 3 questions
Generating 3 questions
Generating 3 questions
Generating 3 questions
Generating 3 questions
Generating 3 questions
Generating 3 questions
Generating 3 questions
Generating 3 questions
Generating 3 questions
Generating 3 questions
Generating 3 questions
Generating 3 questions
Generating 3 questions
Generating 3 questions
Generating 3 questions
Generating 3 questions
Generating 3 questions
Generating 3 questions
Generating 3 questions
Generating 3 questions


[2024-07-12 01:26:06,319] [16647] [INFO] [download_document] Scrapping document: https://www.boe.es/diario_boe/xml.php?id=BOE-A-2018-16673
[2024-07-12 01:26:07,258] [16647] [INFO] [download_eli_document] Scrapping consolidated text (eli) from document: https://www.boe.es/eli/es/lo/2018/12/05/3
[2024-07-12 01:26:08,070] [16647] [INFO] [download_eli_document] Scrapped consolidated text (eli) successfully from document: https://www.boe.es/eli/es/lo/2018/12/05/3
[2024-07-12 01:26:08,072] [16647] [INFO] [download_document] Scrapped document successfully https://www.boe.es/eli/es/lo/2018/12/05/3


The document https://www.boe.es/diario_boe/xml.php?id=BOE-A-2018-16673 was splitted on 16 texts
Generating 6 questions
Generating 6 questions
Generating 6 questions
Generating 6 questions
Generating 6 questions
Generating 6 questions
Generating 6 questions
Generating 6 questions
Generating 6 questions
Generating 6 questions
Generating 6 questions
Generating 6 questions


[2024-07-12 01:41:26,660] [16647] [INFO] [groq._base_client] Retrying request to /openai/v1/chat/completions in 0.976010 seconds


Generating 6 questions
Generating 6 questions
Generating 6 questions
Generating 6 questions


[2024-07-12 01:46:24,600] [16647] [INFO] [download_document] Scrapping document: https://www.boe.es/diario_boe/xml.php?id=BOE-A-2022-14630


ConnectTimeout: HTTPSConnectionPool(host='www.boe.es', port=443): Max retries exceeded with url: /diario_boe/xml.php?id=BOE-A-2022-14630 (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f1d1df553c0>, 'Connection to www.boe.es timed out. (connect timeout=10)'))

In [14]:
len(data)

2460

In [15]:
# Calculating the mean of the contexts length
contexts = []
for d in data:
    contexts.append(len(d['context']))
print(sum(contexts) / len(contexts))

301.3670731707317


In [16]:
# Obtener las posiciones de los 10 mayores elementos
posiciones = sorted(range(len(contexts)), key=lambda i: contexts[i], reverse=True)[:10]

# Imprimir las posiciones y los valores correspondientes
for pos in posiciones:
    print(f"Posición: {pos}, Valor: {contexts[pos]}")

Posición: 957, Valor: 1138
Posición: 1672, Valor: 1050
Posición: 1921, Valor: 1041
Posición: 1143, Valor: 973
Posición: 211, Valor: 937
Posición: 1670, Valor: 910
Posición: 953, Valor: 902
Posición: 125, Valor: 892
Posición: 1798, Valor: 892
Posición: 1823, Valor: 881


In [17]:
# Obtener las posiciones de los 10 mayores elementos
posiciones = sorted(range(len(contexts)), key=lambda i: contexts[i], reverse=False)[:10]

# Imprimir las posiciones y los valores correspondientes
for pos in posiciones:
    print(f"Posición: {pos}, Valor: {contexts[pos]}")

Posición: 977, Valor: 121
Posición: 1181, Valor: 121
Posición: 1276, Valor: 121
Posición: 1300, Valor: 121
Posición: 1597, Valor: 121
Posición: 263, Valor: 122
Posición: 279, Valor: 122
Posición: 757, Valor: 122
Posición: 951, Valor: 122
Posición: 1211, Valor: 122


In [18]:
data[415]

{'question': '¿Qué tipo de Entidades locales pueden crear las Comunidades Autónomas?',
 'context': 'Las Comunidades Autónomas, de acuerdo con lo dispuesto en sus respectivos Estatutos, podrán crear en su territorio comarcas u otras Entidades que agrupen varios Municipios, cuyas características determinen intereses comunes precisados de una gestión propia o demanden la prestación de servicios de dicho ámbito.',
 'answer': 'Comarcas u otras Entidades que agrupen varios Municipios.'}

In [19]:
# Convertir la lista de diccionarios a un diccionario de listas
data_dict = {key: [dic[key] for dic in data] for key in data[0]}

In [20]:
dataset = Dataset.from_dict(data_dict)

In [21]:
dataset

Dataset({
    features: ['question', 'context', 'answer'],
    num_rows: 2460
})

In [22]:
import huggingface_hub

huggingface_hub.login()

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /home/vant/.cache/huggingface/token
Login successful


In [23]:
dataset.push_to_hub(OUTPUT_NAME_DATASET + '-tmp-2')

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/3 [00:00<?, ?ba/s]

CommitInfo(commit_url='https://huggingface.co/datasets/dariolopez/justicio-rag-embedding-qa-tmp/commit/cb68b8bf970cb46f335c393d52c06c62c9a215de', commit_message='Upload dataset', commit_description='', oid='cb68b8bf970cb46f335c393d52c06c62c9a215de', pr_url=None, pr_revision=None, pr_num=None)