In [9]:
import os
from openai import OpenAI
from IPython.display import display, Markdown

# Initialize the OpenAI client
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

# Function to generate study questions from documentation and outline
def generar_preguntas_sobre_documentacion(document_path, outline_path, num_preguntas=10):
    # Read the documentation from the file
    with open(document_path, "r", encoding="utf-8") as file:
        documentacion = file.read()

    # Read the outline from the file
    with open(outline_path, "r", encoding="utf-8") as file:
        outline = file.read()
    
    prompt = (
        f"Using the following documentation and section outline, generate {num_preguntas} theoretical questions. "
        f"Each question should test the understanding of key concepts, functions, or features described "
        f"in the documentation, and should cover the topics listed in the section outline. The questions should include "
        f"multiple-choice options formatted as a Markdown list. Additionally, include a hidden section or collapsible element "
        f"that reveals the correct answer along with a detailed explanation.\n\n"
        f"Documentation:\n\n{documentacion}\n\n"
        f"Section Outline:\n\n{outline}\n\n"
        "Format:\n"
        "1. Theoretical problem statement based on the documentation and outline.\n"
        "2. Multiple-choice options formatted in Markdown as a list (a,b,c,d).\n"
        "3. A hidden section or collapsible element that reveals the correct answer, provides a detailed explanation, "
        "   and references the documentation and outline where applicable to support the correct answer and explain why the other "
        "   options are incorrect. The word 'Answer' and the correct answer should be formatted in red using HTML "
        "   (e.g., `<span style='color:red'>Answer</span>` and `<span style='color:red'>correct answer</span>`).\n\n"
    )

    # Create the chat completion request with streaming enabled
    stream = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        stream=True,
    )

    # Collect the response in chunks
    response_content = ""
    for chunk in stream:
        response_content += chunk.choices[0].delta.content or ""

    # Display the output as Markdown in Jupyter Notebook
    display(Markdown(response_content))

# Example usage
document_path = "/Users/bruno/dbks-gpt/dbks-gpt/documentation/section-II-data_managment_documentation.md"
outline_path = "/Users/bruno/dbks-gpt/dbks-gpt/outline/section-II-data_managment_outline.txt"

# Generate and display study questions in Markdown format
generar_preguntas_sobre_documentacion(document_path, outline_path)

1. What is Delta Lake in Databricks?
    a. A tool to handle data lakes.
    b. A tool to handle data files. 
    c. The default format for all operations in Databricks.
    d. Both b and c.

<details>
<summary>Answer</summary>
<span style='color:red'>Answer: d. Both b and c.</span>

Explanation: According to the documentation, Delta Lake is the optimized storage layer that provides the foundation for tables in a lakehouse on Databricks. It is a tool that extends Parquet data files for ACID transactions and metadata handling. Also, Delta Lake is the default format for all operations on Databricks. So both options b and c are correct.
</details>

2. Which of the following is not a feature of Delta Lake?
    a. Compatible with Apache Spark APIs.
    b. Integrated with Structured Streaming.
    c. Provides incremental processing at scale.
    d. It requires manual updates to the table schema.

<details>
<summary>Answer</summary>
<span style='color:red'>Answer: d. It requires manual updates to the table schema.</span>

Explanation: According to the documentation, Delta Lake actually supports automatic updates to the table schema. So, option d is not a feature of Delta Lake. Options a, b, and c are features of Delta Lake as confirmed in the documentation.
</details>

3. What's the purpose of using managed tables in Databricks?
    a. To give direct access to the data.
    b. To offer transactional guarantees and optimized performance.
    c. To allow data redundancy.
    d. To create tables using formats other than Delta Lake.

<details>
<summary>Answer</summary>
<span style='color:red'>Answer: b. To offer transactional guarantees and optimized performance.</span>

Explanation: As per the documentation, managed tables on Databricks are recommended for all tabular data managed in Databricks because they offer transactional guarantees and optimized performance. Options a, c, and d are not accurate according to the mentioned documentation.
</details>

4. Which statement about external tables in Databricks is false?
    a. They decouple the management of underlying data files from the metastore registration.
    b. They can store data files using common formats readable by external systems.
    c. When an external table is dropped, the underlying data in cloud storage is also deleted.
    d. External tables are recommended when you require direct access to the data.

<details>
<summary>Answer</summary>
<span style='color:red'>Answer: c. When an external table is dropped, the underlying data in cloud storage is also deleted.</span>

Explanation: According to the documentation, when an external table is dropped, the underlying data in cloud storage is not deleted. This makes option c false.
</details>

5. What does a view in Databricks refer to?
    a. An object endorsed by Apache.
    b. A data visualization tool.
    c. A read-only object composed from one or more tables and views in a metastore.
    d. A relationship pattern between tables.

<details>
<summary>Answer</summary>
<span style='color:red'>Answer: c. A read-only object composed from one or more tables and views in a metastore.</span>

Explanation: As per the documentation, a view in Databricks is a read-only object composed from one or more tables and views in a metastore.
</details>

6. How long is the Delta Lake table history retained by default?
    a. 7 days.
    b. 30 days.
    c. 7 months.
    d. 1 year.

<details>
<summary>Answer</summary>
<span style='color:red'>Answer: b. 30 days.</span>

Explanation: According to the official documentation, the Delta Lake table history is retained for 30 days by default.
</details>

7. What setting must be altered to retain Delta Lake table history longer than the default duration?
    a. spark.sql.files.ignoreMissingFiles
    b. delta.logRetentionDuration
    c. hive.metastore.warehouse.dir
    d. spark.sql.legacy.allowUntypedScalaUDF

<details>
<summary>Answer</summary>
<span style='color:red'>Answer: b. delta.logRetentionDuration.</summary>

Explanation: According to the documentation, in order to change the retention duration for the Delta Lake table history, the delta.logRetentionDuration property must be configured. So option b is correct.
</details>

8. In Databricks, is it possible to query a version of a table that has been removed by a VACUUM operation?
    a. Yes, always.
    b. Yes, but only if the version was removed within the last 7 days.
    c. No, the VACUUM operation permanently deletes the versions.
    d. It depends on the delta.logRetentionDuration setting.

<details>
<summary>Answer</summary>
<span style='color:red'>Answer: b. Yes, but only if the version was removed within the last 7 days.</span>

Explanation: According to the documentation, if you run VACUUM daily with the default values, 7 days of data is generally available for time travel. So, you should be able to query a table version that has been removed by VACUUM, as long as it was removed within the last 7 days.
</details>

9. When a user creates a table using SQL commands, Spark, or other tools in Databricks, by default, what type of table is it?
    a. An External table.
    b. A Basic Table.
    c. A Managed Table.
    d. A Delta table.

<details>
<summary>Answer</summary>
<span style='color:red'>Answer: c. A Managed Table.</span>

Explanation: As per the documentation, by default, any time a user creates a table using SQL commands, Spark, or other tools in Databricks, the table is a managed table.
</details>

10. What happens to the data files of a managed table in Databricks when the table is dropped?
    a. The data files are not deleted.
    b. The data files are deleted immediately.
    c. The data files are deleted within 30 days.
    d. The data files are archived for future access.

<details>
<summary>Answer</summary>
<span style='color:red'>Answer: c. The data files are deleted within 30 days.</span>

Explanation: According to the documentation, when a managed table is dropped