<img src="https://github.com/chdb-io/chdb/raw/pybind/docs/_static/snake-chdb.png" width=320 >

# chdb-GPT
Generate **chDB** and **ClickHouse** queries using natural language and  **OpenAI APIs**

In [None]:
#@title Install Requirements { display-mode: "form" }
!pip install openai chdb --quiet

In [None]:
#@title Provide OpenAI API { display-mode: "form" }
openai_api_key = "" #@param {type:"string"}
import openai

openai.api_key = openai_api_key

# Set this to `True` if you need GPT4. If not, the code will use GPT-3.5.
GPT4 = False

In [None]:
#@title Prepare ClickHouse GTP Agent { display-mode: "form" }
class Conversation:
    """
    This class helps me keep the context of a conversation. It's not
    sophisticated at all and it simply regulates the number of messages in the
    context window.

    You could try something much more involved, like counting the number of
    tokens and limiting. Even better: you could use the API to summarize the
    context and reduce its length.

    But this is simple enough and works well for what I need.
    """

    messages = None

    def __init__(self):
        # Here is where you can add some personality to your assistant, or
        # play with different prompting techniques to improve your results.
        Conversation.messages = [
            {
                "role": "system",
                "content": (
                    "You are a ClickHouse expert specializing in OLAP databases, SQL format, and functions. You can produce SQL queries using knowledge of ClickHouse's architecture, data modeling, performance optimization, query execution, and advanced analytical functions."
                ),
            }
        ]


    def answer(self, prompt):
        """
        This is the function I use to ask questions.
        """
        self._update("user", prompt)

        response = openai.ChatCompletion.create(
            model="gpt-4-0613" if GPT4 else "gpt-3.5-turbo-0613",
            messages=Conversation.messages,
            temperature=0,
        )

        self._update("assistant", response.choices[0].message.content)

        return response.choices[0].message.content

    def _update(self, role, content):
        Conversation.messages.append({
            "role": role,
            "content": content,
        })

        # This is a rough way to keep the context size manageable.
        if len(Conversation.messages) > 20:
            Conversation.messages.pop(0)


    def build_query_prompt(query):

        input_str=f"""
        You are a ClickHouse expert specializing in OLAP databases, SQL format, and functions. You can produce SQL queries using knowledge of ClickHouse's architecture, data modeling, performance optimization, query execution, and advanced analytical functions.
        I would like you to generate an accurate ClickHouse sql query for the question:
        {query}

        - Make sure the query is ClickHouse compatible
        - Make sure ClickHouse SQL and ClickHouse functions are used
        - Assume there are no tables in memory, data is always remote
        - Load data from files using the file() ClickHouse function, for instance: file('data.csv')
        - Load data from urls containing http using the url() ClickHouse function, for instance url('http://domain.com/file.csv')
        - Make sure any file hosted on s3 is loaded using the s3() ClickHouse function
        - Ensure case sensistivity
        - Ensure NULL check
        - Do not add any special information or comment, just return the query

        The expected output is code only. Always use table name in column reference to avoid ambiguity.
        """

        return input_str


Let's input our query and form a prompt:


In [None]:
#@title Prompt using Natural Language { display-mode: "form" }
query = "show the top 10 towns from url https://datasets-documentation.s3.eu-west-3.amazonaws.com/house_parquet/house_0.parquet"  #@param {type:"string"}

prompt = Conversation.build_query_prompt(query)

conversation = Conversation()

answer = conversation.answer(prompt)
print(answer)

Create a new instance of `Conversation` whenever you want to clear the context.

We can now extend our query and the API will remember what we did before.

In [None]:
#@title Refine SQL using Natural Language
refine_query = "add round(avg(price)) AS price to the query" #@param {type:"string"}
query = conversation.answer(refine_query)
print(answer)

In [None]:
#@title Execute Query using chDB { display-mode: "form" }
import chdb
res = chdb.query(query, 'Pretty'); print(res.data())