# Customizing Llama 3x models to industry datasets using in-context learning techniques like [GTL](https://arxiv.org/pdf/2310.07338)

---

## Introduction

This notebook introduces an approach that leverages Llama 3 models on Amazon Bedrock/Amazon SageMaker JumpStart, including advanced prompt engineering, to convert natural language questions into executable SQL queries. The Prompting technique used here allows the Llama models to understand industry specific tabular dataset better and provide more accurate and relevant analysis of the structured tabular data. Our approach generates SQL queries, enabling information retrieval from industry specific database structures. This approach is crucial for a lot of organizations to use Llama models without the complexity required for fine-tuning. A lot organizations dont have the large labeled datasets to fine-tune large language models and also may end up training multiple sets of models for different datasets which can be avoided by using the techniques shown here.

Moreover, our approach demonstrates high scalability through dynamically selecting and retrieving the most relevant table schemas based on the given natural language question. This scalability is achieved by employing in-context learning for the llama models specific to the dataset being used for analysis. 

Our solution can be applied in practical scenarios where companies manage numerous databases with industry specific table and column names, such as in the finance or healthcare industry.

Advantages of this method:
- Fine-tuning or data required for finetuning can be challenging which can be avoided using this mechanism
- Correct Analysis of the tabular data using the industry specific language becomes easier
- Consistency and accuracy of the data analysis of the models improve compared with few-shot prompting 
- Generic Llama models can be used from Amazon SageMaker or Amazon Bedrock without need to maintain set of use-case specific fine-tuned models which may require mature MLOps processes. 

---
## Llama 3 Model Selection

Today, there are two Llama 3 models available on Amazon Bedrock:

### 1. Llama 3 8B

- **Description:** Ideal for limited computational power and resources, faster training times, and edge devices.
- **Max Tokens:** 2,048
- **Context Window:** 8,196
- **Languages:** English
- **Supported Use Cases:** Synthetic Text Generation, Text Classification, and Sentiment Analysis.

### 2. Llama 3 70B

- **Description:** Ideal for content creation, conversational AI, language understanding, research development, and enterprise applications. 
- **Max Tokens:** 2,048
- **Context Window:** 8,196
- **Languages:** English
- **Supported Use Cases:** Synthetic Text Generation and Accuracy, Text Classification and Nuance, Sentiment Analysis and Nuance Reasoning, Language Modeling, Dialogue Systems, and Code Generation.

### Performance and Cost Trade-offs

The table below compares the model performance on the Massive Multitask Language Understanding (MMLU) benchmark and their on-demand pricing on Amazon Bedrock.

| Model           | MMLU Score | Price per 1,000 Input Tokens | Price per 1,000 Output Tokens |
|-----------------|------------|------------------------------|-------------------------------|
| Llama 3 8B | 68.4%      | \$0.0004                   | \$0.0006                    |
| Llama 3 70B | 82.0%      | \$0.00265                   | \$0.0035                     |

For more information, refer to the following links:

1. [Llama 3 8B Model Cards and Prompt Formats](https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3)
2. [Amazon Bedrock Pricing Page](https://aws.amazon.com/bedrock/pricing/)


## The Approach to the Text-to-SQL Problem
This notebook covers the following approaches

### Few-shot GTL Prompt for text-to-SQL 
Few-shot text-to-SQL is an approach for querying databases by translating natural language questions into SQL queries, using only a few examples for in-context learning of industry topics by the Large Language Model.

Providing following information to the model as part of the Prompt:
    - Model Personna
    - Sample Data Set
    - Some features and descriptions
    - Label column to focus on for analysis by the model

Add more features in the prompt for more complex questions that requires queries to use multiple columns and get consistently high accuracy with output generated in business specific language.

Reference (Generative Tabular Learning): https://arxiv.org/abs/2305.12586
---

##### **Structure of Prompt**

![GTL Prompt Structure](llama-gtl.drawio.png)

## Objectives

This notebook will provide code snippets to assist with implementing two differents approaches to converting a natural language question into a SQL query that would answer it.
- Few-shot prompting
- Few-shot prompting with GTL

---

### Tools

+ AWS Python SDKs [boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) to be able to submit API calls to [Amazon Bedrock](https://aws.amazon.com/bedrock/).

+ [LangChain](https://python.langchain.com/v0.1/docs/get_started/introduction/) is a framework that provides off the shelf components to make it easier to build applications with large language models. It is supported in multiple programming languages, such as Python, JavaScript, Java and Go. In this notebook, LangChain is used to build a prompt template.

+ [Amazon Athena](https://aws.amazon.com/athena/) is a serverless, interactive analytics service built on open-source frameworks, supporting open-table and file formats. Athena is built on open-source Trino and Presto engines and Apache Spark frameworks, with no provisioning or configuration effort required.

+ [ETFs and Mutual Funds composition & yield metrics](https://www.kaggle.com/datasets/andrezaza/large-etfs-with-size-allocations-and-exposures?resource=download) Dataset from Kaggle
---

## Pre-requisites:

1. Use kernel either `conda_python3`, `conda_pytorch_p310` or `conda_tensorflow2_p310`.
2. Install the required packages.
3. Access to the LLM API. 

### Amazon Bedrock Deployment

In this notebook, Llama 3 70B is used. By deploying the notebook through our cloudformation template, it is granted the appropriate IAM permissions to send API request to Bedrock.

Refer [here](https://aws.amazon.com/blogs/aws/metas-llama-3-models-are-now-available-in-amazon-bedrock/) for details on how Amazon Bedrock provides access to Meta’s Llama 3.

### SageMaker Deployment

#### Changing instance type
---
Models are supported on the following instance types:

 - Llama3 8B Text Generation: `ml.g5.2xlarge`, `ml.g5.4xlarge`, `ml.g5.8xlarge`, `ml.g5.12xlarge`, `ml.g5.24xlarge`, `ml.g5.48xlarge`, and `ml.p4d.24xlarge`
- Llama3 70B Text Generation: `ml.g5.48xlarge`, and `ml.p4d.24xlarge`

By default, the JumpStartModel class selects a default instance type available in your region. If you would like to use a different instance type, you can do so by specifying instance type in the JumpStartModel class.

`my_model = JumpStartModel(model_id=model_id, instance_type="ml.g5.12xlarge")`

---

## Getting Started

### Step 0: Select Model Hosting Service

Here, you can select to run this notebook using SageMaker JumpStart or Amazon Bedrock.

In [3]:
def ask_for_service():
    service = input("Do you want to run the LLM for this notebook using Amazon Bedrock (B) or Amazon SageMaker JumpStart (S)? (default: B) ").strip().upper()
    if service in ['S', 'SAGEMAKER']:
        return 'Amazon SageMaker'
    elif service in ['B', 'BEDROCK', '']:
        return 'Amazon Bedrock'
    else:
        print("Invalid input. Using Amazon Bedrock by default.")
        return 'Amazon Bedrock'

# Call the function and get the selected service
llm_selected_service = ask_for_service()

# Print the selected service
print(f"You have chosen to run the LLM for this notebook using {llm_selected_service}.")

Do you want to run the LLM for this notebook using Amazon Bedrock (B) or Amazon SageMaker JumpStart (S)? (default: B)  B


You have chosen to run the LLM for this notebook using Amazon Bedrock.


In [4]:
def ask_for_service():
    service = input("Do you want to run the Embedding for this notebook using Amazon Bedrock (B) or Amazon SageMaker JumpStart (S)? (default: B) ").strip().upper()
    if service in ['S', 'SAGEMAKER']:
        return 'Amazon SageMaker'
    elif service in ['B', 'BEDROCK', '']:
        return 'Amazon Bedrock'
    else:
        print("Invalid input. Using Amazon Bedrock by default.")
        return 'Amazon Bedrock'

# Call the function and get the selected service
embedding_selected_service = ask_for_service()

# Print the selected service
print(f"You have chosen to run the Embedding for this notebook using {embedding_selected_service}.")

Do you want to run the Embedding for this notebook using Amazon Bedrock (B) or Amazon SageMaker JumpStart (S)? (default: B)  B


You have chosen to run the Embedding for this notebook using Amazon Bedrock.


In [5]:
if llm_selected_service == 'Amazon SageMaker':
    # Import the JumpStartModel class from the SageMaker JumpStart library
    from sagemaker.jumpstart.model import JumpStartModel

    # Specify the model ID for the HuggingFace Llama 3 Instruct LLM model
    llama3_8b_id = "meta-textgeneration-llama-3-8b-instruct"
    llama3_70b_id = "meta-textgeneration-llama-3-70b-instruct"
    DEFULT_LLM_MODEL_ID = llama3_70b_id
    if DEFULT_LLM_MODEL_ID == llama3_70b_id:
        instance_type = "ml.g5.48xlarge"
    else:
        instance_type = "ml.g5.12xlarge"
    model = JumpStartModel(model_id=DEFULT_LLM_MODEL_ID, instance_type=instance_type)
    llm_predictor = model.deploy(accept_eula=True)
    print(f"LLM SageMaker Endpoint Name: {llm_predictor.endpoint_name}")
else:
    llm_predictor = None
    llama3_8b_id = "meta.llama3-8b-instruct-v1:0"
    llama3_70b_id = "meta.llama3-70b-instruct-v1:0"
    DEFULT_LLM_MODEL_ID = llama3_70b_id
    DEFAULT_EMBEDDING_MODEL_ID = "amazon.titan-embed-text-v2:0"

if embedding_selected_service == 'Amazon SageMaker':
    # Import the JumpStartModel class from the SageMaker JumpStart library
    from sagemaker.jumpstart.model import JumpStartModel

    # Deploy BGE Large En embedding model on Amazon SageMaker JumpStart:
    # Specify the model ID for the HuggingFace BGE Large EN Embedding model
    DEFAULT_EMBEDDING_MODEL_ID = "huggingface-sentencesimilarity-bge-large-en"
    text_embedding_model = JumpStartModel(model_id=DEFAULT_EMBEDDING_MODEL_ID)
    embedding_predictor = text_embedding_model.deploy()
    print(f"LLM SageMaker Endpoint Name: {embedding_predictor.endpoint_name}")
else:
    embedding_predictor = None
    DEFAULT_EMBEDDING_MODEL_ID = "amazon.titan-embed-text-v2:0"

### Step 1: Install Dependencies

Here, we will install all the required dependencies to run this notebook.

In [6]:
!pip install boto3==1.34.127 -qU --force --quiet --no-warn-conflicts
!pip install botocore==1.34.144 -qU --force --quiet --no-warn-conflicts
!pip install langchain==0.2.5 -qU --force --quiet --no-warn-conflicts
!pip install numpy==1.26.4 -qU --force --quiet --no-warn-conflicts

**Note:** *When installing libraries using the pip, you may encounter errors or warnings during the installation process. These are generally not critical and can be safely ignored. However, after installing the libraries, it is recommended to restart the kernel or computing environment you are working in. Restarting the kernel ensures that the newly installed libraries are loaded properly and available for use in your code or workflow.*

#### Now lets import the required modules to run the notbook

In [7]:
import boto3

import json
from langchain import PromptTemplate

import re, time
from typing import Dict, List, Any
import yaml

In [8]:
# Setup Bedrock Client
bedrock_client = boto3.client(
    service_name='bedrock-runtime'
)

### Step 2: Database Connection

Here we connect to the Amazon Athena Database and query Athena Table

Establish the database connection

In [9]:
# Get Environment Variables
import boto3
awsRegion = 'us-east-1'
s3bucketPrefix = 'textract-console-us-east-1-659785f3-fe94-4ad8-90cf-51c6b962df69'
athenaworkgroup = 'primary'

# Create Athena Boto3 client
athenaClient = boto3.client('athena', region_name=awsRegion)

In [10]:
DBName = 'default'
athenaS3Staging = 's3://' + s3bucketPrefix + '/'
athenaWorkgroup = athenaworkgroup
RESULT_OUTPUT_LOCATION = athenaS3Staging

#### Load table schema settings

In [11]:
def load_settings(file_path):
    """
    Reads a YAML file and returns its contents as a Python object.

    Args:
        file_path (str): The path to the YAML file.

    Returns:
        obj: The contents of the YAML file as a Python object.
    """
    try:
        with open(file_path, 'r') as file:
            data = yaml.safe_load(file)
        return data
    except FileNotFoundError:
        print(f"Error: The file '{file_path}' does not exist.")
    except yaml.YAMLError as exc:
        print(f"Error: Failed to parse the YAML file '{file_path}': {exc}")

### Step 3: Provide schema information for Amazon Athena Table

In [12]:
# Load table settings
settings_etf = load_settings('etftable.yml')
table_etf = settings_etf['table_name']
table_schema_etf = settings_etf['table_schema']

### Step 4: Create helper functions

To facilate the usability and readability of the SQL Query Analysis made by Llama 3, we have developed a suite of helper functions tailored to various use cases.

The `format_instructions` function is designed to process the input from Llama 3 models, allowing a conversation between roles such as `system`, `user`, and `assistant`. To see more details about Llama 3 prompt formats, click [here](https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3/).

In [13]:
def format_instructions(instructions: List[Dict[str, str]]) -> List[str]:
    """Format instructions where conversation roles must alternate system/user/assistant/user/assistant/..."""
    prompt: List[str] = []
    for instruction in instructions:
        if instruction["role"] == "system":
            prompt.extend(["<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n", (instruction["content"]).strip(), "<|eot_id|>"])
        elif instruction["role"] == "user":
            prompt.extend(["<|start_header_id|>user<|end_header_id|>\n", (instruction["content"]).strip(), "<|eot_id|>"])
        else:
            raise ValueError(f"Invalid role: {instruction['role']}. Role must be either 'user' or 'system'.")
    prompt.extend(["<|start_header_id|>assistant<|end_header_id|>\n"])
    return "".join(prompt)

The `execute_query` function will execute SQL queries, typically for retrieving data from a database, and format the results as a string for further processing or display. 

In [14]:
def execute_query(query: str) -> str:
    """Execute an SQL query on the database connection and return the results as a string.

    Args:
        query (str): SQL query to execute

    Returns:
        str: A formatted string containing the SQL results.
    """
    # Connect to Athena DB
    # Execute Query 
    responseExec = athenaClient.start_query_execution(
        QueryString=query,
        ResultConfiguration={"OutputLocation": RESULT_OUTPUT_LOCATION}
    )
    execID = responseExec["QueryExecutionId"]

    # Check if Query finished
    state = "RUNNING"
    max_execution = 5
    while max_execution > 0 and state in ["RUNNING", "QUEUED"]:
        max_execution -= 1
        responseCheck = athenaClient.get_query_execution(QueryExecutionId=execID)
        if (
            "QueryExecution" in responseCheck
            and "Status" in responseCheck["QueryExecution"]
            and "State" in responseCheck["QueryExecution"]["Status"]
        ):
            state = responseCheck["QueryExecution"]["Status"]["State"]
            if state == "SUCCEEDED":
                break

        time.sleep(30)

    # Get Query Results
    responseResults = athenaClient.get_query_results(
        QueryExecutionId=execID
    )
    result_rows = responseResults['ResultSet']['Rows']
    
    # Convert result to string with newline between rows
    output_text = '\n'.join([str(x) for x in result_rows])
    return output_text

Lets check the data in Amazon Athena Table. Here we are using a sample ETF Dataset

In [15]:

sql = """SELECT *
FROM etftable
LIMIT 1;"""

# Execute SQL query
sql_results = execute_query(sql)

# SQL output
print("ETF Table data Sample:\n")

print(sql_results)

ETF Table data Sample:

{'Data': [{'VarCharValue': 'isin'}, {'VarCharValue': 'wkn'}, {'VarCharValue': 'name'}, {'VarCharValue': 'fundprovider'}, {'VarCharValue': 'legalstructure'}, {'VarCharValue': 'quote'}, {'VarCharValue': 'quote52low'}, {'VarCharValue': 'quote52high'}, {'VarCharValue': 'ytdreturncur'}, {'VarCharValue': 'totalexpenseratio'}, {'VarCharValue': 'fiveyearvolatilitycur'}, {'VarCharValue': 'fiveyearreturnperriskcur'}, {'VarCharValue': 'yearvolatilitycur'}, {'VarCharValue': 'distributionpolicy'}, {'VarCharValue': 'fundcurrency'}, {'VarCharValue': 'yeardividendyield'}, {'VarCharValue': 'threemonthreturncur'}, {'VarCharValue': 'monthreturncur'}, {'VarCharValue': 'sixmonthreturncur'}, {'VarCharValue': 'inceptiondate'}, {'VarCharValue': 'threeyearvolatilitycur'}, {'VarCharValue': 'yearreturnperriskcur'}, {'VarCharValue': 'yearreturn2cur'}, {'VarCharValue': 'yearreturn4cur'}, {'VarCharValue': 'replicationmethod'}, {'VarCharValue': 'hassecuritieslending'}, {'VarCharValue': 'ticke

The `sagemaker_chat_completion` function uses the SageMaker Endpoint to invoke the LLMs. The response from the LLM is extracted and returned as text.

In [16]:
def sagemaker_chat_completion(
    prompt: str,
    max_gen_len: int = 512,
    temperature: float = 0.5,
    top_p: float = 0.999
) -> str:
    """
    Generates a chat completion from a prompt using the llama3 model via Amazon SageMaker JumpStart.

    Args:
        prompt (str): The prompt text to generate completions for.
        max_gen_len (int, optional): The maximum length of the completion.
        temperature (float, optional): Sampling temperature for the model.
        top_p (float, optional): Top p sampling ratio for the model.

    Returns:
        str: The generated text completion.
    """
    body = {
        "inputs": prompt,
        "parameters": {
            "max_new_tokens": max_gen_len,
            "temperature": temperature,
            "top_p": top_p,
            "stop": ["<|eot_id|>"]
        }
    }

    # Call the model API to generate the completion
    response = llm_predictor.predict(body)
    completion = response.get('generated_text', '')

    return completion.strip()

The `bedrock_chat_completion` function uses the Bedrock client to invoke the LLMs. The response from the LLM is extracted and returned as text.

In [17]:
def bedrock_chat_completion(
    model_id: str,
    prompt: str,
    max_gen_len: int = 1024,
    temperature: float = 0.5,
    top_p: float = 0.999
) -> str:
    """
    Generates a chat completion from a prompt using the llama3 model via Amazon Bedrock.

    Args:
        model_id (str): The ID of the llama3 model to use for completion.
        prompt (str): The prompt text to generate completions for.
        max_gen_len (int, optional): The maximum length of the completion.
        temperature (float, optional): Sampling temperature for the model.
        top_p (float, optional): Top p sampling ratio for the model.

    Returns:
        str: The generated text completion.
    """
    body = {
        "prompt": prompt,
        "max_gen_len": max_gen_len,
        "temperature": temperature,
        "top_p": top_p,
    }

    accept = "application/json"
    contentType = "application/json"

    # Convert the body dictionary to JSON string and encode it as bytes
    body_json = json.dumps(body)
    body_bytes = body_json.encode('utf-8')

    # Call the model API to generate the completion
    response = bedrock_client.invoke_model(
        body=body_bytes, modelId=model_id, accept=accept, contentType=contentType
    )
    response_body = response["body"].read()
    response_body = json.loads(response_body)
    completion = response_body.get("generation", "")

    return completion.strip()

The Function `get_llm_sql_analysis` generates and executes an SQL query for a given question, and returns a comprehensive analyzes based on the sql query results.

In [18]:
def get_llm_sql_analysis(question: str, sql_sys_prompt: str, qna_sys_prompt: str) -> str:
    """
    Generates an SQL query based on the given question, executes it, and returns an analysis of the results using Llama 3.

    Args:
        question (str): The input question for which an SQL query needs to be generated.
        sql_sys_prompt (str): The prompt to be used for generating the SQL query using Llama 3.
        qna_sys_prompt (str): The prompt to be used for analyzing the SQL query results using Llama 3.

    Returns:
        str: The analysis of the SQL query results provided by the language model.
    """
    if llm_selected_service == 'Amazon SageMaker':
        # Generates SQL query
        completion = sagemaker_chat_completion(
            prompt=sql_sys_prompt
        )
    else:
        # Generates SQL query
        completion = bedrock_chat_completion(
            model_id=DEFULT_LLM_MODEL_ID,
            prompt=sql_sys_prompt
        )

    try:
        # Extract the SQL query
        pattern = r"<sql>(.*)</sql>"
        llm_sql_query = re.search(pattern, completion, re.DOTALL).group(1)
        print(f"LLM SQL Query: \n{llm_sql_query}")

        # Execute SQL query
        sql_results = execute_query(llm_sql_query)

        if llm_selected_service == 'Amazon SageMaker':
            # Generates SQL analysis
            llm_sql_analysis = sagemaker_chat_completion(
                prompt=qna_sys_prompt.format(query_results=sql_results, question=question)
            )
        else:
            # Generates SQL analysis
            llm_sql_analysis = bedrock_chat_completion(
                model_id=DEFULT_LLM_MODEL_ID,
                prompt=qna_sys_prompt.format(query_results=sql_results, question=question)
            )

        print(f"LLM SQL Analysis: \n{llm_sql_analysis}")
        return llm_sql_analysis
    except Exception as e:
        print(e)
        return e

### Create a one-shot Prompt
Here, we design our prompt template that will account for our question and answer, and formatted correctly for use with Llama 3 models.

First, we create a `system prompt` containing two parts:

1. `table_schema`. This is a description of the structure of the database table, including the name of the table, the names of the columns within each table, and the data types of each column. This information helps Llama 3 to understand the organization and contents of the table.

2. `question`. This is the specific request or information that the user wants to obtain from the table.

By including both the table schema and the user's question in the system prompt, we provide Llama 3 model a complete understanding of the table structure and the user's desired output.

In [19]:
instructions = [
    {
        "role": "system",
        "content": 
        """You are a ANSI SQL query expert whose output is a valid sql query.

Only use the following tables:

It has the following schema:
<table_schema>
{table_schema}
<table_schema>

Please construct a valid ANSI SQL statement to answer the following the question, return only the sql query in between <sql></sql>.
"""
    },
    {
        "role": "user",
        "content": "{question}"
    }
]
tmp_sql_sys_prompt = format_instructions(instructions)

Next, we create a new `system prompt` containing two parts:

1. `query_results` represents the SQL query results after executing the prompt `tmp_sql_sys_prompt`. This is the raw data that Llama 3 model will use to generate its analysis.

2. `question`. This specifies the type of analysis or insight that the user wants Llama 3 model to provide based on the SQL query results.

By combining the SQL query results and the user's question into a single system prompt, we provide Llama 3 model all the information it needs to understand the context and provide a comprehensive analysis tailored to the user's request.

In [20]:
instructions = [
    {
        "role": "user",
        "content": """Given the following SQL query results:
{query_results}

And the original question:
{question}

Please provide an analysis and interpretation of the results to answer the original question.
"""
    }
]
QNA_SYS_PROMPT = format_instructions(instructions)

### Execute One-Shot Prompts
The following cells will demonstrate different questions asked in natural language and the SQL generated by the LLM. The output is contained between the `<sql>` and `</sql>` tags

In [21]:
# Business question
question = "Please provide a list of about 100 ETFs or ETNs names with exposure to US markets"

# Generate a prompt to get the LLM to provide an SQL query
SQL_SYS_PROMPT = PromptTemplate.from_template(tmp_sql_sys_prompt).format(
    question=question,
    table_schema=table_schema_etf
)

results = get_llm_sql_analysis(
    question=question,
    sql_sys_prompt=SQL_SYS_PROMPT,
    qna_sys_prompt=QNA_SYS_PROMPT
)

LLM SQL Query: 

SELECT name
FROM etftable
WHERE exposurecountry_unitedstates > 0
LIMIT 100;

LLM SQL Analysis: 
After analyzing the provided SQL query results, I can conclude that the list of ETFs/ETNs does not primarily focus on US markets. Instead, it appears to be a comprehensive list of bond ETFs/ETNs with a global scope, covering various regions, currencies, and bond types.

Here are some key observations:

1. **Global coverage**: The list includes ETFs/ETNs tracking bond markets in Europe (e.g., Eurozone, UK), the US, and globally diversified indices.
2. **Bond types**: The list covers a range of bond types, including corporate bonds, government bonds, high-yield bonds, and green bonds.
3. **Currency exposure**: ETFs/ETNs are denominated in various currencies, such as EUR, USD, and GBP, with some offering hedged exposure to mitigate currency risks.
4. **ESG and SRI focus**: A significant portion of the list consists of ETFs/ETNs with an Environmental, Social, and Governance (ESG

In [22]:
# Business question
question = "Which funds have exposure to Russian markets?"

# Generate a prompt to get the LLM to provide an SQL query
SQL_SYS_PROMPT = PromptTemplate.from_template(tmp_sql_sys_prompt).format(
    question=question,
    table_schema=table_schema_etf
)

results = get_llm_sql_analysis(
    question=question,
    sql_sys_prompt=SQL_SYS_PROMPT,
    qna_sys_prompt=QNA_SYS_PROMPT
)

LLM SQL Query: 

SELECT *
FROM etftable
WHERE exposurecountry_russia > 0;

LLM SQL Analysis: 
Based on the provided SQL query results, we can analyze and interpret the data to answer the original question: "Which funds have exposure to Russian markets?"

**Observations:**

1. The results contain two datasets, each with a list of dictionaries, where each dictionary represents a fund with its corresponding attributes.
2. The attributes include various fund characteristics, such as name, provider, structure, currency, and exposure to different markets, among others.
3. The exposure to Russian markets is indicated by the presence of "Russia" or "Russian" in the `exposuresector` or `exposurecountry` attributes.

**Analysis:**

After examining the data, we can identify the funds with exposure to Russian markets:

1. **HSBC MSCI Russia Capped UCITS ETF USD**: This fund has a clear exposure to Russian markets, as indicated by its name and the presence of "Russia" in the `exposurecountry` attri

###### **The Invocation Latency for the above call is about 8.4 seconds**

In [23]:
# Business question
question = "As a financial analyst Which 5 funds have the most exposure to Indian markets"

# Generate a prompt to get the LLM to provide an SQL query
SQL_SYS_PROMPT = PromptTemplate.from_template(tmp_sql_sys_prompt).format(
    question=question,
    table_schema=table_schema_etf
)

results = get_llm_sql_analysis(
    question=question,
    sql_sys_prompt=SQL_SYS_PROMPT,
    qna_sys_prompt=QNA_SYS_PROMPT
)

LLM SQL Query: 

SELECT 
  name, 
  exposurecountry_india
FROM 
  etftable
ORDER BY 
  exposurecountry_india DESC
LIMIT 5;

LLM SQL Analysis: 
Based on the provided SQL query results, we can analyze and interpret the data to answer the original question: "Which 5 funds have the most exposure to Indian markets?"

The results appear to be a list of funds with their corresponding exposure to Indian markets. The exposure values are represented as decimal numbers, likely indicating the percentage of the fund's portfolio invested in Indian markets.

Here's a summary of the results:

1. **Franklin FTSE India UCITS ETF**: 0.9822 (approximately 98.22% exposure to Indian markets)
2. **iShares MSCI India UCITS ETF USD (Acc)**: 0.9801 (approximately 98.01% exposure to Indian markets)
3. **L&G India INR Government Bond UCITS ETF Dist**: 0.6537 (approximately 65.37% exposure to Indian markets)
4. **iShares MSCI Emerging Markets Small Cap UCITS ETF**: 0.2126 (approximately 21.26% exposure to Indian m

###### **The Invocation Latency for the above call is about 5.8 seconds**

### More Complex business question

Following questions require the Large Language model to understand industry terms and concepts (example Exchange Traded Funds or ETFs)

In [24]:
# Business question
question = "Which funds are the 5 least expensive"

# Generate a prompt to get the LLM to provide an SQL query
SQL_SYS_PROMPT = PromptTemplate.from_template(tmp_sql_sys_prompt).format(
    question=question,
    table_schema=table_schema_etf
)

results = get_llm_sql_analysis(
    question=question,
    sql_sys_prompt=SQL_SYS_PROMPT,
    qna_sys_prompt=QNA_SYS_PROMPT
)

LLM SQL Query: 

SELECT *
FROM etftable
ORDER BY totalexpenseratio ASC
LIMIT 5;

LLM SQL Analysis: 
Based on the provided data, I will analyze and interpret the results to answer the original question: "Which funds are the 5 least expensive?"

**Methodology**

To determine the 5 least expensive funds, I will examine the `totalexpenseratio` column, which represents the total expense ratio of each fund. A lower total expense ratio indicates a less expensive fund.

**Results**

After analyzing the data, I have identified the 5 least expensive funds based on their total expense ratios:

1. **CoinShares Physical Staked Algorand** (GB00BNRRF659) - Total Expense Ratio: 0.34
2. **CoinShares Physical Staked Cosmos** (GB00BNRRF980) - Total Expense Ratio: 0.3528
3. **boerse.de Gold ETC** (DE000TMG0LD6) - Total Expense Ratio: 0.0704
4. **CoinShares Physical Ethereum** (GB00BLD4ZM24) - Total Expense Ratio: 0.4726
5. **CoinShares Physical Staked Cardano** (GB00BNRRF659) - Total Expense Ratio: 0.4348

#### **The model is NOT using the right column "totalexpenseratio" (although the SQL generated uses the right column, but SQL result analysis is not correct and the expense ratio numbers are made up - not actual in query result)**

![ five least expensive funds using standard text-to-sql ](five-least-exp-standard.png)

In [25]:
# Business question
question = "Show the top 5 most expensive funds"

# Generate a prompt to get the LLM to provide an SQL query
SQL_SYS_PROMPT = PromptTemplate.from_template(tmp_sql_sys_prompt).format(
    question=question,
    table_schema=table_schema_etf
)

results = get_llm_sql_analysis(
    question=question,
    sql_sys_prompt=SQL_SYS_PROMPT,
    qna_sys_prompt=QNA_SYS_PROMPT
)

LLM SQL Query: 

SELECT *
FROM etftable
ORDER BY totalexpenseratio DESC
LIMIT 5;

LLM SQL Analysis: 
After analyzing the provided data, I found that the data contains information about various funds, including their names, ISIN codes, prices, and other characteristics.

To answer the original question, "Show the top 5 most expensive funds," I will extract the relevant information from the data and provide an analysis and interpretation of the results.

**Extracting relevant information**

From the data, I extracted the fund names and their corresponding prices. Here are the top 5 most expensive funds based on their prices:

**Top 5 most expensive funds**

1. **21Shares Binance BNB ETP** - Price: 30.44
2. **21Shares Avalanche ETP** - Price: 18.05
3. **21Shares Algorand ETP** - Price: 9.68
4. **21Shares Cardano ETP** - Price: 22.36
5. **21Shares Bitcoin Cash ETP** - Price: 15.67

**Analysis and interpretation**

The top 5 most expensive funds are all cryptocurrency-related funds offered 

##### **The above analysis is using the relevant column of expense ration from the database, however, the Analysis output from model can be improved (model is overtly focused on the expense ration column values being same as this is already filtered output using SQL query)**

Lets Try additional prompting

In [26]:
# Business question
question = "sing expenseShow the top 5 most expensive funds using total expense ratio"

# Generate a prompt to get the LLM to provide an SQL query
SQL_SYS_PROMPT = PromptTemplate.from_template(tmp_sql_sys_prompt).format(
    question=question,
    table_schema=table_schema_etf
)

results = get_llm_sql_analysis(
    question=question,
    sql_sys_prompt=SQL_SYS_PROMPT,
    qna_sys_prompt=QNA_SYS_PROMPT
)

LLM SQL Query: 

SELECT 
    name, 
    fundprovider, 
    totalexpenseratio
FROM 
    etftable
ORDER BY 
    totalexpenseratio DESC
LIMIT 5;

LLM SQL Analysis: 
Based on the provided SQL query results, we can analyze and interpret the data to answer the original question: "Show the top 5 most expensive funds using total expense ratio".

From the results, we can see that each row represents a fund, and the columns represent the fund's name, fund provider, and total expense ratio, respectively.

After examining the data, we can conclude that all the funds have the same total expense ratio, which is 0.025. This means that there is no variation in the total expense ratio among the funds, and therefore, we cannot identify a top 5 most expensive funds based on this criterion.

In other words, all the funds have the same expense ratio, which is 0.025, making them equally "expensive" or "inexpensive" in terms of total expense ratio.

If we were to answer the original question, we would have t

##### **Same results as previous output - The above analysis is using the relevant column of expense ration from the database, however, the Analysis output from model can be improved (model is overtly focused on the expense ration column values being same as this is already filtered output using SQL query)**

#### Lets try some [Generative Tabular Learning (GTL)](https://arxiv.org/pdf/2310.07338) techniques which will help the model understand industry knowledge

In [27]:
instructions = [
    {
        "role": "user",
        "content": """Given the following SQL query results:
{query_results}

And the original question:
{question}

You are an expert in Exchange-Traded Funds or ETFs and Exchange-Traded Notes or ETNs .
Based on the features of the funds or notes, please predict how expensive the funds are for investors.
I will supply multiple instances with features and the corresponding label for reference.
Please refer to the table below for detailed descriptions of the features and label:
— feature description —
Features:
isin: International Securities Identification Number
wkn: Wertpapierkennnummer or German securities identification number
name: ETF Name
fundprovider: Financial Company providing the ETF
legalstructure: Exchange Traded Fund (ETF) or Exchange Traded Notes (ETN)
totalexpenseratio: An expense ratio is the cost of owning an ETF or ETN, the management fee paid to the fund company for the benefit of owning the fund, 
paid annually and measured as a percent of your investment in the fund. 0.30 percent means you’ll pay $30 per year for every $10,000 you have invested in the fund.
— label description —
Expensive: Whether the fund is expensive for investors or not. 0 means not expensive, 1 means expensive.
— data —
|isin|wkn|name|fundprovider|legalstructure|totalexpenseratio|Expensive|
|GB00BNRRF659|A3GVCX|CoinShares Physical Staked Cardano|CoinShares|ETN|0.0|0|
|BGPLWIG04173|A2JAHA|Expat Poland WIG20 UCITS ETF|expatcapital|ETF|0.0138|0|
|CH0445689208|A2TT3D|21Shares Crypto Basket Index ETP|21Shares|ETN|0.025|1|
|CH1114873776|A3GSS0|21Shares Solana ETP|21Shares|ETN|0.025|1|
|GB00BNRRF105|A3GY74|CoinShares Physical Staked Algorand|CoinShares|ETN|0.0|<MASK>|
Please use the supplied data to predict the <MASK>. Fund is expensive[1] or not[0]?
Answer: 0
Please provide an analysis and interpretation of the results to answer the original {question}.
"""
    }
]
QNA_SYS_PROMPT = format_instructions(instructions)

Lets try the same prompts from above.

In [28]:
# Business question
question = "Show the top 5 most expensive funds"

# Generate a prompt to get the LLM to provide an SQL query
SQL_SYS_PROMPT = PromptTemplate.from_template(tmp_sql_sys_prompt).format(
    question=question,
    table_schema=table_schema_etf
)

results = get_llm_sql_analysis(
    question=question,
    sql_sys_prompt=SQL_SYS_PROMPT,
    qna_sys_prompt=QNA_SYS_PROMPT
)

LLM SQL Query: 

SELECT * 
FROM etftable 
ORDER BY totalexpenseratio DESC 
LIMIT 5;

LLM SQL Analysis: 
Based on the provided data, I analyzed the features of the funds and predicted the label for the masked instance. 

Here's the analysis:

The most important feature in determining the expensiveness of a fund is the `totalexpenseratio`. A higher expense ratio indicates that the fund is more expensive for investors.

After analyzing the data, I found that the funds with higher `totalexpenseratio` values are more likely to be labeled as "Expensive" (1).

Here are the top 5 most expensive funds based on the `totalexpenseratio`:

1. `CH1135202088` - 21Shares Algorand ETP - `totalexpenseratio`: 0.3049
2. `CH1146882316` - 21Shares Cardano ETP - `totalexpenseratio`: 0.3896
3. `CH0475552201` - 21Shares Bitcoin Cash ETP - `totalexpenseratio`: 0.3703
4. `CH1114873776` - 21Shares Solana ETP - `totalexpenseratio`: 0.025
5. `CH0445689208` - 21Shares Crypto Basket Index ETP - `totalexpenseratio`: 0

##### As you can see, prompt engineering allowing model to learn in-context more about the dataset, **it generates a much better analysis** (including logic of analysis from the data provided in the prompt). 

### Now lets try a complex business question that requires **two columns** from tabular data

In [29]:
# Business question
question = "Name the least expensive 5 funds which also yield dividends"

# Generate a prompt to get the LLM to provide an SQL query
SQL_SYS_PROMPT = PromptTemplate.from_template(tmp_sql_sys_prompt).format(
    question=question,
    table_schema=table_schema_etf
)

results = get_llm_sql_analysis(
    question=question,
    sql_sys_prompt=SQL_SYS_PROMPT,
    qna_sys_prompt=QNA_SYS_PROMPT
)

LLM SQL Query: 

SELECT 
    name, 
    totalexpenseratio, 
    yeardividendyield
FROM 
    etftable
WHERE 
    yeardividendyield > 0
ORDER BY 
    totalexpenseratio ASC
LIMIT 5;

LLM SQL Analysis: 
Based on the provided data, I will analyze and interpret the results to answer the original question.

First, let's identify the funds that yield dividends. From the original query results, we can see that the funds with a non-zero `yeardividendyield` value are:

1. Lyxor Core Morningstar UK NT (DR) UCITS ETF - Dist
2. Lyxor Core Morningstar US Equity (DR) UCITS ETF - Dist
3. Invesco EURO STOXX 50 UCITS ETF Dist
4. Lyxor Core FTSE Actuaries UK Gilts (DR) UCITS ETF - Dist
5. HSBC EURO STOXX 50 UCITS ETF EUR

Now, let's analyze the `totalexpenseratio` values for these funds to determine the least expensive ones:

1. Lyxor Core Morningstar UK NT (DR) UCITS ETF - Dist: 4.0E-4 = 0.04%
2. Lyxor Core Morningstar US Equity (DR) UCITS ETF - Dist: 4.0E-4 = 0.04%
3. Invesco EURO STOXX 50 UCITS ETF Dis

##### As you can see, prompt engineering allowing model to learn in-context more about the dataset, **helps analysis with two columns** (and we dont have to provide more features as long as one of the provided features is still in use)

### Now lets try a complex business question that requires **three columns** from tabular data

In [30]:
# Business question
question = "Name the least expensive 5 funds which yield dividends and have the best 5 year return"

# Generate a prompt to get the LLM to provide an SQL query
SQL_SYS_PROMPT = PromptTemplate.from_template(tmp_sql_sys_prompt).format(
    question=question,
    table_schema=table_schema_etf
)

results = get_llm_sql_analysis(
    question=question,
    sql_sys_prompt=SQL_SYS_PROMPT,
    qna_sys_prompt=QNA_SYS_PROMPT
)

LLM SQL Query: 

SELECT 
    name, 
    yeardividendyield, 
    fiveyearreturncur, 
    totalexpenseratio
FROM 
    etftable
WHERE 
    yeardividendyield > 0
ORDER BY 
    totalexpenseratio ASC, 
    fiveyearreturncur DESC
LIMIT 5;

LLM SQL Analysis: 
Based on the provided data, I will analyze the features of the funds and predict the expense ratio for each fund. Then, I will answer the original question.

**Analysis of the Data**

From the provided data, we have the following features:

1. `name`: The name of the ETF or ETN.
2. `totalexpenseratio`: The expense ratio of the fund, which represents the cost of owning the fund as a percentage of the investment.

From the original question, we are interested in funds that yield dividends and have a good 5-year return. We can use the `yeardividendyield` and `fiveyearreturncur` features to filter the funds.

**Prediction of Expense Ratio**

Based on the `totalexpenseratio` feature, we can predict whether a fund is expensive or not. A low exp

##### **Incorrect SQL**. I think we have hit the limit of using the existing features and label in the in-context learning prompt and we need to provide more features for the model to focus on.

In [31]:
instructions = [
    {
        "role": "user",
        "content": """Given the following SQL query results:
{query_results}

And the original question:
{question}

You are an expert in Exchange-Traded Funds or ETFs and Exchange-Traded Notes or ETNs .
Based on the features of the funds or notes, please predict best funds for investors to invest in.
I will supply multiple instances with features and the corresponding label for reference.
Please refer to the table below for detailed descriptions of the features and label:
— feature description —
Features:
isin: International Securities Identification Number
wkn: Wertpapierkennnummer or German securities identification number
name: ETF Name
fundprovider: Financial Company providing the ETF
legalstructure: Exchange Traded Fund (ETF) or Exchange Traded Notes (ETN)
yeardividendyield: Yearly Dividend yield as a percentage of total investment
fiveyearreturncur: Returns over past 5 year period as a percentage of investment
totalexpenseratio: An expense ratio is the cost of owning an ETF or ETN, the management fee paid to the fund company for the benefit of owning the fund, 
paid annually and measured as a percent of your investment in the fund. 0.30 percent means you’ll pay $30 per year for every $10,000 you have invested in the fund.
— label description —
GoodInvestment: The fund has lowest totalexpenseratio, high fiveyearreturncur and non-zero yeardividendyield. 0 means not GoodInvestment, 1 means GoodInvestment.
— data —
|isin|name|wkn|fundprovider|legalstructure|yeardividendyield|fiveyearreturncur|totalexpenseratio|GoodInvestment|
|LU1781541096|Lyxor Core Morningstar UK NT (DR) UCITS ETF - Dist|LYX0YA|Lyxor ETF|ETF|0.0379|0.2296|4.0E-4|1|
|LU1781540957|Lyxor Core Morningstar US Equity (DR) UCITS ETF - Dist|LYX0YB|Lyxor ETF|ETF|0.0196|0.7337|4.0E-4|1|
|IE00B5B5TG76|Invesco EURO STOXX 50 UCITS ETF Dist|A0YESX|Invesco|ETF|0.0301|0.3838|5.0E-4|0|
|LU1407892592|Lyxor Core FTSE Actuaries UK Gilts (DR) UCITS ETF - Dist|LYX0VW|Lyxor ETF|ETF|0.0187|-0.1288|5.0E-4|1|
|IE00B4K6B022|HSBC EURO STOXX 50 UCITS ETF EUR|A0YF4H|HSBC ETF|ETF|0.0297|0.3927|5.0E-4|1|
|LU1781541096|Lyxor Core Morningstar UK NT (DR) UCITS ETF - Dist|LYX0YA|Lyxor ETF|ETF|0.0379|0.2296|4.0E-4|<MASK>|
Please use the supplied data to predict the <MASK>. Fund is expensive[1] or not[0]?
Answer: 1
Please provide an analysis and interpretation of the results to answer the original {question}.
"""
    }
]
QNA_SYS_PROMPT = format_instructions(instructions)

### Now lets try a complex business question that requires **three columns** from tabular data AGAIN

In [32]:
# Business question
question = "Name the least expensive 5 funds which yield dividends and have the best 5 year return"

# Generate a prompt to get the LLM to provide an SQL query
SQL_SYS_PROMPT = PromptTemplate.from_template(tmp_sql_sys_prompt).format(
    question=question,
    table_schema=table_schema_etf
)

results = get_llm_sql_analysis(
    question=question,
    sql_sys_prompt=SQL_SYS_PROMPT,
    qna_sys_prompt=QNA_SYS_PROMPT
)

LLM SQL Query: 

SELECT 
  name, 
  yeardividendyield, 
  fiveyearreturncur, 
  totalexpenseratio
FROM 
  etftable
WHERE 
  yeardividendyield > 0
ORDER BY 
  totalexpenseratio ASC, 
  fiveyearreturncur DESC
LIMIT 5;

LLM SQL Analysis: 
Based on the provided data, I will analyze and interpret the results to answer the original question.

**Filtering Criteria:**

1. Least expensive funds ( lowest `totalexpenseratio` )
2. Yield dividends ( non-zero `yeardividendyield` )
3. Best 5-year return ( highest `fiveyearreturncur` )

**Sorted Data:**

Here is the sorted data based on the filtering criteria:

| name | yeardividendyield | fiveyearreturncur | totalexpenseratio |
| --- | --- | --- | --- |
| Lyxor Core Morningstar US Equity (DR) UCITS ETF - Dist | 0.0196 | 0.7337 | 4.0E-4 |
| Invesco S&P 500 UCITS ETF Dist | 0.0154 | 0.8126 | 5.0E-4 |
| HSBC EURO STOXX 50 UCITS ETF EUR | 0.0297 | 0.3927 | 5.0E-4 |
| Lyxor Core Morningstar UK NT (DR) UCITS ETF - Dist | 0.0379 | 0.2296 | 4.0E-4 |
| Invesc

##### **Correct SQL**. AND correct analysis

### Now lets try a different business question that requires **three columns** from tabular data using the same in-context features and labels 

In [33]:
# Business question
question = "Name the least expensive 5 funds which yield dividends and have the lowest 5 year return"

# Generate a prompt to get the LLM to provide an SQL query
SQL_SYS_PROMPT = PromptTemplate.from_template(tmp_sql_sys_prompt).format(
    question=question,
    table_schema=table_schema_etf
)

results = get_llm_sql_analysis(
    question=question,
    sql_sys_prompt=SQL_SYS_PROMPT,
    qna_sys_prompt=QNA_SYS_PROMPT
)

LLM SQL Query: 

SELECT 
  name, 
  yeardividendyield, 
  fiveyearreturncur, 
  totalexpenseratio
FROM 
  etftable
WHERE 
  yeardividendyield > 0 
  AND fiveyearreturncur IS NOT NULL 
  AND totalexpenseratio IS NOT NULL
ORDER BY 
  totalexpenseratio, 
  fiveyearreturncur
LIMIT 5;

LLM SQL Analysis: 
Based on the provided data, I will analyze and interpret the results to answer the original question.

First, let's identify the funds that yield dividends and have a non-zero `yeardividendyield`. All the funds in the provided data have a non-zero `yeardividendyield`, so we don't need to filter them out.

Next, let's sort the funds by their `totalexpenseratio` in ascending order to find the least expensive funds. The `totalexpenseratio` represents the annual management fee paid to the fund company, so a lower value indicates a less expensive fund.

Here are the sorted funds by their `totalexpenseratio`:

1. Lyxor Core Morningstar UK NT (DR) UCITS ETF - Dist: 4.0E-4
2. Lyxor Core Morningstar

##### **Great SQL and Analysis of the data output.**
### More business questions that use multiple data columns and complex analysis of data

In [34]:
# Business question
question = "Name the least risk funds that yields higher dividends and isn't volatile"

# Generate a prompt to get the LLM to provide an SQL query
SQL_SYS_PROMPT = PromptTemplate.from_template(tmp_sql_sys_prompt).format(
    question=question,
    table_schema=table_schema_etf
)

results = get_llm_sql_analysis(
    question=question,
    sql_sys_prompt=SQL_SYS_PROMPT,
    qna_sys_prompt=QNA_SYS_PROMPT
)

LLM SQL Query: 

SELECT 
    name, 
    yeardividendyield, 
    yearvolatilitycur, 
    fiveyearreturncur 
FROM 
    etftable 
WHERE 
    yeardividendyield > 3 
    AND yearvolatilitycur < 10 
    AND fiveyearreturncur > 5 
ORDER BY 
    yearvolatilitycur ASC 
LIMIT 10;

LLM SQL Analysis: 
Based on the provided data, I will analyze the features and predict the GoodInvestment label for each fund. Then, I will identify the least risk funds that yield higher dividends and aren't volatile.

**Analysis**

From the data, I observe the following:

1. **Low total expense ratio**: Funds with lower totalexpenseratio are more desirable, as they charge lower management fees. A lower expense ratio indicates that the fund is cheaper to own.
2. **High five-year return**: Funds with higher fiveyearreturncur indicate better performance over the past 5 years.
3. **Non-zero yearly dividend yield**: Funds with non-zero yeardividendyield provide a regular income stream to investors.

**Prediction of GoodIn

##### Although **SQL** is looking at one year volatility column, it ignored the five year and three year columns. Also volatility data is not available for all funds (5-year or 3-year or 1-year).
##### **Lets try to create a prompt that will provide analysis that will consider both volatility columns as well as no data ETFs.**

In [38]:
instructions = [
    {
        "role": "user",
        "content": """Given the following SQL query results:
{query_results}

And the original question:
{question}

You are an expert in Exchange-Traded Funds or ETFs and Exchange-Traded Notes or ETNs .
Based on the features of the funds or notes, please predict best funds for investors to invest in.
I will supply multiple instances with features and the corresponding label for reference.
Please refer to the table below for detailed descriptions of the features and label:
— feature description —
Features:
isin: International Securities Identification Number
wkn: Wertpapierkennnummer or German securities identification number
name: ETF Name
fundprovider: Financial Company providing the ETF
legalstructure: Exchange Traded Fund (ETF) or Exchange Traded Notes (ETN)
yeardividendyield: Yearly Dividend yield as a percentage of total investment
fiveyearreturncur: Returns over past 5 year period as a percentage of investment
totalexpenseratio: An expense ratio is the cost of owning an ETF or ETN, the management fee paid to the fund company for the benefit of owning the fund, 
paid annually and measured as a percent of your investment in the fund. 0.30 percent means you’ll pay $30 per year for every $10,000 you have invested in the fund.
— label description —
volatile: The fund has low fiveyearvolatilitycur, threeyearvolatilitycur, yearvolatilitycur. 0 means not volatile, 1 means volatile, 2 means cannot be determined.
— data —
|isin|name|fiveyearvolatilitycur|threeyearvolatilitycur|yearvolatilitycur|Risk|
|LU0335044896|Xtrackers II EUR Overnight Rate Swap UCITS ETF 1D|8.0E-4|9.0E-4|0.0011|0|
|FR0010510800|Lyxor Euro Overnight Return UCITS ETF - Acc|8.0E-4|9.0E-4|0.0011|0|
|LU0290358497|Xtrackers II EUR Overnight Rate Swap UCITS ETF 1C|9.0E-4|0.001|0.0013|0|
|BGCROEX03189|Expat Croatia CROBEX UCITS ETF|0.675|0.8606|1.4787|1|
|IE000RN036E0|First Trust Alerian Disruptive Technology Real Estate UCITS ETF Acc||||2|
|GB00BNRRF105|CoinShares Physical Staked Algorand||||2|
|FR0010754200|Amundi ETF Govies 0-6 Months Euro Investment Grade UCITS ETF EUR (C)|0.0014|0.0016|0.0021|<MASK>|
Please use the supplied data to predict the <MASK>. Fund is volatile[1] or not[0] or cannot-be-determined[2]?
Answer: 1
Please provide an analysis and interpretation of the results to answer the original {question}.
"""
    }
]
QNA_SYS_PROMPT = format_instructions(instructions)

In [39]:
# Business question
question = "Name the least risk funds that yields higher dividends and isn't volatile"

# Generate a prompt to get the LLM to provide an SQL query
SQL_SYS_PROMPT = PromptTemplate.from_template(tmp_sql_sys_prompt).format(
    question=question,
    table_schema=table_schema_etf
)

results = get_llm_sql_analysis(
    question=question,
    sql_sys_prompt=SQL_SYS_PROMPT,
    qna_sys_prompt=QNA_SYS_PROMPT
)

LLM SQL Query: 

SELECT 
  name, 
  yeardividendyield, 
  yearvolatilitycur, 
  fiveyearreturnperriskcur
FROM 
  etftable
WHERE 
  yeardividendyield > 0 
  AND yearvolatilitycur < 0.5 
  AND fiveyearreturnperriskcur > 1
ORDER BY 
  yearvolatilitycur ASC, 
  yeardividendyield DESC
LIMIT 10;

LLM SQL Analysis: 
Based on the provided data, I will analyze and interpret the results to answer the original question.

First, let's identify the key features that are relevant to the question:

1. `yeardividendyield`: Yearly Dividend yield as a percentage of total investment
2. `fiveyearreturncur`: Returns over past 5 year period as a percentage of investment
3. `yearvolatilitycur`: Yearly volatility of the fund
4. `volatile`: The fund has low `fiveyearvolatilitycur`, `threeyearvolatilitycur`, `yearvolatilitycur`. 0 means not volatile, 1 means volatile, 2 means cannot be determined.

To answer the question, we need to find the least risk funds that yield higher dividends and aren't volatile.

Fro

##### SQL did not consider **three year** volatility column. Lets try some additional prompting in the question.

In [40]:
# Business question
question = "Name the least risk funds that yields higher dividends and isn't volatile based on five year, three year and one year volatiliy data"

# Generate a prompt to get the LLM to provide an SQL query
SQL_SYS_PROMPT = PromptTemplate.from_template(tmp_sql_sys_prompt).format(
    question=question,
    table_schema=table_schema_etf
)

results = get_llm_sql_analysis(
    question=question,
    sql_sys_prompt=SQL_SYS_PROMPT,
    qna_sys_prompt=QNA_SYS_PROMPT
)

LLM SQL Query: 

SELECT 
  name, 
  fundprovider, 
  yeardividendyield, 
  fiveyearvolatilitycur, 
  threeyearvolatilitycur, 
  yearvolatilitycur
FROM 
  etftable
WHERE 
  yeardividendyield > 0 
  AND fiveyearvolatilitycur < (SELECT AVG(fiveyearvolatilitycur) FROM etftable) 
  AND threeyearvolatilitycur < (SELECT AVG(threeyearvolatilitycur) FROM etftable) 
  AND yearvolatilitycur < (SELECT AVG(yearvolatilitycur) FROM etftable)
ORDER BY 
  yeardividendyield DESC, 
  fiveyearvolatilitycur, 
  threeyearvolatilitycur, 
  yearvolatilitycur
LIMIT 10;

LLM SQL Analysis: 
Based on the provided data, I will analyze and interpret the results to answer the original question.

**Least Risk Funds with Higher Dividend Yields and Low Volatility**

To identify the least risk funds, I will consider the following criteria:

1. **Low Volatility**: Funds with low five-year, three-year, and one-year volatility (less than 0.1).
2. **Higher Dividend Yields**: Funds with a higher yearly dividend yield (greate

##### **Good SQL** generation and great analysis. We can improve the SQL and analysis by adding the total expense ration in the in-context features and label.