## Using an LLM to Answer Questions Based on Provided Sources

By providing the source that grounds the model, we can control the information it returns, ensuring it answers questions based on the provided sources instead of its own knowledge:

1. **Ingesting the Document**: Preprocess the text from the provided sources and divide it into manageable chunks if necessary. This will ensure the entire document can be used without exceeding token limits. (The maximum number of tokens an LLM can process in a single input called the **context length** or **maximum context length**)

2. **Embedding the Context**: For each question, construct a prompt that embeds the relevant sections of the source document directly within the prompt.

3. **Crafting the Prompt**:
   - Start with an instruction that clearly states the priority of the source: "Based on the provided document, answer the following question:"
   - Insert the relevant chunk of the source text.
   - Follow up with the specific question.

4. **Controlling the Model’s Output**:
   - Use parameters like temperature = 0 to ensure factual outputs.
   - Optionally, add explicit instructions to the prompt to disregard any conflicting information from its pre-trained knowledge base.

5. **Example Prompt Construction**:
   ```markdown
   Based on the provided document, answer the following question:
   
   Document Excerpt: "The ISCED11_34_44 code refers to 'Upper secondary and post-secondary non-tertiary general programmes'. This classification is part of the International Standard Classification of Education framework."

   Question: What does the code ISCED11_34_44 mean?


In [3]:
# When importing files from a different directory, you need to add the directory to the path
import sys
import os

# Get the current working directory
current_dir = os.getcwd()

# Construct the path to the chat_bot directory
chat_bot_path = os.path.join(current_dir, '..', 'chat_bot')

# Add the chat_bot directory to the Python path
sys.path.append(chat_bot_path)

In [4]:
# Importing the LLM and SDMX_DataFlow classes from the chat_bot folder that is one level above the current directory. To do this, we need to add the parent directory to the path.

import os
import sys
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
	sys.path.append(module_path)

from chat_bot import LLM
from chat_bot import SDMX_DataFlow

## Accessing SDMX information for a chosen DataFlow

In [5]:
fin_perstud = {"agency": "OECD.EDU.IMEP",
"id": "DSD_EAG_UOE_FIN@DF_UOE_INDIC_FIN_PERSTUD",
"version": "1.0",
"name": "Expenditure on educational institutions per full-time equivalent student",
"description": "This dataset contains data on expenditure per full-time equivalent student and per full-time equivalent student as a percentage of GPD per capita. The default table displays data for 2020 in current USD PPP and as a percentage of GDP per capita, from all expenditure sources, and unfiltered by type of expenditure. The selection can be changed to display data: by year, by source of expenditure, by destination of expenditure and by type of expenditure. Please note that some categories are mutually exclusive. </p><p>For more information, please consult <a href=\"https://www.oecd-ilibrary.org/sites/e13bef63-en/1/3/4/index.html?itemId=%20/content/publication/e13bef63-en&_csp_=a4f4b3d408c9dd70d167f10de61b8717&itemIGO=oecd&itemContentType=book\"><i>Education at a Glance 2023</i></a>. Additional details regarding the methodology used, references to the sources, and specific notes for each country can be found in <a href=\"https://www.oecd-ilibrary.org/sites/301fa18e-en/index.html?itemId=/content/component/301fa18e-en\"><i>Education at a Glance 2023 Sources, Methodologies and Technical Notes</i></a>."
}

dataflow_details_url = f'https://sdmx.oecd.org/public/rest/dataflow/{fin_perstud["agency"]}/{fin_perstud["id"]}/{fin_perstud["version"]}?references=all'
dataflow_details_url

'https://sdmx.oecd.org/public/rest/dataflow/OECD.EDU.IMEP/DSD_EAG_UOE_FIN@DF_UOE_INDIC_FIN_PERSTUD/1.0?references=all'

In [6]:
df_info = SDMX_DataFlow.Dataflow(dataflow_details_url)

In [7]:
df_info.populate_variables()

In [8]:
# https://openai.com/api/pricing/
MODEL_PRICING_PER_M_TOKENS = {
    'gpt-4o': {'prompt_tokens': 5.00, 'completion_tokens': 15.00},
    'gpt-4o-2024-08-06': {'prompt_tokens': 2.50, 'completion_tokens': 10.00},
    'gpt-4o-mini': {'prompt_tokens': 0.150, 'completion_tokens': 0.600},
    'gpt-4o-mini-2024-07-18': {'prompt_tokens': 0.150, 'completion_tokens': 0.600},
    'o1-preview': {'prompt_tokens': 15.00, 'completion_tokens': 60.00}
}

Using the `tiktoken` library for estimating the cost of LLM calls is a great idea for several reasons:

1. **Token Efficiency**: LLMs, such as those from OpenAI, charge based on the number of tokens processed. The `tiktoken` library allows you to precisely calculate the number of tokens your input will generate, providing an accurate estimate of the cost.
2. **Cost Management**: By knowing the token count in advance, you can manage and optimize your prompts to stay within your budget. This helps avoid unexpected costs and ensures efficient use of the LLM.
3. **Performance Optimization**: Estimating token usage helps in optimizing the performance of your model calls. You can adjust your inputs to ensure they are within the token limits, preventing truncation or errors during processing.
4. **Transparency**: It provides transparency in understanding how your inputs are tokenized, giving you insights into the tokenization process and enabling better control over the generated text.

Overall, the `tiktoken` library is a valuable tool for effectively managing and predicting the costs associated with using LLMs, ensuring you can make informed decisions about your AI usage.


In [9]:
# print the memory size of instance variables
import tiktoken

# Load the encoding for the GPT model you're using
# For GPT-4 or GPT-3.5, you can use the `cl100k_base` encoding
encoding = tiktoken.encoding_for_model("gpt-4")

for var in vars(df_info):
    content = getattr(df_info, var)
    text = str(content)
    # Tokenize the string
    tokens = encoding.encode(text)
    print(f"{var:30} contains #{len(tokens)} tokens")
    

url                            contains #44 tokens
df_details_json                contains #223153 tokens
df_name                        contains #11 tokens
df_description                 contains #274 tokens
df_dimension_names             contains #147 tokens
df_code_names                  contains #1005 tokens


Assuming each answer is always 100 tokens, independent of input size, we can calculate the cost for 1000 calls as follows:

**Model Pricing:**
- **gpt-4o**: $5.00 / 1M input tokens, $15.00 / 1M output tokens
- **gpt-4o-mini**: $0.150 / 1M input tokens, $0.600 / 1M output tokens
- **o1-preview**: $15.00 / 1M input tokens, $60.00 / 1M output tokens

```python
1000 * len(tokens) * MODEL_PRICING_PER_M_TOKENS[model_name]['prompt_tokens'] / 1000000  # Cost of prompt
1000 * 100 * MODEL_PRICING_PER_M_TOKENS[model_name]['completion_tokens'] / 1000000      # Cost of completion
```

In [10]:
# json to text to tokens:
raw_json_text = str(df_info.df_details_json)
raw_json_tokens = encoding.encode(raw_json_text)
parsed_metadata_text = str(df_info.df_code_names)
parsed_metadata_tokens = encoding.encode(parsed_metadata_text)
completion_tokens = 100

for model_name in ("gpt-4o-mini", "gpt-4o", "o1-preview"):
    print(f"Model: {model_name}:")
    raw_prompt_cost = 1000 * len(raw_json_tokens) * MODEL_PRICING_PER_M_TOKENS[model_name]['prompt_tokens'] / 1000000
    parsed_prompt_cost = 1000 * len(parsed_metadata_tokens) * MODEL_PRICING_PER_M_TOKENS[model_name]['prompt_tokens'] / 1000000
    completion_cost = 1000 * completion_tokens * MODEL_PRICING_PER_M_TOKENS[model_name]['completion_tokens'] / 1000000
    print(f"Total cost for 1000 calls on the raw SDMX metadata: ${raw_prompt_cost + completion_cost}")
    print(f"Total cost for 1000 calls on the parsed code names: ${parsed_prompt_cost + completion_cost}")
    print("==========================")

Model: gpt-4o-mini:
Total cost for 1000 calls on the raw SDMX metadata: $33.53295
Total cost for 1000 calls on the parsed code names: $0.21075
Model: gpt-4o:
Total cost for 1000 calls on the raw SDMX metadata: $1117.265
Total cost for 1000 calls on the parsed code names: $6.525
Model: o1-preview:
Total cost for 1000 calls on the raw SDMX metadata: $3353.295
Total cost for 1000 calls on the parsed code names: $21.075


<font color=red>We can spend about 150 times more for the same answer by passing everything to the LLM instead of just selecting relevant data.</font>

### The Challenge of Too Much Data: Finding What’s Relevant for the LLM

When working with large datasets like SDMX, there is often **a lot more data than we actually need** to answer specific questions. For example:

- A URL might only take up 44 tokens.
- The name of a dataframe takes up just 11 tokens.
- But the full details of the data (in JSON format) can use **223,153 tokens**!

#### The Problem:
If the user only asks for something simple, like the name of the dataframe, we don’t need to send all 223,153 tokens. We could give the answer with just 11 tokens. However, if we send all the data to the LLM, it creates two big challenges:
1. **Efficiency**: It’s like trying to find a needle in a haystack. The LLM would have to search through tons of unnecessary data to find what matters.
2. **Cost**: Sending large amounts of data to the LLM is expensive. The more tokens we use, the higher the cost.

#### The Solution:
A good idea is to **prepare the data before sending it to the LLM**, which is part of what’s called an "agentic approach." This means we make sure the LLM only gets the important and relevant information it needs to answer the question, rather than flooding it with everything.


In [11]:
interesting_sessin_variables = ['df_name', 'df_description', 'df_dimension_names', 'df_code_names']

# Create a subset dictionary directly from the dataclass instance
interesting_content= {var_name: getattr(df_info, var_name) for var_name in interesting_sessin_variables}

# Serialize the subset dictionary to a JSON string
import json
json_str = json.dumps(interesting_content, indent=4)

# Print the JSON string
print(json_str)


{
    "df_name": "Expenditure on educational institutions per full-time equivalent student",
    "df_description": "This dataset contains data on expenditure per full-time equivalent student and per full-time equivalent student as a percentage of GPD per capita. The default table displays data for 2020 in current USD PPP and as a percentage of GDP per capita, from all expenditure sources, and unfiltered by type of expenditure. The selection can be changed to display data: by year, by source of expenditure, by destination of expenditure and by type of expenditure. Please note that some categories are mutually exclusive. </p><p>For more information, please consult <a href=\"https://www.oecd-ilibrary.org/sites/e13bef63-en/1/3/4/index.html?itemId=%20/content/publication/e13bef63-en&_csp_=a4f4b3d408c9dd70d167f10de61b8717&itemIGO=oecd&itemContentType=book\"><i>Education at a Glance 2023</i></a>. Additional details regarding the methodology used, references to the sources, and specific notes 

In [12]:
interesting_content.keys()

dict_keys(['df_name', 'df_description', 'df_dimension_names', 'df_code_names'])

# Agentic approach

In [13]:
#persona = """You are a data analyst working for a government agency. You have been tasked with analyzing the expenditure on educational institutions per full-time equivalent student. You need to understand the dataset's structure, including the dimension names and code names, to effectively filter and analyze the data. Your goal is to provide insights and recommendations based on the dataset to inform policy decisions and resource allocation in the education sector."""
    
def select_info(user_question):
    persona = """
    You are a coordinator helping data analysts at a government agency. 
    You have been asked to help them find the best data source to answer a specific user question."""
    prompt = f"""
    Please select the best information from the sources listed below to answer the user's question. 
    You are only allowed to choose one source. 
    Please respect the formatting! For example: your answer should look like "df_code_names".
    If you are not able to find the answer in the sources listed below, please select "None".

    ### Sources:
    1. [df_name]: Name of the dataset. Usually 5-10 words.
    2. [df_description]: Detailed description of the dataset, including what it contains and how it can be filtered.
    3. [df_dimension_names]: The list of all dimension codes of the DataFlow as keys and their corresponding names in English as values.
    - Example:
        - 'EDUCATION_LEV': 'Education level'
        - 'EXP_SOURCE': 'Financing source'
    4. [df_code_names]: The list of all codes used in the DataFlow as keys and their corresponding names in English as values.
        - Example:
            - 'EXP_SOURCE':
                - 'S13': 'General government'
                - 'S1D_NON_EDU': 'Private sector'

    ### User Question:
    {user_question}
    """
    return LLM.model(prompt, persona)

In [14]:
user_question = "What is the name of the dataset?"
key, cost = select_info(user_question)
print(f"Selected source: {key}, Cost: {cost}")
interesting_content[key]

Selected source: df_name, Cost: 4.65e-05


'Expenditure on educational institutions per full-time equivalent student'

In [15]:
def answer_question(user_question, interesting_content):
    key, cost = select_info(user_question)
    print(f"Selected source: {key}, Cost: {cost}")

    persona = """You are a helpful data analyst working for OECD."""
    prompt = f"""
    Please provide the answer to the following user question: 
    {user_question}
    Please use the information from the source that was selected as the best source to answer the question:
    {interesting_content[key]}
    """

    return LLM.model(prompt, persona)

# the idea works but not in any scenario

In [16]:
user_question = "What is the name of the dataset?"
answer_question(user_question, interesting_content)

Selected source: df_name, Cost: 4.65e-05


('The name of the dataset is "Expenditure on educational institutions per full-time equivalent student."',
 2.22e-05)

In [17]:
# this call should fail because the LLM cannot decide where to look for the information
user_question = "What is the definition of the education level 'ISCED11_34_44'?"
try:
    answer_question(user_question, interesting_content)
except Exception as e:
    print(f"The call failed with the following error: {e}")

Selected source: None, Cost: 4.74e-05
The call failed with the following error: 'None'


### Two Approaches for Working with Data and Large Language Models (LLMs)

When using LLMs to work with data, there are two main approaches we can take:

1. **Send All the Data to the LLM**:
   - We extract everything important from the data (in this case, from SDMX, which is a standard for sharing statistical data).
   - Then, we convert this information into a compact form (expressed in tokens, which are chunks of text that the LLM understands).
   - Finally, we send all the relevant information to the LLM at once and ask it to generate an answer.

2. **Use a Search-First Approach (RAG)**:
   - Instead of sending everything to the LLM, we can use a system called RAG (Retrieval-Augmented Generation).
   - First, we search our stored information (like a database) to find the most relevant entries for the task.
   - We then send only the top few results to the LLM and ask it to generate an answer based on those results.
   
The second approach helps limit the amount of information the LLM needs to process, making it more efficient.
