1. LLM vs compound ai system => RAG
2. Compound ai system control logic can be:
    - rule-based: coded by humans
    - [AGENT] ai-based: dynamically planned by the ai every time (LLM reasoning capability improvement made this possible)
3. [AGENT]
    - reason (decompose the task)
    - act, by calling tools
        - external apis
        - search the web
        - vector database
        - calculator
        - some other specialist LLMs
        - ...
    - access memory
4. Different Agent types:
    - ReAct (reason + act): user query => LLM.pan / think => execute (maybe using external tools) => observe tool return {ko: iterate / ok: answer}
5. desinging compound systems
    - this is a sliding scale of LLM autonomy from full rule-based to full ai-based
    - when the scope is narrow and well defined, it is more efficient to design a rule-based system (e.g. the user is not expected to ask about the weather)
    - agentic approach is useful when the deterministic controlflow would get too complicated (so we don't mind the time and cost each individual execution takes compared to the time and cost developping the programatic approach would take)

In [1]:
# When importing files from a different directory, you need to add the directory to the path
import sys
import os

# Get the current working directory
current_dir = os.getcwd()

# Construct the path to the chat_bot directory
chat_bot_path = os.path.join(current_dir, '..', 'chat_bot')

# Add the chat_bot directory to the Python path
sys.path.append(chat_bot_path)

In [5]:
import LLM
import SDMX_DataFlow

# Agent demo:
1. get the sdmx descriptor for a dataflow
2. tokenize the whole of it to estimate cost (and conclude that parsing is king)
3. dict / paresed json based solution

In [3]:
fin_perstud = {"agency": "OECD.EDU.IMEP",
"id": "DSD_EAG_UOE_FIN@DF_UOE_INDIC_FIN_PERSTUD",
"version": "1.0",
"name": "Expenditure on educational institutions per full-time equivalent student",
"description": "This dataset contains data on expenditure per full-time equivalent student and per full-time equivalent student as a percentage of GPD per capita. The default table displays data for 2020 in current USD PPP and as a percentage of GDP per capita, from all expenditure sources, and unfiltered by type of expenditure. The selection can be changed to display data: by year, by source of expenditure, by destination of expenditure and by type of expenditure. Please note that some categories are mutually exclusive. </p><p>For more information, please consult <a href=\"https://www.oecd-ilibrary.org/sites/e13bef63-en/1/3/4/index.html?itemId=%20/content/publication/e13bef63-en&_csp_=a4f4b3d408c9dd70d167f10de61b8717&itemIGO=oecd&itemContentType=book\"><i>Education at a Glance 2023</i></a>. Additional details regarding the methodology used, references to the sources, and specific notes for each country can be found in <a href=\"https://www.oecd-ilibrary.org/sites/301fa18e-en/index.html?itemId=/content/component/301fa18e-en\"><i>Education at a Glance 2023 Sources, Methodologies and Technical Notes</i></a>."
}

dataflow_details_url = f'https://sdmx.oecd.org/public/rest/dataflow/{fin_perstud["agency"]}/{fin_perstud["id"]}/{fin_perstud["version"]}?references=all'
dataflow_details_url

'https://sdmx.oecd.org/public/rest/dataflow/OECD.EDU.IMEP/DSD_EAG_UOE_FIN@DF_UOE_INDIC_FIN_PERSTUD/1.0?references=all'

In [6]:
df_info = SDMX_DataFlow.Dataflow(dataflow_details_url)

In [7]:
df_info.populate_variables()

No codelist found for TIME_PERIOD


In [8]:
# https://openai.com/api/pricing/
MODEL_PRICING_PER_M_TOKENS = {
    'gpt-4o': {'prompt_tokens': 5.00, 'completion_tokens': 15.00},
    'gpt-4o-2024-08-06': {'prompt_tokens': 2.50, 'completion_tokens': 10.00},
    'gpt-4o-mini': {'prompt_tokens': 0.150, 'completion_tokens': 0.600},
    'gpt-4o-mini-2024-07-18': {'prompt_tokens': 0.150, 'completion_tokens': 0.600},
    'o1-preview': {'prompt_tokens': 15.00, 'completion_tokens': 60.00}
}

In [13]:
# print the memory size of instance variables
import tiktoken

# Load the encoding for the GPT model you're using
# For GPT-4 or GPT-3.5, you can use the `cl100k_base` encoding
encoding = tiktoken.encoding_for_model("gpt-4")

for var in vars(df_info):
    content = getattr(df_info, var)
    text = str(content)
    # Tokenize the string
    tokens = encoding.encode(text)
    print(f"{var} contains #{len(tokens)} tokens")
    

url contains #44 tokens
df_details_json contains #223203 tokens
df_name contains #11 tokens
df_description contains #274 tokens
df_dimension_names contains #147 tokens
df_code_names contains #1029 tokens


Assuming each answer is always 100 tokens, independent of input size, we can calculate the cost for 1000 calls as follows:

**Model Pricing:**
- **gpt-4o**: $5.00 / 1M input tokens, $15.00 / 1M output tokens
- **gpt-4o-mini**: $0.150 / 1M input tokens, $0.600 / 1M output tokens
- **o1-preview**: $15.00 / 1M input tokens, $60.00 / 1M output tokens

```python
1000 * len(tokens) * MODEL_PRICING_PER_M_TOKENS[model_name]['prompt_tokens'] / 1000000  # Cost of prompt
1000 * 100 * MODEL_PRICING_PER_M_TOKENS[model_name]['completion_tokens'] / 1000000      # Cost of completion
```

In [None]:
df_info.df_details_json



text = str(df_info.df_details_json)
# Tokenize the string
tokens = encoding.encode(text)
print(f"Cost of prompt: {len(tokens) * MODEL_PRICING_PER_M_TOKENS['gpt-4o']['prompt_tokens'] / 1000000}")
print(f"Cost of completion: {100 * MODEL_PRICING_PER_M_TOKENS['gpt-4o']['completion_tokens'] / 1000000}")

Cost of prompt: 1.116015
Cost of completion: 0.0015


In [None]:
import tiktoken
encoding = tiktoken.encoding_for_model("gpt-4")
# json to text to tokens:
raw_json_text = str(df_info.df_details_json)
raw_json_tokens = encoding.encode(text)
parsed_metadata_text = str(df_info.df_code_names)
parsed_metadata_tokens = encoding.encode(parsed_metadata_text)
completion_tokens = 100

for model_name in ("gpt-4o","gpt-4o-mini", "o1-preview"):
    print(f"Model: {model_name}:")
    raw_prompt_cost = 1000 * len(raw_json_tokens) * MODEL_PRICING_PER_M_TOKENS[model_name]['prompt_tokens'] / 1000000
    parsed_prompt_cost = 1000 * len(parsed_metadata_tokens) * MODEL_PRICING_PER_M_TOKENS[model_name]['prompt_tokens'] / 1000000
    completion_cost = 1000 * completion_tokens * MODEL_PRICING_PER_M_TOKENS[model_name]['completion_tokens'] / 1000000
    print(f"Total cost for 1000 calls on the raw SDMX metadata: ${raw_prompt_cost + completion_cost}")
    print(f"Total cost for 1000 calls on the parsed code names: ${parsed_prompt_cost + completion_cost}")
    print("==========================")



Model: gpt-4o:
Total cost for 1000 calls on the raw SDMX metadata: $1117.515
Total cost for 1000 calls on the parsed code names: $6.645
Model: gpt-4o-mini:
Total cost for 1000 calls on the raw SDMX metadata: $33.54045
Total cost for 1000 calls on the parsed code names: $0.21434999999999998
Model: o1-preview:
Total cost for 1000 calls on the raw SDMX metadata: $3354.045
Total cost for 1000 calls on the parsed code names: $21.435000000000002


### The Challenge of Too Much Data: Finding What’s Relevant for the LLM

When working with large datasets like SDMX, there is often **a lot more data than we actually need** to answer specific questions. For example:

- A URL might only take up 44 tokens.
- The name of a dataframe takes up just 11 tokens.
- But the full details of the data (in JSON format) can use **223,153 tokens**!

#### The Problem:
If the user only asks for something simple, like the name of the dataframe, we don’t need to send all 223,153 tokens. We could give the answer with just 11 tokens. However, if we send all the data to the LLM, it creates two big challenges:
1. **Efficiency**: It’s like trying to find a needle in a haystack. The LLM would have to search through tons of unnecessary data to find what matters.
2. **Cost**: Sending large amounts of data to the LLM is expensive. The more tokens we use, the higher the cost.

#### The Solution:
A good idea is to **prepare the data before sending it to the LLM**, which is part of what’s called an "agentic approach." This means we make sure the LLM only gets the important and relevant information it needs to answer the question, rather than flooding it with everything.


In [14]:
interesting_sessin_variables = ['df_name', 'df_description', 'df_dimension_names', 'df_code_names']

# Create a subset dictionary directly from the dataclass instance
interesting_content= {var_name: getattr(df_info, var_name) for var_name in interesting_sessin_variables}

# Serialize the subset dictionary to a JSON string
import json
json_str = json.dumps(interesting_content, indent=4)

# Print the JSON string
print(json_str)


{
    "df_name": "Expenditure on educational institutions per full-time equivalent student",
    "df_description": "This dataset contains data on expenditure per full-time equivalent student and per full-time equivalent student as a percentage of GPD per capita. The default table displays data for 2020 in current USD PPP and as a percentage of GDP per capita, from all expenditure sources, and unfiltered by type of expenditure. The selection can be changed to display data: by year, by source of expenditure, by destination of expenditure and by type of expenditure. Please note that some categories are mutually exclusive. </p><p>For more information, please consult <a href=\"https://www.oecd-ilibrary.org/sites/e13bef63-en/1/3/4/index.html?itemId=%20/content/publication/e13bef63-en&_csp_=a4f4b3d408c9dd70d167f10de61b8717&itemIGO=oecd&itemContentType=book\"><i>Education at a Glance 2023</i></a>. Additional details regarding the methodology used, references to the sources, and specific notes 

In [15]:
interesting_content.keys()

dict_keys(['df_name', 'df_description', 'df_dimension_names', 'df_code_names'])

# Agentic approach

In [18]:
#persona = """You are a data analyst working for a government agency. You have been tasked with analyzing the expenditure on educational institutions per full-time equivalent student. You need to understand the dataset's structure, including the dimension names and code names, to effectively filter and analyze the data. Your goal is to provide insights and recommendations based on the dataset to inform policy decisions and resource allocation in the education sector."""
    
def select_info(user_question):
    persona = """
    You are a coordinator helping data analysts at a government agency. 
    You have been asked to help them find the best data source to answer a specific user question."""
    prompt = f"""
    Please select the best information from the sources listed below to answer the user's question. 
    You are only allowed to choose one source. 
    Please respect the formatting! For example: your answer should look like "df_code_names".
    If you are not able to find the answer in the sources listed below, please select "None".

    ### Sources:
    1. [df_name]: Name of the dataset. Usually 5-10 words.
    2. [df_description]: Detailed description of the dataset, including what it contains and how it can be filtered.
    3. [df_dimension_names]: The list of all dimension codes of the DataFlow as keys and their corresponding names in English as values.
    - Example:
        - 'EDUCATION_LEV': 'Education level'
        - 'EXP_SOURCE': 'Financing source'
    4. [df_code_names]: The list of all codes used in the DataFlow as keys and their corresponding names in English as values.
        - Example:
            - 'EXP_SOURCE':
                - 'S13': 'General government'
                - 'S1D_NON_EDU': 'Private sector'

    ### User Question:
    {user_question}
    """
    return LLM.model(prompt, persona)

In [19]:
user_question = "What is the name of the dataset?"
key, cost = select_info(user_question)
print(f"Selected source: {key}, Cost: {cost}")
interesting_content[key]

Selected source: df_name, Cost: 4.65e-05


'Expenditure on educational institutions per full-time equivalent student'

In [20]:
def answer_question(user_question, interesting_content):
    key, cost = select_info(user_question)
    print(f"Selected source: {key}, Cost: {cost}")

    persona = """You are a helpful data analyst working for OECD."""
    prompt = f"""
    Please provide the answer to the following user question: 
    {user_question}
    Please use the information from the source that was selected as the best source to answer the question:
    {interesting_content[key]}
    """

    return LLM.model(prompt, persona)

# the idea works but not in any scenario

In [21]:
user_question = "What is the name of the dataset?"
answer_question(user_question, interesting_content)

Selected source: df_name, Cost: 4.65e-05


('The name of the dataset is "Expenditure on educational institutions per full-time equivalent student."',
 2.22e-05)

In [22]:
user_question = "What is the definition of the education level 'ISCED11_34_44'?"
answer_question(user_question, interesting_content)

Selected source: None, Cost: 4.74e-05


KeyError: 'None'

### Two Approaches for Working with Data and Large Language Models (LLMs)

When using LLMs to work with data, there are two main approaches we can take:

1. **Send All the Data to the LLM**:
   - We extract everything important from the data (in this case, from SDMX, which is a standard for sharing statistical data).
   - Then, we convert this information into a compact form (expressed in tokens, which are chunks of text that the LLM understands).
   - Finally, we send all the relevant information to the LLM at once and ask it to generate an answer.

2. **Use a Search-First Approach (RAG)**:
   - Instead of sending everything to the LLM, we can use a system called RAG (Retrieval-Augmented Generation).
   - First, we search our stored information (like a database) to find the most relevant entries for the task.
   - We then send only the top few results to the LLM and ask it to generate an answer based on those results.
   
The second approach helps limit the amount of information the LLM needs to process, making it more efficient.


# Embed the metadata for better search
- going through items individually to prepare adequate phrasing

In [23]:

def flatten_name_and_description(info):

    return [
          ("The DataFrame's name", info.df_name)
        , (info.df_name, "The DataFrame's name")
        , ("The DataFrame's description", info.df_description)
        , (info.df_description, "The DataFrame's description")
    ]

def flatten_dimensions(info):
    ans = []
    for code, name in info.df_dimension_names.items():
        meta_statement = f"The name that corresponds to the dimension code: '{code}' is {name}."
        ans.append((code, meta_statement))
        ans.append((name, meta_statement))
        ans.append((f"What name that corresponds to the dimension code: '{code}'?", meta_statement))
        ans.append((f"What is the dimension code for '{name}'?", meta_statement))
    return ans

def flatten_codes(info):
    ans = []
    for code_list_id in info.df_code_names:
        for code, name in info.df_code_names[code_list_id].items():
            meta_statement = f"The English name of the code '{code}' within the code list ID '{code_list_id}' is '{name}'."
            ans.append((code, meta_statement))
            ans.append((name, meta_statement))
            ans.append((f"What is the English name of the code '{code}' within the code list ID '{code_list_id}'?", meta_statement))
            ans.append((f"What is the code for '{name}' within the code list ID '{code_list_id}'?", meta_statement))
    return ans


In [24]:
flat_info_for_embedding = list(tuple())
flat_info_for_embedding.extend(flatten_name_and_description(df_info))
flat_info_for_embedding.extend(flatten_dimensions(df_info))
flat_info_for_embedding.extend(flatten_codes(df_info))

In [25]:
# Export the flattened information to a JSON file
import json
with open('flat_info_for_embedding.json', 'w') as f:
    json.dump(flat_info_for_embedding, f, indent=4)

In [26]:
# Define a mapping of old keys to new keys
key_mapping = {
    'df_name': 'full_name_of_the_dataflow',
    'df_description': 'description_of_the_dataflow',
    'df_dimension_names': 'dimension_codes_and_their_english_names',
    'df_code_names': 'code_list_codes_and_their_english_names'
}

# Create a new dictionary with updated key names
new_interesting_content = {key_mapping.get(k, k): v for k, v in interesting_content.items()}

# Print the new dictionary
new_interesting_content.keys()

dict_keys(['full_name_of_the_dataflow', 'description_of_the_dataflow', 'dimension_codes_and_their_english_names', 'code_list_codes_and_their_english_names'])

This function does the following:

1. It uses a recursive approach to traverse the dictionary.
2. For each leaf node (where the value is not a dictionary or list), it creates two entries in the result list:
    - One with the key as the main key and (position, value) as the value.
    - Another with the value as the main key and (position, key) as the value.
3. For 'df_name' and 'df_description', it prefixes 'Code' to the key and 'Name' to the value in the resulting structures.
4. The position is represented as a dot-separated string of the keys leading to the leaf node.

In [66]:
def flatten_dictionary(d):
    result = []
    
    def flatten(data, path=None):
        if path is None:
            path = []
        
        if isinstance(data, dict):
            for key, value in data.items():
                new_path = path + [str(key)]
                if isinstance(value, (dict, list)):
                    flatten(value, new_path)
                else:
                    position = '.'.join(map(str, new_path))
                    if path and path[0] in ['df_name', 'df_description']:
                        result.append({f"Code: {key}": (position, value)})
                        result.append({f"Name: {value}": (position, key)})
                    else:
                        result.append({key: (position, value)})
                        result.append({value: (position, key)})
        elif isinstance(data, list):
            for i, item in enumerate(data):
                flatten(item, path + [str(i)])
    
    flatten(d)
    return result

In [None]:
flattened = flatten_dictionary(new_interesting_content)
for item in flattened:
    print(item)

In [65]:
# export the flattened dictionary to a JSON file
import json
with open('flattened_interesting_content.json', 'w') as f:
    json.dump(flattened, f, indent=4)