<a href="https://colab.research.google.com/github/datakind/hxl-metadata-prediction/blob/main/openai-hxl-prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

A data standard on platforms such as the [Humanitarian Data Exchange (HDX)](https://data.humdata.org/) is the [Humanitarian Exchange Language (HXL)](https://hxlstandard.org/), a column level set of attributes and tags and attributes which improve data interoperability and discovery. These tags and attributes are typically set by hand by data owners, which being a manual process can result in poor dataset coverage. Improving coverage through ML and AI techniques is desirable for faster and more efficient use of data in responding to Humanitarian disasters.

Previous work has focussed on fine tuning LLMs to complete tags and attrubutes, starting with the study [Predicting Metadata on Humanitarian Datasets with GPT 3](https://medium.com/towards-data-science/predicting-metadata-for-humanitarian-datasets-using-gpt-3-b104be17716d). This has yielded promosing results, but is constrained by the quality of training data and the HDX team have confirmed that basic tags related to location and dates are popular, more esoteric tags defined in [the standard](https://hxlstandard.org/standard/1-1final/tagging/) are not well represented.

This notebook fine-tunes an OpenAI model to test performance.

# Setup

1. Run notebook [generate-test-train-data.ipynb]([generate-test-train-data.ipynb]) to generate test and train data files for use in fine-tuning
2. Set `OPENAI_API_KEY` in file `.env` or in Colab secrets


If using Google colabs ...

3. Create a folder on google drive, and update file paths below accordingly, noting that the Google drive mount cell creates the mount at `/content/drive`

In [None]:
!pip install pandas==2.2.2
!pip install openai==1.35.3
!pip install python-dotenv==1.0.1

In [26]:
import openai
import os
import time
import openai
from openai import OpenAI
import pandas as pd
import json
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score
import sys
import requests
import pprint

from dotenv import load_dotenv
load_dotenv()

if os.getenv("OPENAI_API_KEY") is None:
  from google.colab import userdata
  OPENAI_API_KEY =  userdata.get('OPENAI_API_KEY')
else:
  OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

client = OpenAI(
    api_key=OPENAI_API_KEY
)

# If using Colab, this is where Google drive gets mounted. Otherwise leave blank
GOOGLE_BASE_DIR = "/content/drive/MyDrive/Colab"

# This is the HXL schema sheet, search HDX to get this link
HXL_SCHEMA_RESOURCE_URL = "https://docs.google.com/spreadsheets/d/1En9FlmM8PrbTWgl3UHPF_MXnJ6ziVZFhBbojSJzBdLI/export?format=xlsx"

# Where to save local data files
LOCAL_DATA_DIR = f"{GOOGLE_BASE_DIR}/hxl-metadata-prediction/data/"

# As generated by generate-test-train-data.ipynb
TRAINING_FILE = f"{LOCAL_DATA_DIR}/hxl_chat_prompts_train.jsonl"
TEST_FILE = f"{LOCAL_DATA_DIR}/hxl_chat_prompts_test.jsonl"

# Base model to fine-tune
MODEL = "gpt-4o-mini-2024-07-18"

pd.set_option('display.max_colwidth', 900)
pd.set_option('display.max_rows', 50)


In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Download HXL Schema

In [8]:
local_data_file = LOCAL_DATA_DIR + "/hxl-core-schema.xlsx"

response = requests.get(HXL_SCHEMA_RESOURCE_URL)
with open(local_data_file, 'wb') as f:
    f.write(response.content)

df= pd.read_excel(local_data_file, sheet_name='Core hashtags')
hashtags_list = df['Hashtag'][1:].tolist()

df= pd.read_excel(local_data_file, sheet_name='Core attributes')
attributes_list = df['Attribute'][1:].tolist()

# Remove rows with disallowed tags or attributes
APPROVED_HXL_SCHEMA = hashtags_list + attributes_list

print("Approved HXL schema ...")
print(APPROVED_HXL_SCHEMA)

Approved HXL schema ...
['#access', '#activity', '#adm1', '#adm2', '#adm3', '#adm4', '#adm5', '#affected', '#beneficiary', '#capacity', '#cause', '#channel', '#contact', '#country', '#crisis', '#currency', '#date', '#delivery', '#description', '#event', '#frequency', '#geo', '#group', '#impact', '#indicator', '#inneed', '#item', '#loc', '#meta', '#modality', '#need', '#operations', '#org', '#output', '#population', '#reached', '#region', '#respondee', '#sector', '#service', '#severity', '#status', '#subsector', '#targeted', '#value', '+abducted', '+acronym', '+activity', '+adolescents', '+adults', '+approved', '+ar', '+bounds', '+budget', '+canceled', '+children', '+cluster', '+code', '+converted', '+coord', '+dest', '+displaced', '+elderly', '+elevation', '+email', '+en', '+end', '+es', '+f', '+fa', '+fr', '+funder', '+hh', '+i', '+id', '+idps', '+impl', '+incamp', '+ind', '+infants', '+infected', '+injured', '+killed', '+label', '+lat', '+lon', '+m', '+ms', '+name', '+noncamp', '+num

#### Generate a prompt using HXL standard


# Analysis

## Check test/train

Let's do a sanity check to ensure the test set doesn't include data from orgnaizations in the traning set.

In [15]:
def read_prompts_file(filename):
  results = []
  with open(filename) as f:
    prompts = [json.loads(line) for line in f]
    for p in prompts:
      p["prompt"] = p["messages"][0:2]
      p["expected"] = p["messages"][-1]["content"]
      results.append(p)
    results = pd.DataFrame(results)

    print(f"\nFound {len(results)} prompts")
    print(f"\nData providers {results['Data provider'].unique()}")

    results['tag'] = results['expected'].apply(lambda x: x.split('+')[0])
    tag_counts = results['tag'].value_counts()
    print("\n",tag_counts)

  return results

print("\n\n======= TRAIN =======")
X_train = read_prompts_file(TRAINING_FILE)
print("\n\n======= TRAIN =======")
X_test = read_prompts_file(TEST_FILE)

# Print data providers in X_test that are in X_train
common_providers = list(set(X_train["Data provider"]).intersection(set(X_test["Data provider"])))
if len(common_providers) == 0:
  print("No common Data providers")
else:
  print(f"Common providers: {common_providers} found in both Train and test sets!!!")
  sys.exit()

# Print any hashes in test which are in train
common_hashes = list(set(X_train["expected"]).intersection(set(X_test["expected"])))
if len(common_hashes) == 0:
  print("No common hashes")

display(X_train)




Found 2919 prompts

Data providers ['international-organization-for-migration'
 'eth-zurich-weather-and-climate-risks' 'ifrc' 'ocha-fts' 'cerf' 'awsd'
 'insecurity-insight' 'ocha-sudan' 'ocha-niger' 'wfp' 'ocha-car' 'cred'
 'fao' 'water-point-data-exchange' 'ipc' 'interaction' 'ocha-somalia'
 'hdx' 'ocha-yemen' 'ocha-afghanistan' 'ourairports' 'hxl'
 'world-bank-group' 'unrwa-for-palestine-refugees-in-the-near-east'
 'ocha-fiss' 'ocha-ukraine' 'unhcr' 'ocha-ethiopia' 'ocha-haiti'
 'ocha-colombia' 'ocha-chad' 'ocha-nigeria' 'ocha-myanmar'
 'ocha-south-sudan' 'ocha-mali' 'ocha-dr-congo'
 'blavatnik-school-of-government-university-of-oxford' 'ocha-burkina'
 'un-ocha' 'ocha-ds' 'reliefweb' 'ocha-rosc' 'ocha-cameroon' 'unicef-rdc'
 'ocha-rosea' 'ocha-rolac' 'ocha-burundi' 'world-health-organization'
 'jcc' 'international-displacement-monitoring-centre-idmc' 'ocha-iraq'
 'ocha-opt' 'qcri' 'health-cluster' 'ocha-mozambique-hat' 'unicef-data'
 'unesco' 'ocha-libya' 'ocha-rowca' 'iati' 'clea

Unnamed: 0,messages,Data description,HDX resource id,HDX dataset id,Data provider,Date created,Locations,URL,Text header,Data excerpt,prompt,expected,tag
0,"[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/DRC - Baseline Assessment - M23 Crisis 13 - February 2024.xlsx'; dataset_description='The dataset from the file ""DRC - Baseline Assessment - M23 Crisis 13 - February 2024.xlsx"" contains information on the total number of internally displaced persons (IDPs) and returnees in the Democratic Republic of Congo. The data includes the total number of IDP households, IDP individuals, male and female IDPs, and returnees. Specifically, there are 319,283 IDP households, 1,548,732 IDP individuals, with 646,805 males and 901,927 females, and 587,705 returnees.'; column_name:'Total IDP HH'; examples: [319283]'}, {'rol...","The dataset from the file ""DRC - Baseline Assessment - M23 Crisis 13 - February 2024.xlsx"" contains information on the total number of internally displaced persons (IDPs) and returnees in the Democratic Republic of Congo. The data includes the total number of IDP households, IDP individuals, male and female IDPs, and returnees. Specifically, there are 319,283 IDP households, 1,548,732 IDP individuals, with 646,805 males and 901,927 females, and 587,705 returnees.",26ecc26f-74e7-46af-b450-8872dca0b63b,drc-displacement-idps-returnees-m23-crisis-north-kivu-province-baseline-assessment-iom-dtm,international-organization-for-migration,2023-10-16,COD,https://data.humdata.org/dataset/3554c498-660a-45cb-ada5-86a1fbcd6056/resource/26ecc26f-74e7-46af-b450-8872dca0b63b/download/adc_27jan-12_feb_update_public_v2.xlsx,Total IDP HH,[319283],"[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/DRC - Baseline Assessment - M23 Crisis 13 - February 2024.xlsx'; dataset_description='The dataset from the file ""DRC - Baseline Assessment - M23 Crisis 13 - February 2024.xlsx"" contains information on the total number of internally displaced persons (IDPs) and returnees in the Democratic Republic of Congo. The data includes the total number of IDP households, IDP individuals, male and female IDPs, and returnees. Specifically, there are 319,283 IDP households, 1,548,732 IDP individuals, with 646,805 males and 901,927 females, and 587,705 returnees.'; column_name:'Total IDP HH'; examples: [319283]'}]",#affected+hh,#affected
1,"[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/DRC - Baseline Assessment - M23 Crisis 13 - February 2024.xlsx'; dataset_description='The dataset from the file ""DRC - Baseline Assessment - M23 Crisis 13 - February 2024.xlsx"" contains information on the total number of internally displaced persons (IDPs) and returnees in the Democratic Republic of Congo. The data includes the total number of IDP households, IDP individuals, male and female IDPs, and returnees. Specifically, there are 319,283 IDP households, 1,548,732 IDP individuals, with 646,805 males and 901,927 females, and 587,705 returnees.'; column_name:'Total Returnees'; examples: [587705]'}, {'...","The dataset from the file ""DRC - Baseline Assessment - M23 Crisis 13 - February 2024.xlsx"" contains information on the total number of internally displaced persons (IDPs) and returnees in the Democratic Republic of Congo. The data includes the total number of IDP households, IDP individuals, male and female IDPs, and returnees. Specifically, there are 319,283 IDP households, 1,548,732 IDP individuals, with 646,805 males and 901,927 females, and 587,705 returnees.",26ecc26f-74e7-46af-b450-8872dca0b63b,drc-displacement-idps-returnees-m23-crisis-north-kivu-province-baseline-assessment-iom-dtm,international-organization-for-migration,2023-10-16,COD,https://data.humdata.org/dataset/3554c498-660a-45cb-ada5-86a1fbcd6056/resource/26ecc26f-74e7-46af-b450-8872dca0b63b/download/adc_27jan-12_feb_update_public_v2.xlsx,Total Returnees,[587705],"[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/DRC - Baseline Assessment - M23 Crisis 13 - February 2024.xlsx'; dataset_description='The dataset from the file ""DRC - Baseline Assessment - M23 Crisis 13 - February 2024.xlsx"" contains information on the total number of internally displaced persons (IDPs) and returnees in the Democratic Republic of Congo. The data includes the total number of IDP households, IDP individuals, male and female IDPs, and returnees. Specifically, there are 319,283 IDP households, 1,548,732 IDP individuals, with 646,805 males and 901,927 females, and 587,705 returnees.'; column_name:'Total Returnees'; examples: [587705]'}]",#affected+ind+returnees,#affected
2,"[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/admin1-summaries-earthquake.csv'; dataset_description='The dataset contains earthquake data for various administrative regions in Afghanistan, including country name, admin1 name, latitude, longitude, aggregation type, indicator name, and indicator value. The data includes maximum earthquake values recorded in different regions, with corresponding latitude and longitude coordinates. The dataset provides insights into the seismic activity in different administrative areas of Afghanistan.'; column_name:'country_name'; examples: ['Afghanistan', 'Afghanistan', 'Afghanistan', 'Afghanistan', 'Afghanistan', 'Af...","The dataset contains earthquake data for various administrative regions in Afghanistan, including country name, admin1 name, latitude, longitude, aggregation type, indicator name, and indicator value. The data includes maximum earthquake values recorded in different regions, with corresponding latitude and longitude coordinates. The dataset provides insights into the seismic activity in different administrative areas of Afghanistan.",dbf9b4bd-1321-4846-b6f0-4654509d3626,climada-earthquake-dataset,eth-zurich-weather-and-climate-risks,2024-02-23,AFG BFA BDI CMR CAF TCD COL COD ETH HTI MLI MOZ MMR NER NGA SOM SSD PSE SDN SYR UKR VEN YEM,https://data.humdata.org/dataset/744f4f0b-3172-4397-9609-5ec0b9d34fcb/resource/dbf9b4bd-1321-4846-b6f0-4654509d3626/download/admin1-summaries-earthquake.csv,country_name,"['Afghanistan', 'Afghanistan', 'Afghanistan', 'Afghanistan', 'Afghanistan', 'Afghanistan', 'Afghanistan', 'Afghanistan', 'Afghanistan', 'Afghanistan', 'Afghanistan']","[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/admin1-summaries-earthquake.csv'; dataset_description='The dataset contains earthquake data for various administrative regions in Afghanistan, including country name, admin1 name, latitude, longitude, aggregation type, indicator name, and indicator value. The data includes maximum earthquake values recorded in different regions, with corresponding latitude and longitude coordinates. The dataset provides insights into the seismic activity in different administrative areas of Afghanistan.'; column_name:'country_name'; examples: ['Afghanistan', 'Afghanistan', 'Afghanistan', 'Afghanistan', 'Afghanistan', 'Af...",#country,#country
3,"[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/admin1-summaries-earthquake.csv'; dataset_description='The dataset contains earthquake data for various administrative regions in Afghanistan, including country name, admin1 name, latitude, longitude, aggregation type, indicator name, and indicator value. The data includes maximum earthquake values recorded in different regions, with corresponding latitude and longitude coordinates. The dataset provides insights into the seismic activity in different administrative areas of Afghanistan.'; column_name:'latitude'; examples: ['34.5527', '34.9568', '34.9619', '34.3033', '34.0121', '34.2743', '34.7693', '35.4...","The dataset contains earthquake data for various administrative regions in Afghanistan, including country name, admin1 name, latitude, longitude, aggregation type, indicator name, and indicator value. The data includes maximum earthquake values recorded in different regions, with corresponding latitude and longitude coordinates. The dataset provides insights into the seismic activity in different administrative areas of Afghanistan.",dbf9b4bd-1321-4846-b6f0-4654509d3626,climada-earthquake-dataset,eth-zurich-weather-and-climate-risks,2024-02-23,AFG BFA BDI CMR CAF TCD COL COD ETH HTI MLI MOZ MMR NER NGA SOM SSD PSE SDN SYR UKR VEN YEM,https://data.humdata.org/dataset/744f4f0b-3172-4397-9609-5ec0b9d34fcb/resource/dbf9b4bd-1321-4846-b6f0-4654509d3626/download/admin1-summaries-earthquake.csv,latitude,"['34.5527', '34.9568', '34.9619', '34.3033', '34.0121', '34.2743', '34.7693', '35.4474', '35.8025', '34.8046', '33.3211']","[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/admin1-summaries-earthquake.csv'; dataset_description='The dataset contains earthquake data for various administrative regions in Afghanistan, including country name, admin1 name, latitude, longitude, aggregation type, indicator name, and indicator value. The data includes maximum earthquake values recorded in different regions, with corresponding latitude and longitude coordinates. The dataset provides insights into the seismic activity in different administrative areas of Afghanistan.'; column_name:'latitude'; examples: ['34.5527', '34.9568', '34.9619', '34.3033', '34.0121', '34.2743', '34.7693', '35.4...",#geo+lat,#geo
4,"[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/admin1-summaries-earthquake.csv'; dataset_description='The dataset contains earthquake data for various administrative regions in Afghanistan, including country name, admin1 name, latitude, longitude, aggregation type, indicator name, and indicator value. The data includes maximum earthquake values recorded in different regions, with corresponding latitude and longitude coordinates. The dataset provides insights into the seismic activity in different administrative areas of Afghanistan.'; column_name:'longitude'; examples: ['69.3376', '69.6258', '68.887', '68.2174', '69.1631', '70.4529', '70.1638', '69.7...","The dataset contains earthquake data for various administrative regions in Afghanistan, including country name, admin1 name, latitude, longitude, aggregation type, indicator name, and indicator value. The data includes maximum earthquake values recorded in different regions, with corresponding latitude and longitude coordinates. The dataset provides insights into the seismic activity in different administrative areas of Afghanistan.",dbf9b4bd-1321-4846-b6f0-4654509d3626,climada-earthquake-dataset,eth-zurich-weather-and-climate-risks,2024-02-23,AFG BFA BDI CMR CAF TCD COL COD ETH HTI MLI MOZ MMR NER NGA SOM SSD PSE SDN SYR UKR VEN YEM,https://data.humdata.org/dataset/744f4f0b-3172-4397-9609-5ec0b9d34fcb/resource/dbf9b4bd-1321-4846-b6f0-4654509d3626/download/admin1-summaries-earthquake.csv,longitude,"['69.3376', '69.6258', '68.887', '68.2174', '69.1631', '70.4529', '70.1638', '69.798', '68.9114', '67.2373', '67.812']","[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/admin1-summaries-earthquake.csv'; dataset_description='The dataset contains earthquake data for various administrative regions in Afghanistan, including country name, admin1 name, latitude, longitude, aggregation type, indicator name, and indicator value. The data includes maximum earthquake values recorded in different regions, with corresponding latitude and longitude coordinates. The dataset provides insights into the seismic activity in different administrative areas of Afghanistan.'; column_name:'longitude'; examples: ['69.3376', '69.6258', '68.887', '68.2174', '69.1631', '70.4529', '70.1638', '69.7...",#geo+lon,#geo
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2914,"[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/1501 Sierra Leone Health Centers.xlsx'; dataset_description='The dataset contains information on health centers in Sierra Leone, including details such as center ID, status, date opened, type of center, activity, location coordinates, address, and source of information. The data includes various health centers across different districts and chiefdoms in Sierra Leone, with details on their capacities, equipment, and verification status. The dataset seems to focus on health posts and centers in the Eastern region of Sierra Leone, providing information on their locations and operational status.'; column_nam...","The dataset contains information on health centers in Sierra Leone, including details such as center ID, status, date opened, type of center, activity, location coordinates, address, and source of information. The data includes various health centers across different districts and chiefdoms in Sierra Leone, with details on their capacities, equipment, and verification status. The dataset seems to focus on health posts and centers in the Eastern region of Sierra Leone, providing information on their locations and operational status.",f78dc606-04e2-4fb6-a7eb-9eb995c33f76,141121-sierra-leone-health-facilities,standby-task-force,2014-11-01,SLE,https://data.humdata.org/dataset/7453fb80-752b-4078-a892-d936f9846dab/resource/f78dc606-04e2-4fb6-a7eb-9eb995c33f76/download/1501-sierra-leone-health-centers.xlsx,Province,"['Eastern', 'Eastern', 'Eastern', 'Eastern', 'Eastern', 'Eastern', 'Eastern', 'Eastern', 'Eastern', 'Eastern', 'Eastern']","[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/1501 Sierra Leone Health Centers.xlsx'; dataset_description='The dataset contains information on health centers in Sierra Leone, including details such as center ID, status, date opened, type of center, activity, location coordinates, address, and source of information. The data includes various health centers across different districts and chiefdoms in Sierra Leone, with details on their capacities, equipment, and verification status. The dataset seems to focus on health posts and centers in the Eastern region of Sierra Leone, providing information on their locations and operational status.'; column_nam...",#adm1,#adm1
2915,"[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/1501 Sierra Leone Health Centers.xlsx'; dataset_description='The dataset contains information on health centers in Sierra Leone, including details such as center ID, status, date opened, type of center, activity, location coordinates, address, and source of information. The data includes various health centers across different districts and chiefdoms in Sierra Leone, with details on their capacities, equipment, and verification status. The dataset seems to focus on health posts and centers in the Eastern region of Sierra Leone, providing information on their locations and operational status.'; column_nam...","The dataset contains information on health centers in Sierra Leone, including details such as center ID, status, date opened, type of center, activity, location coordinates, address, and source of information. The data includes various health centers across different districts and chiefdoms in Sierra Leone, with details on their capacities, equipment, and verification status. The dataset seems to focus on health posts and centers in the Eastern region of Sierra Leone, providing information on their locations and operational status.",f78dc606-04e2-4fb6-a7eb-9eb995c33f76,141121-sierra-leone-health-facilities,standby-task-force,2014-11-01,SLE,https://data.humdata.org/dataset/7453fb80-752b-4078-a892-d936f9846dab/resource/f78dc606-04e2-4fb6-a7eb-9eb995c33f76/download/1501-sierra-leone-health-centers.xlsx,District,"['Kenema', 'Kenema', 'Kenema', 'Kenema', 'Kenema', 'Kenema', 'Kenema', 'Kenema', 'Kenema', 'Kenema', 'Kenema']","[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/1501 Sierra Leone Health Centers.xlsx'; dataset_description='The dataset contains information on health centers in Sierra Leone, including details such as center ID, status, date opened, type of center, activity, location coordinates, address, and source of information. The data includes various health centers across different districts and chiefdoms in Sierra Leone, with details on their capacities, equipment, and verification status. The dataset seems to focus on health posts and centers in the Eastern region of Sierra Leone, providing information on their locations and operational status.'; column_nam...",#adm2,#adm2
2916,"[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/1501 Sierra Leone Health Centers.xlsx'; dataset_description='The dataset contains information on health centers in Sierra Leone, including details such as center ID, status, date opened, type of center, activity, location coordinates, address, and source of information. The data includes various health centers across different districts and chiefdoms in Sierra Leone, with details on their capacities, equipment, and verification status. The dataset seems to focus on health posts and centers in the Eastern region of Sierra Leone, providing information on their locations and operational status.'; column_nam...","The dataset contains information on health centers in Sierra Leone, including details such as center ID, status, date opened, type of center, activity, location coordinates, address, and source of information. The data includes various health centers across different districts and chiefdoms in Sierra Leone, with details on their capacities, equipment, and verification status. The dataset seems to focus on health posts and centers in the Eastern region of Sierra Leone, providing information on their locations and operational status.",f78dc606-04e2-4fb6-a7eb-9eb995c33f76,141121-sierra-leone-health-facilities,standby-task-force,2014-11-01,SLE,https://data.humdata.org/dataset/7453fb80-752b-4078-a892-d936f9846dab/resource/f78dc606-04e2-4fb6-a7eb-9eb995c33f76/download/1501-sierra-leone-health-centers.xlsx,Chiefdom,"['Dama', 'Dama', 'Dama', 'Dama', 'Dama', 'Dama', 'Dama', 'Dama', 'Dama', 'Dama', 'Dama']","[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/1501 Sierra Leone Health Centers.xlsx'; dataset_description='The dataset contains information on health centers in Sierra Leone, including details such as center ID, status, date opened, type of center, activity, location coordinates, address, and source of information. The data includes various health centers across different districts and chiefdoms in Sierra Leone, with details on their capacities, equipment, and verification status. The dataset seems to focus on health posts and centers in the Eastern region of Sierra Leone, providing information on their locations and operational status.'; column_nam...",#adm3,#adm3
2917,"[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/Guinea health-facility master data.google sheet'; dataset_description='None'; column_name:'Nom de la région'; examples: ['Boke', 'Conakry', 'Faranah', 'Kankan', 'Kindia', 'Labe', 'Mamou', 'Nzerekore']'}, {'role': 'assistant', 'content': '#adm1+name'}]",,5d2531d6-c03a-449b-afdd-52c07d687679,guinea-healthcare-master-data,ipc-cluster-guinea,2015-09-03,GIN,https://docs.google.com/spreadsheets/d/1x0MgLKLG3fxWBJ200VV5Fr67GqgSvYISefO-EYEp2wg/edit#gid=0,Nom de la région,"['Boke', 'Conakry', 'Faranah', 'Kankan', 'Kindia', 'Labe', 'Mamou', 'Nzerekore']","[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/Guinea health-facility master data.google sheet'; dataset_description='None'; column_name:'Nom de la région'; examples: ['Boke', 'Conakry', 'Faranah', 'Kankan', 'Kindia', 'Labe', 'Mamou', 'Nzerekore']'}]",#adm1+name,#adm1


## Fine-tune

In [99]:
def fine_tune_model(train_file, model_name="gpt-4o-mini"):
    """
    Fine-tune an OpenAI model using training data.

    Args:
        prompt_file (str): The file containing the prompts to use for fine-tuning.
        model_name (str): The name of the model to fine-tune. Default is "davinci-002".

    Returns:
        str: The ID of the fine-tuned model.
    """

    # Create a version of the train_file jsonl which only has "messages"
    train_file_short = train_file.replace(".jsonl", "_short.jsonl")
    with open(train_file) as f:
        prompts = [json.loads(line) for line in f]
        prompts = [p["messages"] for p in prompts]
        with open(train_file_short, "w") as f:
            for p in prompts:
              row = {}
              row["messages"] = p
              f.write(json.dumps(row) + "\n")

    # Upload file to OpenAI for fine-tuning
    file = client.files.create(
        file=open(train_file_short, "rb"),
        purpose="fine-tune"
    )
    file_id = file.id
    print(f"Uploaded training file with ID: {file_id}")

    # Start the fine-tuning job
    ft = client.fine_tuning.jobs.create(
        training_file=file_id,
        model=model_name
    )
    ft_id = ft.id
    print(f"Fine-tuning job started with ID: {ft_id}")

    # Monitor the status of the fine-tuning job
    ft_result = client.fine_tuning.jobs.retrieve(ft_id)
    while ft_result.status != 'succeeded':
        print(f"Current status: {ft_result.status}")
        time.sleep(120)  # Wait for 60 seconds before checking again
        ft_result = client.fine_tuning.jobs.retrieve(ft_id)
        if 'failed' in ft_result.status.lower():
            sys.exit()

    print(f"Fine-tuning job {ft_id} succeeded!")

    # Retrieve the fine-tuned model
    fine_tuned_model = ft_result.fine_tuned_model
    print(f"Fine-tuned model: {fine_tuned_model}")

    return fine_tuned_model

In [101]:
model = fine_tune_model(TRAINING_FILE, model_name=MODEL)

Uploaded training file with ID: file-Uzsels8NXSGHecUo2672vxG5
Fine-tuning job started with ID: ftjob-TxNTfk1vI83R7dNor0rqlVkU
Current status: validating_files
Current status: validating_files
Current status: running
Current status: running
Current status: running
Current status: running
Current status: running
Current status: running
Current status: running
Current status: running
Current status: running
Current status: running
Current status: running
Current status: running
Current status: running
Current status: running
Current status: running
Current status: running
Current status: running
Current status: running
Current status: running
Fine-tuning job ftjob-TxNTfk1vI83R7dNor0rqlVkU succeeded!
Fine-tuned model: ft:gpt-4o-mini-2024-07-18:datakind::9p1xodpF


In [112]:
model = "ft:gpt-4o-mini-2024-07-18:datakind::9oJXzcfa" # No data summaries
#model = ft:gpt-4o-mini-2024-07-18:datakind::9p1xodpF # With data description
print(f"Fine-tuned model: {model}")

Fine-tuned model: ft:gpt-4o-mini-2024-07-18:datakind::9oJXzcfa


## Prediction Test

In [113]:
def make_chat_predictions(prompts, model, temperature=0.1, max_tokens=13):
  results = []
  for p in prompts:
    actual = p["messages"][-1]["content"]
    p["messages"] = p["messages"][0:2]
    completion = client.chat.completions.create(
      model=model,
      messages=p["messages"],
      temperature=temperature,
      max_tokens=max_tokens
    )
    predicted = completion.choices[0].message.content
    predicted = filter_for_schema(predicted)

    res = {
        "prompt": p["messages"],
        "actual": actual,
        "predicted": predicted
    }

    print(f"Predicted: {predicted}; Actual: {actual}")

    results.append(res)

  results = pd.DataFrame(results)

  return results

def filter_for_schema(text):
    #print(f"Tokens before: {text}")
    if " " in text:
        text = text.replace(" ","")

    tokens_raw = text.split("+")
    tokens = [tokens_raw[0]]
    for t in tokens_raw[1:]:
        tokens.append(f"+{t}")

    filtered = []
    for t in tokens:
        if t in APPROVED_HXL_SCHEMA:
            if t not in filtered:
                filtered.append(t)
    filtered = "".join(filtered)

    if len(filtered) > 0 and filtered[0] != '#':
        filtered = ""

    # Add spaces back in
    # filtered = filtered.replace("+", " +")

    #print(f"        After: {filtered}")
    return filtered

def output_prediction_metrics(results, prediction_field="predicted", actual_field="actual"):
    """
    Prints out model performance report.

    Parameters
    ----------
    results : dataframe
        Dataframe of results
    prediction_field : str
        Field name of element with prediction. Handy for comparing raw and post-processed predictions.
    actual_field: str
        Field name of the actual result for comparison with prediction
    """
    y_test = []
    y_pred = []
    y_justtag_test = []
    y_justtag_pred = []
    for index, r in results.iterrows():
        if actual_field not in r and predicted_field not in r:
            print("Provided results do not contain expected values.")
            sys.exit()
        y_pred.append(r[prediction_field])
        y_test.append(r[actual_field])
        actual_tag = r[actual_field].split("+")[0]
        predicted_tag = r[prediction_field].split("+")[0]
        y_justtag_test.append(actual_tag)
        y_justtag_pred.append(predicted_tag)

    print(f"LLM results for {prediction_field}, {len(results)} predictions ...")
    print("\nJust HXL tags ...\n")
    print(f"Accuracy: {round(accuracy_score(y_justtag_test, y_justtag_pred),2)}")
    print(
        f"Precision: {round(precision_score(y_justtag_test, y_justtag_pred, average='weighted', zero_division=0),2)}"
    )
    print(
        f"Recall: {round(recall_score(y_justtag_test, y_justtag_pred, average='weighted', zero_division=0),2)}"
    )
    print(
        f"F1: {round(f1_score(y_justtag_test, y_justtag_pred, average='weighted', zero_division=0),2)}"
    )

    print(f"\nTags and attributes with {prediction_field} ...\n")
    print(f"Accuracy: {round(accuracy_score(y_test, y_pred),2)}")
    print(
        f"Precision: {round(precision_score(y_test, y_pred, average='weighted', zero_division=0),2)}"
    )
    print(
        f"Recall: {round(recall_score(y_test, y_pred, average='weighted', zero_division=0),2)}"
    )
    print(
        f"F1: {round(f1_score(y_test, y_pred, average='weighted', zero_division=0),2)}"
    )

    return

In [114]:
with open(TEST_FILE) as f:
    X_test = [json.loads(line) for line in f]

# Subsample
#size = 10
#X_test = X_test[-size:]

results = make_chat_predictions(X_test, model)

results.to_excel(f"{LOCAL_DATA_DIR}/hxl-metadata-prediction-results.xlsx", index=False)

display(results)

output_prediction_metrics(results)

print("Done")

Predicted: #country+code; Actual: #country+code
Predicted: #adm1+name; Actual: #loc+name
Predicted: #indicator+code; Actual: #meta+id
Predicted: #indicator+name; Actual: #indicator+name
Predicted: #country+name; Actual: #country+name
Predicted: #indicator+code; Actual: #indicator+code
Predicted: #indicator+code; Actual: #indicator+code+label
Predicted: #country+code; Actual: #country+code
Predicted: #meta+id; Actual: #meta+id
Predicted: #indicator+name; Actual: #indicator+name
Predicted: #country+name; Actual: #country+name
Predicted: #indicator+code; Actual: #indicator+code
Predicted: #indicator+code; Actual: #indicator+code+label
Predicted: #indicator+name; Actual: #indicator+label
Predicted: #country+code; Actual: #country+code
Predicted: #country+name; Actual: #country+name
Predicted: #indicator+code; Actual: #indicator+id
Predicted: #indicator+name; Actual: #indicator+name
Predicted: #country+code; Actual: #country+code
Predicted: #country+name; Actual: #country+name
Predicted: #i

Unnamed: 0,prompt,actual,predicted
0,"[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/DHS Quickstats Data for Sao Tome and Principe.csv'; dataset_description='The dataset contains information on various indicators related to health and demographics in Sao Tome and Principe, with data points such as total fertility rate and contraceptive use among married women across different regions. The data includes details like survey year, survey ID, indicator ID, indicator order, and precision. Each data point is associated with specific regions within the country and includes values, characteristic labels, and survey type. The dataset provides insights into key health and demographic indicators fo...",#country+code,#country+code
1,"[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/DHS Quickstats Data for Sao Tome and Principe.csv'; dataset_description='The dataset contains information on various indicators related to health and demographics in Sao Tome and Principe, with data points such as total fertility rate and contraceptive use among married women across different regions. The data includes details like survey year, survey ID, indicator ID, indicator order, and precision. Each data point is associated with specific regions within the country and includes values, characteristic labels, and survey type. The dataset provides insights into key health and demographic indicators fo...",#loc+name,#adm1+name
2,"[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/DHS Quickstats Data for Sao Tome and Principe.csv'; dataset_description='The dataset contains information on various indicators related to health and demographics in Sao Tome and Principe, with data points such as total fertility rate and contraceptive use among married women across different regions. The data includes details like survey year, survey ID, indicator ID, indicator order, and precision. Each data point is associated with specific regions within the country and includes values, characteristic labels, and survey type. The dataset provides insights into key health and demographic indicators fo...",#meta+id,#indicator+code
3,"[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/DHS Quickstats Data for Sao Tome and Principe.csv'; dataset_description='The dataset contains information on various indicators related to health and demographics in Sao Tome and Principe, with data points such as total fertility rate and contraceptive use among married women across different regions. The data includes details like survey year, survey ID, indicator ID, indicator order, and precision. Each data point is associated with specific regions within the country and includes values, characteristic labels, and survey type. The dataset provides insights into key health and demographic indicators fo...",#indicator+name,#indicator+name
4,"[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/DHS Quickstats Data for Sao Tome and Principe.csv'; dataset_description='The dataset contains information on various indicators related to health and demographics in Sao Tome and Principe, with data points such as total fertility rate and contraceptive use among married women across different regions. The data includes details like survey year, survey ID, indicator ID, indicator order, and precision. Each data point is associated with specific regions within the country and includes values, characteristic labels, and survey type. The dataset provides insights into key health and demographic indicators fo...",#country+name,#country+name
...,...,...,...
432,"[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/Iraq 2015.csv'; dataset_description='The dataset contains information related to humanitarian aid activities in Iraq in 2015, including details such as the date of activity, organizations involved, types of aid provided (shelter, non-food items), activity status (planned or delivered), modality of aid delivery (in-kind, cash, voucher), and beneficiary descriptions. The data includes specific quantities of items distributed, such as polyethylene floor insulation, tents, mattresses, blankets, hygiene kits, and various other items. The dataset also includes information on the average cost per family, number...",#adm1+code,#adm1+code
433,"[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/Iraq 2015.csv'; dataset_description='The dataset contains information related to humanitarian aid activities in Iraq in 2015, including details such as the date of activity, organizations involved, types of aid provided (shelter, non-food items), activity status (planned or delivered), modality of aid delivery (in-kind, cash, voucher), and beneficiary descriptions. The data includes specific quantities of items distributed, such as polyethylene floor insulation, tents, mattresses, blankets, hygiene kits, and various other items. The dataset also includes information on the average cost per family, number...",#adm2,#adm2+name
434,"[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/Iraq 2015.csv'; dataset_description='The dataset contains information related to humanitarian aid activities in Iraq in 2015, including details such as the date of activity, organizations involved, types of aid provided (shelter, non-food items), activity status (planned or delivered), modality of aid delivery (in-kind, cash, voucher), and beneficiary descriptions. The data includes specific quantities of items distributed, such as polyethylene floor insulation, tents, mattresses, blankets, hygiene kits, and various other items. The dataset also includes information on the average cost per family, number...",#adm2+code,#adm2+code
435,"[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/Iraq 2015.csv'; dataset_description='The dataset contains information related to humanitarian aid activities in Iraq in 2015, including details such as the date of activity, organizations involved, types of aid provided (shelter, non-food items), activity status (planned or delivered), modality of aid delivery (in-kind, cash, voucher), and beneficiary descriptions. The data includes specific quantities of items distributed, such as polyethylene floor insulation, tents, mattresses, blankets, hygiene kits, and various other items. The dataset also includes information on the average cost per family, number...",#loc+type,#loc+type


Done


### Prediction analysis

Next we will look at cases where the prediction failed to see patterns to address.

In [105]:
results["match"] = results['predicted'] == results['actual']
display(results[results["match"]==False])

Unnamed: 0,prompt,actual,predicted,match
1,"[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/DHS Quickstats Data for Sao Tome and Principe.csv'; dataset_description='The dataset contains information on various indicators related to health and demographics in Sao Tome and Principe, with data points such as total fertility rate and contraceptive use among married women across different regions. The data includes details like survey year, survey ID, indicator ID, indicator order, and precision. Each data point is associated with specific regions within the country and includes values, characteristic labels, and survey type. The dataset provides insights into key health and demographic indicators fo...",#loc+name,#adm1+name,False
6,"[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/DHS Quickstats Data for Sao Tome and Principe.csv'; dataset_description='The dataset contains information on various indicators related to health and demographics in Sao Tome and Principe, with data points such as total fertility rate and contraceptive use among married women across different regions. The data includes details like survey year, survey ID, indicator ID, indicator order, and precision. Each data point is associated with specific regions within the country and includes values, characteristic labels, and survey type. The dataset provides insights into key health and demographic indicators fo...",#indicator+code+label,#indicator+code,False
12,"[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/DHS Quickstats Data for Sao Tome and Principe1.csv'; dataset_description='The dataset contains various indicators related to demographic and health statistics for Sao Tome and Principe in 2008, sourced from the DHS Quickstats Data file. The indicators include total fertility rate, contraceptive use among married women, unmet need for family planning, median age at first marriage and first sexual intercourse for women, and infant mortality rates. Each data entry includes information such as the indicator name, value, precision, country code, survey details, and characteristic categories. The dataset provi...",#indicator+code+label,#indicator+code,False
13,"[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/DHS Quickstats Data for Sao Tome and Principe1.csv'; dataset_description='The dataset contains various indicators related to demographic and health statistics for Sao Tome and Principe in 2008, sourced from the DHS Quickstats Data file. The indicators include total fertility rate, contraceptive use among married women, unmet need for family planning, median age at first marriage and first sexual intercourse for women, and infant mortality rates. Each data entry includes information such as the indicator name, value, precision, country code, survey details, and characteristic categories. The dataset provi...",#indicator+label,#indicator+type,False
16,"[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/Human Development Indicators for Zimbabwe.csv'; dataset_description='The dataset contains Human Development Indicators for Zimbabwe, with information on indicators such as Adolescent Birth Rate (births per 1,000 women ages 15-19) for the years 1990 to 1998. The data includes columns for country code, country name, indicator ID, indicator name, index ID, index name, indicator value, and year. The values in the dataset show a trend of decreasing Adolescent Birth Rate over the specified years.'; column_name:'indicator_id'; examples: ['abr', 'abr', 'abr', 'abr', 'abr', 'abr', 'abr', 'abr', 'abr', 'abr', 'abr...",#indicator+id,#indicator+code,False
20,"[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/Human Development Indicators for Somalia.csv'; dataset_description='The dataset contains Human Development Indicators for Somalia, with information on indicators such as Adolescent Birth Rate (births per 1,000 women ages 15-19) for the years 1990 to 1998. The data includes columns for country code, country name, indicator ID, indicator name, index ID, index name, indicator value, and year. The values in the dataset show a trend of Adolescent Birth Rate fluctuating over the years, ranging from 115.367 to 126.275.'; column_name:'indicator_id'; examples: ['abr', 'abr', 'abr', 'abr', 'abr', 'abr', 'abr', 'ab...",#indicator+id,#indicator+code,False
23,"[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/Burundi- Muyinga, Cankuzo, Makamba, Ruyigi, Rutana, Rumonge: Operational Presence.xlsx'; dataset_description='None'; column_name:'Organisation Name'; examples: ['A.B.D.D.M', 'A.B.D.D.M', 'A.B.D.D.M', 'A.B.D.D.M', 'ACT Alliance', 'ACT Alliance', 'ACT Alliance', 'ACT Alliance', 'Action Aid', 'Action Aid', 'Action Aid']'}]",#org+impl,#org+name,False
25,"[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/Burundi- Muyinga, Cankuzo, Makamba, Ruyigi, Rutana, Rumonge: Operational Presence.xlsx'; dataset_description='None'; column_name:'Country'; examples: ['Burundi', 'Burundi', 'Burundi', 'Burundi', 'Burundi', 'Burundi', 'Burundi', 'Burundi', 'Burundi', 'Burundi', 'Burundi']'}]",#adm1+name,#country+name,False
26,"[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/Burundi- Muyinga, Cankuzo, Makamba, Ruyigi, Rutana, Rumonge: Operational Presence.xlsx'; dataset_description='None'; column_name:'Region'; examples: ['Muyinga', 'Muyinga', 'Muyinga', 'Muyinga', 'Cankuzo', 'Rumonge', 'Ruyigi', 'Ruyigi', 'Makamba', 'Makamba', 'Rutana']'}]",#region+name,#adm1+name,False
27,"[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/Burundi- Muyinga, Cankuzo, Makamba, Ruyigi, Rutana, Rumonge: Operational Presence.xlsx'; dataset_description='None'; column_name:'Location'; examples: ['Giteranyi', 'Giteranyi', 'Muyinga', 'Muyinga', 'Gisigara', 'Rumonge', 'Gisuru', 'Kiniyia', 'Kayogoro', 'Nyanza-Lac', 'Giharo']'}]",#adm2+name,#loc+name,False


#### Scenario 1 - Predicting 'adm1' when it should region

Better prompting could fix this, to inform that region != adm1.

In [106]:
# Find rows where match is False and predicted contains admin1
breaks = results[results["match"]==False]
print(breaks.shape)
scenario1 = breaks[(breaks["match"]==False) & (breaks["actual"].str.contains("region")) & (breaks["predicted"].str.contains("adm1"))]
display(scenario1)
print(scenario1.shape)

(157, 4)


Unnamed: 0,prompt,actual,predicted,match
26,"[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/Burundi- Muyinga, Cankuzo, Makamba, Ruyigi, Rutana, Rumonge: Operational Presence.xlsx'; dataset_description='None'; column_name:'Region'; examples: ['Muyinga', 'Muyinga', 'Muyinga', 'Muyinga', 'Cankuzo', 'Rumonge', 'Ruyigi', 'Ruyigi', 'Makamba', 'Makamba', 'Rutana']'}]",#region+name,#adm1+name,False
60,"[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/NGA_Subnational_Covid19_HXL_HERA.csv'; dataset_description='The dataset contains information on COVID-19 cases in different regions of Nigeria, with columns including ID, date, ISO code, country name, region name, number of confirmed cases, deaths, recoveries, and gender-specific case counts. The data is sourced from the Nigeria Centre for Disease Control and covers various regions such as Abia, Adamawa, Akwa Ibom, Anambra, Bauchi, Bayelsa, Benue, Borno, and Cross River. The dataset appears to be structured with semicolons as delimiters and includes metadata headers at the beginning.'; column_name:'REGIO...",#region+name,#adm1+name,False
66,"[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/BFA_Subnational_Covid19_HXL_HERA.csv'; dataset_description='The dataset contains information on COVID-19 cases in Burkina Faso at a subnational level, with columns including ID, DATE, ISO_3, PAYS, ID_PAYS, REGION, ID_REGION, CONTAMINES, DECES, GUERIS, CONTAMINES_FEMME, CONTAMINES_HOMME, CONTAMINES_GENRE_NON_SPECIFIE, NOUVEAUX_INDIVIDUS_VACCINES_1DOSE, TOTAL_INDIVIDUS_VACCINES_1DOSE, NOUVEAUX_AGENTS_SANTE_VACCINES_1DOSE, TOTAL_AGENTS_SANTE_VACCINES_1DOSE, NOUVEAUX_INDIVIDUS_VACCINES_2DOSES, TOTAL_INDIVIDUS_VACCINES_2DOSES, NOUVEAUX_AGENTS_SANTE_VACCINES_2DOSES, TOTAL_AGENTS_SANTE_VACCINES_2DOSES, and SOUR...",#region+name,#adm1+name,False
72,"[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/MRT_Subnational_Covid19_HXL_HERA.csv'; dataset_description='The dataset contains information on COVID-19 cases in Mauritania at a subnational level, with columns including ID, date, ISO code, country name, region, number of confirmed cases, deaths, recoveries, and gender breakdown of cases. The data is sourced from the Ministry of Health. The dataset provides details for different regions within Mauritania, with each row representing a specific region and its corresponding COVID-19 statistics on a particular date.'; column_name:'REGION'; examples: ['Adrar', 'Assaba', 'Brakna', 'Dakhlet Nouadhibou', 'Gorg...",#region+name,#adm1+name,False
78,"[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/MLI_Subnational_Covid19_HXL_HERA.csv'; dataset_description='The dataset contains information on COVID-19 cases in Mali at the subnational level, with columns including ID, date, ISO code, country name, region, number of confirmed cases, deaths, recoveries, and gender breakdown of cases. The data is sourced from the Ministry of Health and includes details such as the region name, affected individuals by gender, and specific counts for infected, deceased, and recovered cases. The dataset appears to be structured with semicolons as delimiters and includes data for various regions within Mali, providing a sn...",#region+name,#adm1+name,False
129,"[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/2010-2019-consolidado-sivicap (1).xlsx'; dataset_description='The dataset contains information on IRCA (Índice de Riesgo de Calidad del Agua) values for different municipalities in Antioquia from the years 2010 to 2019. The data includes columns for the year, department, municipality, IRCA value, and category of risk associated with the water quality. The dataset provides a breakdown of IRCA values for each municipality, categorizing them as ""Sin riesgo"" (No risk), ""Medio"" (Medium risk), or ""Bajo"" (Low risk). The data is structured in a tabular format with rows representing different municipalities and t...",#region+name,#adm1+name,False
130,"[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/2010-2019-consolidado-sivicap (1).xlsx'; dataset_description='The dataset contains information on IRCA (Índice de Riesgo de Calidad del Agua) values for different municipalities in Antioquia from the years 2010 to 2019. The data includes columns for the year, department, municipality, IRCA value, and category of risk associated with the water quality. The dataset provides a breakdown of IRCA values for each municipality, categorizing them as ""Sin riesgo"" (No risk), ""Medio"" (Medium risk), or ""Bajo"" (Low risk). The data is structured in a tabular format with rows representing different municipalities and t...",#region+code,#adm1+code,False
269,"[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/BFA_Covid19_Citylevel_HXL_HERA.csv'; dataset_description='The dataset contains information on COVID-19 cases in Burkina Faso at the city level. It includes columns such as ID, DATE, ISO_3, PAYS, REGION, VILLES, COMMUNES_TYPE, Contaminés, Décès, Guéris, Femme, Homme, Genre_non spécifié, and Source. Each row represents data for a specific date and location within Burkina Faso, detailing the number of confirmed cases, deaths, recoveries, and gender distribution. The data is sourced from the Ministry of Health in Burkina Faso.'; column_name:'REGION'; examples: ['Centre', 'Non spécifié', 'Non spécifié', 'Non ...",#region+name,#adm1+name,False
276,"[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/MLI_Covid19_Citylevel_HXL_HERA.csv'; dataset_description='The dataset contains COVID-19 related data for Mali, with information such as date, region, city, number of confirmed cases, deaths, recoveries, gender distribution, and data sources. The data is structured in a CSV format with columns representing different attributes. The dataset includes details on affected individuals in various regions and cities within Mali, along with gender-specific information. Additionally, it provides sources for the data entries.'; column_name:'REGION'; examples: ['Kayes', 'Non spécifié', 'Bamako', 'Non spécifié', 'Bam...",#region+name,#adm1+name,False
305,"[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/DRC_Subnational_Covid19_HXL_HERA.csv'; dataset_description='The dataset contains information on COVID-19 cases in different regions of the Democratic Republic of Congo (DRC). It includes data such as the date, region, number of confirmed cases, deaths, recoveries, gender breakdown of cases, and vaccination statistics. Each row represents a specific region within the DRC and provides details on the COVID-19 situation in that area. The dataset seems to be sourced from the World Health Organization (OMS RDC).'; column_name:'REGION'; examples: ['Bas Uele', 'Equateur', 'Haut Katanga', 'Haut Lomami', 'Haut Uel...",#region+name,#adm1+name,False


(17, 4)


#### Scenario 2 - When labeled data says admin1 but really it is country (admin 0)

In [107]:
breaks = breaks[~breaks.index.isin(scenario1.index)]
print(breaks.shape)

scenario2 = breaks[(breaks["predicted"].str.contains("country")) & (breaks["actual"].str.contains("adm1"))]
display(scenario2)
print(scenario2.shape)


(140, 4)


Unnamed: 0,prompt,actual,predicted,match
25,"[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/Burundi- Muyinga, Cankuzo, Makamba, Ruyigi, Rutana, Rumonge: Operational Presence.xlsx'; dataset_description='None'; column_name:'Country'; examples: ['Burundi', 'Burundi', 'Burundi', 'Burundi', 'Burundi', 'Burundi', 'Burundi', 'Burundi', 'Burundi', 'Burundi', 'Burundi']'}]",#adm1+name,#country+name,False
38,"[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/FieldsData_3W_PHL_Aklan.xlsx'; dataset_description='None'; column_name:'Country'; examples: ['Philippines', 'Philippines', 'Philippines', 'Philippines', 'Philippines', 'Philippines', 'Philippines', 'Philippines', 'Philippines', 'Philippines', 'Philippines']'}]",#adm1+name,#country+name,False
226,"[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/Kerela.xlsx'; dataset_description='The dataset from the file ""Kerela.xlsx"" contains information about various organizations in South India, specifically in Wayanad, India. The data includes details such as the timestamp, source, name of the organization, type of organization, region, country, province, and sector. The organizations mentioned are involved in areas such as education, health & nutrition, livelihoods, environment protection, and food security. The dataset seems to capture activities and initiatives undertaken by these organizations in the specified region.'; column_name:'Country'; examples: ...",#adm1+name,#country+name,False
327,"[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/Burundi Cankuzo_FD.xlsx'; dataset_description='The dataset from the file ""Burundi Cankuzo_FD.xlsx"" contains information on various organizations operating in the Cankuzo region of Burundi. The data includes details such as the timestamp, data source, organization name, region, country, province, and sector of each organization. Organizations mentioned in the dataset are involved in sectors like Food Security, Protection, SGBV, Health, Environment Protection, and Nutrition. The dataset provides a snapshot of the organizations working in the specified region and their focus areas.'; column_name:'Country'; ...",#adm1+name,#country+name,False
350,"[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/4W_BU_Kayanza.csv'; dataset_description='The dataset contains information about various organizations operating in Kayanza, Burundi, including CARE International, World Food Program, Action Aid, Fédération Nationale des Associations engagées dans le Domaine de l'Enfance au Burundi (FENADEB), Initiative Pastorale pour la Réinsertion des Enfants en Difficulté (IPRED), Observatoire INEZA des droits de l'enfant au Burundi (OIDEB), and Red Cross - Burundi. The organizations are involved in sectors such as education, food security, and child protection, with activities ranging from development to humanitarian ...",#adm1+name,#country+name,False
359,"[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/4W_BU_Kirundo.csv'; dataset_description='The dataset contains information about various organizations operating in Kirundo, Burundi, including their names, types, sectors, and activities. The data includes details such as timestamps, data sources, organization names, organization types, region, country, province, and sector types. Some organizations are national NGOs, and there are links to organization websites provided for some entries. The dataset also includes information on the type of activities each organization is involved in, such as food security, protection, development, and humanitarian effor...",#adm1+name,#country+name,False
367,"[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/fieldsdata_4w_UG_BujumburaMarie.csv'; dataset_description='The dataset contains information on various organizations operating in Bujumbura Mairie, Burundi, with details such as organization names, types, sectors, and activities. The data includes both international and national NGOs involved in areas like child protection, development, and humanitarian work. The dataset also includes timestamps indicating when the information was reported.'; column_name:'Country'; examples: ['Burundi', 'Burundi', 'Burundi', 'Burundi', 'Burundi', 'Burundi', 'Burundi', 'Burundi', 'Burundi', 'Burundi', 'Burundi']'}]",#adm1+name,#country+name,False


(7, 4)


The model is correct here, the column names are typically things like 'Country'.

#### Scenario 3 - Where prediction add 'name' but labeled data doesn't have it

In [108]:
breaks = breaks[~breaks.index.isin(scenario2.index)]
print(breaks.shape)

scenario3 = breaks[(breaks["predicted"].str.len() > breaks["actual"].str.len()) & (breaks["predicted"].str.contains("name"))]
display(scenario3)
print(scenario3.shape)

(133, 4)


Unnamed: 0,prompt,actual,predicted,match
1,"[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/DHS Quickstats Data for Sao Tome and Principe.csv'; dataset_description='The dataset contains information on various indicators related to health and demographics in Sao Tome and Principe, with data points such as total fertility rate and contraceptive use among married women across different regions. The data includes details like survey year, survey ID, indicator ID, indicator order, and precision. Each data point is associated with specific regions within the country and includes values, characteristic labels, and survey type. The dataset provides insights into key health and demographic indicators fo...",#loc+name,#adm1+name,False
51,"[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/victimas_hecho_dpto_2017_2021.xlsx'; dataset_description='The dataset ""victimas_hecho_dpto_2017_2021.xlsx"" contains information on victimizing events in Colombia from 2017 to 2021. It includes columns such as country, department code and name, year, type of victimizing event, gender, ethnicity, disability status, age range of victims, and number of individuals affected per occurrence. The data provides details on the location, characteristics, and impact of victimizing events, including information on the victims' demographics and vulnerabilities.'; column_name:'PAIS'; examples: ['COLOMBIA', 'COLOMBIA', ...",#country,#country+name,False
88,"[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/victimas_explotacion_sexual_comercial.xlsx'; dataset_description='The dataset contains information on victims of commercial sexual exploitation, including details such as stage of legal process, department, municipality, charges, arrests, victim age group, country of birth, and total number of victims. The data includes entries for different stages of legal proceedings, with varying details for each victim such as age group and location. The dataset appears to have a mix of categorical and numerical data, with some missing values for certain variables like country of birth.'; column_name:'PAIS_NACIMIENTO...",#country,#country+name+origin,False
99,"[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/datos_brutos_personas_alcanzadas_vbg.xlsx'; dataset_description='The dataset contains information related to projects aimed at providing access to safe and confidential services for survivors of Gender Based Violence (GBV) in vulnerable populations during the COVID-19 pandemic. The data includes details such as project IDs, validation status, organization names, project descriptions, activity descriptions, resource types, estimated costs, start and end dates, status, country information, location details, number of people reached, demographics of beneficiaries, and indicators related to COVID-19, armed v...",#country,#country+name,False
136,"[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/Base convalidaciones - HDX.xlsx'; dataset_description='The dataset titled ""Base convalidaciones - HDX.xlsx"" contains information on individuals who have validated their academic titles in Colombia, originally obtained in Venezuela. The data includes details on the academic titles, institutions, and the approval or rejection status of the validations. It provides a breakdown by geographical location (national, department, municipality) and population segments. The dataset spans from 2015 to 2021 and is updated on an as-needed basis.'; column_name:'ciudad residencia'; examples: ['Santa Marta', 'Soledad', '...",#adm3,#adm2+name,False
152,"[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/instituciones_de_salud_en_colombia.xlsx'; dataset_description='The dataset contains information about healthcare institutions in Colombia, with columns including the department, municipality, provider code, provider name, habilitation code, address, neighborhood, nature of the institution, level, and other relevant details. Each row represents a different healthcare institution, specifying whether it is urban or rural, its opening date, and its classification as a public institution. The dataset also includes information on the main headquarters, habilitation status, and the date of the last update.'; co...",#org,#org+name,False
157,"[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/instituciones_de_salud_en_colombia.xlsx'; dataset_description='The dataset contains information about healthcare institutions in Colombia, with columns including the department, municipality, provider code, provider name, habilitation code, address, neighborhood, nature of the institution, level, and other relevant details. Each row represents a different healthcare institution, specifying whether it is urban or rural, its opening date, and its classification as a public institution. The dataset also includes information on the main headquarters, habilitation status, and the date of the last update.'; co...",#loc,#adm3+name,False
178,"[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/delito-sexual-ven-col.xlsx'; dataset_description='The dataset contains information on sexual crimes in Colombia, with details such as the year, date, department, municipality, day, time, location, type of site, weapons used, age, gender, marital status, country of birth, employment class, profession, education level, DANE code, and quantity of incidents. The data includes entries from various departments and municipalities in Colombia, providing insights into the characteristics of these crimes.'; column_name:'País de nacimiento'; examples: ['COLOMBIA', 'COLOMBIA', 'COLOMBIA', 'COLOMBIA', 'COLOMBIA', 'CO...",#country,#country+name,False
180,"[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/apoyos-de-la-cooperacion-con-insumos-para-salud-ante-covid-19.xlsx'; dataset_description='The dataset contains information on support provided by organizations in response to COVID-19, including details such as the department, municipality, organization, institution, quantity of supplies, availability, and delivery date. The data includes entries for different locations, with varying quantities of supplies delivered to different types of recipients (e.g., communities, hospitals). The dataset appears to track the distribution of supplies like insumos to different entities in various regions, with informat...",#loc,#loc+name,False
184,"[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/vih_sida-vf.xlsx'; dataset_description='The dataset contains information on the density of HIV/AIDS incidence per 1000 cases by sex and municipality in Colombia. It includes data such as the number of cases, affected population, and density of incidence for different municipalities in Antioquia. The dataset also provides details on the source of the data and specifies that only live cases for the year 2017 were considered. The data is structured in columns with various unnamed headers, and it seems to be sourced from the SISPRO 2017 database.'; column_name:'Pais'; examples: [' Colombia', ' Colombia', ' C...",#country,#country+name,False


(18, 4)


Looking at the above, apart from two real breaks, the rest seem to be correct. The model is adding 'name' for what are name columns.

#### The rest

In [109]:
breaks = breaks[~breaks.index.isin(scenario3.index)]
print(breaks.shape)
display(breaks)
print(breaks.shape)

(115, 4)


Unnamed: 0,prompt,actual,predicted,match
6,"[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/DHS Quickstats Data for Sao Tome and Principe.csv'; dataset_description='The dataset contains information on various indicators related to health and demographics in Sao Tome and Principe, with data points such as total fertility rate and contraceptive use among married women across different regions. The data includes details like survey year, survey ID, indicator ID, indicator order, and precision. Each data point is associated with specific regions within the country and includes values, characteristic labels, and survey type. The dataset provides insights into key health and demographic indicators fo...",#indicator+code+label,#indicator+code,False
12,"[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/DHS Quickstats Data for Sao Tome and Principe1.csv'; dataset_description='The dataset contains various indicators related to demographic and health statistics for Sao Tome and Principe in 2008, sourced from the DHS Quickstats Data file. The indicators include total fertility rate, contraceptive use among married women, unmet need for family planning, median age at first marriage and first sexual intercourse for women, and infant mortality rates. Each data entry includes information such as the indicator name, value, precision, country code, survey details, and characteristic categories. The dataset provi...",#indicator+code+label,#indicator+code,False
13,"[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/DHS Quickstats Data for Sao Tome and Principe1.csv'; dataset_description='The dataset contains various indicators related to demographic and health statistics for Sao Tome and Principe in 2008, sourced from the DHS Quickstats Data file. The indicators include total fertility rate, contraceptive use among married women, unmet need for family planning, median age at first marriage and first sexual intercourse for women, and infant mortality rates. Each data entry includes information such as the indicator name, value, precision, country code, survey details, and characteristic categories. The dataset provi...",#indicator+label,#indicator+type,False
16,"[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/Human Development Indicators for Zimbabwe.csv'; dataset_description='The dataset contains Human Development Indicators for Zimbabwe, with information on indicators such as Adolescent Birth Rate (births per 1,000 women ages 15-19) for the years 1990 to 1998. The data includes columns for country code, country name, indicator ID, indicator name, index ID, index name, indicator value, and year. The values in the dataset show a trend of decreasing Adolescent Birth Rate over the specified years.'; column_name:'indicator_id'; examples: ['abr', 'abr', 'abr', 'abr', 'abr', 'abr', 'abr', 'abr', 'abr', 'abr', 'abr...",#indicator+id,#indicator+code,False
20,"[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/Human Development Indicators for Somalia.csv'; dataset_description='The dataset contains Human Development Indicators for Somalia, with information on indicators such as Adolescent Birth Rate (births per 1,000 women ages 15-19) for the years 1990 to 1998. The data includes columns for country code, country name, indicator ID, indicator name, index ID, index name, indicator value, and year. The values in the dataset show a trend of Adolescent Birth Rate fluctuating over the years, ranging from 115.367 to 126.275.'; column_name:'indicator_id'; examples: ['abr', 'abr', 'abr', 'abr', 'abr', 'abr', 'abr', 'ab...",#indicator+id,#indicator+code,False
23,"[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/Burundi- Muyinga, Cankuzo, Makamba, Ruyigi, Rutana, Rumonge: Operational Presence.xlsx'; dataset_description='None'; column_name:'Organisation Name'; examples: ['A.B.D.D.M', 'A.B.D.D.M', 'A.B.D.D.M', 'A.B.D.D.M', 'ACT Alliance', 'ACT Alliance', 'ACT Alliance', 'ACT Alliance', 'Action Aid', 'Action Aid', 'Action Aid']'}]",#org+impl,#org+name,False
27,"[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/Burundi- Muyinga, Cankuzo, Makamba, Ruyigi, Rutana, Rumonge: Operational Presence.xlsx'; dataset_description='None'; column_name:'Location'; examples: ['Giteranyi', 'Giteranyi', 'Muyinga', 'Muyinga', 'Gisigara', 'Rumonge', 'Gisuru', 'Kiniyia', 'Kayogoro', 'Nyanza-Lac', 'Giharo']'}]",#adm2+name,#loc+name,False
28,"[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/Burundi- Muyinga, Cankuzo, Makamba, Ruyigi, Rutana, Rumonge: Operational Presence.xlsx'; dataset_description='None'; column_name:'Sector'; examples: ['Protection', 'Health and nutrition', 'Protection', 'Health and nutrition', 'Protection', 'Education', 'Protection', 'Protection', 'Health and nutrition', 'Health and nutrition', 'Education']'}]",#sector+name,#sector,False
33,"[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/PiN_VBG_2023_hdx.xlsx'; dataset_description='The dataset from the file ""PiN_VBG_2023_hdx.xlsx"" contains information on different municipalities within the Amazonas region, including their respective DIVIPOLA codes, population in 2022, severity level based on the Joint Intersectoral Analysis Framework (JIAF) scale, and the number of people in need (PiN). The severity levels range from 2 to 4, with corresponding PiN values provided. The dataset provides a snapshot of the population in need across various municipalities in the Amazonas region for the year 2022.'; column_name:'PiN - People in Need'; examples...",#inneed+f,#inneed,False
35,"[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/FieldsData_3W_PHL_Aklan.xlsx'; dataset_description='None'; column_name:'Organisation Name'; examples: ['Disaster Risk Reduction Network Philippines (DRRNet)', 'Disaster Risk Reduction Network Philippines (DRRNet)', 'Akbayanihan Foundation', 'Western Visayas Network of NGOs (WEVNet)', 'Western Visayas Network of NGOs (WEVNet)', 'RealLIFE Foundation, Inc.', 'Uswag Development Foundation, Inc.', 'Uswag Development Foundation, Inc.', 'Uswag Development Foundation, Inc.', 'GRF Hublag Foundation, Inc.', ""Consortium for People's Development - Disaster Response""]'}]",#org+impl,#org+name,False


(115, 4)


The above is a bit of a mixed bag ...

- Some cases where the model is incorrect, typically where the model did not have enough context in the prompt. For example, rows above where actual is "#org" and predicted varies. The excel sheet has more surrounding context than provided to the model.

- There are also cases where the model seems reasonable (eg #beneficiary+type for column "Beneficiary type", #activity+type for column "Type of Activity").

- Administrative level disagreements, eg 'Province' column name where actual is #adm2+name, but the model predicts province as #adm1+name.

- Various other mismatches around tag order, differing granularity of tags and more

The takeaway for the exceptions is that in most cases, it's the human-labeled data that could be improved.

### Creating a better test set

Let's create a spreadsheet for human-review of the failed predictions.

In [110]:
X_test = read_prompts_file(TEST_FILE)
X_test = pd.DataFrame(X_test)
X_test["actual"] = results["actual"]
X_test["predicted"] = results["predicted"]
X_test = X_test.drop(columns=["messages", "expected"])
X_test["match"] = X_test['predicted'] == X_test['actual']

display(X_test)
print(results.shape)
X_test.to_excel(f"{LOCAL_DATA_DIR}/hxl-metadata-prediction-test-modified.xlsx", index=False)


Found 437 prompts

Data providers ['dhs' 'undp-human-development-reports-office' 'fieldsdata' 'immap'
 'hera-humanitarian-emergency-response-africa' 'cimp' 'meers' 'rca'
 'global-shelter-cluster']

 tag
#adm1           58
#adm2           55
#country        54
#affected       43
#org            43
#date           30
#indicator      27
#region         24
#inneed         18
#sector         17
#meta           13
#loc             9
#adm3            8
#activity        8
#population      7
#beneficiary     6
#status          6
#reached         2
#targeted        2
#value           1
#event           1
#contact         1
#service         1
#delivery        1
#frequency       1
#output          1
Name: count, dtype: int64


Unnamed: 0,Data description,HDX resource id,HDX dataset id,Data provider,Date created,Locations,URL,Text header,Data excerpt,prompt,tag,actual,predicted,match
0,"The dataset contains information on various indicators related to health and demographics in Sao Tome and Principe, with data points such as total fertility rate and contraceptive use among married women across different regions. The data includes details like survey year, survey ID, indicator ID, indicator order, and precision. Each data point is associated with specific regions within the country and includes values, characteristic labels, and survey type. The dataset provides insights into key health and demographic indicators for analysis and comparison across different regions of Sao Tome and Principe.",ed7b5bd2-7818-4d7a-9ff0-8ba0d97bf7d5,dhs-subnational-data-for-sao-tome-and-principe,dhs,2020-01-28,STP,https://data.humdata.org/dataset/760a1cb4-f0ee-4057-8865-fa9faba71ae1/resource/ed7b5bd2-7818-4d7a-9ff0-8ba0d97bf7d5/download/dhs-quickstats_subnational_stp.csv,ISO3,"['STP', 'STP', 'STP', 'STP', 'STP', 'STP', 'STP', 'STP', 'STP', 'STP', 'STP']","[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/DHS Quickstats Data for Sao Tome and Principe.csv'; dataset_description='The dataset contains information on various indicators related to health and demographics in Sao Tome and Principe, with data points such as total fertility rate and contraceptive use among married women across different regions. The data includes details like survey year, survey ID, indicator ID, indicator order, and precision. Each data point is associated with specific regions within the country and includes values, characteristic labels, and survey type. The dataset provides insights into key health and demographic indicators fo...",#country,#country+code,#country+code,True
1,"The dataset contains information on various indicators related to health and demographics in Sao Tome and Principe, with data points such as total fertility rate and contraceptive use among married women across different regions. The data includes details like survey year, survey ID, indicator ID, indicator order, and precision. Each data point is associated with specific regions within the country and includes values, characteristic labels, and survey type. The dataset provides insights into key health and demographic indicators for analysis and comparison across different regions of Sao Tome and Principe.",ed7b5bd2-7818-4d7a-9ff0-8ba0d97bf7d5,dhs-subnational-data-for-sao-tome-and-principe,dhs,2020-01-28,STP,https://data.humdata.org/dataset/760a1cb4-f0ee-4057-8865-fa9faba71ae1/resource/ed7b5bd2-7818-4d7a-9ff0-8ba0d97bf7d5/download/dhs-quickstats_subnational_stp.csv,Location,"['Região Centro', 'Região Sul', 'Região Norte', 'Região do Principe', 'Região Centro', 'Região Sul', 'Região Norte', 'Região do Principe', 'Região Centro', 'Região Sul', 'Região Norte']","[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/DHS Quickstats Data for Sao Tome and Principe.csv'; dataset_description='The dataset contains information on various indicators related to health and demographics in Sao Tome and Principe, with data points such as total fertility rate and contraceptive use among married women across different regions. The data includes details like survey year, survey ID, indicator ID, indicator order, and precision. Each data point is associated with specific regions within the country and includes values, characteristic labels, and survey type. The dataset provides insights into key health and demographic indicators fo...",#loc,#loc+name,#adm1+name,False
2,"The dataset contains information on various indicators related to health and demographics in Sao Tome and Principe, with data points such as total fertility rate and contraceptive use among married women across different regions. The data includes details like survey year, survey ID, indicator ID, indicator order, and precision. Each data point is associated with specific regions within the country and includes values, characteristic labels, and survey type. The dataset provides insights into key health and demographic indicators for analysis and comparison across different regions of Sao Tome and Principe.",ed7b5bd2-7818-4d7a-9ff0-8ba0d97bf7d5,dhs-subnational-data-for-sao-tome-and-principe,dhs,2020-01-28,STP,https://data.humdata.org/dataset/760a1cb4-f0ee-4057-8865-fa9faba71ae1/resource/ed7b5bd2-7818-4d7a-9ff0-8ba0d97bf7d5/download/dhs-quickstats_subnational_stp.csv,DataId,"['3517584', '2078280', '2086067', '2078289', '4220627', '5306972', '1182398', '4673618', '4220628', '5306973', '1182399']","[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/DHS Quickstats Data for Sao Tome and Principe.csv'; dataset_description='The dataset contains information on various indicators related to health and demographics in Sao Tome and Principe, with data points such as total fertility rate and contraceptive use among married women across different regions. The data includes details like survey year, survey ID, indicator ID, indicator order, and precision. Each data point is associated with specific regions within the country and includes values, characteristic labels, and survey type. The dataset provides insights into key health and demographic indicators fo...",#meta,#meta+id,#meta+id,True
3,"The dataset contains information on various indicators related to health and demographics in Sao Tome and Principe, with data points such as total fertility rate and contraceptive use among married women across different regions. The data includes details like survey year, survey ID, indicator ID, indicator order, and precision. Each data point is associated with specific regions within the country and includes values, characteristic labels, and survey type. The dataset provides insights into key health and demographic indicators for analysis and comparison across different regions of Sao Tome and Principe.",ed7b5bd2-7818-4d7a-9ff0-8ba0d97bf7d5,dhs-subnational-data-for-sao-tome-and-principe,dhs,2020-01-28,STP,https://data.humdata.org/dataset/760a1cb4-f0ee-4057-8865-fa9faba71ae1/resource/ed7b5bd2-7818-4d7a-9ff0-8ba0d97bf7d5/download/dhs-quickstats_subnational_stp.csv,Indicator,"['Total fertility rate 15-49', 'Total fertility rate 15-49', 'Total fertility rate 15-49', 'Total fertility rate 15-49', 'Married women currently using any method of contraception', 'Married women currently using any method of contraception', 'Married women currently using any method of contraception', 'Married women currently using any method of contraception', 'Married women currently using any modern method of contraception', 'Married women currently using any modern method of contraception', 'Married women currently using any modern method of contraception']","[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/DHS Quickstats Data for Sao Tome and Principe.csv'; dataset_description='The dataset contains information on various indicators related to health and demographics in Sao Tome and Principe, with data points such as total fertility rate and contraceptive use among married women across different regions. The data includes details like survey year, survey ID, indicator ID, indicator order, and precision. Each data point is associated with specific regions within the country and includes values, characteristic labels, and survey type. The dataset provides insights into key health and demographic indicators fo...",#indicator,#indicator+name,#indicator+name,True
4,"The dataset contains information on various indicators related to health and demographics in Sao Tome and Principe, with data points such as total fertility rate and contraceptive use among married women across different regions. The data includes details like survey year, survey ID, indicator ID, indicator order, and precision. Each data point is associated with specific regions within the country and includes values, characteristic labels, and survey type. The dataset provides insights into key health and demographic indicators for analysis and comparison across different regions of Sao Tome and Principe.",ed7b5bd2-7818-4d7a-9ff0-8ba0d97bf7d5,dhs-subnational-data-for-sao-tome-and-principe,dhs,2020-01-28,STP,https://data.humdata.org/dataset/760a1cb4-f0ee-4057-8865-fa9faba71ae1/resource/ed7b5bd2-7818-4d7a-9ff0-8ba0d97bf7d5/download/dhs-quickstats_subnational_stp.csv,CountryName,"['Sao Tome and Principe', 'Sao Tome and Principe', 'Sao Tome and Principe', 'Sao Tome and Principe', 'Sao Tome and Principe', 'Sao Tome and Principe', 'Sao Tome and Principe', 'Sao Tome and Principe', 'Sao Tome and Principe', 'Sao Tome and Principe', 'Sao Tome and Principe']","[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/DHS Quickstats Data for Sao Tome and Principe.csv'; dataset_description='The dataset contains information on various indicators related to health and demographics in Sao Tome and Principe, with data points such as total fertility rate and contraceptive use among married women across different regions. The data includes details like survey year, survey ID, indicator ID, indicator order, and precision. Each data point is associated with specific regions within the country and includes values, characteristic labels, and survey type. The dataset provides insights into key health and demographic indicators fo...",#country,#country+name,#country+name,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
432,"The dataset contains information related to humanitarian aid activities in Iraq in 2015, including details such as the date of activity, organizations involved, types of aid provided (shelter, non-food items), activity status (planned or delivered), modality of aid delivery (in-kind, cash, voucher), and beneficiary descriptions. The data includes specific quantities of items distributed, such as polyethylene floor insulation, tents, mattresses, blankets, hygiene kits, and various other items. The dataset also includes information on the average cost per family, number of families reached, and total number of people assisted.",e6d5a229-92d2-4f76-9a0b-2f26854e1012,iraq-4w-data-2015-2016,global-shelter-cluster,2017-06-12,IRQ,https://data.humdata.org/dataset/79165904-0ed0-417b-a743-6e7605b36d9a/resource/e6d5a229-92d2-4f76-9a0b-2f26854e1012/download/iraq-2015.csv,Gov'te Pcode_??? ????????,"['', '', '', '', '', '', '', 'IQ-G05', 'IQ-G08', 'IQ-G08', '']","[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/Iraq 2015.csv'; dataset_description='The dataset contains information related to humanitarian aid activities in Iraq in 2015, including details such as the date of activity, organizations involved, types of aid provided (shelter, non-food items), activity status (planned or delivered), modality of aid delivery (in-kind, cash, voucher), and beneficiary descriptions. The data includes specific quantities of items distributed, such as polyethylene floor insulation, tents, mattresses, blankets, hygiene kits, and various other items. The dataset also includes information on the average cost per family, number...",#adm1,#adm1+code,#adm1+code,True
433,"The dataset contains information related to humanitarian aid activities in Iraq in 2015, including details such as the date of activity, organizations involved, types of aid provided (shelter, non-food items), activity status (planned or delivered), modality of aid delivery (in-kind, cash, voucher), and beneficiary descriptions. The data includes specific quantities of items distributed, such as polyethylene floor insulation, tents, mattresses, blankets, hygiene kits, and various other items. The dataset also includes information on the average cost per family, number of families reached, and total number of people assisted.",e6d5a229-92d2-4f76-9a0b-2f26854e1012,iraq-4w-data-2015-2016,global-shelter-cluster,2017-06-12,IRQ,https://data.humdata.org/dataset/79165904-0ed0-417b-a743-6e7605b36d9a/resource/e6d5a229-92d2-4f76-9a0b-2f26854e1012/download/iraq-2015.csv,District_??????,"['Khanaqin_??????', '', '', 'Khanaqin_??????', '', 'Khanaqin_??????', 'Khanaqin_??????', 'Sulaymaniya_??????????', 'Amedi_????????', 'Zakho_????', 'Sumel']","[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/Iraq 2015.csv'; dataset_description='The dataset contains information related to humanitarian aid activities in Iraq in 2015, including details such as the date of activity, organizations involved, types of aid provided (shelter, non-food items), activity status (planned or delivered), modality of aid delivery (in-kind, cash, voucher), and beneficiary descriptions. The data includes specific quantities of items distributed, such as polyethylene floor insulation, tents, mattresses, blankets, hygiene kits, and various other items. The dataset also includes information on the average cost per family, number...",#adm2,#adm2,#adm2+name,False
434,"The dataset contains information related to humanitarian aid activities in Iraq in 2015, including details such as the date of activity, organizations involved, types of aid provided (shelter, non-food items), activity status (planned or delivered), modality of aid delivery (in-kind, cash, voucher), and beneficiary descriptions. The data includes specific quantities of items distributed, such as polyethylene floor insulation, tents, mattresses, blankets, hygiene kits, and various other items. The dataset also includes information on the average cost per family, number of families reached, and total number of people assisted.",e6d5a229-92d2-4f76-9a0b-2f26854e1012,iraq-4w-data-2015-2016,global-shelter-cluster,2017-06-12,IRQ,https://data.humdata.org/dataset/79165904-0ed0-417b-a743-6e7605b36d9a/resource/e6d5a229-92d2-4f76-9a0b-2f26854e1012/download/iraq-2015.csv,Dis't Pcode_??? ???????,"['', '', '', '', '', '', '', 'IQ-D033', 'IQ-D048', 'IQ-D051', '']","[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/Iraq 2015.csv'; dataset_description='The dataset contains information related to humanitarian aid activities in Iraq in 2015, including details such as the date of activity, organizations involved, types of aid provided (shelter, non-food items), activity status (planned or delivered), modality of aid delivery (in-kind, cash, voucher), and beneficiary descriptions. The data includes specific quantities of items distributed, such as polyethylene floor insulation, tents, mattresses, blankets, hygiene kits, and various other items. The dataset also includes information on the average cost per family, number...",#adm2,#adm2+code,#adm2+code,True
435,"The dataset contains information related to humanitarian aid activities in Iraq in 2015, including details such as the date of activity, organizations involved, types of aid provided (shelter, non-food items), activity status (planned or delivered), modality of aid delivery (in-kind, cash, voucher), and beneficiary descriptions. The data includes specific quantities of items distributed, such as polyethylene floor insulation, tents, mattresses, blankets, hygiene kits, and various other items. The dataset also includes information on the average cost per family, number of families reached, and total number of people assisted.",e6d5a229-92d2-4f76-9a0b-2f26854e1012,iraq-4w-data-2015-2016,global-shelter-cluster,2017-06-12,IRQ,https://data.humdata.org/dataset/79165904-0ed0-417b-a743-6e7605b36d9a/resource/e6d5a229-92d2-4f76-9a0b-2f26854e1012/download/iraq-2015.csv,Shelter arrangement\nSelect shelter arrangement of IDPs,"['', 'Camp', '', '', '', '', '', '', '', '', '']","[{'role': 'system', 'content': '  You are an assistant that replies with HXL tags and attributes""  '}, {'role': 'user', 'content': 'What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/Iraq 2015.csv'; dataset_description='The dataset contains information related to humanitarian aid activities in Iraq in 2015, including details such as the date of activity, organizations involved, types of aid provided (shelter, non-food items), activity status (planned or delivered), modality of aid delivery (in-kind, cash, voucher), and beneficiary descriptions. The data includes specific quantities of items distributed, such as polyethylene floor insulation, tents, mattresses, blankets, hygiene kits, and various other items. The dataset also includes information on the average cost per family, number...",#loc,#loc+type,#loc+type,True


(437, 4)


Looking at the [human reviewed](https://docs.google.com/spreadsheets/d/1YwlubOYTVyipR26Drr8bqJae46dr7BuWgqmPbtEytRI/edit?usp=sharing) sheet, the breaks fall into a few categories ...


```
Fail reason,COUNTA of Fail reason
#org+url not set correctly,3
Actual is incorrect,12
Admin level mismatch,20
Admin1 vs. Location ambiguity,1
"Both prediction and actual have reasonable, but different, attributes",3
Not detecting #region,1
Not detecting region,20
"Prediction agrees, attribute order differs",1
Prediction has correct granular attributes,21
Prediction has extraneous incorrect attributes,4
Prediction incorrect,11
Prediction missing attribute granularity,30
Prompt lacks sufficient context of all columns,42
```

These categories are likely due to issues in the human labeled data and we should either update the 'actual' in the test set, or simply remove them.

There are two categories due to misidentifying region-related tags as well as admin levels. For these, we will improve the fine-tuning prompt.

Finally, many mismatching would seem to be where the data provided in the column alone is insifficient. It might be good to add ion a general description of the data.



### Predict just prompting

In [31]:
def generate_hxl_standard_prompt(local_data_file):

  core_hashtags = pd.read_excel(local_data_file, sheet_name='Core hashtags')
  core_hashtags = core_hashtags.loc[core_hashtags["Release status"] == "Released"]
  core_hashtags = core_hashtags[["Hashtag", "Hashtag long description", "Sample HXL"]]

  core_attributes = pd.read_excel(local_data_file, sheet_name='Core attributes')
  core_attributes = core_attributes.loc[core_attributes["Status"] == "Released"]
  core_attributes = core_attributes[["Attribute", "Attribute long description", "Suggested hashtags (selected)"]]

  print(core_hashtags.shape)
  print(core_attributes.shape)

  core_hashtags = core_hashtags.to_dict(orient='records')
  core_attributes = core_attributes.to_dict(orient='records')

  hxl_prompt= f"""
  You are an AI assistant that predicts Humanitarian Markup Language (HXL) tags and attributes for columns of data where the HXL standard is defined as follows:

  CORE HASHTAGS:

  {json.dumps(core_hashtags,indent=4)}

  CORE ATTRIBUTES:

  {json.dumps(core_attributes, indent=4)}

  Key points:

  - ALWAYS predict hash tags
  - NEVER predict a tag which is not a valid core hashtag
  - NEVER start with a core hashtag, you must always start with a core hashtag
  - Always try and predict an attribute if possible

  You must return your result as a JSON record with the fields 'predicted' and 'reasoning', each is of type string.

  """

  print(len(hxl_prompt.split(" ")))
  print(hxl_prompt)
  return hxl_prompt

def call_gpt(prompt, system_prompt, model, temperature, top_p):
    response = client.chat.completions.create(
        model=model,
        messages= [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": prompt}
        ],
        max_tokens=2000,
        temperature=temperature,
        top_p=top_p,
        frequency_penalty=0,
        presence_penalty=0,
        stop=None,
        stream=False
    )

    result = response.choices[0].message.content
    result = result.replace("```json","").replace("```","")
    try:
        result = json.loads(result)
        result["predicted"] = result["predicted"].replace(" ","")
    except:
        result = None
    return result

hxl_prompt = generate_hxl_standard_prompt(local_data_file)

result = []
model = "gpt-4o"
for index, p in X_test.iterrows():

    prompt = p["messages"][1]["content"]
    actual = p["messages"][2]["content"]

    result = call_gpt(prompt, hxl_prompt, model, 0.0, 0.1)

    if result is None:
        print("    !!!!! No LLM result")
        predicted = ""
        resoning = ""
    else:
        predicted = result["predicted"]
        reasoning = result["reasoning"]

    results.append({
        "prompt": prompt,
        "actual": actual,
        "predicted": "",
        "reasoning": ""
    })

    pprint.pp(prompt)
    print(f'   Actual.   : {actual}')
    print(f'   Predicted : {result["predicted"]}')
    pprint.pp(f'   Reasoning : {result["reasoning"]}')
    print()

results = pd.DataFrame(results)
display(results)

results.to_excel(f"{LOCAL_DATA_DIR}/hxl-metadata-prompting-only-prediction-results.xlsx", index=False)

display(results)

output_prediction_metrics(results)




(44, 3)
(54, 3)
5240

  You are an AI assistant that predicts Humanitarian Markup Language (HXL) tags and attributes for columns of data where the HXL standard is defined as follows:

  CORE HASHTAGS:

  [
    {
        "Hashtag": "#access",
        "Hashtag long description": "Accessiblity and constraints on access to a market, distribution point, facility, etc.",
        "Sample HXL": "#access +type"
    },
    {
        "Hashtag": "#activity",
        "Hashtag long description": "A programme, project, or other activity. This hashtag applies to all levels; use the attributes +activity, +project, or +programme to distinguish different hierarchical levels.",
        "Sample HXL": "#activity +project"
    },
    {
        "Hashtag": "#adm1",
        "Hashtag long description": "Top-level subnational administrative area (e.g. a governorate in Syria).",
        "Sample HXL": "#adm1 +code"
    },
    {
        "Hashtag": "#adm2",
        "Hashtag long description": "Second-level subnationa

KeyboardInterrupt: 