# Introduction

This notebook explores using GPT-4 and GPT-3.5-Turbo for custom data quality tests, using the criteria used to identify [Data Grid datasets](https://centre.humdata.org/introducing-the-hdx-data-grid-a-way-to-find-and-fill-data-gaps/) on the fantastic [Humanitarian Data Exchange](https://data.humdata.org/) (HDX) Platform. 
See [this blog post]() for more details.

## Setup

1. For running OpenAI you will need to create a file called `api_key.txt` in the current directory and put your [OpenAI API key](https://beta.openai.com/account/api-keys) in there (just the API string, nothing else)
2. At time of writing submit an early access request for GPT-4 [here](https://openai.com/waitlist/gpt-4-api). You can also test with 'gpt-3.5-turbo' by setting the `model` variable below
3. Though the code below includes 

In [25]:
import numpy as np

import pandas as pd
import json
import os
import chardet
import shutil
import sys
import re
import traceback
import time

import numpy as np
from IPython.display import display, Markdown, Latex

import openai as ai
from openai import cli

from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score

pd.set_option("display.max_colwidth", None)

# File with HDX categories, as extracted from this document: https://data.humdata.org/dataset/2048a947-5714-4220-905b-e662cbcd14c8/resource/9d4121c6-b32b-4eb8-a707-209c79241970/download/state-of-open-humanitarian-data-2023.pdf
hdx_data_categories_file = "./data/HDX Data Grid Categories.csv"

# From HDX: https://data.humdata.org/dataset/kenya-production-of-rice-in-irrigation-schemes
irrigation_sheet="./data/number-of-acreage-under-irrigation.xlsx"

# From HDX: https://data.humdata.org/dataset/wfp-food-prices-for-chad
wfp_food_prices="./data/wfp_food_prices_tcd.csv"

# Sample of 200 Data Grid HDX datasets, with table excerpts from all tabular resources in the datase
# Code to produce this file can be happily provided on request to Medium author, but left out here to avoid 
# too much traffic on HDX
hdx_excerpts_file = "./data/datasets_excerpts.pkl"

# Open AI API key should be put into this file
ai.api_key_path = "./api_key.txt"

output_folder = "./data/"


## Analysis

First, let's identify datasets which were approved for DataGrid by checking the [data grid recipes](https://github.com/OCHA-DAP/data-grid-recipes) repo.

Read in Data Grid categories file ...

In [10]:
dg_categories = pd.read_csv(hdx_data_categories_file)
dg_categories = dg_categories[
    ["Category", "Subcategory", "Definition", "Datagrid recipe category"]
]
display(dg_categories)

Unnamed: 0,Category,Subcategory,Definition,Datagrid recipe category
0,Affected People,Internally Displaced Persons,Tabular data of the number of displaced people by location. Locations can be administrative divisions or other locations (such as camps) if an additional dataset defining those locations is also available.,IDPs
1,Affected People,Refugees and Persons of Concern,Tabular data of the number of refugees and persons of concern either in the country or originating from the country disaggregated by their current location. Locations can be administrative divisions or other locations (such as camps) if an additional dataset defining those locations is also available or if the locations' coordinates are defined in the tabular data.,REFUGEES POCs
2,Affected People,Returnees,Tabular data of the number of displaced people who have returned.,RETURNEES
3,Affected People,Humanitarian Needs,Tabular data of the number of people in need of humanitarian assistance by location and humanitarian cluster/sector.,HNO
4,Coordination & Context,3W - Who is doing what where,"List of organisations working on humanitarian issues, by humanitarian cluster/sector and disaggregated by administrative division.\n\nNote: An exception for the subnational rule is made for the IATI dataset which, if available, should always be included as an incomplete d",WHO WHAT WHERE
5,Coordination & Context,Funding,Tabular data listing the amount of funding provided by humanitarian cluster/sector.,FUNDING
6,Coordination & Context,Conflict Events,"Vector data or tabular data with coordinates describing the location, date, and type of conflict event.",CONFLICT EVENTS
7,Coordination & Context,Humanitarian Access,"Tabular or vector data describing the location of natural hazards, permissions, active fighting, or other access constraints that impact the delivery of humanitarian interventions.",HUMANITARIAN ACCESS
8,Coordination & Context,Climate Impact,"Tabular or vector data containing current and historical impacts of climate events relating to floods, droughts and storms. The data should specify the location of the event, date of the event, and contain at least one indicator of impact such as spatial extent of event, disruption to affected populations, destroyed infrastructure, and/or affected vegetation.",CLIMATE IMPACT
9,Food Security & Nutrition,Food Security,Vector data representing the IPC/CH acute food insecurity phase classification or tabular data representing population or percentage of population by IPC/CH phase and administrative division.,FOOD SECURITY


## Prompting GPT

Let's make it easier on the model by converting our categories information into text we can put in a prompt. GPT-4 will parse tables pretty well, but anything to reduce ambiguity is good ...

In [11]:
dg_categories["prompt_text"] = dg_categories.apply(
    lambda x: f"- Category '{x['Category']} : {x['Subcategory']}' is defined as: {x['Definition']}",
    axis=1,
)

category_prompt_text = dg_categories["prompt_text"].to_string(index=False, header=False)
display(category_prompt_text)

'                                                                                                                                                                                                                   - Category \'Affected People : Internally Displaced Persons\' is defined as: Tabular data of the number of displaced people by location. Locations can be administrative divisions or other locations (such as camps) if an additional dataset defining those locations is also available.\n                                              - Category \'Affected People : Refugees and Persons of Concern\' is defined as: Tabular data of the number of refugees and persons of concern either in the country or originating from the country disaggregated by their current location. Locations can be administrative divisions or other locations (such as camps) if an additional dataset defining those locations is also available or if the locations\' coordinates are defined in the tabular data.\n        

In [12]:
df = pd.read_excel(irrigation_sheet, sheet_name="Sheet1")
df = df.fillna("")
display(df)

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11
0,Table 3: Number of acreage under irrigation,,,,,,,,,,,
1,,,OVERALL,,Sub county,,,,,,,
2,,,,,Chepalungu,,,,Bomet Central,,,
3,,,,,Male,,Female,,Male,,Female,
4,,,N,%,N,%,N,%,N,%,N,%
5,What is the average size of land you own that is currently under irrigation?,0 - 2 acres,22,2.8%,4,2.2%,10,3.8%,3,1.7%,5,2.9%
6,,2 - 5 acres,6,.8%,2,1.1%,2,.8%,0,0.0%,2,1.2%
7,,5 - 10 acres,1,.1%,0,0.0%,0,0.0%,0,0.0%,1,.6%
8,,More than 10 acres,0,0.0%,0,0.0%,0,0.0%,0,0.0%,0,0.0%
9,,,760,96.3%,176,96.7%,251,95.4%,170,98.3%,163,95.3%


Typical table, with hierarchical columns and note in the first row. Let's convert it to a CSV string to be added in a prompt ...

In [13]:
csv_as_str = df[0:20].to_csv(index=False)
print(csv_as_str)

Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11
Table 3: Number of acreage under irrigation,,,,,,,,,,,
,,OVERALL,,Sub county,,,,,,,
,,,,Chepalungu,,,,Bomet Central,,,
,,,,Male,,Female,,Male,,Female,
,,N,%,N,%,N,%,N,%,N,%
What is the average size of land you own that is currently under irrigation?,0 - 2 acres,22,2.8%,4,2.2%,10,3.8%,3,1.7%,5,2.9%
,2 - 5 acres,6,.8%,2,1.1%,2,.8%,0,0.0%,2,1.2%
,5 - 10 acres,1,.1%,0,0.0%,0,0.0%,0,0.0%,1,.6%
,More than 10 acres,0,0.0%,0,0.0%,0,0.0%,0,0.0%,0,0.0%
,None,760,96.3%,176,96.7%,251,95.4%,170,98.3%,163,95.3%
,Total,789,100.0%,182,100.0%,263,100.0%,173,100.0%,171,100.0%



In [None]:
def prompt_model(prompts, temperature=0.0, model="gpt-4"):
    messages = [{"role": "system", "content": "You are a helpful assistant."}]
    for prompt in prompts:
        messages.append({"role": "user", "content": prompt})
        response = ai.ChatCompletion.create(
            model=model, temperature=temperature, messages=messages
        )
    return response["choices"][0]["message"]["content"]


prompts = []
prompts.append(
    f"Here is a list of HDX data categories with their definition: \n\n {category_prompt_text} \n\n"
)
prompts.append(
    f"Does the following table from file {irrigation_sheet} fall into one of the categories provided, if not say no. "
    f"If it does, which category and explain why? \n\n {csv_as_str} \n\n"
)

for model in ["gpt-3.5-turbo", "gpt-4"]:
    response = prompt_model(prompts, temperature=0.0, model=model)
    print(f"\n{model} Model response: \n\n{response}")

Good! Both correct! And importantly they predicted a negative case. This is important for our use-case.

Let's try with a table which is in Data Grid, Chad food prices.

In [7]:
df = pd.read_csv(wfp_food_prices)
df = df.fillna("")
csv_as_str = df[0:20].to_csv(index=False)

In [17]:
print(csv_as_str)

prompts = []
prompts.append(
    f"Here is a list of HDX data categories with their definition: \n\n {category_prompt_text} \n\n"
)
prompts.append(
    f"Does the following table from file {wfp_food_prices} fall into one of the categories provided, if not say no. "
    f"If it does, which category and explain why? \n\n {csv_as_str} \n\n"
)

for model in ["gpt-3.5-turbo", "gpt-4"]:
    response = prompt_model(prompts, temperature=0.0, model=model)
    print(f"\n{model} Model response: \n\n{response}")

date,admin1,admin2,market,latitude,longitude,category,commodity,unit,priceflag,pricetype,currency,price,usdprice
#date,#adm1+name,#adm2+name,#loc+market+name,#geo+lat,#geo+lon,#item+type,#item+name,#item+unit,#item+price+flag,#item+price+type,#currency,#value,#value+usd
2003-10-15,Barh El Gazal,Barh El Gazel Sud,Moussoro,13.640841,16.490069,cereals and tubers,Maize,KG,actual,Retail,XAF,134.0,0.2377
2003-10-15,Barh El Gazal,Barh El Gazel Sud,Moussoro,13.640841,16.490069,cereals and tubers,Millet,KG,actual,Retail,XAF,147.0,0.2608
2003-10-15,Lac,Mamdi,Bol,13.5,14.683333,cereals and tubers,Maize,KG,actual,Retail,XAF,81.0,0.1437
2003-10-15,Lac,Mamdi,Bol,13.5,14.683333,cereals and tubers,Maize (white),KG,actual,Retail,XAF,81.0,0.1437
2003-10-15,Logone Occidental,Lac Wey,Moundou,8.5666667,16.0833333,cereals and tubers,Millet,KG,actual,Retail,XAF,95.0,0.1685
2003-10-15,Logone Occidental,Lac Wey,Moundou,8.5666667,16.0833333,cereals and tubers,Sorghum,KG,actual,Retail,XAF,62.0,0.11
2003-10-15,Lo

Again, GPT-3.5-turbo was wrong as 'Food Prices' is not an accepted category, GPT-4 correct. Let's try with a few more files, including some which are in DataGrid, to see how well we do.

# Using multiple table excerpts and running for more datasets
OK, now let's loop through the Data Grid datasets, which should all be in an approved category.


In [86]:
def predict(data_excerpts, main_prompt, temperature=0.0):
    results = []
    for index, row in data_excerpts.iterrows():
        dataset = row["name"]
        title = row["title"]
        print(
            f"\n========================================= {dataset} =============================================\n"
        )

        prompts = []

        # Start the prompt by defining the categories we want to assign
        prompts.append(
            f"Here is a list of HDX data categories with their definition: \n\n {category_prompt_text} \n\n"
        )
        prompts.append(
            f"Here are excerpts from all the tables in this dataset: {title} ...\n\n"
        )

        # Build multiple prompts for each table excerpt for this dataset
        tables = row["table_excerpts"]
        for table in tables:
            table = json.loads(table)
            csv_as_str = table["table_excerpt"]
            sheet = table["sheet"]
            type = table["type"]
            filename = table["filename"]
            print(f"DATA > {filename} / {sheet}")
            prompts.append(
                f"Type {type} sheet {sheet} from file {filename} Table excerpt: \n\n {csv_as_str} \n\n"
            )

        # Finish up with our request
        prompts.append(main_prompt)

        actual_category = row["datagrid_category"]
        d = {
            "dataset_name": dataset,
            "filename": filename,
            "prompts": prompts,
            "actual_category": actual_category,
        }

        # Send our prompt array to two models
        for model in ["gpt-3.5-turbo", "gpt-4"]:
            # for model in ['gpt-3.5-turbo']:
            # GPT-4 is in test and can fail sometimes
            try:
                print(f"\nCalling model {model}")
                response = prompt_model(prompts, temperature=temperature, model=model)
                if "|" in response:
                    predicted_category = response.split("|")[1].strip()
                else:
                    predicted_category = response
                print(f"\n{model} Model response: \n\n{response}")
                match = actual_category == predicted_category
                d[f"{model}_response"] = response
                d[f"{model}_predicted_category"] = predicted_category
                d[f"{model}_match"] = match
                print(
                    f"******* RESULT: || {match} || prediced {predicted_category}, actual {actual_category} *******"
                )
            except Exception as e:
                print(e)
        results.append(d)

    results = pd.DataFrame(results)
    return results


data_excerpts = pd.read_pickle(hdx_excerpts_file)

data_excerpts = data_excerpts[data_excerpts["is_datagrid"] == True]

data_excerpts = data_excerpts.sample(min(150, data_excerpts.shape[0]), random_state=42)

main_prompt = (
            "Does the dataset fall into exactly one of the categories mentioned above, if not say no. "
            "If it does, add a pipe charatcter '|' before and after the top category and sub-category category and explain why it was chosen step-by-step.\n\n"
            "What is the second most likely category if you had to pick one (adding a ^ character either side)? \n\n"
)

results = predict(data_excerpts, main_prompt, temperature=0.0)
results.to_excel(f"{output_folder}/results.xlsx")

print("Done")



DATA > ./data/prompts/mozambique-attacks-on-aid-operations-education-health-and-protection/2020 SHCC Health Care Mozambique Data.xlsx / 2020 SHCC Mozambique Data

Calling model gpt-3.5-turbo

gpt-3.5-turbo Model response: 

|Affected People : Health Care| - The dataset falls under this category as it provides information on attacks on health care workers and facilities in Mozambique. The table excerpt shows data on the number of health workers killed, kidnapped, arrested, threatened, injured, and sexually assaulted, as well as the number of attacks on facilities that reported destruction, damage, and armed entry. 

^|Coordination & Context : Conflict Events| - The dataset could also fall under this category as it provides information on attacks on aid operations and health care in the context of conflict in Mozambique. However, the primary focus of the dataset is on the impact of conflict on health care, making it more suitable for the 'Affected People' category.
******* RESULT: || F

[32m2023-03-26 20:38:16.402[0m | [1mINFO    [0m | [36mopenai.util[0m:[36mlog_info[0m:[36m67[0m - [1merror_code=context_length_exceeded error_message="This model's maximum context length is 4097 tokens. However, your messages resulted in 4152 tokens. Please reduce the length of the messages." error_param=messages error_type=invalid_request_error message='OpenAI API error received' stream_error=False[0m


This model's maximum context length is 4097 tokens. However, your messages resulted in 4152 tokens. Please reduce the length of the messages.

Calling model gpt-4

gpt-4 Model response: 

Yes, the dataset falls into exactly one of the categories mentioned above. The most suitable category for this dataset is:

|Population & Socio-economy : Poverty Rate|

The dataset contains information about poverty rates in Colombia at a subnational level, including the Multidimensional Poverty Index (MPI), headcount ratios, and other related indicators. This category is chosen because it specifically focuses on poverty rates and their disaggregation by administrative divisions.

The second most likely category would be:

^Affected People : Humanitarian Needs^

This category could be considered because the dataset provides information on the needs of the population in terms of poverty and deprivation, which can be relevant for humanitarian assistance planning and interventions. However, the primary f

[32m2023-03-26 20:46:17.367[0m | [1mINFO    [0m | [36mopenai.util[0m:[36mlog_info[0m:[36m67[0m - [1merror_code=context_length_exceeded error_message="This model's maximum context length is 4097 tokens. However, your messages resulted in 4185 tokens. Please reduce the length of the messages." error_param=messages error_type=invalid_request_error message='OpenAI API error received' stream_error=False[0m


This model's maximum context length is 4097 tokens. However, your messages resulted in 4185 tokens. Please reduce the length of the messages.

Calling model gpt-4

gpt-4 Model response: 

Yes, the dataset falls into exactly one of the categories mentioned above.

|Population & Socio-economy : Poverty Rate|

This category was chosen because the dataset provides information on the Multidimensional Poverty Index (MPI) for subnational regions in the Democratic Republic of the Congo. The MPI is a measure of poverty that takes into account multiple dimensions, such as health, education, and living standards. The dataset includes information on the proportion of people who are MPI poor and experience deprivations in each of the indicators by subnational regions, as well as the contribution of deprivations to the MPI.

^Coordination & Context : Humanitarian Needs^

The second most likely category would be Coordination & Context : Humanitarian Needs, as the dataset provides information on the p

[32m2023-03-26 21:01:22.071[0m | [1mINFO    [0m | [36mopenai.util[0m:[36mlog_info[0m:[36m67[0m - [1merror_code=context_length_exceeded error_message="This model's maximum context length is 4097 tokens. However, your messages resulted in 4119 tokens. Please reduce the length of the messages." error_param=messages error_type=invalid_request_error message='OpenAI API error received' stream_error=False[0m


This model's maximum context length is 4097 tokens. However, your messages resulted in 4119 tokens. Please reduce the length of the messages.

Calling model gpt-4

gpt-4 Model response: 

Yes, the dataset falls into exactly one of the categories mentioned above. The most suitable category for this dataset is:

|Population & Socio-economy : Poverty Rate|

This category is chosen because the dataset provides information on the Multidimensional Poverty Index (MPI) for subnational regions in Cameroon, which is a measure of poverty. The dataset includes data on various indicators related to health, education, and living standards, as well as the percentage of the population living in poverty in each region.

The second most likely category would be:

^Food Security & Nutrition : Food Security^

This category is chosen as the second most likely because the dataset includes information on indicators related to nutrition, which can be linked to food security. However, it is not the primary foc

[32m2023-03-26 21:12:21.135[0m | [1mINFO    [0m | [36mopenai.util[0m:[36mlog_info[0m:[36m67[0m - [1merror_code=context_length_exceeded error_message="This model's maximum context length is 4097 tokens. However, your messages resulted in 4304 tokens. Please reduce the length of the messages." error_param=messages error_type=invalid_request_error message='OpenAI API error received' stream_error=False[0m


This model's maximum context length is 4097 tokens. However, your messages resulted in 4304 tokens. Please reduce the length of the messages.

Calling model gpt-4

gpt-4 Model response: 

Yes, the dataset falls into exactly one of the categories mentioned above.

|Coordination & Context : 3W - Who is doing what where|

This category was chosen because the dataset contains information about organizations working on humanitarian issues in Yemen, their activities, and their locations. The tables in the dataset provide details about the organizations, their acronyms, types, and the sectors or clusters they are working in, as well as the administrative divisions where they are operating.

^Geography & Infrastructure : Administrative Divisions^

The second most likely category would be "Geography & Infrastructure : Administrative Divisions" because the dataset also includes information about the administrative divisions in Yemen, such as admin1 and admin2 names and codes. However, this categ

[32m2023-03-26 21:30:54.948[0m | [1mINFO    [0m | [36mopenai.util[0m:[36mlog_info[0m:[36m67[0m - [1merror_code=context_length_exceeded error_message="This model's maximum context length is 4097 tokens. However, your messages resulted in 10183 tokens. Please reduce the length of the messages." error_param=messages error_type=invalid_request_error message='OpenAI API error received' stream_error=False[0m


This model's maximum context length is 4097 tokens. However, your messages resulted in 10183 tokens. Please reduce the length of the messages.

Calling model gpt-4


[32m2023-03-26 21:32:08.476[0m | [1mINFO    [0m | [36mopenai.util[0m:[36mlog_info[0m:[36m67[0m - [1merror_code=context_length_exceeded error_message="This model's maximum context length is 8192 tokens. However, your messages resulted in 10179 tokens. Please reduce the length of the messages." error_param=messages error_type=invalid_request_error message='OpenAI API error received' stream_error=False[0m


This model's maximum context length is 8192 tokens. However, your messages resulted in 10179 tokens. Please reduce the length of the messages.


DATA > ./data/prompts/health-facilities-in-sub-saharan-africa/Sub-Saharan_health_facilities.xlsx / Subsaharan_health_facilities

Calling model gpt-3.5-turbo

gpt-3.5-turbo Model response: 

|Health & Education : Health Facilities| - The dataset falls under the sub-category of 'Health Facilities' in the 'Health & Education' category. This is because the dataset provides tabular data with coordinates representing health facilities with some indication of the type of facility (clinic, hospital, etc.). The columns include information such as country, administrative division, facility name, facility type, ownership, latitude, longitude, and source.

^|Geography & Infrastructure : Populated Places|^ - The dataset could also fall under the sub-category of 'Populated Places' in the 'Geography & Infrastructure' category. This is because the dataset pro

[32m2023-03-26 21:37:20.178[0m | [1mINFO    [0m | [36mopenai.util[0m:[36mlog_info[0m:[36m67[0m - [1merror_code=context_length_exceeded error_message="This model's maximum context length is 4097 tokens. However, your messages resulted in 4403 tokens. Please reduce the length of the messages." error_param=messages error_type=invalid_request_error message='OpenAI API error received' stream_error=False[0m


This model's maximum context length is 4097 tokens. However, your messages resulted in 4403 tokens. Please reduce the length of the messages.

Calling model gpt-4

gpt-4 Model response: 

Yes, the dataset falls into exactly one of the categories mentioned above. The most suitable category for this dataset is:

|Population & Socio-economy : Poverty Rate|

The dataset contains information about poverty rates in Libya at a subnational level, including the Multidimensional Poverty Index (MPI), headcount ratios, and other related indicators. This category is chosen because it specifically focuses on poverty rates and their distribution across different regions within a country.

The second most likely category would be:

^Affected People : Humanitarian Needs^

This category could be considered as the dataset provides information on the poverty rates and deprivation levels in different regions, which can be used to identify areas with higher humanitarian needs. However, the primary focus of 

[32m2023-03-26 21:58:20.210[0m | [1mINFO    [0m | [36mopenai.util[0m:[36mlog_info[0m:[36m67[0m - [1merror_code=context_length_exceeded error_message="This model's maximum context length is 4097 tokens. However, your messages resulted in 7610 tokens. Please reduce the length of the messages." error_param=messages error_type=invalid_request_error message='OpenAI API error received' stream_error=False[0m


This model's maximum context length is 4097 tokens. However, your messages resulted in 7610 tokens. Please reduce the length of the messages.

Calling model gpt-4


[32m2023-03-26 21:59:59.039[0m | [1mINFO    [0m | [36mopenai.util[0m:[36mlog_info[0m:[36m67[0m - [1merror_code=context_length_exceeded error_message="This model's maximum context length is 8192 tokens. However, your messages resulted in 11827 tokens. Please reduce the length of the messages." error_param=messages error_type=invalid_request_error message='OpenAI API error received' stream_error=False[0m


This model's maximum context length is 8192 tokens. However, your messages resulted in 11827 tokens. Please reduce the length of the messages.


DATA > ./data/prompts/libya-populated-places/libya-common-operational-dataset-final-2017-ocha.xlsx / ADMIN 1 - geodivision
DATA > ./data/prompts/libya-populated-places/libya-common-operational-dataset-final-2017-ocha.xlsx / ADMIN 2 - Mintiqua
DATA > ./data/prompts/libya-populated-places/libya-common-operational-dataset-final-2017-ocha.xlsx / Sheet1
DATA > ./data/prompts/libya-populated-places/libya-common-operational-dataset-final-2017-ocha.xlsx / ADMIN 3 - Baladiya
DATA > ./data/prompts/libya-populated-places/libya-common-operational-dataset-final-2017-ocha.xlsx / ADMIN 4 - Muhalla
DATA > ./data/prompts/libya-populated-places/libya-common-operational-dataset-final-2017-ocha.xlsx / Capitals

Calling model gpt-3.5-turbo


[32m2023-03-26 22:01:42.181[0m | [1mINFO    [0m | [36mopenai.util[0m:[36mlog_info[0m:[36m67[0m - [1merror_code=context_length_exceeded error_message="This model's maximum context length is 4097 tokens. However, your messages resulted in 4131 tokens. Please reduce the length of the messages." error_param=messages error_type=invalid_request_error message='OpenAI API error received' stream_error=False[0m


This model's maximum context length is 4097 tokens. However, your messages resulted in 4131 tokens. Please reduce the length of the messages.

Calling model gpt-4

gpt-4 Model response: 

Yes, the dataset falls into one of the categories mentioned above.

|Geography & Infrastructure : Populated Places|

This category is chosen because the dataset contains vector data or tabular data with coordinates representing the location of populated places (cities, towns, villages) in Libya. The dataset includes information on administrative divisions, populated places, and their coordinates.

^Geography & Infrastructure : Administrative Divisions^

The second most likely category would be Administrative Divisions, as the dataset also contains information about the administrative divisions of Libya, including names and unique identifiers. However, the primary focus of the dataset is on populated places, which is why the Populated Places category is more appropriate.
******* RESULT: || True || pred

[32m2023-03-26 22:19:15.700[0m | [1mINFO    [0m | [36mopenai.util[0m:[36mlog_info[0m:[36m67[0m - [1merror_code=context_length_exceeded error_message="This model's maximum context length is 4097 tokens. However, your messages resulted in 4116 tokens. Please reduce the length of the messages." error_param=messages error_type=invalid_request_error message='OpenAI API error received' stream_error=False[0m


This model's maximum context length is 4097 tokens. However, your messages resulted in 4116 tokens. Please reduce the length of the messages.

Calling model gpt-4

gpt-4 Model response: 

Yes, the dataset falls into exactly one of the categories mentioned above. The most suitable category for this dataset is:

|Population & Socio-economy : Poverty Rate|

The dataset contains information on poverty rates in Mozambique, specifically the Multidimensional Poverty Index (MPI) and its components, such as health, education, and living standards. It also provides data on the population size and share by region, which is relevant to the sub-category of poverty rate.

The second most likely category would be:

^Food Security & Nutrition : Food Security^

Although the dataset does not directly provide data on food security, the Multidimensional Poverty Index (MPI) and its components, such as health, education, and living standards, can be related to food security and nutrition.
******* RESULT: ||

[32m2023-03-26 22:29:14.826[0m | [1mINFO    [0m | [36mopenai.util[0m:[36mlog_info[0m:[36m67[0m - [1merror_code=context_length_exceeded error_message="This model's maximum context length is 4097 tokens. However, your messages resulted in 4493 tokens. Please reduce the length of the messages." error_param=messages error_type=invalid_request_error message='OpenAI API error received' stream_error=False[0m


This model's maximum context length is 4097 tokens. However, your messages resulted in 4493 tokens. Please reduce the length of the messages.

Calling model gpt-4
Request timed out: HTTPSConnectionPool(host='api.openai.com', port=443): Read timed out. (read timeout=600)


DATA > ./data/prompts/ourairports-col/List of airports in Colombia (HXL tags).csv / 
DATA > ./data/prompts/ourairports-col/List of airports in Colombia (no HXL tags).csv / 

Calling model gpt-3.5-turbo

gpt-3.5-turbo Model response: 

|Affected People : Refugees and Persons of Concern| - The dataset does not fall into exactly one of the categories mentioned above. However, the closest category is 'Affected People: Refugees and Persons of Concern' as it provides information on airports in Colombia, which can be used to track the movement of refugees and persons of concern. 

The second most likely category would be '^Geography & Infrastructure: Airports^' as the dataset specifically provides information on airports in 

[32m2023-03-26 22:53:57.621[0m | [1mINFO    [0m | [36mopenai.util[0m:[36mlog_info[0m:[36m67[0m - [1merror_code=context_length_exceeded error_message="This model's maximum context length is 4097 tokens. However, your messages resulted in 4817 tokens. Please reduce the length of the messages." error_param=messages error_type=invalid_request_error message='OpenAI API error received' stream_error=False[0m


This model's maximum context length is 4097 tokens. However, your messages resulted in 4817 tokens. Please reduce the length of the messages.

Calling model gpt-4

gpt-4 Model response: 

Yes, the dataset falls into exactly one of the categories mentioned above.

|Affected People : Internally Displaced Persons|

This category was chosen because the dataset contains tabular data of the number of displaced people (IDPs) by location, which includes information on their settlement type, coordinates, and other relevant details.

^Affected People : Returnees^

The second most likely category would be "Affected People : Returnees" because the dataset also includes information on returnees from internal displacement and returnees from abroad, along with their respective numbers and locations.
******* RESULT: || False || prediced Affected People : Internally Displaced Persons, actual Affected People : Returnees *******


DATA > ./data/prompts/ethiopian-airdromes/ETH_Airdromes.csv / 

Calling mo

[32m2023-03-26 23:35:04.302[0m | [1mINFO    [0m | [36mopenai.util[0m:[36mlog_info[0m:[36m67[0m - [1merror_code=context_length_exceeded error_message="This model's maximum context length is 4097 tokens. However, your messages resulted in 4218 tokens. Please reduce the length of the messages." error_param=messages error_type=invalid_request_error message='OpenAI API error received' stream_error=False[0m


This model's maximum context length is 4097 tokens. However, your messages resulted in 4218 tokens. Please reduce the length of the messages.

Calling model gpt-4
Error communicating with OpenAI: ('Connection aborted.', ConnectionResetError(54, 'Connection reset by peer'))


DATA > ./data/prompts/ukraine-border-crossings/ukr_border_crossings_090622.xlsx / Border Crossings

Calling model gpt-3.5-turbo

gpt-3.5-turbo Model response: 

|Geography & Infrastructure: Border Crossings| - The dataset falls under the sub-category of "Geography & Infrastructure" as it provides information on the location of border crossings between Ukraine and neighboring countries. The dataset includes the name of the border crossing in English and Ukrainian, the country it connects to, and the latitude and longitude coordinates of each crossing. 

^|Coordination & Context: Migration| - The second most likely category would be "Coordination & Context: Migration" as the dataset provides information on the moveme

[32m2023-03-26 23:49:17.140[0m | [1mINFO    [0m | [36mopenai.util[0m:[36mlog_info[0m:[36m67[0m - [1merror_code=context_length_exceeded error_message="This model's maximum context length is 4097 tokens. However, your messages resulted in 4117 tokens. Please reduce the length of the messages." error_param=messages error_type=invalid_request_error message='OpenAI API error received' stream_error=False[0m


This model's maximum context length is 4097 tokens. However, your messages resulted in 4117 tokens. Please reduce the length of the messages.

Calling model gpt-4

gpt-4 Model response: 

Yes, the dataset falls into exactly one of the categories mentioned above.

|Population & Socio-economy : Poverty Rate|

This category was chosen because the dataset provides information on the Multidimensional Poverty Index (MPI) for subnational regions in Pakistan, which is a measure of poverty. The dataset includes data on various indicators related to health, education, and living standards, which contribute to the overall poverty rate in each region.

^Food Security & Nutrition : Food Security^

The second most likely category would be Food Security & Nutrition : Food Security because some of the indicators in the dataset, such as nutrition and child mortality, are related to food security and can have an impact on the overall food security situation in the regions. However, this category is not 

[32m2023-03-26 23:55:28.124[0m | [1mINFO    [0m | [36mopenai.util[0m:[36mlog_info[0m:[36m67[0m - [1merror_code=context_length_exceeded error_message="This model's maximum context length is 4097 tokens. However, your messages resulted in 70993 tokens. Please reduce the length of the messages." error_param=messages error_type=invalid_request_error message='OpenAI API error received' stream_error=False[0m


This model's maximum context length is 4097 tokens. However, your messages resulted in 70993 tokens. Please reduce the length of the messages.

Calling model gpt-4


[32m2023-03-26 23:58:09.223[0m | [1mINFO    [0m | [36mopenai.util[0m:[36mlog_info[0m:[36m67[0m - [1merror_code=context_length_exceeded error_message="This model's maximum context length is 8192 tokens. However, your messages resulted in 70988 tokens. Please reduce the length of the messages." error_param=messages error_type=invalid_request_error message='OpenAI API error received' stream_error=False[0m


This model's maximum context length is 8192 tokens. However, your messages resulted in 70988 tokens. Please reduce the length of the messages.


DATA > ./data/prompts/northeast-nigeria-displacement-for-borno-adamawa-and-yobe-states-bay-state/Round_42_IDP_Dataset_September_2022.xlsx / Sheet2
DATA > ./data/prompts/northeast-nigeria-displacement-for-borno-adamawa-and-yobe-states-bay-state/Round_42_IDP_Returnees_September_2022.xlsx / Sheet2
DATA > ./data/prompts/northeast-nigeria-displacement-for-borno-adamawa-and-yobe-states-bay-state/Round_41_IDP_Dataset_June_2022.xlsx / Sheet2
DATA > ./data/prompts/northeast-nigeria-displacement-for-borno-adamawa-and-yobe-states-bay-state/Round_41_IDP_Returnees_June_2022.xlsx / Sheet2

Calling model gpt-3.5-turbo

gpt-3.5-turbo Model response: 

|Affected People : Internally Displaced Persons| - The dataset contains tabular data of the number of displaced people by location, specifically for Borno, Adamawa, and Yobe States in Northeast Nigeria. The data

[32m2023-03-27 00:11:37.719[0m | [1mINFO    [0m | [36mopenai.util[0m:[36mlog_info[0m:[36m67[0m - [1merror_code=context_length_exceeded error_message="This model's maximum context length is 4097 tokens. However, your messages resulted in 4735 tokens. Please reduce the length of the messages." error_param=messages error_type=invalid_request_error message='OpenAI API error received' stream_error=False[0m


This model's maximum context length is 4097 tokens. However, your messages resulted in 4735 tokens. Please reduce the length of the messages.

Calling model gpt-4

gpt-4 Model response: 

Yes, the dataset falls into one of the categories mentioned above.

|Affected People : Humanitarian Needs| is the top category and sub-category for this dataset. This category was chosen because the dataset contains tabular data of the number of people in need of humanitarian assistance by location and humanitarian cluster/sector, which aligns with the definition of this category.

The second most likely category would be ^Coordination & Context : 3W - Who is doing what where^. This category is relevant because the dataset also provides information on the locations and administrative divisions where humanitarian assistance is needed, although it does not specifically list the organizations working on these issues.
******* RESULT: || True || prediced Affected People : Humanitarian Needs, actual Affecte

[32m2023-03-27 00:47:08.020[0m | [1mINFO    [0m | [36mopenai.util[0m:[36mlog_info[0m:[36m67[0m - [1merror_code=context_length_exceeded error_message="This model's maximum context length is 4097 tokens. However, your messages resulted in 4381 tokens. Please reduce the length of the messages." error_param=messages error_type=invalid_request_error message='OpenAI API error received' stream_error=False[0m


This model's maximum context length is 4097 tokens. However, your messages resulted in 4381 tokens. Please reduce the length of the messages.

Calling model gpt-4

gpt-4 Model response: 

Yes, the dataset falls into exactly one of the categories mentioned above.

|Population & Socio-economy : Baseline Population|

I chose this category because the dataset contains information about the population of the State of Palestine, disaggregated by administrative divisions (Admin0, Admin1, and Admin2) and years (2020, 2021, and 2022). The data includes total population, as well as population by age and sex categories.

^Coordination & Context : Administrative Divisions^

The second most likely category would be Administrative Divisions, as the dataset also provides information about the administrative divisions of the State of Palestine, including their names and unique identifiers (p-codes). However, the primary focus of the dataset is on population statistics, which is why Baseline Population

[32m2023-03-27 01:07:29.132[0m | [1mINFO    [0m | [36mopenai.util[0m:[36mlog_info[0m:[36m67[0m - [1merror_code=context_length_exceeded error_message="This model's maximum context length is 4097 tokens. However, your messages resulted in 30021 tokens. Please reduce the length of the messages." error_param=messages error_type=invalid_request_error message='OpenAI API error received' stream_error=False[0m


This model's maximum context length is 4097 tokens. However, your messages resulted in 30021 tokens. Please reduce the length of the messages.

Calling model gpt-4
Request timed out: HTTPSConnectionPool(host='api.openai.com', port=443): Read timed out. (read timeout=600)


DATA > ./data/prompts/somalia-acute-malnutrition-burden-and-prevalence/Somalia 2022 Post Gu Total Acute Malnutrition Burden and Prevalence for Aug 2022 to Jul 2023 by District .xlsx / Sheet2
DATA > ./data/prompts/somalia-acute-malnutrition-burden-and-prevalence/2021 Post Gu AMN Burden and Prevalence - 9 Sep 2021.xlsx / Burden
DATA > ./data/prompts/somalia-acute-malnutrition-burden-and-prevalence/2021 Post Gu AMN Burden and Prevalence - 9 Sep 2021.xlsx / Prevalence
DATA > ./data/prompts/somalia-acute-malnutrition-burden-and-prevalence/FSNAU Nutrition Surveys data-Gu and Deyr 2020.xlsx / Deyr 2020
DATA > ./data/prompts/somalia-acute-malnutrition-burden-and-prevalence/FSNAU Nutrition Surveys data-Gu and Deyr 2020.xlsx /

[32m2023-03-27 01:19:32.519[0m | [1mINFO    [0m | [36mopenai.util[0m:[36mlog_info[0m:[36m67[0m - [1merror_code=context_length_exceeded error_message="This model's maximum context length is 4097 tokens. However, your messages resulted in 4235 tokens. Please reduce the length of the messages." error_param=messages error_type=invalid_request_error message='OpenAI API error received' stream_error=False[0m


This model's maximum context length is 4097 tokens. However, your messages resulted in 4235 tokens. Please reduce the length of the messages.

Calling model gpt-4

gpt-4 Model response: 

Yes, the dataset falls into exactly one of the categories mentioned above.

|Affected People : Acute Malnutrition| is the top category and sub-category for this dataset. This category was chosen because the dataset contains tabular data specifying the global acute malnutrition (GAM) or severe acute malnutrition (SAM) rate by administrative division, which aligns with the definition of the "Affected People : Acute Malnutrition" category.

The second most likely category would be ^Food Security & Nutrition : Food Security^, as the dataset also relates to food security and nutrition issues, but it does not specifically provide information on the IPC/CH acute food insecurity phase classification.
******* RESULT: || False || prediced Affected People : Acute Malnutrition, actual Food Security & Nutrition : 

[32m2023-03-27 01:45:06.250[0m | [1mINFO    [0m | [36mopenai.util[0m:[36mlog_info[0m:[36m67[0m - [1merror_code=context_length_exceeded error_message="This model's maximum context length is 4097 tokens. However, your messages resulted in 4400 tokens. Please reduce the length of the messages." error_param=messages error_type=invalid_request_error message='OpenAI API error received' stream_error=False[0m


This model's maximum context length is 4097 tokens. However, your messages resulted in 4400 tokens. Please reduce the length of the messages.

Calling model gpt-4

gpt-4 Model response: 

Yes, the dataset falls into exactly one of the categories mentioned above.

|Population & Socio-economy : Poverty Rate|

This category was chosen because the dataset provides information on poverty rates in Yemen at a subnational level, including the Multidimensional Poverty Index (MPI), headcount ratios, and other related indicators.

^Food Security & Nutrition : Food Security^

The second most likely category would be Food Security & Nutrition : Food Security, as poverty rates and food security are often closely related, and the dataset may provide insights into the food security situation in Yemen. However, the dataset does not specifically focus on food security indicators.
******* RESULT: || True || prediced Population & Socio-economy : Poverty Rate, actual Population & Socio-economy : Poverty Ra

[32m2023-03-27 01:52:29.177[0m | [1mINFO    [0m | [36mopenai.util[0m:[36mlog_info[0m:[36m67[0m - [1merror_code=context_length_exceeded error_message="This model's maximum context length is 4097 tokens. However, your messages resulted in 4220 tokens. Please reduce the length of the messages." error_param=messages error_type=invalid_request_error message='OpenAI API error received' stream_error=False[0m


This model's maximum context length is 4097 tokens. However, your messages resulted in 4220 tokens. Please reduce the length of the messages.

Calling model gpt-4

gpt-4 Model response: 

Yes, the dataset falls into exactly one of the categories mentioned above.

|Coordination & Context : 3W - Who is doing what where|

This category was chosen because the dataset provides information on organizations working on humanitarian issues in Afghanistan, their operational presence, and capacity in various sectors such as Education in Emergencies, Emergency Shelter & Non-Food Items, etc. The dataset includes information on the location (administrative divisions) where these organizations are working, making it a perfect fit for the "3W - Who is doing what where" category.

^Affected People : Humanitarian Needs^

The second most likely category would be "Affected People : Humanitarian Needs" because the dataset indirectly provides information on the number of people in need of humanitarian assis

[32m2023-03-27 02:21:32.123[0m | [1mINFO    [0m | [36mopenai.util[0m:[36mlog_info[0m:[36m67[0m - [1merror_code=context_length_exceeded error_message="This model's maximum context length is 4097 tokens. However, your messages resulted in 4357 tokens. Please reduce the length of the messages." error_param=messages error_type=invalid_request_error message='OpenAI API error received' stream_error=False[0m


This model's maximum context length is 4097 tokens. However, your messages resulted in 4357 tokens. Please reduce the length of the messages.

Calling model gpt-4
Error communicating with OpenAI: ('Connection aborted.', ConnectionResetError(54, 'Connection reset by peer'))


DATA > ./data/prompts/somalia-roads/Roads Status.xlsx / Road access

Calling model gpt-3.5-turbo

gpt-3.5-turbo Model response: 

|Geography & Infrastructure| - Roads | 

This dataset falls under the category of "Geography & Infrastructure" and specifically the sub-category of "Roads". The table excerpt provided contains information on the status of roads in Somalia, including the names of the routes and their current status (open or closed). This information is important for humanitarian organizations to plan and execute aid delivery operations in the country.

The second most likely category for this dataset would be |Coordination & Context| - Humanitarian Access^. While the primary focus of the dataset is on the

[32m2023-03-27 02:57:50.539[0m | [1mINFO    [0m | [36mopenai.util[0m:[36mlog_info[0m:[36m67[0m - [1merror_code=context_length_exceeded error_message="This model's maximum context length is 4097 tokens. However, your messages resulted in 4192 tokens. Please reduce the length of the messages." error_param=messages error_type=invalid_request_error message='OpenAI API error received' stream_error=False[0m


This model's maximum context length is 4097 tokens. However, your messages resulted in 4192 tokens. Please reduce the length of the messages.

Calling model gpt-4
Error communicating with OpenAI: ('Connection aborted.', ConnectionResetError(54, 'Connection reset by peer'))


DATA > ./data/prompts/sind-safeguarding-healthcare-monthly-news-briefs-dataset/2016-2023 Attacks on Health Care Incident Data.xlsx / 2016-23 Attacks on Health Care 
DATA > ./data/prompts/sind-safeguarding-healthcare-monthly-news-briefs-dataset/2022 Attacks on Health Care Incident Data.xlsx / 2022 Attacks on Health Care 
DATA > ./data/prompts/sind-safeguarding-healthcare-monthly-news-briefs-dataset/2000-2022 Attacked and Threatened Health Care at Risk.xlsx / HealthCare @ Risk InteractivMap

Calling model gpt-3.5-turbo


[32m2023-03-27 03:08:10.143[0m | [1mINFO    [0m | [36mopenai.util[0m:[36mlog_info[0m:[36m67[0m - [1merror_code=context_length_exceeded error_message="This model's maximum context length is 4097 tokens. However, your messages resulted in 5361 tokens. Please reduce the length of the messages." error_param=messages error_type=invalid_request_error message='OpenAI API error received' stream_error=False[0m


This model's maximum context length is 4097 tokens. However, your messages resulted in 5361 tokens. Please reduce the length of the messages.

Calling model gpt-4

gpt-4 Model response: 

Yes, the dataset falls into exactly one of the categories mentioned above.

|Coordination & Context : Conflict Events| - This category is chosen because the dataset contains information about attacks on healthcare facilities, health workers, and related incidents, which are considered conflict events. The data includes details about the date, location, perpetrators, weapons used, and the impact on health facilities and workers.

^Affected People : Humanitarian Needs^ - This could be considered as the second most likely category because the dataset provides information about the impact of these conflict events on health workers and facilities, which can be related to the humanitarian needs of the affected population. However, it does not directly provide data on the number of people in need of humanita

[32m2023-03-27 03:34:17.427[0m | [1mINFO    [0m | [36mopenai.util[0m:[36mlog_info[0m:[36m67[0m - [1merror_code=context_length_exceeded error_message="This model's maximum context length is 4097 tokens. However, your messages resulted in 4146 tokens. Please reduce the length of the messages." error_param=messages error_type=invalid_request_error message='OpenAI API error received' stream_error=False[0m


This model's maximum context length is 4097 tokens. However, your messages resulted in 4146 tokens. Please reduce the length of the messages.

Calling model gpt-4

gpt-4 Model response: 

Yes, the dataset falls into exactly one of the categories mentioned above. The most suitable category for this dataset is:

|Population & Socio-economy : Poverty Rate|

The dataset contains information about poverty rates in Burkina Faso at the subnational level, including the Multidimensional Poverty Index (MPI), headcount ratios, and other related indicators. This category is chosen because it specifically focuses on poverty rates and their distribution across different regions.

The second most likely category would be:

^Affected People : Humanitarian Needs^

This category could be considered because the dataset provides information on poverty and deprivation, which can be related to humanitarian needs. However, it is not the primary focus of the dataset, so it is not the top choice.
******* RESUL

[32m2023-03-27 03:49:31.901[0m | [1mINFO    [0m | [36mopenai.util[0m:[36mlog_info[0m:[36m67[0m - [1merror_code=context_length_exceeded error_message="This model's maximum context length is 4097 tokens. However, your messages resulted in 4410 tokens. Please reduce the length of the messages." error_param=messages error_type=invalid_request_error message='OpenAI API error received' stream_error=False[0m


This model's maximum context length is 4097 tokens. However, your messages resulted in 4410 tokens. Please reduce the length of the messages.

Calling model gpt-4

gpt-4 Model response: 

Yes, the dataset falls into exactly one of the categories mentioned above. The most suitable category for this dataset is:

|Population & Socio-economy : Poverty Rate|

The dataset contains information on poverty rates in Bangladesh, specifically the Multidimensional Poverty Index (MPI) for subnational regions. It provides data on the proportion of people who are MPI poor and experience deprivations in various indicators such as health, education, and living standards. The dataset also includes information on the total population, population share by region, and the number of MPI poor by region.

The second most likely category, if I had to pick one, would be:

^Food Security & Nutrition : Food Security^

Although the dataset does not directly provide data on food security, the Multidimensional Povert

[32m2023-03-27 04:03:58.719[0m | [1mINFO    [0m | [36mopenai.util[0m:[36mlog_info[0m:[36m67[0m - [1merror_code=context_length_exceeded error_message="This model's maximum context length is 4097 tokens. However, your messages resulted in 7032 tokens. Please reduce the length of the messages." error_param=messages error_type=invalid_request_error message='OpenAI API error received' stream_error=False[0m


This model's maximum context length is 4097 tokens. However, your messages resulted in 7032 tokens. Please reduce the length of the messages.

Calling model gpt-4


[32m2023-03-27 04:07:27.440[0m | [1mINFO    [0m | [36mopenai.util[0m:[36mlog_info[0m:[36m67[0m - [1merror_code=context_length_exceeded error_message="This model's maximum context length is 8192 tokens. However, your messages resulted in 16748 tokens. Please reduce the length of the messages." error_param=messages error_type=invalid_request_error message='OpenAI API error received' stream_error=False[0m


This model's maximum context length is 8192 tokens. However, your messages resulted in 16748 tokens. Please reduce the length of the messages.
Done


In [87]:
def output_prediction_metrics(
    results, prediction_field="predicted_post_processed", actual_field="actual_category"
):
    """
    Prints out model performance report if provided results in the format:

    [
        {
            'prompt': ' \'ISO3\' | "[\'RWA\', \'RWA\', \'RWA\', \'RWA\', \'RWA\', \'RWA\', \'RWA\', \'RWA\']"',
            'predicted': ' #country+code+iso3+v_iso3+',
            'predicted_post_processed': '#country+code',
            'expected': '#country+code'
        },
        ... etc ...
    ]

    Parameters
    ----------
    results : list
        See above for format
    prediction_field : str
        Field name of element with prediction. Handy for comparing raw and post-processed predictions.
    """
    y_test = []
    y_pred = []
    for index, r in results.iterrows():
        if actual_field not in r:
            print("Provided results do not contain expected values.")
            sys.exit()
        y_pred.append(r[prediction_field])
        y_test.append(r[actual_field])

    print(f"Results for {prediction_field}, {len(results)} predictions ...\n")
    print(f"Accuracy: {round(accuracy_score(y_test, y_pred),2)}")
    print(
        f"Precision: {round(precision_score(y_test, y_pred, average='weighted', zero_division=0),2)}"
    )
    print(
        f"Recall: {round(recall_score(y_test, y_pred, average='weighted', zero_division=0),2)}"
    )
    print(
        f"F1: {round(f1_score(y_test, y_pred, average='weighted', zero_division=0),2)}"
    )

    return


results.fillna("", inplace=True)
results = results.drop_duplicates(subset=["dataset_name"])


def split_subcategory(category):
    if ":" in category:
        return category.split(":")[0].strip()
    else:
        return category


for c in [
    "gpt-3.5-turbo_predicted_category",
    "gpt-4_predicted_category",
    "actual_category",
]:
    results[f"{c}_1"] = results[c].apply(lambda x: split_subcategory(x))

for prediction_field in [
    "gpt-3.5-turbo_predicted_category_1",
    "gpt-4_predicted_category_1",
    "gpt-3.5-turbo_predicted_category",
    "gpt-4_predicted_category",
]:

    print(f"======> \n{prediction_field} ...")
    actual_category = "actual_category"
    if "_1" in prediction_field:
        actual_category += "_1"
    output_prediction_metrics(
        results.loc[
            (results["gpt-3.5-turbo_predicted_category"] != "")
            & (results["gpt-4_predicted_category"] != "")
        ],
        prediction_field=prediction_field,
        actual_field=actual_category,
    )

results.to_excel(f"{output_folder}/results_all.xlsx")

gpt-3.5-turbo_predicted_category_1 ...
Results for gpt-3.5-turbo_predicted_category_1, 53 predictions ...

Accuracy: 0.66
Precision: 0.86
Recall: 0.66
F1: 0.68
gpt-4_predicted_category_1 ...
Results for gpt-4_predicted_category_1, 53 predictions ...

Accuracy: 0.96
Precision: 0.97
Recall: 0.96
F1: 0.96
gpt-3.5-turbo_predicted_category ...
Results for gpt-3.5-turbo_predicted_category, 53 predictions ...

Accuracy: 0.57
Precision: 0.73
Recall: 0.57
F1: 0.6
gpt-4_predicted_category ...
Results for gpt-4_predicted_category, 53 predictions ...

Accuracy: 0.89
Precision: 0.92
Recall: 0.89
F1: 0.89


Let's look at times when GPT-4 didn't predict the category correctly ...

In [92]:
df = results.loc[results["gpt-4_match"] == False]

for index, row in df.iterrows():
    response = row["gpt-4_response"]
    predicted_second_category = response.split("^")[1].strip()
    print(f"Dataset: {row['dataset_name']}")
    # print(f"Dataset: {row['filename']}")
    print("")
    print(f"Actual: {row['actual_category']}")
    print(f"Predicted category: {row['gpt-4_predicted_category']}")
    print(f"Predicted second category: {predicted_second_category}\n")
    print(
        f"Secondary category matched: {predicted_second_category == row['actual_category']}"
    )
    print("=====================================================")

Dataset: mozambique-attacks-on-aid-operations-education-health-and-protection

Actual: Coordination & Context : Humanitarian Access
Predicted category: Coordination & Context : Conflict Events
Predicted second category: Health & Education : Health Facilities

Secondary category matched: False
Dataset: iraq-violence-against-civilians-and-vital-civilian-facilities

Actual: Coordination & Context : Humanitarian Access
Predicted category: Coordination & Context : Conflict Events
Predicted second category: Affected People : Humanitarian Needs

Secondary category matched: False
Dataset: south-sudan-access-incidents

Actual: Coordination & Context : Conflict Events
Predicted category: Coordination & Context : Humanitarian Access
Predicted second category: Coordination & Context : Conflict Events

Secondary category matched: True
Dataset: somalia-displacement-idps-returnees-baseline-assessment-iom-dtm

Actual: Affected People : Returnees
Predicted category: Affected People : Internally Displac

We can see in the above that in many cases GPT-4's second choice was the correct one. More than that, some datasets seem to contain two categories of information, in fact on the HDX they are used in multiple places at times. The problem is a bit more nuanced than presented here, for a future study we might instead take categories and predict datasets rather than the other way around.
