<a href="https://colab.research.google.com/github/datakind/hxl-metadata-prediction/blob/main/generate-test-train-data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

This notebook downloads data provided by the HDX team from a google drive folder. The data was captured using an [HXL crawl process](https://github.com/HXLStandard/hdx-hashtag-crawler)

# Setup

In [None]:
!pip install gdown==5.2.0
!pip install pandas==2.2.2
!pip install hdx-python-api==6.3.1
!pip install openai==1.35.3

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [77]:
import sys
import os
import requests
import gdown
import tarfile
import pandas as pd
import re
from sklearn.model_selection import train_test_split

from hdx.utilities.easy_logging import setup_logging
from hdx.api.configuration import Configuration
from hdx.data.dataset import Dataset
import hxl
import json
import time
from openai import OpenAI
import numpy as np

if os.getenv("OPENAI_API_KEY") is None:
  from google.colab import userdata
  OPENAI_API_KEY =  userdata.get('OPENAI_API_KEY')
else:
  OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

client = OpenAI(
    api_key=OPENAI_API_KEY
)

input_options = hxl.input.InputOptions(http_headers={'User-Agent': "DK HXL Analaysis"})

# Google drive location of HDXHashtag crawler data. Shared with HDX team
DATA_GDRIVE="https://drive.google.com/file/d/1BDCuh0WVJWK1-1RMC-77cvh4H2Hep_ry/export?format=xlsx"
DATA_FILE="hdx-hxl-output.tgz"

# If using Colab, this is where Google drive gets mounted. Otherwise leave blank
GOOGLE_BASE_DIR = "/content/drive/MyDrive/Colab"

# Where to save local data files
LOCAL_DATA_DIR = f"{GOOGLE_BASE_DIR}/hxl-metadata-prediction/data/"

# This is the HXL schema sheet, search HDX to get this link
HXL_SCHEMA_RESOURCE_URL = "https://docs.google.com/spreadsheets/d/1En9FlmM8PrbTWgl3UHPF_MXnJ6ziVZFhBbojSJzBdLI/export?format=xlsx"

# Number of records in data excerpts
DATA_EXCERPT_SIZE = 10

# Data Summary LLM
#DATA_SUMMARY_LLM = "gpt-4o-mini"
DATA_SUMMARY_LLM = "gpt-3.5-turbo"

local_data_file = LOCAL_DATA_DIR + "/" + DATA_FILE

pd.set_option('display.max_colwidth', 200)
pd.set_option('display.max_rows', 200)

## Get HDX connection

In [2]:
agent_count = 0

def setup_hdx_connection(agent_name):
    try:
        Configuration.create(hdx_site="prod", user_agent=agent_name, hdx_read_only=True)
    except:
        print("Configuration already created, continuing ...")

# Note, if you run this twice you will get a 'Configuration already exists!' error, but it can be ignored
setup_hdx_connection(f"DK_UserAgent{agent_count}")

## Download HXL Core schema

In [5]:
local_data_file = LOCAL_DATA_DIR + "/hxl-core-schema.xlsx"

response = requests.get(HXL_SCHEMA_RESOURCE_URL)
with open(local_data_file, 'wb') as f:
    f.write(response.content)

df= pd.read_excel(local_data_file, sheet_name='Core hashtags')
hashtags_list = df['Hashtag'][1:].tolist()

df= pd.read_excel(local_data_file, sheet_name='Core attributes')
attributes_list = df['Attribute'][1:].tolist()

# Remove rows with disallowed tags or attributes
APPROVED_HXL_SCHEMA = hashtags_list + attributes_list

print("Approved HXL schema ...")
print(APPROVED_HXL_SCHEMA)

Approved HXL schema ...
['#access', '#activity', '#adm1', '#adm2', '#adm3', '#adm4', '#adm5', '#affected', '#beneficiary', '#capacity', '#cause', '#channel', '#contact', '#country', '#crisis', '#currency', '#date', '#delivery', '#description', '#event', '#frequency', '#geo', '#group', '#impact', '#indicator', '#inneed', '#item', '#loc', '#meta', '#modality', '#need', '#operations', '#org', '#output', '#population', '#reached', '#region', '#respondee', '#sector', '#service', '#severity', '#status', '#subsector', '#targeted', '#value', '+abducted', '+acronym', '+activity', '+adolescents', '+adults', '+approved', '+ar', '+bounds', '+budget', '+canceled', '+children', '+cluster', '+code', '+converted', '+coord', '+dest', '+displaced', '+elderly', '+elevation', '+email', '+en', '+end', '+es', '+f', '+fa', '+fr', '+funder', '+hh', '+i', '+id', '+idps', '+impl', '+incamp', '+ind', '+infants', '+infected', '+injured', '+killed', '+label', '+lat', '+lon', '+m', '+ms', '+name', '+noncamp', '+num

# Analysis

## Download HXL crawler data

This data was generated using the [HDX Hashtag Crawler](https://github.com/dividor/hdx-hashtag-crawler) over several days and saved to Google drive.

In [13]:
# Use gdown to download the file
gdown.download(DATA_GDRIVE, local_data_file, quiet=False, fuzzy=True)

tar = tarfile.open(local_data_file)
tar.extractall(LOCAL_DATA_DIR)
tar.close()

Downloading...
From: https://drive.google.com/uc?id=1BDCuh0WVJWK1-1RMC-77cvh4H2Hep_ry
To: /content/drive/MyDrive/Colab/hxl-metadata-prediction/data/hdx-hxl-output.tgz
100%|██████████| 19.3M/19.3M [00:00<00:00, 201MB/s]


## Identify Unique combinations of HXL tags we want for training data

In [14]:
# Open hdx-expanded-hashed-stats.csv in data/hdx-hxl

df = pd.read_csv(LOCAL_DATA_DIR + "/output/hdx-expanded-hashed-stats.csv")

# We'll keep one row per column, Hashtag with Attributes has what we require
df.drop(columns=['Attribute', 'Hashtag'], inplace=True)
df.drop_duplicates(inplace=True)

# Remove HXL tags row in the metadata (we keep them for actual data)
df = df[1:]

display(df.head())
print(df.shape)

print("Unique data providers ...")
print(len(df["Data provider"].unique()))

print("Unique HDX resource ids ...")
print(len(df["HDX resource id"].unique()))

  df = pd.read_csv(LOCAL_DATA_DIR + "/output/hdx-expanded-hashed-stats.csv")


Unnamed: 0,Hashtag with Attributes,Text header,Locations,Data provider,HDX dataset id,HDX resource id,Date created,Unnamed: 9,Hash
1,#affected+hh,Total IDP HH,COD,international-organization-for-migration,drc-displacement-idps-returnees-m23-crisis-north-kivu-province-baseline-assessment-iom-dtm,26ecc26f-74e7-46af-b450-8872dca0b63b,2023-10-16,True,0x2cc7fd3129c0d18c
2,#affected+idp+ind,Total IDP IND,COD,international-organization-for-migration,drc-displacement-idps-returnees-m23-crisis-north-kivu-province-baseline-assessment-iom-dtm,26ecc26f-74e7-46af-b450-8872dca0b63b,2023-10-16,True,0x2cc7fd3129c0d18c
4,#affected+idp+male,Total IDP Male Ind,COD,international-organization-for-migration,drc-displacement-idps-returnees-m23-crisis-north-kivu-province-baseline-assessment-iom-dtm,26ecc26f-74e7-46af-b450-8872dca0b63b,2023-10-16,True,0x2cc7fd3129c0d18c
6,#affected+female+idp,Total IDP Female Ind,COD,international-organization-for-migration,drc-displacement-idps-returnees-m23-crisis-north-kivu-province-baseline-assessment-iom-dtm,26ecc26f-74e7-46af-b450-8872dca0b63b,2023-10-16,True,0x2cc7fd3129c0d18c
8,#affected+ind+returnees,Total Returnees,COD,international-organization-for-migration,drc-displacement-idps-returnees-m23-crisis-north-kivu-province-baseline-assessment-iom-dtm,26ecc26f-74e7-46af-b450-8872dca0b63b,2023-10-16,True,0x2cc7fd3129c0d18c


(487297, 9)
Unique data providers ...
120
Unique HDX resource ids ...
43074


Let's use the column hash created by the crawler to find unique combinations of tags

In [15]:
hash_count = df.groupby('Hash').size()
display(hash_count)


Hash
0x100556db35012c6b       101
0x102125f4dd16c64     190286
0x10309a2e5e2722ba       509
0x105b36aac3c9192f       693
0x105c6ee3379af31c       595
                       ...  
0xf0a7e4d9104f069         12
0xf93f0051e52a4d          28
0xfa11ad9f842a37d         17
0xfe0e278de9d0a33          1
0xfe8777dcd878424         20
Length: 644, dtype: int64

In [18]:
hash_resources = df.groupby('Hash')['HDX resource id'].nunique().sort_values(ascending=False)

for col in ['HDX resource id', 'HDX dataset id', 'Data provider']:
    hash_resources  = hash_resources .reset_index()
    hash_resources [f"Unique {col}"] = hash_resources ['Hash'].map(df.groupby('Hash')[col].unique())

display(hash_resources )

hash_resources.to_excel(f"{LOCAL_DATA_DIR}/hxl_hash_resources.xlsx", index=False)

Unnamed: 0,level_0,index,Hash,HDX resource id,Unique HDX resource id,Unique HDX dataset id,Unique Data provider
0,0,0,0x102125f4dd16c64,13858,"[51b2e4ec-aca5-4b97-bbb7-c005175b682e, a6ef8040-3b15-47ae-9973-1dbc113673cf, f579cf0e-5535-4414-897f-2f8c05105180, 66c62464-017b-4aa6-845f-9ec2487acb82, 91e1cb98-353b-487e-a14d-b0eea783da6f, 65f38...","[who-data-for-south-sudan, who-data-for-montenegro, who-data-for-zimbabwe, who-data-for-zambia, who-data-for-yemen, who-data-for-viet-nam, who-data-for-venezuela-bolivarian-republic-of, who-data-f...",[world-health-organization]
1,1,1,0x428a8e37940223d8,9328,"[2e130cdf-c850-4533-b2f3-e961adbec48a, 9e160d82-691d-49a6-979b-0ff0dbb6b7a8, 002501bc-7efb-4335-b672-d045cd76bc5b, 0d0e0fc4-e4f1-49cd-ad30-42cc8dc08b74, f2c1ea93-d241-413b-8140-c009da88d912, 23e36...","[world-bank-combined-indicators-for-zimbabwe, world-bank-trade-indicators-for-zimbabwe, world-bank-external-debt-indicators-for-zimbabwe, world-bank-climate-change-indicators-for-zimbabwe, world-b...",[world-bank-group]
2,2,2,0x31fda1ef985b4a59,3425,"[9c751883-698a-4a2c-9475-ff828e9c11db, 791b69af-df57-4157-96c0-c0d8d308315e, b055f1f7-8cdc-4ff8-a412-9f54d6e56c41, 6ae8568f-449b-4b59-9ad2-79d7df94cd9c, 4f4c7462-6b13-4125-a88a-f8e9ac0c837b, abb79...","[dhs-data-for-sao-tome-and-principe, dhs-data-for-rwanda, dhs-data-for-philippines, dhs-data-for-peru, dhs-data-for-paraguay, dhs-data-for-papua-new-guinea, dhs-data-for-pakistan, dhs-data-for-nig...",[dhs]
3,3,3,0x19598575d3397e19,3232,"[ed7b5bd2-7818-4d7a-9ff0-8ba0d97bf7d5, 8f239e93-76c0-4287-a414-3d17a5e55344, 040efd19-3d71-4e0b-8939-f6b46c465868, 8db41d1c-af8f-4f26-9d40-5b80b1bf72e8, 6482dc75-0e79-40d9-aacf-cdcde3c368a6, c9981...","[dhs-subnational-data-for-sao-tome-and-principe, dhs-subnational-data-for-rwanda, dhs-subnational-data-for-philippines, dhs-subnational-data-for-peru, dhs-subnational-data-for-paraguay, dhs-subnat...",[dhs]
4,4,4,0x16d2b679132fea10,2147,"[295cd9e4-8464-43ee-ad17-47196991a1f7, 34337d16-017d-4d69-834c-a5e0fc21a549, b51b8c0e-494d-488b-98d5-a70fd9451b90, 75076d6a-8d3f-49e3-b4f4-7c889bc82806, 7e0b2b37-73b8-4c69-bb11-6ba954fa0cd9, 08ea0...","[unhcr-population-data-for-world, unhcr-population-data-for-zwe, unhcr-population-data-for-zmb, unhcr-population-data-for-zaf, unhcr-population-data-for-yem, unhcr-population-data-for-wsm, unhcr-p...",[unhcr]
...,...,...,...,...,...,...,...
639,639,639,0x22def8e1b7b0c742,1,[68328f42-9276-423e-80d0-fe89630804ff],[3w-december-2017],[ocha-ethiopia]
640,640,640,0x4b7012601f2de402,1,[63143067-46e0-4fb5-b131-d83ee45122ab],[ethiopia-settlements],[ocha-ethiopia]
641,641,641,0x4aefd35864aaa20a,1,[6ddcfbc9-fa06-4b14-b9a4-ce96d3fae65e],[base-acceso-internet-personas-entre-los-5-y-19-anos-2018],[immap]
642,642,642,0x4acf4e36d67d877,1,[903326f2-b372-4786-9973-87226cb15e41],[people-in-need-2008-2019],[ocha-fts]


In [19]:
# Extract a single resource_id for each hash
resource_ids = hash_resources['Unique HDX resource id'].apply(lambda x: x[0])
print(resource_ids)

0      51b2e4ec-aca5-4b97-bbb7-c005175b682e
1      2e130cdf-c850-4533-b2f3-e961adbec48a
2      9c751883-698a-4a2c-9475-ff828e9c11db
3      ed7b5bd2-7818-4d7a-9ff0-8ba0d97bf7d5
4      295cd9e4-8464-43ee-ad17-47196991a1f7
                       ...                 
639    68328f42-9276-423e-80d0-fe89630804ff
640    63143067-46e0-4fb5-b131-d83ee45122ab
641    6ddcfbc9-fa06-4b14-b9a4-ce96d3fae65e
642    903326f2-b372-4786-9973-87226cb15e41
643    8b0feea6-20e4-45dc-aaae-c8b6fbd5a9f4
Name: Unique HDX resource id, Length: 644, dtype: object


In [20]:
df_subset = df[df['HDX resource id'].isin(resource_ids)]

print("Column data subset to one resource ID per hash ...")
print(df_subset.shape)

print("Unique data providers ...")
print(len(df_subset["Data provider"].unique()))

print("Unique HDX resource ids ...")
print(len(df_subset["HDX resource id"].unique()))

Column data subset to one resource ID per hash ...
(7834, 9)
Unique data providers ...
119
Unique HDX resource ids ...
644


### Download data excerpts

Using our subset of resource_ids for each hash, extract column data excerpts.

In [None]:
df2 = df_subset.copy()

df2['Data excerpt'] = ''

datasets_resources = df2[['HDX dataset id', 'HDX resource id']].drop_duplicates()
datasets_resources.reset_index(drop=True, inplace=True)

print("For each resource, extract a data excerpt for each column ...")

num_rows = datasets_resources.shape[0]
for index, row in datasets_resources.iterrows():

    if index % 10 == 0:
        print(f"Processing {index} of {num_rows} ({(index/num_rows)*100:.2f}%)")

    dataset_id = row['HDX dataset id']
    resource_id = row['HDX resource id']
    dataset = Dataset.read_from_hdx(dataset_id)
    if dataset is None:
        print(f"Dataset {dataset_id} not found!")
        continue
    resources = dataset.get_resources()
    for resource in resources:
        if resource['id'] == resource_id:
            print(f"    Accessing data for resource {resource_id}, {resource['name']}")
            try:
                url, path = resource.download(LOCAL_DATA_DIR)
                df2.loc[df2['HDX resource id'] == resource_id, 'File'] = path
                df2.loc[df2['HDX resource id'] == resource_id, 'URL'] = url

                with hxl.data(resource['url'], input_options) as source:
                    columns = [column.header for column in source.columns]
                    tags = [column.get_display_tag(sort_attributes=True) for column in source.columns]
                    data = {}
                    rowcount = 0
                    for row in source:
                        if rowcount > DATA_EXCERPT_SIZE:
                            break
                        i = 0
                        for colvalue in row:
                            colname = columns[i]
                            if colname not in data:
                                data[colname] = [colvalue]
                            else:
                                data[colname].append(colvalue)
                            i += 1
                        rowcount += 1

                    for col in columns:
                        if col in data:
                            #print(f"       Setting data excerpt for column {col} >> {data[col]} ...")
                            df2.loc[(df2['HDX resource id'] == resource_id) & (df2['Text header'] == col), 'Data excerpt'] = str(data[col])

            except Exception as e:
                print(f"Error accessing data for resource {resource_id}, {resource['name']} ... {e}")

display(df2)
print(df2.shape)
df2.to_csv(f"{LOCAL_DATA_DIR}/hxl_hash_resources_data.csv", index=False)



### Train and test data

In this section we will create train and test datasets for fine tuning.

In [6]:
data = pd.read_csv(f"{LOCAL_DATA_DIR}/hxl_hash_resources_data.csv")
print(data.shape)
display(data)

(7834, 12)


Unnamed: 0,Hashtag with Attributes,Text header,Locations,Data provider,HDX dataset id,HDX resource id,Date created,Unnamed: 9,Hash,Data excerpt,File,URL
0,#affected+hh,Total IDP HH,COD,international-organization-for-migration,drc-displacement-idps-returnees-m23-crisis-north-kivu-province-baseline-assessment-iom-dtm,26ecc26f-74e7-46af-b450-8872dca0b63b,2023-10-16,True,0x2cc7fd3129c0d18c,[319283],/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/DRC - Baseline Assessment - M23 Crisis 13 - February 2024.xlsx,https://data.humdata.org/dataset/3554c498-660a-45cb-ada5-86a1fbcd6056/resource/26ecc26f-74e7-46af-b450-8872dca0b63b/download/adc_27jan-12_feb_update_public_v2.xlsx
1,#affected+idp+ind,Total IDP IND,COD,international-organization-for-migration,drc-displacement-idps-returnees-m23-crisis-north-kivu-province-baseline-assessment-iom-dtm,26ecc26f-74e7-46af-b450-8872dca0b63b,2023-10-16,True,0x2cc7fd3129c0d18c,[1548732],/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/DRC - Baseline Assessment - M23 Crisis 13 - February 2024.xlsx,https://data.humdata.org/dataset/3554c498-660a-45cb-ada5-86a1fbcd6056/resource/26ecc26f-74e7-46af-b450-8872dca0b63b/download/adc_27jan-12_feb_update_public_v2.xlsx
2,#affected+idp+male,Total IDP Male Ind,COD,international-organization-for-migration,drc-displacement-idps-returnees-m23-crisis-north-kivu-province-baseline-assessment-iom-dtm,26ecc26f-74e7-46af-b450-8872dca0b63b,2023-10-16,True,0x2cc7fd3129c0d18c,[646805],/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/DRC - Baseline Assessment - M23 Crisis 13 - February 2024.xlsx,https://data.humdata.org/dataset/3554c498-660a-45cb-ada5-86a1fbcd6056/resource/26ecc26f-74e7-46af-b450-8872dca0b63b/download/adc_27jan-12_feb_update_public_v2.xlsx
3,#affected+female+idp,Total IDP Female Ind,COD,international-organization-for-migration,drc-displacement-idps-returnees-m23-crisis-north-kivu-province-baseline-assessment-iom-dtm,26ecc26f-74e7-46af-b450-8872dca0b63b,2023-10-16,True,0x2cc7fd3129c0d18c,[901927],/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/DRC - Baseline Assessment - M23 Crisis 13 - February 2024.xlsx,https://data.humdata.org/dataset/3554c498-660a-45cb-ada5-86a1fbcd6056/resource/26ecc26f-74e7-46af-b450-8872dca0b63b/download/adc_27jan-12_feb_update_public_v2.xlsx
4,#affected+ind+returnees,Total Returnees,COD,international-organization-for-migration,drc-displacement-idps-returnees-m23-crisis-north-kivu-province-baseline-assessment-iom-dtm,26ecc26f-74e7-46af-b450-8872dca0b63b,2023-10-16,True,0x2cc7fd3129c0d18c,[587705],/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/DRC - Baseline Assessment - M23 Crisis 13 - February 2024.xlsx,https://data.humdata.org/dataset/3554c498-660a-45cb-ada5-86a1fbcd6056/resource/26ecc26f-74e7-46af-b450-8872dca0b63b/download/adc_27jan-12_feb_update_public_v2.xlsx
...,...,...,...,...,...,...,...,...,...,...,...,...
7829,#lat_deg,prevlat,VUT,brcmapsteam,cyclone-pam-path,a8ccd9d2-8328-487a-b04b-ca3f3f2e0ea3,2015-03-16,True,0x1d4a8deeb40f76ce,"['-8.5', '-8.5', '-8.4', '-9.8', '-10.6', '-11.1', '-11', '-11.2', '-11.5', '-11.9', '-12.6']",/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/Cyclone Pam Path.google sheet,https://docs.google.com/spreadsheets/d/1xFOPVLCKeVpLtM27loV3_zicG-xswOZk7SD_nAQ217Q/edit?usp=sharing
7830,#lon_deg,prevlon,VUT,brcmapsteam,cyclone-pam-path,a8ccd9d2-8328-487a-b04b-ca3f3f2e0ea3,2015-03-16,True,0x1d4a8deeb40f76ce,"['169.8', '169.8', '170.3', '170.5', '170.3', '170.1', '169.6', '169.7', '169.7', '170.1', '170.2']",/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/Cyclone Pam Path.google sheet,https://docs.google.com/spreadsheets/d/1xFOPVLCKeVpLtM27loV3_zicG-xswOZk7SD_nAQ217Q/edit?usp=sharing
7831,#period_date,datelabel,VUT,brcmapsteam,cyclone-pam-path,a8ccd9d2-8328-487a-b04b-ca3f3f2e0ea3,2015-03-16,True,0x1d4a8deeb40f76ce,"['09 Mar', '09 Mar', '10 Mar', '10 Mar', '10 Mar', '11 Mar', '11 Mar', '11 Mar', '11 Mar', '12 Mar', '12 Mar']",/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/Cyclone Pam Path.google sheet,https://docs.google.com/spreadsheets/d/1xFOPVLCKeVpLtM27loV3_zicG-xswOZk7SD_nAQ217Q/edit?usp=sharing
7832,#x_time,hours,VUT,brcmapsteam,cyclone-pam-path,a8ccd9d2-8328-487a-b04b-ca3f3f2e0ea3,2015-03-16,True,0x1d4a8deeb40f76ce,"['0', '6', '18', '30', '36', '42', '48', '54', '60', '66', '72']",/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/Cyclone Pam Path.google sheet,https://docs.google.com/spreadsheets/d/1xFOPVLCKeVpLtM27loV3_zicG-xswOZk7SD_nAQ217Q/edit?usp=sharing


#### Data Cleaning

In [7]:
data = data[data['Data excerpt'].notnull()]
data = data[data['Data excerpt'].str.contains(r'[A-Za-z0-9]')]

print(data.shape)

(6725, 12)


#### Remove disallowed HXL

In [8]:
def filter_for_schema(text):
    #print(f"Tokens before: {text}")
    if " " in text:
        text = text.replace(" ","")

    tokens_raw = text.split("+")
    tokens = [tokens_raw[0]]
    for t in tokens_raw[1:]:
        tokens.append(f"+{t}")

    filtered = []
    for t in tokens:
        if t in APPROVED_HXL_SCHEMA:
            if t not in filtered:
                filtered.append(t)
    filtered = "".join(filtered)

    if len(filtered) > 0 and filtered[0] != '#':
        filtered = ""

    # Add spaces back in
    # filtered = filtered.replace("+", " +")

    #print(f"        After: {filtered}")
    return filtered

def filter_disallowed_hxl(column_data, hxl_col = 'Hashtag with Attributes'):
    print("Before",column_data.shape)
    allowed = []
    disallowed = []
    for index, row in column_data.iterrows():
        if row[hxl_col] == filter_for_schema(row[hxl_col]):
            allowed.append(row)
        else:
            disallowed.append(row)
    allowed = pd.DataFrame(allowed)
    disallowed = pd.DataFrame(disallowed)
    print("After", allowed.shape)
    return allowed, disallowed

data, disallowed = filter_disallowed_hxl(data)
print(data.shape)

display(disallowed)

Before (6725, 12)
After (3356, 12)
(3356, 12)


Unnamed: 0,Hashtag with Attributes,Text header,Locations,Data provider,HDX dataset id,HDX resource id,Date created,Unnamed: 9,Hash,Data excerpt,File,URL
1,#affected+idp+ind,Total IDP IND,COD,international-organization-for-migration,drc-displacement-idps-returnees-m23-crisis-north-kivu-province-baseline-assessment-iom-dtm,26ecc26f-74e7-46af-b450-8872dca0b63b,2023-10-16,True,0x2cc7fd3129c0d18c,[1548732],/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/DRC - Baseline Assessment - M23 Crisis 13 - February 2024.xlsx,https://data.humdata.org/dataset/3554c498-660a-45cb-ada5-86a1fbcd6056/resource/26ecc26f-74e7-46af-b450-8872dca0b63b/download/adc_27jan-12_feb_update_public_v2.xlsx
2,#affected+idp+male,Total IDP Male Ind,COD,international-organization-for-migration,drc-displacement-idps-returnees-m23-crisis-north-kivu-province-baseline-assessment-iom-dtm,26ecc26f-74e7-46af-b450-8872dca0b63b,2023-10-16,True,0x2cc7fd3129c0d18c,[646805],/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/DRC - Baseline Assessment - M23 Crisis 13 - February 2024.xlsx,https://data.humdata.org/dataset/3554c498-660a-45cb-ada5-86a1fbcd6056/resource/26ecc26f-74e7-46af-b450-8872dca0b63b/download/adc_27jan-12_feb_update_public_v2.xlsx
3,#affected+female+idp,Total IDP Female Ind,COD,international-organization-for-migration,drc-displacement-idps-returnees-m23-crisis-north-kivu-province-baseline-assessment-iom-dtm,26ecc26f-74e7-46af-b450-8872dca0b63b,2023-10-16,True,0x2cc7fd3129c0d18c,[901927],/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/DRC - Baseline Assessment - M23 Crisis 13 - February 2024.xlsx,https://data.humdata.org/dataset/3554c498-660a-45cb-ada5-86a1fbcd6056/resource/26ecc26f-74e7-46af-b450-8872dca0b63b/download/adc_27jan-12_feb_update_public_v2.xlsx
15,#meta+appeal+type,atype,SSD,ifrc,ifrc-appeals-data-for-south-sudan,4110b824-3338-453f-ae5d-89ca80f5b147,2023-03-13,True,0x1d1434ee319a1be,"['0', '0', '0', '1', '0', '0', '2', '0', '0', '1', '0']",/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/IFRC Appeals Data for South Sudan.csv,https://data.humdata.org/dataset/db2f8c45-7992-4f23-99f5-e8b4d853b53d/resource/4110b824-3338-453f-ae5d-89ca80f5b147/download/appeals_data_ssd.csv
18,#meta+appeal+id,code,SSD,ifrc,ifrc-appeals-data-for-south-sudan,4110b824-3338-453f-ae5d-89ca80f5b147,2023-03-13,True,0x1d1434ee319a1be,"['MDRSS013', 'MDRSS012', 'MDRSS011', 'MDRSS009', 'MDRSS008', 'MDRSS007', 'MDRSS006', 'MDRSS005', 'MDRSS004', 'MDRSS003', 'MDRSS002']",/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/IFRC Appeals Data for South Sudan.csv,https://data.humdata.org/dataset/db2f8c45-7992-4f23-99f5-e8b4d853b53d/resource/4110b824-3338-453f-ae5d-89ca80f5b147/download/appeals_data_ssd.csv
...,...,...,...,...,...,...,...,...,...,...,...,...
7829,#lat_deg,prevlat,VUT,brcmapsteam,cyclone-pam-path,a8ccd9d2-8328-487a-b04b-ca3f3f2e0ea3,2015-03-16,True,0x1d4a8deeb40f76ce,"['-8.5', '-8.5', '-8.4', '-9.8', '-10.6', '-11.1', '-11', '-11.2', '-11.5', '-11.9', '-12.6']",/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/Cyclone Pam Path.google sheet,https://docs.google.com/spreadsheets/d/1xFOPVLCKeVpLtM27loV3_zicG-xswOZk7SD_nAQ217Q/edit?usp=sharing
7830,#lon_deg,prevlon,VUT,brcmapsteam,cyclone-pam-path,a8ccd9d2-8328-487a-b04b-ca3f3f2e0ea3,2015-03-16,True,0x1d4a8deeb40f76ce,"['169.8', '169.8', '170.3', '170.5', '170.3', '170.1', '169.6', '169.7', '169.7', '170.1', '170.2']",/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/Cyclone Pam Path.google sheet,https://docs.google.com/spreadsheets/d/1xFOPVLCKeVpLtM27loV3_zicG-xswOZk7SD_nAQ217Q/edit?usp=sharing
7831,#period_date,datelabel,VUT,brcmapsteam,cyclone-pam-path,a8ccd9d2-8328-487a-b04b-ca3f3f2e0ea3,2015-03-16,True,0x1d4a8deeb40f76ce,"['09 Mar', '09 Mar', '10 Mar', '10 Mar', '10 Mar', '11 Mar', '11 Mar', '11 Mar', '11 Mar', '12 Mar', '12 Mar']",/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/Cyclone Pam Path.google sheet,https://docs.google.com/spreadsheets/d/1xFOPVLCKeVpLtM27loV3_zicG-xswOZk7SD_nAQ217Q/edit?usp=sharing
7832,#x_time,hours,VUT,brcmapsteam,cyclone-pam-path,a8ccd9d2-8328-487a-b04b-ca3f3f2e0ea3,2015-03-16,True,0x1d4a8deeb40f76ce,"['0', '6', '18', '30', '36', '42', '48', '54', '60', '66', '72']",/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/Cyclone Pam Path.google sheet,https://docs.google.com/spreadsheets/d/1xFOPVLCKeVpLtM27loV3_zicG-xswOZk7SD_nAQ217Q/edit?usp=sharing


#### Plot tags

In [9]:
data['tag'] = data['Hashtag with Attributes'].apply(lambda x: x.split('+')[0])
tag_counts_train = data['tag'].value_counts()

print(tag_counts_train)

tag
#adm1           527
#adm2           477
#affected       380
#country        285
#date           242
#org            231
#adm3           188
#inneed         141
#sector         110
#geo            106
#targeted        77
#loc             67
#activity        65
#status          60
#population      57
#indicator       55
#region          49
#meta            42
#reached         36
#adm4            32
#event           14
#subsector       12
#beneficiary     12
#cause           11
#item            10
#value           10
#severity         9
#output           8
#crisis           6
#service          4
#currency         4
#adm5             4
#contact          4
#access           4
#description      3
#capacity         3
#impact           3
#frequency        2
#group            2
#modality         2
#delivery         1
#operations       1
Name: count, dtype: int64


#### Generate training prompts

####  Split by data provider organization

On HDX, the hierarchy is ...

Organization > datasets > resources > tables

A random train/test split will result in data from files in a dataset being in both train and test, which would pollute the test set with very similar data to training. So we will split by organization

In [10]:
def split_data(column_data, provider_col, test_size=0.2, random_state=42):
    """
    Perform train-test split on datasets, print information, and return X_train and X_test.

    The split is done by organizations, to try and avoid the situation where an org provides
    similar data files. Also, we exclude orgs which are subsidiaries from the test set, eg ocha-*
    as presumably each subsid will provide similar data. The aim is that the test set is new.

    Parameters:
    - column_data (pd.DataFrame): DataFrame containing column data.
    - provider_col (string): Name of column holding data providers.
    - test_size (float): The proportion of the dataset to include in the test split.
    - random_state (int): Seed for random number generation.

    Returns:
    - pd.DataFrame, pd.DataFrame: X_train, X_test
    """

    orgs_df = column_data.groupby(provider_col)[provider_col].count().sort_values(ascending=False)
    orgs_df = column_data.groupby(provider_col)[provider_col].count().sort_values(ascending=False).reset_index(name='count')
    all_orgs = orgs_df[provider_col].unique()

    # Split orgs to get 'Parent', eg 'ocha-*' -> 'ocha'
    orgs_df['org_parent'] = orgs_df[provider_col].str.split('-').str[0]

    # Count occurrences of each 'org_parent'
    org_parent_counts = orgs_df['org_parent'].value_counts().reset_index(name='count')

    # Filter to keep only those occurring once
    org_parents_single_occurrence = org_parent_counts[org_parent_counts['count'] == 1]

    # Get the 'org_parent' values that occur only once
    single_occurrence_org_parents = org_parents_single_occurrence['org_parent'].tolist()

    # Filter the original DataFrame to keep rows where 'org_parent' occurs only once
    org_parents_unique = orgs_df[orgs_df['org_parent'].isin(single_occurrence_org_parents)]

    print("\nOrgs which don't seem to be subsidiaries ...\n")
    display(org_parents_unique)

    single_entities = list(org_parents_unique[provider_col].unique())

    # Remove 'hdx' from single_entities, not good for testing as it's the folks that made HXL! Also some monolithic orgs with very similar data
    single_entities = [x for x in single_entities if not x in ['hdx','ourairports', 'un-ocha']]

    single_entities.sort()

    # Sample single-subsid orgs
    sample_size = int(len(single_entities)*test_size) - 1
    np.random.seed(42)
    X_test_orgs = np.random.choice(single_entities, sample_size)
    X_train_orgs = list(set(all_orgs)-set(X_test_orgs))

    print(f"Train orgs: {X_train_orgs}")
    print(f"Test orgs: {X_test_orgs}")

    # Extract column rows for datasets in X_train_datasets
    X_train = column_data[column_data[provider_col].isin(X_train_orgs)]

    # Extract column rows for datasets in X_test_datasets
    X_test = column_data[~column_data[provider_col].isin(X_train_orgs)]

    train_orgs = X_train[provider_col].unique()
    train_orgs.sort()
    test_orgs = X_test[provider_col].unique()
    test_orgs.sort()

    print(f"\nTrain orgs: {train_orgs}")
    print(f"\nTest orgs: {test_orgs}")

    print(f"\nTrain column data: {X_train.shape}")
    print(f"Test column data: {X_test.shape}")

    return X_train, X_test


X_train, X_test = split_data(data, 'Data provider', test_size=0.2, random_state=42)


Orgs which don't seem to be subsidiaries ...



Unnamed: 0,Data provider,count,org_parent
1,immap,221,immap
2,hdx,139,hdx
5,hera-humanitarian-emergency-response-africa,100,hera
11,redhum,69,redhum
13,fieldsdata,53,fieldsdata
17,ifrc,49,ifrc
20,wfp,41,wfp
25,brcmapsteam,31,brcmapsteam
28,dhs,28,dhs
29,standby-task-force,28,standby


Train orgs: ['cbes', 'unicef-esaro', 'unicef-data', 'iati', 'ocha-philippines', 'gec', 'ocha-pakistan', 'ocha-rosea', 'ocha-fts', 'world-health-organization', 'sadc_rvaa', 'cirrolytix', 'ocha-colombia', 'cesvi', 'blavatnik-school-of-government-university-of-oxford', 'lacso', 'soswcaf', 'wfp', 'ocha-niger', 'jhucsse', 'cpaor', 'water-point-data-exchange', 'ocha-myanmar', 'unesco', 'cfp-rco-nepal', 'unicef-rdc', 'hdx', 'rcpwca', 'clear', 'crs-waro', 'srsgcc', 'ocha-mali', 'brcmapsteam', 'ocha-burkina', 'ipc-cluster-guinea', 'ocha-haiti', 'ourairports', 'moving-energy-initiative', 'ocha-eritrea', 'ocha-ethiopia', 'cerf', 'ocha-opt', 'ocha-sudan', 'unodc', 'qcri', 'ipc', 'ocha-rosc', 'ewipa', 'unhcr', 'ocha-south-sudan', 'libya-ingo-forum', 'hxl', 'kenya-national-bureau-of-statistics', 'ocha-burundi', 'inter-sector-coordination-group', 'dalberg', 'ocha-libya', 'ocha-ukraine', 'ocha-rowca', 'ocha-yemen', 'redhum', 'infoculture', 'ocha-indonesia', 'cred', 'ocha-car', 'ocha-afghanistan', 'som

### Create prompt files

In [78]:
def create_prompt_file(X_train, prompt_col, filename):
    """
    Create a prompt file from a DataFrame.

    Args:
        X_train (pd.DataFrame): The DataFrame containing the prompts.
        prompt_col (str): The name of the column containing the prompts.
        filename (str): The name of the file to write the prompts to.
    """

    with open(filename, 'w') as f:
        for index, row in X_train.iterrows():
            f.write(row[prompt_col] + "\n")

    print(f"Prompts written to {filename}")


def generate_chat_prompt(dataset_name, resource_name, column_name, excerpt, \
                         hxl_tag=None, dataset_description="", add_response=True):
    """
    Generate a chat (eg for GPT-3.5-Turbo) fine tuning prompt for HXL tags given dataset, resource, column information.

    Parameters:
    - dataset_name (str): Name of the dataset.
    - resource_name (str): Name of the resource.
    - column_name (str): Name of the column.
    - excerpt (str): Examples or excerpt of the column.
    - hxl_tag (str, optional): HXL tags for the column. Default is None.
    - dataset_description (str, optional): Description of the dataset. Default is an empty string.
    - add_response (bool, optional): Whether to include the response in the prompt. Default is True.

    Returns:
    - dict: A dictionary containing the prompt and optional completion/response.
    """

    system_message = """
        You are an assistant that replies with HXL tags and attributes"
    """

    resource_name = resource_name.replace(f"{LOCAL_DATA_DIR}/",'')

    column_details = f"resource_name='{resource_name}'; " + \
                     f"dataset_description='{dataset_description}'; " + \
                     f"column_name:'{column_name}'; examples: {excerpt}"

    user_prompt = f"What are the HXL tags and attributes for a column with these details? {column_details}"

    prompt = {
        "messages": [
            {"role": "system", "content": system_message},
            {"role": "user", "content": user_prompt},
        ]
    }

    if add_response:
        prompt["messages"].append({"role": "assistant", "content": hxl_tag})

    #prompt = json.dumps(prompt, index=4)

    return prompt

def generate_description(data, file_name):
    """
    Generate a short description of a dataset based on a summary of its content.

    Args:
        data (DataFrame): The input dataset for which a description needs to be generated.
        file_name (str): The name of the file.

    Returns:
        str: A short description of the dataset.
    """

    prompt = f"""
      This data file ...

      {file_name}

      Has data that looks like this ...

      {data.iloc[0:DATA_EXCERPT_SIZE].to_string()}

      Summarize this dataset
    """

    # Define conversation messages
    messages = [
        {"role": "system", "content": "You are a helpful assistant Summarizing data into one paragraph"},
        {"role": "user", "content": prompt}
    ]

    # Request a completion (description) from the OpenAI API
    try:
      response = client.chat.completions.create(
          model=DATA_SUMMARY_LLM,
          messages=messages,
          temperature=0,
          max_tokens=300,
          stop=["\n\n"]
      )
    except Exception as e:
      print(f"Error generating description for {file_name} ... {e}")
      return ""

    return response.choices[0].message.content

def generate_data_descriptions(resource_id, resource_name):
    """
    Generate a short description of a dataset based on a summary of its content.

    Args:
        data (DataFrame): The input dataset for which a description needs to be generated.

    Returns:
        str: A short description of the dataset.
    """

    # Read data in from resource_name

    try:

      if ".xlsx" in resource_name:
          df = pd.read_excel(resource_name)
      elif ".csv" in resource_name:
          df = pd.read_csv(resource_name)
      else:
          print(f"Unknown file type for {resource_name}")
          return

    except Exception as e:
        print(f"Error reading {resource_name} ... {e}")
        return

    dataset_description = generate_description(df, resource_name)
    #dataset_description = f"{dataset_description} Columns in the table: {df.columns.to_list()}"

    print(f"Description: {dataset_description}")

    return dataset_description


def generate_prompts(df,
                     heading_col='Text header',
                     resource_name_col='File', \
                     tag_col='Hashtag with Attributes', \
                     excerpt_col='Data excerpt', \
                     hxl_tag=False):
    """
    Generate a set of prompts for HXL tags from a DataFrame.

    Parameters:
    - df (DataFrame): Input DataFrame containing dataset, resource, column information.
    - hxl_tag (bool, optional): Whether to include HXL tags in the prompts. Default is False.
    - heading_col (str, optional): Name of the column containing column headers. Default is 'Text header'.
    - resource_name_col (str, optional): Name of the column containing resource names. Default is 'File'.
    - tag_col (str, optional): Name of the column containing HXL tags. Default is 'Hashtag with Attributes'.
    - excerpt_col (str, optional): Name of the column containing column data excerpts. Default is 'Data excerpt'.

    Returns:
    - str: A string containing JSON-formatted prompts.
    """

    unique_resources = df[resource_name_col].unique().shape[0]
    print(f"\n\nUnique resources: {unique_resources}\n")
    count = 0
    dataset_descriptions = {}
    for index, row in df.iterrows():
        if row['HDX resource id'] not in dataset_descriptions:
            dataset_descriptions[row['HDX resource id']] = \
            generate_data_descriptions(row['HDX resource id'], row[resource_name_col])
            count += 1
            if count % 10 == 0:
                print(f"Processed {round(count/unique_resources,2)*100}% resources ...")

    prompts = []
    for index, row in df.iterrows():
        if row['HDX resource id'] in dataset_descriptions:
            dataset_description = dataset_descriptions[row['HDX resource id']]
        else:
            "No dataset description, skipping ..."
            continue

        prompt = generate_chat_prompt('',  # Dataset name
                    row[resource_name_col], \
                    row[heading_col], \
                    row[excerpt_col], \
                    hxl_tag=row[tag_col], \
                    dataset_description=dataset_description, \
                    add_response=True)

        prompt["Data description"] = dataset_description

        for field in ['HDX resource id', 'HDX dataset id', 'Data provider', \
                      'Date created', 'Locations', 'URL', 'Text header',\
                      'Data excerpt']:
            prompt[field] = row[field]

        prompts.append(prompt)

    return prompts

def save_prompts(prompts, filename):

    with open(filename, 'w') as f:
        for prompt in prompts:
            f.write(json.dumps(prompt) + "\n")

    print(f"Prompts written to {filename}")


def save_all_prompts():
    for dataset_type in ["train","test"]:
        if dataset_type == "train":
            data = X_train
        else:
            data = X_test

        prompt_file = f"{LOCAL_DATA_DIR}/hxl_chat_prompts_{dataset_type}.jsonl"
        prompts = generate_prompts(data, hxl_tag=True)

        print(f"\n\nSaving {len(prompts)} prompts to {prompt_file} ...")

        save_prompts(prompts, prompt_file)


save_all_prompts()




Unique resources: 482

Description: The dataset from the file "DRC - Baseline Assessment - M23 Crisis 13 - February 2024.xlsx" contains information on the total number of internally displaced persons (IDPs) and returnees in the Democratic Republic of Congo. The data includes the total number of IDP households, IDP individuals, male and female IDPs, and returnees. Specifically, there are 319,283 IDP households, 1,548,732 IDP individuals, with 646,805 males and 901,927 females, and 587,705 returnees.
Description: The dataset contains earthquake data for various administrative regions in Afghanistan, including country name, admin1 name, latitude, longitude, aggregation type, indicator name, and indicator value. The data includes maximum earthquake values recorded in different regions, with corresponding latitude and longitude coordinates. The dataset provides insights into the seismic activity in different administrative areas of Afghanistan.
Description: The dataset contains informatio

  df = pd.read_csv(resource_name)


Description: The dataset contains information on demographics and locations of forcibly displaced and stateless persons globally. It includes data on the year, country of origin and asylum, population type, location, urban/rural classification, accommodation type, and demographic breakdown by age and gender. The data provides details such as the number of females and males in different age groups, as well as the total population count for each entry. The dataset covers various countries and years, offering insights into the displacement and statelessness of individuals across different regions.
Description: The dataset contains information on projects in Ethiopia for December 2023, including details such as project status, donors, implementing agencies, locations at regional, zonal, and woreda levels, sector clusters, activities, and geographical coordinates. The data includes multiple entries for different projects, each specifying the status, donors, implementing agencies, and locati

  df = pd.read_csv(resource_name)


Description: The dataset stored in the file global_pcodes.csv contains information about locations, administrative levels, P-Codes, names, parent P-Codes, and valid from dates. The data includes details such as country codes, admin levels, specific P-Codes for different regions, names of locations, parent P-Codes, and effective dates. The dataset appears to focus on administrative divisions within a country, with each entry representing a different region or area within Afghanistan.
Description: The dataset titled "3W_All_Clusters_March_2022.xlsx" contains information on different clusters, organizations, organization types, regions, and districts. The data includes columns for Cluster, Organization, Org Type, Region, and District, with examples of entries such as CCCM, Agency for Technical Cooperation and Development (ACTED), INGO, and various regions and districts. The dataset appears to provide details on organizations operating within different clusters and regions, along with thei

  df = pd.read_csv(resource_name)


Description: The dataset contains information on various measures and indicators related to COVID-19 response for different countries and regions. It includes data such as government responses, containment measures, economic support, vaccination policies, confirmed cases and deaths, vaccination status, and stringency indices. The dataset is structured with columns representing different aspects of the response efforts, with rows corresponding to specific dates and locations.
Description: The dataset contains information on health institutions in Haiti, including details such as administrative divisions (adm1_fr, adm1_ht, adm2code, adm2_en, adm2_fr, adm3code, adm3_en, adm3_fr), institution names, categories, types, codes, latitude (LatDD), and longitude (LongDD). The institutions vary in ownership (public, private for-profit, private non-profit, and mixed) and include health centers, hospitals, and dispensaries. The dataset provides a comprehensive overview of healthcare facilities in H

  df = pd.read_csv(resource_name)


Description: The dataset contains transaction data from the file "transactions.csv" with columns including Month, Reporting org id, Reporting org name, Reporting org type, Sector, Recipient country, Humanitarian indicator, Strict indicator, Transaction type, Activity id, Net money, and Total money. The data includes information on transactions made by the AECID Spanish Agency for International Development Cooperation in various sectors and countries during January 2020. The transactions involve commitments with different values for net money and total money.
Description: The dataset from the file "GHO-mid-year-update-2023.xlsx" contains columns labeled "Unnamed: 0" and "Unnamed: 1" with various entries, including information such as page titles, export details, dates, and data sources. The dataset appears to have some missing values denoted by "NaN" entries. The data seems to pertain to Humanitarian Action 2023, with details on plans, dates, and sources.
Description: The dataset "Droug

  warn("""Cannot parse header or footer so it will be ignored""")


Description: The dataset contains information on populated places in Iraq, including administrative divisions such as country, governorate, district, and sub-district levels. Each entry includes details such as place names in English and Arabic, corresponding administrative codes, longitude, latitude, and estimated population. The dataset provides a comprehensive overview of various populated places in Iraq and their respective demographic information.
Description: The dataset contains information on events in South Sudan from January to December 2022, including details such as event dates, locations at various administrative levels, population demographics, displacement information, movement triggers, arrival details, needs assessments, and shelter conditions. The data includes assessments of affected populations, including IDPs and returnees, as well as information on household composition, age groups, gender breakdowns, and specific needs such as food, shelter, water, sanitation, he

  df = pd.read_csv(resource_name)


Description: The dataset contains information on COVID-19 vaccinations, including data such as location, ISO code, date, total vaccinations, people vaccinated, people fully vaccinated, total boosters, daily vaccinations, and various vaccination rates per hundred and per million. The data includes details on different countries and dates, with some entries showing specific vaccination numbers while others have missing values. The dataset provides a comprehensive overview of vaccination progress across different locations and time periods.
Description: The dataset contains information on various communes in Madagascar, specifically focusing on the impact of drought in the Gand Sud region in September 2022. The data includes details such as the evaluation date, commune type, region, district, household and individual statistics, reasons for displacement, destinations of displaced individuals, and returnee information. Each row represents a different commune, with data on the impact of dro

  df = pd.read_csv(resource_name)


Description: The dataset contains information about various schools, including their names, locations, populations of pupils in 2012 and 2015, ISCED levels, addresses, operators, coordinates, and other details. The data includes details such as school names, dates started, pupil populations, geographical information, and operator information. The dataset seems to focus on schools in the Tawi-tawi region of the Autonomous Region in Muslim Mindanao.
Description: The dataset contains information related to IDPs (Internally Displaced Persons) and returnees in Uganda, specifically focusing on different administrative levels such as Admin 0, Admin 1, Admin 2, Admin 3, and Admin 4. The data includes details like the snapshot date, survey date, administrative codes and names, total number of IDPs and returnees in households and individuals, area of origin of IDPs, and the type of displacement (e.g., natural disasters like floods). The dataset provides insights into the displacement situation i

  df = pd.read_csv(resource_name)


Description: The dataset titled "Excess mortality during COVID-19 pandemic" contains information on various countries, regions, periods, years, months, weeks, dates, deaths, expected deaths, excess deaths, and total excess deaths percentage. The data includes details such as the country, region, period, year, month, week, date, number of deaths, expected deaths, and excess deaths for each entry. The dataset appears to track excess mortality during the COVID-19 pandemic, with a focus on the percentage of total excess deaths.
Processed 56.00000000000001% resources ...
Description: The dataset contains daily updates on COVID-19 cases in Indonesia, with columns for Date, Cumulative_cases, Recovered_cases, Total_death, Patient_under_treatment, New_case_perDay, Recovered-cases_perDay, Death_cases_perDay, and Treatment_cases_perDay. The data starts from March 2, 2020, and includes information on the number of cumulative cases, recoveries, deaths, patients under treatment, and daily changes in

  warn(msg)


Description: The dataset provided is stored in an Excel file and contains information related to various fields such as reporting week, status of response, start and end dates, organizations involved, locations, sector/cluster activities, monitoring indicators, quantities, and demographic data. The data includes details on different activities conducted by the International Organization for Migration (IOM) in Tigray, including health-related services like mobile health and nutrition teams, mental health support, and treatment for severe acute malnutrition. The dataset also includes information on the number of consultations, individuals reached, and services provided.
Description: The dataset contains information on language data for Ecuador, with columns including the names of administrative regions, language codes, number of named languages, main language, main language share, population totals, gender breakdown, literacy rates, and metadata details. The data shows the distribution o

  warn(msg)


Description: The dataset contains information on various organizations, partners, clusters, sub-clusters, regions, provinces, cities/municipalities, barangays, evacuation sites, activities, and their statuses (ongoing, completed, planned). It includes details such as start and finish dates, remarks, region codes, province codes, municipal city codes, organization acronyms, and partner organization acronyms. The data showcases different activities undertaken by organizations like construction, distribution of kits, repair of facilities, and hygiene promotion sessions in response to Typhoon Goni (Rolly) and Vamco (Ulysses) in the Bicol Region.
Description: The dataset contains information on various sectors such as Nutrition, ES NFI, WaSH, Education, Child Protection, FSL, Health, GBV, Protection, RCF, and Total for different states or regions. The data includes details like targets, reached values, and specific sector information for each region. The dataset provides a comprehensive ove

  warn(msg)


Description: The dataset contains information on various programs and projects funded by different agencies, such as UNHCR and UNICEF, in Eritrea. The data includes details like program titles, outcomes, sectors, funding requirements, funding received in different quarters, and donor information. The programs cover a range of areas including health and nutrition, water and sanitation, education, environment, capacity development, food security, gender empowerment, and social protection. Each entry also includes information on pillars, government representatives, and categories. The dataset provides a comprehensive overview of the funding and activities related to refugee and other persons of concern in Eritrea.
Description: The dataset from the file "200304_3W on NCDDS_EQ.xlsx" contains information on the status of projects, with a total of 1653 projects. Of these, 1241 projects are completed, 281 are ongoing, and 131 are planned. The data is structured in two columns, with the first c

  warn(msg)


Description: The dataset contains information on various organizations, partners, clusters, sub-clusters, regions, provinces, cities/municipalities, barangays, evacuation sites, activities, and statuses related to food security, agriculture, and livelihood initiatives. Each entry includes details such as the start and finish dates, the number of families served, and the contents of food packs distributed. The data covers different regions in the Philippines, including Ilocos, Cordillera Administrative Region, and Cagayan Valley. The activities mentioned are primarily focused on distributing food packs containing rice, monggo, dried fish, biscuits, sardines, cooking oil, sugar, and salt, with additional fortified rice packs included in some cases.
Processed 79.0% resources ...
Description: The dataset contains information on various organizations, their types, partners, clusters, sub-clusters, regions, provinces, cities/municipalities, barangays, evacuation centers, activities, statuses

  warn(msg)


Description: The dataset contains information on sexual violence incidents reported in various countries in 2015, including the number of staff affected, incident type, survivor gender, and date. The data includes countries such as Afghanistan, Belgium, Colombia, the Democratic Republic of the Congo (DRC), and Ethiopia. The incident types range from unknown to aggressive sexual behavior, unwanted sexual comments, sexual assault, and attempted sexual assault, with female survivors being the predominant gender represented in the dataset.
Description: The dataset contains information about various camps set up in the aftermath of the Sulawesi Earthquake in Indonesia. It includes details such as camp names, locations, duration, ownership, facilities available, demographics of residents, health and hygiene conditions, food security, access to education and healthcare, livelihood impact, security measures, and access to basic necessities. The camps vary in terms of size, conditions, and serv

  warn(msg)


Description: The dataset from the file "160704_5W_HDX.xlsx" contains information about different organizations, their sectors, activities, locations (province, canton, and parish), and status of their projects. The data includes details such as organization name, sector, activity, province, canton, parish, and project status. The dataset seems to focus on organizations involved in activities related to water, sanitation, hygiene, and health in the Esmeraldas province of Ecuador. The status of the projects varies between "En Ejecución" (In Execution) and "Finalizado" (Finished).
Description: The dataset contains information on the number of internally displaced persons (IDPs) in various countries. The data includes countries such as Uganda, Ethiopia, Kenya, Tanzania, South Sudan, Rwanda, Burundi, Angola, and Zambia, with corresponding IDP numbers. Some countries have missing data for IDPs. Ethiopia has the highest number of IDPs with 2,800,000, followed by South Sudan with 1,900,000, an

  warn(msg)


Description: The dataset from the file "160516_5W_ForHDX.xlsx" contains information about various organizations, sectors, provinces, and beneficiaries in different regions. The data includes columns such as ID, Gob, Cod.2, Organización, Sector, Provincia, Cantón, and Total beneficiaries. It provides details about organizations like Agencia Adventista de Desarrollo y Recursos Asistenciales and Aldeas Infantiles SOS operating in provinces like Manabí and Esmeraldas. The dataset also includes information on the number of beneficiaries in specific areas.
Error generating description for /content/drive/MyDrive/Colab/hxl-metadata-prediction/data/Redhum-Ec 5w 2.0 Ronda 12 Versión 2061012-HDX.xlsx ... Error code: 400 - {'error': {'message': "Invalid 'messages[1].content': string too long. Expected a string with maximum length 1048576, but got a string with length 1983173 instead.", 'type': 'invalid_request_error', 'param': 'messages[1].content', 'code': 'string_above_max_length'}}
Description:

  warn(msg)


Description: The dataset contains information on various activities implemented in different districts and VDC wards, including details such as partner organizations, funding sources, activity types, sub-types, names, details, units, funding and activity statuses, planned and reached totals, start and end dates, and comments. The data includes activities like technical assistance, training, and construction demonstrations, with details on participants and trainees involved. The dataset covers multiple activities across different locations, with completion statuses ranging from ongoing to completed.
Description: The dataset contains information on various regions in Afghanistan, including province codes, district codes, operational and organizational presence in different sectors such as ESNFI, FSAC, health, nutrition, protection, and wash. Each region has data on the operational presence and capacity of different organizations in these sectors. The dataset provides details on the numbe

  warn(msg)


Description: The dataset contains information on regions, districts, communes, and fokontany in Madagascar, along with data on the number of affected individuals in terms of deaths, injuries, damaged houses, flooded houses, roofless houses, displaced persons, and displaced households. The data includes details on different demographic groups such as children under 5 years, pregnant women, persons with disabilities, and individuals over 60 years old. The dataset provides a breakdown of the impact of various events on different locations within the regions, including the number of affected individuals and households.
Description: The dataset contains information on conflict-induced displacements in Afghanistan in 2016, with data compiled by OCHA sub offices based on inter-agency assessments. The data includes details such as the date of displacement, province code and name of origin and displacement, as well as district code and name of origin. The dataset provides a snapshot of newly di

  warn(msg)


Description: The dataset contains information on various organizations, partners, clusters, sub-clusters, activity statuses, activity types, regions, provinces, cities/municipalities, barangays, evacuation sites, start and finish dates of activities in the Philippines as of December 2016. The data includes details such as organization names, partner organizations, types of activities (humanitarian or development), and the status of activities (ongoing, completed, planned). The dataset covers a range of activities including protection, mine action, DRRM projects, food security, agriculture, livelihood, and education, among others, implemented in different regions and provinces within the country.
Error reading /content/drive/MyDrive/Colab/hxl-metadata-prediction/data/DC_OP4_DANA.xlsx ... Excel file format cannot be determined, you must specify an engine manually.
Description: The dataset contains information on nutrition assistance in various countries in the Sahel region for the year 2

  warn(msg)


Description: The dataset contains information on organizations involved in activities related to water, sanitation, and hygiene (WASH) in the Esmeraldas province of Ecuador. The data includes details such as organization names, sectors, activities, quantities, units, locations (province, canton, parish), and status of the projects (e.g., in execution or completed). Organizations like OPS/OMS, Cruz Roja Ecuatoriana, and World Vision are mentioned, along with their specific WASH-related projects such as water treatment, distribution of hygiene kits, and water chlorination. The dataset provides a snapshot of ongoing and completed initiatives in the region.


  warn(msg)


Description: The dataset contains information on various organizations, sectors, types of activities, codes, quantities, units, locations (including provinces, cantons, and parishes), status of activities, and total number of beneficiaries. The data includes details such as organization names, sectors like Water, Sanitation, and Hygiene, specific activities undertaken by each organization, quantities of items distributed or actions taken, and the status of each activity (e.g., in execution or finalized). The dataset covers activities related to water purification, distribution of hygiene kits, construction of latrines, and other water, sanitation, and hygiene initiatives in provinces like Manabí and Esmeraldas in Ecuador.
Unknown file type for /content/drive/MyDrive/Colab/hxl-metadata-prediction/data/Ecuador Earthquake - April 2016 - Severity index.google sheet
Description: The dataset contains information on various projects in Burundi, including details such as reference numbers, dis

  warn(msg)


Description: The dataset titled "141121 LR Health Care Facilities.xlsx" contains information related to the deployment of the DHN/Standby Task Force for data collection of health facilities in Guinea, Liberia, and Sierra Leone. The data was collected, collated, and cleaned by Standby Task Force volunteers during September and October 2014. The dataset includes links to updated maps and is free for use for nonprofit humanitarian projects. Users are advised to check back regularly for new updates as the links may change. The dataset also includes a link to the Standby Task Force Maps portal for further information.
Processed 100.0% resources ...
Description: The dataset contains information on health centers in Sierra Leone, including details such as center ID, status, date opened, type of center, activity, location coordinates, address, and source of information. The data includes various health centers across different districts and chiefdoms in Sierra Leone, with details on their capa

  warn(msg)


Description: The dataset contains information on the density of HIV/AIDS incidence per 1000 cases by sex and municipality in Colombia. It includes data such as the number of cases, affected population, and density of incidence for different municipalities in Antioquia. The dataset also provides details on the source of the data and specifies that only live cases for the year 2017 were considered. The data is structured in columns with various unnamed headers, and it seems to be sourced from the SISPRO 2017 database.
Description: The dataset SECOP_HDX.xlsx contains information on public procurement contracts related to the Venezuelan refugee and migrant population. It includes variables describing entities involved in public contracts, contracts targeting the refugee and migrant population, and a description of the dataset. The geographical breakdown includes national, departmental, and in some cases municipal levels. The dataset covers the period from 2005 to 2021 and is updated as nee