# Introduction

This notebook downloads data provided by the HDX team from a google drive folder. The data was captured using an [HXL crawl process](https://github.com/HXLStandard/hdx-hashtag-crawler)

# Setup

1. Install [miniconda](https://docs.conda.io/en/latest/miniconda.html) by selecting the installer that fits your OS version. Once it is installed you may have to restart your terminal (closing your terminal and opening again)
2. In this directory, open terminal
3. `conda env create -f environment.yml`
4. `conda activate hxl-prediction

In [123]:
import sys
import os
import gdown
import tarfile
import pandas as pd
import re

from hdx.utilities.easy_logging import setup_logging
from hdx.api.configuration import Configuration
from hdx.data.dataset import Dataset
import hxl

input_options = hxl.input.InputOptions(http_headers={'User-Agent': "DK HXL Analaysis"})

DATA_GDRIVE="https://drive.google.com/file/d/1BDCuh0WVJWK1-1RMC-77cvh4H2Hep_ry/view?usp=drive_link"
DATA_FILE="hdx-hxl-output.tgz"
LOCAL_DATA_DIR = "./data/hdx-hxl"

# Number of records in data excerpts
DATA_EXCERPT_SIZE = 10

local_data_file = LOCAL_DATA_DIR + "/" + DATA_FILE

pd.set_option('display.max_colwidth', 200)
pd.set_option('display.max_rows', 200)

In [90]:
agent_count = 0

def setup_hdx_connection(agent_name):
    try:
        Configuration.create(hdx_site="prod", user_agent=agent_name, hdx_read_only=True)
    except:
        print("Configuration already created, continuing ...")

# Note, if you run this twice you will get a 'Configuration already exists!' error, but it can be ignored 
setup_hdx_connection(f"DK_UserAgent{agent_count}")

# Analysis

## Download HXL crawler data

This data was generated using the [HDX Hashtag Crawler](https://github.com/dividor/hdx-hashtag-crawler) over several days and saved to Google drive.

In [14]:
# Use gdown to download the file
gdown.download(DATA_GDRIVE, local_data_file, quiet=False, fuzzy=True)

tar = tarfile.open(local_data_file)
tar.extractall(LOCAL_DATA_DIR)
tar.close()

LOCAL_DATA_DIR += "/output/"

Downloading...
From: https://drive.google.com/uc?id=1BDCuh0WVJWK1-1RMC-77cvh4H2Hep_ry
To: /Users/matthewharris/Desktop/git/hxl_metadata_prediction/data/hdx-hxl/hdx-hxl-output.tgz
100%|██████████| 19.3M/19.3M [00:01<00:00, 10.0MB/s]


In [15]:
print("Files downloaded:")
for file in os.listdir(LOCAL_DATA_DIR):
    print(file)

Files downloaded:
hdx-expanded-quickcharts.csv
hdx-hxl-dataset-list.csv
report-resource-patterns-by-ckan-tag.csv
report-ckan-tags-by-attribute.csv
hxl-quickcharts.csv.SAFE
report-resource-patterns-by-attribute.csv
hxl-quickcharts.csv
report-orgs-by-hashtag.csv
report-resource-patterns-by-hashtag.csv
report-quick-charts-by-org.csv
datasets-quick-charts.csv
report-resource-patterns-by-org.csv
report-orgs-by-attribute.csv
resource-patterns.csv
report-ckan-tags-by-hashtag.csv
hxl-ckan-tags.csv
report-resource-pattern-ratio-by-org.csv
report-resource-patterns-by-hashtag-attribute-pair.csv
report-resource-patterns-by-tagspec.csv
report-orgs-by-tagspec.csv
report-ckan-tags-by-org.csv
report-ckan-tags-by-tagspec.csv
report-orgs-by-hashtag-attribute-pair.csv
hdx-hashtag-stats.csv
report-ckan-tags-by-hashtag-attribute-pair.csv
hdx-expanded-hashed-stats.csv
hdx-expanded-stats.csv
report-datasets-by-org.csv
report-resources-by-org.csv


## Identify Unique combinations of HXL tags we want for training data

In [135]:
# Open hdx-expanded-hashed-stats.csv in data/hdx-hxl

df = pd.read_csv(LOCAL_DATA_DIR + "/output/hdx-expanded-hashed-stats.csv")

# We'll keep one row per column, Hashtag with Attributes has what we require
df.drop(columns=['Attribute', 'Hashtag'], inplace=True)
df.drop_duplicates(inplace=True)  

# Remove HXL tags row in the metadata (we keep them for actual data)
df = df[1:]

display(df.head())
print(df.shape)

print("Unique data providers ...")
print(len(df["Data provider"].unique()))

print("Unique HDX resource ids ...")
print(len(df["HDX resource id"].unique()))

  df = pd.read_csv(LOCAL_DATA_DIR + "/output/hdx-expanded-hashed-stats.csv")


Unnamed: 0,Hashtag with Attributes,Text header,Locations,Data provider,HDX dataset id,HDX resource id,Date created,Unnamed: 9,Hash
1,#affected+hh,Total IDP HH,COD,international-organization-for-migration,drc-displacement-idps-returnees-m23-crisis-north-kivu-province-baseline-assessment-iom-dtm,26ecc26f-74e7-46af-b450-8872dca0b63b,2023-10-16,True,0x2cc7fd3129c0d18c
2,#affected+idp+ind,Total IDP IND,COD,international-organization-for-migration,drc-displacement-idps-returnees-m23-crisis-north-kivu-province-baseline-assessment-iom-dtm,26ecc26f-74e7-46af-b450-8872dca0b63b,2023-10-16,True,0x2cc7fd3129c0d18c
4,#affected+idp+male,Total IDP Male Ind,COD,international-organization-for-migration,drc-displacement-idps-returnees-m23-crisis-north-kivu-province-baseline-assessment-iom-dtm,26ecc26f-74e7-46af-b450-8872dca0b63b,2023-10-16,True,0x2cc7fd3129c0d18c
6,#affected+female+idp,Total IDP Female Ind,COD,international-organization-for-migration,drc-displacement-idps-returnees-m23-crisis-north-kivu-province-baseline-assessment-iom-dtm,26ecc26f-74e7-46af-b450-8872dca0b63b,2023-10-16,True,0x2cc7fd3129c0d18c
8,#affected+ind+returnees,Total Returnees,COD,international-organization-for-migration,drc-displacement-idps-returnees-m23-crisis-north-kivu-province-baseline-assessment-iom-dtm,26ecc26f-74e7-46af-b450-8872dca0b63b,2023-10-16,True,0x2cc7fd3129c0d18c


(487297, 9)
Unique data providers ...
120
Unique HDX resource ids ...
43074


Let's use the column hash to find unique combinations of tags

In [136]:
hash_count = df.groupby('Hash').size()
display(hash_count)


Hash
0x100556db35012c6b       101
0x102125f4dd16c64     190286
0x10309a2e5e2722ba       509
0x105b36aac3c9192f       693
0x105c6ee3379af31c       595
                       ...  
0xf0a7e4d9104f069         12
0xf93f0051e52a4d          28
0xfa11ad9f842a37d         17
0xfe0e278de9d0a33          1
0xfe8777dcd878424         20
Length: 644, dtype: int64

In [137]:
hash_resources = df.groupby('Hash')['HDX resource id'].nunique().sort_values(ascending=False)

for col in ['HDX resource id', 'HDX dataset id', 'Data provider']:
    hash_resources  = hash_resources .reset_index()
    hash_resources [f"Unique {col}"] = hash_resources ['Hash'].map(df.groupby('Hash')[col].unique())

display(hash_resources )

hash_resources.to_excel("./data/hxl_hash_resources.xlsx", index=False)

Unnamed: 0,level_0,index,Hash,HDX resource id,Unique HDX resource id,Unique HDX dataset id,Unique Data provider
0,0,0,0x102125f4dd16c64,13858,"[51b2e4ec-aca5-4b97-bbb7-c005175b682e, a6ef8040-3b15-47ae-9973-1dbc113673cf, f579cf0e-5535-4414-897f-2f8c05105180, 66c62464-017b-4aa6-845f-9ec2487acb82, 91e1cb98-353b-487e-a14d-b0eea783da6f, 65f38...","[who-data-for-south-sudan, who-data-for-montenegro, who-data-for-zimbabwe, who-data-for-zambia, who-data-for-yemen, who-data-for-viet-nam, who-data-for-venezuela-bolivarian-republic-of, who-data-f...",[world-health-organization]
1,1,1,0x428a8e37940223d8,9328,"[2e130cdf-c850-4533-b2f3-e961adbec48a, 9e160d82-691d-49a6-979b-0ff0dbb6b7a8, 002501bc-7efb-4335-b672-d045cd76bc5b, 0d0e0fc4-e4f1-49cd-ad30-42cc8dc08b74, f2c1ea93-d241-413b-8140-c009da88d912, 23e36...","[world-bank-combined-indicators-for-zimbabwe, world-bank-trade-indicators-for-zimbabwe, world-bank-external-debt-indicators-for-zimbabwe, world-bank-climate-change-indicators-for-zimbabwe, world-b...",[world-bank-group]
2,2,2,0x31fda1ef985b4a59,3425,"[9c751883-698a-4a2c-9475-ff828e9c11db, 791b69af-df57-4157-96c0-c0d8d308315e, b055f1f7-8cdc-4ff8-a412-9f54d6e56c41, 6ae8568f-449b-4b59-9ad2-79d7df94cd9c, 4f4c7462-6b13-4125-a88a-f8e9ac0c837b, abb79...","[dhs-data-for-sao-tome-and-principe, dhs-data-for-rwanda, dhs-data-for-philippines, dhs-data-for-peru, dhs-data-for-paraguay, dhs-data-for-papua-new-guinea, dhs-data-for-pakistan, dhs-data-for-nig...",[dhs]
3,3,3,0x19598575d3397e19,3232,"[ed7b5bd2-7818-4d7a-9ff0-8ba0d97bf7d5, 8f239e93-76c0-4287-a414-3d17a5e55344, 040efd19-3d71-4e0b-8939-f6b46c465868, 8db41d1c-af8f-4f26-9d40-5b80b1bf72e8, 6482dc75-0e79-40d9-aacf-cdcde3c368a6, c9981...","[dhs-subnational-data-for-sao-tome-and-principe, dhs-subnational-data-for-rwanda, dhs-subnational-data-for-philippines, dhs-subnational-data-for-peru, dhs-subnational-data-for-paraguay, dhs-subnat...",[dhs]
4,4,4,0x16d2b679132fea10,2147,"[295cd9e4-8464-43ee-ad17-47196991a1f7, 34337d16-017d-4d69-834c-a5e0fc21a549, b51b8c0e-494d-488b-98d5-a70fd9451b90, 75076d6a-8d3f-49e3-b4f4-7c889bc82806, 7e0b2b37-73b8-4c69-bb11-6ba954fa0cd9, 08ea0...","[unhcr-population-data-for-world, unhcr-population-data-for-zwe, unhcr-population-data-for-zmb, unhcr-population-data-for-zaf, unhcr-population-data-for-yem, unhcr-population-data-for-wsm, unhcr-p...",[unhcr]
...,...,...,...,...,...,...,...
639,639,639,0x22def8e1b7b0c742,1,[68328f42-9276-423e-80d0-fe89630804ff],[3w-december-2017],[ocha-ethiopia]
640,640,640,0x4b7012601f2de402,1,[63143067-46e0-4fb5-b131-d83ee45122ab],[ethiopia-settlements],[ocha-ethiopia]
641,641,641,0x4aefd35864aaa20a,1,[6ddcfbc9-fa06-4b14-b9a4-ce96d3fae65e],[base-acceso-internet-personas-entre-los-5-y-19-anos-2018],[immap]
642,642,642,0x4acf4e36d67d877,1,[903326f2-b372-4786-9973-87226cb15e41],[people-in-need-2008-2019],[ocha-fts]


In [138]:
# Extract a single resource_id for each hash
resource_ids = hash_resources['Unique HDX resource id'].apply(lambda x: x[0])
print(resource_ids)

0      51b2e4ec-aca5-4b97-bbb7-c005175b682e
1      2e130cdf-c850-4533-b2f3-e961adbec48a
2      9c751883-698a-4a2c-9475-ff828e9c11db
3      ed7b5bd2-7818-4d7a-9ff0-8ba0d97bf7d5
4      295cd9e4-8464-43ee-ad17-47196991a1f7
                       ...                 
639    68328f42-9276-423e-80d0-fe89630804ff
640    63143067-46e0-4fb5-b131-d83ee45122ab
641    6ddcfbc9-fa06-4b14-b9a4-ce96d3fae65e
642    903326f2-b372-4786-9973-87226cb15e41
643    8b0feea6-20e4-45dc-aaae-c8b6fbd5a9f4
Name: Unique HDX resource id, Length: 644, dtype: object


In [140]:
df_subset = df[df['HDX resource id'].isin(resource_ids)]

print("Column data subset to one resource ID per hash ...")
print(df_subset.shape)

print("Unique data providers ...")
print(len(df_subset["Data provider"].unique()))

print("Unique HDX resource ids ...")
print(len(df_subset["HDX resource id"].unique()))

Column data subset to one resource ID per hash ...
(7834, 9)
Unique data providers ...
119
Unique HDX resource ids ...
644


### Download data excerpts

Uzing our subset of resource_ids for each hash, extract column data excerpts

In [157]:
df2 = df_subset.copy()

df2['Data excerpt'] = ''

datasets_resources = df2[['HDX dataset id', 'HDX resource id']].drop_duplicates()

print("For each resource, extract a data excerpt for each column ...")
for index, row in datasets_resources.iterrows():
    dataset_id = row['HDX dataset id']
    resource_id = row['HDX resource id']
    dataset = Dataset.read_from_hdx(dataset_id)
    resources = dataset.get_resources()
    for resource in resources:
        if resource['id'] == resource_id:
            print(f"    Accessing data for resource {resource_id}, {resource['name']}")
            url, path = resource.download(LOCAL_DATA_DIR)
            with hxl.data(resource['url'], input_options) as source:
                columns = [column.header for column in source.columns]
                tags = [column.get_display_tag(sort_attributes=True) for column in source.columns]
                data = {}
                rowcount = 0
                for row in source:
                    if rowcount > DATA_EXCERPT_SIZE:
                        break
                    i = 0
                    for colvalue in row:
                        colname = columns[i]
                        if colname not in data:
                            data[colname] = [colvalue]
                        else:
                            data[colname].append(colvalue)     
                        i += 1
                    rowcount += 1                            

                for col in columns:
                    print(f"       Setting data excerpt for column {col} >> {data[col]} ...")
                    df2.loc[(df2['HDX resource id'] == resource_id) & (df2['Text header'] == col), 'Data excerpt'] = str(data[col])

display(df2)
df2.to_excel("./data/hxl_hash_resources_data.xlsx", index=False)



For each resource, extract a data excerpt for each column ...
    Accessing data for resource 26ecc26f-74e7-46af-b450-8872dca0b63b, DRC - Baseline Assessment - M23 Crisis 13 - February 2024
       Setting data excerpt for column Total IDP HH >> [319283] ...
       Setting data excerpt for column Total IDP IND >> [1548732] ...
       Setting data excerpt for column Total IDP Male Ind >> [646805] ...
       Setting data excerpt for column Total IDP Female Ind >> [901927] ...
       Setting data excerpt for column Total Returnees >> [587705] ...
    Accessing data for resource dbf9b4bd-1321-4846-b6f0-4654509d3626, admin1-summaries-earthquake.csv
       Setting data excerpt for column country_name >> ['Afghanistan', 'Afghanistan', 'Afghanistan', 'Afghanistan', 'Afghanistan', 'Afghanistan', 'Afghanistan', 'Afghanistan', 'Afghanistan', 'Afghanistan', 'Afghanistan'] ...
       Setting data excerpt for column admin1_name >> ['Kabul', 'Kapisa', 'Parwan', 'Maidan Wardak', 'Logar', 'Nangarhar', '

KeyError: 'id'