# Aardvark Metadata for HDX

This script aims to harvest metadata in Aardvark version from datasets on [HDX site](https://data.humdata.org).

> Originally created by Gene Cheng [@Ziiiiing]() on Oct 24, 2021

In [1]:
import csv 
import time
import uuid
import geocoder
import urllib.request
from bs4 import BeautifulSoup

Before execute the script, you need to manually change the following cell to your own **GeoNames** user account. Or you can register a free one [here](http://www.geonames.org/login)

*Note: according to Terms and Conditions, the hourly limit for personal account is 1000 credits and 1 credit is 1 hit for webservice request.*

In [2]:
# Manual changes here
geonames_acc = '<your GeoNames username >'

### STEP 1: Find All Data Links from Search Page

It seems like there are less than 200 data published on this site, so we set the `ext_page_size=200` to the `home_url` to get all search results in one page.

Next, we use the **Beautiful Soup** to find and store all data links in a list.

In [3]:
home_url = "https://data.humdata.org/search?ext_geodata=1&q=&ext_page_size=200"
home_page = urllib.request.urlopen(home_url).read()
soup = BeautifulSoup(home_page, "html.parser")

# find geodata links
data_urls = []
linkFields = soup.find_all('div', {'class': 'dataset-heading'})
for tag in linkFields:
    url = 'https://data.humdata.org' + tag.find('a')['href']
    data_urls.append(url)

### STEP 2: Extract metadata from Each Data Page

In [30]:
def find_download(files):
    datasets = []
    for file in files:
        download = 'https://data.humdata.org' + file.find('a', href=True)['href']
        ftype = ''
        fsize = ''
        spans = file.find('a', {'class': 'heading'}).find_all('span')
        for span in spans:
            if span['class'] == ['format-label']:
                ftype = span['data-format']
            if span['class'] == ['format-filesize-label']:
                fsize = span.text.strip()[1:-1]
        datasets.append([download, ftype, fsize])
    
    # consider the first shapefile as download file
    for data in datasets:
        if data[1] == 'shp':
            ftype_full = 'Shapefile'
            return data[0], ftype_full, data[2]
    
    # if no shapefile exists, try to find ARC/INFO Grid, GeoTIFF, Geodatabase, Geopackage files instead
    for data in datasets:
        if data[1] in ['arc/info grid', 'geotiff', 'geodatabase', 'geopackage']:
            ftype_full = data[1].capitalize()
            return data[0], ftype_full, data[2]
        
    # else, return nothing
    return '','',''
        

In [67]:
# get the bounding box for the given location
def find_bbox(location):
    if location == 'World':
        bbox = '-180,90,180,-90'
        return bbox
    # for single place, return bbox directly
    if len(location.split('|')) == 1:
        g1 = geocoder.geonames(location, key=geonames_acc)
        gid = g1.geonames_id
        g2 = geocoder.geonames(gid, method='details', key=geonames_acc)
        bbox = g2.bbox
        w = str(round(bbox['southwest'][1],4))
        n = str(round(bbox['northeast'][0],4))
        e = str(round(bbox['northeast'][1],4))
        s = str(round(bbox['southwest'][0],4))
        return ','.join((w,n,e,s))
    # for multiple locations, find a broader bounding box for all places
    else:
        places = location.split('|')
        w_all = 180
        n_all = -90
        e_all = -180
        s_all = 90
        for place in places:
            try:
                g1 = geocoder.geonames(place, key=geonames_acc)
                gid = g1.geonames_id
                g2 = geocoder.geonames(gid, method='details', key=geonames_acc)
                bbox = g2.bbox
                
                w = round(bbox['southwest'][1],4)
                n = round(bbox['northeast'][0],4)
                e = round(bbox['northeast'][1],4)
                s = round(bbox['southwest'][0],4)

                if w < w_all:
                    w_all = w
                if n > n_all:
                    n_all = n
                if e > e_all:
                    e_all = e
                if s < s_all:
                    s_all = s
            
            except:
                continue          

        return ','.join((str(w_all),str(n_all),str(e_all),str(s_all)))


In [68]:
def collect_metadata(url):
    data_page = urllib.request.urlopen(url).read()
    soup = BeautifulSoup(data_page, "html.parser")
    metadata = []
    
    alternativeTitle = soup.find('h1', {'class': 'itemTitle dataset-title'}).text.strip()
    title = alternativeTitle
    
    descriptionField = soup.find('div', {'class': 'notes embedded-content'}).find_all('p')
    description = ''.join(x.text.strip() for x in descriptionField)

    creator = soup.find('th', text = 'Contributor').findNext('td').text.strip()

    keywordField = soup.find('th', text = 'Tags').findNext('td').find_all('a')
    keyword = keyword = '|'.join(x.text.strip() for x in keywordField)
    
    try:
        updatedField = soup.find('th', text='Updated').findNext('td').text.strip()
        dd = updatedField.split()[0].zfill(2)
        yyyy = updatedField.split()[2]
        mm = str(time.strptime(updatedField.split()[1], '%B').tm_mon).zfill(2)
        dateIssued = '-'.join((yyyy,mm,dd))
    except:
        dateIssued = ''


    temporalCoverage = soup.find('th', text='Date of Dataset').findNext('td').text.strip()
    fromY = temporalCoverage.split('-')[0].split()[-1]
    toY = temporalCoverage.split('-')[1].split()[-1]
    dateRange = '-'.join((fromY, toY)) 
    
    updateFrequency = soup.find('th', text='Expected Update Frequency').findNext('td').text.strip()

    locationField = soup.find('th', text='Location').findNext('td').find_all('a')
    spatialCoverage = '|'.join(x.text.strip() for x in locationField)
    bbox = find_bbox(spatialCoverage) 
    
    license = soup.find('th', text='License').findNext('td').text.replace('\n', '').replace('\t', '').strip()
    

    files = soup.find_all('li', {'class': 'resource-item'})
    download, formatElement, fileSize = find_download(files)
    
    resourceType = 'Vector data'
    resourceClass = 'Datasets'
    information = url
    identifier = url
    idElement = str(uuid.uuid4())
    isoTopCat = ''
    language = 'eng'
    provider = 'University of Minnesota'
    code = '99-1400'
    memberOf = '99-1400'
    status = 'Active'
    accrualMethod = 'HTML'
    dateAccessioned = time.strftime("%Y-%m-%d")
    rights = ''
    accessRights = 'Public'
    suppressed = 'FALSE'
    childRecord = 'FALSE'
    
    metadata = [title, alternativeTitle, description, language, creator, 
                resourceClass, isoTopCat, keyword, dateIssued,
                temporalCoverage, dateRange, updateFrequency, spatialCoverage, bbox, resourceType,
                formatElement, information, download, idElement, identifier, 
                provider, code, memberOf, status, accrualMethod, dateAccessioned, 
                rights, license, accessRights,
                suppressed, childRecord, fileSize]
    
    return metadata

In [69]:
# iterate each data url to extract metadata
all_metadata = []
count = 0
for url in data_urls:
    count += 1
    print('>>> [{}/{}] harvesting dataset:\n{}'.format(count, len(data_urls), url))
    all_metadata.append(collect_metadata(url))
    # remove datasets without available download files in Shapefile, 
    # ARC/INFO Grid, GeoTIFF, Geodatabase, Geopackage format
    all_metadata = [x for x in all_metadata if x[17]]

>>> [1/197] harvesting dataset:
https://data.humdata.org/dataset/beirut-port-explosion-operational-zones
>>> [2/197] harvesting dataset:
https://data.humdata.org/dataset/population-potentially-exposed-to-floods-between-12-21-july-2020-in-bangladesh
>>> [3/197] harvesting dataset:
https://data.humdata.org/dataset/satellite-detected-water-extent-as-of-21-july-2020-over-northwestern-region-of-bangladesh
>>> [4/197] harvesting dataset:
https://data.humdata.org/dataset/water-extent-as-of-20-july-2020-over-the-northeastern-region-of-bangladesh
>>> [5/197] harvesting dataset:
https://data.humdata.org/dataset/satellite-detected-water-extent-as-of-19-july-2020-of-bangladesh
>>> [6/197] harvesting dataset:
https://data.humdata.org/dataset/water-extent-as-of-18-july-2020-over-eastern-part-of-sylhet-division-bangladesh
>>> [7/197] harvesting dataset:
https://data.humdata.org/dataset/satellite-detected-water-extent-as-of-14-july-2020-over-province-2-of-nepal
>>> [8/197] harvesting dataset:
https://

>>> [62/197] harvesting dataset:
https://data.humdata.org/dataset/hotosm_irl_north_roads
>>> [63/197] harvesting dataset:
https://data.humdata.org/dataset/hotosm_irl_north_health_facilities
>>> [64/197] harvesting dataset:
https://data.humdata.org/dataset/hotosm_irl_north_sea_ports
>>> [65/197] harvesting dataset:
https://data.humdata.org/dataset/hotosm_irl_north_points_of_interest
>>> [66/197] harvesting dataset:
https://data.humdata.org/dataset/hotosm_irl_north_waterways
>>> [67/197] harvesting dataset:
https://data.humdata.org/dataset/hotosm_irl_north_airports
>>> [68/197] harvesting dataset:
https://data.humdata.org/dataset/hotosm_irl_north_railways
>>> [69/197] harvesting dataset:
https://data.humdata.org/dataset/hotosm_irl_north_education_facilities
>>> [70/197] harvesting dataset:
https://data.humdata.org/dataset/hotosm_irl_north_financial_services
>>> [71/197] harvesting dataset:
https://data.humdata.org/dataset/caracterizacion-wash-2019
>>> [72/197] harvesting dataset:
https:/

>>> [131/197] harvesting dataset:
https://data.humdata.org/dataset/wildfires-south-of-beirut-in-aley-and-chouf-districts-lebanon
>>> [132/197] harvesting dataset:
https://data.humdata.org/dataset/wildfires-east-of-tyre-in-south-and-el-nabatieh-governorates-lebanon
>>> [133/197] harvesting dataset:
https://data.humdata.org/dataset/waters-extents-as-of-11-october-2019-over-logone-et-chari-department-far-north-region-of-c
>>> [134/197] harvesting dataset:
https://data.humdata.org/dataset/damage-assessment-in-the-southeastern-part-of-new-mirpur-azad-jammu-and-kashmir-pakistan-a
>>> [135/197] harvesting dataset:
https://data.humdata.org/dataset/damage-assessment-of-tulehu-area-eastern-part-of-salahutu-district-maluku-tengah-regency-m
>>> [136/197] harvesting dataset:
https://data.humdata.org/dataset/damage-assessment-of-waai-area-eastern-part-of-salahutu-district-maluku-tengah-regency-mal
>>> [137/197] harvesting dataset:
https://data.humdata.org/dataset/damage-assessment-in-the-southern-pa

>>> [196/197] harvesting dataset:
https://data.humdata.org/dataset/nigeria-elevation-model-cod
>>> [197/197] harvesting dataset:
https://data.humdata.org/dataset/nepal-openstreetmap-extracts


Error missing parameter geonameId from JSON {'status': {'message': 'missing parameter geonameId', 'value': 14}}


### STEP 3: Write a CSV Report

In [70]:
fieldnames = ['Title', 'Alternative Title', 'Description', 'Language', 'Creator', 'Resource Class',
              'ISO Topic Categories', 'Keyword', 'Date Issued', 'Temporal Coverage', 'Date Range', 'Update Frequency', 'Spatial Coverage',
              'Bounding Box', 'Resource Type', 'Format', 'Information', 'Download', 'ID', 'Identifier', 'Provider', 'Code', 'Member Of', 'Status',
              'Accrual Method', 'Date Accessioned', 'Rights', 'License', 'Access Rights', 'Suppressed', 'Child Record', 'File Size']

In [71]:
with open('All_Metadata.csv', 'w') as fw:
    writer = csv.writer(fw)
    writer.writerow(fieldnames)
    writer.writerows(all_metadata)