HDX crawler

The purpose of this project is to present a proof of concept for the Education Meta-dataset on the HDX Platform. The project implements a crawler that takes education datasets as an input, and outputs a populated meta dataset. The crawler generates the input by scraping HTML/CSS elements from HDX search results page for 'Education'. The crawler goes through each search result, dataset page, and the dataset itself to collect the data. The concept behind the metadataset and the details of the scraping algorithm is explained in section Concept, and Algorithm. The caveats and assumptions of the crawler are mentioned in the Assumptions section

How to Run

Python libraries

Assuming you're running on Python3, you need to have the following libraries installed before running

Scrapy
Pandas
Numpy
xlrd
Pillow

Command line

cd to /tutorial/spiders folder
scrapy crawl hdx

Concept

Takes architicture of a Star Schema

"An approach to answering multi-dimensional analytical queries swiftly in computing"

A populated row in the meta-dataset would look as follows:

{
    # dataset dimension
    dataset_id: ...
    who: ...
    what: ...
    when: ...
    where: ...
    on_web: ...
    on_hdx: ...
    open_data: ...
    link: ...

    # tag dimension
    tag_id: ...
    tag: ...

    # hxl dimension
    hxl_id: ...
    hxl: ...

    # indicator dimension
    indicator_id: ...
    indicator: ...

    # quality dimension
    quality_id: ...
    quality: ...
}

Populating the metadataset

The below figure shows the workflow of the crawler to populate a single row in the metadaset

Once the code has finished, two output files will be created, /tutorial/spiders/meta_data_test.json, and /tutorial/spiders/meta_data_test.csv, you may use either for your convenience. Remember to delete both files before running the code, otherwise the crawler will append to those files

Algorithm

HDX Crawler starts at http

This section states the algorithm steps for populating the metadataset

Assumptions

Three assumptions are mentioned below

CSS/HTML
Mappings
File type

CSS/HTML

The crawler assumes that the HDX platform https://data.humdata.org/ has the same HTML and CSS as of 23 July 2018

The crawler uses education indicators retrieved by performing Secondary Data Review of the official Indicator Registry Education Indicators. The reason is that most datasets and reports do not go to the level of details of the official education indicators. The Secondary Data Review reduces the indicators from 50 to 25. For more details, see FILE for description of each indicator and its link to registry code

education indicators are synchronized with the files opened. meaning that any indicator found within, should exist within education_indcators_description.csv

Mappings

One-to-Many

File type

Only considers css/xls/xlsx/zip

Scalability

list edit csv files to include more hxl example: new education hxl given

Output

.json .csv

User cases

Leveraging the metadataset. See .ipynyb

Future work

Switch from scraping to using the HDX API to get the metadata
Use the Data Freshness database instead of the updated date to populate the time dimensions
Schedule Scrapy to run frequently by either using crontab command or deploying the spider to Scrapyd https://github.com/scrapy/scrapyd
Scrape data outside HDX such as Humanitarian Response Planning (HRP) and Humanitarian Needs Overview (HNO) PDFs

value of indicator sector dimension contact dimension ad

An ideal way to handle the size of the metadataset output

To be stored in a database program such as the open-source cross-platform MongoDB

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

HDX crawler

Table of Contents

How to Run

Python libraries

Command line

Concept

Populating the metadataset

Algorithm

Assumptions

CSS/HTML

Mappings

File type

Scalability

Output

User cases

Future work

An ideal way to handle the size of the metadataset output

Files

README.md

Latest commit

History

README.md

File metadata and controls

HDX crawler

Table of Contents

How to Run

Python libraries

Command line

Concept

Populating the metadataset

Algorithm

Assumptions

CSS/HTML

Mappings

File type

Scalability

Output

User cases

Future work

An ideal way to handle the size of the metadataset output