The purpose of this project is to present a proof of concept for the Education Meta-dataset on the HDX Platform. The project implements a crawler that takes education datasets as an input, and outputs a populated meta dataset. The crawler generates the input by scraping HTML/CSS elements from HDX search results page for 'Education'. The crawler goes through each search result, dataset page, and the dataset itself to collect the data. The concept behind the metadataset and the details of the scraping algorithm is explained in section Concept, and Algorithm. The caveats and assumptions of the crawler are mentioned in the Assumptions section
Assuming you're running on Python3, you need to have the following libraries installed before running
- Scrapy
- Pandas
- Numpy
- xlrd
- Pillow
- cd to /tutorial/spiders folder
- scrapy crawl hdx
Takes architicture of a Star Schema
"An approach to answering multi-dimensional analytical queries swiftly in computing"
A populated row in the meta-dataset would look as follows:
{
# dataset dimension
dataset_id: ...
who: ...
what: ...
when: ...
where: ...
on_web: ...
on_hdx: ...
open_data: ...
link: ...
# tag dimension
tag_id: ...
tag: ...
# hxl dimension
hxl_id: ...
hxl: ...
# indicator dimension
indicator_id: ...
indicator: ...
# quality dimension
quality_id: ...
quality: ...
}
The below figure shows the workflow of the crawler to populate a single row in the metadaset
Once the code has finished, two output files will be created, /tutorial/spiders/meta_data_test.json, and /tutorial/spiders/meta_data_test.csv, you may use either for your convenience. Remember to delete both files before running the code, otherwise the crawler will append to those files
- HDX Crawler starts at http
This section states the algorithm steps for populating the metadataset
Three assumptions are mentioned below
- CSS/HTML
- Mappings
- File type
The crawler assumes that the HDX platform https://data.humdata.org/ has the same HTML and CSS as of 23 July 2018
The crawler uses education indicators retrieved by performing Secondary Data Review of the official Indicator Registry Education Indicators. The reason is that most datasets and reports do not go to the level of details of the official education indicators. The Secondary Data Review reduces the indicators from 50 to 25. For more details, see FILE for description of each indicator and its link to registry code
education indicators are synchronized with the files opened. meaning that any indicator found within, should exist within education_indcators_description.csv
One-to-Many
Only considers css/xls/xlsx/zip
list edit csv files to include more hxl example: new education hxl given
.json .csv
Leveraging the metadataset. See .ipynyb
- Switch from scraping to using the HDX API to get the metadata
- Use the Data Freshness database instead of the updated date to populate the time dimensions
- Schedule Scrapy to run frequently by either using crontab command or deploying the spider to Scrapyd https://github.com/scrapy/scrapyd
- Scrape data outside HDX such as Humanitarian Response Planning (HRP) and Humanitarian Needs Overview (HNO) PDFs
value of indicator sector dimension contact dimension ad
To be stored in a database program such as the open-source cross-platform MongoDB