COVID-19 Data
Note that this data is not used for the prod visual
The scraper and data for the prod visual can be found here: https://github.com/OCHA-DAP/hdx-scraper-iati-viz
This scraper extracts data from IATI Datastore nightly and reprocesses it:
- selects certain fields and exports them in a nice clean JSON format
- converts financial data to USD
The scripts in this repository automatically generate fresh data every day (using Github Actions), which can be seen in (and downloaded from) the gh-pages branch.
For more detail on how the data was processed, see the data notes.
Installing
git clone git@github.com:OCHA-DAP/covid19-data.git
virtualenv ./pyenv
source ./pyenv/bin/activate
pip install -r requirements.txt
Running
Download and reprocess data using the following script. Add --help
to see optional arguments.
python run.py
Running with cached rates (saves downloading a new file)
python run.py --cached-rates
Running and deploying to gh-pages
python run.py --deploy
Overview
The code in this repository runs at 1500 UTC every day, using Github Actions. Files are pushed to the gh-pages
branch and made available through Github Pages. The data is then visualised using software stored in the OCHA-DAP/viz-covid19-visualisation repository, and also served from Github Pages.
Data sources
Data is downloaded from a few places:
- IATI data: D-Portal
- FTS data: UNOCHA FTS
- Codelists: CodeforIATI
- Exchange Rates: CodeforIATI
These downloads are now reasonably stable, though a few things to be aware of:
- IATI data: D-Portal fairly frequently fails to respond with relevant data. This appears to be more reliable now that we request fewer activities at once, and we run at 1500 rather than early in the morning (when D-Portal is itself collecting and updating source data). One option could be to consider switching to the new IATI Datastore (though see discussion below).
- FTS data: FTS now seems to be pretty stable; occasionally the FTS API is unavailable
- Codelists: these endpoints are very stable now as flat files are hosted on Github Pages. These files are generally much faster to download than the official IATI codelists, and they are also often more up to date.
- Exchange rates: this file is also now very stable, again as a single compiled flat file is hosted on Github Pages; previously this data was hosted only on morph.io, but there have been a lot of stability issues recently. There don't appear to be any significant problems here any more.
Process
The basic process is as follows:
run.py
:- either download or load in a list of exchange rates
- download data from D-Portal (
get_activities_from_urls()
) - filter out activities that have certain problems (
activities_filter()
) - filter out activities that don't conform to the IATI COVID-19 Publishing Guidance
- extract relevant data from each activity (
process_activity()
) - write XML data for all activities (
write_xml_files()
)- up to 3000 activities per file, labelled
activities-N.xml
where N is the page)
- up to 3000 activities per file, labelled
- write XML data for each reporting organisation
- write out the list of sectors and countries that are used in the data (so that in the user interface we don't display countries or sectors with no activities)
- download and process FTS data
- run
traceability.py
(see below) - remove
activities.xml
(it is used bytraceability.py
, but it is a very large file and exceeds Github usage limits)
traceability.py
:- read in list of exchange rates
- download
TransactionType
codelist - read in the activities XML (from
activities.xml
) - identify which activities contain explicit COVID-19 transactions
- extract relevant data from each transaction (
make_transaction()
) - export transactions to Excel
- disaggregate transactions by sector and country (
make_sector_country_transactions_data()
) - export disaggregated data to JSON and Excel
- make grouped traceability data for Sankey diagram
- export grouped traceability data to JSON and Excel