Collecting EU subsidies for the 2014-2020 program.
The goal of this project is to create a database for the operations in the cohesion policy programmes in the 2014 - 2020 funding cycle among the EU 27 + United Kingdom member states, including interreg programmes. The project is a software solution that self-documents the collection of named datasets, and makes all operations and procedures (accessing, transforming, cleaning, converting, etc.) programatically reproducible, automated and extendible.
The initial list of data sources was provided by the the European Commission. It was not within the scope of the project to research and collect all transactions within the 2014 - 2020 funding cycle - upon agreement only file-based, programatically accessible and computer readable sources were included, therefore data only existing in HTML format for example was outside of the scope. As the framework is extendible, it allows the technical user to include additional datasources that were not part of the project originally.
$ git clone https://github.com/balkey/eu1420.git
The project was tested on PostgreSQL 12.2. You could either install and run a PostgreSQL server locally, or use a hosted database. In case you want to run locally, install PostgreSQL following this tutorial.
Create your user with a role allowing to create schemas and tables. Create a database you will use.
Include your database connection details in an environmental file, following the example here.
Read here on the formating convention. Don't forget to grant executive permissions on the file:
$ chmod 0600 ~/.pgpass
It is strongly recommended to deploy the project in a virtual environment. We recommend virtualenv. Follow this tutorial to install Python3!
$ sudo apt install python3-pip
Make sure that you are able to run csvkit commands directly from the command line or envoked via Python. Therefore we recommend to install it system-wide.
On Ubuntu:
$ apt-get install csvkit
On OSX:
$ brew install csvkit
Once you set up Python and PostgreSQL, it is time to install dependencies for the project. Simply run in the project's root folder:
$ pip3 install -r requirements.txt
Make sure you have included all necessary configuration files. You can see examples in the config folder.
That's it, you should be ready to go!
There's two ways to run the entire project: either falling back to previously downloaded files, or downloading them from scratch. We strongly recommend using previously downloaded files as there's no guarantee that newly downloaded files will respect the data structures defined in the project, so it is very likely that you'll get errors that way.
You can also add new files to the existing ones, simply copy them to the data/source
folder.
Make sure you keep the data/source/[COUNTRY_CODE]/[PROGAME_CODE]/file.xlsx
convention when adding new files!
In case you already have included the files in the data/source
folder, simply run in the project's root folder:
$ make
In case you want to newly download the sourcefiles, you need to download them from a Google Sheets first, using the Google Sheets API. See an example sheet here.
Then see the how the sheet is downloaded here. You can also just copy your source file to data/source/operations_list.csv
- in this case, make sure to comment out the command downloading the sheet from the Google Sheets API.
The minimal data requirement is the following for your operations_list.csv
file:
Once this file is in place, you just simply need to run:
$ make FORCE_DOWNLOAD=1
There's one important human supervision in the code, when the interatctive shell will prompt for user input in assigning the correct detected header for each source file. Please review the code here.