This respository contains everything required to create a local version of the CHEFS database, further discussed in the accompanying article: "Food safety monitoring trends across Europe: insights from the 392-million-entry CompreHensive European Food Safety (CHEFS) database".
Disclaimer: Although we made every effort to ensure the validity of the CHEFS database, we are not responsible for any conclusions that third parties might draw.
The repository is structured as follows:
-
- Scripts : collection of scripts for database setup, data download and data processing
-
- Prevalence Data : Folder to store microbiological Prevalence data (empty for now)
-
- AMR data : Folder to store AMR data (empty for now)
-
- Exports : Folder for data exports (empty for now)
-
- Zoonose data : Folder to store microbiological Prevalence data (empty for now)
-
- Meta data : Folder that stores some meta information in Excel files (needed for some of the scripts in 01. Scripts)
-
- Data Files : destination folder for downloaded data files
-
- EFSA Catalogues/version v9 : All EFSA catalogues (needed for some scripts in the 01. Scripts)
-
- Query & Visualization Examples : Collection of python notebooks that include example SQL queries and visualizations from the article. See the README.md inside for more information
- README.md
- requirements.txt
If you find this useful and use our code or database, please cite our paper:
Nehir Kizililsoley*, Floor van Meer, PhD*, Osman Mutlu, Wouter Hoenderdaal, Rosan Hobé, Wenjuan Mu, PhD, Arjen Gerssen, PhD, HJ van der Fels-Klerx, PhD, Ákos Jóźwiak, PhD, Ioannis Manikas, PhD, Ali Hurriyetoglu, PhD, Bas van der Velden, PhD. Food safety monitoring trends across Europe: insights from the 392-million-entry CompreHensive European Food Safety (CHEFS) database, arxiv.
The CHEFS database processing pipeline has been developed on computer with a Windows operation system. Supplementary software included:
- python v3.12
- postgresql v14
- git
The data processing scripts have been developed and run on a computer with an Intel Xeon CPU 2400 MHz, 32 GB RAM, and 100+ GB of storage space
To set up the CHEFS database processing pipeline, follow these steps:
- Download and install Python
- Download and install PostgreSQL
- Download and install git
- Create a directory on your computer where you want to store all files related to the processing pipeline
- Open up the terminal
- cd into directory created at step 3 (see above)
- Clone content of this repository into the directory:
git clone https://git.wur.nl/wouter.hoenderdaal/chefs.git
- Create python virtual environment by typing in the following commands in the terminal in this order:
pip install virtualenv
(if you don't already have virtualenv installed)virtualenv venv
.\venv\Scripts\activate
pip install -r requirements.txt
Note that the last step will install all python packages required for the CHEFS database processing pipeline. The packages will be installed inside the newly created environment "venv".
- Setting up the database: Download and install pgAdmin and create an empty database. Inside this database, run the following commands:
CREATE SCHEMA efsa;
CREATE SCHEMA ontologies_efsa;
These schemas will store the EFSA data and the EFSA catalogues.
- Configure database connection: Having cloned this repository into the directory, go to the directory "01. Scripts" and open the file "DBcredentials.py". In this file, either use environment variables or (not recommended) hardcode the DB host/name/username/password here.
You have now sucessfully set up the python environment, the database with two empty schemas, and the database connection.
Below the data processing steps are summarized. In the other 2 sections more details are presented. Please read everything, otherwise you will run into problems.
Note that there are .py (Python) and .sql (PostgreSQL) scripts in the 01. Scripts folder. The .py files should be run on the command line from the virtual environment. The .sql scripts, however, should be run from the pqsl Postgres terminal. You can enter this terminal either via the pgAdmin tool. How to use psql and get it started, please see the documentation and the pgAdmin documentation on psql
- Run the "0." scripts to create tables in the database
- Choose a type of data, e.g. start with "contaminants"
- Run script 1 to 13
- Run some queries in the database for testing (optional)
- Delete all folders and files in the "07. Datafiles" folder
- Repeat steps starting from step 2.
To run the processing scripts, go to the terminal and type the following to reach the 01. Scripts folder:
cd "01. Scripts"
In the "01 Scripts" folder, you find a series of python and SQL scripts numbered from 0 to 13.
Important The scripts starting with "0" should only be executed once! These are:
- 0a. EFSA CATALOGUE.py
- 0b. efsa - TABLES CATALOGUE.sql
- 0c. efsa - TABLES.sql
- 0d. meta-info in DB.sql
These scripts create the tables and content of the EFSA catalogues, and creates the tables and views where the EFSA zenodo data will be stored.
You can run a scripts by typing in the terminal (after having activated your python virtual environment): python '.\0a. EFSA CATALOGUE.py'
Tip: type in "python 0" and then press the TAB key on your keyword. This will auto-complete the filename and give you python '.\0a. EFSA CATALOGUE.py'
After intialising the database in Step 1, you can proceed with steps 2-5. These steps download the datafiles from Zenodo, to process these files, create intermediary csv files, and import those into the database. Note that there are three types of data that can be imported into the database, namely:
- Contaminants
- Pesticides
- Veterinary
If you wish to get access to the entire database, all scripts (from "1. download_zonodo_files.py" to "13. CLEAN TABLES"), with the exception of the scripts listed under nr 0, should be run for each data type.
You select the data type by changing the TYPE variable in the scripts 1. download_zonodo_files.py (lines 26-28) and 2. convert 7z to zip.py (lines 16-18) to the desired type. For example, if you start with "contaminants" (which is recommended because the file sizes are smallest compared to the other two), you need to first change the variable TYPE to TYPE = "contaminants".
Then you start running the scripts 1 to 13 in the order that they are numbered. In the scripts, especially the .sql scripts, additional detailed information is provided to explain what you need to do.
EXCEPTION Note that for pesticides, the script "2. convert 7z to zip.py" requires you to first manually unpack all the .7z files in all the country folders in "07. Data Files". Unfortunatelly, these pesticide .7z files could not be automatically unzipped using Python. So that is why you need to do this manually.
Some of the data files are very large, and processing is extensive, so do not be surprised if running a script like "4. create_country_sample_files.py" takes many hours (up to 6 hours on my computer for veterinary data). The scripts do give you feedback that shows on which file or for which country they are currently working.
After all data processing is done, you can rely on the following views (and their underlying tables) to query the data:
- efsa.vw_sample_core
- efsa.vw_measurement_core
- efsa.vw_sample_measurement_core
Note that the database is not yet optimized for fast performance. Some steps, e.g. partitioning is implemented to facilitate faster query speed, but feel free to implement additional indexes and other things to boost performance. Because these things can be very specific to what type of queries you are running, we leave it up to you to make decisions regarding performance optimization.