kontrata

A project to download, process and index all contracts published at the Basque Public Administration Public Contracts Platform (REVASCON).

This project has been inspired by @JaimeObregon and contratoscantabria.es

Install

You need a recent python version, any version >= 3.8 will do.

Clone this project and create a virtualenv inside it:

python3 -m venv .

Then install the required dependencies:

./bin/pip install -r requirements.txt

Run each of the scripts of the pipeline.

./bin/python step_00_cache_contracts_files.py
./bin/python step_01_get_contracts.py --year 2021
./bin/python step_02_process_contracts.py --year 2021
./bin/python step_03_build_data_dicts.py
./bin/python step_04_fix_authority_and_company_data_async.py --year 2021
./bin/python step_05_index_contracts.py --year 2021

The first script is optional, it just downloads the main files from the opendata portal and caches them locally.
If you don't want to run them, just skip that part and run the step_01 script, it will download the requested year's file. I think
that pre-caching files can speed up the download and processing time, because you can run several parallel downloads.

## Elastic

All the data produced in the 2nd step will be indexed in an [Elastic](https://www.elastic.co/es/) service. You will need such a service to run the indexing part.

Have a look at the [sample elastic docker-compose installation](https://github.com/erral/kontrata-docker) that I am using on development.

By default, the 5th step script will try to index the files in an Elastic service running in localhost:9200. You may change this setting an environment var:

```shell
export ELASTIC_HOST=10.0.0.1 && python step_05_index_contracts.py

You have the following variables to customize this:

ELASTIC_HOST
ELASTIC_PORT

Pipeline

step_00_cache_contracts_files.py (optional)

Download and cache the original JSON files

step_01_get_contracts.py

It has an optional parameter --year, to download contracts just from that year. It has an optional parameter --update, to signal if you want to update already downloaded files.

Download the original contracts JSONP file
Cache the file
Convert to JSON
Cache the file
For each contract:
- Create a dir
- Extract the id
- Download the data XML file
- Download the metadata XML file
- Create the contract JSON file

step_02_process_contracts.py

It has an optional parameter --year, to download contracts just from that year.

Read the existing XML files for each contract and build a json file with the relevant data

step_03_build_data_dicts.py

Build authority and company lists, to use them in the fixing process

step_04_fix_authority_and_company_data_async.py

It has an optional parameter --year, to download contracts just from that year.

Try to fix, unify and calculate cifs and slugs for authority and companies

step_05_index_contracts.py

It has an optional parameter --year, to download contracts just from that year.

Index all contracts in elastic

Work in progress

This is a work in progress. The JSON file generated in the 2nd step (and then indexed in the 3rd step) is subject to change.

Demo content

I have downloaded and processed all the data available on the 10th of october and make it available here, in case you do not want to wait to download all of them, I make them available here for anyone here:

Full download of the contract files: 2 XML files (one for data and another one for metadata) and a JSON file with the summary, one per language, a total of 6 files per contract. 665 MB in total.

I have also processed all those files with the second script, and make it available here, in case you do not want to wait to process all of them (the process of the files takes usualy minutes).

Full download of the processed contract files: 2 JSON files, one with the XML in raw JSON (unprocessed, just converted), and another one created after processing the previous one, one per language, a total of 4 files per contract. 286 MB in total.

In the demo folder you can find a extract of some contracts with the corresponding files.

License

GPLv3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

kontrata

Install

Pipeline

Work in progress

Demo content

License

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
.github/workflows		.github/workflows
cache		cache
cifs		cifs
contracts		contracts
demo		demo
processed		processed
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
step_00_cache_contracts_files.py		step_00_cache_contracts_files.py
step_01_get_contracts.py		step_01_get_contracts.py
step_02_process_contracts.py		step_02_process_contracts.py
step_03_build_data_dicts.py		step_03_build_data_dicts.py
step_04_fix_authority_and_company_data_async.py		step_04_fix_authority_and_company_data_async.py
step_05_index_contracts.py		step_05_index_contracts.py
test_step_02_process_contracts.py		test_step_02_process_contracts.py
utils.py		utils.py

License

erral/kontrata

Folders and files

Latest commit

History

Repository files navigation

kontrata

Install

Pipeline

Work in progress

Demo content

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages