sebra-scrape

Scraper and parser to obtain data from SEBRA 'https://www.minfin.bg/bg/transparency/'

The project does the following:

Crawls the reports from the ministry website since the date of the last downloaded report
Downloads and renames the reports
Downloads the previously parsed CSV from Google Drive
Parses the newly downloaded reports into a large data frame
Uploads the raw reports and large data frame with all historic data to Google Drive
Appends the DfG POSTGRESQL database with the newly downloaded and parsed reports. DB only has the parsed data, raw data is stored in Google Drive

Installation

Clone the repo
Fill in the USR and PASS variables in sample.env file with your DB credentials and save as .env. If you don't have these credentials, contact a member of the administration to obtain them.
Download the service_acct.json file from here: https://drive.google.com/file/d/1GwnCsSQLM6XaPGhFGCb4jRLltWOJLYd-/view?usp=sharing If you rename this file (or download it to another location), make sure to rename the relevant bit in the sebra_pipeline.py.
Install the Python packages in requirements.txt

All paths in the project are RELATIVE, so if your development environment does not automatically change the working directory to the main directory of this repo, make sure you switch before you run the pipeline
Sometimes the parser fails to assert code equivalence, in this case the data for that day is not parsed. Since the collection always starts from the latest date of collection, historical dates would be missing. TODO: Do not start from the latest date but crawl all missing dates from the start.

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
.env_template		.env_template
.gitignore		.gitignore
README.md		README.md
anomaly_detection.ipynb		anomaly_detection.ipynb
gdrive_manage.py		gdrive_manage.py
parse_excel_documents.ipynb		parse_excel_documents.ipynb
parse_excel_documents_mf.ipynb		parse_excel_documents_mf.ipynb
requirements.txt		requirements.txt
sample.env		sample.env
scraping_sebra.ipynb		scraping_sebra.ipynb
sebra_downloader.py		sebra_downloader.py
sebra_gdrive_manager.py		sebra_gdrive_manager.py
sebra_parser.py		sebra_parser.py
sebra_pipeline.py		sebra_pipeline.py
settings.py		settings.py