This repository contains code and scripts for gathering data from the Stonybrook University ECHO database and other web sources, filtering it to make it available for the generation of R Markdown report cards.
- Python and SQLite are required.
- The main branch of the ECHO_modules repository, must be checked out into the home directory of this project.
- The latest leg_info.db SQLite database of legislator information, from the EEW-Report-Making repository, needs to be placed into the home directory of this project.
This diagram shows the overall process of getting the data for the report cards. It looks complicated, but many of the steps only need to be run when congressional districts change, or when legislators change.
Most of the data gathering shown in the diagram is done infrequently. The leg_info.py script that builds the leg_info.db SQLite database of legislator data only needs to be run when legislators change. The same is true of the get_leg_image.py script that gets the images for the current legislators. The RegionMap.py program that creates maps of the congressional districts and states only needs to be run when districts are re-drawn.
Monthly we will extract the data needed for all Congressional and State report cards from the Stonybrook database. This is stored in a local SQLite database, region.db. The extraction is performed by the AllPrograms.py program. AllPrograms.py must gather a large volume of data in order to filter it for the report cards, so it isn't able to process all Congressional districts at once. The list of CDs to be processed for a single run of AllPrograms.py must be reduced to around 60 per run.
The -c argument to AllPrograms.py is a list of comma-separated state, CD number pairs (e.g. AL,1). The -f option specifies the focus year of the data, which is generally the last full year of reliable data. (In 2021 we are specifying 2020 as the focus year.)
A script run_AllPrograms.sh runs AllPrograms.py with batches of congressional districts. The districts are listed in 9 files state_cd-X.csv, where X is 1 to 9. The script is run as follows:
- run_AllPrograms.sh
run_AllPrograms.sh first backs up the current region.db database, appending the current date to the file name. Then it cleans out the tables that will be re-populated by running AllPrograms.py.
A log file, AllPrograms.log, can be viewed to determine if there was any problem encountered in running the AllPrograms.py on any of the state_cd-X.csv files.
The AllPrograms.py program writes into tables in the local region.db SQLite database. The schema for this database is in region_db.schema.
- The region.db local SQLite database is backed up.
- AllPrograms.py is run. It uses AllPrograms_util.py and AllPrograms_db.py to collect the needed data from the SBU database and populate region.db.
- state_cd-*.csv files contain the CDs to be run in each batch by AllPrograms.py.
- Standard output and standard error messages are collected in log files.
The schema of this small, local database is in region_db.schema. The database collects summarized data from the SBU ECHO database organized by regions. (Currently Congressional Districts and States are the only regions supported.) The tables in the database are:
- regions - This identifies all of the regions (congressional districts) for which data exists. All other tables link via the regions table's rowid index.
- active_facilities - the count of facilities for each program--CAA, CWA, RCRA, GHG
- per_fac - counts of violations, etc. (type) by program (CAA, etc.) by year, per facility
- violations - counts of violations by program by year
- enforcements - counts of enforcements and penalty amounts by program by year
- ghg_emissions - amounts by year
- non_compliants - facilities, quarters of non-compliance, formal actions, URL, latitude and longitude by program
- violations_by_facilities - number of facilities and non-compliant quarters by program
- enf_per_fac - number of facilities, count of enforcements, amount of enforcements by year and program
- inflation - yearly inflation factors
In the .Rmd templates, the reticulate R package is employed to allow the use of the Python Region.py code in the R Markdown file.
The Region object is imported, and the constructor is called with the type parameter set to 'State' or 'Congressional District', value set to the CD number (or omitted for states) and state set as expected. The region variable created can then be used to request data through the functions provided.
- Functions available through Regions.py include:
u <- import( 'Region' )
# region <- u$Region(type='State', state='TX')
region <- u$Region(type='Congressional District', value='34', state='TX')
USAinspectionsper1000_All <- region$get_per_1000( 'inspections', 'USA', 2020 )
inspectionsper1000_state <- region$get_per_1000( 'inspections', 'State', 2020 )
inspectionsper1000_cd <- region$get_per_1000( 'inspections', 'CD', 2020 )
USAviolationsper1000_All <- region$get_per_1000( 'violations', 'USA', 2020 )
violationsper1000_state <- region$get_per_1000( 'violations', 'State', 2020 )
violationsper1000_cd <- region$get_per_1000( 'violations', 'CD', 2020 )
inflation <- region$get_inflation( 2020 )
CWAper1000 <- region$get_cwa_per_1000( 2020 )
violations <- region$get_events( 'violations', 'All', 2020 )
CAAviolations <- region$get_events( 'violations', 'CAA', 2020 )
CWAviolations <- region$get_events( 'violations', 'CWA', 2020 )
RCRAviolations <- region$get_events( 'violations', 'RCRA', 2020 )
enforcement <- region$get_events( 'enforcements', 'All', 2020 )
CAAenforcement <- region$get_events( 'enforcements', 'CAA', 2020 )
CWAenforcement <- region$get_events( 'enforcements', 'CWA', 2020 )
RCRAenforcement <- region$get_events( 'enforcements', 'RCRA', 2020 )
CAArecurring <- region$get_recurring_violations( 'CAA' )
CWArecurring <- region$get_recurring_violations( 'CWA' )
RCRArecurring <- region$get_recurring_violations( 'RCRA' )
inspections <- region$get_events( 'inspections', 'All', 2020 )
CAAinspections <- region$get_events( 'inspections', 'CAA', 2020 )
CWAinspections <- region$get_events( 'inspections', 'CWA', 2020 )
RCRAinspections <- region$get_events( 'inspections', 'RCRA', 2020 )
CAA_active_facilities <- region$get_active_facilities('CAA')
CWA_active_facilities <- region$get_active_facilities('CWA')
RCRA_active_facilities <- region$get_active_facilities('RCRA')
This database is a collection of information on legislators collected from a few web resources. Images, committee information and other data is retrieved from these sources:
- the @unitedstates project https://theunitedstates.io/
- Govtrack https://www.govtrack.us/congress/members/
- Wikipedia https://en.wikipedia.org/wiki/
- Open Secrets https://www.opensecrets.org/members-of-congress/
The leg_info database contains the following tables:
- legislators - This identifies each legislator with links to their official URL and online data sources such as Govtrack, Wikipedia, Open Secrets and others.
- committees - This is a collection of all Senate and House of Representative committees.
- sub_committees - These are the subcommittees, linked to their committees.
- committee_members - This links legislators to the committees and subcommittees they serve on.
The Linux cron utility is used to run several of our processes on an automated schedule. The commands to be run are managed with the 'crontab -e' command.
# m h dom mon dow command
0 19 15 * * /home/edgi/EEW-ReportCard-Data/backup_active_facilities.sh
0 19 16 * * /home/edgi/EEW-ReportCard-Data/run_AllPrograms.sh
0 23 16 * * /home/edgi/EEW-ReportCard-Data/run_leg_info.sh
0 5 20 * * /home/edgi/EEW-ReportCard-Data/run_reportcards.sh
0 5 20 * * /home/edgi/EEW-ReportCard-Data/send_to_eew_web.sh
The account's .profile is not read when commands are run by cron. Our EEW_HOME environment variable, set in .profile, must be explicitly called in the bash shell scripts.
This script copies the current active_facilities table in region.db into active_facilities_previous. This is used later to test success for run_AllPrograms.sh, by comparing all entries in active_facilities with their previous values. We can expect that the number of facilities for a program and region might change some between the times we run AllPrograms to get data from ECHO and the SBU database, but a large difference likely signals a problem with one of the batches of CDs processed by AllPrograms, in which case that batch will be processed again. (A second failure will just result in an error logged.)
This script retrieves the current data for all regions (CDs) from the SBU ECHO database. The AllPrograms.py program is used. Because of the large number of congressional districts that must be processed, they are grouped into 9 CSV files. (These are the 9 state_cd-x.csv files.) Each CSV file is given to AllPrograms.py. The program populates the SQLite region.db database.
This script runs the leg_info.py program to gather legislator information into the leg_info.db SQLite database.
This script calls the run_state_reportcards.R and run_CD_reportcards.R scripts which use State_template.rmd and CD_template.rmd markdown templates to generate report cards for every region. States and CDs are each batched into three groups according to their state names. Generated HTML and PDF report cards are written to the Output directory.
Rscript run_state_reportcards.R -s '^[A-I]'
Rscript run_state_reportcards.R -s '^[J-R]'
Rscript run_state_reportcards.R -s '^[S-Z]'
Rscript run_CD_reportcards.R -s '^[A-I]'
Rscript run_CD_reportcards.R -s '^[J-R]'
Rscript run_CD_reportcards.R -s '^[S-Z]'
This script uses FTP to send all of the report cards found in the Output directory to the EEW hosted web server.
Copyright (C) Environmental Data and Governance Initiative (EDGI) This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, version 3.0.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
See the LICENSE
file for details.