Skip to content

Submission Loader

Keith Hickey edited this page May 26, 2020 · 26 revisions

Purpose

This page explains how the "submission loader" ingests DABS data from Broker on a daily basis.

Code

Bulk of the code is in usaspending_api.etl.management.commands.load_submission

Executive Summary

The submission loader runs nightly, extracting published (used to be "certified") DABS from Broker using SQL and loading that data into USAspending tables by saving Django ORM objects.

Submission data are for the most part copied over verbatim to USAspending database tables from Broker database tables, which store valid File A/B/C data. Submissions are loaded into USAspending nightly, after dependent data changes are loaded, including the Broker award data as well as reference data that the incoming data may refer to. Submissions that are found to already be loaded are deleted in USAspending before they are loaded.

ETL Source -> Destination

Data Broker Source USAspending Destination
Submission Metadata certify_history submission_attributes
Submission submission submission_attributes
File A certified_appropriation appropriation_account_balances (FYTD TAS balances for budget/obligations/outlays)
File B certified_object_class_program_activity financial_accounts_by_program_activity_object_class (FYTD TAS+OC+PA balances for obligations/outlays)
File C certified_award_financial financial_accounts_by_awards (TAS+PA+OC+Award obligations/outlays)

Execution Model

The submission loader uses Python code, structured in a custom Django management command using manage.py that effectively makes it an executable python script. The loader is run as a step in the USAspending "DATA Act Production Nightly Load" data pipeline, executed by Jenkins. Its step is executed after all other data from Broker is loaded, such as award data, subaward data, and various other reference data sets are updated.

Jenkins directly calls the load_multiple_submissions Django management command with HARDCODED fiscal years and quarters that must be updated annually. load_multiple_submissions is then responsible for calling load_submission once per new/updated submission.

Data Access

The major steps to extract data from Broker are performed in SQL, using the Python psycopg2 library and a connection to the Broker Postgres databases. This data is turned into python dictionaries which are loaded row-by-row into the USAspending database by instantiating Django model objects from the Broker data dicts, and performing save operations using the Django ORM.

The major step to load the data from Broker to USAspending is run under a @transaction.atomic decorator, which should allow rollbacks of the entire submission load if any errors occur.

Dependent Data

File A, File B, and File C make reference to reference data which should exist, and in some cases must exist (for DB foreign keys), in USAspending before their records can be saved. This helps in referential integrity and data consistency across Broker and USAspending.

  1. Award data:
    • Referenced By: File C
    • During the nightly pipeline execution, before submissions are loaded, transactions are loaded from Broker's detached_award_procurement table and published_award_financial_assistance table. From them a unique award identifier is derived and then an award is created if one does not yet exist for this key. Creation of the award record allows for establishing a linkage of File C data to an award record.
  2. TAS
    • Referenced By: File A, File B, File C
    • records should be loaded into USAspending to allow FK linkages from File A, File B, and File C records. At this moment those records are skipped if such a linkage cannot be made. These are refreshed using load_tas from Broker in a nightly pipeline step that runs before the submission loader.
  3. Object Class
    • Referenced By: File B, File C
    • records must be loaded into USAspending so FK linkages from File B and File C records to them can be made, otherwise the submission load will error-out There are only 100-200 of them and change rarely. See Data Exchanges docs for more detail on how they're loaded.
  4. Program Activity
    • Referenced By: File B, File C
    • records are loaded into USAspending on-the-fly as they are seen in File B or File C.
  5. DEF Codes
    • Referenced By: File B, File C
    • records are new, and their load is TBD.

Derived or Summarized Data

This section covers any data in USAspending that is (re)calculated from the raw Broker data loaded, or otherwise summarized (rolled-up sum, average, etc.).

NOTE: No numeric values are pre-aggregated during the load. This is left to query-time of the API.

final_of_fy

This is a boolean field in the File A (appropriation_account_balances) and File B (financial_accounts_by_program_activity_object_class) loaded USAspending tables that exists on each row with a default value of false.

  • It is set to true in appropriation_account_balances when the record is the last seen reporting of a TAS value for a Fiscal Year.
  • It is set to true in financial_accounts_by_program_activity_object_class when the record is the last seen reporting of a TAS+OC+PA (and soon to be +DEFC) value for a Fiscal Year.

This value is reset on all records per table per loaded submission.

  • It is reset on all File A records after all File A records in the single submission are loaded
  • It is reset on all File B records after all File B records in the single submission are loaded

This helps with queries that need to aggregate totals across dimensions (TAS / TAS+OC+PA) for a full fiscal year, or multiple fiscal years, but need to account for some of those records not being reported each quarter of the FY.

⚠️ There are some suspected inconsistencies in the granularity of how the final_of_fy is queried, and it is being investigated

C-to-D Linkage

During the ingestion of File C, the submission loader attempts to link DABS File C records to an Award record in the USAspending award data, according to the C-to-D Linkage. This allows displays or queries of information about account data or award data to benefit from further information provided in its linked domain area.

⚠️ NOTE: This above docs reference SQL update statements queries that are run to handle this, but the linkage also happens in the load_submission.py Python code of the submission loader by setting the created financial_accounts_by_awards.award_id field in the load_file_c(...) function to be the result of the find_matching_award(...) function.

Deriving Discrete Quarterly Activity

⚠️ NOTE: This is deprecated. Code that does this will soon be removed, as it and resultant tables is not used. Effectively it was attempting to calculate discrete per-single-quarter activity budget/obligations/outlays activity by finding the difference from a previous quarter. Also known as "submission chaining" which is also deprecated.

Disconnected Steps

Anything “disconnected” from the process as a post-step that must be remembered to be called separately, or occasionally (every month/quarter … every time X data load runs or Y file is ingested… etc.)?

Reference Data Loads

  • Object Classes should be loaded at least once before running this loader, and ideally every time there is a major change to them - see Dependent Data above
  • TAS records should be loaded from Broker every time before running this loader - see Dependent Data above and note that these are loaded as a prior step in the Jenkins pipeline

Command Parameterization

  • FY+Period parameterized runs need to be added to the Jenkins execution script each new FY. The calls to load_multiple_submission are generated by a loop to auto-generate command line invocations.
    • The range of this loop needs to be expanded periodically

Data Changes

  • If transactions or award records are modified as part of a production "data change", the C-to-D linkage management command and/or SQL code may need to be re-run.
    • If C-to-D linkage is altered, certain materialized views that flatten into denormalized tables the relationship between award records and TAS records may need to be refreshed
    • Similarly, Elasticsearch indices that store this same relationship may need to be updated or re-indexed

Error Conditions

  1. General Python errors (KeyError, AttributeError on None objects, etc.)
  2. ObjectClass.DoesNotExist (subclass of django.core.exceptions.ObjectDoesNotExist) if a referenced Object Class is not found
  3. django.core.management.base.CommandError if submission with given ID not found, or multiple found in Broker

SEE ALSO

See docs on the following management commands that do the work of the submission loader to read through more detailed steps and execution options/args

Clone this wiki locally