Skip to content
This repository has been archived by the owner on Sep 23, 2021. It is now read-only.

department-of-veterans-affairs/health-apis-bulk-fhir

Repository files navigation

health-apis-bulk-fhir

This application is the Bulk FHIR layer that sits on top of Data Query to provide anonymous data.

Read more

Supported Resources

  • Patient

Caveats

Limitations of the current MVP include:

  • Data is currently DSTU2 format and not R4 as dictated by the specification.
  • Security is implemented with access tokens and not SMART Authorization. system/*.read scopes are not currently available.
  • Patient/$export does not support the optional _type and _since parameters.

Concept: Publication

The VA houses the largest medical history database in the US. To support a data set this large, some deviations from the specification have been made. The kick off request (/Patient/$export) does not actually initiate the bulk packaging. Instead, bulk data is prepared in advance on a periodic basis, e.g. monthly. A Publication is the periodic collection of data. The Bulk FHIR application makes one Publication available to all consumers.

The Bulk FHIR endpoints still function as specified.

  • The /Patient/$export endpoint will return the location of the Status endpoint.
  • The Status endpoint will always return a Complete response. It is is never In-Progress.

Publications are very large. Depending on the resource type, the number of records can range from tens of millions to billions. Publications are made of many files, which are identified in the Complete status response. A Publication can have thousands of files, each file containing tens of thousands of records.

Publications are created in a rolling wave. For example, the January publication is made available in February. The February publication will be built automatically in the background over the month and made available in March.

Concept: Anonymization

Personally identifiable information (PII) data is removed or synthesized. The following generalizations apply:

  • Optional data that is considered PII is removed
  • Dates are truncated to the year, e.g. 2005-01-01T12:34:56Z

Patient

  • Remove .address, .contact[], .id, .identifier[], .photo, .telecom.
  • Remove .multipleBirthInteger and populate .multipleBirthBoolean if applicable.
  • Synthesize .name using generated values. Only .name.given, .name.family, and .name.text will be populated.
  • Synthesize .birthDate. Patients that are greater than 90 years old will have their birth date adjusted such that they appear 90. For example, if the current year is 2019 and the patient is 92, their birth date will be 1929-01-01T12:34:56Z
  • Synthesize .deceasedDateTime

Read more

Architecture

Architecture

Data Flow

Data Flow

Notes
  • Data Query is responsible for enabling access to bulk FHIR compliant records through VA internal APIs that are protected from general access.
  • The Incredible Bulk communicates with Data Query through internal, protected APIs.
    • internal calls to data query require the DATA_QUERY_INTERNAL_ACCESS_KEY found in the deployment unit
  • The Incredible Bulk is responsible for Publication management and anonymization.
    • internal calls to the publication endpoint require the KONG_INTERNAL_PROTECTED_OP_TOKENS found in the deployment unit
  • Publication files are created by The Incredible Bulk but served to consumers directly from S3 (via Kong)
    • Consumer access through Kong requires the sharing of the KONG_PUBLIC_PROTECTED_OP_TOKENS found in the deployment unit
  • Timers are implemented using Kubernetes batch CronJob containers that periodically poke Publication endpoints.

When building files, The Incredible Bulk will gather data from Data Query where it will be anonymized and written to S3.

Publication Lifecycle

  • A Publication is created using POST /internal/publication
    • Data Query will be interrogated to determine records that are available.
    • The number of files required will be determined and groups of records will be associated to each file.
    • The status of each file will be NOT_STARTED
  • A timer will trigger file building using POST /internal/publication/any/file/next
    • The first file that has a status of NOT_STARTED for the oldest Publication will be chosen.
    • Records will be extracted from Data Query, anonymized, and written to S3 for storage.
  • Once all files are created (status is COMPLETE) for the Publication, the entire Publication will be considered COMPLETE and made immediately available to consumers on future status calls. (The status endpoint is returned as part of the /Patient/$export call.)

Notes

  • A second timer will periodically check for incomplete Publication files. For example, if an instance of The Incredible Bulk is building a file, but were to crash, then the file would have been marked as IN_PROGRESS, but cannot complete. This timer will look for such instances and update the file status as NOT_STARTED so that it can be re-attempted.
  • Specific files can be built using POST /internal/publication/{id}/file/{fileId}
  • Publications can be listed using GET /internal/publication
  • Status can be queried using GET /internal/publication/{id}

Short Comings, Gotchas, and Potential Problems

  • Implementation Guide (IG) has been updated since this PoC.
  • OAuth is not currently supported. Simple API key authentication method is used. The API key provides an "all or nothing" approach. We have no mechanism for allowing access to different resources for different users.
  • The IG assumes bulk data files are generated on demand. Our data set is very large and not well suited for on demand create. Instead data sets are created monthly, taking many days for just Patient alone. Files are built in batches to avoid overloading the servers and database.
  • Data sets are very large, e.g. Observation has billions of records. Transferring this data to clients will be time consuming. Per the specification, records are included in multiple files. We must find the balance in files that are very large and having a very large number of files. Even with large files, there will still be a great number of them. This
  • The Bulk FHIR specification defines STU3 structures, this PoC returns DSTU2 flavored structures.
  • Only Patient resource is implemented. Support for Observation, Condition, Procedure, etc. is absent.
  • The current solution periodically builds comprehensive publications monthly. There is significant cost (in time) to produce the data set. There is no support for incremental updates, which could be problematic for users that wish to stay as current.
  • We do not support optional endpoints or parameters for the following
    • groups or group level data export
    • system level export , e.g. services/fhir/v0/stu3$export
    • query parameters:
      • _outputFormat (we only support output application/fhir+ndjson)
      • _since (time based filtering)
      • _type (We only support Patient)
      • no experimental parameters, e.g. type filters
    • delete operations
    • new optional Expires header is not supported but should be