health-apis-bulk-fhir

This application is the Bulk FHIR layer that sits on top of Data Query to provide anonymous data.

Supported Resources

Patient

Caveats

Limitations of the current MVP include:

Data is currently DSTU2 format and not R4 as dictated by the specification.
Security is implemented with access tokens and not SMART Authorization. system/*.read scopes are not currently available.
Patient/$export does not support the optional _type and _since parameters.

Concept: Publication

The VA houses the largest medical history database in the US. To support a data set this large, some deviations from the specification have been made. The kick off request (/Patient/$export) does not actually initiate the bulk packaging. Instead, bulk data is prepared in advance on a periodic basis, e.g. monthly. A Publication is the periodic collection of data. The Bulk FHIR application makes one Publication available to all consumers.

The Bulk FHIR endpoints still function as specified.

The /Patient/$export endpoint will return the location of the Status endpoint.
The Status endpoint will always return a Complete response. It is is never In-Progress.

Publications are very large. Depending on the resource type, the number of records can range from tens of millions to billions. Publications are made of many files, which are identified in the Complete status response. A Publication can have thousands of files, each file containing tens of thousands of records.

Publications are created in a rolling wave. For example, the January publication is made available in February. The February publication will be built automatically in the background over the month and made available in March.

Concept: Anonymization

Personally identifiable information (PII) data is removed or synthesized. The following generalizations apply:

Optional data that is considered PII is removed
Dates are truncated to the year, e.g. 2005-01-01T12:34:56Z

Patient

Remove .address, .contact[], .id, .identifier[], .photo, .telecom.
Remove .multipleBirthInteger and populate .multipleBirthBoolean if applicable.
Synthesize .name using generated values. Only .name.given, .name.family, and .name.text will be populated.
Synthesize .birthDate. Patients that are greater than 90 years old will have their birth date adjusted such that they appear 90. For example, if the current year is 2019 and the patient is 92, their birth date will be 1929-01-01T12:34:56Z
Synthesize .deceasedDateTime

Architecture

Data Flow

Notes

Data Query is responsible for enabling access to bulk FHIR compliant records through VA internal APIs that are protected from general access.
The Incredible Bulk communicates with Data Query through internal, protected APIs.
- internal calls to data query require the DATA_QUERY_INTERNAL_ACCESS_KEY found in the deployment unit
The Incredible Bulk is responsible for Publication management and anonymization.
- internal calls to the publication endpoint require the KONG_INTERNAL_PROTECTED_OP_TOKENS found in the deployment unit
Publication files are created by The Incredible Bulk but served to consumers directly from S3 (via Kong)
- Consumer access through Kong requires the sharing of the KONG_PUBLIC_PROTECTED_OP_TOKENS found in the deployment unit
Timers are implemented using Kubernetes batch CronJob containers that periodically poke Publication endpoints.

When building files, The Incredible Bulk will gather data from Data Query where it will be anonymized and written to S3.

Publication Lifecycle

A Publication is created using POST /internal/publication
- Data Query will be interrogated to determine records that are available.
- The number of files required will be determined and groups of records will be associated to each file.
- The status of each file will be NOT_STARTED
A timer will trigger file building using POST /internal/publication/any/file/next
- The first file that has a status of NOT_STARTED for the oldest Publication will be chosen.
- Records will be extracted from Data Query, anonymized, and written to S3 for storage.
Once all files are created (status is COMPLETE) for the Publication, the entire Publication will be considered COMPLETE and made immediately available to consumers on future status calls. (The status endpoint is returned as part of the /Patient/$export call.)

Notes

A second timer will periodically check for incomplete Publication files. For example, if an instance of The Incredible Bulk is building a file, but were to crash, then the file would have been marked as IN_PROGRESS, but cannot complete. This timer will look for such instances and update the file status as NOT_STARTED so that it can be re-attempted.
Specific files can be built using POST /internal/publication/{id}/file/{fileId}
Publications can be listed using GET /internal/publication
Status can be queried using GET /internal/publication/{id}

Short Comings, Gotchas, and Potential Problems

Implementation Guide (IG) has been updated since this PoC.
OAuth is not currently supported. Simple API key authentication method is used. The API key provides an "all or nothing" approach. We have no mechanism for allowing access to different resources for different users.
The IG assumes bulk data files are generated on demand. Our data set is very large and not well suited for on demand create. Instead data sets are created monthly, taking many days for just Patient alone. Files are built in batches to avoid overloading the servers and database.
Data sets are very large, e.g. Observation has billions of records. Transferring this data to clients will be time consuming. Per the specification, records are included in multiple files. We must find the balance in files that are very large and having a very large number of files. Even with large files, there will still be a great number of them. This
The Bulk FHIR specification defines STU3 structures, this PoC returns DSTU2 flavored structures.
Only Patient resource is implemented. Support for Observation, Condition, Procedure, etc. is absent.
The current solution periodically builds comprehensive publications monthly. There is significant cost (in time) to produce the data set. There is no support for incremental updates, which could be problematic for users that wish to stay as current.
We do not support optional endpoints or parameters for the following
- groups or group level data export
- system level export , e.g. services/fhir/v0/stu3$export
- query parameters:
  - _outputFormat (we only support output application/fhir+ndjson)
  - _since (time based filtering)
  - _type (We only support Patient)
  - no experimental parameters, e.g. type filters
- delete operations
- new optional Expires header is not supported but should be

Name		Name	Last commit message	Last commit date
Latest commit History 129 Commits
anonymizer		anonymizer
bulk-fhir-api		bulk-fhir-api
bulk-fhir-ids-mapping		bulk-fhir-ids-mapping
bulk-fhir-tests		bulk-fhir-tests
src/plantuml		src/plantuml
the-incredible-bulk		the-incredible-bulk
.gitignore		.gitignore
.run-local.conf		.run-local.conf
CODEOWNERS		CODEOWNERS
Jenkinsfile		Jenkinsfile
README.md		README.md
bulk		bulk
lombok.config		lombok.config
pom.xml		pom.xml
run-local.sh		run-local.sh

department-of-veterans-affairs/health-apis-bulk-fhir

Folders and files

Latest commit

History

Repository files navigation

health-apis-bulk-fhir

Supported Resources

Caveats

Concept: Publication

Concept: Anonymization

Patient

Architecture

Data Flow

Notes

Publication Lifecycle

Short Comings, Gotchas, and Potential Problems

About

Resources

Stars

Watchers

Forks

Languages