Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review multimodal linkage #30

Merged
merged 12 commits into from
Jul 19, 2024
Merged

Review multimodal linkage #30

merged 12 commits into from
Jul 19, 2024

Conversation

jshoughtaling
Copy link
Collaborator

@jshoughtaling jshoughtaling commented Jun 3, 2024

Feedback on PR from @evan-phelps (thanks!!):

  1. Under "What this SOP does not do," there's reference to data linking after sites have submitted data. Out of curiosity, is there specific, planned linking after submission? Is this referring to potential post-submission Privacy Preserving Record Linking?

Great question. We have currently done central linking of waveform files to OMOP data, but we expect sites to do more linkage locally as the project progresses. That statement is not referring to PPRL; that topic will likely need to be covered in a different SOP, as it would relate to matching patients across sites in addition to linking their data modes together.

  1. For what it's worth, I disagree with the recommendation to conflate file_id and procedure_occurrence_id values, which introduces an unnecessary and potentially misleading coupling of different concepts. If they are, then at a minimum, I would hope that tools and code are not developed in a way that exploits this value-equivalence across two differently purposed variables/columns. Unfortunately, if it's recommended, then many people will probably write code that requires the value-equivalence between file and procedure occurrence identifiers.

Agreed. I'll soften the wording and emphasize that it's perfectly fine to decouple those identifiers.

  1. Is the separation of blocks of values for image file identifiers vs. waveform file identifiers necessary? Or is that an artifact of assuming that file id assignments of one might not be "aware" of the other? Put differently, is it required that all file ids draw from a common range of global file identifiers?

The main concern is having identifier values clash once they reach the PROCEDURE OCCURRENC table. If, as you suggest in (2) we decouple the fileid values from the procedure_occurrence_id values, this range allocation is moot and the fileid values can be arbitrary as long as they're unique. But we do expect the data engineers to ensure that the procedure_occurrence_id remains a proper primary key with no duplicate values after inserting data from the registry tables.

  1. In our databases, including our main OMOP instances, we have needed to convert many IDs to bigint and were concerned about the potential of some existing programs or OHDSI tools truncating them. I'd suggest assuming bigint for new initiatives to avoid the accidental development of tools that use narrower integer representations, which will break when the scale of data inevitably grows and requires bigint. As I mentioned, we're already seeing it.

Many sites are facing similar issues, and we have modified the OMOP CDM DDL on the central cloud accordingly to handle bigints. The 2B+ selection was arbitrary, and mostly stems from the OHDSI convention for custom concept id assignments. If you're already using bigints you can go wild with your ID selections :) It would just be useful to know what ranges you end up using so we can sort them out centrally.

  1. I don't think "Be an integer" should be a sub-item of "IF you are using file_id value as procedure_occurrence_id." If it's a procedure_occurrence_id value, then it has to be an integer anyways. Since all other OMOP ids are integers, and since you're thinking of creating an OMOP extension specification, I'd suggest requiring it to be an integer generally.

Agreed. Will update wording accordingly.

  1. Will the intended "real-world" idea of how to optimally group files be specified? Providing guidance on how to group files optimally would be beneficial. It ensures uniformity and helps sites understand the best practices for data organization, which is crucial for downstream processing.

This is somewhat dependent on the file format chosen, so it was intentionally vague. Now that it seems like WFDB will be the winning format I can provide guidance/examples.

  1. Regarding procedure_concept_ids for Imaging Procedure and Monitoring Procedure, if sites aren't already mapping to more granular concepts under those two broad ancestor concepts, I recommend assuming more granular mapping with respect to how code is written and new tools are developed -- i.e., even if most sites are mapping to those two general concepts, they should be approached through the concept hierarchy, guaranteeing appropriate rollup of more granular mappings, from the beginning.

Absolutely. I will add this important caveat. We need to establish a solid feedback loop here, though, in order to design cohort definitions dependent on multimodal data that apply across all sites.

  1. Regarding procedure_source_value, I'd suggest specifying a concatenation pattern or, alternatively, a metadata standard of specifying the concatenation pattern so that generic code can be written. Standardizing the concatenation pattern can enhance consistency and facilitate the development of generic code, making the system more robust and interoperable.

Agreed. Will update accordingly

@jshoughtaling jshoughtaling linked an issue Jun 3, 2024 that may be closed by this pull request
@jshoughtaling jshoughtaling self-assigned this Jul 12, 2024
@jshoughtaling jshoughtaling merged commit 42ab26b into main Jul 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[SOP Document] Multimodal Data Linkage
3 participants