-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
person_id Generation in OMOP data #15
Comments
Hi, Seeing your edit: Indeed this would enable joining separate datasets that use the same patient identifers. I will address this promptly Does this seem reasonable to you ?
For patient ids, this might be overkill, I will try to find a smaller yet robust hash. I should probably apply that to other identifiers such as |
Yes, this is exactly what I mean. There should be a robust and traceable method for mapping IDs during OMOP conversion. As of now, I am not aware of any other possibilities for linking on IDs besides using person_id. However, it wouldn't hurt to support this for potential future use cases. |
Hi, I just ran the OMOP conversion (v0.2.2) and the type of person_id in person.csv is string. However, in OMOP CDM, it is supposed to be integer. Could you take a look at it? |
Hi, I should be able to fix this within a week, along with Issue #24 |
|
Issue Description
While working with MIMIC-III and MIMIC-IV datasets, I noticed that the transformation of
subject_id
toperson_id
inpreprocessed_labels.parquet
could lead to inconsistencies. The current method, as shown below, generatesperson_id
based on the order of unique values inperson_source_value
:Concerns
Context-Specific Conversion: This approach seems highly context-specific and potentially unstable across different tables or datasets. The unique generation of
person_id
s may not be preserved if tables are processed separately or at different times.Order Dependency: The
person_id
mapping relies on the order ofunique_pid
, which can vary with different processing instances, leading to inconsistentperson_id
assignments.Data Linking Challenges: For studies requiring linkage between different datasets (e.g., MIMIC-IV and MIMIC-CXR), this inconsistency in
person_id
generation could hinder the ability to effectively join datasets, especially when converted into the OMOP CDM format.Suggested Alternative
To ensure consistency, I propose using a hash function or another traceable and consistent method for generating
person_id
s. This approach would guarantee that the sameperson_source_value
is always mapped to the sameperson_id
, irrespective of the dataset or processing instance.Example Scenario
In my current project, I am attempting to integrate the MIMIC-CXR dataset (Chest X-ray dataset) into an OMOP dataset containing MIMIC-IV data (using your work). The inconsistency in
person_id
generation poses a significant challenge for this integration.The text was updated successfully, but these errors were encountered: