Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

person_id Generation in OMOP data #15

Closed
xinyuejohn opened this issue Jan 11, 2024 · 5 comments
Closed

person_id Generation in OMOP data #15

xinyuejohn opened this issue Jan 11, 2024 · 5 comments

Comments

@xinyuejohn
Copy link

xinyuejohn commented Jan 11, 2024

Issue Description

While working with MIMIC-III and MIMIC-IV datasets, I noticed that the transformation of subject_id to person_id in preprocessed_labels.parquet could lead to inconsistencies. The current method, as shown below, generates person_id based on the order of unique values in person_source_value:

unique_pid = self.person['person_source_value'].unique()
start_idx = self.start_index['person']
self.person_id_mapper = pd.Series(range(start_idx, start_idx + len(unique_pid)), index=unique_pid)
self.person['person_id'] = self.person['person_source_value'].map(self.person_id_mapper)

Concerns

  1. Context-Specific Conversion: This approach seems highly context-specific and potentially unstable across different tables or datasets. The unique generation of person_ids may not be preserved if tables are processed separately or at different times.

  2. Order Dependency: The person_id mapping relies on the order of unique_pid, which can vary with different processing instances, leading to inconsistent person_id assignments.

  3. Data Linking Challenges: For studies requiring linkage between different datasets (e.g., MIMIC-IV and MIMIC-CXR), this inconsistency in person_id generation could hinder the ability to effectively join datasets, especially when converted into the OMOP CDM format.

Suggested Alternative

To ensure consistency, I propose using a hash function or another traceable and consistent method for generating person_ids. This approach would guarantee that the same person_source_value is always mapped to the same person_id, irrespective of the dataset or processing instance.

Example Scenario

In my current project, I am attempting to integrate the MIMIC-CXR dataset (Chest X-ray dataset) into an OMOP dataset containing MIMIC-IV data (using your work). The inconsistency in person_id generation poses a significant challenge for this integration.

@xinyuejohn xinyuejohn changed the title Discussion about person_id conversion Inconsistency in person_id Generation Across Datasets Jan 11, 2024
@USM-CHU-FGuyon
Copy link
Owner

USM-CHU-FGuyon commented Jan 11, 2024

Hi,
So the issue lies in the creation of person_id_mapper: I should ensure that person_source_value always maps to the same person_id, no matter the input data... right ?

Seeing your edit: Indeed this would enable joining separate datasets that use the same patient identifers. I will address this promptly

Does this seem reasonable to you ?

import hashlib
unique_pid = self.person.person_source_value.drop_duplicates()
pd.Series(unique_pid.apply(lambda x: hashlib.md5(x.encode('utf-8')).hexdigest()).values, index=unique_pid)


>> person_source_value
amsterdam-0      e46f49e41743e3c18985a118dbba56d5
amsterdam-1      237dc18c7c627bc48106b715f4f5b325
amsterdam-10     5f7fbe31642622ce46d50995aabe1188
amsterdam-89     f14143f8e1b2da1630fbe8de0ae7f373
amsterdam-869    e67912545d443db07189cfad5a01d020
              
mimic3-42728     1e48fc9f5e4d1e37b0fe872fb0ef213e
mimic3-17532     6be0abc49aadc01b54082eb654eff25e
mimic3-13083     55c6bc6b0ee202022a96554b32f9b731
mimic3-13620     09021015cf2f99d27b271affaf8a8b3e
mimic3-7630      982e8d12ac476c9d2715632c64566d19
Length: 282526, dtype: object

For patient ids, this might be overkill, I will try to find a smaller yet robust hash.

I should probably apply that to other identifiers such as measurement_id so the user can link an omop entry to the actual measurement in the original dataset, what do you think ?

@xinyuejohn
Copy link
Author

Yes, this is exactly what I mean. There should be a robust and traceable method for mapping IDs during OMOP conversion. As of now, I am not aware of any other possibilities for linking on IDs besides using person_id. However, it wouldn't hurt to support this for potential future use cases.

USM-CHU-FGuyon added a commit that referenced this issue Jan 19, 2024
@xinyuejohn
Copy link
Author

Hi, I just ran the OMOP conversion (v0.2.2) and the type of person_id in person.csv is string. However, in OMOP CDM, it is supposed to be integer. Could you take a look at it?

@USM-CHU-FGuyon
Copy link
Owner

Hi, I should be able to fix this within a week, along with Issue #24

@USM-CHU-FGuyon USM-CHU-FGuyon changed the title Inconsistency in person_id Generation Across Datasets person_id Generation in OMOP data Feb 29, 2024
@USM-CHU-FGuyon
Copy link
Owner

person_id now start with 1, and visit_occurrence_id start with 2, then I used 11 digits for encoding the original identifier.

visit_occurrence_id person_id
2367845152469 1367845152469
2446728184613 1446728184613
2646177120648 1646177120648
2469987407475 1887476626291
2106057239409 1566595215392

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants