person_id Generation in OMOP data #15

xinyuejohn · 2024-01-11T10:09:17Z

Issue Description

While working with MIMIC-III and MIMIC-IV datasets, I noticed that the transformation of subject_id to person_id in preprocessed_labels.parquet could lead to inconsistencies. The current method, as shown below, generates person_id based on the order of unique values in person_source_value:

unique_pid = self.person['person_source_value'].unique()
start_idx = self.start_index['person']
self.person_id_mapper = pd.Series(range(start_idx, start_idx + len(unique_pid)), index=unique_pid)
self.person['person_id'] = self.person['person_source_value'].map(self.person_id_mapper)

Concerns

Context-Specific Conversion: This approach seems highly context-specific and potentially unstable across different tables or datasets. The unique generation of person_ids may not be preserved if tables are processed separately or at different times.
Order Dependency: The person_id mapping relies on the order of unique_pid, which can vary with different processing instances, leading to inconsistent person_id assignments.
Data Linking Challenges: For studies requiring linkage between different datasets (e.g., MIMIC-IV and MIMIC-CXR), this inconsistency in person_id generation could hinder the ability to effectively join datasets, especially when converted into the OMOP CDM format.

Suggested Alternative

To ensure consistency, I propose using a hash function or another traceable and consistent method for generating person_ids. This approach would guarantee that the same person_source_value is always mapped to the same person_id, irrespective of the dataset or processing instance.

Example Scenario

In my current project, I am attempting to integrate the MIMIC-CXR dataset (Chest X-ray dataset) into an OMOP dataset containing MIMIC-IV data (using your work). The inconsistency in person_id generation poses a significant challenge for this integration.

The text was updated successfully, but these errors were encountered:

USM-CHU-FGuyon · 2024-01-11T10:47:55Z

Hi,
So the issue lies in the creation of person_id_mapper: I should ensure that person_source_value always maps to the same person_id, no matter the input data... right ?

Seeing your edit: Indeed this would enable joining separate datasets that use the same patient identifers. I will address this promptly

Does this seem reasonable to you ?

import hashlib
unique_pid = self.person.person_source_value.drop_duplicates()
pd.Series(unique_pid.apply(lambda x: hashlib.md5(x.encode('utf-8')).hexdigest()).values, index=unique_pid)


>> person_source_value
amsterdam-0      e46f49e41743e3c18985a118dbba56d5
amsterdam-1      237dc18c7c627bc48106b715f4f5b325
amsterdam-10     5f7fbe31642622ce46d50995aabe1188
amsterdam-89     f14143f8e1b2da1630fbe8de0ae7f373
amsterdam-869    e67912545d443db07189cfad5a01d020
              
mimic3-42728     1e48fc9f5e4d1e37b0fe872fb0ef213e
mimic3-17532     6be0abc49aadc01b54082eb654eff25e
mimic3-13083     55c6bc6b0ee202022a96554b32f9b731
mimic3-13620     09021015cf2f99d27b271affaf8a8b3e
mimic3-7630      982e8d12ac476c9d2715632c64566d19
Length: 282526, dtype: object

For patient ids, this might be overkill, I will try to find a smaller yet robust hash.

I should probably apply that to other identifiers such as measurement_id so the user can link an omop entry to the actual measurement in the original dataset, what do you think ?

xinyuejohn · 2024-01-12T13:45:58Z

Yes, this is exactly what I mean. There should be a robust and traceable method for mapping IDs during OMOP conversion. As of now, I am not aware of any other possibilities for linking on IDs besides using person_id. However, it wouldn't hurt to support this for potential future use cases.

xinyuejohn · 2024-02-19T10:29:54Z

Hi, I just ran the OMOP conversion (v0.2.2) and the type of person_id in person.csv is string. However, in OMOP CDM, it is supposed to be integer. Could you take a look at it?

USM-CHU-FGuyon · 2024-02-27T09:58:50Z

Hi, I should be able to fix this within a week, along with Issue #24

USM-CHU-FGuyon · 2024-03-07T08:45:29Z

person_id now start with 1, and visit_occurrence_id start with 2, then I used 11 digits for encoding the original identifier.

visit_occurrence_id	person_id
2367845152469	1367845152469
2446728184613	1446728184613
2646177120648	1646177120648
2469987407475	1887476626291
2106057239409	1566595215392

xinyuejohn changed the title ~~Discussion about person_id conversion~~ Inconsistency in person_id Generation Across Datasets Jan 11, 2024

USM-CHU-FGuyon added a commit that referenced this issue Jan 12, 2024

Fixes in the OMOP data, see Issues #14 and #15

1204c75

USM-CHU-FGuyon closed this as completed Jan 18, 2024

USM-CHU-FGuyon added a commit that referenced this issue Jan 19, 2024

quickfix Issue #15

e7a0668

USM-CHU-FGuyon reopened this Feb 20, 2024

USM-CHU-FGuyon changed the title ~~Inconsistency in person_id Generation Across Datasets~~ person_id Generation in OMOP data Feb 29, 2024

USM-CHU-FGuyon closed this as completed Mar 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

person_id Generation in OMOP data #15

person_id Generation in OMOP data #15

xinyuejohn commented Jan 11, 2024 •

edited

Loading

USM-CHU-FGuyon commented Jan 11, 2024 •

edited

Loading

xinyuejohn commented Jan 12, 2024

xinyuejohn commented Feb 19, 2024

USM-CHU-FGuyon commented Feb 27, 2024

USM-CHU-FGuyon commented Mar 7, 2024

person_id Generation in OMOP data #15

person_id Generation in OMOP data #15

Comments

xinyuejohn commented Jan 11, 2024 • edited Loading

Issue Description

Concerns

Suggested Alternative

Example Scenario

USM-CHU-FGuyon commented Jan 11, 2024 • edited Loading

xinyuejohn commented Jan 12, 2024

xinyuejohn commented Feb 19, 2024

USM-CHU-FGuyon commented Feb 27, 2024

USM-CHU-FGuyon commented Mar 7, 2024

xinyuejohn commented Jan 11, 2024 •

edited

Loading

USM-CHU-FGuyon commented Jan 11, 2024 •

edited

Loading