Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

duplications in measurement and drug_exposure #26

Closed
mostafaalishahi opened this issue Mar 6, 2024 · 2 comments
Closed

duplications in measurement and drug_exposure #26

mostafaalishahi opened this issue Mar 6, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@mostafaalishahi
Copy link

There are duplications in measurement and drug_exposure tables in OMOP, we have around 300 million duplicated rows in the measurement table and about 3 million duplicated rows in the drug_exposure table.

@mostafaalishahi mostafaalishahi changed the title There are duplications in measurement and drug_exposure duplications in measurement and drug_exposure Mar 6, 2024
@USM-CHU-FGuyon
Copy link
Owner

Hi, thanks for pointing this out. I think the issue comes with downcasting of measurement_datetime.

There is no duplicated data when using measurement_date and measurement_time as the time component

import pandas as pd

measurement_0 = pd.read_parquet(r'D:/BLENDED_ICU/blended_data/OMOP-CDM/measurement/MEASUREMENT_0.parquet')
measurement_1 = pd.read_parquet(r'D:/BLENDED_ICU/blended_data/OMOP-CDM/measurement/MEASUREMENT_1.parquet')

df = pd.concat([measurement_0, measurement_1], axis=0)

primarykey = df[['measurement_date',
                 'measurement_time',
                 'visit_occurrence_id',
                 'measurement_concept_id']]

dupli = primarykey.duplicated()
dupli.sum()
>> 0

But there are when using measurement_datetime

primarykey = df[['measurement_datetime',
                 'visit_occurrence_id',
                 'measurement_concept_id']]

dupli = primarykey.duplicated()
dupli.sum()
>>73168825

In fact the measurement_datetime column once saved in parquet is equal to the measurement_date. I'm fixing this very soon.

The data seems to be fine, a quick fix would be to omit measurement_datetime and use measurement_date and measurement_time as the time component.

Please tell me if this fixes the duplications for you.

@USM-CHU-FGuyon USM-CHU-FGuyon added the bug Something isn't working label Mar 7, 2024
@USM-CHU-FGuyon
Copy link
Owner

This should be fixed in v0.3.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants