Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Link events to same day visits (option 2) and add parameter to creating visits (option 3) #26

Merged
merged 5 commits into from
May 15, 2023

Conversation

egillax
Copy link
Contributor

@egillax egillax commented Apr 3, 2023

In this PR I've added

  • Link same day events without a visit_occurrence_id to same day visit if possible (fixes Dependency on visit_occurrence_id? #22)
  • Added the option to both try to link events to visits, and then create visits from the leftover orphan events (if any)
  • Added a gap parameter to specify how far away events need to be before they are considered the same visit
  • Moved the create visit functionality to an earlier stage in the pipeline and refactored into a function

@schuemie @ChaoPang

@@ -759,3 +740,64 @@ def get_standard_concept_ids(spark, concept_ids):
WHERE ca.concept_id_1 IN ({concept_ids})
""".format(concept_ids=','.join([str(c) for c in concept_ids])))
return standard_concept_ids


def link_events_with_visits(patient_event):
Copy link
Contributor

@ChaoPang ChaoPang Apr 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If two visits occur on the same day with different visit types (ED + Inpatient), the two last functions may not consistently take the information from the same visit and potentially create a discrepancy between the visit_occurrence_id and the corresponding visit_concept_id. An alternative way would be to join to visit_occurrence directly to get the visit_occurrence_id and visit_concept_id for those orphan events

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks this is a case I hadn't considered. If we have two visits on the same day and an orphan event, should we try to link it to a visit? We don't have enough information I think to find out which would be the correct visit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@egillax This is probably a rare event, but this is more likely to happen if we specify a large gap period other than the same day. My point is that it doesn't matter to which visit we link the orphan events, however, it's important to link them consistently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ChaoPang I agree about consistency.

I'll look into joining directly to the visit_occurrence

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ChaoPang I've been struggling a bit with your suggestion to join directly to the visit_occurrence to get the ids. I'm very new to pyspark. Could you explain it a bit maybe?

utils/spark_utils.py Outdated Show resolved Hide resolved
# first row per person_id gives nan for date_difference, replace with value 2
patient_event = patient_event.na.fill(2, "date_difference")

if link_events_visits:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if block (when link_events_visits is True) already covers the cases captured by else, my understanding is that we only want to create visits when visit_occurrence_id is missing. However, it looks like the else block creates visits and ignores the existing visit_occurrence_id, is that the behavior you expected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. In case of very few events with visit info I wanted to have the option to overwrite the visit info with the created one. So the grouping into "visits" is consistent. But for combination of linking and creating visits the existing info would be preserved.

But maybe it would be better to add some kind of overwrite option. So you would have:

  • Link visits
  • Link visits and create visits from orphans (preserving visit info that is there)
  • Create visits from orphans and preserve existing visit info.
  • Create visits for all events and overwrite existing visit info

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could automatically apply the following logic for each event:

  1. If visit_occurrence_id is present, use that to link the event to the visit.
  2. If visit_occurrence_id is not present, but there is a visit on the same date, link to that.
  3. If visit_occurrence_id is not present, and there is no visit on the same date, create a visit.

That should work for most databases, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@schuemie that is what you get with this PR with by using both options --link_events_to_visits and --create_visits_from_dates.

Additionally I wanted to be able to create visits for every event to impose my own structure on the sequences. By using gap of 1 I would get sequences of events on consecutive days and still have the time aware tokens to represent the gaps between those sequences.

Maybe it's cleanest if I use a separate option for that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@egillax got it. To check my understanding, when the overwrite is enabled, we essentially ignore all the visits and impose our own structure, and create some visits. Sometimes we might want to use a larger gap to define what a visit means since it's just a way to group events. I can imagine there are cases, where we want to link the events that occurred within 10 days of each other, which would create some meta visits. Is my understanding correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ChaoPang yes you are correct. You could also look at it as a way to control for how small gaps you'll insert the time aware tokens.

utils/spark_utils.py Outdated Show resolved Hide resolved
Copy link
Contributor

@ChaoPang ChaoPang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@egillax thanks a lot for this nice feature! I have left my comments, please let know what you think.

egillax and others added 2 commits April 7, 2023 15:00
Co-authored-by: Chao Pang <ChaoPang229@gmail.com>
Co-authored-by: Chao Pang <ChaoPang229@gmail.com>
Copy link
Contributor

@ChaoPang ChaoPang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @egillax it's been a long time, I approved the PR.

Copy link
Contributor

@ChaoPang ChaoPang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @egillax it's been a long time, I approved the PR.

@ChaoPang ChaoPang merged commit 68c14de into cumc-dbmi:master May 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Dependency on visit_occurrence_id?
3 participants