# New find_visit_occurence_id

Resulta que el código de la función find_visit_occurence_id se nos ha quedado corto. En algunos casos las tablas generadas son demasido grandes para que quepan en memoria.

En este notebook vamos a intentar:

1. Permitir procesar dataframes que no quepan en memoria
2. Como objetivo secundario, estaría bien poder paralelizarlo de alguna manera


## Creación de test dataset

Vamos a crear un dataset que tenga todas las variantes de datos que podamos encontrar

In [None]:
import pandas as pd

input = []
visit = []

# 1 -> Person that has one event and one visit that match
input_rows = [
    {"person_id": 1, "event_id": 1, "start_date": "2020-01-05", "expected_visit_id": 1}
]
visit_rows = [
    {
        "person_id": 1,
        "visit_id": 1,
        "start_date": "2020-01-01",
        "end_date": "2020-01-10",
    }
]
input += input_rows
visit += visit_rows
# 2 -> Person that has one event and one visit that do not match
input_rows = [
    {
        "person_id": 2,
        "event_id": 2,
        "start_date": "2020-02-05",
        "expected_visit_id": None,
    }
]
visit_rows = [
    {
        "person_id": 2,
        "visit_id": 2,
        "start_date": "2020-01-01",
        "end_date": "2020-01-10",
    }
]
input += input_rows
visit += visit_rows
# 3 -> Person that has one event and no visit
input_rows = [
    {
        "person_id": 3,
        "event_id": 3,
        "start_date": "2020-02-05",
        "expected_visit_id": None,
    }
]
input += input_rows
# 4 -> Person that has no event and one visit
visit_rows = [
    {
        "person_id": 4,
        "visit_id": 3,
        "start_date": "2020-01-01",
        "end_date": "2020-01-10",
    }
]
visit += visit_rows
# 5 -> Person that has two events and two visits. One match and the rest do not
input_rows = [
    {"person_id": 5, "event_id": 4, "start_date": "2020-01-05", "expected_visit_id": 4}
]
visit_rows = [
    {
        "person_id": 5,
        "visit_id": 4,
        "start_date": "2020-01-01",
        "end_date": "2020-01-10",
    }
]
input += input_rows
visit += visit_rows
input_rows = [
    {
        "person_id": 5,
        "event_id": 5,
        "start_date": "2020-02-05",
        "expected_visit_id": None,
    }
]
visit_rows = [
    {
        "person_id": 5,
        "visit_id": 5,
        "start_date": "2020-03-01",
        "end_date": "2020-03-10",
    }
]
input += input_rows
visit += visit_rows
# 6 -> Person that has two events tha match to a single visit
input_rows = [
    {"person_id": 6, "event_id": 6, "start_date": "2020-01-04", "expected_visit_id": 6},
    {"person_id": 6, "event_id": 7, "start_date": "2020-01-05", "expected_visit_id": 6},
]
visit_rows = [
    {
        "person_id": 6,
        "visit_id": 6,
        "start_date": "2020-01-01",
        "end_date": "2020-01-10",
    }
]
input += input_rows
visit += visit_rows
# 7 -> Person that has one event that fits the end of one period and the beginning of the next
input_rows = [
    {"person_id": 7, "event_id": 8, "start_date": "2020-01-05", "expected_visit_id": 7}
]
visit_rows = [
    {
        "person_id": 7,
        "visit_id": 7,
        "start_date": "2020-01-01",
        "end_date": "2020-01-05",
    },
    {
        "person_id": 7,
        "visit_id": 8,
        "start_date": "2020-01-05",
        "end_date": "2020-01-10",
    },
]
input += input_rows
visit += visit_rows

In [None]:
input_df = pd.DataFrame.from_records(input)
visit_df = pd.DataFrame.from_records(visit)
visit_df = visit_df.rename(
    {
        "visit_id": "visit_occurrence_id",
        "start_date": "visit_start_datetime",
        "end_date": "visit_end_datetime",
    },
    axis=1,
)

In [None]:
input_df

In [None]:
visit_df

## Búsqueda de visit_occurrence_id
El objetivo consiste en enlazar cada medida del paciente con una visita. Para ello cargaremos la tabla `visit_df`, que ya debería haber sido construida en una sección anterior, y buscaremos para cada measurement_date de la tabla `input_df` un intervalo de fechas de visitas que la contenga. Si existe, le asignaremos el `visit_occurrence_id` correspondiente.

### Método implementado actualmente (18/06/2025)

In [None]:
def find_visit_occurence_id(
    events_df: pd.DataFrame,
    event_columns: list,
    visits_df: pd.DataFrame,
    verbose: int = 0,
) -> pd.DataFrame:
    """
    Find valid date ranges by merging condition and visit occurrence data.

    This function merges input_df and visit occurrence dataframes,
    then filters for input_df start dates that fall within visit date ranges.

    Parameters
    ----------
    events_df : pandas.DataFrame
        Input dataframe for which to assign visit_occurrence_id's
    event_columns : list
        Column names that contains, in this order:
        - 'person_id'   Identifier for each person in the dataframe
        - 'start_date'  Date to fit between 'visit_start_date' and 'visit_end_date'.
        - 'events_id'   Unique identifier for each registry of events_df.
    visits_df : pandas.DataFrame
        DataFrame containing visit occurrence data.
        Must include columns: 'person_id', 'visit_start_date',
        'visit_end_date', 'visit_occurrence_id'.
        Column names need to be the same. This is to ensure
        the correct table (VISIT_OCCURRENCE) is being used.
    verbose : int, optional, default 0
        Verbosity level for function output.
        0: No output
        1: Additionally, print state of the processing
        2+: Additionally, print all debug information

    Returns
    -------
    pandas.DataFrame
        A DataFrame containing the original table event_df plus the value for
        the visit_occurence_id, visit_start_date and visit_end_date, if found.

    Raises
    ------
    ValueError
        If required columns are missing in input DataFrames.
    """
    pd.options.mode.copy_on_write = True
    # == Initial message ==============================================
    if verbose > 0:
        print("Looking for visit_occurrence_id matches:")

    # == Initial Checks ===============================================
    # Check for required columns in events_df
    if verbose > 0:
        print(" Checking input...")
    required_input_columns = event_columns
    missing_input_columns = set(required_input_columns) - set(events_df.columns)
    if missing_input_columns:
        raise ValueError(
            f"Missing required columns in events_df: {missing_input_columns}"
        )
    # Check for required columns in visits_df
    required_visit_columns = [
        "person_id",
        "visit_start_datetime",
        "visit_end_datetime",
        "visit_occurrence_id",
    ]
    missing_visit_columns = set(required_visit_columns) - set(visits_df.columns)
    if missing_visit_columns:
        raise ValueError(
            f"Missing required columns in visits_df: {missing_visit_columns}"
        )
    visits_df = visits_df[required_visit_columns]

    # == Force dtypes and sort ========================================
    # Ensure start_date and visit dates are datetime
    if verbose > 0:
        print(" Sorting dataframes...")
    events_df[event_columns[1]] = events_df[event_columns[1]].astype("datetime64[ms]")
    visits_df["visit_start_datetime"] = visits_df["visit_start_datetime"].astype(
        "datetime64[ms]"
    )
    visits_df["visit_end_datetime"] = visits_df["visit_end_datetime"].astype(
        "datetime64[ms]"
    )

    # Drop all duplicates, if visits are not unique we cannot assign them
    visits_df = visits_df.drop_duplicates(
        subset=["person_id", "visit_start_datetime", "visit_end_datetime"], keep=False
    )

    # Sort the neccesary columns of dataframes
    events_df = events_df.sort_values([event_columns[0], event_columns[1]])
    visits_df = visits_df.sort_values(
        [event_columns[0], "visit_start_datetime", "visit_end_datetime"]
    )

    # == Merging ======================================================
    if verbose > 0:
        print(" Combining results...")
    merged_df = pd.merge(
        events_df.reset_index(drop=True),
        visits_df.reset_index(drop=True),
        on=event_columns[0],
        how="left",
    )

    # Check if merge resulted in any matches
    if merged_df["visit_occurrence_id"].isna().all():
        raise ValueError(
            (
                "No matching records found after merging."
                + "Check if person_id values align between dataframes."
            )
        )

    # == Filter for valid ranges ======================================
    if verbose > 0:
        print(" Filtering valid ranges...")
    # Create mask for dates within range
    date_range_mask = (
        merged_df[event_columns[1]] >= merged_df["visit_start_datetime"]
    ) & (merged_df[event_columns[1]] <= merged_df["visit_end_datetime"])
    # Filter only valid ranges
    valid_ranges = merged_df[date_range_mask]

    # Merge with original to retrieve events without visit_occurrence_id
    final_df = pd.merge(
        events_df,
        valid_ranges[
            [
                event_columns[0],
                event_columns[2],
                "visit_occurrence_id",
                "visit_start_datetime",
                "visit_end_datetime",
            ]
        ],
        on=[event_columns[0], event_columns[2]],
        how="left",
    )
    # Sometimes, there might be events that land in visits that share a day.
    # Those would be duplicated on the event_id. Let's drop those duplicates
    # Since they're ordered, we will only lose the second visit, the one
    # that starts with the event
    final_df = final_df.drop_duplicates([event_columns[0], event_columns[2]])

    if verbose > 1:
        if valid_ranges.empty:
            print(
                (
                    " Warning: No valid date ranges found."
                    + "All condition start dates are outside visit date ranges."
                )
            )
        print(f"  Shape of events_df: {events_df.shape}")
        print(f"  Shape of visits_df: {visits_df.shape}")
        print(f"  Shape of merged_df: {merged_df.shape}")
        print(f"  Shape of valid_ranges: {valid_ranges.shape}")
        print(f"  Shape of final_df: {final_df.shape}")

    if verbose > 0:
        print(" Done.")

    return final_df

In [None]:
df = find_visit_occurence_id(
    input_df, ["person_id", "start_date", "event_id"], visit_df, verbose=2
)
df

In [None]:
import warnings
with warnings.catch_warnings():
    warnings.simplefilter(action='ignore', category=FutureWarning)    
    print('find_visit_occurence_id')
    %timeit -n 5 -r 5 find_visit_occurence_id(input_df, ["person_id", "start_date", "event_id"], visit_df, verbose=0)

### Usando fireducks

In [None]:
import fireducks.pandas as pdfd

def find_visit_occurence_id_fireducks(
    events_df: pdfd.DataFrame,
    event_columns: list,
    visits_df: pdfd.DataFrame,
    verbose: int = 0,
) -> pdfd.DataFrame:
    """
    Find valid date ranges by merging condition and visit occurrence data.

    This function merges input_df and visit occurrence dataframes,
    then filters for input_df start dates that fall within visit date ranges.

    Parameters
    ----------
    events_df : pandas.DataFrame
        Input dataframe for which to assign visit_occurrence_id's
    event_columns : list
        Column names that contains, in this order:
        - 'person_id'   Identifier for each person in the dataframe
        - 'start_date'  Date to fit between 'visit_start_date' and 'visit_end_date'.
        - 'events_id'   Unique identifier for each registry of events_df.
    visits_df : pandas.DataFrame
        DataFrame containing visit occurrence data.
        Must include columns: 'person_id', 'visit_start_date',
        'visit_end_date', 'visit_occurrence_id'.
        Column names need to be the same. This is to ensure
        the correct table (VISIT_OCCURRENCE) is being used.
    verbose : int, optional, default 0
        Verbosity level for function output.
        0: No output
        1: Additionally, print state of the processing
        2+: Additionally, print all debug information

    Returns
    -------
    pandas.DataFrame
        A DataFrame containing the original table event_df plus the value for
        the visit_occurence_id, visit_start_date and visit_end_date, if found.

    Raises
    ------
    ValueError
        If required columns are missing in input DataFrames.
    """
    pdfd.options.mode.copy_on_write = True
    # == Initial message ==============================================
    if verbose > 0:
        print("Looking for visit_occurrence_id matches:")

    # == Initial Checks ===============================================
    # Check for required columns in events_df
    if verbose > 0:
        print(" Checking input...")
    required_input_columns = event_columns
    missing_input_columns = set(required_input_columns) - set(events_df.columns)
    if missing_input_columns:
        raise ValueError(
            f"Missing required columns in events_df: {missing_input_columns}"
        )
    # Check for required columns in visits_df
    required_visit_columns = [
        "person_id",
        "visit_start_datetime",
        "visit_end_datetime",
        "visit_occurrence_id",
    ]
    missing_visit_columns = set(required_visit_columns) - set(visits_df.columns)
    if missing_visit_columns:
        raise ValueError(
            f"Missing required columns in visits_df: {missing_visit_columns}"
        )
    visits_df = visits_df[required_visit_columns]

    # == Force dtypes and sort ========================================
    # Ensure start_date and visit dates are datetime
    if verbose > 0:
        print(" Sorting dataframes...")
    events_df[event_columns[1]] = events_df[event_columns[1]].astype("datetime64[ms]")
    visits_df["visit_start_datetime"] = visits_df["visit_start_datetime"].astype(
        "datetime64[ms]"
    )
    visits_df["visit_end_datetime"] = visits_df["visit_end_datetime"].astype(
        "datetime64[ms]"
    )

    # Drop all duplicates, if visits are not unique we cannot assign them
    visits_df = visits_df.drop_duplicates(
        subset=["person_id", "visit_start_datetime", "visit_end_datetime"], keep=False
    )

    # Sort the neccesary columns of dataframes
    events_df = events_df.sort_values([event_columns[0], event_columns[1]])
    visits_df = visits_df.sort_values(
        [event_columns[0], "visit_start_datetime", "visit_end_datetime"]
    )

    # == Merging ======================================================
    if verbose > 0:
        print(" Combining results...")
    merged_df = pdfd.merge(
        events_df.reset_index(drop=True),
        visits_df.reset_index(drop=True),
        on=event_columns[0],
        how="left",
    )

    # Check if merge resulted in any matches
    if merged_df["visit_occurrence_id"].isna().all():
        raise ValueError(
            (
                "No matching records found after merging."
                + "Check if person_id values align between dataframes."
            )
        )

    # == Filter for valid ranges ======================================
    if verbose > 0:
        print(" Filtering valid ranges...")
    # Create mask for dates within range
    date_range_mask = (
        merged_df[event_columns[1]] >= merged_df["visit_start_datetime"]
    ) & (merged_df[event_columns[1]] <= merged_df["visit_end_datetime"])
    # Filter only valid ranges
    valid_ranges = merged_df[date_range_mask]

    # Merge with original to retrieve events without visit_occurrence_id
    final_df = pdfd.merge(
        events_df,
        valid_ranges[
            [
                event_columns[0],
                event_columns[2],
                "visit_occurrence_id",
                "visit_start_datetime",
                "visit_end_datetime",
            ]
        ],
        on=[event_columns[0], event_columns[2]],
        how="left",
    )
    # Sometimes, there might be events that land in visits that share a day.
    # Those would be duplicated on the event_id. Let's drop those duplicates
    # Since they're ordered, we will only lose the second visit, the one
    # that starts with the event
    final_df = final_df.drop_duplicates([event_columns[0], event_columns[2]])

    if verbose > 1:
        if valid_ranges.empty:
            print(
                (
                    " Warning: No valid date ranges found."
                    + "All condition start dates are outside visit date ranges."
                )
            )
        print(f"  Shape of events_df: {events_df.shape}")
        print(f"  Shape of visits_df: {visits_df.shape}")
        print(f"  Shape of merged_df: {merged_df.shape}")
        print(f"  Shape of valid_ranges: {valid_ranges.shape}")
        print(f"  Shape of final_df: {final_df.shape}")

    if verbose > 0:
        print(" Done.")

    return final_df

In [None]:
import warnings
with warnings.catch_warnings():
    warnings.simplefilter(action='ignore', category=FutureWarning)    
    print('find_visit_occurence_id_fireducks')
    %timeit -n 5 -r 5 find_visit_occurence_id_fireducks(input_df, ["person_id", "start_date", "event_id"], visit_df, verbose=0)

### Usando polars

In [None]:
import polars as pl


def find_visit_occurence_id_polars(
    events_df: pd.DataFrame,
    event_columns: list,
    visits_df: pd.DataFrame,
    verbose: int = 0,
) -> pd.DataFrame:
    """
    Find valid date ranges by merging condition and visit occurrence data.

    This function merges input_df and visit occurrence dataframes,
    then filters for input_df start dates that fall within visit date ranges.

    Parameters
    ----------
    events_df : pandas.DataFrame
        Input dataframe for which to assign visit_occurrence_id's
    event_columns : list
        Column names that contains, in this order:
        - 'person_id'   Identifier for each person in the dataframe
        - 'start_date'  Date to fit between 'visit_start_date' and 'visit_end_date'.
        - 'events_id'   Unique identifier for each registry of events_df.
    visits_df : pandas.DataFrame
        DataFrame containing visit occurrence data.
        Must include columns: 'person_id', 'visit_start_date',
        'visit_end_date', 'visit_occurrence_id'.
        Column names need to be the same. This is to ensure
        the correct table (VISIT_OCCURRENCE) is being used.
    verbose : int, optional, default 0
        Verbosity level for function output.
        0: No output
        1: Additionally, print state of the processing
        2+: Additionally, print all debug information

    Returns
    -------
    pandas.DataFrame
        A DataFrame containing the original table event_df plus the value for
        the visit_occurence_id, visit_start_date and visit_end_date, if found.

    Raises
    ------
    ValueError
        If required columns are missing in input DataFrames.
    """
    # Transform to polars dataframes
    events_df = pl.from_pandas(events_df)
    visits_df = pl.from_pandas(visits_df)

    # == Initial message ==============================================
    if verbose > 0:
        print("Looking for visit_occurrence_id matches:")

    # == Initial Checks ===============================================
    # Check for required columns in events_df
    if verbose > 0:
        print(" Checking input...")
    required_input_columns = event_columns
    missing_input_columns = set(required_input_columns) - set(events_df.columns)
    if missing_input_columns:
        raise ValueError(
            f"Missing required columns in events_df: {missing_input_columns}"
        )
    # Check for required columns in visits_df
    required_visit_columns = [
        "person_id",
        "visit_start_datetime",
        "visit_end_datetime",
        "visit_occurrence_id",
    ]
    missing_visit_columns = set(required_visit_columns) - set(visits_df.columns)
    if missing_visit_columns:
        raise ValueError(
            f"Missing required columns in visits_df: {missing_visit_columns}"
        )
    visits_df = visits_df.select(required_visit_columns)

    # == Force dtypes and sort ========================================
    # Ensure start_date and visit dates are datetime
    if verbose > 0:
        print(" Sorting dataframes...")
    events_df = events_df.with_columns(pl.col(event_columns[1]).cast(pl.Datetime("ms")))
    visits_df = visits_df.with_columns(
        pl.col("visit_start_datetime").cast(pl.Datetime("ms")),
        pl.col("visit_end_datetime").cast(pl.Datetime("ms")),
        )

    # Drop all duplicates, if visits are not unique we cannot assign them
    visits_df = visits_df.unique(
        subset=["person_id", "visit_start_datetime", "visit_end_datetime"], keep="none"
    )

    # Sort the neccesary columns of dataframes
    events_df = events_df.sort([event_columns[0], event_columns[1]])
    visits_df = visits_df.sort(
        [event_columns[0], "visit_start_datetime", "visit_end_datetime"]
    )

    # == Merging ======================================================
    if verbose > 0:
        print(" Combining results...")
    merged_df = visits_df.join(events_df, on=event_columns[0], how="left")

    # Check if merge resulted in any matches
    if merged_df["visit_occurrence_id"].is_null().all():
        raise ValueError(
            (
                "No matching records found after merging."
                + "Check if person_id values align between dataframes."
            )
        )

    # == Filter for valid ranges ======================================
    if verbose > 0:
        print(" Filtering valid ranges...")
    # Filter only valid ranges within range
    valid_ranges = merged_df.filter(
        pl.col(event_columns[1]).is_between(
            pl.col("visit_start_datetime"), pl.col("visit_end_datetime"), closed="both"
        )
    )
    valid_ranges = valid_ranges[
        [
            event_columns[0],
            event_columns[2],
            "visit_occurrence_id",
            "visit_start_datetime",
            "visit_end_datetime",
        ]
    ]

    # Merge with original to retrieve events without visit_occurrence_id
    final_df = events_df.join(valid_ranges,
        on=[event_columns[0], event_columns[2]],
        how="left",
    )

    # Sometimes, there might be events that land in visits that share a day.
    # Those would be duplicated on the event_id. Let's drop those duplicates
    # Since they're ordered, we will only lose the second visit, the one
    # that starts with the event
    final_df = final_df.unique(subset=[event_columns[0], event_columns[2]])

    if verbose > 1:
        if valid_ranges.is_empty():
            print(
                (
                    " Warning: No valid date ranges found."
                    + "All condition start dates are outside visit date ranges."
                )
            )
        print(f"  Shape of events_df: {events_df.shape}")
        print(f"  Shape of visits_df: {visits_df.shape}")
        print(f"  Shape of merged_df: {merged_df.shape}")
        print(f"  Shape of valid_ranges: {valid_ranges.shape}")
        print(f"  Shape of final_df: {final_df.shape}")

    if verbose > 0:
        print(" Done.")

    return final_df.to_pandas()

visit_df["visit_start_datetime"] = pd.to_datetime(visit_df["visit_start_datetime"])
visit_df["visit_end_datetime"] = pd.to_datetime(visit_df["visit_end_datetime"])
df = find_visit_occurence_id_polars(
    input_df, ["person_id", "start_date", "event_id"], visit_df, verbose=0
)
df


## Prueba con datasets grandes
Vamos a comparar la velocidad de los distintos métodos con datasets grandes.

Creamos una función para generar datasets

In [None]:
import numpy as np
import pandas as pd
import pyarrow as pa
from pyarrow import parquet

import sys

sys.path.append("../../")
import bps_to_omop.general as gen


def create_sample_df(
    n_people: int = 1000,
    n_dates: int = 50,
    first_date: str = "2020-01-01",
    last_date: str = "2023-01-01",
    mean_duration_days: int = 60,
    std_duration_days: int = 180,
    people_pool=None,
) -> pd.DataFrame:
    """
    Creates a dataframe of 'n_people' people with 'n_dates' events each.

    Events will be restrained to start after 'first_date' and do not
    begin after 'last_date'.

    The events and their duration will be modelled with a gaussian with
    mean = 'mean_duration_days' and std='std_duration_days'.

    User can provide a list of person_id using people_pool. If provided
    no new users will be created, but 'n_dates' events will be drawn
    from each 'person_id' in 'people_pool' that list of ids.
    """
    # == Parameters ==
    np.random.seed(42)
    pd.options.mode.string_storage = "pyarrow"
    # Start date from which to start the dates
    first_date = pd.to_datetime(first_date)
    last_date = pd.to_datetime(last_date)
    max_days = (last_date - first_date).days

    # == Generate IDs randomly ==
    # -- Generate the Ids
    if people_pool is None:
        size = n_people * n_dates
        people = np.random.randint(10000000, 99999999 + 1, size=n_people)
        person_id = np.random.choice(people, size)
    else:
        size = n_dates
        person_id = np.random.choice(people_pool, size)

    # == Generate random dates ==
    # Generate random integers for days and convert to timedelta
    random_days = np.random.randint(0, max_days, size=size)
    # Create the columns
    observation_start_date = first_date + pd.to_timedelta(random_days, unit="D")
    # Generate a gaussian sample of dates
    random_days = np.random.normal(mean_duration_days, std_duration_days, size=size)
    random_days = np.int32(random_days)
    observation_end_date = observation_start_date + pd.to_timedelta(
        random_days, unit="D"
    )
    # Correct end_dates
    # => If they are smaller than start_date, take start_date
    observation_end_date = np.where(
        observation_end_date < observation_start_date,
        observation_start_date,
        observation_end_date,
    )

    # == Generate the code ==
    event_id = np.arange(len(person_id))

    # == Generate the dataframe ==
    df_raw = {
        "event_id": event_id,
        "person_id": person_id,
        "start_date": observation_start_date,
        "end_date": observation_end_date,
    }
    df_raw = pd.DataFrame(df_raw)
    df_raw["start_date"] = pd.to_datetime(df_raw["start_date"]).astype("datetime64[ms]")
    df_raw["end_date"] = pd.to_datetime(df_raw["end_date"]).astype("datetime64[ms]")

    return df_raw

Probamos con datos pequeños que se puedan manejar

In [None]:
# Define parameters
n_people = 100000
n_dates_visit = 100
n_dates_input = 100
last_date = "2020-07-31"


In [None]:
# Create input dataset
input_df = create_sample_df(
    n_people=n_people, n_dates=n_dates_input, last_date=last_date
)
input_df = input_df.drop("end_date", axis=1).sort_values(["person_id", "start_date"])
input_df.info()

In [None]:
# Create visit dataset
visit_df = create_sample_df(
    n_dates=n_people*n_dates_visit,
    last_date=last_date,
    people_pool=input_df["person_id"].unique(),
)
visit_df = visit_df.sort_values(["person_id", "start_date", "end_date"])
visit_df = visit_df.rename(
    {
        "event_id": "visit_occurrence_id",
        "start_date": "visit_start_datetime",
        "end_date": "visit_end_datetime",
    },
    axis=1,
)
visit_df = visit_df[
    ["person_id", "visit_start_datetime", "visit_end_datetime", "visit_occurrence_id"]
]

# Remove overlap
visit_df = gen.remove_overlap(
    visit_df,
    ["person_id", "visit_start_datetime", "visit_end_datetime", "visit_occurrence_id"],
    [True, True, False, True],
    verbose=0,
)
visit_df.info()

Primero comprobamos que los resultados obtenidos tienen sentido

In [None]:
with warnings.catch_warnings():
    warnings.simplefilter(action="ignore", category=FutureWarning)
    df_0 = find_visit_occurence_id(input_df, ["person_id", "start_date", "event_id"], visit_df, verbose=2)
    df_1 = find_visit_occurence_id_fireducks(input_df, ["person_id", "start_date", "event_id"], visit_df, verbose=2)
    df_2 = find_visit_occurence_id_polars(input_df, ["person_id", "start_date", "event_id"], visit_df, verbose=2)

df_0 = df_0.sort_values(["person_id", "start_date", "event_id"]).reset_index(drop=True)
df_1 = df_1.sort_values(["person_id", "start_date", "event_id"]).reset_index(drop=True)
df_2 = df_2.sort_values(["person_id", "start_date", "event_id"]).reset_index(drop=True)

pd.testing.assert_frame_equal(df_0, df_1)
pd.testing.assert_frame_equal(df_0, df_2)

Luego comprobamos la velocidad

In [None]:
with warnings.catch_warnings():
    print(f"- {n_people = }\n- {n_dates_visit = }\n- {n_dates_input = }\n- {input_df.shape = }\n- {visit_df.shape = }")
    warnings.simplefilter(action='ignore', category=FutureWarning)  
    print('\nfind_visit_occurence_id:')
    %timeit -n 5 -r 5 find_visit_occurence_id(input_df, ["person_id", "start_date", "event_id"], visit_df, verbose=0)
    print('find_visit_occurence_id_fireducks')
    %timeit -n 5 -r 5 find_visit_occurence_id_fireducks(input_df, ["person_id", "start_date", "event_id"], visit_df, verbose=0)
    print('find_visit_occurence_id_polars')
    %timeit -n 5 -r 5 find_visit_occurence_id_polars(input_df, ["person_id", "start_date", "event_id"], visit_df, verbose=0)

(20/06/2025) Funny thing here, if there's lots of visits, polars is slightly slower than pandas. However, both solution are fast because there's not many ppl in the first place:

```batch
- n_people = 10000
- n_dates_visit = 100
- n_dates_input = 10000

find_visit_occurence_id:
24.5 ms ± 1.56 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)
find_visit_occurence_id_fireducks
23.8 ms ± 222 μs per loop (mean ± std. dev. of 5 runs, 5 loops each)
find_visit_occurence_id_polars
33.1 ms ± 1.5 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)
```

However, when the number of records increases, polars starts to be faster:

- 10.000 ppl
```
- n_people = 10000
- n_dates_visit = 100
- n_dates_input = 100
- input_df.shape = (1000000, 3)
- visit_df.shape = (65055, 4)

find_visit_occurence_id:
1.36 s ± 11.1 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)
find_visit_occurence_id_fireducks
1.34 s ± 2.22 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)
find_visit_occurence_id_polars
539 ms ± 38.9 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)
```

- 100.000 ppl
```
- n_people = 100000
- n_dates_visit = 100
- n_dates_input = 100
- input_df.shape = (10000000, 3)
- visit_df.shape = (649197, 4)

find_visit_occurence_id:
22.1 s ± 80.2 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)
find_visit_occurence_id_fireducks
22.1 s ± 70.1 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)
find_visit_occurence_id_polars
3.82 s ± 105 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)
```

### Test with batch processing

Sometimes datasets are too big to be processed at once, Let's try to change the function retrieve_visit_occurrence_id so it can work in paralell.

In [None]:
# First we take the dataframe and use only a certain number of pple
ppl_batch = 1000

list_ppl = input_df["person_id"].unique()



In [None]:
def serial_procesing(input_df, visit_df, ppl_batch, func):
    # Get list of unique ppl
    list_ppl = input_df["person_id"].unique()

    # Iterate over unique ppl
    output = []
    for i_init in list(range(0, len(list_ppl), ppl_batch)):
        # Retrieve only ppl_batch number of ppl
        try:
            list_ppl_tmp = list_ppl[i_init:i_init+ppl_batch]
        except IndexError:
            list_ppl_tmp = list_ppl[i_init:]
        
        # Restrict datafrmes to those ppl
        input_df_tmp = input_df[input_df["person_id"].isin(list_ppl_tmp)]
        visit_df_tmp = visit_df[visit_df["person_id"].isin(list_ppl_tmp)]
        output_tmp = func(input_df_tmp, ["person_id", "start_date", "event_id"], visit_df_tmp, verbose=0)
        output.append(output_tmp)

    # Concatenate and return
    return pd.concat(output)


In [None]:
with warnings.catch_warnings():
    warnings.simplefilter(action="ignore", category=FutureWarning)
    df_0 = serial_procesing(input_df, visit_df, 1000, find_visit_occurence_id)
    df_1 = serial_procesing(input_df, visit_df, 1000, find_visit_occurence_id_fireducks)
    df_2 = serial_procesing(input_df, visit_df, 1000, find_visit_occurence_id_polars)

df_0 = df_0.sort_values(["person_id", "start_date", "event_id"]).reset_index(drop=True)
df_1 = df_1.sort_values(["person_id", "start_date", "event_id"]).reset_index(drop=True)
df_2 = df_2.sort_values(["person_id", "start_date", "event_id"]).reset_index(drop=True)

pd.testing.assert_frame_equal(df_0, df_1)
pd.testing.assert_frame_equal(df_0, df_2)

In [None]:
with warnings.catch_warnings():
    print(f"- {n_people = }\n- {n_dates_visit = }\n- {n_dates_input = }\n- {input_df.shape = }\n- {visit_df.shape = }")
    warnings.simplefilter(action='ignore', category=FutureWarning)  
    print('\nSerial processing with find_visit_occurence_id:')
    %timeit -n 5 -r 5 serial_procesing(input_df, visit_df, 10000, find_visit_occurence_id)
    print('Serial processing with find_visit_occurence_id_fireducks')
    %timeit -n 5 -r 5 serial_procesing(input_df, visit_df, 10000, find_visit_occurence_id_fireducks)
    print('Serial processing with find_visit_occurence_id_polars')
    %timeit -n 5 -r 5 serial_procesing(input_df, visit_df, 10000, find_visit_occurence_id_polars)

(20/06/2025)

- 10.000 ppl with batches of 1.000 ppl
```
- n_people = 10000
- n_dates_visit = 100
- n_dates_input = 100
- input_df.shape = (1000000, 3)
- visit_df.shape = (65055, 4)

Serial processing with find_visit_occurence_id:
1.18 s ± 3.9 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)
Serial processing with find_visit_occurence_id_fireducks
1.18 s ± 3.61 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)
Serial processing with find_visit_occurence_id_polars
1.01 s ± 15.6 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)
```

- 100.000 ppl with batches of 1.000 ppl
```
- n_people = 100000
- n_dates_visit = 100
- n_dates_input = 100
- input_df.shape = (10000000, 3)
- visit_df.shape = (649197, 4)

Serial processing with find_visit_occurence_id:
15.5 s ± 205 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)
Serial processing with find_visit_occurence_id_fireducks
15.3 s ± 8.8 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)
Serial processing with find_visit_occurence_id_polars
18.4 s ± 44.4 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)
```

- 100.000 ppl with batches of 10.000 ppl
```
- n_people = 100000
- n_dates_visit = 100
- n_dates_input = 100
- input_df.shape = (10000000, 3)
- visit_df.shape = (649197, 4)

Serial processing with find_visit_occurence_id:
14.7 s ± 46.3 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)
Serial processing with find_visit_occurence_id_fireducks
14.7 s ± 13.1 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)
Serial processing with find_visit_occurence_id_polars
Serial processing with find_visit_occurence_id_polars
6.94 s ± 80.3 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)
```