# New observation_period_grouping

Resulta que el código que agrupa las fechas en OBSERVATION_PERIOD no es del todo correcto. No tiene en cuenta las citas contenidas una dentro de otra. Esto ya lo hemos hecho para VISIT_OCCURRENCE. Ya que además hemos descubierto que sin depender de concatenar dataframes es todo mucho (**MUCHO**) más rápido, vamos a aprovechar para reescribirlo.

Primero hay que eliminar las citas que vengan contenidas en otra previa. Esto ya se hizo para VISIT_OCCURRENCE, pero vamos a intentar reescribirlo como una función recursiva.

Luego, con el dataframe limpio de citas que se superponen, haremos otra función que calcule las distancias entre citas y elimine las que estén cerca.

**12/09/2024** - Las funciones generadas y descritas en este documento se han movido a `ETL1_transform.general`. Sustituyen a las creadas para las tablas OBSERVATION_PERIOD y VISIT_OCCURRENCE, así que estas se han borrado de sus respectivos archivos.

# 1. Eliminar overlap

El problema consiste en identificar si, para un paciente dado, el sistema tiene citas que se superponen. Para comprobarlo habría que ordenar las filas por cada persona. 

Vamos a ordenar primero por person_id de manera descendente, para que todas las interacciones de una persona estén juntas. Segundo por start_date de manera descedente, para que las fechas iniciales estén ordenadas en el tiempo. **Tercero, vamos a ordenar por end_date pero de manera descendente**. Esto me asegura que, en una serie de citas que empiezan el mismo día, la primera fila sea la más duradera y, por tanto, tendrá más posibilidades de englobar a las demás.

Si un start_date de una fila está dentro del intervalo definido en el start_date y end_date anterior, significa que podemos eliminar la fila anterior.

La siguiente clave está en la jerarquía de las citas, si dos citas se superponen y una tiene código de visita a hospital (cód. 8756) y otra de receta farmacia (cód. 581458), el evento más general es la visita al hospital, por lo que es el que debe prevalecer si hay que elegir cual quitar.

## 1.1 Creación dataset de prueba
Vamos a crear un dataframe a mano que tenga todos los problemas que nos podamos encontrar:
- Fechas posteriores completamente contenidas en fechas anteriores
    * (2020-01-01, 2020-02-01) contiene a (2020-01-02, 2020-02-02) y a (2020-01-04, 2020-02-04) 
        - Aquí quiero borrar la segunda.
- Fechas posteriores que se superpongan parcialmente a fechas anteriores
    * (2020-03-01, 2020-04-01) contiene parcialmente a (2020-03-15, 2020-04-15) 
        - Aquí quiero combinar la más antigua con la más nueva => (2020-03-01, 2020-04-15).
- Asegurarse de que no se mezclen datos de dos personas distintas.
    * Comprobar que nunca se combinan datos con person_id distintos.

In [2]:
import pandas as pd

nombre_columnas = ['person_id', 'start_date',
                   'end_date', 'type_concept', 'should_remain']
filas = [
    # Problema de fechas
    (1, '2020-01-01', '2020-02-01', 1, True),
    (1, '2020-01-02', '2020-01-02', 2, False),
    (1, '2020-01-04', '2020-01-04', 2, False),
    (1, '2020-01-06', '2020-01-06', 2, False),
    (1, '2020-01-08', '2020-01-08', 2, False),
    (1, '2020-01-04', '2020-01-04', 2, False),
    (1, '2020-03-01', '2020-04-01', 1, True),
    (1, '2020-03-15', '2020-04-15', 2, True),
    # Problema de person_id
    (1, '2021-01-01', '2021-02-01', 1, True),
    (2, '2021-01-02', '2021-02-02', 2, True),
    (3, '2021-01-04', '2021-02-04', 2, True),
    (1, '2021-03-01', '2021-04-01', 1, True),
    (2, '2021-03-15', '2021-04-15', 2, True),
    # Problema de type_concept
    (4, '2022-03-01', '2022-04-01', 2, False),
    (4, '2022-03-01', '2022-04-01', 1, True),
]
df_raw = pd.DataFrame.from_records(filas, columns=nombre_columnas)
df_raw['start_date'] = pd.to_datetime(df_raw['start_date'])
df_raw['end_date'] = pd.to_datetime(df_raw['end_date'])
(print(df_raw.info()))
df_raw

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15 entries, 0 to 14
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   person_id      15 non-null     int64         
 1   start_date     15 non-null     datetime64[ns]
 2   end_date       15 non-null     datetime64[ns]
 3   type_concept   15 non-null     int64         
 4   should_remain  15 non-null     bool          
dtypes: bool(1), datetime64[ns](2), int64(2)
memory usage: 627.0 bytes
None


Unnamed: 0,person_id,start_date,end_date,type_concept,should_remain
0,1,2020-01-01,2020-02-01,1,True
1,1,2020-01-02,2020-01-02,2,False
2,1,2020-01-04,2020-01-04,2,False
3,1,2020-01-06,2020-01-06,2,False
4,1,2020-01-08,2020-01-08,2,False
5,1,2020-01-04,2020-01-04,2,False
6,1,2020-03-01,2020-04-01,1,True
7,1,2020-03-15,2020-04-15,2,True
8,1,2021-01-01,2021-02-01,1,True
9,2,2021-01-02,2021-02-02,2,True


In [3]:
df = df_raw.copy()
df.groupby('person_id')['type_concept'].first()

person_id
1    1
2    2
3    2
4    2
Name: type_concept, dtype: int64

## 1.2 Eliminar overlap
La idea de este código está clara. Se parte de un dataframe que tiene las columnas `person_id`, `start_date`, `end_date`, `type_concept`. Se ordena en el siguiente orden:
1. person_id, ascendente
2. start_date, ascendente
3. end_date, descendente
4. type_concept, ascendente

* De este modo nos aseguramos de que para cada `start_date`, la primera fila tiene la `end_date` más alejada, que es la que puede contener a las otras filas que tengan la misma `start_date`.
* La columna type_concept tiene que transformarse previamente en tipo categoría con un orden que predefinamos, para que así podamos efectuar el orden. Este orden representará la prioridad del código. Cuando todo lo anterior sea igual, la que permanecerá será aquella fila que esté más arriba.

In [None]:
def find_overlap_index(df: pd.DataFrame) -> pd.Series:
    """Finds all rows that are contained with the previous 
    row, making sure they belong to the same person_id.

    Parameters
    ----------
    df : pd.DataFrame
        pandas Dataframe with at least three columns.
        Assumes first column is person_id, second column is
        start_date and third column is end_date

    Returns
    -------
    pd.Series
        pandas Series with bools. True if row is contained
        with the previous row, False otherwise.
    """
    # 1. Check that current and previous patient are the same
    idx_person = df.iloc[:, 0] == df.iloc[:, 0].shift(1)
    # 2. Check that current start_date is later that previous start_date
    idx_start = df.iloc[:, 1] >= df.iloc[:, 1].shift(1)
    # 3.  Check that current end_date is sooner that previous end_date
    idx_end = df.iloc[:, 2] <= df.iloc[:, 2].shift(1)
    # 4. If everything past is true, I can drop the row
    return idx_start & idx_end & idx_person


def remove_all_overlap_original(df: pd.DataFrame,
                                counter_lim: int = 1000,
                                verbose: bool = False) -> pd.DataFrame:

    cols_to_show = ['person_id', 'start_date', 'end_date', 'type_concept']
    # Prepare the while loop
    idx_to_remove_sum = 1
    counter = 0
    if verbose:
        print('Cleaning...')
    # Start the loop
    while (idx_to_remove_sum > 0) and (counter <= counter_lim):
        # Get the rows
        idx_to_remove = find_overlap_index(df)
        # Prepare next loop
        idx_to_remove_sum = idx_to_remove.sum()
        counter += 1
        # Print the statements
        if verbose:
            print(f"{counter} => {idx_to_remove_sum} rows removed. Example:")
        # Show info of first case as an example
        if verbose & (idx_to_remove_sum > 0):
            idx_first_true = idx_to_remove.idxmax()
            print(df.loc[[idx_first_true-1, idx_first_true], cols_to_show])
        # Remove the overlapping rows
        df = df.loc[~idx_to_remove].reset_index(drop=True)

    return df


def remove_overlap(
        df: pd.DataFrame,
        verbose: int = 0,
        _counter: int = 0,
        _counter_lim: int = 1000) -> pd.DataFrame:
    """Removes all rows that are completely contained within 
    another row. It will not remove rows that are only partially
    contained within the previous one.

    Parameters
    ----------
    df : pd.DataFrame
        pandas dataframe with at least four columns: 
        ['person_id', 'start_date', 'end_date'].
        Column names do not need to be the same but, the order 
        must be the same as here. 
        This allows its use for different tables with columns
        that have the same purpose but different names.
    verbose : int, optional
        Information output, by default 0
        - 0 No info
        - 1 Show number of iterations
        - 2 Show an example of the first row being removed and
            the row that contains it.
    _counter : int
        Iteration control param. Number of iterations. 
        0 will be used to begin and function will take over.
    _counter_lim : int, optional
        Iteration control param. Limit of iterations, by default 1000

    Returns
    -------
    pd.DataFrame
        Copy of input dataframe with contained rows removed.
    """
    # == Preparation =================================================
    # Sort the dataframe if first iteration
    if _counter == 0:
        df = df.sort_values(
            [df.columns[0], df.columns[1], df.columns[2], df.columns[3]],
            ascending=[True, True, False, True])

    # == Find indexes ================================================
    # Get the rows
    idx_to_remove = find_overlap_index(df)

    # == Main "loop" =================================================
    # Prepare next loop
    idx_to_remove_sum = idx_to_remove.sum()
    _counter += 1
    # If there's still room to go, go
    if (idx_to_remove_sum != 0) and (_counter < _counter_lim):
        if verbose > 0:
            # Show iteration and number of rows removed
            print(f"Iter {_counter} => {idx_to_remove_sum} rows removed.")
        if verbose > 1:
            # Get first removed row and show container and contained row
            idx_max = df.index.get_loc(idx_to_remove.idxmax())
            print(f"{df.iloc[(idx_max-1):idx_max+1, :4]}")

        return remove_overlap(df.loc[~idx_to_remove], verbose, _counter)
    else:
        return df

In [None]:
import bps_to_omop.general as gen
import sys
import pyarrow as pa
sys.path.append('..')

table_raw = pa.Table.from_pandas(df_raw)
table_raw = table_raw.cast(
    pa.schema([
        ('person_id', pa.int64()),
        ('start_date', pa.date64()),
        ('end_date', pa.date64()),
        ('type_concept', pa.int64()),
        ('should_remain', pa.int64())
    ])
)
df_rare = table_raw.to_pandas()
df_done = gen.remove_overlap(df_rare, 1)
# df_done = remove_all_overlap_original(df_rare, 10, verbose=True)
df_done.sort_index()

In [None]:
table_raw = pa.Table.from_pandas(df_raw)
table_raw = table_raw.cast(
    pa.schema([
        ('person_id', pa.int64()),
        ('start_date', pa.timestamp('us')),
        ('end_date', pa.timestamp('us')),
        ('type_concept', pa.int64()),
        ('should_remain', pa.int64())
    ])
)
df_rare = table_raw.to_pandas()
df_done = gen.remove_overlap(df_rare, 1)
# df_done = remove_all_overlap_original(df_rare, 10, verbose=True)
df_done.sort_index()

In [None]:
df_done = remove_overlap(df_raw, True)
# df_done = remove_all_overlap_original(df_rare, 10, verbose=True)
df_done.sort_index()

In [None]:
%timeit -n 10 -r 5 remove_overlap(df_raw, verbose=False)

# 2. Eliminar filas cercanas
Una vez que las filas contenidas en otras se han eliminado, el objetivo ahora es eliminar aquellas que están separadas por un número de días menor al que estipulemos.

## 2.1 Creación dataset de prueba
Vamos a suponer que vamos a agrupar aquellas fechas a menos de 1 año (365 días exactamente) una de otra. 

In [None]:
import numpy as np
import pandas as pd

nombre_columnas = ['person_id', 'start_date',
                   'end_date', 'type_concept']
filas = [
    # Estas fechas deberían juntarse porque están a menos de 365 dias
    # type_concept debería ser 2
    (1, '2020-01-01', '2020-02-01', 1),
    (1, '2020-03-01', '2020-04-01', 2),
    (1, '2020-05-01', '2020-12-01', 2),
    # Esta última de la misma persona no
    (1, '2022-01-01', '2022-01-01', 2),
    # Estas fechas deberían juntarse porque se pisan
    # type_concept debería ser 1
    (2, '2020-01-01', '2020-06-01', 1),
    (2, '2020-03-01', '2020-09-01', 1),
    (2, '2020-06-01', '2020-12-01', 2),
    # Estas dos fechas NO deberían juntarse,
    # cada uno es su propio periodo
    (3, '2021-01-01', '2021-01-01', 1),
    (3, '2023-02-01', '2023-02-01', 2),
    (3, '2024-03-01', '2024-04-01', 3),
    # Se juntarían pero no porque son personas distintas
    (4, '2024-01-01', '2024-02-01', 1),
    (5, '2025-01-01', '2025-02-01', 2),
    # Deberían juntarse porque entras ellas hay poca distancia,
    # pero si eliminas una de golpe las otras están muy
    # separadas y no se juntan.
    # type_concept debería ser 2
    (6, '2020-01-01', '2020-12-01', 1),
    (6, '2021-01-01', '2021-12-01', 2),
    (6, '2022-01-01', '2022-12-01', 2),
    (6, '2023-01-01', '2023-12-01', 2),
]
df_raw = pd.DataFrame.from_records(filas, columns=nombre_columnas)
df_raw['start_date'] = pd.to_datetime(df_raw['start_date'])
df_raw['end_date'] = pd.to_datetime(df_raw['end_date'])
df_raw

Suponiendo que **agrupamos periodos separados por menos de 365 días** y que **usamos la moda para calcular el `type_concept` final**. El resultado del agrupamiento de los datos creados debería ser el siguiente:

In [None]:
nombre_columnas = ['person_id', 'start_date',
                   'end_date', 'type_concept']
filas = [
    # Estas fechas deberían juntarse porque están a menos de 365 dias
    # type_concept debería ser 2
    (1, '2020-01-01', '2020-12-01', 2),
    # Esta última de la misma persona no
    (1, '2022-01-01', '2022-01-01', 2),
    # Estas fechas deberían juntarse porque se pisan
    # type_concept debería ser 1
    (2, '2020-01-01', '2020-12-01', 1),
    # Estas dos fechas NO deberían juntarse,
    # cada uno es su propio periodo
    (3, '2021-01-01', '2021-01-01', 1),
    (3, '2023-02-01', '2023-02-01', 2),
    (3, '2024-03-01', '2024-04-01', 3),
    # Se juntarían pero no porque son personas distintas
    (4, '2024-01-01', '2024-02-01', 1),
    (5, '2025-01-01', '2025-02-01', 2),
    # Deberían juntarse porque entras ellas hay poca distancia,
    # pero si eliminas una de golpe las otras están muy
    # separadas y no se juntan.
    # type_concept debería ser 2
    (6, '2020-01-01', '2023-12-01', 2),
]
df_result = pd.DataFrame.from_records(filas, columns=nombre_columnas)
df_result['start_date'] = pd.to_datetime(df_result['start_date'])
df_result['end_date'] = pd.to_datetime(df_result['end_date'])
df_result

## 2.2 Eliminación de filas cercanas

Hay varios problemas
- Si te encuentras varias filas que cumplen la condición seguidas, puedes perder información si la primera y la última filas están muy separadas.
- No se ha encontrado una manera efectiva de hacer esto sin iterar como antes.
    - O bien iteras por personas, y no tienes que vigilar que mezclas personas
    - O bien lo haces de golpe, pero es muy complejo llevar la cuenta de los `type_concept` y `person_id` que has eliminado


### 2.2.1 Usando sólo índices

Hay que encontrar la primera y última fila de cada persona y también aquellos casos en los que sólo haya una única fila.

Luego hay que buscar también aquellos casos en los que la siguiente fila esté muy alejada, lo que implicaría que hemos encontrado una brecha en el periodo de observación.

In [None]:
# Get the raw files
df_rare = df_raw.sort_values(
    ['person_id', 'start_date', 'end_date'],
    ascending=[True, True, False])
# It is VERY important to reset the index to make sure we can
# retrieve them realiably after sorting them.
df_rare = df_rare.reset_index(drop=True)

# Create index for first, last or only person in dataset
df_rare['idx_person_first'] = (
    (df_rare['person_id'] == df_rare['person_id'].shift(-1)) &
    (df_rare['person_id'] != df_rare['person_id'].shift(1))
)
df_rare['idx_person_last'] = (
    (df_rare['person_id'] != df_rare['person_id'].shift(-1)) &
    (df_rare['person_id'] == df_rare['person_id'].shift(1))
)
df_rare['idx_person_only'] = (
    (df_rare['person_id'] != df_rare['person_id'].shift(-1)) &
    (df_rare['person_id'] != df_rare['person_id'].shift(1))
)
# Create index if the break is too big and needs to be kept
n_days = 365
df_rare['next_interval'] = (
    df_rare['start_date'].shift(-1) - df_rare['end_date']
)
df_rare['idx_interval'] = (
    df_rare['next_interval'] >= pd.Timedelta(n_days, unit='D')
)
# Combine all to see which rows remain
df_rare['to_remain'] = (
    df_rare['idx_person_first'] |
    df_rare['idx_person_last'] |
    df_rare['idx_person_only'] |
    df_rare['idx_interval']
)

df_rare

De esta gente me tengo que quedar seguro :
- `idx_person_first == True`
- `idx_person_last == True`
- `idx_person_only == True`

1. Si para una persona sólo hay 1 `idx_person_only == True`, me quedo ese y a correr. En este caso no hay que hacer nada, esa fila tiene la primera y la última fecha del paciente.

2. Si para una persona sólo hay 1 `idx_person_first == True` y 1 `idx_person_last == True`, entonces tengo el principio y el final. 
    1. Si no hay ningún `idx_interval == True`, junto la `start_date` del `idx_person_first == True` y la `end_date` del `idx_person_last == True`.
    2. Si hay algún `idx_interval == True`, tengo que tener en cuenta que esas filas indican brechas en el periodo de observación. La fila donde `idx_interval == True` indica que es la última del periodo y que la siguiente es el comienzo de otro.

Básicamente hay que registrar por un lado las `start_date`, con sus respectivos `person_id`, y por otro lado las nuevas `end_date`. Vamos a escribir el código que registra esto.

In [None]:
# We will create an initial dataframe with only person_id and
# start_date. The end_date rows and type_concept will be added
# later as new columns.

# == start_date and person_id ==========================================
# To retrieve the start_date we need the indexes of:
# - single day periods (idx_person_only == True)
# - first dates (idx_person_first == True)
# - Rows just after period breaks, (idx_interval.index + 1)

# Get the person condition indexes
idx_start = df_rare.index[
    df_rare['idx_person_only']
    | df_rare['idx_person_first']
    | df_rare['idx_interval'].shift(1)
]
# Get the interval indexes
df_done = df_rare.loc[idx_start, ['person_id', 'start_date']]
df_done

# == end_date ==========================================================
# Get the indexes
idx_end = df_rare.index[
    df_rare['idx_person_only']
    | df_rare['idx_person_last']
    | df_rare['idx_interval']
]
# Append values found to final dataframe
df_done['end_date'] = df_rare.loc[idx_end, ['end_date']].values

df_done

Para encontrar los `type_concept`, podemos usar los índices del principio y el final, hacer un zip y, como deberían estar en orden. Tendré una lista con las parejas inicio final de cada periodo.

Si busco todos los `type_concept` dentro de esos periodos, puedo hacer la moda y asignar el `type_concept` más común.

In [None]:
import scipy.stats as st
# I can iterate over idx_start and idx_end to get the
# periods
mode_values = []
for i in np.arange(len(idx_start)):
    df_tmp = df_rare.loc[idx_start[i]:idx_end[i]]
    print(f"{i=}")
    print(df_tmp[['person_id', 'start_date', 'end_date', 'type_concept']])
    mode = st.mode(df_tmp['type_concept'].values)
    print(f"mode is {mode}", '\n')
    mode_values.append(mode[0])

In [None]:
# Add to dataframe
df_done['type_concept'] = mode_values
df_done

Y listo, ya tengo el dataframe con sólo los inicios y finales de los periodos, incluyendo el type_concept más común calculado usando la moda. Comprobamos que es igual que los resultados esperados:

In [None]:
check_person = df_done['person_id'].values == df_result['person_id'].values
print(f"{'Person_id col is correct:':<28} {check_person.all()}")
check_start_date = df_done['start_date'].values == df_result['start_date'].values
print(f"{'start_date col is correct:':<28} {check_start_date.all()}")
check_end_date = df_done['end_date'].values == df_result['end_date'].values
print(f"{'end_date col is correct:':<28} {check_end_date.all()}")
check_type_concept = df_done['type_concept'].values == df_result['type_concept'].values
print(f"{'type_concept col is correct:'::<28} {check_type_concept.all()}")

### 2.2.2 Prueba con función recursive (NOT FINISHED)

ESTO ESTÁ AQUÍ PARA FUTURAS REFERENCIAS. EL CÓDIGO NO ESTÁ TERMINADO PORQUE EL MÉTODO POR ÍNDICES FUNCIONA LO SUFICIENTEMENTE BIEN Y NO HAY GARANTÍAS DE QUE ESTO LO MEJORE.

Parece que lo mejor (cof) va a ser repetir la estrategia anterior e iterar recursivamente. Así además nos aseguramos que podemos llevar la cuenta de los type_concept y quedarnos con el más representativo.)

In [None]:
# def find_neighbors_index(df: pd.DataFrame,
#                          n_days: int) -> pd.Series:

#     # 1. Check that current and next patient are the same
#     idx_person = df.iloc[:, 0] == df.iloc[:, 0].shift(-1)
#     # 2. Check that current end_date and next start_date
#     # are closer than n_days
#     idx_interval = (
#         (df.iloc[:, 2] - df.iloc[:, 1].shift(-1)) <=
#         pd.Timedelta(n_days, unit='D')
#     )
#     # 4. If everything past is true, I can drop the row
#     return idx_person & idx_interval


# def remove_all_neighbors_recursive_v1(
#         df: pd.DataFrame,
#         n_days: int,
#         verbose: int = 0,
#         _counter: int = 0,
#         _counter_lim: int = 1000) -> pd.DataFrame:

#     # Get the rows
#     idx_to_remove = find_neighbors_index(df, n_days)
#     # Prepare next loop
#     idx_to_remove_sum = idx_to_remove.sum()
#     _counter += 1
#     # If there's still room to go, go
#     if (idx_to_remove_sum != 0) and (_counter < _counter_lim):
#         if verbose >= 1:
#             print(f"Iter {_counter} => {idx_to_remove_sum} rows removed.")
#         if verbose >= 2:
#             print(df[idx_to_remove].head(10))

#         # Modify end_dates
#         df.iloc[:,2] = np.where(idx_to_remove,
#                                 df.iloc[:,2].shift(-1),
#                                 df.iloc[:,2])
#         return remove_all_neighbors_recursive_v1(
#             df[idx_to_remove], verbose, _counter)
#     else:
#         return df

# n_days = 365
# df_rare = df_raw.sort_values(
#     ['person_id', 'start_date', 'end_date'],
#     ascending=[True, True, True])
# df_rare = remove_all_neighbors_recursive_v1(df_rare, n_days, verbose=2)
# df_done = df_rare.sort_index()
# df_done

In [None]:
# n_days = 365
# df = df_raw.sort_values(
#     ['person_id', 'start_date', 'end_date'],
#     ascending=[True, True, True])

# # >>> Iter 1
# idx_to_remove = find_neighbors_index(df, n_days)
# print(f"Iter {1} => {idx_to_remove.sum()} rows removed.")
# print(df[idx_to_remove])
# # <<<

# # Record changes
# df['to_join'] = idx_to_remove
# df['new_end_date'] = np.where(df['to_join'],
#                               df.iloc[:, 2].shift(-1),
#                               df.iloc[:, 2])
# df

In [None]:
# # >>> # Iter 2
# df.iloc[:,2] = np.where(idx_to_remove,
#                         df.iloc[:,2].shift(-1),
#                         df.iloc[:,2])
# idx_to_remove = find_neighbors_index(df, n_days)
# # <<<

# # Record changes
# df['to_join'] = idx_to_remove
# df['new_end_date'] = np.where(df['to_join'],
#                               df.iloc[:, 2].shift(-1),
#                               df.iloc[:, 2])
# df

In [None]:
# # >>> # Iter 3
# df.iloc[:,2] = np.where(idx_to_remove,
#                         df.iloc[:,2].shift(-1),
#                         df.iloc[:,2])
# idx_to_remove = find_neighbors_index(df, n_days)
# # <<<

# # Record changes
# df['to_join'] = idx_to_remove
# df['new_end_date'] = np.where(df['to_join'],
#                               df.iloc[:, 2].shift(-1),
#                               df.iloc[:, 2])
# df

## 2.3 Todo junto

Ahora juntamos todo en una función, para poder medir el tiempo.

In [None]:
import numpy as np
import pandas as pd
import scipy.stats as st


def find_person_index(df: pd.DataFrame) -> tuple[pd.Series]:
    """Finds all rows that are contained with the previous 
    row, making sure they belong to the same person_id.

    Parameters
    ----------
    df : pd.DataFrame
        pandas Dataframe with at least three columns.
        Assumes first column is person_id, second column is
        start_date and third column is end_date

    Returns
    -------
    tuple[pd.Series]
        Tuple with three pandas Series with bools:
        - idx_person_first, True if first row of the person
        - idx_person_last, True if last row of the person
        - idx_person_only, True if only row of the person
        False otherwise.
    """

    # Create index for first, last or only person in dataset
    idx_person_first = (
        (df.iloc[:, 0] == df.iloc[:, 0].shift(-1)) &
        (df.iloc[:, 0] != df.iloc[:, 0].shift(1))
    )
    idx_person_last = (
        (df.iloc[:, 0] != df.iloc[:, 0].shift(-1)) &
        (df.iloc[:, 0] == df.iloc[:, 0].shift(1))
    )
    idx_person_only = (
        (df.iloc[:, 0] != df.iloc[:, 0].shift(-1)) &
        (df.iloc[:, 0] != df.iloc[:, 0].shift(1))
    )
    return (idx_person_first, idx_person_last, idx_person_only)


def group_dates(df: pd.DataFrame, n_days: int) -> pd.DataFrame:
    """Groups rows of dates from the same person that are less
    than n_days apart, keeping only the first start_date and
    the last end_date, respectively. 

    It will remove rows that are partially contained within 
    the previous one.

    Parameters
    ----------
    df : pd.DataFrame
        pandas dataframe with at least four columns: 
        ['person_id', 'start_date', 'end_date', 'type_concept'].
        Column names do not need to be the same but, the order 
        must be the same as here. 
        This allows its use for different tables with columns
        that have the same purpose but different names.
    verbose : int, optional
        Information output, by default 0
        - 0 No info
        - 1 Show number of iterations

    Returns
    -------
    pd.DataFrame
        Copy of input dataframe with grouped rows.
    """

    # == Preparation ==============================================
    # Sort so we know for sure the order is right
    df_rare = df.copy().sort_values(
        [df.columns[0], df.columns[1], df.columns[2]],
        ascending=[True, True, False])
    # It is VERY important to reset the index to make sure we can
    # retrieve them realiably after sorting them.
    df_rare = df_rare.reset_index(drop=True)

    # == Index look-up ============================================
    (idx_person_first, idx_person_last, idx_person_only) = find_person_index(df_rare)
    # Create index if the break is too big and needs to be kept
    next_interval = df_rare.iloc[:, 1].shift(-1) - df_rare.iloc[:, 2]
    idx_interval = next_interval >= pd.Timedelta(n_days, unit='D')

    # == Retrieve relevant rows ===================================
    # -- start_date and person_id ---------------------------------
    # To retrieve the start_date we need the indexes of:
    # - single day periods (idx_person_only == True)
    # - first dates (idx_person_first == True)
    # - Rows just after period breaks, (idx_interval.index + 1)

    # Get the person condition indexes
    idx_start = df_rare.index[
        idx_person_only | idx_person_first | idx_interval.shift(1)
    ]

    # -- end_date -------------------------------------------------
    # Get the indexes
    idx_end = df_rare.index[
        idx_person_only | idx_person_last | idx_interval
    ]

    # == Compute type_concept =====================================
    # Iterate over idx_start and idx_end to get the periods
    mode_values = []
    for i in np.arange(len(idx_start)):
        df_tmp = df_rare.loc[idx_start[i]:idx_end[i]]
        mode_values.append(st.mode(df_tmp.iloc[:, 3].values)[0])

    # == Build final dataframe ====================================
    # Create a copy (.loc) with the first two columns
    df_done = df_rare.loc[idx_start, [df.columns[0], df.columns[1]]]
    # Append values found to final dataframe
    df_done[df.columns[2]] = df_rare.loc[idx_end, [df.columns[2]]].values
    # Add to dataframe
    df_done[df.columns[3]] = mode_values

    return df_done

In [None]:
# == Parametros ==
n_days = 365

# == Creación de datos ==
df_done = group_dates(df_raw, n_days)
df_done

volvemos a comprobar que sale bien

In [None]:
check_person = df_done['person_id'].values == df_result['person_id'].values
print(f"{'Person_id col is correct:':<28} {check_person.all()}")
check_start_date = df_done['start_date'].values == df_result['start_date'].values
print(f"{'start_date col is correct:':<28} {check_start_date.all()}")
check_end_date = df_done['end_date'].values == df_result['end_date'].values
print(f"{'end_date col is correct:':<28} {check_end_date.all()}")
check_type_concept = df_done['type_concept'].values == df_result['type_concept'].values
print(f"{'type_concept col is correct:'::<28} {check_type_concept.all()}")

In [None]:
%timeit -n 10 -r 10 group_dates(df_rare,n_days)

# 3. Prueba con datasets grandes
Vamos a comparar si el metodo de pyarrow sigue funcionando más rápido con datasets grandes.

Nos traemos la función para generar datasets

In [None]:
import numpy as np
import pandas as pd
import pyarrow as pa
from pyarrow import parquet


def create_sample_df(n: int = 1000, n_dates: int = 50,
                     first_date: str = '2020-01-01',
                     last_date: str = '2023-01-01',
                     mean_duration_days: int = 60,
                     std_duration_days: int = 180) -> pd.DataFrame:

    # == Parameters ==
    np.random.seed(42)
    pd.options.mode.string_storage = "pyarrow"
    # Start date from which to start the dates
    first_date = pd.to_datetime(first_date)
    last_date = pd.to_datetime(last_date)
    max_days = (last_date-first_date).days

    # == Generate IDs randomly ==
    # -- Generate the Ids
    people = np.random.randint(10000000, 99999999 + 1, size=n)
    person_id = np.random.choice(people, n*n_dates)

    # == Generate random dates ==
    # Generate random integers for days and convert to timedelta
    random_days = np.random.randint(0, max_days, size=n*n_dates)
    # Create the columns
    observation_start_date = first_date + \
        pd.to_timedelta(random_days, unit='D')
    # Generate a gaussian sample of dates
    random_days = np.random.normal(
        mean_duration_days, std_duration_days, size=n*n_dates)
    random_days = np.int32(random_days)
    observation_end_date = observation_start_date + \
        pd.to_timedelta(random_days, unit='D')
    # Correct end_dates
    # => If they are smaller than start_date, take start_date
    observation_end_date = np.where(observation_end_date < observation_start_date,
                                    observation_start_date, observation_end_date)

    # == Generate the code ==
    period_type_concept_id = np.random.randint(1, 11, size=n*n_dates)

    # == Generate the dataframe ==
    df_raw = {'person_id': person_id, 'observation_period_start_date': observation_start_date,
              'observation_period_end_date': observation_end_date, 'period_type_concept_id': period_type_concept_id}
    return pd.DataFrame(df_raw)

Nos traemos la función de pyarrow tal y como estaba el 12/09/2024

In [None]:
import bps_to_omop.general as gen
import pyarrow as pa
import pyarrow.compute as pc
import numpy as np

# Añadimos el directorio superior al path para poder extraer
# las funciones de las carpetas ETL*
import sys
import os
# Add the parent directory of func_folder to the Python path
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..')))


def group_dates_original_pyarrow(table_done, n_days):
    # -- Thirdly, group up dates -----------------------------------------------
    # Agrupamos las fechas usando group_person_dates(). Básicamente calcula la
    # distancia temporal entre las filas adyacentes de cada persona, juntándolas
    # si es tan por debajo del límite marcado por n_days.
    # Agrupamos
    table_OBSERVATION_PERIOD = []

    person_list = pc.unique(table_done['person_id'])
    # Percentage points where you want to print progress
    for i, person in enumerate(person_list[:]):
        # --Group person
        table_person = group_person_dates(table_done, person, n_days)
        # Append table
        table_OBSERVATION_PERIOD.append(table_person)
    # Concatenate
    table_OBSERVATION_PERIOD = pa.concat_tables(table_OBSERVATION_PERIOD)
    return table_OBSERVATION_PERIOD


def group_person_dates(
        table_rare: pa.Table,
        person: str | int,
        n_days: int) -> pa.Table:
    """Filters original table for a specific person and reduces
    the amount of date records grouping all records that are separated
    by n_days or less.

    Parameters
    ----------
    table_rare : pa.Table
        Table as prepared by 'prepare_table_raw_to_rare()'.
    person : str | int
        person id, can be an int (the usual) or a string.
    n_days : int
        number of maximum days between subsequent records.

    Returns
    -------
    pa.Table
        Table identical to table_rare but with less date records.
    """

    # Filter for the current person_id
    filt = pc.is_in(table_rare['person_id'],  # pylint: disable=E1101
                    pa.array([person]))
    table_person = table_rare.filter(filt)
    # Retrieve corresponding dates
    start_dates = table_person['start_date']
    end_dates = table_person['end_date']
    # Group dates closer
    start_dates, end_dates, _ = group_observation_dates(
        start_dates, end_dates, n_days, verbose=False)
    # Create person
    person_id = gen.create_uniform_int_array(len(start_dates),
                                             value=person)
    # Retrieve most common period type
    period_type_concept_id = pc.mode(  # pylint: disable=E1101
        table_person['period_type_concept_id'])[0][0]
    period_type_concept_id = gen.create_uniform_int_array(len(start_dates),
                                                          value=period_type_concept_id)
    # return table
    return pa.Table.from_arrays(
        [person_id, start_dates, end_dates, period_type_concept_id],
        names=['person_id', 'start_date', 'end_date', 'period_type_concept_id'])


def group_observation_dates(
        start_dates: pa.Array,
        end_dates: pa.Array,
        n_days: int,
        verbose: bool = False) -> tuple[pa.Array, pa.Array, None | pa.Table]:
    """Given a pair of 'start_dates' and 'end_dates', it will
    compute the days between each 'end_date' and the next
    'start_date' and remove dates that are smaller that a
    given number of days ('n_days').

    The new dates will only contain start and end dates that have
    more than 'n_days' of difference between them.

    If dates contain nans/nulls, they will be ignored and grouped 
    with the closest dates.

    Parameters
    ----------
    start_dates : pa.Array
        Array of start dates
    end_dates : pa.Array
        Array of end dates
    n_days : int
        _description_
    verbose : bool, optional
        _description_, by default False

    Returns
    -------
    tuple[pa.Array, pa.Array, None | pa.Table]
        Always return a 3-item tuple.
        First item is reduced start dates.
        Second item is reduced end dates.
        Third item is None if verbose=True,
        if verbose=False, is table with start_dates,
        end_dates and days between them. Usefull when
        verifying dates.

    Raises
    ------
    AssertionError
        The resulting starting dates should always come before
        their corresponding end dates. Return an AssertionError
        otherwise.
    """
    # Get an array of end_dates, taking away the last one
    # (last date cannot be compared to the next start date)
    from_dates = end_dates[:-1]
    # Get an array of start_dates, taking away the first one
    # (first date cannot be compared to the previous end date)
    to_dates = start_dates[1:]

    # -- Compute days between
    intervals = pc.days_between(  # pylint: disable=E1101
        from_dates, to_dates).to_numpy(zero_copy_only=False)
    # Create an inner table for the calculations if verbose
    inner_table = None
    if verbose:
        inner_table = pa.Table.from_arrays(
            [start_dates, end_dates, pa.array(np.append(intervals, np.nan))],
            names=['start', 'end', 'intervals'])

    # Filter intervals under some assumption
    filt = intervals >= n_days
    # => When this filt is 'true', it means that for that index,
    # let's call it 'idx', between the end date of 'idx' and the start
    # date of 'idx+1' there more than 'n_days' days.
    # i.e.:
    # (start_date[idx+1] - end_date[idx]).days > n_days

    # if no interval is greater, take the first and last rows
    if np.nansum(filt) == 0:
        idx_end_dates = np.array([len(intervals)])
        # Sum 1 to get start dates
        idx_start_dates = np.array([0])

    # If some filters exist take those
    else:
        # Get indexes of end_dates
        idx_end_dates = filt.nonzero()[0]
        # Sum 1 to get corresponding start dates
        idx_start_dates = idx_end_dates+1
        # Append last entry as last end_date
        idx_end_dates = np.append(idx_end_dates, len(intervals))
        # Append first entry as first start_date
        idx_start_dates = np.append(0, idx_start_dates)

    # if verbose:
    #     print(f'{idx_start_dates=}')
    #     print(f'{idx_end_dates=}')

    # Make sure all end values are after start values
    new_start = start_dates.take(idx_start_dates)
    new_end = end_dates.take(idx_end_dates)
    if pc.any(pc.less(new_end, new_start)).as_py():  # pylint: disable=E1101
        if verbose:
            print(f"{start_dates=}", f"{end_dates=}")
            print(f"{new_start=}", f"{new_end=}")
        raise AssertionError(
            'Some end dates happen before start dates. Try sorting the original data.')

    return (new_start, new_end, inner_table)

## 3.1. Sanity check
### DATA CREATION

Probamos primero que los resultados sean iguales con ambas funciones

In [None]:
n_days = 365

# Cargamos los datos
df_raw = create_sample_df(n=100, n_dates=10,)
df_raw.columns = ['person_id', 'start_date', 'end_date', 'type_concept']

df_raw = df_raw.sort_values(
    ['person_id', 'start_date', 'end_date', 'type_concept'],
    ascending=[True, True, False, True])

Let's check how quick `remove_overlap()` is:

In [None]:
%timeit -n 10 -r 10 remove_overlap(df_raw,0,False)

Quite quick!

In [None]:
# Remove contained dates
df_raw = remove_overlap(df_raw, 0, False)

In [None]:
df_raw[df_raw['person_id'] == 10271836]

In [None]:
df_raw[df_raw['person_id'] == 23315092]

### PYARROW

In [None]:
# == pyarrow method ==
df_rare = df_raw.copy()
df_rare.columns = ['person_id', 'start_date',
                   'end_date', 'period_type_concept_id']
table_rare = pa.Table.from_pandas(df_rare, preserve_index=False)
table_done = group_dates_original_pyarrow(table_rare, n_days)
df_done_pyarrow = table_done.to_pandas()
df_done_pyarrow = df_done_pyarrow.sort_values(
    ['person_id', 'start_date', 'end_date', 'period_type_concept_id'],
    ascending=[True, True, False, True])

In [None]:
df_done_pyarrow[df_done_pyarrow['person_id'] == 10271836]

In [None]:
df_done_pyarrow[df_done_pyarrow['person_id'] == 23315092]

Pyarrow hace la primera persona (10271836), todas las fechas tienen menos de 365 días entre sí, así que se unen en una sola. Las siguientes son todas de una única fecha por persona hasta 23315092, que tiene dos. Esta también la hace bien.


### SHIFT

In [None]:
# == Shift method ==
df_rare = df_raw.copy()
df_done_shift = group_dates(df_rare, n_days)

In [None]:
df_done_shift[df_done_shift['person_id'] == 10271836]

In [None]:
df_done_shift[df_done_shift['person_id'] == 23315092]

Ahora el método shift hace bien la primera persona (10271836) la que tiene dos periodos (23315092). El type_concept cambia del método pyarrow al shift, pero me fio más del shift en este momento.

## 3.2 Time measurement
Ahora probamos a medir el tiempo que tarda cada uno:

In [None]:
n_days = 365

# Cargamos los datos
df_raw = create_sample_df(n=100)
df_raw.columns = ['person_id', 'start_date', 'end_date', 'type_concept']

df_raw = df_raw.sort_values(
    ['person_id', 'start_date', 'end_date', 'type_concept'],
    ascending=[True, True, False, True])
df_rare = df_raw.reset_index(drop=True).copy()

print('\nshift:')
%timeit -n 1 -r 1 group_dates(df_rare,n_days)

df_rare = df_raw.reset_index(drop=True).copy()
df_rare.columns = ['person_id', 'start_date', 'end_date', 'period_type_concept_id']
table_rare = pa.Table.from_pandas(df_rare,preserve_index=False)
print('\npyarrow:')
%timeit -n 1 -r 1 group_dates_original_pyarrow(table_rare,n_days)

Para n = 1000

    shift:
    375 ms ± 884 μs per loop (mean ± std. dev. of 10 runs, 10 loops each)
    pyarrow:
    454 ms ± 664 μs per loop (mean ± std. dev. of 10 runs, 10 loops each)

Para n = 10000

    shift:
    3.69 s ± 4.92 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)
    pyarrow:
    23.6 s ± 565 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)

Para n = 30000

    shift:
    10.2 s ± 18.5 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)
    pyarrow:
    3min 10s ± 5.35 s per loop (mean ± std. dev. of 2 runs, 2 loops each)
 

Ahora está bastante claro que el método shift funciona mucho más rápido si tenemos muchas personas. Al final en el método original estamos pegando tablas una encima de otra, lo cual resta mucho tiempo. Y esto teniendo en cuenta que el método shift está ordenando dentro de la propia función, cosa que en el de pyarrow dejamos fuera.

Quizá si se pudiera implementar con pyarrow un modo siguiendo el patrón de los shift, se podría conseguir algo mejor. Con todo, los 10 s con 30000 paciente y 50 fechas por paciente ya me parece un buen resultado.

## Prueba timestamp vs datetime

Nos hemos encontrado que el código remove_overlap va mucho más rápido si las fechas están en formato timestamp (pa.timestamp('us)) que si están en datetime (pa.date64()).

El problema está en que los datos finales en el proyecto `sarscov` no coinciden si se usa un método o el otro.

Probamos a lanzar el código aquí para comprobarlo.

In [None]:
df_raw = create_sample_df(n=1000)
df_raw.columns = ['person_id', 'start_date', 'end_date', 'type_concept']

In [None]:
n_days = 365
df_rare = df_raw.loc[:,['person_id','start_date','end_date','type_concept']]
table_raw = pa.Table.from_pandas(df_rare, preserve_index=False)
table_raw = table_raw.cast(
    pa.schema([
        ('person_id', pa.int64()),
        ('start_date', pa.timestamp('us')),
        ('end_date', pa.timestamp('us')),
        ('type_concept', pa.int64()),
    ])
)
df_rare = table_raw.to_pandas()
remove_overlap(df_rare,2).info()

%timeit -n 10 -r 10 group_dates_original_pyarrow(table_rare,n_days)

In [None]:
n_days = 365
df_rare = df_raw.loc[:,['person_id','start_date','end_date','type_concept']]
table_raw = pa.Table.from_pandas(df_rare, preserve_index=False)
table_raw = table_raw.cast(
    pa.schema([
        ('person_id', pa.int64()),
        ('start_date', pa.date64()),
        ('end_date', pa.date64()),
        ('type_concept', pa.int64()),
    ])
)
df_rare = table_raw.to_pandas()
remove_overlap(df_rare,2).info()

%timeit -n 10 -r 10 group_dates_original_pyarrow(table_rare,n_days)

Ambos formatos funcionan bien, dejando el mismo número de filas.

Puede que el problema venga de que algunas fechas en los datos del proyecto vienen con hora. Por ejemplo, todas las de farmacia de dispensación. Si paso estos registros a date64 pierdo la información de la hora, por lo que el orden puede que sea distinto.

# 4. Nuevo método para calcular type_concept

Vamos a comparar el método actual de group_dates con hacer groupby

In [21]:
import sys
import pandas as pd
import numpy as np
import scipy.stats as st

sys.path.append("../../")
from bps_to_omop.general import group_dates, find_person_index

def create_sample_data():
    nombre_columnas = ["person_id", "start_date", "end_date", "type_concept"]
    n_days = 365
    df_in = [
        # Una única fecha
        (1, "2020-01-01", "2020-02-01", 1),
        # Dos fechas que se juntan con type_concept iguales
        (2, "2020-01-01", "2020-02-01", 1),
        (2, "2020-03-01", "2020-04-01", 1),
        # Dos fechas que se juntan con type_concept distintos
        (3, "2020-01-01", "2020-02-01", 1),
        (3, "2020-03-01", "2020-04-01", 2),
        # tres fechas que se juntan
        (4, "2020-01-01", "2020-02-01", 1),
        (4, "2020-03-01", "2020-04-01", 1),
        (4, "2020-05-01", "2020-12-01", 2),
        # una persona con dos grupos distintos
        (5, "2020-01-01", "2020-02-01", 1),
        (5, "2020-03-01", "2020-04-01", 1),
        (5, "2020-05-01", "2020-12-01", 2),
        (5, "2022-01-01", "2022-02-01", 3),
        (5, "2022-03-01", "2022-04-01", 3),
        (5, "2022-05-01", "2022-12-01", 2),
    ]
    df_in = pd.DataFrame.from_records(df_in, columns=nombre_columnas).assign(
        start_date=lambda x: pd.to_datetime(x["start_date"]),
        end_date=lambda x: pd.to_datetime(x["end_date"]),
    )
    return df_in

def group_dates_v2(df: pd.DataFrame, n_days: int, verbose: int = 0) -> pd.DataFrame:
    # == Preparation ==============================================
    if verbose > 0:
        print("Grouping dates:")
        print("- Sorting and preparing data...")
    # Sort so we know for sure the order is right
    df_rare = df.copy().sort_values(
        [df.columns[0], df.columns[1], df.columns[2]], ascending=[True, True, False]
    )
    # It is VERY important to reset the index to make sure we can
    # retrieve them realiably after sorting them.
    df_rare = df_rare.reset_index(drop=True)

    # == Index look-up ============================================
    if verbose > 0:
        print("- Looking up indexes...")
    (idx_person_first, idx_person_last, idx_person_only) = find_person_index(df_rare)
    # Create index if the break is too big and needs to be kept
    next_interval = df_rare.iloc[:, 1].shift(-1) - df_rare.iloc[:, 2]
    idx_interval = next_interval >= pd.Timedelta(n_days, unit="D")

    # == Retrieve relevant rows ===================================
    if verbose > 0:
        print("- Retrieving rows...")
    # -- start_date and person_id ---------------------------------
    # To retrieve the start_date we need the indexes of:
    # - single day periods (idx_person_only == True)
    # - first dates (idx_person_first == True)
    # - Rows just after period breaks, (idx_interval.index + 1)

    # Get the person condition indexes
    idx_start = df_rare.index[
        idx_person_only | idx_person_first | idx_interval.shift(1)
    ]

    # -- end_date -------------------------------------------------
    # Get the indexes
    idx_end = df_rare.index[idx_person_only | idx_person_last | idx_interval]

    # == Compute type_concept =====================================
    if verbose > 0:
        print("- Computing type_concept...")
    # Iterate over idx_start and idx_end to get the periods
    mode_values = []
    for i in np.arange(len(idx_start)):
        df_tmp = df_rare.loc[idx_start[i] : idx_end[i]]
        mode_values.append(st.mode(df_tmp.iloc[:, 3].values)[0])

        if (verbose > 1) and ((i) % int(len(idx_start) / 4) == 0):
            print(f"  - ({(i+1)/len(idx_start)*100:.1f} %) {(i+1)}/{len(idx_start)}")
    if verbose > 1:
        print(f"  - (100.0 %) {len(idx_start)}/{len(idx_start)}")

    # == Build final dataframe ====================================
    if verbose > 0:
        print("- Closing up...")
    # Create a copy (.loc) with the first two columns
    df_done = df_rare.loc[idx_start, [df.columns[0], df.columns[1]]]
    # Append values found to final dataframe
    df_done[df.columns[2]] = df_rare.loc[idx_end, [df.columns[2]]].values
    # Add to dataframe
    df_done[df.columns[3]] = mode_values

    if verbose > 0:
        print("- Done!")
    return df_done

In [22]:
verbose = 1
n_days = 365

df = create_sample_data()
df

Unnamed: 0,person_id,start_date,end_date,type_concept
0,1,2020-01-01,2020-02-01,1
1,2,2020-01-01,2020-02-01,1
2,2,2020-03-01,2020-04-01,1
3,3,2020-01-01,2020-02-01,1
4,3,2020-03-01,2020-04-01,2
5,4,2020-01-01,2020-02-01,1
6,4,2020-03-01,2020-04-01,1
7,4,2020-05-01,2020-12-01,2
8,5,2020-01-01,2020-02-01,1
9,5,2020-03-01,2020-04-01,1


**(!!)**

La idea aquí es que estamos buscando los índice y construyendo manualmente un dataframe con las fechas iniciales y finales.

NO podemos usar el truco del groupby para el type concept directamente, ya que no sabemos los intervalos finales.

Es decir, podemos agrupar por person_id, pero habría que agrupar también por las fechas, para poder sacar para cada persona y cada observation_period, cuál es el type_concept más frecuente.

In [24]:

# == Preparation ==============================================
if verbose > 0:
    print("Grouping dates:")
    print("- Sorting and preparing data...")
# Sort so we know for sure the order is right
df_rare = df.copy().sort_values(
    [df.columns[0], df.columns[1], df.columns[2]], ascending=[True, True, False]
)
# It is VERY important to reset the index to make sure we can
# retrieve them realiably after sorting them.
df_rare = df_rare.reset_index(drop=True)

# == Index look-up ============================================
if verbose > 0:
    print("- Looking up indexes...")
(idx_person_first, idx_person_last, idx_person_only) = find_person_index(df_rare)
# Create index if the break is too big and needs to be kept
next_interval = df_rare.iloc[:, 1].shift(-1) - df_rare.iloc[:, 2]
idx_interval = next_interval >= pd.Timedelta(n_days, unit="D")

# == Retrieve relevant rows ===================================
if verbose > 0:
    print("- Retrieving rows...")
# -- start_date and person_id ---------------------------------
# To retrieve the start_date we need the indexes of:
# - single day periods (idx_person_only == True)
# - first dates (idx_person_first == True)
# - Rows just after period breaks, (idx_interval.index + 1)

# Get the person condition indexes
idx_start = df_rare.index[
    idx_person_only | idx_person_first | idx_interval.shift(1)
]

# -- end_date -------------------------------------------------
# Get the indexes
idx_end = df_rare.index[idx_person_only | idx_person_last | idx_interval]

# == Compute type_concept =====================================
if verbose > 0:
    print("- Computing type_concept...")
# Iterate over idx_start and idx_end to get the periods
mode_values = []
for i in np.arange(len(idx_start)):
    df_tmp = df_rare.loc[idx_start[i] : idx_end[i]]
    mode_values.append(st.mode(df_tmp.iloc[:, 3].values)[0])

    if (verbose > 1) and ((i) % int(len(idx_start) / 4) == 0):
        print(f"  - ({(i+1)/len(idx_start)*100:.1f} %) {(i+1)}/{len(idx_start)}")
if verbose > 1:
    print(f"  - (100.0 %) {len(idx_start)}/{len(idx_start)}")

# == Build final dataframe ====================================
if verbose > 0:
    print("- Closing up...")
# Create a copy (.loc) with the first two columns
df_done = df_rare.loc[idx_start, [df.columns[0], df.columns[1]]]
# Append values found to final dataframe
df_done[df.columns[2]] = df_rare.loc[idx_end, [df.columns[2]]].values
# # Add to dataframe
df_done[df.columns[3]] = mode_values

if verbose > 0:
    print("- Done!")
df_done


Grouping dates:
- Sorting and preparing data...
- Looking up indexes...
- Retrieving rows...
- Computing type_concept...
- Closing up...
- Done!


Unnamed: 0,person_id,start_date,end_date,type_concept
0,1,2020-01-01,2020-02-01,1
1,2,2020-01-01,2020-04-01,1
3,3,2020-01-01,2020-04-01,1
5,4,2020-01-01,2020-12-01,1
8,5,2020-01-01,2020-12-01,1
11,5,2022-01-01,2022-12-01,3


# A. Unfinished testing code

In [1]:
import pandas as pd
import numpy as np
import time
from typing import List, Tuple

def assign_groups_masking(indices: np.ndarray, starts: np.ndarray, ends: np.ndarray) -> np.ndarray:
    """Group assignment using boolean masking"""
    group_ids = np.zeros(len(indices), dtype=int)
    for group_num, (start, end) in enumerate(zip(starts, ends), 1):
        mask = (indices >= start) & (indices <= end)
        group_ids[mask] = group_num
    return group_ids

def assign_groups_searchsorted(indices: np.ndarray, starts: np.ndarray, ends: np.ndarray) -> np.ndarray:
    """Group assignment using searchsorted"""
    boundaries = np.sort(np.concatenate([starts, ends + 1]))
    return np.searchsorted(boundaries, indices, side='right') // 2

def generate_test_case(n_rows: int, n_groups: int) -> Tuple[np.ndarray, np.ndarray]:
    """Generate test data with given size and number of groups"""
    # Create roughly equal-sized groups
    group_size = n_rows // n_groups
    starts = np.arange(0, n_rows, group_size)
    ends = starts + group_size - 1
    ends[-1] = n_rows - 1  # Adjust last group
    return starts, ends

def run_benchmark():
    # Test configurations
    row_sizes = [10_000, 100_000, 1_000_000]
    group_configs = [
        ('Few Large Groups', lambda x: max(5, x // 1_000_000)),
        ('Medium Groups', lambda x: max(50, x // 100_000)),
        ('Many Small Groups', lambda x: max(500, x // 10_000))
    ]
    
    results = []
    
    for n_rows in row_sizes:
        indices = np.arange(n_rows)
        
        for group_desc, group_func in group_configs:
            n_groups = group_func(n_rows)
            starts, ends = generate_test_case(n_rows, n_groups)
            
            # Warm-up run
            _ = assign_groups_masking(indices, starts, ends)
            _ = assign_groups_searchsorted(indices, starts, ends)
            
            # Timing masking approach
            start_time = time.perf_counter()
            for _ in range(5):  # Multiple runs for more stable results
                _ = assign_groups_masking(indices, starts, ends)
            masking_time = (time.perf_counter() - start_time) / 5
            
            # Timing searchsorted approach
            start_time = time.perf_counter()
            for _ in range(5):
                _ = assign_groups_searchsorted(indices, starts, ends)
            searchsorted_time = (time.perf_counter() - start_time) / 5
            
            results.append({
                'Rows': n_rows,
                'Groups': n_groups,
                'Configuration': group_desc,
                'Masking Time': masking_time,
                'Searchsorted Time': searchsorted_time
            })
    
    return pd.DataFrame(results)

# Run benchmark
results_df = run_benchmark()

# Print detailed results
print("\nDetailed Benchmark Results:")
print("=" * 80)
for _, row in results_df.iterrows():
    print(f"\nConfiguration: {row['Configuration']}")
    print(f"Data Size: {row['Rows']:,} rows, {row['Groups']:,} groups")
    print(f"Masking Time: {row['Masking Time']*1000:.2f}ms")
    print(f"Searchsorted Time: {row['Searchsorted Time']*1000:.2f}ms")
    speedup = row['Masking Time'] / row['Searchsorted Time']
    faster_method = "searchsorted" if speedup > 1 else "masking"
    print(f"Winner: {faster_method} ({abs(speedup):,.2f}x {'faster' if speedup > 1 else 'slower'})")

# Calculate and print summary statistics
print("\nSummary Statistics:")
print("=" * 80)
for config in results_df['Configuration'].unique():
    config_results = results_df[results_df['Configuration'] == config]
    print(f"\n{config}:")
    avg_speedup = (config_results['Masking Time'] / config_results['Searchsorted Time']).mean()
    print(f"Average speedup using searchsorted: {avg_speedup:.2f}x")

# Validation of correctness
print("\nValidating correctness of implementations...")
test_indices = np.arange(1000)
test_starts = np.array([0, 200, 400, 600, 800])
test_ends = np.array([199, 399, 599, 799, 999])

masking_results = assign_groups_masking(test_indices, test_starts, test_ends)
searchsorted_results = assign_groups_searchsorted(test_indices, test_starts, test_ends)

if np.array_equal(masking_results, searchsorted_results):
    print("✓ Both implementations produce identical results")
else:
    print("⚠ WARNING: Implementations produce different results!")


Detailed Benchmark Results:

Configuration: Few Large Groups
Data Size: 10,000 rows, 5 groups
Masking Time: 0.11ms
Searchsorted Time: 0.09ms
Winner: searchsorted (1.21x faster)

Configuration: Medium Groups
Data Size: 10,000 rows, 50 groups
Masking Time: 0.85ms
Searchsorted Time: 0.11ms
Winner: searchsorted (7.82x faster)

Configuration: Many Small Groups
Data Size: 10,000 rows, 500 groups
Masking Time: 8.12ms
Searchsorted Time: 0.16ms
Winner: searchsorted (51.13x faster)

Configuration: Few Large Groups
Data Size: 100,000 rows, 5 groups
Masking Time: 0.78ms
Searchsorted Time: 0.51ms
Winner: searchsorted (1.54x faster)

Configuration: Medium Groups
Data Size: 100,000 rows, 50 groups
Masking Time: 7.14ms
Searchsorted Time: 1.07ms
Winner: searchsorted (6.70x faster)

Configuration: Many Small Groups
Data Size: 100,000 rows, 500 groups
Masking Time: 69.66ms
Searchsorted Time: 1.21ms
Winner: searchsorted (57.56x faster)

Configuration: Few Large Groups
Data Size: 1,000,000 rows, 5 groups
