`test_filter_for_freshest_data` occasionally fails

After merging #2948 into `dev` some of us started getting sporadic failures for the Hypothesis based `test_filter_for_freshest_data` test

```
 pytest test/unit/io_managers_test.py::test_filter_for_freshest_data
```

On some machines (e.g. Zane's laptop) it failed every time. On others (like the CI on GitHub) it only fails rarely (see error output below).  It doesn't **seem** to be a material problem and @jdangerx is out this week, so for the moment it's been marked `XFAIL`, until he can take a look at it upon his return.

```
________________________ test_filter_for_freshest_data _________________________

    @hypothesis.given(example_schema.strategy(size=3))
>   def test_filter_for_freshest_data(df):

test/unit/io_managers_test.py:372:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

df =   entity_id       date utility_type              publication_time  int_factoid  float_factoid str_factoid
0           ...0
2           1970-01-01     electric 1970-01-01 00:00:00.000000000            0            0.0

    @hypothesis.given(example_schema.strategy(size=3))
    def test_filter_for_freshest_data(df):
        # XBRL context is the identifying metadata for reported values
        xbrl_context_cols = ["entity_id", "date", "utility_type"]
        filing_metadata_cols = ["publication_time", "filing_name"]
        primary_key = xbrl_context_cols + filing_metadata_cols
        deduped = FercXBRLSQLiteIOManager.filter_for_freshest_data(
            df, primary_key=primary_key
        )
        example_schema.validate(deduped)

        # every post-deduplication row exists in the original rows
        assert (deduped.merge(df, how="left", indicator=True)._merge != "left_only").all()
        # for every [entity_id, utility_type, date] - th"true"e is only one row
        assert (~deduped.duplicated(subset=xbrl_context_cols)).all()
        # for every *context* in the input there is a corresponding row in the output
        original_contexts = df.groupby(xbrl_context_cols, as_index=False).last()
        paired_by_context = original_contexts.merge(
            deduped,
            on=xbrl_context_cols,
            how="outer",
            suffixes=["_in", "_out"],
            indicator=True,
        ).set_index(xbrl_context_cols)
        hypothesis.note(f"Found these contexts in input data:\n{original_contexts}")
        hypothesis.note(f"The freshest data:\n{deduped}")
        hypothesis.note(f"Paired by context:\n{paired_by_context}")
>       assert (paired_by_context._merge == "both").all()
E       AssertionError: assert False
E        +  where False = <bound method Series.all of entity_id  date        utility_type\n           1970-01-01  electric        False\n                       electric        False\nName: _merge, dtype: bool>()
E        +    where <bound method Series.all of entity_id  date        utility_type\n           1970-01-01  electric        False\n                       electric        False\nName: _merge, dtype: bool> = entity_id  date        utility_type\n           1970-01-01  electric         left_only\n                       electric        right_only\nName: _merge, dtype: category\nCategories (3, object): ['left_only', 'right_only', 'both'] == 'both'.all
E        +      where entity_id  date        utility_type\n           1970-01-01  electric         left_only\n                       electric        right_only\nName: _merge, dtype: category\nCategories (3, object): ['left_only', 'right_only', 'both'] =                                   publication_time_in  int_factoid_in  float_factoid_in str_factoid_in          publication_time_out  int_factoid_out  float_factoid_out str_factoid_out      _merge\nentity_id date       utility_type                                                                                                                                                                   \n          1970-01-01 electric              1970-01-01             0.0               0.0                                          NaT              NaN                NaN             NaN   left_only\n                     electric                     NaT             NaN               NaN            NaN 1970-01-01 00:00:00.000000001              0.0                0.0                  right_only._merge
E       Falsifying example: test_filter_for_freshest_data(
E           df=
E                 entity_id       date utility_type              publication_time  int_factoid  float_factoid str_factoid
E               0           1970-01-01     electric 1970-01-01 00:00:00.000000000            0            0.0
E               1        0 1970-01-01     electric 1970-01-01 00:00:00.000000001            0            0.0
E               2           1970-01-01     electric 1970-01-01 00:00:00.000000000            0            0.0
E           ,
E       )
E       Found these contexts in input data:
E         entity_id       date utility_type publication_time  int_factoid  float_factoid str_factoid
E       0           1970-01-01     electric       1970-01-01            0            0.0
E       The freshest data:
E         entity_id       date utility_type              publication_time  int_factoid  float_factoid str_factoid
E       1        0 1970-01-01     electric 1970-01-01 00:00:00.000000001            0            0.0
E       Paired by context:
E                                         publication_time_in  int_factoid_in  float_factoid_in str_factoid_in          publication_time_out  int_factoid_out  float_factoid_out str_factoid_out      _merge
E       entity_id date       utility_type
E                 1970-01-01 electric              1970-01-01             0.0               0.0                                          NaT              NaN                NaN             NaN   left_only
E                            electric                     NaT             NaN               NaN            NaN 1970-01-01 00:00:00.000000001              0.0                0.0                  right_only
E       Explanation:
E           These lines were always and only run by failing examples:
E               /Users/zane/miniforge3/envs/pudl-dev/lib/python3.11/site-packages/_pytest/assertion/util.py:134
E               /Users/zane/miniforge3/envs/pudl-dev/lib/python3.11/site-packages/numpy/core/_dtype.py:336
E               /Users/zane/miniforge3/envs/pudl-dev/lib/python3.11/site-packages/pandas/core/array_algos/putmask.py:107
E               /Users/zane/miniforge3/envs/pudl-dev/lib/python3.11/site-packages/pandas/core/array_algos/putmask.py:112
E               /Users/zane/miniforge3/envs/pudl-dev/lib/python3.11/site-packages/pandas/core/array_algos/putmask.py:138
E               (and 66 more with settings.verbosity >= verbose)

test/unit/io_managers_test.py:398: AssertionError
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`test_filter_for_freshest_data` occasionally fails #2983

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

test_filter_for_freshest_data occasionally fails #2983

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`test_filter_for_freshest_data` occasionally fails #2983