Skip to content

test_filter_for_freshest_data occasionally fails #2983

@zaneselvans

Description

@zaneselvans

After merging #2948 into dev some of us started getting sporadic failures for the Hypothesis based test_filter_for_freshest_data test

 pytest test/unit/io_managers_test.py::test_filter_for_freshest_data

On some machines (e.g. Zane's laptop) it failed every time. On others (like the CI on GitHub) it only fails rarely (see error output below). It doesn't seem to be a material problem and @jdangerx is out this week, so for the moment it's been marked XFAIL, until he can take a look at it upon his return.

________________________ test_filter_for_freshest_data _________________________

    @hypothesis.given(example_schema.strategy(size=3))
>   def test_filter_for_freshest_data(df):

test/unit/io_managers_test.py:372:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

df =   entity_id       date utility_type              publication_time  int_factoid  float_factoid str_factoid
0           ...0
2           1970-01-01     electric 1970-01-01 00:00:00.000000000            0            0.0

    @hypothesis.given(example_schema.strategy(size=3))
    def test_filter_for_freshest_data(df):
        # XBRL context is the identifying metadata for reported values
        xbrl_context_cols = ["entity_id", "date", "utility_type"]
        filing_metadata_cols = ["publication_time", "filing_name"]
        primary_key = xbrl_context_cols + filing_metadata_cols
        deduped = FercXBRLSQLiteIOManager.filter_for_freshest_data(
            df, primary_key=primary_key
        )
        example_schema.validate(deduped)

        # every post-deduplication row exists in the original rows
        assert (deduped.merge(df, how="left", indicator=True)._merge != "left_only").all()
        # for every [entity_id, utility_type, date] - th"true"e is only one row
        assert (~deduped.duplicated(subset=xbrl_context_cols)).all()
        # for every *context* in the input there is a corresponding row in the output
        original_contexts = df.groupby(xbrl_context_cols, as_index=False).last()
        paired_by_context = original_contexts.merge(
            deduped,
            on=xbrl_context_cols,
            how="outer",
            suffixes=["_in", "_out"],
            indicator=True,
        ).set_index(xbrl_context_cols)
        hypothesis.note(f"Found these contexts in input data:\n{original_contexts}")
        hypothesis.note(f"The freshest data:\n{deduped}")
        hypothesis.note(f"Paired by context:\n{paired_by_context}")
>       assert (paired_by_context._merge == "both").all()
E       AssertionError: assert False
E        +  where False = <bound method Series.all of entity_id  date        utility_type\n           1970-01-01  electric        False\n                       electric        False\nName: _merge, dtype: bool>()
E        +    where <bound method Series.all of entity_id  date        utility_type\n           1970-01-01  electric        False\n                       electric        False\nName: _merge, dtype: bool> = entity_id  date        utility_type\n           1970-01-01  electric         left_only\n                       electric        right_only\nName: _merge, dtype: category\nCategories (3, object): ['left_only', 'right_only', 'both'] == 'both'.all
E        +      where entity_id  date        utility_type\n           1970-01-01  electric         left_only\n                       electric        right_only\nName: _merge, dtype: category\nCategories (3, object): ['left_only', 'right_only', 'both'] =                                   publication_time_in  int_factoid_in  float_factoid_in str_factoid_in          publication_time_out  int_factoid_out  float_factoid_out str_factoid_out      _merge\nentity_id date       utility_type                                                                                                                                                                   \n          1970-01-01 electric              1970-01-01             0.0               0.0                                          NaT              NaN                NaN             NaN   left_only\n                     electric                     NaT             NaN               NaN            NaN 1970-01-01 00:00:00.000000001              0.0                0.0                  right_only._merge
E       Falsifying example: test_filter_for_freshest_data(
E           df=
E                 entity_id       date utility_type              publication_time  int_factoid  float_factoid str_factoid
E               0           1970-01-01     electric 1970-01-01 00:00:00.000000000            0            0.0
E               1        0 1970-01-01     electric 1970-01-01 00:00:00.000000001            0            0.0
E               2           1970-01-01     electric 1970-01-01 00:00:00.000000000            0            0.0
E           ,
E       )
E       Found these contexts in input data:
E         entity_id       date utility_type publication_time  int_factoid  float_factoid str_factoid
E       0           1970-01-01     electric       1970-01-01            0            0.0
E       The freshest data:
E         entity_id       date utility_type              publication_time  int_factoid  float_factoid str_factoid
E       1        0 1970-01-01     electric 1970-01-01 00:00:00.000000001            0            0.0
E       Paired by context:
E                                         publication_time_in  int_factoid_in  float_factoid_in str_factoid_in          publication_time_out  int_factoid_out  float_factoid_out str_factoid_out      _merge
E       entity_id date       utility_type
E                 1970-01-01 electric              1970-01-01             0.0               0.0                                          NaT              NaN                NaN             NaN   left_only
E                            electric                     NaT             NaN               NaN            NaN 1970-01-01 00:00:00.000000001              0.0                0.0                  right_only
E       Explanation:
E           These lines were always and only run by failing examples:
E               /Users/zane/miniforge3/envs/pudl-dev/lib/python3.11/site-packages/_pytest/assertion/util.py:134
E               /Users/zane/miniforge3/envs/pudl-dev/lib/python3.11/site-packages/numpy/core/_dtype.py:336
E               /Users/zane/miniforge3/envs/pudl-dev/lib/python3.11/site-packages/pandas/core/array_algos/putmask.py:107
E               /Users/zane/miniforge3/envs/pudl-dev/lib/python3.11/site-packages/pandas/core/array_algos/putmask.py:112
E               /Users/zane/miniforge3/envs/pudl-dev/lib/python3.11/site-packages/pandas/core/array_algos/putmask.py:138
E               (and 66 more with settings.verbosity >= verbose)

test/unit/io_managers_test.py:398: AssertionError

Metadata

Metadata

Assignees

Labels

dagsterIssues related to our use of the Dagster orchestratorferc1Anything having to do with FERC Form 1testingWriting tests, creating test data, automating testing, etc.xbrlRelated to the FERC XBRL transition

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions