-
-
Notifications
You must be signed in to change notification settings - Fork 132
Labels
dagsterIssues related to our use of the Dagster orchestratorIssues related to our use of the Dagster orchestratorferc1Anything having to do with FERC Form 1Anything having to do with FERC Form 1testingWriting tests, creating test data, automating testing, etc.Writing tests, creating test data, automating testing, etc.xbrlRelated to the FERC XBRL transitionRelated to the FERC XBRL transition
Description
After merging #2948 into dev some of us started getting sporadic failures for the Hypothesis based test_filter_for_freshest_data test
pytest test/unit/io_managers_test.py::test_filter_for_freshest_data
On some machines (e.g. Zane's laptop) it failed every time. On others (like the CI on GitHub) it only fails rarely (see error output below). It doesn't seem to be a material problem and @jdangerx is out this week, so for the moment it's been marked XFAIL, until he can take a look at it upon his return.
________________________ test_filter_for_freshest_data _________________________
@hypothesis.given(example_schema.strategy(size=3))
> def test_filter_for_freshest_data(df):
test/unit/io_managers_test.py:372:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
df = entity_id date utility_type publication_time int_factoid float_factoid str_factoid
0 ...0
2 1970-01-01 electric 1970-01-01 00:00:00.000000000 0 0.0
@hypothesis.given(example_schema.strategy(size=3))
def test_filter_for_freshest_data(df):
# XBRL context is the identifying metadata for reported values
xbrl_context_cols = ["entity_id", "date", "utility_type"]
filing_metadata_cols = ["publication_time", "filing_name"]
primary_key = xbrl_context_cols + filing_metadata_cols
deduped = FercXBRLSQLiteIOManager.filter_for_freshest_data(
df, primary_key=primary_key
)
example_schema.validate(deduped)
# every post-deduplication row exists in the original rows
assert (deduped.merge(df, how="left", indicator=True)._merge != "left_only").all()
# for every [entity_id, utility_type, date] - th"true"e is only one row
assert (~deduped.duplicated(subset=xbrl_context_cols)).all()
# for every *context* in the input there is a corresponding row in the output
original_contexts = df.groupby(xbrl_context_cols, as_index=False).last()
paired_by_context = original_contexts.merge(
deduped,
on=xbrl_context_cols,
how="outer",
suffixes=["_in", "_out"],
indicator=True,
).set_index(xbrl_context_cols)
hypothesis.note(f"Found these contexts in input data:\n{original_contexts}")
hypothesis.note(f"The freshest data:\n{deduped}")
hypothesis.note(f"Paired by context:\n{paired_by_context}")
> assert (paired_by_context._merge == "both").all()
E AssertionError: assert False
E + where False = <bound method Series.all of entity_id date utility_type\n 1970-01-01 electric False\n electric False\nName: _merge, dtype: bool>()
E + where <bound method Series.all of entity_id date utility_type\n 1970-01-01 electric False\n electric False\nName: _merge, dtype: bool> = entity_id date utility_type\n 1970-01-01 electric left_only\n electric right_only\nName: _merge, dtype: category\nCategories (3, object): ['left_only', 'right_only', 'both'] == 'both'.all
E + where entity_id date utility_type\n 1970-01-01 electric left_only\n electric right_only\nName: _merge, dtype: category\nCategories (3, object): ['left_only', 'right_only', 'both'] = publication_time_in int_factoid_in float_factoid_in str_factoid_in publication_time_out int_factoid_out float_factoid_out str_factoid_out _merge\nentity_id date utility_type \n 1970-01-01 electric 1970-01-01 0.0 0.0 NaT NaN NaN NaN left_only\n electric NaT NaN NaN NaN 1970-01-01 00:00:00.000000001 0.0 0.0 right_only._merge
E Falsifying example: test_filter_for_freshest_data(
E df=
E entity_id date utility_type publication_time int_factoid float_factoid str_factoid
E 0 1970-01-01 electric 1970-01-01 00:00:00.000000000 0 0.0
E 1 0 1970-01-01 electric 1970-01-01 00:00:00.000000001 0 0.0
E 2 1970-01-01 electric 1970-01-01 00:00:00.000000000 0 0.0
E ,
E )
E Found these contexts in input data:
E entity_id date utility_type publication_time int_factoid float_factoid str_factoid
E 0 1970-01-01 electric 1970-01-01 0 0.0
E The freshest data:
E entity_id date utility_type publication_time int_factoid float_factoid str_factoid
E 1 0 1970-01-01 electric 1970-01-01 00:00:00.000000001 0 0.0
E Paired by context:
E publication_time_in int_factoid_in float_factoid_in str_factoid_in publication_time_out int_factoid_out float_factoid_out str_factoid_out _merge
E entity_id date utility_type
E 1970-01-01 electric 1970-01-01 0.0 0.0 NaT NaN NaN NaN left_only
E electric NaT NaN NaN NaN 1970-01-01 00:00:00.000000001 0.0 0.0 right_only
E Explanation:
E These lines were always and only run by failing examples:
E /Users/zane/miniforge3/envs/pudl-dev/lib/python3.11/site-packages/_pytest/assertion/util.py:134
E /Users/zane/miniforge3/envs/pudl-dev/lib/python3.11/site-packages/numpy/core/_dtype.py:336
E /Users/zane/miniforge3/envs/pudl-dev/lib/python3.11/site-packages/pandas/core/array_algos/putmask.py:107
E /Users/zane/miniforge3/envs/pudl-dev/lib/python3.11/site-packages/pandas/core/array_algos/putmask.py:112
E /Users/zane/miniforge3/envs/pudl-dev/lib/python3.11/site-packages/pandas/core/array_algos/putmask.py:138
E (and 66 more with settings.verbosity >= verbose)
test/unit/io_managers_test.py:398: AssertionError
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
dagsterIssues related to our use of the Dagster orchestratorIssues related to our use of the Dagster orchestratorferc1Anything having to do with FERC Form 1Anything having to do with FERC Form 1testingWriting tests, creating test data, automating testing, etc.Writing tests, creating test data, automating testing, etc.xbrlRelated to the FERC XBRL transitionRelated to the FERC XBRL transition
Type
Projects
Status
Done