Fix indexing of melted normalized input #563

g12-al · 2019-08-02T22:57:37Z

When using tsfresh with batched df input without column_value or column_kind specified, tsfresh will automatically convert it to the required melted format in _normalize_input_to_internal_representation.

However, the pd.melt() function works differently than the code assumes. If the column_sort contains many various types of indices that do not overlap, this results in wrong input data, resulting in incorrect input data for feature generation!

Reasoning: pd.melt() does not interleave per-ID series, it stacks them. Therefore, when assigning back the column_sort, the column_sort needs to be tiled, not repeated.

When using tsfresh with batched df input without column_value or column_kind specified, tsfresh will automatically convert it to the required melted format in _normalize_input_to_internal_representation. However, the pd.melt() function works differently than the code assumes. If the column_sort contains many various types of indices that do not overlap, this results in wrong input data, resulting in incorrect input data for feature generation! Reasoning: pd.melt() does not interleave per-ID series, it stacks them. Therefore, when assigning back the column_sort, the column_sort needs to be tiled, not repeated.

coveralls · 2019-08-02T23:02:03Z

Coverage remained the same at ?% when pulling f6b6587 on g12-al:master into f7b037e on blue-yonder:master.

g12-al · 2019-08-08T05:25:43Z

Just some words summarizing this in case it's not clear.

If you perform batching with tsfresh on inputs with mulitple columns containing values, the output for individual inputs will be different depending on the inputs used in the batch. This means that if you generated features using batches of 10, each with 2+ columns, the output for feature generation on just an individual input will not match the output when feature generation was applied on the input in the batch.

Features used in production models will likely have different distributions than when the models are being created and validated. I fear that this has far-reaching consequences for model accuracy.

MaxBenChrist · 2019-08-16T12:07:01Z

I understand the grave implications if this bug holds true. First, we need unit tests that verify that this is an issue, so that the current implementation is faulty. Then we should push the fix on top of the failing unit tests.

I started to work on those tests already but I am quite busy lately. If you could go and add an unit test that demonstrates the error first, and then apply the patch on top of it, that would be great.

g12-al · 2019-08-16T18:12:03Z

Thank you @MaxBenChrist. I have some time today, so I'll try to write a unit test that demonstrates the issue.

In my own dataset, I found the minimal set of IDs which exhibited reordering with ToT tsfresh. I wasn't able to find a repro by manually constructing inputs. Then, I discovered the repro would still occur even when downsampling extremely. I downsampled my 250k row dataset down to 30 rows, and then manually clipped trimmed it until it was as small as possible. This test is the minimum base-case. With ToT, using np.repeat, test_df0 will have the order of column "b" flipped after the call to the normalization code.

g12-al · 2019-08-16T23:44:10Z

I've managed to distill the issue down into a simple unit test and pushed it. Please take a look and verify.

MaxBenChrist · 2019-08-18T21:35:58Z

I verified that this bug can result in wrong ordered time series under the following conditions

column_kind is None
column_value is None
the time series have different sizes

This bug will not change the actual values of the time series (so, there can't occur any leaks from train to test set) but change the order of the time series.

Can you please replace your unit test by the following two:

    def test_wide_dataframe_order_preserved_with_sort_column(self):
        """ verifies that the order of the sort column from a wide time series container is preserved
        """

        test_df = pd.DataFrame({'id': ["a", "a", "b"],
                                'v1': [3, 2, 1],
                                'v2': [13, 12, 11],
                                'sort': [103, 102, 101]})

        melt_df, _, _, _ = \
            dataframe_functions._normalize_input_to_internal_representation(
                test_df, column_id="id", column_sort="sort", column_kind=None, column_value=None)

        assert (test_df.sort_values("sort").query("id=='a'")["v1"].values ==
                melt_df.query("id=='a'").query("_variables=='v1'")["_values"].values).all()
        assert (test_df.sort_values("sort").query("id=='a'")["v2"].values ==
                melt_df.query("id=='a'").query("_variables=='v2'")["_values"].values).all()


    def test_wide_dataframe_order_preserved(self):
        """ verifies that the order of the time series inside a wide time series container are preserved
        (columns_sort=None)
        """
        test_df = pd.DataFrame({'id': ["a", "a", "a", "b"],
                                'v1': [4, 3, 2, 1],
                                'v2': [14, 13, 12, 11]})

        melt_df, _, _, _ = \
            dataframe_functions._normalize_input_to_internal_representation(
                test_df, column_id="id", column_sort=None, column_kind=None, column_value=None)

        assert (test_df.query("id=='a'")["v1"].values ==
                melt_df.query("id=='a'").query("_variables=='v1'")["_values"].values).all()
        assert (test_df.query("id=='a'")["v2"].values ==
                melt_df.query("id=='a'").query("_variables=='v2'")["_values"].values).all()

Afterwards we can merge this. Thanks again for reporting this! 👍

This was requested by Max in Pull Request blue-yonder#563.

g12-al · 2019-08-19T18:21:42Z

Great, thank you for your help with this Max! I've replaced my unit test with the ones you provided.

Would you still like to do the order which you specified above? I can move the unit tests to a separate pull request.
"First, we need unit tests that verify that this is an issue, so that the current implementation is faulty. Then we should push the fix on top of the failing unit tests."

MaxBenChrist · 2019-08-19T20:54:36Z

"First, we need unit tests that verify that this is an issue, so that the current implementation is faulty. Then we should push the fix on top of the failing unit tests."

Yes, I verified the bug and that your fix solves the issue locally, so it's fine to merge this pr. Thanks for reporting the issue again!

Replace my unit test for order preservation with Max's versions

3f58503

This was requested by Max in Pull Request blue-yonder#563.

MaxBenChrist merged commit ee4d5bb into blue-yonder:master Aug 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix indexing of melted normalized input #563

Fix indexing of melted normalized input #563

g12-al commented Aug 2, 2019

coveralls commented Aug 2, 2019

g12-al commented Aug 8, 2019

MaxBenChrist commented Aug 16, 2019 •

edited

Loading

g12-al commented Aug 16, 2019

g12-al commented Aug 16, 2019

MaxBenChrist commented Aug 18, 2019

g12-al commented Aug 19, 2019

MaxBenChrist commented Aug 19, 2019

Fix indexing of melted normalized input #563

Fix indexing of melted normalized input #563

Conversation

g12-al commented Aug 2, 2019

coveralls commented Aug 2, 2019

g12-al commented Aug 8, 2019

MaxBenChrist commented Aug 16, 2019 • edited Loading

g12-al commented Aug 16, 2019

g12-al commented Aug 16, 2019

MaxBenChrist commented Aug 18, 2019

g12-al commented Aug 19, 2019

MaxBenChrist commented Aug 19, 2019

MaxBenChrist commented Aug 16, 2019 •

edited

Loading