Fix `MRMRFeatureSelectionTransform`: change `less_is_better` logic, add `drop_zero` mode #314

yellowssnake · 2024-05-14T22:55:26Z

Before submitting (must do checklist)

Did you read the contribution guide?
Did you update the docs? We use Numpy format for all the methods and classes.
Did you write any new necessary tests?
Did you update the CHANGELOG?

Fix MRMRFeatureSelectionTransform, change score formula and add drop_zero mode

Proposed Changes

Closing issues

Closes #308.

…zero mode

github-actions · 2024-05-15T07:38:27Z

🚀 Deployed on https://deploy-preview-314--etna-docs.netlify.app

CHANGELOG.md

Ama16 · 2024-05-15T10:41:11Z

etna/analysis/feature_selection/mrmr_selection.py

@@ -59,6 +71,10 @@ def mrmr(
    fast_redundancy:
        * True: compute redundancy only inside the the segments, time complexity :math:`O(top\_k * n\_segments * n\_features * history\_len)`
        * False: compute redundancy for all the pairs of segments, time complexity :math:`O(top\_k * n\_segments^2 * n\_features * history\_len)`
+    drop_zero:
+        If True, drop features with zero relevance before MRMR. If top_k is greater number of features


It is not clear for me. If top_k is greater than number of features with relevance >0, select all non-zero relevance features with randomly selected zero relevance features.

I select all non-zero relevance features and add some features with zero relevance in order to select exact top_k features

I mean what is written is very unclear and complicated. Could you rewrite more clear?

Ama16 · 2024-05-15T10:44:08Z

etna/analysis/feature_selection/mrmr_selection.py

+            filter(lambda feature: is_not_relevant(relevance, feature, atol), not_selected_features)
+        )
+        not_selected_features = list(
+            filter(lambda feature: is_relevant(relevance, feature, atol), not_selected_features)


Here you compute twice the same logic. Could you do it in one compute?

Ama16 · 2024-05-15T10:44:57Z

etna/analysis/feature_selection/mrmr_selection.py

-    top_k = min(top_k, len(all_features))
+    if drop_zero is True:
+        not_relevant_features = list(
+            filter(lambda feature: is_not_relevant(relevance, feature, atol), not_selected_features)


Why we compare with atol? Why not strong equality to 0?

Ama16 · 2024-05-15T10:48:13Z

etna/transforms/feature_selection/feature_importance.py

@@ -235,8 +235,14 @@ def _fit(self, df: pd.DataFrame) -> "MRMRFeatureSelectionTransform":
        df_features = df_features.loc[df_features.first_valid_index() :]
        relevance_table = self.relevance_table(df_target, df_features, **self.relevance_params)

+        if relevance_table.values.min() < 0:


you should add it into relevance_table, not here
Here you should have guarantees that all values >= 0

Why? We can use relevance table for some other things, and only in MRMR we need non negative relevance

From any relevance table the importance must be non-negative just because it is relevance

Ama16 · 2024-05-15T10:55:54Z

tests/test_transforms/test_feature_selection/test_feature_importance_transform.py

+
+@pytest.mark.parametrize("fast_redundancy", ([True, False]))
+@pytest.mark.parametrize("relevance_table", ([ModelRelevanceTable()]))
+def test_mrmr_drop_zero_mode_sanity_check(relevance_table, ts_with_regressors, fast_redundancy):


Where is the drop zero mode tests here?

I add test where top_k > number of regressors, test where number of regressors with non-zero relevance <top_k<number of regressors and test to check sanity. I think this covers all the main cases. Maybe I should rename some tests

Ama16 · 2024-05-15T10:57:19Z

tests/test_transforms/test_feature_selection/test_feature_importance_transform.py

+
+@pytest.mark.parametrize("fast_redundancy", ([True, False]))
+@pytest.mark.parametrize("relevance_table", ([ModelRelevanceTable()]))
+def test_mrmr_select_top_k_regressors_in_drop_zero_mode(relevance_table, ts_with_regressors, fast_redundancy):


where is drop zero mode tests here x2?

Answer in the next comment

Ama16 · 2024-05-15T10:58:32Z

tests/test_transforms/test_feature_selection/test_feature_importance_transform.py

+
+@pytest.mark.parametrize("fast_redundancy", ([True, False]))
+@pytest.mark.parametrize("relevance_table", ([ModelRelevanceTable()]))
+def test_mrmr_drop_zero_mode_sanity_check(relevance_table, ts_with_regressors, fast_redundancy):


You already have test with this name. And with meaning too

In this test I check that I select correct regressors in previous test I check that if top_k >= number of features with non-zero relevance it still select top_k

tests/test_transforms/test_inference/test_inverse_transform.py

tests/test_transforms/test_feature_selection/test_feature_importance_transform.py

codecov · 2024-05-15T18:38:10Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 86.38%. Comparing base (ba63d88) to head (872d0cd).
Report is 5 commits behind head on master.

❗ Current head 872d0cd differs from pull request most recent head fba8ec1

Please upload reports for the commit fba8ec1 to get more accurate results.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #314      +/-   ##
==========================================
- Coverage   88.84%   86.38%   -2.47%     
==========================================
  Files         203      224      +21     
  Lines       14328    15258     +930     
==========================================
+ Hits        12730    13180     +450     
- Misses       1598     2078     +480

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Ama16 · 2024-05-16T08:39:09Z

etna/analysis/feature_selection/mrmr_selection.py

@@ -30,11 +30,22 @@ class AggregationMode(str, Enum):
 }


+def is_relevant(relevance, feature, atol):


Ama16 · 2024-05-16T08:40:19Z

etna/analysis/feature_selection/mrmr_selection.py

+    return not relevance.loc[feature] == 0
+
+
+def is_not_relevant(relevance, feature, atol):


Do you really need this functions? You could just add this logic into filter explicitly without functions

Ama16 · 2024-05-16T08:41:51Z

etna/analysis/feature_selection/mrmr_selection.py

@@ -59,6 +71,10 @@ def mrmr(
    fast_redundancy:
        * True: compute redundancy only inside the the segments, time complexity :math:`O(top\_k * n\_segments * n\_features * history\_len)`
        * False: compute redundancy for all the pairs of segments, time complexity :math:`O(top\_k * n\_segments^2 * n\_features * history\_len)`
+    drop_zero:
+        If True, drop features with zero relevance before MRMR. If top_k is greater number of features


I mean what is written is very unclear and complicated. Could you rewrite more clear?

etna/transforms/feature_selection/feature_importance.py

Ama16 · 2024-05-16T08:48:58Z

etna/transforms/feature_selection/feature_importance.py

@@ -235,8 +235,14 @@ def _fit(self, df: pd.DataFrame) -> "MRMRFeatureSelectionTransform":
        df_features = df_features.loc[df_features.first_valid_index() :]
        relevance_table = self.relevance_table(df_target, df_features, **self.relevance_params)

+        if relevance_table.values.min() < 0:


From any relevance table the importance must be non-negative just because it is relevance

Ama16 · 2024-05-16T08:49:29Z

tests/test_analysis/test_feature_selection/test_mrmr.py

@@ -83,25 +83,6 @@ def test_mrmr_right_regressors(df_with_regressors, relevance_method, expected_re
    assert set(selected_regressors) == set(expected_regressors)


-@pytest.mark.parametrize("fast_redundancy", [True, False])


Ama16 · 2024-05-16T08:51:00Z

tests/test_transforms/test_feature_selection/test_feature_importance_transform.py

+    for column in df_selected.columns.get_level_values("feature"):
+        if column.startswith("regressor"):
+            selected_regressors.add(column)
+    assert len(selected_regressors) == 15


15 is a magic constant. Take it from ts_with_regressors

Ama16 · 2024-05-16T08:53:53Z

tests/test_transforms/test_feature_selection/test_feature_importance_transform.py

+@pytest.mark.parametrize("fast_redundancy", ([True, False]))
+@pytest.mark.parametrize("relevance_table", ([ModelRelevanceTable()]))
+def test_mrmr_select_top_k_regressors_in_drop_zero_mode(relevance_table, ts_with_regressors, fast_redundancy):
+    """Check that transform selects right top_k regressors."""


I think its wrong description

Ama16 · 2024-05-16T14:00:55Z

etna/analysis/feature_selection/mrmr_selection.py

-    -------
-    selected_features: List[str]
-        list of ``top_k`` selected regressors, sorted by their importance
+      Maximum Relevance and Minimum Redundancy feature selection method.


Why you add spaces here?

It is still relevant.

Ama16 · 2024-05-16T14:02:44Z

etna/analysis/feature_selection/mrmr_selection.py

+          * True: compute redundancy only inside the the segments, time complexity :math:`O(top\_k * n\_segments * n\_features * history\_len)`
+          * False: compute redundancy for all the pairs of segments, time complexity :math:`O(top\_k * n\_segments^2 * n\_features * history\_len)`
+    drop_zero:
+          * True: use only features with relevance > 0 in calculations, if their number is less than zero


less than top_k maybe?

Ama16 · 2024-05-16T14:04:55Z

tests/test_transforms/test_feature_selection/test_feature_importance_transform.py

+    for column in df_selected.columns.get_level_values("feature"):
+        if column.startswith("regressor"):
+            selected_regressors.add(column)
+    assert len(selected_regressors) == 10


If you know all relevanced features, add a check for their presence

The purpose of this test is to check that in drop_zero mode top_k features are selected even when there are fewer features with relevance > 0 than top_k, a specific list of regressors is not needed here

d-a-bunin · 2024-05-21T08:26:43Z

CHANGELOG.md

@@ -873,4 +874,4 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
  - Distribution plot
  - Anomalies (Outliers) plot
  - Backtest (CrossValidation) plot
-  - Forecast plot
+  - Forecast plot


You probably broke newline here, let's fix that.

Aren' you going to fix that?

CHANGELOG.md

d-a-bunin · 2024-05-21T08:28:13Z

etna/analysis/feature_selection/mrmr_selection.py

-    -------
-    selected_features: List[str]
-        list of ``top_k`` selected regressors, sorted by their importance
+      Maximum Relevance and Minimum Redundancy feature selection method.


It is still relevant.

etna/analysis/feature_selection/mrmr_selection.py

etna/transforms/feature_selection/feature_importance.py

tests/test_transforms/test_feature_selection/test_feature_importance_transform.py

tests/test_models/test_catboost.py

already reviewed by d.a.bunin

Fix MRMRFeatureSelectionTransform, change score formula and add drop_…

04844d0

…zero mode

yellowssnake requested a review from d-a-bunin May 14, 2024 22:55

Merge branch 'master' into issue308

d858d7a

yellowssnake requested a review from Ama16 May 15, 2024 07:36

github-actions bot temporarily deployed to pull request May 15, 2024 07:38 Inactive

yellowssnake changed the title ~~Issue308/ChangeMRMRFeatureSelectionTransform~~ ChangeMRMRFeatureSelectionTransform May 15, 2024

change tests

e9e958a

github-actions bot temporarily deployed to pull request May 15, 2024 10:23 Inactive

Ama16 requested changes May 15, 2024

View reviewed changes

change test_for_linux

9c6e028

github-actions bot temporarily deployed to pull request May 15, 2024 11:15 Inactive

fix comments

73039fa

github-actions bot temporarily deployed to pull request May 15, 2024 12:41 Inactive

fix codestyle

95d0010

github-actions bot temporarily deployed to pull request May 15, 2024 13:39 Inactive

fix first iteration of mrmr

c8d102d

github-actions bot temporarily deployed to pull request May 15, 2024 17:13 Inactive

final change in tests

cfdcd56

github-actions bot temporarily deployed to pull request May 15, 2024 18:07 Inactive

yellowssnake requested a review from Ama16 May 15, 2024 18:51

Ama16 requested changes May 16, 2024

View reviewed changes

fix comments

872d0cd

github-actions bot temporarily deployed to pull request May 16, 2024 12:57 Inactive

yellowssnake requested a review from Ama16 May 16, 2024 13:00

Ama16 previously requested changes May 16, 2024

View reviewed changes

fix_comments

fa845d7

github-actions bot temporarily deployed to pull request May 16, 2024 14:25 Inactive

yellowssnake requested a review from Ama16 May 16, 2024 14:33

yellowssnake and others added 2 commits May 20, 2024 11:02

new_test

e2e4ca6

Merge branch 'master' into issue308

92b21b4

github-actions bot temporarily deployed to pull request May 20, 2024 11:54 Inactive

d-a-bunin changed the title ~~ChangeMRMRFeatureSelectionTransform~~ Fix MRMRFeatureSelectionTransform: change less_is_better logic, add drop_zero mode May 21, 2024

d-a-bunin requested changes May 21, 2024

View reviewed changes

fix_comments

1c27fe3

github-actions bot temporarily deployed to pull request May 21, 2024 09:52 Inactive

yellowssnake requested a review from d-a-bunin May 21, 2024 10:28

d-a-bunin reviewed May 21, 2024

View reviewed changes

tests/test_models/test_catboost.py Outdated Show resolved Hide resolved

fix docstring, catboost test and changelog

0e8808d

github-actions bot temporarily deployed to pull request May 21, 2024 13:03 Inactive

correct_catboost

8129628

yellowssnake requested a review from d-a-bunin May 21, 2024 14:06

github-actions bot temporarily deployed to pull request May 21, 2024 14:10 Inactive

fix changelog

16d83fd

github-actions bot temporarily deployed to pull request May 21, 2024 15:37 Inactive

fix changelog

fba8ec1

github-actions bot temporarily deployed to pull request May 21, 2024 17:01 Inactive

d-a-bunin approved these changes May 22, 2024

View reviewed changes

yellowssnake merged commit b011ec8 into master May 22, 2024
14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix `MRMRFeatureSelectionTransform`: change `less_is_better` logic, add `drop_zero` mode #314

Fix `MRMRFeatureSelectionTransform`: change `less_is_better` logic, add `drop_zero` mode #314

yellowssnake commented May 14, 2024 •

edited by d-a-bunin

Loading

github-actions bot commented May 15, 2024 •

edited

Loading

Ama16 May 15, 2024

yellowssnake May 15, 2024

Ama16 May 16, 2024

Ama16 May 15, 2024

Ama16 May 15, 2024

Ama16 May 15, 2024

yellowssnake May 15, 2024

Ama16 May 16, 2024

Ama16 May 15, 2024

yellowssnake May 15, 2024

Ama16 May 15, 2024

yellowssnake May 15, 2024

Ama16 May 15, 2024

yellowssnake May 15, 2024

codecov bot commented May 15, 2024 •

edited

Loading

Ama16 May 16, 2024

Ama16 May 16, 2024

Ama16 May 16, 2024

Ama16 May 16, 2024

Ama16 May 16, 2024

Ama16 May 16, 2024

Ama16 May 16, 2024

Ama16 May 16, 2024

d-a-bunin May 21, 2024

Ama16 May 16, 2024

Ama16 May 16, 2024

yellowssnake May 16, 2024

d-a-bunin May 21, 2024

d-a-bunin May 21, 2024

d-a-bunin May 21, 2024

		@@ -30,11 +30,22 @@ class AggregationMode(str, Enum):
		}


		def is_relevant(relevance, feature, atol):

		return not relevance.loc[feature] == 0


		def is_not_relevant(relevance, feature, atol):

		@@ -83,25 +83,6 @@ def test_mrmr_right_regressors(df_with_regressors, relevance_method, expected_re
		assert set(selected_regressors) == set(expected_regressors)


		@pytest.mark.parametrize("fast_redundancy", [True, False])

Fix MRMRFeatureSelectionTransform: change less_is_better logic, add drop_zero mode #314

Fix MRMRFeatureSelectionTransform: change less_is_better logic, add drop_zero mode #314

Conversation

yellowssnake commented May 14, 2024 • edited by d-a-bunin Loading

Before submitting (must do checklist)

Proposed Changes

Closing issues

github-actions bot commented May 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented May 15, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Fix `MRMRFeatureSelectionTransform`: change `less_is_better` logic, add `drop_zero` mode #314

Fix `MRMRFeatureSelectionTransform`: change `less_is_better` logic, add `drop_zero` mode #314

yellowssnake commented May 14, 2024 •

edited by d-a-bunin

Loading

github-actions bot commented May 15, 2024 •

edited

Loading

codecov bot commented May 15, 2024 •

edited

Loading