Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix MRMRFeatureSelectionTransform: change less_is_better logic, add drop_zero mode #314

Merged
merged 17 commits into from
May 22, 2024

Conversation

yellowssnake
Copy link
Collaborator

@yellowssnake yellowssnake commented May 14, 2024

Before submitting (must do checklist)

  • Did you read the contribution guide?
  • Did you update the docs? We use Numpy format for all the methods and classes.
  • Did you write any new necessary tests?
  • Did you update the CHANGELOG?

Fix MRMRFeatureSelectionTransform, change score formula and add drop_zero mode

Proposed Changes

Closing issues

Closes #308.

@yellowssnake yellowssnake requested a review from Ama16 May 15, 2024 07:36
Copy link

github-actions bot commented May 15, 2024

🚀 Deployed on https://deploy-preview-314--etna-docs.netlify.app

@github-actions github-actions bot temporarily deployed to pull request May 15, 2024 07:38 Inactive
@yellowssnake yellowssnake changed the title Issue308/ChangeMRMRFeatureSelectionTransform ChangeMRMRFeatureSelectionTransform May 15, 2024
@github-actions github-actions bot temporarily deployed to pull request May 15, 2024 10:23 Inactive
CHANGELOG.md Show resolved Hide resolved
@@ -59,6 +71,10 @@ def mrmr(
fast_redundancy:
* True: compute redundancy only inside the the segments, time complexity :math:`O(top\_k * n\_segments * n\_features * history\_len)`
* False: compute redundancy for all the pairs of segments, time complexity :math:`O(top\_k * n\_segments^2 * n\_features * history\_len)`
drop_zero:
If True, drop features with zero relevance before MRMR. If top_k is greater number of features
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not clear for me. If top_k is greater than number of features with relevance >0, select all non-zero relevance features with randomly selected zero relevance features.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I select all non-zero relevance features and add some features with zero relevance in order to select exact top_k features

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean what is written is very unclear and complicated. Could you rewrite more clear?

filter(lambda feature: is_not_relevant(relevance, feature, atol), not_selected_features)
)
not_selected_features = list(
filter(lambda feature: is_relevant(relevance, feature, atol), not_selected_features)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you compute twice the same logic. Could you do it in one compute?

top_k = min(top_k, len(all_features))
if drop_zero is True:
not_relevant_features = list(
filter(lambda feature: is_not_relevant(relevance, feature, atol), not_selected_features)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we compare with atol? Why not strong equality to 0?

@@ -235,8 +235,14 @@ def _fit(self, df: pd.DataFrame) -> "MRMRFeatureSelectionTransform":
df_features = df_features.loc[df_features.first_valid_index() :]
relevance_table = self.relevance_table(df_target, df_features, **self.relevance_params)

if relevance_table.values.min() < 0:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should add it into relevance_table, not here
Here you should have guarantees that all values >= 0

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? We can use relevance table for some other things, and only in MRMR we need non negative relevance

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From any relevance table the importance must be non-negative just because it is relevance


@pytest.mark.parametrize("fast_redundancy", ([True, False]))
@pytest.mark.parametrize("relevance_table", ([ModelRelevanceTable()]))
def test_mrmr_drop_zero_mode_sanity_check(relevance_table, ts_with_regressors, fast_redundancy):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the drop zero mode tests here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I add test where top_k > number of regressors, test where number of regressors with non-zero relevance <top_k<number of regressors and test to check sanity. I think this covers all the main cases. Maybe I should rename some tests


@pytest.mark.parametrize("fast_redundancy", ([True, False]))
@pytest.mark.parametrize("relevance_table", ([ModelRelevanceTable()]))
def test_mrmr_select_top_k_regressors_in_drop_zero_mode(relevance_table, ts_with_regressors, fast_redundancy):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is drop zero mode tests here x2?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Answer in the next comment


@pytest.mark.parametrize("fast_redundancy", ([True, False]))
@pytest.mark.parametrize("relevance_table", ([ModelRelevanceTable()]))
def test_mrmr_drop_zero_mode_sanity_check(relevance_table, ts_with_regressors, fast_redundancy):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You already have test with this name. And with meaning too

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this test I check that I select correct regressors in previous test I check that if top_k >= number of features with non-zero relevance it still select top_k

@github-actions github-actions bot temporarily deployed to pull request May 15, 2024 11:15 Inactive
@github-actions github-actions bot temporarily deployed to pull request May 15, 2024 12:41 Inactive
@github-actions github-actions bot temporarily deployed to pull request May 15, 2024 13:39 Inactive
@github-actions github-actions bot temporarily deployed to pull request May 15, 2024 17:13 Inactive
@github-actions github-actions bot temporarily deployed to pull request May 15, 2024 18:07 Inactive
Copy link

codecov bot commented May 15, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 86.38%. Comparing base (ba63d88) to head (872d0cd).
Report is 5 commits behind head on master.

Current head 872d0cd differs from pull request most recent head fba8ec1

Please upload reports for the commit fba8ec1 to get more accurate results.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #314      +/-   ##
==========================================
- Coverage   88.84%   86.38%   -2.47%     
==========================================
  Files         203      224      +21     
  Lines       14328    15258     +930     
==========================================
+ Hits        12730    13180     +450     
- Misses       1598     2078     +480     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@yellowssnake yellowssnake requested a review from Ama16 May 15, 2024 18:51
@@ -30,11 +30,22 @@ class AggregationMode(str, Enum):
}


def is_relevant(relevance, feature, atol):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Atol?

return not relevance.loc[feature] == 0


def is_not_relevant(relevance, feature, atol):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you really need this functions? You could just add this logic into filter explicitly without functions

@@ -59,6 +71,10 @@ def mrmr(
fast_redundancy:
* True: compute redundancy only inside the the segments, time complexity :math:`O(top\_k * n\_segments * n\_features * history\_len)`
* False: compute redundancy for all the pairs of segments, time complexity :math:`O(top\_k * n\_segments^2 * n\_features * history\_len)`
drop_zero:
If True, drop features with zero relevance before MRMR. If top_k is greater number of features
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean what is written is very unclear and complicated. Could you rewrite more clear?

@@ -235,8 +235,14 @@ def _fit(self, df: pd.DataFrame) -> "MRMRFeatureSelectionTransform":
df_features = df_features.loc[df_features.first_valid_index() :]
relevance_table = self.relevance_table(df_target, df_features, **self.relevance_params)

if relevance_table.values.min() < 0:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From any relevance table the importance must be non-negative just because it is relevance

@@ -83,25 +83,6 @@ def test_mrmr_right_regressors(df_with_regressors, relevance_method, expected_re
assert set(selected_regressors) == set(expected_regressors)


@pytest.mark.parametrize("fast_redundancy", [True, False])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

for column in df_selected.columns.get_level_values("feature"):
if column.startswith("regressor"):
selected_regressors.add(column)
assert len(selected_regressors) == 15
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

15 is a magic constant. Take it from ts_with_regressors

@pytest.mark.parametrize("fast_redundancy", ([True, False]))
@pytest.mark.parametrize("relevance_table", ([ModelRelevanceTable()]))
def test_mrmr_select_top_k_regressors_in_drop_zero_mode(relevance_table, ts_with_regressors, fast_redundancy):
"""Check that transform selects right top_k regressors."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its wrong description

@github-actions github-actions bot temporarily deployed to pull request May 16, 2024 12:57 Inactive
@yellowssnake yellowssnake requested a review from Ama16 May 16, 2024 13:00
Ama16
Ama16 previously requested changes May 16, 2024
-------
selected_features: List[str]
list of ``top_k`` selected regressors, sorted by their importance
Maximum Relevance and Minimum Redundancy feature selection method.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why you add spaces here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is still relevant.

* True: compute redundancy only inside the the segments, time complexity :math:`O(top\_k * n\_segments * n\_features * history\_len)`
* False: compute redundancy for all the pairs of segments, time complexity :math:`O(top\_k * n\_segments^2 * n\_features * history\_len)`
drop_zero:
* True: use only features with relevance > 0 in calculations, if their number is less than zero
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

less than top_k maybe?

for column in df_selected.columns.get_level_values("feature"):
if column.startswith("regressor"):
selected_regressors.add(column)
assert len(selected_regressors) == 10
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you know all relevanced features, add a check for their presence

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of this test is to check that in drop_zero mode top_k features are selected even when there are fewer features with relevance > 0 than top_k, a specific list of regressors is not needed here

@github-actions github-actions bot temporarily deployed to pull request May 16, 2024 14:25 Inactive
@yellowssnake yellowssnake requested a review from Ama16 May 16, 2024 14:33
@github-actions github-actions bot temporarily deployed to pull request May 20, 2024 11:54 Inactive
@d-a-bunin d-a-bunin changed the title ChangeMRMRFeatureSelectionTransform Fix MRMRFeatureSelectionTransform: change less_is_better logic, add drop_zero mode May 21, 2024
CHANGELOG.md Outdated
@@ -873,4 +874,4 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- Distribution plot
- Anomalies (Outliers) plot
- Backtest (CrossValidation) plot
- Forecast plot
- Forecast plot
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably broke newline here, let's fix that.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren' you going to fix that?

CHANGELOG.md Outdated Show resolved Hide resolved
-------
selected_features: List[str]
list of ``top_k`` selected regressors, sorted by their importance
Maximum Relevance and Minimum Redundancy feature selection method.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is still relevant.

etna/analysis/feature_selection/mrmr_selection.py Outdated Show resolved Hide resolved
etna/analysis/feature_selection/mrmr_selection.py Outdated Show resolved Hide resolved
etna/transforms/feature_selection/feature_importance.py Outdated Show resolved Hide resolved
@github-actions github-actions bot temporarily deployed to pull request May 21, 2024 09:52 Inactive
@github-actions github-actions bot temporarily deployed to pull request May 21, 2024 13:03 Inactive
@github-actions github-actions bot temporarily deployed to pull request May 21, 2024 14:10 Inactive
@github-actions github-actions bot temporarily deployed to pull request May 21, 2024 15:37 Inactive
@github-actions github-actions bot temporarily deployed to pull request May 21, 2024 17:01 Inactive
@yellowssnake yellowssnake dismissed Ama16’s stale review May 22, 2024 07:18

already reviewed by d.a.bunin

@yellowssnake yellowssnake merged commit b011ec8 into master May 22, 2024
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix MRMR: change less_is_better logic, add drop_zero mode
3 participants