Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug where Imputer changes index of X #1590

Merged
merged 3 commits into from
Jan 4, 2021

Conversation

freddyaboulton
Copy link
Contributor

@freddyaboulton freddyaboulton commented Dec 21, 2020

Pull Request Description

Fix #1442

This has caused two confusing bugs already when the input data have custom indices:

  1. TargetEncoder not being able to fit: Add Target Encoder to Components #1401 (comment)
  2. DelayedFeatureTransformer adds nans when X and y have different indices (ran into this during the blitz for time series performance tests)

After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of docs/source/release_notes.rst to include this pull request by adding :pr:123.

@freddyaboulton freddyaboulton changed the title [BLITZ] Fix bug where Imputer changes index of X [Blitz] Fix bug where Imputer changes index of X Dec 21, 2020
@codecov
Copy link

codecov bot commented Dec 21, 2020

Codecov Report

Merging #1590 (2c4bccb) into main (51d7403) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@            Coverage Diff            @@
##             main    #1590     +/-   ##
=========================================
+ Coverage   100.0%   100.0%   +0.1%     
=========================================
  Files         240      240             
  Lines       18207    18210      +3     
=========================================
+ Hits        18199    18202      +3     
  Misses          8        8             
Impacted Files Coverage Δ
...elines/components/transformers/imputers/imputer.py 100.0% <100.0%> (ø)
evalml/tests/component_tests/test_imputer.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 51d7403...2c4bccb. Read the comment docs.

@freddyaboulton freddyaboulton changed the title [Blitz] Fix bug where Imputer changes index of X Fix bug where Imputer changes index of X Jan 4, 2021
@freddyaboulton freddyaboulton marked this pull request as ready for review January 4, 2021 17:14
Copy link
Contributor

@angela97lin angela97lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks good! Just had a quick question for my own understanding about updating test code :d

@@ -199,32 +199,30 @@ def test_imputer_empty_data(data_type):
imputer = Imputer()
imputer.fit(X, y)
transformed = imputer.transform(X, y)
assert_frame_equal(transformed, expected, check_dtype=False)
assert_frame_equal(transformed, expected, check_dtype=False, check_index_type=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, does this mean what we're doing will change the index type but not the values? Ex: RangeIndex --> Index or something of that sort?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The index type shouldn't change! When we create the empty dataframes in the test, the index type is Index. What happened is that before we would always call reset_index so the type would change to Int64Index. Since we don't reset the index, the index type should be preserved as Index.

I'm creating expected with the correct index type as opposed to setting check_index_type=False. I think that's clearer so thanks for the question!

@freddyaboulton freddyaboulton merged commit b257752 into main Jan 4, 2021
@freddyaboulton freddyaboulton deleted the 1422-imputer-changes-index-on-X branch January 4, 2021 20:32
Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Good call. I wonder if other components need this treatment too.

index=list(range(0, 9))))
pd.DataFrame({'input_val': [1.0, 2, 3, 4, 5, 6, 7, 8, 9],
'input_cat': pd.Categorical(['a'] * 6 + ['b'] * 3)},
index=list(range(1, 10))))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

if X_null_dropped.empty:
return X_null_dropped

if self._numeric_cols is not None and len(self._numeric_cols) > 0:
X_numeric = X_null_dropped[self._numeric_cols]
X_null_dropped[X_numeric.columns] = self._numeric_imputer.transform(X_numeric)
imputed = self._numeric_imputer.transform(X_numeric)
imputed.index = X_null_dropped.index
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there ever a case where imputed.shape != X_null_dropped.shape? I.e., could self._numeric_imputer.transform(X_numeric) ever result in dropped rows?

I believe the answer is, to the best of our knowledge, no. Correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think you're right!

@bchen1116 bchen1116 mentioned this pull request Jan 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Imputer changes index on X
3 participants