Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add normalization option for confusion matrix #484

Merged
merged 16 commits into from Mar 12, 2020

Conversation

angela97lin
Copy link
Contributor

@angela97lin angela97lin commented Mar 11, 2020

Closes #432.

Without normalization:

image

With normalization:

image

===

Since coloration is the same there, I inputted some hard-coded data to show the difference between normalization / not-normalized colors:

Without normalization:
image

With normalization:

image

@angela97lin
Copy link
Contributor Author

angela97lin commented Mar 11, 2020

Note: sklearn adds a normalization parameter in versions >0.22(.2?), but since our current dependencies allow for anything greater than 0.21.3 and since it was easier to just add our own normalization calculations rather than changing how score() works for the confusion matrix objective and then potentially having to carry both a normalized / non-normalized version of the matrix around, this is what I went with. If we want to later support the other sklearn normalization parameter choices ({‘true’, ‘pred’, ‘all’}), this could be the better approach. :)

@angela97lin angela97lin requested review from kmax12 and dsherry Mar 11, 2020
@codecov
Copy link

codecov bot commented Mar 11, 2020

Codecov Report

Merging #484 into master will increase coverage by <.01%.
The diff coverage is 98.52%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #484      +/-   ##
==========================================
+ Coverage   98.26%   98.26%   +<.01%     
==========================================
  Files         104      104              
  Lines        3292     3352      +60     
==========================================
+ Hits         3235     3294      +59     
- Misses         57       58       +1
Impacted Files Coverage Δ
evalml/utils/__init__.py 100% <100%> (ø) ⬆️
...l/tests/automl_tests/test_pipeline_search_plots.py 97.36% <100%> (+0.12%) ⬆️
evalml/tests/utils_tests/test_gen_utils.py 100% <100%> (ø) ⬆️
evalml/automl/pipeline_search_plots.py 97.91% <100%> (+0.26%) ⬆️
evalml/utils/gen_utils.py 97.36% <93.33%> (-2.64%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c23e7b8...e04e501. Read the comment docs.

Copy link
Contributor

@kmax12 kmax12 left a comment

overall, the output looks good to me. some comments on implementation

evalml/automl/pipeline_search_plots.py Outdated Show resolved Hide resolved
A normalized version of the input confusion matrix.

"""
return conf_mat.astype('float') / conf_mat.sum(axis=1)[:, np.newaxis]
Copy link
Contributor

@kmax12 kmax12 Mar 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the [:, np.newaxis] for? in my testing it looks like it doesn't do anything

Screen Shot 2020-03-11 at 5 48 10 PM

Copy link
Contributor Author

@angela97lin angela97lin Mar 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is it changes broadcasting for the division. Here's an example where it makes a difference:

image

Copy link
Contributor

@kmax12 kmax12 Mar 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm. i see. looks like that's need for supporting numpy arrays, but perhaps not needed for dataframes (my example).

i'm fine leaving it in to be safe, but perhaps leave a comment? This stack overflow resource was helpful for me: https://stackoverflow.com/questions/8904694/how-to-normalize-a-2-dimensional-numpy-array-in-python-less-verbose

Copy link
Contributor Author

@angela97lin angela97lin Mar 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oo, nice resource! Just took a closer look at the example you posted but it seems like even with dataframes, it makes a difference? The columns are flipped but the numbers outputted are also different 🤔

evalml/automl/pipeline_search_plots.py Outdated Show resolved Hide resolved
evalml/automl/pipeline_search_plots.py Outdated Show resolved Hide resolved
evalml/automl/pipeline_search_plots.py Outdated Show resolved Hide resolved
evalml/tests/automl_tests/test_autobase.py Outdated Show resolved Hide resolved
@angela97lin angela97lin requested a review from kmax12 Mar 12, 2020
Copy link
Contributor

@kmax12 kmax12 left a comment

looking good. just some minor comments


"""
column_names = None
if isinstance(conf_mat, pd.DataFrame):
Copy link
Contributor

@kmax12 kmax12 Mar 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you do this because of the call to np.nan_to_num(conf_mat) later? if so, I'd just end the function with

if isinstance(conf_mat, pd.DataFrame):
    conf_mat = conf_mat.fillna(9)
else:
    conf_mat = np.nan_to_num(conf_mat)

this would be cleaner than storing the column names

Copy link
Contributor Author

@angela97lin angela97lin Mar 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you're right. Thanks for the suggestion, updated!

evalml/automl/pipeline_search_plots.py Outdated Show resolved Hide resolved
@angela97lin angela97lin requested a review from kmax12 Mar 12, 2020
kmax12
kmax12 approved these changes Mar 12, 2020
Copy link
Contributor

@kmax12 kmax12 left a comment

Looks great!

@angela97lin angela97lin merged commit 9edad38 into master Mar 12, 2020
2 checks passed
@angela97lin angela97lin deleted the 432_conf_matrix_normalization branch Mar 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add normalization option to confusion matrix
2 participants