Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not copy mapping from dependent variable to prediction field in regression analysis #51227

Merged
merged 7 commits into from
Jan 22, 2020

Conversation

przemekwitek
Copy link
Contributor

@przemekwitek przemekwitek commented Jan 20, 2020

Currently, in case of regression analysis, the mapping is copied from dependent variable to prediction field.
When the dependent variable is of a discrete type (i.e. integer, long, etc.) the prediction field is indexed as a discrete type as well, increasing total prediction error (MSE, R^2).
This PR addresses that by making prediction field mapped dynamically (as float).

Closes https://github.com/elastic/machine-learning-qa/issues/661

@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (:ml)

@przemekwitek przemekwitek removed the WIP label Jan 20, 2020
@przemekwitek przemekwitek marked this pull request as ready for review January 20, 2020 15:06
Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me :D.

I wonder if we should force the mapping for regression to always be double instead of it being float? I am not sure if the precision loss is a concern or not.

@droberts195
Copy link
Contributor

It's a good point about using double instead of float. float only stores 7 digits of accuracy, so if the dependent variable contained integers greater than 10 million then using float will lose more accuracy than sticking with integer. (integer will lose the fractions from the predictions, but for a number between 10 and 100 million float would make the units and fraction random.) double would be a reasonable compromise as it has 15 digits of accuracy. long has 19, but we'd only suffer the problem at the (unusual) extreme sizes, whereas integers over 10 million are not that uncommon.

@przemekwitek
Copy link
Contributor Author

przemekwitek commented Jan 21, 2020

I wonder if we should force the mapping for regression to always be double instead of it being float? I am not sure if the precision loss is a concern or not.

This change required bigger changes in the logic that calculates mappings as now Regression imposes constant mapping while Classification copies the mapping from dependent variable.
You might want to take another look @benwtrent and @droberts195.

@benwtrent benwtrent self-assigned this Jan 21, 2020
@benwtrent benwtrent self-requested a review January 21, 2020 18:00
Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good.

There are a bunch of failing tests due to the change :).

*
* @param resultsFieldName name of the results field under which all the results are stored
* @return {@link Map} containing fields for which the mappings should be copied from source index to destination index
* @return {@link List} containing fields for which the mappings should be handled explicitly
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* @return {@link List} containing fields for which the mappings should be handled explicitly
* @return {@link Map} containing fields for which the mappings should be handled explicitly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants