-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prediction Metrics: New module #42
Conversation
JIRA: MADLIB-907 A collection of summary statistics to gauge model accuracy based on predicted values vs. ground-truth values.
@@ -117,10 +117,11 @@ complete matrix stored as a distributed table. | |||
@ingroup grp_datatrans | |||
|
|||
@defgroup grp_mdl Model Evaluation | |||
@{Contains the cross-validation module, a collection of routines useful for | |||
<a href="http://en.wikipedia.org/wiki/Cross-validation_(statistics)">Cross-validation</a>. @} | |||
@{Contains functions for ensuring accuracy and validation of predictive methods. @} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would use "evaluating" instead of "ensuring"
General comments:
|
I have made some changes and added validation and online help functions (in my private fork branch). I have not implemented the SRF versions as discussed earlier. We can make those functions as a separate work. @orhankislal could you please pull in the 3 commits on that branch and update your branch to update this PR. Once we've discussed the changes here, I can merge the commits to master. |
FROM ( | ||
SELECT {grp_str} | ||
{pred_col} AS threshold, | ||
sum({obs_col}) AS t, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add a ::int
inside the sum. That way we can also support obs_col
as a boolean (which would be nice considering this is binary classification).
Along with casting the columns to int in binary classification, we also need to change docs/online-help/tests to reflect that boolean columns allowed for observation columns. |
- Adds validation for input columns - Adds support for binary values on binary classification and AUC
avg({obs_col}) OVER ({partition_str}) as mean | ||
FROM {table_in} | ||
) x {grp_by_str} | ||
) y |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not
SELECT
{grp_out_str}
1 - avg(({pred_col} - {obs_col})^2)/var_pop({obs_col}) AS r2_score
FROM {table_in} {grp_by_str}
It is simpler, faster (2-3 times in my quick experiments) and numerically stable (avoid large sum)
Various fixes for documentation, performance and code clarity.
JIRA: MADLIB-907
A collection of summary statistics to gauge model accuracy
based on predicted values vs. ground-truth values.