Prediction Metrics: New module #42

orhankislal · 2016-05-04T18:44:37Z

JIRA: MADLIB-907

A collection of summary statistics to gauge model accuracy
based on predicted values vs. ground-truth values.

JIRA: MADLIB-907 A collection of summary statistics to gauge model accuracy based on predicted values vs. ground-truth values.

iyerr3 · 2016-05-10T21:22:09Z

doc/mainpage.dox.in

@@ -117,10 +117,11 @@ complete matrix stored as a distributed table.
        @ingroup grp_datatrans

 @defgroup grp_mdl Model Evaluation
-@{Contains the cross-validation module, a collection of routines useful for
-<a href="http://en.wikipedia.org/wiki/Cross-validation_(statistics)">Cross-validation</a>. @}
+@{Contains functions for ensuring accuracy and validation of predictive methods. @}


I would use "evaluating" instead of "ensuring"

iyerr3 · 2016-05-10T21:43:52Z

General comments:

The distance functions (mean_*_error) all have the same structure except the distance metric. I suggest refactoring the table creation and the grouping statements into a single function that takes a parameter for the distance.
Avoid using __ within python. It leads to name mangling, that doesn't really help in this situation.
There seems to be considerable overlap between the grouping and non-grouping SQL in r2, adjusted_r2 and auc functions. Maybe we can combine them?
I'll bump the suggestion by @decibel in the previous PR. It might be helpful to have 2 versions: 1 that creates the output table and the other returns the rows directly.

iyerr3 · 2016-05-27T18:43:58Z

I have made some changes and added validation and online help functions (in my private fork branch).

I have not implemented the SRF versions as discussed earlier. We can make those functions as a separate work.

@orhankislal could you please pull in the 3 commits on that branch and update your branch to update this PR. Once we've discussed the changes here, I can merge the commits to master.

iyerr3 · 2016-05-31T20:44:48Z

src/ports/postgres/modules/stats/pred_metrics.py_in

+        FROM (
+          SELECT {grp_str}
+                 {pred_col} AS threshold,
+                 sum({obs_col}) AS t,


Let's add a ::int inside the sum. That way we can also support obs_col as a boolean (which would be nice considering this is binary classification).

iyerr3 · 2016-05-31T20:48:32Z

Along with casting the columns to int in binary classification, we also need to change docs/online-help/tests to reflect that boolean columns allowed for observation columns.

- Adds validation for input columns - Adds support for binary values on binary classification and AUC

mktal · 2016-06-08T15:50:43Z

src/ports/postgres/modules/stats/pred_metrics.py_in

+                           avg({obs_col}) OVER ({partition_str}) as mean
+                    FROM {table_in}
+                ) x {grp_by_str}
+            ) y


Why not

SELECT {grp_out_str} 1 - avg(({pred_col} - {obs_col})^2)/var_pop({obs_col}) AS r2_score FROM {table_in} {grp_by_str}

It is simpler, faster (2-3 times in my quick experiments) and numerically stable (avoid large sum)

Various fixes for documentation, performance and code clarity.

orhankislal · 2016-06-09T19:24:56Z

Thank you very much for your comments @iyerr3 and @mktal. I have pushed 2 new commits to incorporate your suggestions. Please let me know if you have any other suggestions.

Prediction Metrics: New module

0a2900a

JIRA: MADLIB-907 A collection of summary statistics to gauge model accuracy based on predicted values vs. ground-truth values.

iyerr3 reviewed May 10, 2016
View reviewed changes

iyerr3 added 3 commits May 24, 2016 16:54

Move location + update functions

998296a

Add validation + other minor changes

b75db13

Add help messages for all functions

045591e

Minor fix for a help message

dc641df

iyerr3 reviewed May 31, 2016
View reviewed changes

Prediction Metrics

28f6c04

- Adds validation for input columns - Adds support for binary values on binary classification and AUC

mktal reviewed Jun 8, 2016
View reviewed changes

Prediction Metrics

7eba633

Various fixes for documentation, performance and code clarity.

asfgit closed this in b916568 Jun 14, 2016

orhankislal deleted the feature/pred_metrics_take2 branch March 23, 2017 00:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prediction Metrics: New module #42

Prediction Metrics: New module #42

orhankislal commented May 4, 2016

iyerr3 May 10, 2016

iyerr3 commented May 10, 2016

iyerr3 commented May 27, 2016

iyerr3 May 31, 2016

iyerr3 commented May 31, 2016

mktal Jun 8, 2016

orhankislal commented Jun 9, 2016

Prediction Metrics: New module #42

Prediction Metrics: New module #42

Conversation

orhankislal commented May 4, 2016

iyerr3 May 10, 2016

Choose a reason for hiding this comment

iyerr3 commented May 10, 2016

iyerr3 commented May 27, 2016

iyerr3 May 31, 2016

Choose a reason for hiding this comment

iyerr3 commented May 31, 2016

mktal Jun 8, 2016

Choose a reason for hiding this comment

orhankislal commented Jun 9, 2016