Skip to content

Commit

Permalink
more on cross-validation leakage
Browse files Browse the repository at this point in the history
  • Loading branch information
JohnMount committed Jul 6, 2019
1 parent af2e3b9 commit 124d865
Show file tree
Hide file tree
Showing 5 changed files with 197 additions and 75 deletions.
4 changes: 2 additions & 2 deletions DESCRIPTION
@@ -1,8 +1,8 @@
Package: vtreat
Type: Package
Title: A Statistically Sound 'data.frame' Processor/Conditioner
Version: 1.4.2
Date: 2019-07-01
Version: 1.4.3
Date: 2019-07-06
Authors@R: c(
person("John", "Mount", email = "jmount@win-vector.com", role = c("aut", "cre")),
person("Nina", "Zumel", email = "nzumel@win-vector.com", role = c("aut")),
Expand Down
4 changes: 4 additions & 0 deletions NEWS.md
@@ -1,4 +1,8 @@

# vtreat 1.4.3 2019/07/06

* More tests.

# vtreat 1.4.2 2019/07/01

* Fix eronious Cohen reference in documentation.
Expand Down
5 changes: 3 additions & 2 deletions extras/ConstantLeak.Rmd
Expand Up @@ -9,7 +9,7 @@ We will show how in some situations using "more data in cross-validation" can be
harmful.

Our example: an outcome (`y`) that is independent of a low-complexity
categorical variable (`x`). We will combine this with a varaible that is a noisy
categorical variable (`x`). We will combine this with a variable that is a noisy
constant and leave-one-out cross-validation (which is a deterministic procedure) to
get a bad result (failing to notice over-fit).

Expand Down Expand Up @@ -128,6 +128,7 @@ summary(lm(y ~ x_badCoderN, data= cfeFX$crossFrame))
What happened is:

* The deterministic structure of leave-one-out cross validation introduces an information leak that copies a transform of the value of `y` into the bad coder. Essentially the leave-one-out cross validation is consuming a number of degrees of freedom equal to the number of different data sets its presents (one per data row).
* The bad coder had a design flow of returning a conditional mean, instead of a conditional difference from the overall mean. The actual `vtreat` impact/effects coders are careful to return the difference from cross-validation segment mean (which would be zero for all constant values).
* The bad coder being a near constant means this leak is nearly the entirety of the bad coder signal.
* On any data set other than the one-way-holdout cross-validation frame the bad coder is in fact a noisy constant (and not useful). The the bad coder is pure over-fit and any model that uses it is at risk of over-fit.

Expand All @@ -143,5 +144,5 @@ This failing is because the common cross validation procedures are not [fully ne

*Fully* nested cross-simulation (where even the last step is under the cross-control and enumerating excluded sets of training rows) is likely too cumbersome (requiring more code coordination) and expensive (upping the size of the sets of rows we have to exclude) to force on implementers who are also unlikely to see any benefit in non-degenerate cases. The partially nested cross-simulation used in `vtreat` is likely a good practical compromise (though we may explore full-nesting for the score frame estimates, as that is a step completely under `vtreat` control).

The current `vtreat` procedures are very strong and fully up to the case of assisting in construction of best possible machine learning models. However in certain degenerate cases (near-constant encoding combined completely deterministic cross-validation; neither of which is a default behavior of `vtreat`) the cross validation system itself can introduce an information leak that promote over-fit.
The current `vtreat` procedures are very strong and fully up to the case of assisting in construction of best possible machine learning models. However in certain degenerate cases (near-constant encoding combined completely deterministic cross-validation; neither of which is a default behavior of `vtreat`) the cross validation system itself can introduce an information leak that promotes over-fit.

0 comments on commit 124d865

Please sign in to comment.