Skip to content
This repository has been archived by the owner on Mar 3, 2020. It is now read-only.

The AUC-meter evaluates differently from classical statistics #42

Closed
gforge opened this issue Aug 4, 2016 · 6 comments
Closed

The AUC-meter evaluates differently from classical statistics #42

gforge opened this issue Aug 4, 2016 · 6 comments

Comments

@gforge
Copy link
Contributor

gforge commented Aug 4, 2016

I've finished writing a basic test-suite for the meters and apart from issue #41 I've encountered an unexpected problem with the tnt.AUCMeter. The following test-case should hopefully implement classical AUC-calculation based on this paper:

function test.AUCMeter()
   local mtr = tnt.AUCMeter()

   -- From http://stats.stackexchange.com/questions/145566/how-to-calculate-area-under-the-curve-auc-or-the-c-statistic-by-hand
   local samples = torch.Tensor{
      {33,6,6,11,2}, --normal
      {3,2,2,11,33} -- abnormal
   }
   for i=1,samples:size(2) do
      local target = torch.Tensor():resize(samples:narrow(2,i,1):sum()):zero()
      target:narrow(1,1,samples[2][i]):fill(1)
      local output = torch.Tensor(target:size(1)):fill(i)
      mtr:add(output, target)
   end

   local error, tpr, fpr = mtr:value()

   tester:assert(math.abs(error - 0.8931711) < 10^-3,
      ("The AUC error does not match: %.3f is not equal to 0.893"):format(error))
end

Unfortunately the AUC is lower (0.704) than expected 0.893. I'm not familiar with ML enough to know if there ML AUC differs in some significant way but the value 0.704 seems intuitively low (my apologies if I missed something in the coding). After looking at how the AUC is calculated there is a zero appended that could possibly be pulling the value down.

@lvdmaaten
Copy link
Contributor

The method of computing the AUC described in the article you linked to is a bit different from what is implemented here. Note how the article does linear interpolation between the actual observations. This is not entirely accurate: the linear interpolation is just an approximation of the AUC at the unobserved points. We use a more conservative, constant approximation: in our case, the plot would look like a stepsize function that never lies above the linear interpolation version (and is only equal at observed points).

As a result, the AUC we measure will always be lower than what the article's method computes. It is still possible that there is a bug in AUCMeter but you need a different test case to check.

@gforge
Copy link
Contributor Author

gforge commented Aug 5, 2016

Thanks for the explanation. I think I grasp the idea but since I'm not entirely comfortable with the calculation I switched approach - now I simply go for a random guess being equal to 0.5 and a perfect guess equal to 1. Unfortunately I seem to still be missing something as I somehow can't get the "perfect guess" to work, my current test code is:

function test.AUCMeter()
   local mtr = tnt.AUCMeter()

   local test_size = 10^3
   mtr:add(torch.rand(test_size), torch.zeros(test_size))
   mtr:add(torch.rand(test_size), torch.Tensor(test_size):fill(1))
   local err = mtr:value()
   tester:eq(err, 0.5, "Random guesses should provide a AUC close to 0.5", 10^-1)

   mtr:add(torch.Tensor(test_size):fill(0), torch.zeros(test_size))
   mtr:add(torch.Tensor(test_size):fill(0.1), torch.zeros(test_size))
   mtr:add(torch.Tensor(test_size):fill(0.2), torch.zeros(test_size))
   mtr:add(torch.Tensor(test_size):fill(0.3), torch.zeros(test_size))
   mtr:add(torch.Tensor(test_size):fill(0.4), torch.zeros(test_size))
   mtr:add(torch.Tensor(test_size):fill(1), torch.Tensor(test_size):fill(1))
   err = mtr:value()
   tester:eq(err, 1, "Only correct guesses should provide a AUC close to 1", 10^-1)

   -- Simulate a random situation where all the guesses are correct
   mtr:reset()
   local output = torch.abs(torch.rand(test_size)-.5)*2/3
   mtr:add(output, torch.zeros(test_size))
   output = torch.min(
      torch.cat(torch.rand(test_size) + .75,
                torch.Tensor(test_size):fill(1),
                2),
      2)
   mtr:add(output:fill(1), torch.Tensor(test_size):fill(1))
   err = mtr:value()
   tester:eq(err, 1, "Simulated random correct guesses should provide a AUC close to 1", 10^-1)
end

I've tried several versions of this with the estimate being around 0.75. I guess it's related to the step quality as it evaluates to 3/4 but the random attempt should in my mind smooth out the steps.

@lvdmaaten
Copy link
Contributor

The first unit test is a bit flaky because they contain randomness (maybe use an example for which you know the correct answer instead?). Also, note you're missing a mtr:reset() between test 1 and 2. (And are you sure you want the output:fill(1) in the last mtr:add(...)?)

This bug is now fixed. Thanks for spotting this!

@gforge gforge mentioned this issue Aug 11, 2016
@gforge
Copy link
Contributor Author

gforge commented Aug 11, 2016

Thanks. The output:fill(1) was a leftover from debugging. If you dislike the randomness then you can add a torch.manualSeed(123) to make sure it never randomly fails although the chances should be small considering the sample size. I would love to use an example with a correct answer but since I'm an M.D. without a the formal ML training I don't have any material that I can use as a correct test. I've tried to Google it but the AUCs that I've found were classical calculations that didn't apply here.

@lvdmaaten
Copy link
Contributor

Okay, yeah let's fix the random seed for the test then. I don't think it is a good idea to have the unit tests fail with some non-zero probability, since we plan to increasingly rely on Travis to determine whether or not pull requests are okay.

Thanks for contributing these tests!

@gforge
Copy link
Contributor Author

gforge commented Aug 11, 2016

Done. Thank you for the excellent package and your patience with my questions.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants