The AUC-meter evaluates differently from classical statistics #42

gforge · 2016-08-04T18:48:16Z

I've finished writing a basic test-suite for the meters and apart from issue #41 I've encountered an unexpected problem with the tnt.AUCMeter. The following test-case should hopefully implement classical AUC-calculation based on this paper:

function test.AUCMeter()
   local mtr = tnt.AUCMeter()

   -- From http://stats.stackexchange.com/questions/145566/how-to-calculate-area-under-the-curve-auc-or-the-c-statistic-by-hand
   local samples = torch.Tensor{
      {33,6,6,11,2}, --normal
      {3,2,2,11,33} -- abnormal
   }
   for i=1,samples:size(2) do
      local target = torch.Tensor():resize(samples:narrow(2,i,1):sum()):zero()
      target:narrow(1,1,samples[2][i]):fill(1)
      local output = torch.Tensor(target:size(1)):fill(i)
      mtr:add(output, target)
   end

   local error, tpr, fpr = mtr:value()

   tester:assert(math.abs(error - 0.8931711) < 10^-3,
      ("The AUC error does not match: %.3f is not equal to 0.893"):format(error))
end

Unfortunately the AUC is lower (0.704) than expected 0.893. I'm not familiar with ML enough to know if there ML AUC differs in some significant way but the value 0.704 seems intuitively low (my apologies if I missed something in the coding). After looking at how the AUC is calculated there is a zero appended that could possibly be pulling the value down.

The text was updated successfully, but these errors were encountered:

lvdmaaten · 2016-08-04T19:48:03Z

The method of computing the AUC described in the article you linked to is a bit different from what is implemented here. Note how the article does linear interpolation between the actual observations. This is not entirely accurate: the linear interpolation is just an approximation of the AUC at the unobserved points. We use a more conservative, constant approximation: in our case, the plot would look like a stepsize function that never lies above the linear interpolation version (and is only equal at observed points).

As a result, the AUC we measure will always be lower than what the article's method computes. It is still possible that there is a bug in AUCMeter but you need a different test case to check.

gforge · 2016-08-05T20:13:35Z

Thanks for the explanation. I think I grasp the idea but since I'm not entirely comfortable with the calculation I switched approach - now I simply go for a random guess being equal to 0.5 and a perfect guess equal to 1. Unfortunately I seem to still be missing something as I somehow can't get the "perfect guess" to work, my current test code is:

function test.AUCMeter()
   local mtr = tnt.AUCMeter()

   local test_size = 10^3
   mtr:add(torch.rand(test_size), torch.zeros(test_size))
   mtr:add(torch.rand(test_size), torch.Tensor(test_size):fill(1))
   local err = mtr:value()
   tester:eq(err, 0.5, "Random guesses should provide a AUC close to 0.5", 10^-1)

   mtr:add(torch.Tensor(test_size):fill(0), torch.zeros(test_size))
   mtr:add(torch.Tensor(test_size):fill(0.1), torch.zeros(test_size))
   mtr:add(torch.Tensor(test_size):fill(0.2), torch.zeros(test_size))
   mtr:add(torch.Tensor(test_size):fill(0.3), torch.zeros(test_size))
   mtr:add(torch.Tensor(test_size):fill(0.4), torch.zeros(test_size))
   mtr:add(torch.Tensor(test_size):fill(1), torch.Tensor(test_size):fill(1))
   err = mtr:value()
   tester:eq(err, 1, "Only correct guesses should provide a AUC close to 1", 10^-1)

   -- Simulate a random situation where all the guesses are correct
   mtr:reset()
   local output = torch.abs(torch.rand(test_size)-.5)*2/3
   mtr:add(output, torch.zeros(test_size))
   output = torch.min(
      torch.cat(torch.rand(test_size) + .75,
                torch.Tensor(test_size):fill(1),
                2),
      2)
   mtr:add(output:fill(1), torch.Tensor(test_size):fill(1))
   err = mtr:value()
   tester:eq(err, 1, "Simulated random correct guesses should provide a AUC close to 1", 10^-1)
end

I've tried several versions of this with the estimate being around 0.75. I guess it's related to the step quality as it evaluates to 3/4 but the random attempt should in my mind smooth out the steps.

lvdmaaten · 2016-08-11T16:06:54Z

The first unit test is a bit flaky because they contain randomness (maybe use an example for which you know the correct answer instead?). Also, note you're missing a mtr:reset() between test 1 and 2. (And are you sure you want the output:fill(1) in the last mtr:add(...)?)

This bug is now fixed. Thanks for spotting this!

gforge · 2016-08-11T19:50:52Z

Thanks. The output:fill(1) was a leftover from debugging. If you dislike the randomness then you can add a torch.manualSeed(123) to make sure it never randomly fails although the chances should be small considering the sample size. I would love to use an example with a correct answer but since I'm an M.D. without a the formal ML training I don't have any material that I can use as a correct test. I've tried to Google it but the AUCs that I've found were classical calculations that didn't apply here.

lvdmaaten · 2016-08-11T19:59:27Z

Okay, yeah let's fix the random seed for the test then. I don't think it is a good idea to have the unit tests fail with some non-zero probability, since we plan to increasingly rely on Travis to determine whether or not pull requests are okay.

Thanks for contributing these tests!

gforge · 2016-08-11T20:19:20Z

Done. Thank you for the excellent package and your patience with my questions.

gforge mentioned this issue Aug 5, 2016

Wrapping issue when passing tables as first parameter with argcheck #41

Closed

lvdmaaten closed this as completed in dc1ba1d Aug 11, 2016

gforge mentioned this issue Aug 11, 2016

Meter tests #49

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The AUC-meter evaluates differently from classical statistics #42

The AUC-meter evaluates differently from classical statistics #42

gforge commented Aug 4, 2016

lvdmaaten commented Aug 4, 2016

gforge commented Aug 5, 2016

lvdmaaten commented Aug 11, 2016

gforge commented Aug 11, 2016

lvdmaaten commented Aug 11, 2016

gforge commented Aug 11, 2016

The AUC-meter evaluates differently from classical statistics #42

The AUC-meter evaluates differently from classical statistics #42

Comments

gforge commented Aug 4, 2016

lvdmaaten commented Aug 4, 2016

gforge commented Aug 5, 2016

lvdmaaten commented Aug 11, 2016

gforge commented Aug 11, 2016

lvdmaaten commented Aug 11, 2016

gforge commented Aug 11, 2016