Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSD() gives wrong value when some element == 0 #4

Closed
wkc1986 opened this issue Apr 27, 2018 · 3 comments
Closed

JSD() gives wrong value when some element == 0 #4

wkc1986 opened this issue Apr 27, 2018 · 3 comments
Labels

Comments

@wkc1986
Copy link

wkc1986 commented Apr 27, 2018

Sorry to be a bother again. I'm using the GitHub version. Comparing against a test JSD function in base R, as well as calculations using H() and gJSD(), only JSD() gives a different result:

test.pairwise.jsd <- function(x, fn = "log") {
  h <- function(p, fn = "log") {
    pilogpi <- sapply(p, function(pi, fn) {
      if (pi == 0) {
        pi
      } else {
        pi * do.call(fn, list(x = pi))
      }
    }, fn = fn)
    -1 * sum(pilogpi)
  }
  mean.p <- apply(x, 2, mean)
  # browser()
  h(mean.p, fn) - mean(apply(x, 1, h, fn = fn))
}

xx <- rbind(c(1,0), c(0.5,0.5))
d1 <- test.pairwise.jsd(xx, "log2")
library(philentropy)
d2 <- JSD(xx)
d3 <- H(apply(xx, 2, mean)) - mean(apply(xx, 1, H))
d4 <- gJSD(t(xx))
d1
[1] 0.3112781
d2
jensen-shannon 
0.06127812 
d3
[1] 0.3112781
d4
[1] 0.3112781

@HajkD
Copy link
Member

HajkD commented May 1, 2018

Hi @wkc1986,

Thank you so so much for pointing out this issue and I will immediately have a look at it.
This clearly shouldn't happen and it seems that my unit tests didn't catch this...

I will start debugging now and will come back to you soon.

Please feel free to send me any other issue you might find.

Many thanks and best wishes,
Hajk

@HajkD HajkD added the bug label May 1, 2018
@HajkD
Copy link
Member

HajkD commented May 4, 2018

Hi @wkc1986,

I found the bug and fixed it.

The problem was the following:

In the initial Rcpp function jensen_shannon() I handled 0 values in vectors as following:

if((P[i] == 0.0) || (Q[i] == 0.0)){
                                if (P[i] == 0.0)
                                        sum1 += 0.0;
                                if (Q[i] == 0.0)
                                        sum2 += 0.0;
                        } else {
                        
                                PQsum =   P[i] + Q[i];
                                
                                if (unit == "log"){
                                        sum1  +=  P[i] * log((2.0 * P[i]) / PQsum);
                                        sum2  +=  Q[i] * log((2.0 * Q[i]) / PQsum);
                                }
                                
                                else if (unit == "log2"){
                                        sum1  +=  P[i] * custom_log2((2.0 * P[i]) / PQsum);
                                        sum2  +=  Q[i] * custom_log2((2.0 * Q[i]) / PQsum);
                                }
                                
                                else if (unit == "log10"){
                                        sum1  +=  P[i] * custom_log10((2.0 * P[i]) / PQsum);
                                        sum2  +=  Q[i] * custom_log10((2.0 * Q[i]) / PQsum);
                                } else {
                                        Rcpp::stop("Please choose from units: log, log2, or log10.");
                                }
                        }
                }

The problem was that I only consider the case P[i] == 0.0) || (Q[i] == 0.0). Meaning that either P[i] == 0 OR Q[i] == 0. I didn't consider the cases where P[i] == 0.0) && (Q[i] == 0.0) or where PQsum == 0.

I now consider all cases where 0 values can mess up the computations.

PQsum =   P[i] + Q[i];
                                
                                if (unit == "log"){
                                  if (P[i] == 0.0 || PQsum == 0.0) {
                                    sum1  += 0.0;
                                  } else {
                                    sum1  +=  P[i] * log((2.0 * P[i]) / PQsum);
                                  }
                                  if (Q[i] == 0.0 || PQsum == 0.0) {
                                    sum2  += 0.0;
                                  } else {
                                    sum2  +=  Q[i] * log((2.0 * Q[i]) / PQsum);
                                  }
                                }
                                
                                else if (unit == "log2"){
                                  if (P[i] == 0.0 || PQsum == 0.0) {
                                    sum1  += 0.0;
                                  } else {
                                    sum1  +=  P[i] * custom_log2((2.0 * P[i]) / PQsum);
                                  }
                                  if (Q[i] == 0.0 || PQsum == 0.0) {
                                    sum2  += 0.0;
                                  } else {
                                    sum2  +=  Q[i] * custom_log2((2.0 * Q[i]) / PQsum);
                                  }
                                }
                                
                                else if (unit == "log10"){
                                  if (P[i] == 0.0 || PQsum == 0.0) {
                                    sum1  += 0.0;
                                  } else {
                                    sum1  +=  P[i] * custom_log10((2.0 * P[i]) / PQsum);
                                  }
                                  if (Q[i] == 0.0 || PQsum == 0.0) {
                                    sum2  += 0.0;
                                  } else {
                                    sum2  +=  Q[i] * custom_log10((2.0 * Q[i]) / PQsum);
                                  }
                                } else {
                                        Rcpp::stop("Please choose from units: log, log2, or log10.");
                                }

When running your initial example I now get all the same values:

test.pairwise.jsd <- function(x, fn = "log") {
  h <- function(p, fn = "log") {
    pilogpi <- sapply(p, function(pi, fn) {
      if (pi == 0) {
        pi
      } else {
        pi * do.call(fn, list(x = pi))
      }
    }, fn = fn)
    -1 * sum(pilogpi)
  }
  mean.p <- apply(x, 2, mean)
  # browser()
  h(mean.p, fn) - mean(apply(x, 1, h, fn = fn))
}

xx <- rbind(c(1,0), c(0.5,0.5))
d1 <- test.pairwise.jsd(xx, "log2")
library(philentropy)
d2 <- JSD(xx)
d3 <- H(apply(xx, 2, mean)) - mean(apply(xx, 1, H))
d4 <- gJSD(t(xx))
d1
[1] 0.3112781
d2
jensen-shannon 
     0.3112781 
d3
[1] 0.3112781
d4
[1] 0.3112781

Thanks to your excellent example I also unraveled that the Rcpp function jensen_difference() has the same issue and I also fixed it. Meaning that JSD(xx) and distance(x = xx, method = "jensen_difference", unit = "log2") will return the same values even when 0 values are present.

JSD(xx)
jensen-shannon 
     0.3112781
distance(x = xx, method = "jensen_difference", unit = "log2")
jensen_difference 
        0.3112781 

I will now extend the unit tests and write more extensive test for 0 values and will also go over all other distance functions in that context.

Once more many many thanks for making me aware of this issue, I truly appreciate it :))))

Best wishes,
Hajk

@wkc1986
Copy link
Author

wkc1986 commented May 4, 2018

Glad to be of help! Your package is the fastest way I know of to calculate JSD matrices and generalized JSD in R, so I'm happy to see it ticking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants