CCDCesque critical value specification #66

ceholden · 2015-11-25T02:30:30Z

The "test statistic" used in CCDC is the square root of the sum of squared scaled residuals for all bands tested. This "test statistic" isn't normalized by how many bands you're adding, so the critical value needs to depend on how many test indices you have. If you use more indices to test with, then you'll need to increase the critical value by some amount.

Zhu derives the "test statistic" critical value for p=0.01 and k=len(test_indices) using the inverse survival function of the scipy.stats.chi distribution since we're probably summing squares of normally distributed variables (the scaled residuals).

Questions of independence, normality, statistical soundness, etc. aside, my biggest concern is that we don't really care about finding change according to some null hypothesis testing framework value. CCDC is, at best, vaguely statistical and we've never analytically or numerically explored what the distribution of the "test statistic" is under the null hypothesis of no change. However, using the scipy.stats.chi.isf does convey the important message that the critical value depends on how many bands are being tested.

So, the proposed solution either includes:

Better documentation explaining this!
Convert threshold parameter to p_value and retrieve the threshold using scipy.stats.chi.isf for a given number of test_indices
Accept both threshold and p_value with threshold being the default input that overrides p_value if both are specified
- If the user specifies p_value, then we back out threshold against the chi distribution
- If threshold is specified, well then there's no difference versus current behavior.
- If the user specifies both parameters, we keep threshold as being authoritative and warn the user

Thoughts, @bullocke, @valpasq, @parevalo, and @xjtang?

The text was updated successfully, but these errors were encountered:

bullocke · 2015-11-25T13:19:32Z

I've actually thought a little bit about this too. In the short term, I get the feeling a lot of what users are using for their thresholds are found using trial and error. So in terms of selecting the correct threshold I am not sure changing it to a p value will make much of a difference (although it would seem a little more statistical). That does beg the question of whether trial and error will really yield better results than deriving it using a chi distribution, however. That being said, it is something important to consider documentation could definitely be useful.

What I think would be nice would be some sort of well thought out workflow in choosing what bands to use and threshold. This would be something done for each scene and a little more involved than just clicking around in TSTools and changing the threshold. If we could take the time to document these 'change scores' or residuals for each bands for the different landcover classes and change classes we could easily test what the threshold should be in a more robust fashion. Thoughts?

valpasq · 2015-11-25T14:02:53Z

I am in favor of converting the threshold parameter to p_value and retrieve the threshold using scipy.stats.chi.isf for a given number of test_indices.

I definitely understand the concern that this could be misleading/confusing since we are not actually finding change using a formal null hypothesis testing framework, but I think the chi distribution value we are currently using is no less confusing and most of us are selecting by trial and error in a less than meaningful way. By internally selecting the critical value based on a user-defined p-value and number of test indices, we would (a) ensure comparability across model runs with variable number of test indices (something that is currently easy to overlook!) and (b) put the critical value into terms that are likely more familiar to users (i.e. I think it would be easier to explain that a p-value of 0.1 will be more sensitive to change compared to a p-value of 0.01 versus trying to explain where the threshold of "3.368" came from).

I also agree that there should be better documentation explaining the interpretation of p-value/critical value in this context and how it is selected. In the next couple of days, I plan to write up what I learned yesterday from Chris and I discussing all of the inputs as a starting point.

To Eric's point about a workflow, I am also working on some test runs comparing change results for different parameter sets. I'm using a relatively small subset (~500,000 pixels) in the Broadmoor/Boston area, but it should be interesting to see if/how tweaking different parameters, fit types, and design matrices will impact the change results. I'll keep you both posted on what I find, and if there are any useful lessons learned that we could share as examples in the documentation. Once I have a good process for comparing/more standard set of model runs, might also be worth running a few other subsets for further comparison.

ceholden · 2015-11-25T14:35:05Z

Some arguments against, which is currently my opinion:

Even if you pick a p value, you still have to pick it using trial and error. This just kicks the can down to a newly renamed parameter and the process is basically just as trivial.
p values have a very real meaning and it is more confusing to users to assert that we have a probabilistic interpretation for this parameter. Leaving it as threshold seems correctly "less statistical"

We could also just allow both in the CCDCesque configuration. If the user specifies p_value, then we back out threshold against the chi distribution. If threshold is specified, well then there's no difference versus current behavior. If the user specifies both parameters, we keep threshold as being authoritative and yell at the user for being an idiot \s. Only downside is this might add more confusion than currently exists.

I'm definitely in favor of better documentation (see #35 for documenting how we go about selecting parameters). Valerie's test runs ought to also give us some actual numbers behind what's just been a speculative sense of the sensitivity of each parameter, and this would make for some great documentation material.

I'm also up for better framing the threshold parameter by either running some sort of simulation experiment or by analyzing what kinds or change processes produce what kinds of magnitudes of change. I've always very much liked Figure 4 in Jan's 2012 BFAST where Jan shows the probability of detecting a change given d observations of the disturbance and m magnitude of said disturbance as a function of the noise in the timeseries. Doing this sort of thing in multivariate space probably complicates this, but it would be great if we could at least demonstrate some simple examples. There's also a bit of a replication analysis to be done in looking at what bands are most affected by certain types of change processes (I think Mike Wulder showed this? Or was it Warren Cohen?)

parevalo · 2015-11-25T14:59:26Z

I think it makes more sense to continue using the threshold and try to
figure out the possible sensible ranges and their outcome for a given group
of parameter sets. I believe that using the p value only adds one more
layer of complexity, gives it a "statistical look" that it really doesn't
have, and doesn't solve the problem of having to pick it up by trial an
error. Ultimately, I think that only a sensitivity analysis similar to what
Val is running will give us some guidelines on how to choose the threshold
(and other parameter values). Maybe then we will be able to create some
kind of table/plot/widget in which we can show for example how playing with
certain ranges of values makes the change detection more or less likely to
happen.

valpasq · 2015-11-25T17:19:53Z

I guess my main concern with the current threshold is the lack of insight on how to effectively change values when you change the number of test indices...

I think the p-value approach is advantageous because it handles the connection to the test_indices under the hood. This way, you can set a constant p-value, change the number of test indices, and still produce comparable results. With the threshold set the way it is now, it is up to the user to understand that the threshold value must change depending on the number of test indices (so if you want a comparable result for runs with different numbers of test indices, you must change two parameters [test_indices and threshold] instead of one [test_indices, keeping p-value constant]). And unless you do use values from the chi-not-squared distribution for a particular p-value, it's hard to determine comparable thresholds, e.g. Chris helped me figure out that for p=0.01, if you have 3 test indices (k=3), threshold=3.643 while if you have 4 test indices (k=4), threshold=3.368. I would have never come up with those numbers through trial and error, but those are the numbers I needed to try to compare results from 3 indices to results from 4 indices. I might go back and look up values for p=0.05, but I still need them to come from the distribution so they are comparable for numbers different test indices.

If there is a way to account for differing numbers of test indices in another way, that could work too. For example, at one point yesterday, I think we talked about dividing the L2 norm by the number of test bands (or something like that?). Anything that ensures that when I change the test indices, my comparison to the threshold is adjusted accordingly would address my concern.

Maybe another option would be to abstract away from a p-value per se. I mean, when you select a threshold, you are in some sense arbitrarily relating to the chi-not-squared distribution. Maybe we could have a range of values that correspond to p-values but aren't p-values. For example, maybe threshhold simply equals p-value*100 such that a threshold of 5 (p=0.05) would be more sensitive to change than a threshold of 1 (p=0.01). Under the hood, the input could be divided by 100 to convert to a p-value for lookup of the critical value, but the user wouldn't be left with the impression that they are specifying a p-value. Something like that.

I might also just be over-complicating this. I don't know how many users will actually want to produce and compare maps from different test indices. But after doing a couple of runs changing my inputs to see whether the original bands v. BGW produced different results, only to find that the results were not actually a fair comparison unless I adjusted the threshold accordingly, I think it is worth coming up with a better way of handling the relationship between threshold and test_indices going forward. I'm all for better documentation, but I think there should also be safeguards in case people don't read/understand the parameters the way we would hope.

ceholden · 2015-12-01T04:07:49Z

OK long post below outlining why I'm going to eliminate option 2 as a possibility for becoming default method for specifying threshold. This leaves 1 or 3.

I'm in favor of 3 as an accommodation since it's so easy to implement. Any reasons for objection?

I think I understand the argument in favor of 2, but I disagree that we should be making things easier to understand to make it more approachable, to not trip people up, or to make it more of a "set and forget" parameter. There are far too many nobs to turn in CCDC for it to be approachable anyway, and I'm afraid hiding any of them would turn the experience into, "The defaults didn't work so I'm giving up". There really aren't any defaults (except that there has to be something to show the datatype/etc.)!

The larger problem is that the chi distribution isn't suited for this purpose anyway. We're not actually looking at the sum of standard normally distributed variables (N(0, 1)) for the chi, or the sum of the squares of N(0, 1) variables. The two criteria:

0 mean of the residuals / rmse
- More or less met. The mean of the residuals isn't actually 0 because we're using Lasso, but we'll ignore that for now
1 standard deviation of residuals / rmse
- Normally when you scale residuals, you scale them by the standard deviation of the residuals, not some measure of the model error (scaling by 1 / std guarantees you have std=1)
- The calculation for RMSE and residual std are basically identical, but RMSE compares y against yhat while residual std compares resid against mean(resid)
- I ran a few pixels and compared the two metrics for several models; the RMSE and std(resid) are somewhat close sometimes, but they're not equivalent (see below)

Simple code example in R using mtcars dataset:

> fit <- lm(mpg~hp, data=mtcars)
> rmse <- mean(residuals(fit) ^ 2) ^ 0.5
> resid_std <- var(residuals(fit)) ^ 0.5
> print(c(rmse, resid_std))
[1] 3.740297 3.800146

So if we're not actually summing N(0, 1) data, I'm going to rule against the p_value implementation via technicality as the self proclaimed BDFL.

It seems quite a bit of a shame though to have to dismiss it on a technicality. Maybe this leads to some discussion about reshaping the "test statistic". We could, for example, just use the std(resid) instead of RMSE or maybe even do some sort of robust standard deviation estimate instead, but that's all for another ticket and larger group discussion.

ceholden closed this as completed Dec 7, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CCDCesque critical value specification #66

CCDCesque critical value specification #66

ceholden commented Nov 25, 2015

bullocke commented Nov 25, 2015

valpasq commented Nov 25, 2015

ceholden commented Nov 25, 2015

parevalo commented Nov 25, 2015

valpasq commented Nov 25, 2015

ceholden commented Dec 1, 2015

CCDCesque critical value specification #66

CCDCesque critical value specification #66

Comments

ceholden commented Nov 25, 2015

bullocke commented Nov 25, 2015

valpasq commented Nov 25, 2015

ceholden commented Nov 25, 2015

parevalo commented Nov 25, 2015

valpasq commented Nov 25, 2015

ceholden commented Dec 1, 2015