Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Estimating confidence that a breakpoint occurs in a 1D array #135

Closed
spencerahill opened this issue Mar 9, 2021 · 6 comments
Closed

Estimating confidence that a breakpoint occurs in a 1D array #135

spencerahill opened this issue Mar 9, 2021 · 6 comments

Comments

@spencerahill
Copy link

I have a few hundred 1-D timeseries, each ~120 points. Each timeseries may or may not have one or more breakpoints caused by changes in instrumentation, with the timing and nature of the instrumentation change differing from timeseries to timeseries and not known in advance for any of them. I am interested in estimating some measure of confidence for each timeseries that at least one breakpoint exists for that timeseries (in addition to when that breakpoint occurs).

What would you recommend using for this? This sounds to me like a statistical test based on BinSeg with n=1 breakpoints, but I'm new to breakpoints overall and to the ruptures package and so it's not obvious to me if that's correct conceptually nor how to do that with ruptures. Apologies if I'm moving too quickly and thus missing something clear in the docs.

@spencerahill
Copy link
Author

spencerahill commented Mar 9, 2021

Edit: the fact that there are multiple timeseries is irrelevant; ultimately the question boils down to computing this for a single timeseries. (Since, c.f. #136, I can efficiently broadcast the calculation from one to all the rest of my timeseries). Thanks!

@oboulant
Copy link
Collaborator

oboulant commented Mar 10, 2021

Hi @spencerahill ,

Thanks again for your interest in ruptures !

You can indeed use the BinSeg search method. You can also use the Kernel change point detection with the Pelt algorithm search method (see here for an example).

Pelt allows you to compute an unknown number of change points.

As for the measure of confidence, not knowing anything about the underlying data, what I would do for a start is :

  1. For a several values of the Pelt penalty, compute change points using ruptures Kernel change point detection with the Pelt algorithm
  2. For each Pelt penalty, compute the returned number of change points
  3. Plot the graph (Pelt penalty value, number of returned change points)
  4. Study if there is a "discontinuity" in the plot and/or how hard it is to "constraint" the algorithm the return 0 change points with a higher penalty value.

Hope it is clear ?

Finally, since this is more of a discussion than an issue with the package, could you please move this to the discussion section ?

Olivier

@deepcharles
Copy link
Owner

I am interested in estimating some measure of confidence for each timeseries that at least one breakpoint exists for that timeseries (in addition to when that breakpoint occurs).

This is clearly something that ruptures should propose (or at least document). We'll work on it in the coming weeks.

This setting is known as AMOC (at most one change). If you have iid Gaussian data, you can have an exact p-value for the presence (or absence) of a change. See this recent article. For a simpler approach, you can use multiple hypothesis testing procedures (that control the false discovery rate or the familywise error rate). Thoses approaches can handle more complex data and provide p-values, but are approximate. You can also use AIC or BIC (or any other model selection methods) but those will not give you p-values, just a binary decision (change or no change).

I can dig up a few references for the multiple hypothesis testing procedures (which also happen to be a research interest of mine), but this will have to wait until next week.

Cheers

@deepcharles
Copy link
Owner

Note that all those procedures (except for the first one, the exact computation of a p-value) do not need much additional code.

@spencerahill
Copy link
Author

Thanks both for the detailed responses.

This is clearly something that ruptures should propose (or at least document). We'll work on it in the coming weeks.

In light of this, perhaps it's OK to leave this as an Issue rather than move to Discussion? Either way fine just let me know.

This setting is known as AMOC (at most one change)

Perhaps this is just a nomenclature confusion on my part, but I'm interested in at least one change (is "ALOC" a thing?)

As for the measure of confidence, not knowing anything about the underlying data, what I would do for a start is :

  1. For a several values of the Pelt penalty, compute change points using ruptures Kernel change point detection with the Pelt algorithm
  2. For each Pelt penalty, compute the returned number of change points
  3. Plot the graph (Pelt penalty value, number of returned change points)
  4. Study if there is a "discontinuity" in the plot and/or how hard it is to "constraint" the algorithm the return 0 change points with a higher penalty value.

That makes sense. I'll let you know once I've given a try if I have any problems with it. In terms of the data, it's a mixed bag. The ones that don't have an obvious break by eye are pretty fairly Gaussian. But others by eye you can see a fundamental shift in both mean and variance leading to weird, sort-of bimodal distributions. And still others are intermediate.

@deepcharles
Copy link
Owner

Perhaps this is just a nomenclature confusion on my part, but I'm interested in at least one change (is "ALOC" a thing?)

If the task consists in labelling certain signals as "contain one change or more" vs "do not contain any change", I guess both setting are equivalent.

In any case, I will close this issue for now. Feel free to reopen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants