Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stochasticity #2

Closed
benfulcher opened this issue Oct 27, 2014 · 6 comments
Closed

Stochasticity #2

benfulcher opened this issue Oct 27, 2014 · 6 comments
Assignees
Labels

Comments

@benfulcher
Copy link
Owner

Some operations output results that depend on the random seed, and thus running the same operation on the same time series can produce different results if run multiple times.
A solution to this is required, and could be done by allowing a random seed input to each non-deterministic function, to allow reproducible results. If none is provided, a default could be rng('default') at the start of each function.
I should implement this as a priority going forward.

@benfulcher benfulcher added the bug label Oct 27, 2014
@benfulcher benfulcher self-assigned this Oct 27, 2014
@sdvillal
Copy link

How are you planning to do this? It seems to me that matlab positional-only argument specification (that is, the lack of support for named parameters) makes this change (and in general any other parameter addition/removal) challenging.

Options that come to my mind:

  1. Make seed the first parameter of the function after the time series. I think this is most sensible as seed is a required parameter. It also would make adapting the python bindings very easy.
  2. Make seed the last parameter of the function. I think this is not really an option.
  3. Start using inputParsers (but please, not varargins ;-))
  4. Change all HCTSA functions signature to accept two parameters: the time series and a struct of parameters. If no parameters are passed, a struct of default parameters can be returned. That, I think, would make easier to maintain the library.

@benfulcher
Copy link
Owner Author

Thanks for the ideas, Santi. I'm now preparing to do this next, as a priority. Yes, this is where Matlab functions are a bit restrictive -- would be great to have named parameters in this case. Anyway, this is where we are -- in Matlab land for now.

For the purposes of how the engine works, I don't see a problem with just resetting the random seed at the beginning of every function call. This could be done: (i) in TSQ_brawn (i.e., not requiring any changes to any functions at all), (ii) inside all stochastic functions. As long as the seed is reset the same way each time, this would ensure reproducibility.

Regarding your options for (ii):

  1. This works if we reckon that seed is a required parameter, but I see most users not wanting to set this, but for it to default to a particular seed.
  2. The last parameter could work (if we view it as optional), but yeah, this could get messy if inputs to functions grow/change with time (although this will probably be minimal -- ideal is for function files not to change too much, at least in their structure).
    3./4. Yeah, could use inputParsers (thanks for the link -- have never used this previously). Could also use a standardized first input, which could be a structure, containing the raw time series, the z-scored time series (potentially a whole string of processed versions of the time series that could then be accessed by all functions), a random seed, etc. and other potential generic data inputs. But this would limit the generality of these functions for people wanting to use them outside this context (making their format quite opaque for quick and ready use). Could get around this by having a general function that parses this input data structure, but also deals with the case that a time-series vector is provided (as the current first input). This could be a good investment going forward, when we might want to start using more details about the time-series object (e.g., unevenly sampled data, or providing a custom sampling rate that could be used in some functions, etc.)

I think for now I'm favouring using (i), if it's not too slow just for a quick-to-implement solution (or 1./2.). Looking to a future in which the first input is a structure (that can also deal with vectors if just a vector is provided).

Of course, external C++/fortran mex functions that use randomization will not be reproducible, unless I go into their sources and impose seed resetting...

@benfulcher
Copy link
Owner Author

This is underway in the DeterministicStochasticity branch (sub-branch of OperationChanges). Will happen for now through a uniform BF_ResetSeed function, that all stochastic operations will reference for inputs specifying a seed reset (the default will be the Matlab default: reset the seed to Mersenne Twister with seed 0). Am yet to specify defaults or incorporate input arguments controlling this, but hopefully I'll find time to get to this by the end of the week.
Like I said in the longer term, looking to implement a consistent 1st-input argument data structure that could specify something like this (also sampling rate, or a time vector for unevenly sampled data, for some custom additional operations for specific domain applications, etc.), but for now this is a workable solution that doesn't require large-scale changes to the current architecture.

@sdvillal
Copy link

If I understand correctly, touching TSQ_Brawn would not help library users. So not useful to me ;-). But that is indeed the approach I would initially take in pyopy: reset the global rng state to the specified seed (in python land) before running the computation. That is ugly in two ways: 1) touching global state is ugly per se 2) having seed-unaware random functions makes a tad more difficult to batch operation calls deterministically (e.g. one would need to interleave calls to set the seed between calls to the operator) and with clear provenance (as the seed is always part of the signature of the operation, even if implicit).

So for me, the only way to get this correct is to get seed-aware functions (ii).

Making the seed the last parameter of the functions is definitely less work, so probably the best way to go.

My thinking is: seed at the end means enforcing users to specify all other parameters if they want to play with stochasticity (with what that means in functions were the defaults are something more elaborated than a number or a string). This is always true for any nth parameter when using vanilla matlab function dispatch. So the question would be: what will a user want to change from the default parameters more often? That depends on the scenario, so there is no good answer. I often will want to change the seed when I find an interesting (aka discriminative) feature that is stochastic, instead of touching any of the other parameters. Check if the appeal of the feature is just because Mars was aligned with Pluto or if it really encodes something useful.

What I would definitely try, regardless of where the seed parameter goes is:

  • making absence or negative seed to mean "use clock-based seed"
  • using a local random number generator initialised with whatever seed whenever possible - that is, for these operators in which you control the random sampling bit.

@sdvillal
Copy link

My tests tell me that these operators need to be tagged as stochastic:

  • DN_SimpleFit
  • MF_FitSubsegments
  • MF_arfit
  • MF_CompareAR
  • MF_hmm_fit
  • MF_hmm_CompareNStates
  • MF_GARCHcompare
  • MF_StateSpace_n4sid
  • NW_VisibilityGraph
  • PP_Compare
  • PP_Iterate
  • SB_TransitionpAlphabet
  • SP_Summaries

There can be false positives, I might have bugs and other sources of stochasticity might be in place. Also I might miss some more because of the simplicity of the series I'm using to test. I still have some operators failing to compute too.

I will give a second look, hopefully, at the end of the week, but of course, nothing can replace careful thinking.

@benfulcher
Copy link
Owner Author

The only remaining (labeled) stochastic operations are the fractaldimensions operations (which rely on mexed code from TSTOOL). I don't know how to control random seeds in C++ (e.g., in gendimest)
However, everything using random numbers in Matlab now has a controlled random seed (using the function BF_ResetSeed), which is specified as input to most functions, and defaults to 'default', which resets the seed as rng(0,'twister');

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants