-
Notifications
You must be signed in to change notification settings - Fork 431
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] support for dictionaries and/or namedtuples #385
Comments
I'm certainly interested and I've played around with this in the past a little bit. One issue I hit in the past was that the computational overhead of serializing structured data was significant enough that it wasn't worth the effort. So it would be worth thinking about the best interface for this (e.g. does it make sense for this to be implemented within emcee vs. as a front end wrapper). In general, I'm not interested in emcee becoming a fully-fledged modeling language so I would normally come down on the side of leaving such features to downstream libraries (for example such a library could map from the vector-space representation to the structured representation that you're suggeting), but I'm definitely interested to hear about issues that that would present. |
Understandable @dfm, and I agree completely with your bullet points in that page. The appeal of emcee is the simple API, and the fact that even a new python user can use it just by supplying a callable. I do think, though, that any user wishing to build a robust model like the one I described above, they are limited by the sampler's ability to take in only a list of numbers. Look at it this way -- suppose I was working on a research project and I wasn't sure how many model parameters I was going to have. With the current API, every time I come up with a new model I have to write this translation step that goes from "column index in the chain" to "argument position in my model". This is repetitive and prone to errors. How many research hours have been lost from plugging As far as serializing structured data, I don't think that should be a concern. For instance, creating a million dictionaries of a dozen entrees each (roughly the size of the problems emcee is meant to handle) only takes 1.5 seconds: values = [3.14, 2.71, 42, 0, 1, 1.1, 10000., -999, 123, 456., 789., 1e20]
keys = ["a" + str(i) for i in range(len(values))]
def foo():
d = {k: v for k, v in zip(keys, values)}
if __name__ == "__main__":
print(f"Timing dict creation for: {len(keys)} numbers")
import timeit
print(timeit.timeit("foo()", setup="from __main__ import foo"))
# >>> Timing dict creation for: 12 numbers
# >>> 1.477253458 Surely that is a minuscule price to pay compared to any non-trivial posterior, but perhaps this isn't what you meant by serializing though. Can you clarify? |
I agree with this at some level, but I also don't think that it's obvious how to fix this for everyone without designing a fully fledged modeling language, requiring users to buy in. For example, perhaps I want some of my parameters to be tensors, not scalars. There are libraries that have solved that (e.g. PyMC3 bijections), but this actually tends to require a pretty large code base to cover the range of features that users would expect. From my perspective, it's not clear that this needs to be part of emcee (and I'm certainly not very keen to be on the hook maintain such a code base) when instead it could be a downstream library offering these features. Can you say more about what benefit you would expect from this being built into emcee vs an independent library offering a shim? This would remain my suggested proposal (and I'd be happy to include a link to such a library from the emcee docs), but please let me know what use case such a structure wouldn't solve.
In the default backend, emcee stores the chains in pre-allocated numpy arrays of the right shape and writes directly to those arrays, minimizing the number of copies and resizes. That's what you'd need to compare to. I've tried several experiments to make this more flexible (using structured arrays or dicts as interfaces) and found the computational overhead and memory usage to be surprisingly large. I don't remember the specific details or numbers, but I just wanted to warn you that I've gone down this path a number of times and abandoned it :D |
P.S. @tmcclintock: I'm sorry if I'm coming across as very negative! I don't mean that. This would definitely be extremely worthwhile, I'm just hesitant to add too many features to emcee (especially major refactors like this) since my maintenance time is limited these days. |
Hah no it's ok, and understandable. After all, if it ain't broke don't fix it :). How about I mock up a proof of concept in a PR and send it along? You can take a look and let me know if I should spend more time fleshing it out. |
Sure - that sounds like a great plan! Happy to chat as you proceed. |
Proposal
The sampler API shares a lot of features with
scipy.optimize.minimize
-- given a model that can be typed asCallable(List,...)
keep proposing samples to improve the model. This is fine for many problems, but one can imagine a scenario where it would be useful for model developers to handle samples with more than just aList
. Handling parameters in aList
means that writing a posterior requires defining parameters positionally. In contrast, in packages like pytorch one can refer to model parameters by name (eg. atorch.nn.Linear
has aweight
andbias
named parameters).This issue proposes enhancing the emcee API to allow for dictionary and/or namedtuple parameter constructs. Looking at the API for the
EnsembleSampler
, this would require an additional constructor argument, since the sampler does not inspect the state of thelog_prob_fn
.Here is a minimum proposed implementation for supporting a dictionary:
This allows for a lot more flexibility downstream for model developers. As a simple example, consider a model composed of two parts, that each can be updated in a way that doesn't affect the higher level model:
If this sounds like an attractive feature I would be happy to work on a PR.
The text was updated successfully, but these errors were encountered: