Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] support for dictionaries and/or namedtuples #385

Closed
tmcclintock opened this issue Apr 17, 2021 · 6 comments · Fixed by #386
Closed

[ENH] support for dictionaries and/or namedtuples #385

tmcclintock opened this issue Apr 17, 2021 · 6 comments · Fixed by #386

Comments

@tmcclintock
Copy link
Contributor

Proposal

The sampler API shares a lot of features with scipy.optimize.minimize -- given a model that can be typed as Callable(List,...) keep proposing samples to improve the model. This is fine for many problems, but one can imagine a scenario where it would be useful for model developers to handle samples with more than just a List. Handling parameters in a List means that writing a posterior requires defining parameters positionally. In contrast, in packages like pytorch one can refer to model parameters by name (eg. a torch.nn.Linear has a weight and bias named parameters).

This issue proposes enhancing the emcee API to allow for dictionary and/or namedtuple parameter constructs. Looking at the API for the EnsembleSampler, this would require an additional constructor argument, since the sampler does not inspect the state of the log_prob_fn.

Here is a minimum proposed implementation for supporting a dictionary:

class EnsembleSampler(object):
    def __init__(
        self,
        nwalkers: int,
        ndim: int,
        log_prob_fn: Callable(Union[List, Dict]),
        parameter_names: Optional[List[str]] = None,
        ...
    ):
        ...

        # Save the names
        self.parameter_names = parameter_names
        self.name_parameters: bool = parameter_names is not None
        if self.name_parameters:
            assert len(parameter_names) == ndim, f"names must match dimensions"
        ...

    def sample(...):
        ...
         with get_progress_bar(progress, total) as pbar:
            i = 0
            for _ in count() if iterations is None else range(iterations):
                for _ in range(yield_step):
                    ...

                    # If our parameters are named, return them in a named format
                    if self.name_parameters:
                        state = {key: value for key, value in zip(self.parameter_names, state)}

                    yield state

This allows for a lot more flexibility downstream for model developers. As a simple example, consider a model composed of two parts, that each can be updated in a way that doesn't affect the higher level model:

class PartA:
    __slots__ = ["param_a"]
    param_a: float

    def __call__(self):
        ...

class PartB:
    __slots__ = ["param_b"]
    param_b: float

    def __call__(self):
        ...

class MyModel:
    def __init__(self, a = PartA(), b = PartB()):
        self.a = a
        self.b = b

    def log_likelihood_fn(self, params: Dict[str, float]):
        # Update the composed parts of the model
        for key in self.a.__slots__:
            setattr(self.a, params[key])

        for key in self.b.__slots__:
            setattr(self.b, params[key])

        # Final likelihood uses the results of A and B, but we don't
        # care how they are implemented
        return f(self.a(), self.b())

If this sounds like an attractive feature I would be happy to work on a PR.

@dfm
Copy link
Owner

dfm commented Apr 18, 2021

I'm certainly interested and I've played around with this in the past a little bit. One issue I hit in the past was that the computational overhead of serializing structured data was significant enough that it wasn't worth the effort. So it would be worth thinking about the best interface for this (e.g. does it make sense for this to be implemented within emcee vs. as a front end wrapper). In general, I'm not interested in emcee becoming a fully-fledged modeling language so I would normally come down on the side of leaving such features to downstream libraries (for example such a library could map from the vector-space representation to the structured representation that you're suggeting), but I'm definitely interested to hear about issues that that would present.

@tmcclintock
Copy link
Contributor Author

Understandable @dfm, and I agree completely with your bullet points in that page. The appeal of emcee is the simple API, and the fact that even a new python user can use it just by supplying a callable. I do think, though, that any user wishing to build a robust model like the one I described above, they are limited by the sampler's ability to take in only a list of numbers.

Look at it this way -- suppose I was working on a research project and I wasn't sure how many model parameters I was going to have. With the current API, every time I come up with a new model I have to write this translation step that goes from "column index in the chain" to "argument position in my model". This is repetitive and prone to errors. How many research hours have been lost from plugging params[0] into the spot meant for params[1]? We've all been there, so let's fix it for everyone!

As far as serializing structured data, I don't think that should be a concern. For instance, creating a million dictionaries of a dozen entrees each (roughly the size of the problems emcee is meant to handle) only takes 1.5 seconds:

values = [3.14, 2.71, 42, 0, 1, 1.1, 10000., -999, 123, 456., 789., 1e20]
keys = ["a" + str(i) for i in range(len(values))]

def foo():
    d = {k: v for k, v in zip(keys, values)}

if __name__ == "__main__":
    print(f"Timing dict creation for: {len(keys)} numbers")

    import timeit
    print(timeit.timeit("foo()", setup="from __main__ import foo"))
# >>> Timing dict creation for: 12 numbers
# >>> 1.477253458

Surely that is a minuscule price to pay compared to any non-trivial posterior, but perhaps this isn't what you meant by serializing though. Can you clarify?

@dfm
Copy link
Owner

dfm commented Apr 19, 2021

Look at it this way -- suppose I was working on a research project and I wasn't sure how many model parameters I was going to have. With the current API, every time I come up with a new model I have to write this translation step that goes from "column index in the chain" to "argument position in my model". This is repetitive and prone to errors. How many research hours have been lost from plugging params[0] into the spot meant for params[1]? We've all been there, so let's fix it for everyone!

I agree with this at some level, but I also don't think that it's obvious how to fix this for everyone without designing a fully fledged modeling language, requiring users to buy in. For example, perhaps I want some of my parameters to be tensors, not scalars. There are libraries that have solved that (e.g. PyMC3 bijections), but this actually tends to require a pretty large code base to cover the range of features that users would expect. From my perspective, it's not clear that this needs to be part of emcee (and I'm certainly not very keen to be on the hook maintain such a code base) when instead it could be a downstream library offering these features. Can you say more about what benefit you would expect from this being built into emcee vs an independent library offering a shim? This would remain my suggested proposal (and I'd be happy to include a link to such a library from the emcee docs), but please let me know what use case such a structure wouldn't solve.

Surely that is a minuscule price to pay compared to any non-trivial posterior, but perhaps this isn't what you meant by serializing though. Can you clarify?

In the default backend, emcee stores the chains in pre-allocated numpy arrays of the right shape and writes directly to those arrays, minimizing the number of copies and resizes. That's what you'd need to compare to. I've tried several experiments to make this more flexible (using structured arrays or dicts as interfaces) and found the computational overhead and memory usage to be surprisingly large. I don't remember the specific details or numbers, but I just wanted to warn you that I've gone down this path a number of times and abandoned it :D

@dfm
Copy link
Owner

dfm commented Apr 19, 2021

P.S. @tmcclintock: I'm sorry if I'm coming across as very negative! I don't mean that. This would definitely be extremely worthwhile, I'm just hesitant to add too many features to emcee (especially major refactors like this) since my maintenance time is limited these days.

@tmcclintock
Copy link
Contributor Author

Hah no it's ok, and understandable. After all, if it ain't broke don't fix it :).

How about I mock up a proof of concept in a PR and send it along? You can take a look and let me know if I should spend more time fleshing it out.

@dfm
Copy link
Owner

dfm commented Apr 21, 2021

Sure - that sounds like a great plan! Happy to chat as you proceed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants