Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[0.2.dev3] Choice simulation without capacity constraints #43

Merged
merged 15 commits into from Oct 3, 2018

Conversation

smmaurer
Copy link
Member

@smmaurer smmaurer commented Sep 13, 2018

This PR adds functionality for efficient Monte Carlo simulation of choices for a set of K scenarios, each having different probability distributions (and potentially different alternatives). Choices are independent and unconstrained, meaning that the same alternative can be chosen in multiple scenarios.

This is a component of issue #26. With this PR, we have full support in ChoiceModels for unconstrained choice simulation. The next PR will handle capacity constraints. A separate PR in UrbanSim Templates will provide access to this logic.

Discussion

This PR adds a tool called choicemodels.tools.monte_carlo_choices().

Using this is equivalent to applying np.random.choice() to each of K scenarios, but it's implemented as a single-pass matrix calculation. This is about 50x faster than using df.apply() or a loop. The algorithm is adapted from urbansim.urbanchoice.

For cases where all the choice scenarios have the same probability distribution among alternatives, you don't need this function. You can use np.random.choice() with size=K, which will be more efficient. (For example, that would work for a choice model whose expression includes only attributes of the alternatives.)

PR includes a unit test that confirms the simulated choices align with the provided probabilities.

Usage

from choicemodels.tools import monte_carlo_choices

probabilities =  # pd.Series indexed with observation id and alternative id
choices = monte_carlo_choices(probabilities)

This is implemented as a general-purpose function that can accept any list of indexed probabilities -- so it will work with output from our own MNL estimator, or PyLogit, or future model types. It can be called directly or used as the back end for a model template.

Performance

Overall the performance is excellent, especially compared to df.apply() as noted above.

Simulating choices is faster than calculating choice probabilities from the MNL utility equations. For 1 million choice scenarios with 10 alternatives each, calculating the probabilities takes 1.0 seconds and then simulating choices takes 0.5 seconds, on an old i5 MacBook.

Although this seems fine in absolute terms, it's worth noting that it's a little bit slower than the 100%-numpy implementation in the original urbansim.urbanchoice codebase. It looks like this is caused by overhead from requiring the probabilities to be formatted as an indexed pandas object.

Profiling indicates that 65% of the execution time, and the vast majority of memory usage, comes from a couple of initial pandas operations. The numpy matrix math is very efficient in comparison.

I think for now, the clean data format is worth the performance hit. But I'd like to go through and do more careful profiling of other parts of the codebase in light of this.

Other changes

  • reorders parameters in the MultinomialLogitResults() constructor and makes the estimation_engine parameter optional
  • improves the efficiency of MultinomialLogitResults.probabilities()

Versioning

  • updates ChoiceModels version number to 0.2.dev3

@coveralls
Copy link

coveralls commented Sep 13, 2018

Coverage Status

Coverage increased (+0.7%) to 66.615% when pulling 21167d7 on unconstrained-prediction into 9684d3a on master.

Copy link
Member

@mxndrwgrdnr mxndrwgrdnr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all seems reasonably straightforward to me. I wonder though, if the pandas conversion is whats causing the slow down, if its really worth it instead of just making it pure numpy? I agree the clean API is nice to have, but do we expect many users to actually leverage it down at this level? Certainly most UrbanSim users would not. And users that want to leverage choicemodels outside of an UrbanSim context definitely might want to access these lower level functions, but in this scenario they are probably less likely to be working with pandas objects to begin with, maybe? Ultimately I agree its not a huge performance hit, and shouldn't have a big impact on overall UrbanSim runtimes. Just food for thought.

@smmaurer
Copy link
Member Author

@mxndrwgrdnr Thanks for taking a look at this. Yeah, these are good points. I guess we could also leave pandas formats as the default but provide an option for passing numpy arrays directly, which we could use in the UrbanSim templates to maximize performance. I'll look into this more as i'm building out the capacity-constrained choice simulation, which i think will have much worse performance.

@smmaurer smmaurer requested a review from janowicz October 3, 2018 18:59
@smmaurer smmaurer merged commit 36eb8ea into master Oct 3, 2018
@smmaurer smmaurer deleted the unconstrained-prediction branch October 3, 2018 19:29
@smmaurer smmaurer changed the title Choice simulation without capacity constraints [0.2.dev3] Choice simulation without capacity constraints Nov 10, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants