Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster Expectations and new DiscreteDistributionXRA class #1156

Merged
merged 42 commits into from
Aug 8, 2022

Conversation

alanlujan91
Copy link
Member

@alanlujan91 alanlujan91 commented Jul 23, 2022

This is a work in progress but I would like some feedback. This PR tackles a few related concepts.

  1. Create new DiscreteDistribution.calc_expectation(func, args) method that uses numpy functional programming instead of looping. Significantly faster than older method.
  2. Ditto for DiscreteDistribution.distr_of_function(func, args).
  3. Creates operator Expected(func, dist, args) which uses new DD.calc_exp in the back end.
  4. Creates DiscreteDistributionXRA which uses underlying xarray.DataArray to allow for labeling distributions, dimensions, and coordinates.
  5. DiscreteDistributionXRA.calc_expectations(func, args, labels=True) can use labeled and expressive equations that are easier to read as math.
  6. Adds notebooks to demonstrate new functionality.
  7. Expectations notebook
  8. DiscreteDistributionXRA notebook
  9. Uses Expected operator in ConsIndShockModel as proof of concept. In my tests this cuts down runtime by 2/3 (607 ms to 203 ms)

I'm still missing some documentation and tests, but would appreciate feedback and wanted to see if this passes remote tests.

  • Tests for new functionality/models or Tests to reproduce the bug-fix in code.
  • Updated documentation of features that add new functionality.
  • Update CHANGELOG.md with major/minor changes.

"""
return self._xarray.attrs

def expected_value(self, func=None, *args, labels=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


wrapped = dict(zip(idx, x))

return func(wrapped, *args)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to talk to you about what you are doing with the 'labels' case here before merging.
You asked if there is a simpler way to do what you're trying to do, and I suspect there is.
But the documentation doesn't make it clear to me what's happening.

@sbenthall sbenthall removed automerge Ready-To-Merge Has been reviewed and when branch is updated and checks pass it should be merged labels Aug 3, 2022
@alanlujan91 alanlujan91 mentioned this pull request Aug 3, 2022
3 tasks
coords : dict
Coordinate values/names for each dimension of the underlying array.
dims : tuple or list
Dimension names for each dimension of the underlying array.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry to be getting to a thorough review of this work so late.

What I am noticing here is that the arguments to this DiscreteDistributionXRA class are exposing the configuration of the underlying DataArray data structure: dims, coords. But the documentation is not providing guidance as to the semantics of these arguments with respect to the class's purpose as a representation of a probability distribution.

Since you seem to be using these values mainly in the labels version of the expected_value operation, it leaves me with a lot of questions about the design here. I wonder if there's a clearer way to do what you're trying to do, or a way to describe this capability that doesn't rely so much on a deep understanding of the DataArray format.

It looks like the usage pattern you are intending here is for a multivariate distribution over A and B, you will have 'dims = ['rv', 'x']' where 'x' refers to... the probability mass? And then coords= {'rv' : ['A','B']}. The DataArray then stores a values indexed by ((a, b), x) which represent the probability of (A = a, B = b)... but then what is the 'x' dimension?

And if this is the intended usage pattern, would you ever want dims to be anything other than ['rv', 'x']? Would that be a reasonable default?

But I'm actually now quite confused about the data model. Why wouldn't you just have the dims = ['A', 'B']? That way each index of (a,b) would map to a probability value. Am I missing something?

Wrapper function for `func` that handles labeled indexing.
"""
dim_0 = self.dims[0]
idx = self.coords[dim_0].values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code assumes a particular structure of dims and coords.
That structure should be guaranteed by the constructor of the class, but currently it is wide open to user configuration.
It's easy to make a 'valid' (in terms of what documented) which breaks at this point in the code.

@sbenthall sbenthall removed Status: Review Needed TSC: Review Needed Item escalated to the TSC. labels Aug 3, 2022
@sbenthall
Copy link
Contributor

Looking this over... it looks like currently in the DiscreteDistribution (and the new DiscreteDistributionXRA by extension), the class is initialized with a pmf (should really be pmv) and a X/data argument which for an N-dimensional multivariate is a 2 dimensional array, where each slice data[:, x] is a particular point with probability mass pmf[x] and values for all N variables.

This takes len(X) + len(X) * N space, where len(X) is determined by the approximation value. In our case, N-VariateNormal().approx(D) expands into D^N different points. So: (1 + N) D ^ N.

I would think that one of the benefits of using xarrays would be to use the dimension labels in a more concise way. I.e., storing all the data of the distribution in a single N-dimensional array, where the value of data[a,b] is the point mass of realizing (a,b).

This takes D ^ N space, which is less.

The only downside I can think of is if when approximating a distribution there are some methods that would lead to a very uneven discretization in which many (a,b, ...) values were null. But that could be solved with xarray sparsity, which seems to be supported.

I was wondering if you'd thought about this while working on this @alanlujan91 . It's quite possible I'm forgetting some important rationale for why things are the way they are, but it's probably more likely that I was just trying to extract the functionality from the older HARK code without breaking things, and a new design would be better.

@sbenthall
Copy link
Contributor

It's interesting to me that the draw() method of the new XRA based class has not been updated -- so the Xarray doesn't seem to be used for operations per se, but rather as a way of storing information about how the multivariate probability data is structured?

@sbenthall
Copy link
Contributor

About the 'labels' way of calling a function...

The current syntax for this call is (using the alias):

ExpectedValue(
    lambda x: 1 / x["perm_shk"] + x["tran_shk"],
    dist=x_dist,
    labels=True,
)

Especially since we want to move away from this rather odd construct of x standing for the state of nature (don't we?), I would expect a syntax more like this:

ExpectedValue(
    lambda perm_shk, tran_shk: 1 / perm_shk + tran_shk,
    dist=x_dist,
    labels=True,
)

... and likewise for when this is called via dist.expected_value

@llorracc
Copy link
Collaborator

llorracc commented Aug 4, 2022

@sbenthall,

@alanlujan91 and I talked about this at some length today. You're right that it is closely connected with the issues around the whole steps/stages/frames direction.

Also closely connected to some of the things I worked on last year in my pre-2.0-alpha.

A key goal there was to have a structure from which it would be possible to extract the key equations of a model in a predictable and mathematical-looking way, and where the same exact equations could be applied in the solution and simulation stages.

In the end the only way I found to do that was to encapsulate the relevant python code in a string. Then the strings associated with transitions could be printed, and the unique string could be evaluated as a python expression both in the solution and the simulation stage.

If we have a call on Fri, maybe the two of us could briefly discuss what Alan and I discussed today.

The upshot, though, was a conclusion that maybe we should hold off on merging the xarray and expectation stuff until we are more confident about how we want to handle these things.

@sbenthall
Copy link
Contributor

Ok -- it makes sense to hold off on merging the xarray and 'local' function evaluation parts of this PR.

The performance improvements to calc_expectations and some of the redesign to the interface of DiscreteDistribution (see #1159 ) could be split off into their own PRs and be most welcome as standalone changes.

There are well known security issues with evaluating strings as Python code. @MridulS has commented on this before.
Let's talk some more soon about this.

@alanlujan91 alanlujan91 mentioned this pull request Aug 4, 2022
3 tasks
@alanlujan91 alanlujan91 marked this pull request as draft August 4, 2022 14:47
@alanlujan91 alanlujan91 marked this pull request as ready for review August 7, 2022 23:11
@alanlujan91
Copy link
Member Author

@Mv77 Not sure why this is causing errors that weren't showing up before, I am guessing they have to do with interpolation.py's new release? Do you have any guess?

@Mv77
Copy link
Contributor

Mv77 commented Aug 8, 2022

Yuup.

The release broke my interpolators. #1157 fixes them but let me push a quick standalone fix.

Incoming!

@Mv77
Copy link
Contributor

Mv77 commented Aug 8, 2022

@alanlujan91

#1161 is a standalone fix. Its 3 lines of code. If it passes tests you can go ahead and merge it, which should fix the issue.

@llorracc llorracc merged commit b344850 into econ-ark:master Aug 8, 2022
@sbenthall sbenthall added this to the 0.13.0 milestone Jan 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants