Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache scope #129

Merged
merged 9 commits into from Dec 3, 2014
Merged

Cache scope #129

merged 9 commits into from Dec 3, 2014

Conversation

jiffyclub
Copy link
Member

This adds a cache_scope option to the table, column, and injectable decorators that allow results to be automatically cleared from the cache at set points. The options are 'step', 'iter', and 'forever'. Items with scope 'step' are cleared after every model is run, items with scope 'iter' are cleared at the end of every model year, and things with scope 'forever' are never automatically cleared.

This removes the need for table_source so that has been removed. The equivalent is now to use cache_scope='forever'.

I waffled a little bit on using 'forever' or 'simulation' for the long-lived scope. With a scope of 'simulation' it'd make sense to clear it at the end of the simulation, with 'forever' it'd be best to never clear it. I'm up for either one, anyone have an opinion?

@jiffyclub
Copy link
Member Author

Couple new features added this morning:

Functions wrapped with sim.table() will now be evaluated before injection, and a DataFrameWrapper will be injected, so that other functions can directly access the underlying DataFrame if necessary. And on that topic the attribute name of the underlying DataFrame has been changed from ._frame to .local so that accessing it doesn't look so sketchy.

@jiffyclub
Copy link
Member Author

(All UrbanSim tests are passing, the failure is because sanfran_urbansim uses table_source. That's fixed by UDST/sanfran_urbansim#7.)

@bridwell
Copy link
Contributor

bridwell commented Nov 5, 2014

Can the 'iter' option be extended to use specific simulation years? For example, something like:

@sim.column(iter=[2010, 2020, 2030])

All simulation years not specified would be cached. This would be nice for models that run at different time scales rather than having to put the logic into the model itself.

@jiffyclub
Copy link
Member Author

@bridwell, I don't think we can support a feature like that. It implies there's a single, easy to identify variable that the simulation is looping over so that there can be a comparison. That variable exists now (as year), but that's something we're planning to change soon to generalize the simulation framework to uses beyond UrbanSim.

To accomplish what you want I'd make a "model" like this:

@sim.table(cache=True, scope='forever')
def my_table():
    # do stuff
    return df

@sim.model()
def clear_my_table_cache(year, my_table):
    if year in [2010, 2020, 2030]:
        my_table.clear_cached()

In anticipation of fully implementing cache scope controls this
removes the table_source decorator and makes a namedtuple
container object for cached items so that items can be cached along
with the information about their cache scope.
Table, column, and injectable decorators now take a cache_scope
argument that can be 'step', 'iter', or 'forever'.
Thing with a scope of 'step' are cleared after each model,
things with a scope of 'iter' are cleared at the end of each model year,
and things with scope 'forever' are never automatically cleared.
It occurred to me that one of the benefits of table_source was that
it converted the registered thing to a DataFrameWrapper that allows
you to access the underlying DataFrame.

To keep the same benefit now that table_source is gone I've converted
the simulation framework so that the injection and get_table always
result in a wrapped DataFrame. This could result in a little more computation
while evaluating dependencies, but caching still works the same
so things that are cached will only be built once
(or until they are removed from the cache).
There are situations when you need to directly access the DataFrame
underlying a DataFrameWrapper, especially when trying to shadow that
column with a computed column. Instead of forcing people to use
the semi-private ._frame now it's .local.
@jiffyclub
Copy link
Member Author

@daradib The test_table_copy test is failing and it looks to me like the test setup is wrong, but I wanted to check with you. In particular this table is set up with copy_col=False but with the name test_funcd_copied2. Then it fails on test assert table['a'] is not table['a'] because copy_col is False. Should that function be named uncopied or have copy_col=True?

@daradib
Copy link
Contributor

daradib commented Dec 2, 2014

The test_funcd_copied2 function (should be better named, sorry about that) is supposed to test that we still have a copy if the input table (test_frame_copied) evaluates as a copy, even if the function doesn't directly copy the input table. If I'm understanding correctly, this PR evaluates the input table before injection so you'll get the same copy as input each time, and the test fails. If that's now intended behavior, maybe we can get rid of the function?

@jiffyclub
Copy link
Member Author

Yeah the idea is now that users should never come in contact with TableFuncWrapper, injection and get_table will always hand you a DataFrameWrapper so that things are more predictable. I'll see about cleaning this up in accordance.

A couple of the tests in test_table_copy were testing
the behaviour of TableFuncWrapper, but those don't enter
user space anymore, get_table always returns a DataFrameWrapper.
@jiffyclub
Copy link
Member Author

@daradib Can you confirm 5f34076 looks like the right thing to do?

get_table now returns a DataFrameWrapper, which will not be
re-evaluated with __getitem__. Instead, call get_table twice.

to_frame() always returns copies, so avoid using it when
testing other behavior.

Rename table_source since it has been replaced by cache=True.

Use a for loop to reuse code when creating tables of computed columns.
@daradib
Copy link
Contributor

daradib commented Dec 2, 2014

Looks good. I added a commit cleaning up test_table_copy to call get_table twice, avoid calling to_frame, etc.

@jiffyclub
Copy link
Member Author

@fscottfoti, any objection to merging this and making it part of the next release?

jiffyclub added a commit that referenced this pull request Dec 3, 2014
Set scope on cached data

The user can set a scope on cached data so that the simulation framework
can take care of expiring cached items on a schedule,
saving the user from having to do manual cache management.
Removes the "table source", which is replaced by cache mechanics.
@jiffyclub jiffyclub merged commit 8264b2c into master Dec 3, 2014
@jiffyclub jiffyclub deleted the cache-scope branch December 3, 2014 20:28
@fscottfoti
Copy link
Contributor

Sorry, I missed this comment. So this is not a reverse compatible change right? I have to update bayarea_urbansim when I update urbansim?

@fscottfoti
Copy link
Contributor

Yup, now this is breaking. Should be easy enough to fix.

@jiffyclub
Copy link
Member Author

I ended up making two releases today. Version 1.2 has the accounts and supply/demand stuff, but not this. Version 1.3 has this. So you can pick one of those releases to work with for now.

@fscottfoti
Copy link
Contributor

sounds good

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants