Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

orca cache default should not be "forever"? #16

Open
fscottfoti opened this issue Nov 7, 2016 · 4 comments
Open

orca cache default should not be "forever"? #16

fscottfoti opened this issue Nov 7, 2016 · 4 comments

Comments

@fscottfoti
Copy link
Contributor

We had a couple of snafus surrounding the orca cache behavior recently and it took some real digging to both fix and understand the problems. I ended up having to write a FAQ just to sort it out which I'll copy below. The main issue here, I think, is that the default cache_scope should be iteration rather than forever.

Why did setting cache=True when defining the jobs table cause the same jobs summary numbers to be printed out each year - doesn't the cache get cleared each year?

No. As the docs clearly state, the default cache_scope is "fovever." You have to specify a cache_scope of "iteration" if you want that behavior.

Really, the default is forever - shouldn't it be iteration?

Yeah probably, since the main point of Orca is to make recomputation of variables easier for each simulation iteration.

Yeah but really, all sorts of things have a cache=True (e.g. in urbansim_defaults), don't these get recomputed every year?

Yes - it turns out that is a happy accident. When you add rows to a table, you clear the cache, so since we add rows to jobs and households via the transition models and to buildings via the developer model, the cache gets cleared by adding those new rows.

OK, but now explain why we have 3 zone_ids in the jobs table, one that is zone_id, one that is zone_id_x, and one that is zone_id_y?

Easy. There is a zone_id defined on parcels, buildings, and jobs. We need a column from parcels and a column from jobs so we merge the three tables. The first pandas merge has two zone_ids and pd.merge appends _x and _y, then the third doesn't conflict with the first two and becomes the canonical zone_id. Turns out an odd number of merged columns will give you what you want.

Sort of, but one of those zone_ids was different from the others - it had nulls where the other two were defined.

This was the original problem that led us to look into this - our job summaries were incorrect. This is caused because the ELCM runs last in order to place unplaced jobs after new buildings get built. But when the jobs get new building ids from the ELCM, the zone_id isn't updated because it is cached. Thus all the unplaced jobs still have a null zone_id because unplaced jobs don't have zones. (Incidentally zone_id_x and zone_id_y were both correct because they were merged after the ELCM ran - only zone_id was incorrect cause it was stuck in the cache.)

And the first thing you tried didn't work - just clearing the orca cache - why not?!?

At first I tried clearing the "forever" cache and this doesn't work because the Pandana global memory is stored in the forever cache. Pandana can't be reinitialized and this was the first error I got. When I cleared the "iteration" cache instead, which I thought would work, it did NOT because the default is "forever" not "iteration" as I had thought, so the columns I needed to clear were defined as "forever" and still in the cache.

OK is there a way out of this madness? Like, a simple solution?

Yeah, setting the orca default cache_scope to "iteration" instead of "forever" should do the trick. But that would involve everyone agree to that fairly major behavior change.

@fscottfoti fscottfoti changed the title orca cache default should not be forever? orca cache default should not be "forever"? Nov 7, 2016
@Eh2406
Copy link
Contributor

Eh2406 commented Nov 7, 2016

Yes, No caching is a better default. With cache = "forever" orca is 'just' a set of global mutable variables. With cache=None it is at least a set of mutable generator functions.

My code base can be updated easily, However, the change is a semver major change (not that we are using semver) So I'd like to hear from a large number of users, and see clear docs in the announcement and changelog before the change is made.

@bridwell
Copy link
Contributor

bridwell commented Nov 18, 2016

I'm fine with changing the default.

Or maybe, the default caching scope could be be a global option?

@lmwang9527
Copy link

From the description of the caching behavior above (I haven't looked into the code myself yet), the behavior may still be problematic even when the cache option is "iteration", granted it may be less a problem with "iteration" than "forever". From Fletcher's description, it seems once the cache option of a variable is on, the value will be read from cache even when its dependencies have changed (an exception is when rows are added or deleted from the table?). This can cause stale (un-updated) variable values even when cache="iteration".

As an example, consider a neighborhood household_income_3000 variable that is used by hlcm and rsh. Assuming a neighborhood_vars step runs before both model steps, when hlcm runs, the variable value from cached columns would be correct. However, when rsh runs after hclm, the cached column value would be stale and may be incorrect because households would have moved around. cache="step" would work in this case, but it defeats most of the purpose of caching.

An ideal solution (that was implemented in the opus version of UrbanSim) is to keep track of versions for columns and whenever a computed/derived variable has a lower version than its dependencies, recompute its value. This will also require a major change for orca to start track of variable dependencies.

@hanase
Copy link
Contributor

hanase commented Mar 4, 2019

Thanks for the comment @lmwang9527 ! The caching has been a headache for us as well. We now set everything to "step" to avoid hidden issues, which of course slows things down.

There is also the issue that a variable should be re-usable in different contexts. For example, you might want to compute the number of households in various models where the cache needs to be "step" for reasons you described. But in another orca run that generates annual indicators, you want the cache for this variable to be "iteration", in another run it could be "forever". So I don't thing the cache scope should be associated with tables or columns. It should be application dependent.

A few years back I tried to lobby for the Opus solution which uses versioning of dependencies, since as you say would be the ideal solution. Not sure though if anybody is currently working on improving orca.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants