Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dynamic determination of invalidated cache #15

Open
Eh2406 opened this issue Nov 7, 2016 · 9 comments
Open

dynamic determination of invalidated cache #15

Eh2406 opened this issue Nov 7, 2016 · 9 comments

Comments

@Eh2406
Copy link
Contributor

Eh2406 commented Nov 7, 2016

can we keep track of when we generate a orca column and use that to determine when a orca table needs to be re-calculated.

@fscottfoti
Copy link
Contributor

You mean like have a dependency graph? I definitely thought that was a promising idea but Matt thought it would be a lot of work at the time...

@Eh2406
Copy link
Contributor Author

Eh2406 commented Nov 7, 2016

To be honest I assumed we had a DG, or the data to build one on the fly. I assumed that is what is needed to do lazy loading. I will go back to the code. How do we do lazy loading?

When we add data to orca, store a timestamp (or monatomic number) in a _CACHE_TIME.
However we do lazy loading, rerun that code and see the largest value for the _CACHE_TIME of the dependencies. If the dependencies time > than the time for the cashe of this item, than we are in a invalidated cache state.

@fscottfoti
Copy link
Contributor

No there's no DG - lazy evaluation happens when an item gets injected into a function. If it's in the cache it doesn't get re-evaluated. The user has to invalidate the cache manually, which is usually done every iteration, and is why my other issue about the cache is so important. It's not that we didn't think DG was a good idea, just that Matt never got to it.

@Eh2406
Copy link
Contributor Author

Eh2406 commented Nov 9, 2016

Does the algorithm I suggested make sense? Do you think it would improve things?

As always just starting a discussion, and willing to help implement any fix.

@fscottfoti
Copy link
Contributor

I definitely agree that building a DF and invalidating the cache appropriately when something gets updated would be awesome. The problem is that your dependency graph is usually built based on what gets injected into the function, but everything that gets injected is not necessarily used. So like when you inject the parcels table which has 20 columns, you only update one column in the parcels table you want to invalidate only the things that depend on that column, but you're injecting the whole table. I see no obvious way to solve this other than requiring the user to say which columns a function depends on explicitly, but then we're almost as prone to error as we were in the first place. Thoughts?

@Eh2406
Copy link
Contributor Author

Eh2406 commented Nov 9, 2016

It can also be subverted by using orca.get_table in the function to make a dependency without telling orca, or by using a different global variable / file to store some state. Generally python is permissive with privacy and mutability for better and worse. (off topic, I am excited by timely dataflow for rust. It is leveraging rust's stinginess with privacy and mutability to insure a lot of the invariants we are discussing.)

Perhaps this algorithm can be a diagnostic tool? Still thinking...

@fscottfoti
Copy link
Contributor

fscottfoti commented Nov 9, 2016

Right - that's the best I can come up with too... orca knows when it's inside that method, and it can capture any calls to get_table or __getattr__ and know which columns are a dependency. My fear is that you can have different code paths through the method - like an if/else - so you can't know you've got ALL the dependencies just by running the function once. And that's where I get stumped.

@bridwell
Copy link
Contributor

I would say that stating specific column dependencies, while cumbersome, is still preferable to having to manually clear the cache within the steps. But maybe this could be automated in some way, like looking through the contents of a yaml file or configuration dictionary rather than being declared in the decorator?

@waddell
Copy link
Member

waddell commented Nov 18, 2016

@bridwell that is my inclination as well, to be explicit about dependencies. A bit more work up front, but with good payoff in terms of being able to address challenges like identifying dynamically when to recompute based on an updated dependency.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants