Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache control #5136

Open
Makman2 opened this issue Feb 7, 2018 · 10 comments
Open

Cache control #5136

Makman2 opened this issue Feb 7, 2018 · 10 comments
Labels
area/core difficulty/medium status/blocked The issue requires other referenced issues/PRs to be solved/merged before being worked on
Milestone

Comments

@Makman2
Copy link
Member

Makman2 commented Feb 7, 2018

Once the NextGen-Core is implemented, we have way more possibilities for different cache operating modes that shall be available as CLI arguments in coala.

  • --cache-strategy / --cache-protocol:
    Controls how coala manages caches for the next run.

    Following modes could be implemented:

    • none: Don't use a cache at all. A shortcut-flag could be additionally implemented, --no-cache, effectively meaning --cache-protocol=none
    • primitive: Use a cache that grows infinitely. All cache entries are stored for all following runs, and aren't removed. Effective when many recurrent changes happen in coafiles and settings. Fastest in storing.
    • lri/last-recently-used
      Cached items persist only until the next run.
      Stretch issue: Implement count-parameters that allow to control when to discard items from the cache, e.g. after 3 runs of coala without using a cached item, discard it.

    My recommendation is to use lri as default, as coala mostly is executed locally.

  • --clear-cache
    Clears the cache.

  • --export-cache / --import-cache
    Maybe useful to share caches. Like CI server for any project run coala, and you can download the cache from there as an artifact to speed up your builds / coala runs.

  • --cache-compression
    Accepts as arguments:

    • none: No cache compression. This is default.
    • Other flags that specify common compression capabilities Python provides (for example lzma or gzip).
      Cache compression should be evaluated before regarding its effectiveness, because the cache will mainly store hashes which usually aren't really redundant, the gain might be very low. The little performance penalty when loading the cache might be too much when respecting a possible very low gain of cache space reduction.
  • --optimize-cache
    A little performance penalty to make the cache loading faster. Particularly this feature shall utilize pickletools.optimize. But this is not exclusive to this flag.

@Makman2
Copy link
Member Author

Makman2 commented Feb 7, 2018

@Makman2 Makman2 added the status/blocked The issue requires other referenced issues/PRs to be solved/merged before being worked on label Feb 7, 2018
@palash25
Copy link
Member

palash25 commented Mar 9, 2018

@Makman2 with reference to the cache-compression what I gather from your description is that we need to know whether compression would be a good idea before implementing it (i.e. projects requiring large disk space will need to cahce their results) for this we can have a small piece of code to determine the size of the repository and only initialize the compression if the repo is above a minimum threshold of size (which we will need to decide but I'm guessing it might be hundreds of MBs or a few GBs for large projects). Correct me if I'm wrong.

@palash25
Copy link
Member

palash25 commented Mar 9, 2018

I am also curious about import/export cache will this be a separate module for e.g. something likeCacheTransfer.pywhich will make get and post requests to different CI servers like Travis and Circle using their respective APIs and the requests library?

@palash25
Copy link
Member

palash25 commented Mar 9, 2018

I think this issue can be a part of the caching/performance optimization project.

@Makman2
Copy link
Member Author

Makman2 commented Mar 10, 2018

Cache compression should be evaluated before regarding its effectiveness, because the cache will mainly store hashes which usually aren't really redundant, the gain might be very low. The little performance penalty when loading the cache might be too much when respecting a possible very low gain of cache space reduction.

It's that we don't compress files, but binary data which might not have much redundancy (actually it's also files, but this shall change, explaining below). If data has no redundancy compression is too ineffective and maintaining compression features would be useless.

I am also curious about import/export cache will this be a separate module for e.g. something likeCacheTransfer.pywhich will make get and post requests to different CI servers like Travis and Circle using their respective APIs and the requests library?

No no no :) It's just that I want to be able to do coala --export-cache coacache.cache (the same for importing) to have a reliable interface for transferring caches. It could be that for now this boils down to a simple copy command (copying the coala cache file out of some coala-specific location).

The idea with the CI is just a possible use-case (has also to be investigated). Consider a very large project, which generates a 100MB cache (that's already quite insane). The coala analysis has taken 2h. So new developers can speed up their runs, they would just download this file, which is configured being offered on CI. They do coala --import-cache ... and if they want to run coala, it doesn't take 2h initially. Or consider the CI build cache itself: Instead of requiring a clean coala run, we would just cache coala's cache file inside our builds, and the next build restores it. Consecutive CI builds take way less time.

I think this issue can be a part of the caching/performance optimization project.

Yes.

(actually it's also files, but this shall change, explaining below)

So about caching again: The new core caches the task objects emitted by the bear. These task objects are effectively just arguments to the analyze function packed into a tuple. As you recall, local bears (or now called FileBear) have following signature:

def analyze(self, filename, file, ...):
    ...

The argument file contains all the file contents directly, and thus is saved into the cache. This is something I want to avoid in future by using a "file-factory" or "file-proxy", which is just some interface to reading files. It will implement proper cache-saving methods to reduce storage requirements by not including the file itself (but just the name, timestamps, etc...).

@palash25
Copy link
Member

palash25 commented Mar 21, 2018

@Makman2 I wanted to know more about the design of this. Can all these flags reside in a separate module (CacheModes.py) as decorator functions and then we can use these decorators wherever we need them in our codebase to cache data like in Core.py (this is the design that I am currently including in my proposal).

@Makman2
Copy link
Member Author

Makman2 commented Mar 21, 2018

Don't understand that quite^^ What do you want to cache like Core.py?

@palash25
Copy link
Member

I'm sorry that was poorly phrased.

I meant that whether the implementations of these flags will reside in a separate module or as functions in Core.py

@Makman2
Copy link
Member Author

Makman2 commented Mar 21, 2018

Separate module, but could be located inside core module (not the Core.py file itself).

@Makman2
Copy link
Member Author

Makman2 commented Mar 21, 2018

Or even inside coalib.core.caching/coalib.core.cache or so

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/core difficulty/medium status/blocked The issue requires other referenced issues/PRs to be solved/merged before being worked on
Development

No branches or pull requests

2 participants