New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid unnecessarily loading the hackage index #1892

Closed
mgsloan opened this Issue Mar 9, 2016 · 19 comments

Comments

Projects
None yet
4 participants
@mgsloan
Collaborator

mgsloan commented Mar 9, 2016

Loading the hackage is expensive, because the encoded index cache is 18Mb. On my fairly fast computer, it takes 0.3 seconds to decode it. Search getPackageCaches to find the spots it's used.

There are two cases where loading the hackage currently happens unnecessarily:

  1. In loadSourceMap, because the user might have specified a package target on the commandline which is not in the resolver. In this case, it uses the latest from hackage.

  2. In constructPlan, because plan construction errors get annotated with "latest applicable" versions.

For both of these, loading the hackage index is only necessary under specific and abormal circumstances. However, currently this expensive operation is being done regardless of need.

It's also used in resolveBuildPlan, but that only seems to be used by the tests.

@luigy

This comment has been minimized.

Contributor

luigy commented Mar 25, 2016

Loading the hackage is expensive, because the encoded index cache is 18Mb. On my fairly fast computer, it takes 0.3 seconds to decode it.

Hmm, eventually we got to steal hoogle decoding tricks ⚡️, but in the meantime lazy me just wants to write a separate cache like I do in #1839 since it looks like there are a few cases in which all we care about is the latest version. I tested throwing this info in the same cache and the size is around ~505Kb.

  1. In constructPlan, because plan construction errors get annotated with "latest applicable" versions.

The smaller cache mentioned above can help with this and also for show command in #1614

$ stack show latest-version base
4.8.2.0

Also as another enhancement we can fork reading the cache as soon as we hit a command that might need it + using an MVar for configPackageCaches

@mgsloan

This comment has been minimized.

Collaborator

mgsloan commented Mar 25, 2016

Hmm, eventually we got to steal hoogle decoding tricks , but in the meantime lazy me just wants to write a separate cache like I do in #1839 since it looks like there are a few cases in which all we care about is the latest version. I tested throwing this info in the same cache and the size is around ~505Kb.

Currently working on a fast serialization library - this is one of the applications we have in mind! https://www.fpcomplete.com/blog/2016/03/efficient-binary-serialization

Also as another enhancement we can fork reading the cache as soon as we hit a command that might need it + using an MVar for configPackageCaches

True! In general we might be able to make things a little faster by running such blocking fetches in parallel.

@luigy

This comment has been minimized.

Contributor

luigy commented Mar 25, 2016

Currently working on a fast serialization library - this is one of the applications we have in mind!

Excellent!
so if I want this now, I have to be join the fp force... hmm 😜

@mgsloan

This comment has been minimized.

Collaborator

mgsloan commented Mar 28, 2016

No promises, but should be released soon particularly if it improves performance a bunch for the stack usecase.

@sjakobi

This comment has been minimized.

Contributor

sjakobi commented Apr 24, 2016

I have realized that one of the most frequent uses of the package caches is for lookups where we want to get the existing versions of a given package. That's an operation that isn't well supported by the current representation of the package caches which uses PackageIdentifier as the key.

So I'd suggest changing that representation to Map PackageName (Map Version PackageCache) or even HashMap PackageName .... The downside of using a HashMap is that we'd apparently have to define an orphan instance for Binary.

@phadej

This comment has been minimized.

Contributor

phadej commented Apr 24, 2016

stack could depend onbinary-orphans which provides that instance.

@sjakobi

This comment has been minimized.

Contributor

sjakobi commented Apr 24, 2016

@phadej: That's good to know! Thanks for a useful library! 👍

@sjakobi

This comment has been minimized.

Contributor

sjakobi commented Apr 25, 2016

  1. In constructPlan, because plan construction errors get annotated with "latest applicable" versions.

#2062 fixes that case.

@mgsloan

This comment has been minimized.

Collaborator

mgsloan commented Apr 25, 2016

Note that we plan to switch away from using binary soon, which should speed things up but will invalidate everyone's caches. This will probably happen after releasing current HEAD.

@sjakobi

This comment has been minimized.

Contributor

sjakobi commented Apr 26, 2016

Note that we plan to switch away from using binary soon, which should speed things up but will invalidate everyone's caches.

Sounds like that would be a good occasion to change the type of the caches too.

Moving from Map PackageIdentifier PackageCache to [Hash]Map PackageName ([Hash]Map Version PackageCache) should bring down the size of the serialized file by quite a bit too. Unless the new serialization step involves some kind of compression, that is…

@luigy

This comment has been minimized.

Contributor

luigy commented Apr 26, 2016

Moving from Map PackageIdentifier PackageCache to [Hash]Map PackageName ([Hash]Map Version PackageCache)

Yes, and if so extend it to include preferred-versions from the index or should it be kept as a separate cache since not every package has preferred-versions? #1839 (will get back to this on the weekend)

type PreferredVersions = VersionRange
Map PackageName 
    (Map Version PackageCache
    ,PreferredVersions -- with default to `AnyVersion` if prefered versions is absent
    )                  --  from index for a package
                       -- or just use `Maybe VersionRange` depending which one serializes better

Unpack and upgrade can use this and it would be cool if when the index is updated add some parallel check that when the current version of stack has been deprecated display a warning(using #1745 (?)) and/or prompt to upgrade

@sjakobi

This comment has been minimized.

Contributor

sjakobi commented Apr 26, 2016

@luigy: I think the type you propose looks fine, except I'd strongly prefer a HashMap PackageName ....

Regenerating the hackage index cache takes only 7 seconds on my rather slow computer so I don't think we should worry about that.

@mgsloan

This comment has been minimized.

Collaborator

mgsloan commented Apr 27, 2016

Sounds good to me! Definitely a good thing to address after the upcoming release, as that's when usage of the new serialization library is going to be merged. Earlier I mentioned invalidating caches, but we will probably keep around some of the binary serialization code for the caches that we'd rather not invalidate. It'll re-encode them when reading an old cache file.

#2062 fixes that case.

I think there's still more to do here - only loading it from the file if need be (getPackageCaches)

@sjakobi

This comment has been minimized.

Contributor

sjakobi commented Apr 27, 2016

I think there's still more to do here - only loading it from the file if need be (getPackageCaches)

So, just to back up my impression that the call to getPackageCaches in constructPlan has no performance impact whatsoever, here's a profile for a short no-op build run with the current stack-HEAD.

The (chronologically) first call (actually the second in the file) is at line 6957 in the context of withCabalLoader and takes 18% of the total profiled time.

The second call (the one in constructPlan) is at line 6931 and has no performance footprint at all because the PackageCacheMap is already stored in an IORef.

So, I believe that if there's any way to avoid loading the package caches at all (at least for benign cases), we should look at withCabalLoader first. If we can avoid loading the package caches there, we should also try to avoid the call in constructPlan.

@mgsloan

This comment has been minimized.

Collaborator

mgsloan commented Apr 27, 2016

Good point! I hadn't noticed that withCabalLoader misused runOnce - on the surface it looked like that code avoided unnecessary loading of the hackage index. Now it should only load the index if it needs to.

@sjakobi

This comment has been minimized.

Contributor

sjakobi commented Apr 28, 2016

Good point! I hadn't noticed that withCabalLoader misused runOnce - on the surface it looked like that code avoided unnecessary loading of the hackage index. Now it should only load the index if it needs to.

Cool! That cleared things up for me quite a bit!

I did another profile run and now it's the getPackageCaches call in constructPlan that does all the work. Using that neat liftBaseWith/runInBase trick of yours that should be avoidable too.

@mgsloan

This comment has been minimized.

Collaborator

mgsloan commented Apr 28, 2016

Using that neat liftBaseWith/runInBase trick of yours that should be avoidable too.

True :) at the time I didn't want to think about it too hard.

@mgsloan

This comment has been minimized.

Collaborator

mgsloan commented Apr 29, 2016

Moving over a note from the discussion in #2077 - we should make sure the plan construction code doesn't demand this in the case that extra-deps are already installed. (even if they are local packages)

@mgsloan

This comment has been minimized.

Collaborator

mgsloan commented Apr 29, 2016

@sjakobi Good work! I think we can call this issue done. I've moved some further details that got complicated to here: #2081

@mgsloan mgsloan closed this Apr 29, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment