Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use compact regions? #4457

Closed
nad opened this issue Feb 19, 2020 · 30 comments
Closed

Use compact regions? #4457

nad opened this issue Feb 19, 2020 · 30 comments
Assignees
Labels
documented-in-changelog Issues already documented in the CHANGELOG performance Slow type checking, interaction, compilation or execution of Agda programs type: enhancement Issues and pull requests about possible improvements
Milestone

Comments

@nad
Copy link
Contributor

nad commented Feb 19, 2020

Perhaps we could use "compact regions" to avoid garbage-collecting data read from interface files.

@nad nad added status: info-needed More information is needed from the bug reporter to confirm the issue. type: enhancement Issues and pull requests about possible improvements performance Slow type checking, interaction, compilation or execution of Agda programs labels Feb 19, 2020
@nad nad added this to the 2.6.2 milestone Feb 19, 2020
@nad
Copy link
Contributor Author

nad commented Feb 19, 2020

Compact regions do not support functions, and compaction is "very costly" if you want to retain sharing, so perhaps this is not a good idea.

@nad nad closed this as completed Feb 19, 2020
@nad nad added status: abandoned Work on this issue has been abandoned (not in changelog) and removed status: abandoned Work on this issue has been abandoned (not in changelog) labels Feb 19, 2020
@nad
Copy link
Contributor Author

nad commented Feb 19, 2020

Compact regions do not support functions

But we do not serialise functions…

I ran a quick test. The key change is restricted to one line in Agda.TypeChecking.Serialise:

-    Right x   -> return (Just x)
+    Right x   -> liftIO (Just . C.getCompact <$> C.compactWithSharing x)

I ran runhaskell Main.hs agda, with all dependencies of the standard library's Everything.agda already type-checked, and Main.hs defined as in #4387 (including threadDelay). Images from ThreadScope, first without compaction, and then with:
without-compaction
with-compaction

Perhaps this makes sense after all. Compact regions seem to be available from GHC 8.0, but I'm not sure how well they worked in that version of GHC. I've tested using 8.6.5.

@nad nad reopened this Feb 19, 2020
@nad
Copy link
Contributor Author

nad commented Feb 19, 2020

My plan is to push my changes, restricted to GHC 8.6 and higher. The documentation claims that the feature "is supported by GHC 8.2 and later", but I don't want to spend time benchmarking older versions of GHC. I'm depending on the lower-level library ghc-compact, rather than compact, because the latter does not support GHC 8.8 (I'm guessing that it is not maintained).

Any objections?

It might also be a good idea to always use the serialised interfaces (#4316).

@WolframKahl
Copy link
Member

Since “compacting data involves copying it”, it probably it makes sense to add a verbosity flag keeping track of individual and cumulated compactSize of compacted .agdai data.

And in view of

 compactWithSharing :: a -> IO (Compact a)

    Compact a value, retaining any internal sharing and cycles. O(size of data)

    This is typically about 10x slower than compact,
    because it works by maintaining a hash table
    mapping uncompacted objects to compacted objects.

, an entry in the Agda-internal time profiling list separate from deserialisation probably also makes sense.

@nad nad closed this as completed in f932c53 Feb 19, 2020
@nad
Copy link
Contributor Author

nad commented Feb 19, 2020

I've pushed my changes. Feel free to implement the things that you suggested.

@asr asr modified the milestones: 2.6.2, 2.6.1 Feb 19, 2020
@nad
Copy link
Contributor Author

nad commented Feb 20, 2020

Given that we are (hopefully?) about to release 2.6.1 I'll add a flag for this feature and turn it off by default. For that reason I'll also add some documentation.

@nad nad reopened this Feb 20, 2020
@nad
Copy link
Contributor Author

nad commented Feb 20, 2020

I did not include information about heap usage for the example above (as reported by ThreadScope).

  • Without compact regions:
    Maximum heap size: 2.2 GiB
    Maximum heap residency: 1.1 GiB
    Total allocated: 4.0 GiB
  • With compact regions:
    Maximum heap size: 444.0 MiB
    Maximum heap residency: 226.7 MiB
    Total allocated: 332.0 MiB

Another way to reduce GC is to set a large initial heap size. With -H2.5G and no compact regions I got the following results:
without-compaction-large-heap
Maximum heap size: 2.5 GiB
Maximum heap residency: 889.1 MiB
Total allocated: 3.5 MiB

@nad
Copy link
Contributor Author

nad commented Feb 20, 2020

an entry in the Agda-internal time profiling list separate from deserialisation probably also makes sense.

I'll add that.

@nad
Copy link
Contributor Author

nad commented Feb 20, 2020

Profiling information for the test case above:

  • Without compaction:
    Total           43,336ms 
    Miscellaneous    1,209ms 
    Deserialization 40,790ms 
    Import           1,297ms 
    Parsing             38ms 
    
           8,866,512 bytes allocated in the heap
       6,006,811,856 bytes copied during GC
       1,377,000,196 bytes maximum residency (16 sample(s))
          10,188,028 bytes maximum slop
                1313 MB total memory in use (0 MB lost due to fragmentation)
    
                                         Tot time (elapsed)  Avg pause  Max pause
      Gen  0     18950 colls,     0 par   11.377s  11.472s     0.0006s    0.0080s
      Gen  1        16 colls,     0 par    8.376s   9.373s     0.5858s    3.6147s
    
      TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)
    
      SPARKS: 0(0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
    
      INIT    time    0.000s  (  0.001s elapsed)
      MUT     time   22.311s  ( 23.520s elapsed)
      GC      time   19.753s  ( 20.845s elapsed)
      EXIT    time    0.001s  (  0.005s elapsed)
      Total   time   42.066s  ( 44.371s elapsed)
    
      Alloc rate    397,402 bytes per MUT second
    
      Productivity  53.0% of total user, 53.0% of total elapsed
    
  • With compaction:
    Total                      40,001ms           
    Miscellaneous               1,309ms           
    Deserialization            33,496ms (37,242ms)
    Deserialization.Compaction  3,746ms           
    Import                      1,410ms           
    Parsing                        38ms           
    
         363,091,840 bytes allocated in the heap
       3,541,671,600 bytes copied during GC
         242,774,924 bytes maximum residency (37 sample(s))
           1,083,452 bytes maximum slop
                 231 MB total memory in use (0 MB lost due to fragmentation)
    
                                         Tot time (elapsed)  Avg pause  Max pause
      Gen  0     19033 colls,     0 par   11.773s  11.950s     0.0006s    0.0171s
      Gen  1        37 colls,     0 par    0.795s   0.871s     0.0235s    0.1373s
    
      TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)
    
      SPARKS: 0(0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
    
      INIT    time    0.001s  (  0.000s elapsed)
      MUT     time   25.913s  ( 27.271s elapsed)
      GC      time   12.568s  ( 12.821s elapsed)
      EXIT    time    0.001s  (  0.008s elapsed)
      Total   time   38.482s  ( 40.100s elapsed)
    
      Alloc rate    14,012,065 bytes per MUT second
    
      Productivity  67.3% of total user, 68.0% of total elapsed
    

These numbers look suspicious to me. I'll test a bit more.

@nad
Copy link
Contributor Author

nad commented Feb 20, 2020

I got the following numbers for std-lib-test:

  • If serialised interfaces are not decoded and used (in which case compaction is not used):
    Total                        369,484ms
    Miscellaneous                  8,522ms
    Typing                        29,213ms (193,005ms)
    Typing.CheckRHS               73,014ms
    Typing.TypeSig                29,306ms
    Typing.CheckLHS               14,785ms  (27,924ms)
    Typing.CheckLHS.UnifyIndices  13,139ms
    Typing.OccursCheck            12,395ms
    Typing.With                   12,029ms
    Typing.Generalize              9,122ms
    Serialization                 55,606ms  (93,023ms)
    Serialization.BinaryEncode    27,109ms
    Serialization.BuildInterface   4,217ms
    Serialization.Compress         3,380ms
    Serialization.Sort             2,709ms
    Parsing                        3,672ms  (23,069ms)
    Parsing.OperatorsExpr         15,567ms
    Parsing.OperatorsPattern       3,829ms
    Scoping                        8,376ms  (13,124ms)
    Scoping.InverseScopeLookup     4,748ms
    Positivity                     9,309ms
    DeadCode                       7,397ms
    Highlighting                   7,177ms
    Coverage                       5,103ms
    Import                         4,454ms
    Termination                    1,472ms   (3,369ms)
    Termination.Compare            1,167ms
    Termination.RecCheck             561ms
    Termination.Graph                168ms
    Injectivity                    1,312ms
    Deserialization                  459ms
    ProjectionLikeness               145ms
    ModuleName                        10ms
    
       4,056,469,192 bytes allocated in the heap
      29,549,965,520 bytes copied during GC
         699,499,620 bytes maximum residency (65 sample(s))
           2,969,736 bytes maximum slop
                 667 MB total memory in use (0 MB lost due to fragmentation)
    
                                         Tot time (elapsed)  Avg pause  Max pause
      Gen  0     191066 colls,     0 par   68.013s  68.457s     0.0004s    0.0093s
      Gen  1        65 colls,     0 par   34.968s  35.875s     0.5519s    2.2317s
    
      TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)
    
      SPARKS: 0(0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
    
      INIT    time    0.000s  (  0.001s elapsed)
      MUT     time  257.556s  (266.183s elapsed)
      GC      time  102.980s  (104.332s elapsed)
      EXIT    time    0.001s  (  0.004s elapsed)
      Total   time  360.537s  (370.521s elapsed)
    
      Alloc rate    15,749,840 bytes per MUT second
    
      Productivity  71.4% of total user, 71.8% of total elapsed
    
    
    real	6m10,543s
    user	6m0,543s
    sys	0m9,061s
    
  • If serialised interfaces are decoded (but not compacted) and used:
    Total                        422,862ms
    Miscellaneous                  9,800ms
    Typing                        27,303ms (201,668ms)
    Typing.CheckRHS               78,564ms
    Typing.TypeSig                31,288ms
    Typing.CheckLHS               16,525ms  (30,093ms)
    Typing.CheckLHS.UnifyIndices  13,567ms
    Typing.With                   12,837ms
    Typing.OccursCheck            12,307ms
    Typing.Generalize              9,273ms
    Serialization                 48,129ms  (80,090ms)
    Serialization.BinaryEncode    20,785ms
    Serialization.BuildInterface   5,018ms
    Serialization.Compress         3,380ms
    Serialization.Sort             2,776ms
    Deserialization               47,193ms
    Parsing                       12,456ms  (33,145ms)
    Parsing.OperatorsExpr         16,265ms
    Parsing.OperatorsPattern       4,423ms
    Scoping                        8,537ms  (12,927ms)
    Scoping.InverseScopeLookup     4,389ms
    Positivity                     9,197ms
    Highlighting                   7,588ms
    DeadCode                       6,486ms
    Coverage                       5,301ms
    Import                         4,443ms
    Termination                    1,524ms   (3,495ms)
    Termination.Compare            1,227ms
    Termination.RecCheck             564ms
    Termination.Graph                178ms
    Injectivity                    1,362ms
    ProjectionLikeness               150ms
    ModuleName                        11ms
    
       1,808,834,476 bytes allocated in the heap
      35,519,923,596 bytes copied during GC
       1,548,933,760 bytes maximum residency (53 sample(s))
          10,470,784 bytes maximum slop
                1477 MB total memory in use (0 MB lost due to fragmentation)
    
                                         Tot time (elapsed)  Avg pause  Max pause
      Gen  0     208334 colls,     0 par   85.989s  86.570s     0.0004s    0.0103s
      Gen  1        53 colls,     0 par   46.025s  47.341s     0.8932s    10.1818s
    
      TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)
    
      SPARKS: 0(0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
    
      INIT    time    0.000s  (  0.001s elapsed)
      MUT     time  282.427s  (290.916s elapsed)
      GC      time  132.014s  (133.912s elapsed)
      EXIT    time    0.000s  (  0.002s elapsed)
      Total   time  414.442s  (424.830s elapsed)
    
      Alloc rate    6,404,615 bytes per MUT second
    
      Productivity  68.1% of total user, 68.5% of total elapsed
    
    
    real	7m4,848s
    user	6m54,445s
    sys	0m9,580s
    
  • If serialised interfaces are decoded, compacted and used:
    Total                        374,828ms
    Miscellaneous                  8,347ms
    Typing                        25,589ms (181,416ms)
    Typing.CheckRHS               67,956ms
    Typing.TypeSig                29,059ms
    Typing.CheckLHS               14,569ms  (27,774ms)
    Typing.CheckLHS.UnifyIndices  13,205ms
    Typing.OccursCheck            11,785ms
    Typing.With                   10,217ms
    Typing.Generalize              9,033ms
    Serialization                 47,600ms  (78,819ms)
    Serialization.BinaryEncode    20,608ms
    Serialization.BuildInterface   4,533ms
    Serialization.Compress         3,326ms
    Serialization.Sort             2,751ms
    Deserialization               34,902ms  (38,818ms)
    Deserialization.Compaction     3,916ms
    Parsing                        2,245ms  (21,618ms)
    Parsing.OperatorsExpr         15,484ms
    Parsing.OperatorsPattern       3,888ms
    Scoping                        8,123ms  (12,313ms)
    Scoping.InverseScopeLookup     4,189ms
    Positivity                     8,879ms
    DeadCode                       5,310ms
    Coverage                       5,093ms
    Highlighting                   5,059ms
    Import                         4,303ms
    Termination                    1,542ms   (3,404ms)
    Termination.Compare            1,159ms
    Termination.RecCheck             538ms
    Termination.Graph                164ms
    Injectivity                    1,284ms
    ProjectionLikeness               147ms
    ModuleName                        10ms
    
       2,207,774,688 bytes allocated in the heap
      24,195,161,408 bytes copied during GC
         299,007,688 bytes maximum residency (159 sample(s))
             354,736 bytes maximum slop
                 285 MB total memory in use (0 MB lost due to fragmentation)
    
                                         Tot time (elapsed)  Avg pause  Max pause
      Gen  0     208357 colls,     0 par   78.774s  79.321s     0.0004s    0.0296s
      Gen  1       159 colls,     0 par    9.292s   9.623s     0.0605s    0.3170s
    
      TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)
    
      SPARKS: 0(0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
    
      INIT    time    0.000s  (  0.001s elapsed)
      MUT     time  278.189s  (286.614s elapsed)
      GC      time   88.066s  ( 88.943s elapsed)
      EXIT    time    0.000s  (  0.002s elapsed)
      Total   time  366.255s  (375.561s elapsed)
    
      Alloc rate    7,936,237 bytes per MUT second
    
      Productivity  76.0% of total user, 76.3% of total elapsed
    
    
    real	6m16,726s
    user	6m6,261s
    sys	0m8,618s
    

The second option does not look very good. The last option is (for these particular test runs) slightly slower than the first option, but uses much less memory. I'll activate decoding of serialised interfaces, as suggested before (#4316), when compaction is available and in use.

nad added a commit that referenced this issue Feb 20, 2020
By default compact regions are not used.

The compact region code is now enabled starting from GHC 8.2.
nad added a commit that referenced this issue Feb 20, 2020
But only if compact regions are available and in use.
@andreasabel
Copy link
Member

Nice work!

About the option name: My first association for "compact regions" was to use compact source code regions (in the sense of Range). "Region" is otherwise not a term in the Agda universe. Also, this option is about a very specific implementation detail, thus, it should not just land in the option soup.

I suggest to have a prefix to the name that distinguishes this option as something just tuning the performance of Agda, as opposed to changing type theory, changing interaction, or changing the produced executable.

My color of the bike shed: --performance-compact-memory-regions.

@nad
Copy link
Contributor Author

nad commented Feb 21, 2020

My plan was to remove this option once we decide whether this is a good idea or not. However, if we come to the conclusion that this is something that should be tunable by the user, then a better name might make sense.

If we anticipate having multiple performance tuning options, then I would prefer something like --performance=….

nad added a commit that referenced this issue Feb 21, 2020
@nad
Copy link
Contributor Author

nad commented Feb 21, 2020

As far as I can tell most of the CI tests are only run for a single version of GHC. I'd like to make sure that --compact-regions works for every supported version of GHC. Is there a simple way to accomplish this?

@nad
Copy link
Contributor Author

nad commented Feb 21, 2020

Given the performance improvements I've seen I suggest that we turn on --compact-regions by default, but only if the feature works fine with all versions of GHC that we support.

@asr
Copy link
Member

asr commented Feb 21, 2020

From .travis.yml you can run a test on all supported GHC versions by adding complete tests to the commit message.

@nad
Copy link
Contributor Author

nad commented Feb 21, 2020

Can I run the Travis tests without creating a pull request or pushing a commit? I don't want to create a pull request for something that I don't want human feedback on.

I get notifications for all pull requests, and in quite a few cases the only purpose of a pull request is to run the test suite. These pull requests also, in my opinion, pollute the GitHub history.

@UlfNorell
Copy link
Member

I believe Travis runs on all branches, regardless of PRs, so you can just push a temporary branch with a complete tests commit.

@asr
Copy link
Member

asr commented Feb 21, 2020

From .travis.yml:

    # Test all issue* branches. Saves you opening a PR.
    - /^issue.*/

So, you should name your branch issueXYZ.

@UlfNorell
Copy link
Member

Ah, blind luck that this is what I do already...

@asr
Copy link
Member

asr commented Feb 21, 2020

All that Travis magic was written by @L-TChen.

@nad
Copy link
Contributor Author

nad commented Feb 22, 2020

Thanks, that seems to work.

@nad
Copy link
Contributor Author

nad commented Feb 22, 2020

Unfortunately Agda failed to build using GHC 8.0.2 and 8.2.2, seemingly due to lack of time or space. Perhaps I can run the tests locally instead.

@nad
Copy link
Contributor Author

nad commented Feb 22, 2020

No, for GHC 8.2.2 the problem was that Agda failed to create the interface file for Agda.Builtin.Bool due to lack of space. I guess that there is some problem with compact regions in this version of GHC.

nad added a commit that referenced this issue Feb 22, 2020
If GHC 8.4 or later is used to compile Agda.

The flag --compact-regions has been removed.

Some test case outputs became worse because of this change, see #4469
and #4472.
@nad
Copy link
Contributor Author

nad commented Feb 22, 2020

Now compact regions are always active (for GHC 8.4 and later). The option has been removed, I see no reason for not using compact regions.

As an aside, it seems as if memory consumption (with or without compact regions) is substantially lower with GHC 8.6.5 than with 8.4.4, and perhaps even lower with 8.8.2.

@nad nad closed this as completed Feb 22, 2020
@nad nad added documented-in-changelog Issues already documented in the CHANGELOG and removed status: info-needed More information is needed from the bug reporter to confirm the issue. labels Feb 22, 2020
@andreasabel
Copy link
Member

@asr wrote:

All that Travis magic was written by @L-TChen.

Except the line you quoted (testing the issue branches). I added this. :)

@jespercockx jespercockx reopened this Feb 23, 2020
@jespercockx
Copy link
Member

The test suite is failing on my PC with GHC 8.0.2 because of this change.

@nad
Copy link
Contributor Author

nad commented Feb 24, 2020

See #4469 and #1346.

@nad nad closed this as completed Feb 24, 2020
@jespercockx
Copy link
Member

Could we roll back this change until those issues have been resolved? We should avoid as much as possible to be in a situation where the test suite does not run successfully.

@nad
Copy link
Contributor Author

nad commented Feb 24, 2020

I'll disable the broken test cases.

@nad
Copy link
Contributor Author

nad commented Feb 24, 2020

By the way, note that the problems you see are not caused by the use of compact regions, but by the fix of #4316 (which is only active if compact regions are in use).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documented-in-changelog Issues already documented in the CHANGELOG performance Slow type checking, interaction, compilation or execution of Agda programs type: enhancement Issues and pull requests about possible improvements
Projects
None yet
Development

No branches or pull requests

6 participants