Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate More Effective IR Caching format #5567

Closed
JaroslavTulach opened this issue Feb 7, 2023 · 11 comments · Fixed by #5918
Closed

Investigate More Effective IR Caching format #5567

JaroslavTulach opened this issue Feb 7, 2023 · 11 comments · Fixed by #5918
Assignees
Labels
--low-performance -compiler p-medium Should be completed in the next few sprints s-research-needed Status: the task will require heavy research to complete
Milestone

Comments

@JaroslavTulach
Copy link
Member

JaroslavTulach commented Feb 7, 2023

Summary

Measurements performed during the investigation of LS+engine startup (#5027) indicate that it takes more than 2.5s to load “IR caches” (described here). The current format is using standard Java serialization/deserialization and is subject to slow initialization. It uses reflection behind the scene and reflection is slow to initialize. . We have to either limit the number of loaded IR objects or change the format to a more efficient one.

Value

Modify the current format slightly to avoid deserializing things that aren't needed. For example the "resolve import exports compiler pass" forces deserialization of all the IR files of a given library. Such deserialization includes also method bodies which is unnecessary. Either store the information for resolving imports separately or make IR multi-tier.

Specification

@kustosz suggestion: let's say you have a 2-tier serialization solution, something like:

class SerializedModule {
    String name;
    SomeSummaryOfImportsAndExports importsAndExports;
    Map<TypeRep, SerializedMethod> methods;
}
class SerializedMethod {
    byte[] serializedBody;
}

Deserialization of SerializedModule would be fast, it's a small structure (except the SerializedMethods, but these are just bytes arrays – very fast to read). Then you just have a lazy root node that keeps that SerializedMethod and reads it when it is first called

Acceptance Criteria & Test Cases

Pass existing tests. Perform measurement of startup according to the methodology and confirm the deserialization time drops down.

@hubertp
Copy link
Contributor

hubertp commented Feb 22, 2023

Some experiment I did with fst: 8246ad2
Note: it uses older version because lasted fst was only compiled with JDK14 (or 17, can't remember). Preferably we would actually be able to use the latest version which still is 2 years old.

@jdunkerley jdunkerley added the s-research-needed Status: the task will require heavy research to complete label Feb 28, 2023
@JaroslavTulach
Copy link
Member Author

I have overridden ObjectInputStream.resolveClass and obtain list of 143 classes that are being stored in our IR caches stream:

 java.lang.Enum
 java.util.ArrayList
 java.util.UUID
 org.enso.compiler.core.IR$Application$Force
 org.enso.compiler.core.IR$Application$Literal$Sequence
 org.enso.compiler.core.IR$Application$Operator$Binary
 org.enso.compiler.core.IR$Application$Prefix
 org.enso.compiler.core.IR$CallArgument$Specified
 org.enso.compiler.core.IR$Case$Branch
 org.enso.compiler.core.IR$Case$Expr
 org.enso.compiler.core.IR$DefinitionArgument$Specified
 org.enso.compiler.core.IR$Expression$Binding
 org.enso.compiler.core.IR$Expression$Block
 org.enso.compiler.core.IR$Foreign$Definition
 org.enso.compiler.core.IR$Function$Lambda
 org.enso.compiler.core.IR$IdentifiedLocation
 org.enso.compiler.core.IR$Literal$Number
 org.enso.compiler.core.IR$Literal$Text
 org.enso.compiler.core.IR$Module
 org.enso.compiler.core.IR$Module$Scope$Definition$Data
 org.enso.compiler.core.IR$Module$Scope$Definition$Method$Conversion
 org.enso.compiler.core.IR$Module$Scope$Definition$Method$Explicit
 org.enso.compiler.core.IR$Module$Scope$Definition$Type
 org.enso.compiler.core.IR$Module$Scope$Export$Module
 org.enso.compiler.core.IR$Module$Scope$Import$Module
 org.enso.compiler.core.IR$Module$Scope$Import$Polyglot
 org.enso.compiler.core.IR$Module$Scope$Import$Polyglot$Java
 org.enso.compiler.core.IR$Name$BuiltinAnnotation
 org.enso.compiler.core.IR$Name$GenericAnnotation
 org.enso.compiler.core.IR$Name$Literal
 org.enso.compiler.core.IR$Name$MethodReference
 org.enso.compiler.core.IR$Name$Qualified
 org.enso.compiler.core.IR$Name$Self
 org.enso.compiler.core.IR$Pattern$Constructor
 org.enso.compiler.core.IR$Pattern$Literal
 org.enso.compiler.core.IR$Pattern$Name
 org.enso.compiler.core.IR$Pattern$Type
 org.enso.compiler.core.IR$Type$Error
 org.enso.compiler.core.IR$Type$Function
 org.enso.compiler.core.IR$Type$Set$Union
 org.enso.compiler.core.ir.DiagnosticStorage
 org.enso.compiler.core.ir.MetadataStorage
 org.enso.compiler.data.BindingsMap
 org.enso.compiler.data.BindingsMap$Cons
 org.enso.compiler.data.BindingsMap$ExportedModule
 org.enso.compiler.data.BindingsMap$ModuleMethod
 org.enso.compiler.data.BindingsMap$ModuleReference$Abstract
 org.enso.compiler.data.BindingsMap$PolyglotSymbol
 org.enso.compiler.data.BindingsMap$Resolution
 org.enso.compiler.data.BindingsMap$ResolvedConstructor
 org.enso.compiler.data.BindingsMap$ResolvedImport
 org.enso.compiler.data.BindingsMap$ResolvedMethod
 org.enso.compiler.data.BindingsMap$ResolvedModule
 org.enso.compiler.data.BindingsMap$ResolvedPolyglotSymbol
 org.enso.compiler.data.BindingsMap$ResolvedType
 org.enso.compiler.data.BindingsMap$SymbolRestriction$AllowedResolution
 org.enso.compiler.data.BindingsMap$SymbolRestriction$Hiding
 org.enso.compiler.data.BindingsMap$SymbolRestriction$Intersect
 org.enso.compiler.data.BindingsMap$SymbolRestriction$Only
 org.enso.compiler.data.BindingsMap$SymbolRestriction$Union
 org.enso.compiler.data.BindingsMap$Type
 org.enso.compiler.pass.analyse.AliasAnalysis$
 org.enso.compiler.pass.analyse.AliasAnalysis$Graph
 org.enso.compiler.pass.analyse.AliasAnalysis$Graph$Link
 org.enso.compiler.pass.analyse.AliasAnalysis$Graph$Occurrence$Def
 org.enso.compiler.pass.analyse.AliasAnalysis$Graph$Occurrence$Use
 org.enso.compiler.pass.analyse.AliasAnalysis$Graph$Scope
 org.enso.compiler.pass.analyse.AliasAnalysis$Info$Occurrence
 org.enso.compiler.pass.analyse.AliasAnalysis$Info$Scope$Child
 org.enso.compiler.pass.analyse.AliasAnalysis$Info$Scope$Root
 org.enso.compiler.pass.analyse.BindingAnalysis$
 org.enso.compiler.pass.analyse.CachePreferenceAnalysis$
 org.enso.compiler.pass.analyse.CachePreferenceAnalysis$WeightInfo
 org.enso.compiler.pass.analyse.DataflowAnalysis$
 org.enso.compiler.pass.analyse.DataflowAnalysis$DependencyInfo
 org.enso.compiler.pass.analyse.DataflowAnalysis$DependencyInfo$Type$Dynamic
 org.enso.compiler.pass.analyse.DataflowAnalysis$DependencyInfo$Type$Static
 org.enso.compiler.pass.analyse.DataflowAnalysis$DependencyMapping
 org.enso.compiler.pass.analyse.GatherDiagnostics$
 org.enso.compiler.pass.analyse.GatherDiagnostics$DiagnosticsMeta
 org.enso.compiler.pass.analyse.TailCall$
 org.enso.compiler.pass.analyse.TailCall$TailPosition$NotTail$
 org.enso.compiler.pass.analyse.TailCall$TailPosition$Tail$
 org.enso.compiler.pass.resolve.DocumentationComments$
 org.enso.compiler.pass.resolve.DocumentationComments$Doc
 org.enso.compiler.pass.resolve.ExpressionAnnotations$
 org.enso.compiler.pass.resolve.GenericAnnotations$
 org.enso.compiler.pass.resolve.GlobalNames$
 org.enso.compiler.pass.resolve.IgnoredBindings$
 org.enso.compiler.pass.resolve.IgnoredBindings$State$Ignored$
 org.enso.compiler.pass.resolve.IgnoredBindings$State$NotIgnored$
 org.enso.compiler.pass.resolve.MethodCalls$
 org.enso.compiler.pass.resolve.MethodDefinitions$
 org.enso.compiler.pass.resolve.ModuleAnnotations$
 org.enso.compiler.pass.resolve.ModuleAnnotations$Annotations
 org.enso.compiler.pass.resolve.Patterns$
 org.enso.compiler.pass.resolve.TypeNames$
 org.enso.compiler.pass.resolve.TypeSignatures$
 org.enso.compiler.pass.resolve.TypeSignatures$Signature
 org.enso.compiler.SuggestionsCache$Suggestions
 org.enso.interpreter.epb.EpbParser$ForeignLanguage
 org.enso.pkg.QualifiedName
 org.enso.polyglot.DocSection$Keyed
 org.enso.polyglot.DocSection$Mark$Example
 org.enso.polyglot.DocSection$Mark$Important
 org.enso.polyglot.DocSection$Mark$Info
 org.enso.polyglot.DocSection$Marked
 org.enso.polyglot.DocSection$Paragraph
 org.enso.polyglot.DocSection$Tag
 org.enso.polyglot.Suggestion$Argument
 org.enso.polyglot.Suggestion$Constructor
 org.enso.polyglot.Suggestion$Function
 org.enso.polyglot.Suggestion$Local
 org.enso.polyglot.Suggestion$Method
 org.enso.polyglot.Suggestion$Module
 org.enso.polyglot.Suggestion$Position
 org.enso.polyglot.Suggestion$Scope
 org.enso.polyglot.Suggestion$Type
 org.enso.syntax.text.Location
 scala.collection.generic.DefaultSerializationProxy
 scala.collection.generic.SerializeEnd$
 scala.collection.immutable.HashMap$
 scala.collection.immutable.HashSet$
 scala.collection.immutable.List$
 scala.collection.immutable.Map$EmptyMap$
 scala.collection.immutable.Map$Map1
 scala.collection.immutable.Map$Map2
 scala.collection.immutable.Map$Map3
 scala.collection.immutable.Map$Map4
 scala.collection.immutable.Set$EmptySet$
 scala.collection.immutable.Set$Set1
 scala.collection.immutable.Set$Set2
 scala.collection.immutable.Set$Set3
 scala.collection.immutable.Set$Set4
 scala.collection.immutable.Vector$
 scala.collection.IterableFactory$ToFactory
 scala.collection.MapFactory$ToFactory
 scala.collection.mutable.HashMap$DeserializationFactory
 scala.None$
 scala.Option
 scala.runtime.ModuleSerializationProxy
 scala.Some
 scala.Tuple2

@JaroslavTulach
Copy link
Member Author

I tried to skip reading function bodies - not done yet. Need to investigate ObjectStreamFilter.

Another attempt was to separate the ImportsExports data - a05dfa5 and 25826dd - but that is already done by "Imports/Exports" caches - as soon as #5890 gets fixed.

@JaroslavTulach
Copy link
Member Author

Further investigation of those 3s spends during deserialization suggests they may be appropriate - we are deserializing 48MB of .ir files as simple patch demonstrates:

--- a/engine/runtime/src/main/java/org/enso/compiler/Cache.java
+++ b/engine/runtime/src/main/java/org/enso/compiler/Cache.java
@@ -230,7 +239,7 @@
 
       if (sourceDigestValid && blobDigestValid) {
         Object readObject;
-        System.err.println("reading from " + dataPath);
+        System.err.println("CacheRead " + getClass().getSimpleName() + " from " + dataPath + " size: " + dataPath.size());
         try (ObjectInputStream stream = new ObjectInputStream(new ByteArrayInputStream(blobBytes)) {
           {
             enableResolveObject(true);

this is the histogram of occurrences of various types inside of the biggest 3MB! large Table.ir:

2       org.enso.compiler.pass.analyse.DataflowAnalysis$DependencyInfo
4       org.enso.compiler.pass.analyse.DataflowAnalysis$DependencyMapping
5       scala.collection.mutable.HashMap$DeserializationFactory
5       scala.collection.mutable.HashMap
6       org.enso.compiler.core.IR$Name$GenericAnnotation
9       org.enso.compiler.data.BindingsMap$PolyglotSymbol
9       org.enso.compiler.core.IR$Module$Scope$Import$Polyglot
9       org.enso.compiler.core.IR$Module$Scope$Import$Polyglot$Java
12      org.enso.compiler.data.BindingsMap$ResolvedPolyglotSymbol
21      org.enso.compiler.core.IR$Type$Error
35      org.enso.compiler.data.BindingsMap$ModuleMethod
37      org.enso.compiler.core.IR$Type$Function
38      org.enso.compiler.core.IR$Pattern$Name
44      org.enso.compiler.core.IR$Pattern$Type
44      org.enso.compiler.data.BindingsMap$ResolvedImport
51      org.enso.compiler.pass.resolve.TypeSignatures$Signature
54      org.enso.compiler.core.IR$Case$Expr
57      org.enso.compiler.core.IR$Type$Set$Union
59      org.enso.compiler.core.IR$Pattern$Constructor
64      org.enso.compiler.pass.resolve.DocumentationComments$Doc
68      org.enso.pkg.QualifiedName
72      org.enso.compiler.core.IR$Literal$Text
82      org.enso.compiler.core.IR$Application$Literal$Sequence
84      org.enso.compiler.data.BindingsMap$Type
87      org.enso.compiler.data.BindingsMap$ResolvedMethod
88      org.enso.compiler.core.IR$Module$Scope$Import$Module
93      org.enso.compiler.core.IR$Literal$Number
105     org.enso.compiler.core.IR$Name$MethodReference
105     org.enso.compiler.core.IR$Module$Scope$Definition$Method$Explicit
113     org.enso.compiler.pass.analyse.AliasAnalysis$Graph
113     org.enso.compiler.pass.analyse.AliasAnalysis$Info$Scope$Root
128     org.enso.compiler.data.BindingsMap$ResolvedModule
132     org.enso.compiler.core.IR$Case$Branch
137     org.enso.compiler.core.IR$Application$Force
163     org.enso.compiler.data.BindingsMap$Cons
174     org.enso.compiler.core.IR$Name$Qualified
213     org.enso.compiler.pass.resolve.ModuleAnnotations$Annotations
213     org.enso.compiler.data.BindingsMap$ResolvedConstructor
278     org.enso.compiler.core.IR$Expression$Block
302     org.enso.compiler.core.IR$Function$Lambda
314     scala.collection.immutable.Set$Set4
336     scala.collection.immutable.HashSet
343     org.enso.compiler.core.IR$Name$Self
352     org.enso.compiler.core.IR$Expression$Binding
456     scala.collection.immutable.Map$Map2
464     scala.collection.immutable.Map$Map1
474     org.enso.compiler.core.IR$DefinitionArgument$Specified
773     scala.collection.immutable.Set$Set3
908     org.enso.compiler.pass.analyse.AliasAnalysis$Graph$Occurrence$Def
1027    org.enso.compiler.data.BindingsMap$ResolvedType
1126    scala.collection.immutable.HashMap
1126    scala.collection.MapFactory$ToFactory
1201    org.enso.compiler.data.BindingsMap$Resolution
1227    org.enso.compiler.pass.analyse.AliasAnalysis$Graph$Link
1255    org.enso.compiler.data.BindingsMap$ModuleReference$Abstract
1320    org.enso.compiler.core.IR$Application$Prefix
1594    scala.collection.immutable.Set$Set2
1957    org.enso.compiler.pass.analyse.DataflowAnalysis$DependencyInfo$Type$Dynamic
2498    org.enso.compiler.core.IR$CallArgument$Specified
2690    org.enso.compiler.pass.analyse.AliasAnalysis$Graph$Scope
3085    org.enso.compiler.pass.analyse.AliasAnalysis$Info$Scope$Child
3228    org.enso.compiler.pass.analyse.AliasAnalysis$Graph$Occurrence$Use
3471    scala.collection.immutable.Map$Map4
3641    scala.collection.immutable.$colon$colon
3738    java.lang.String
3978    scala.collection.IterableFactory$ToFactory
4138    org.enso.compiler.pass.analyse.AliasAnalysis$Info$Occurrence
4469    scala.collection.immutable.Map$Map3
4632    org.enso.compiler.core.IR$Name$Literal
5841    org.enso.syntax.text.Location
5924    org.enso.compiler.core.IR$IdentifiedLocation
6131    org.enso.compiler.pass.analyse.CachePreferenceAnalysis$WeightInfo
6419    scala.Some
11410   org.enso.compiler.core.ir.DiagnosticStorage
11410   org.enso.compiler.core.ir.MetadataStorage
11512   java.util.UUID
18056   scala.collection.immutable.Set$Set1
18496   org.enso.compiler.pass.analyse.DataflowAnalysis$DependencyInfo$Type$Static
42558   scala.Tuple2

the histogram can be printed by following program:

import java.io.FileInputStream;
import java.io.IOException;
import java.io.ObjectInputStream;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.Map;

public class CountClasses {
  public static void main(String... args) throws Exception {
    Map<Class<?>,int[]> counter = new HashMap<>();
    
    try (ObjectInputStream is = new ObjectInputStream(new FileInputStream(args[0])) {
      {
        enableResolveObject(true);
      }
      @Override
      protected Object resolveObject(Object obj) throws IOException {
        counter.putIfAbsent(obj.getClass(), new int[1]);
        counter.get(obj.getClass())[0]++;
        return obj;
      }
    }) {
      var obj = is.readObject();
      System.out.println("read " + obj.getClass());
      
      var entries = new ArrayList<>(counter.entrySet());
      entries.sort((o1, o2) -> o1.getValue()[0] - o2.getValue()[0]);
      for (var e : entries) {
        System.out.printf("%d\t%s\n", e.getValue()[0], e.getKey().getName());
      }
    }
  }
}

The next question: what can we do about this?

@JaroslavTulach JaroslavTulach linked a pull request Mar 14, 2023 that will close this issue
2 tasks
@mergify mergify bot closed this as completed in #5918 Mar 17, 2023
@JaroslavTulach
Copy link
Member Author

This is not yet done. The new attempt, including ExecSimpleTest is available at IR_5567 branch

@enso-bot
Copy link

enso-bot bot commented Mar 20, 2023

Jaroslav Tulach reports a new STANDUP for today (2023-03-20):

Progress: - test generating ImportExports caches: de60b35
- trouble using loggers from system threads: https://app.slack.com/client/TN37RDLPK/CNQSB2DHD/thread/CNQSB2DHD-1679299491.782429
- no parsing in ExportsResolution: a509bd9
- cache importExportBindings: 483207d
- review of: https://github.com/enso-org/enso/pull/5998/files#r1141647711
- demo meetings It should be finished by 2023-03-27.

Next Day: Vacation - travel to Bulgaria.

GitHub
Pull Request Description Vector.sort handles incomparable values. Now, it should sort basically any vector, potentially attaching a warning if sort encountered incomparable values. Important Notes ...

@JaroslavTulach
Copy link
Member Author

I now have a plan to move forward. The plan is inspired by our work on the new parser. In #3723 we were running the old and new parser next to each other and we were comparing the results.

Let's create a testing infrastructure to run the IR builder without caches and the IR builder with caches next to each other and compare their results. When they differ - we'd have an easy to debug test case to fix. The skeleton of such test is already available in IR_5567 branch.

@enso-bot
Copy link

enso-bot bot commented Mar 23, 2023

Jaroslav Tulach reports a new STANDUP for yesterday (2023-03-22):

Progress: - catching up, building latest bits

Next Day: Back at full speed.

@enso-bot
Copy link

enso-bot bot commented Mar 24, 2023

Jaroslav Tulach reports a new STANDUP for yesterday (2023-03-23):

Progress: - looking at Hubert's: #6045 (comment)

Next Day: Vacation: Parallel Giant Slalom at WCJ

Discord
Discord is the easiest way to communicate over voice, video, and text. Chat, hang out, and stay close with your friends and communities.

@enso-bot
Copy link

enso-bot bot commented Mar 28, 2023

Jaroslav Tulach reports a new STANDUP for yesterday (2023-03-27):

Progress: - processing emails & reviews

  • standup meeting
  • meeting Dmitry: work on simplifying suggestion.DB generation
  • analyzing SerdeCompilerTest failures: Serde testing infrastructure for the Enso compiler #6062 (comment)
  • found the problem: 0d983fe
  • but still CI fighting
  • meeting GregO.
  • standup with libraries guys
  • meeting Wojciech
  • few fixes for PR-6062
  • demo meeting for "GUI Team Meeting"
  • demo meeting for "Libs & Engine" Update It should be finished by 2023-03-27.

Next Day: Integrate SerdeCompilerTest

@JaroslavTulach JaroslavTulach changed the title More Effective IR Caching format Investigate More Effective IR Caching format Mar 28, 2023
mergify bot pushed a commit that referenced this issue Mar 28, 2023
The primary delivery of this PR is a design of `SerdeCompilerTest` - a testing suite that allows us to write sample projects, parse them with and without caches and verify they still produce the same `IR`. This is a similar idea to #3723 which compared the old and new parser `IR`s.

With infrastructure like this we can start addressing #5567 without any (significant) fear of breaking something essential.
@JaroslavTulach
Copy link
Member Author

The testing infrastructure

has been integrated. Let's close this (too) long open issue and schedule smaller ones for subsequent work. Like:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
--low-performance -compiler p-medium Should be completed in the next few sprints s-research-needed Status: the task will require heavy research to complete
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants