-
Notifications
You must be signed in to change notification settings - Fork 675
JENA-1122 Add memoizing of LuceneTextIndexes so that there is one TextIndexLucene per directory #123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
object per directory. Similary so the Lucene assembler only creates one RAMDirectory per node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this whole method might just be eventHandlers.computeIfAbsent(event, e -> new ArrayList<>()).add(handler)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it is.
|
Design Protecting the text index this way sort of works for TDB specifically because of an internal feature of TDB (it manages storage to stop duplication) which is not a guaranteed feature. Other dataset implementations will not work out so nicely. It will be like two separate datasets and one index will and probably lead to corruption or inconsistent reading (c.f. email "transactions and docProducers"). On JENA-1122 I summarized discussions up to here as two options suggested:
The first one looks hard because of choosing the key to include the dataset in the general case. The second one is easier to do because there is a natural key of the resource (URI, bnode) for the dataset. Bonus would a similar per-text index assembler check on reuse JENA-1104. There is one minor point - Fuseki can have multiple assembler files and badly chosen, clashing dataset URIs (solution - keep a list of all URIs acorss assembler configs - useful check anyway) The ideal for JENA-1122 is this PR (simplified?) to protect text indexes and (2) above to allow complex configurations. |
|
The main problem I had was figuring out how to manage the static state so it was consistent with the lifecyle implemented by TextIndexLucene and avoid memory leaks on long running systems. Hence the call backs. That is an issue that is common to both API and assembler construction. The assembler mechanism is not designed to handle graph (or even lattice) structured composition of components. According to Chris, it is by design that It does a recursive decent over the structure and assembles a new thing each time a node is encountered. Perhaps there is a need for an assembly mechanism designed to handle graph structures. Would also be useful to have some notion of an assember 'session', that is the process of building a complete structure, during which session state could be built up and thrown away at the end. Maybe there is one and I didn;t find it. |
|
Just as a sidenote, isn't |
|
e.g. TDB internally manages sharing because only TDB knows what must be shared. But a TDB dataset assembler is a subclass of DatasetRDF assembler. |
|
As far as I can see, only an assembler solution will address JENA-1122. @bwmcbride - do you believe this PR alone will address JENA-1122? Has this PR been run under load? By protecting the text index alone, I see a new class of concurrency issues arising. A shared text index between two different TDB datasets will go wrong on update. At the moment, you can't create that setup because of JENA-1104. Other setups will have concurrency problems. If this works in a "one dataset, must be TDB, only in Fuseki" setup, it only protects against JENA-1104. If then use of TDB happens to work at the moment, it is based on internal implementation handling in Fuseki. That is very fragile. This PR does no harm though it is a bit more complicated than needed because a solution to the root cause JENA-1122 may well remove the need for some of the machinery. |
|
Thanks for the clarification re: |
|
There has been some discussion about the commonalities between |
|
Thanks for all the comments. I've updated the pull request implementing all the proposed changes and made a few other minor amendments. However, looking at @afs comments again, I realise I am heading off down the wrong path. What @afs had suggested was not memoizing the text index objects in TextDataset factory, which is what this PR currently does, but memoizing dataset objects. My bad. I thought the suggestion was to memoize the TextDataSetLucene objects in TextDatasetFactory. That does I think solve JENA-1122 (as I understand it) but, as Andy says, it opens up other nasty problems. So I propose to switch to memoizing datasets in the assembler. I'm sorry the rest of this is so long. The short version is, what is the best way to go about managing state for memoizing datasets in assemblers? The simple option is to just move the code in this pull request for memoizing TextIndexLucene datasets from TextDatasetFactory to TextIndexLuceneAssembler. That addresses my immediate problem, but a solution that worked for datasets in general would be better. A problem is if/when/how to clear out the memoized state. In use cases where an application or a service starts up, assembles its components and then does what it does until it terminates, there may not be much of an issue leaving state around in the assembler maps used to relate nodes to reusable objects. In use cases where an application is repeatedly building and tearing down assemblies during its lifetime then not clearing out the memoizing state can lead to failures when a component is reused between 'builds'. The current TextIndexLucene test cases do this and they fail if the memoizing map is not cleared out between tests. Maybe the tests are broken. The current code in this pull request clears a TextIndexLucene object out of the memoizingmap when the TextIndexLucene object is closed. But that is basically a kludge. I don't think it a good idea to go round changing all existing dataset implementations to support event handling callbacks on close. The way I would naturally expect things to work is for assemblers to have some notion of a build. A build defines a context in which state like the memoizing map can be built up. The build context can be thrown away at the end of the build and a new one created for the next build. This would prevent reuse across builds. So if you do: foo = assembler.build(R); you will get two different assemblies, typically with no sharing between them. (You would still get a lock failure if R had a TextIndexLucene component that was not closed between creating the two assemblies.) As far as I can see assemblers are not designed to work like this. Nor can I see how to add this notion of a build context without affecting many existing assemblers. I may be missing something. I have raised the question in the hope that someone will suggest an approach to a more general solution. |
|
Is there any understanding about the sharing of components between the results of calls to the various forms of |
|
This PR protects against multiple creation of a text index (JENA-1104), not against two calls to create the same dataset for two services in Fuseki. By chance, TDB is less prone to problems if that happens but that is luck. General datasets e.g. with inference graphs, SDB or plain in-memory datasets are likely exposed to problems. Let's solve the immediate issue described in JENA-1122, then see if JENA-1104 needs addressing or whether the situations where it can still happen are uninteresting or have other problems in which case the application must be responsible for creating the index only once. For the record, there are some specific items with the current PR that I would like clarified or refuted before this code is used to address JENA-1107, if that is still needed. 1:
2: Using WeakReferences and managing I am not clear that the WeakReference to the Lucene index helps because there are no finalizers, so GC fnalization does not tidy up lucene. A freed WeakReference would cause a new attempt to create the index but it will hit the state lock. 3: Creation by 4: Need to be clear on the contract for "same |
|
Starting from the JENA-1122 description:
Fuseki only calls assemblers once. No other system is (legitimately) calling Fuseki service building. The configuration file processing puts service access points into the server-wide state. There is no service assembler (it could be done but it isn't, it serves no purpose); it is done by custom processing during walking the datastructure which is happening anyway. In the Fuseki case, we want shared datasets descriptions, that is, same name, to yield the same dataset. Processing dataset descriptions is driven by assemblers and they have names for keys using the root resource. A general "dataset sharing" outside assemblers is hard because of the lack of key. In other cases, I can imagine that a shared description alone does not always imply a shared object - in-memory datasets for example. The more general area is not clearly defined. The solution I see is that Fuseki handles the process step for the link: This happens in It looks to me that if sharing is provided here, the problem statement of JENA-1122 is addressed. One matter arising: Service descriptions can be in multiple files (it is the preferred pattern to use If a user manually writes two configuration files, but uses the same absolute URI and they meant it to be different, we have a problem and this could be made to cause an error (safe choice to force shared datasets to be in the server
For now, just solving this for two services in the server configuration file, with entries in the |
|
Thanks for the steer. So fix in Builder.buildDataService. The simple thing to do to just solve the problem for two services in the server configuration file is to create a static memoizing map in the Builder class. One of the calls to Builder.buildDataAccessPoint, which calls Builder.buildDataService is from ActionDatasets.execPostContainer. That seems to be taking an HTTP POST, interpreting it as a configuration and then creating a service corresponding to that configuration. There appears also to be a DELETE operation. So, in the case of the sequence:
the dataset created for the first POST will get reused. Is it ok to use a simple static map in the Builder class as 'a start'? |
|
See also JENA-848. |
|
It needs to be a singleton of some kind; a static map is one way to do singletons. There is only one Fuseki server per JVM. c.f. Referencing counting the service to dataset linkage would be good because it builds for any future deleting of datasets from a running server. |
if the same URI is used for |
|
|
||
| public static DescriptionToDatasetMap getSingleton() { return singleton ; } | ||
|
|
||
| private RefCountingMap<Resource, Dataset> map = new RefCountingMap<Resource,Dataset>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A small thing, but you can use new RefCountingMap<>() here.
|
Would it be possible to use one of the Guava |
|
I wasn't aware of Guava - I'll have a look. |
|
Jena has a shaded copy in |
|
I've put in this code (integrated via applying the diff and some minor changes after that). There is an outstanding issue: If two different files (e.g One possibility is to clear the Resource->dataset mapping for each file. If the latter, it would be good to check for mistakes. Any good suggestions how to check that? |
|
The immediate issue that this PR addresses is completed. Discussion of the details of the behaviour in all cases on JENA-1122. |
JENA-1122
These changes memoize LuceneTextIndexes so that there is one per directory, and Lucene RAMDirectories created by the Lucene assembler so there is one RAMDirectory per node in the configuration graph.
One issue is when to forget a memoized object. The policy implemented in this code is to forget the object when it is closed.