JENA-1122 Add memoizing of LuceneTextIndexes so that there is one TextIndexLucene per directory #123

bwmcbride · 2016-01-22T13:06:56Z

JENA-1122

These changes memoize LuceneTextIndexes so that there is one per directory, and Lucene RAMDirectories created by the Lucene assembler so there is one RAMDirectory per node in the configuration graph.

One issue is when to forget a memoized object. The policy implemented in this code is to forget the object when it is closed.

object per directory. Similary so the Lucene assembler only creates one RAMDirectory per node.

ajs6f · 2016-01-22T13:49:54Z

jena-text/src/main/java/org/apache/jena/query/text/TextIndexLucene.java

I think this whole method might just be eventHandlers.computeIfAbsent(event, e -> new ArrayList<>()).add(handler)?

afs · 2016-01-22T21:59:02Z

Design

Protecting the text index this way sort of works for TDB specifically because of an internal feature of TDB (it manages storage to stop duplication) which is not a guaranteed feature. Other dataset implementations will not work out so nicely. It will be like two separate datasets and one index will and probably lead to corruption or inconsistent reading (c.f. email "transactions and docProducers").

On JENA-1122 I summarized discussions up to here as two options suggested:

Internal static state in TextDatasetFactory that the same datasets object is returned each time. c.f. TDB's StoreConnection. Extends sharing of text datasets to work with java/API uses but not "any dataset" in Fuseki configurations.
Fuseki (or in DatasetAssembler maybe) assembling datasets deals with sharing using the graph structure. This copes with any dataset but not API use.

The first one looks hard because of choosing the key to include the dataset in the general case.

The second one is easier to do because there is a natural key of the resource (URI, bnode) for the dataset. Bonus would a similar per-text index assembler check on reuse JENA-1104.

There is one minor point - Fuseki can have multiple assembler files and badly chosen, clashing dataset URIs (solution - keep a list of all URIs acorss assembler configs - useful check anyway)

The ideal for JENA-1122 is this PR (simplified?) to protect text indexes and (2) above to allow complex configurations.

bwmcbride · 2016-01-24T13:32:33Z

The main problem I had was figuring out how to manage the static state so it was consistent with the lifecyle implemented by TextIndexLucene and avoid memory leaks on long running systems. Hence the call backs. That is an issue that is common to both API and assembler construction.

The assembler mechanism is not designed to handle graph (or even lattice) structured composition of components. According to Chris, it is by design that It does a recursive decent over the structure and assembles a new thing each time a node is encountered.

Perhaps there is a need for an assembly mechanism designed to handle graph structures. Would also be useful to have some notion of an assember 'session', that is the process of building a complete structure, during which session state could be built up and thrown away at the end. Maybe there is one and I didn;t find it.

ajs6f · 2016-01-24T13:39:48Z

Just as a sidenote, isn't org.apache.jena.assembler.Mode intended to control whether new things are created or old things are reused during the assembly?

afs · 2016-01-24T15:03:16Z

org.apache.jena.assembler.Mode does not help unfortunately. It assists with passing down a requirement to share for certain setups that work as tightly coordinated groups but that isn't good enough here (it's a hint of unclear meaning really, not an instruction). It should not be in the primary interface method but it got there a long time ago.

How do you decide when to set it in the first place?
When set, why would assembler know another should be shared? It assumes knowledge about implementation but the target may be a subtype or specialization.

e.g. TDB internally manages sharing because only TDB knows what must be shared. But a TDB dataset assembler is a subclass of DatasetRDF assembler.

afs · 2016-01-24T15:23:46Z

As far as I can see, only an assembler solution will address JENA-1122. @bwmcbride - do you believe this PR alone will address JENA-1122? Has this PR been run under load?

By protecting the text index alone, I see a new class of concurrency issues arising. A shared text index between two different TDB datasets will go wrong on update. At the moment, you can't create that setup because of JENA-1104. Other setups will have concurrency problems.

If this works in a "one dataset, must be TDB, only in Fuseki" setup, it only protects against JENA-1104. If then use of TDB happens to work at the moment, it is based on internal implementation handling in Fuseki. That is very fragile.

This PR does no harm though it is a bit more complicated than needed because a solution to the root cause JENA-1122 may well remove the need for some of the machinery.

ajs6f · 2016-01-24T15:59:54Z

Thanks for the clarification re: Mode. Should the Javadocs for Mode say that something like that it "advises implementations of user intention, but does not constrain implementation behavior"?

ajs6f · 2016-01-24T16:04:29Z

There has been some discussion about the commonalities between jena-text and jena-spatial. Is the same problem likely to come up with jena-spatial? Maybe it's worth doing the assembler-based fix because of that? (Sorry if I'm really off-base here, just trying to avoid any potential future problems.)

bwmcbride · 2016-01-24T23:24:49Z

Thanks for all the comments. I've updated the pull request implementing all the proposed changes and made a few other minor amendments.

However, looking at @afs comments again, I realise I am heading off down the wrong path. What @afs had suggested was not memoizing the text index objects in TextDataset factory, which is what this PR currently does, but memoizing dataset objects.

My bad. I thought the suggestion was to memoize the TextDataSetLucene objects in TextDatasetFactory. That does I think solve JENA-1122 (as I understand it) but, as Andy says, it opens up other nasty problems.

So I propose to switch to memoizing datasets in the assembler.

I'm sorry the rest of this is so long. The short version is, what is the best way to go about managing state for memoizing datasets in assemblers?

The simple option is to just move the code in this pull request for memoizing TextIndexLucene datasets from TextDatasetFactory to TextIndexLuceneAssembler. That addresses my immediate problem, but a solution that worked for datasets in general would be better.

A problem is if/when/how to clear out the memoized state.

In use cases where an application or a service starts up, assembles its components and then does what it does until it terminates, there may not be much of an issue leaving state around in the assembler maps used to relate nodes to reusable objects.

In use cases where an application is repeatedly building and tearing down assemblies during its lifetime then not clearing out the memoizing state can lead to failures when a component is reused between 'builds'. The current TextIndexLucene test cases do this and they fail if the memoizing map is not cleared out between tests. Maybe the tests are broken.

The current code in this pull request clears a TextIndexLucene object out of the memoizingmap when the TextIndexLucene object is closed. But that is basically a kludge. I don't think it a good idea to go round changing all existing dataset implementations to support event handling callbacks on close.

The way I would naturally expect things to work is for assemblers to have some notion of a build. A build defines a context in which state like the memoizing map can be built up. The build context can be thrown away at the end of the build and a new one created for the next build. This would prevent reuse across builds. So if you do:

foo = assembler.build(R);
bar = assembler.build(R);

you will get two different assemblies, typically with no sharing between them. (You would still get a lock failure if R had a TextIndexLucene component that was not closed between creating the two assemblies.)

As far as I can see assemblers are not designed to work like this. Nor can I see how to add this notion of a build context without affecting many existing assemblers. I may be missing something.

I have raised the question in the hope that someone will suggest an approach to a more general solution.

ajs6f · 2016-01-25T20:51:15Z

Is there any understanding about the sharing of components between the results of calls to the various forms of Assembler::open?

afs · 2016-01-30T16:54:54Z

This PR protects against multiple creation of a text index (JENA-1104), not against two calls to create the same dataset for two services in Fuseki. By chance, TDB is less prone to problems if that happens but that is luck. General datasets e.g. with inference graphs, SDB or plain in-memory datasets are likely exposed to problems.

Let's solve the immediate issue described in JENA-1122, then see if JENA-1104 needs addressing or whether the situations where it can still happen are uninteresting or have other problems in which case the application must be responsible for creating the index only once.

For the record, there are some specific items with the current PR that I would like clarified or refuted before this code is used to address JENA-1107, if that is still needed.

1: TextIndexLucene.close is not reference counted.

Create text index -> T1
Create text index -> T2 (which is T1, shared)
Close T2.
Any T1 code will now crash - the index is closed.

2: Using WeakReferences and managing close() seems to be duplicating lifecycle management.

I am not clear that the WeakReference to the Lucene index helps because there are no finalizers, so GC fnalization does not tidy up lucene. A freed WeakReference would cause a new attempt to create the index but it will hit the state lock.

3: Creation by createLuceneIndex is not thread safe. It has a get-create-put timing hole.

4: Need to be clear on the contract for "same Directory, different Lucene configuration (TextIndexConfig)".

afs · 2016-01-30T17:04:53Z

Starting from the JENA-1122 description:

Two Fuseki services, linking to the same dataset description.

Fuseki only calls assemblers once. No other system is (legitimately) calling Fuseki service building. The configuration file processing puts service access points into the server-wide state. There is no service assembler (it could be done but it isn't, it serves no purpose); it is done by custom processing during walking the datastructure which is happening anyway.

In the Fuseki case, we want shared datasets descriptions, that is, same name, to yield the same dataset. Processing dataset descriptions is driven by assemblers and they have names for keys using the root resource. A general "dataset sharing" outside assemblers is hard because of the lack of key. In other cases, I can imagine that a shared description alone does not always imply a shared object - in-memory datasets for example. The more general area is not clearly defined.

The solution I see is that Fuseki handles the process step for the link:

    fuseki:dataset   <#dataset>

<#dataset> rdf:type ja:RDFDataset(OrSubType)

This happens in Builder.buildDataService as it calls Assembler.general.open(datasetDesc).

It looks to me that if sharing is provided here, the problem statement of JENA-1122 is addressed.

One matter arising:

Service descriptions can be in multiple files (it is the preferred pattern to use configuration/). The template system behind the UI uses relative URIs so names of descriptions are unique across the server.

If a user manually writes two configuration files, but uses the same absolute URI and they meant it to be different, we have a problem and this could be made to cause an error (safe choice to force shared datasets to be in the server config.ttl).

FusekiConfig.initializeDataAccessPoints is the driver, it calls readConfigurationDirectory and the others places service descriptions can be and so needs checking.

For now, just solving this for two services in the server configuration file, with entries in the fuseki:services list links is a good start.

bwmcbride · 2016-01-31T19:41:04Z

Thanks for the steer. So fix in Builder.buildDataService.

The simple thing to do to just solve the problem for two services in the server configuration file is to create a static memoizing map in the Builder class.

One of the calls to Builder.buildDataAccessPoint, which calls Builder.buildDataService is from ActionDatasets.execPostContainer. That seems to be taking an HTTP POST, interpreting it as a configuration and then creating a service corresponding to that configuration. There appears also to be a DELETE operation.

So, in the case of the sequence:

POST config for service A
DELETE service A
POST different config for service A

the dataset created for the first POST will get reused.

Is it ok to use a simple static map in the Builder class as 'a start'?

bwmcbride · 2016-01-31T20:07:44Z

See also JENA-848.

afs · 2016-02-01T12:06:30Z

It needs to be a singleton of some kind; a static map is one way to do singletons. There is only one Fuseki server per JVM.

c.f. DataAccessPointRegistry and Registry<K,T>.

Referencing counting the service to dataset linkage would be good because it builds for any future deleting of datasets from a running server.

afs · 2016-02-01T12:07:28Z

POST config for service A

DELETE service A

POST different config for service A

the dataset created for the first POST will get reused.

if the same URI is used for fuseki:dataset

ajs6f · 2016-02-08T14:16:14Z

...ki2/jena-fuseki-core/src/main/java/org/apache/jena/fuseki/build/DescriptionToDatasetMap.java

+
+	public static DescriptionToDatasetMap getSingleton() { return singleton ; }
+
+	private RefCountingMap<Resource, Dataset> map = new RefCountingMap<Resource,Dataset>();


A small thing, but you can use new RefCountingMap<>() here.

afs · 2016-02-09T12:54:50Z

Would it be possible to use one of the Guava Multi* classes? Maybe in Map<K, MultiSet<V>> (a map to a counting set and ensure the set is one unique individual). There are other Multi* one of which might even do this all directly but I can't find one ATM.

bwmcbride · 2016-02-09T13:44:01Z

I wasn't aware of Guava - I'll have a look.

afs · 2016-02-09T14:10:05Z

Jena has a shaded copy in org.apache.jena.ext.com.google.... (module jena-ext - one of the modules before jena-core).

afs · 2016-02-19T20:40:34Z

I've put in this code (integrated via applying the diff and some minor changes after that).

There is an outstanding issue:

If two different files (e.g run/configuration/service1.ttl and run/configuration/service2.ttl) each define their dataset with the same URI (bad choice but it can happen) then the second is lost.

One possibility is to clear the Resource->dataset mapping for each file.
Another is to live with it.

If the latter, it would be good to check for mistakes. Any good suggestions how to check that?

afs · 2016-02-20T09:59:05Z

The immediate issue that this PR addresses is completed.

Discussion of the details of the behaviour in all cases on JENA-1122.

Add memoizing of LuceneTextIndexes so that there is one TextIndexLucene

d0451f1

object per directory. Similary so the Lucene assembler only creates one RAMDirectory per node.

bwmcbride changed the title ~~Add memoizing of LuceneTextIndexes so that there is one TextIndexLucene~~ Add memoizing of LuceneTextIndexes so that there is one TextIndexLucene per directory Jan 22, 2016

ajs6f reviewed Jan 22, 2016
View reviewed changes

bwmcbride added 3 commits January 24, 2016 13:47

responding to feedback from first pull request

2396ff6

Adding missing commits

72ae068

add newline to end of file

a208eaa

Added tests that a new index is allocated after an index close

7e53695

ajs6f mentioned this pull request Jan 25, 2016

Add Javadoc for Assembler indicating that Model values are advisory #124

Merged

bwmcbride changed the title ~~Add memoizing of LuceneTextIndexes so that there is one TextIndexLucene per directory~~ JENA-1122 Add memoizing of LuceneTextIndexes so that there is one TextIndexLucene per directory Jan 30, 2016

bwmcbride added 3 commits February 3, 2016 17:38

Revert changes in JenaText

ac8baf1

Memoize datasets when building Fuseki

db68003

Added explanatory comments

f77e779

ajs6f reviewed Feb 8, 2016
View reviewed changes

asfgit closed this in 0602895 Feb 19, 2016

kinow mentioned this pull request Jun 20, 2018

JENA-1556 implementation #436

Merged


		public static DescriptionToDatasetMap getSingleton() { return singleton ; }

		private RefCountingMap<Resource, Dataset> map = new RefCountingMap<Resource,Dataset>();

JENA-1122 Add memoizing of LuceneTextIndexes so that there is one TextIndexLucene per directory #123

JENA-1122 Add memoizing of LuceneTextIndexes so that there is one TextIndexLucene per directory #123

Uh oh!

Conversation

bwmcbride commented Jan 22, 2016

Uh oh!

ajs6f Jan 22, 2016

Choose a reason for hiding this comment

Uh oh!

bwmcbride Jan 22, 2016

Choose a reason for hiding this comment

Uh oh!

afs commented Jan 22, 2016

Uh oh!

bwmcbride commented Jan 24, 2016

Uh oh!

ajs6f commented Jan 24, 2016

Uh oh!

afs commented Jan 24, 2016

Uh oh!

afs commented Jan 24, 2016

Uh oh!

ajs6f commented Jan 24, 2016

Uh oh!

ajs6f commented Jan 24, 2016

Uh oh!

bwmcbride commented Jan 24, 2016

Uh oh!

ajs6f commented Jan 25, 2016

Uh oh!

afs commented Jan 30, 2016

Uh oh!

afs commented Jan 30, 2016

Uh oh!

bwmcbride commented Jan 31, 2016

Uh oh!

bwmcbride commented Jan 31, 2016

Uh oh!

afs commented Feb 1, 2016

Uh oh!

afs commented Feb 1, 2016

Uh oh!

ajs6f Feb 8, 2016

Choose a reason for hiding this comment

Uh oh!

afs commented Feb 9, 2016

Uh oh!

bwmcbride commented Feb 9, 2016

Uh oh!

afs commented Feb 9, 2016

Uh oh!

afs commented Feb 19, 2016

Uh oh!

afs commented Feb 20, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants