-
Notifications
You must be signed in to change notification settings - Fork 161
/
LuceneSail.java
645 lines (594 loc) · 33.1 KB
/
LuceneSail.java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
/*******************************************************************************
* Copyright (c) 2015 Eclipse RDF4J contributors, Aduna, and others.
* All rights reserved. This program and the accompanying materials
* are made available under the terms of the Eclipse Distribution License v1.0
* which accompanies this distribution, and is available at
* http://www.eclipse.org/org/documents/edl-v10.php.
*******************************************************************************/
package org.eclipse.rdf4j.sail.lucene;
import java.io.File;
import java.io.IOException;
import java.io.Reader;
import java.io.StringReader;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collection;
import java.util.HashMap;
import java.util.HashSet;
import java.util.List;
import java.util.Map;
import java.util.Properties;
import java.util.Set;
import java.util.concurrent.atomic.AtomicBoolean;
import org.eclipse.rdf4j.model.IRI;
import org.eclipse.rdf4j.model.Resource;
import org.eclipse.rdf4j.model.Statement;
import org.eclipse.rdf4j.model.Value;
import org.eclipse.rdf4j.model.ValueFactory;
import org.eclipse.rdf4j.query.BindingSet;
import org.eclipse.rdf4j.query.QueryLanguage;
import org.eclipse.rdf4j.query.TupleQuery;
import org.eclipse.rdf4j.query.TupleQueryResult;
import org.eclipse.rdf4j.repository.sail.SailRepository;
import org.eclipse.rdf4j.repository.sail.SailRepositoryConnection;
import org.eclipse.rdf4j.sail.NotifyingSailConnection;
import org.eclipse.rdf4j.sail.SailException;
import org.eclipse.rdf4j.sail.helpers.NotifyingSailWrapper;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
/**
* A LuceneSail wraps an arbitrary existing Sail and extends it with support for full-text search on all
* Literals.
* <h2>Setting up a LuceneSail</h2> LuceneSail works in two modes: storing its data into a directory on the
* harddisk or into a RAMDirectory in RAM (which is discarded when the program ends). Example with storage in
* a folder:
*
* <pre>
* // create a sesame memory sail
* MemoryStore memoryStore = new MemoryStore();
*
* // create a lucenesail to wrap the memorystore
* LuceneSail lucenesail = new LuceneSail();
* // set this parameter to store the lucene index on disk
* lucenesail.setParameter(LuceneSail.LUCENE_DIR_KEY, "./data/mydirectory");
*
* // wrap memorystore in a lucenesail
* lucenesail.setBaseSail(memoryStore);
*
* // create a Repository to access the sails
* SailRepository repository = new SailRepository(lucenesail);
* repository.initialize();
* </pre>
*
* Example with storage in a RAM directory:
*
* <pre>
* // create a sesame memory sail
* MemoryStore memoryStore = new MemoryStore();
*
* // create a lucenesail to wrap the memorystore
* LuceneSail lucenesail = new LuceneSail();
* // set this parameter to let the lucene index store its data in ram
* lucenesail.setParameter(LuceneSail.LUCENE_RAMDIR_KEY, "true");
*
* // wrap memorystore in a lucenesail
* lucenesail.setBaseSail(memoryStore);
*
* // create a Repository to access the sails
* SailRepository repository = new SailRepository(lucenesail);
* repository.initialize();
* </pre>
*
* <h2>Asking full-text queries</h2> Text queries are expressed using the virtual properties of the
* LuceneSail. An example query looks like this (SERQL): <code>
* SELECT Subject, Score, Snippet
* FROM {Subject} <http://www.openrdf.org/contrib/lucenesail#matches> {}
* <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> {<http://www.openrdf.org/contrib/lucenesail#LuceneQuery>};
* <http://www.openrdf.org/contrib/lucenesail#query> {"my Lucene query"};
* <http://www.openrdf.org/contrib/lucenesail#score> {Score};
* <http://www.openrdf.org/contrib/lucenesail#snippet> {Snippet}
* </code> When defining queries, these properties <b>type and query are mandatory</b>. Also, the <b>matches
* relation is mandatory</b>. When one of these misses, the query will not be executed as expected. The
* failure behavior can be configured, setting the Sail property "incompletequeryfail" to true will throw a
* SailException when such patterns are found, this is the default behavior to help finding inaccurate
* queries. Set it to false to have warnings logged instead. <b>Multiple queries</b> can be issued to the
* sail, the results of the queries will be integrated. Note that you cannot use the same variable for
* multiple Text queries, if you want to combine text searches, use Lucenes query syntax.
* <h2 id="storedindexed">Fields are stored/indexed</h2> All fields are stored and indexed. The "text" fields
* (gathering all literals) have to be stored, because when a new literal is added to a document, the previous
* texts need to be copied from the existing document to the new Document, this does not work when they are
* only "indexed". Fields that are not stored, cannot be retrieved using full-text querying.
* <h2>Deleting a Lucene index</h2> At the moment, deleting the lucene index can be done in two ways:
* <ul>
* <li>Delete the folder where the data is stored while the application is not running</li>
* <li>Call the repository's
* <code>{@link org.eclipse.rdf4j.repository.RepositoryConnection#clear(org.eclipse.rdf4j.model.Resource[])}</code>
* method with no arguments. <code>clear()</code>. This will delete the index.</li>
* </ul>
* <h2>Handling of Contexts</h2> Each lucene document contains a field for every contextIDs that contributed
* to the document. <b>NULL</b> contexts are marked using the String {@link LuceneIndex#CONTEXT_NULL} ("null")
* and stored in the lucene field {@link LuceneIndex#CONTEXT_FIELD_NAME} ("context"). This means that when
* adding/appending to a document, all additional context-uris are added to the document. When deleting
* individual triples, the context is ignored. In clear(Resource ...) we make a query on all Lucene-Documents
* that were possibly created by this context(s). Given a document D that context C(1-n) contributed to. D' is
* the new document after clear(). - if there is only one C then D can be safely removed. There is no D' (I
* hope this is the standard case: like in ontologies, where all triples about a resource are in one document)
* - if there are multiple C, remember the uri of D, delete D, and query (s,p,o, ?) from the underlying store
* after committing the operation- this returns the literals of D', add D' as new document This will probably
* be both fast in the common case and capable enough in the multiple-C case.
* <h2 name="indexedfieldssyntax">Defining the indexed Fields</h2> The property {@link #INDEXEDFIELDS} is to
* configure which fields to index and to project a property to another. Syntax:
*
* <pre>
* # only index label and comment
* index.1=http://www.w3.org/2000/01/rdf-schema#label
* index.2=http://www.w3.org/2000/01/rdf-schema#comment
* # project http://xmlns.com/foaf/0.1/name to rdfs:label
* http\://xmlns.com/foaf/0.1/name=http\://www.w3.org/2000/01/rdf-schema#label
* </pre>
*
* <h2>Datatypes</h2> Datatypes are ignored in the LuceneSail.
*/
public class LuceneSail extends NotifyingSailWrapper {
/*
* FIXME: Add a proper reference to the ISWC paper in the Javadoc. Gunnar: only when/if the paper is
* accepted Enrico: paper was rejected Leo: We need to resubmit it. FIXME: Add settings that instruct a
* LuceneSailConnection or LuceneIndex which properties are to be handled in which way. This is
* conceptually similar to Lucene's Field types: should properties be stored in the wrapped Sail (enabling
* retrieval through RDF queries), indexed in the LuceneIndex (enabling full-text search using Lucene
* queries embedded in RDF graph queries) or both? Gunnar and Leo: we had this in the old version, we
* might add later. Enrico: in beagle we set the default setting to index AND store a field, so that when
* you extend the ontology you can be sure it is indexed and stored by the lucenesail without touching it.
* For certain (very rare) predicates (like the full text of the resource) we then explicitly turned off
* the store option. That would be a desired behaviour. In the old version an RDF file was used, but it
* should be done differently, that is too hard-coded! can't that information be stored in the wrapped
* sail itself? Annotate a predicate with the proper lucene values (store / index / storeAndIndex), if
* nothing is given, take the default, and read this on starting the lucenesail. Leo: ok, default = index
* and store, agreed. Leo: about configuration: RDF config is agreed, if passed as file, inside the
* wrapped sail, or in an extra sail should all be possible.
*/
/*
* FIXME: This code can only handle RDF queries containing a single "Lucene expression" (i.e. a
* combination of matches, query and optionally other predicates from the LuceneSail's namespace), the
* other expressions are ignored. Extending this to support an arbitrary number of search expressions is
* theoretically possible but easier said then done, especially because of the number of different cases
* that need to be handled: variable subject vs. specified subject, expressions operating on the same
* subject vs. expressions operating on different subjects, etc. Gunnar: I would we restrict this to one.
* Enrico might have other requirements? Enrico: we need 1) an arbitrary number of lucene expressions and
* 2) an arbitrary combination with ordinary structured queries (see lucenesail paper, fig. 1 on page 6)
* Leo: combining lucene query with normal query is required, having multiple lucene queries in one SPARQL
* query is a good idea, which should be doable. Lower priority. FIXME: We should escape those chars in
* predicates/field names that have a special meaning in Lucene's query syntax, using ":" in a field name
* might lead to problems (it will when you start to query on these fields). Enrico: yes, we escaped those
* : sucessfully with a simple \, the only difficuilty was to figure out how many \ are needed (how often
* they get unescaped until they arrive at Lucene) Leo noticed this. Gunnar asks: Does lucene not have a
* escape syntax? FIXME: The getScore method is a convenient and efficient way of testing whether a given
* document matches a query, as it adds the document URI to the Lucene query instead of firing the query
* and looping over the result set. The problem with this method is that I am not sure whether adding the
* URI to the Lucene query will lead to a different score for that document. For most applications this is
* probably not a problem as you either will use the search method with the scores reposted to its
* listener, or the getScore method, but not both. The order of matching documents will probably be the
* same when sorting on score (field is indexed without normalization + only unique values). Still, it is
* counterintuitive when a particular document is returned with a given score and a getScore for that same
* URI gives a different score. FIXME: the code is very much NOT thread-safe, especially when you are
* changing the index and querying it with LuceneSailConnection at the same time: the
* IndexReaders/Searchers are closed after each statement addition or removal but they must also remain
* open while we are looping over search results. Also, internal document numbers are used in the
* communication between LuceneIndex and LuceneSailConnection, which is not a good idea. Some mechanism
* has to be introduced to support external querying while the index is being modified (basically: make
* sure that a single search process keeps using the same IndexSearcher). Gunnar and Leo: we are not sure
* if the original lucenesail was 100% threadsafe, but at least it had "synchronized" everywhere :)
* http://gnowsis.opendfki.de/repos/gnowsis/trunk/lucenesail/src/java/org/openrdf/sesame/sailimpl/
* lucenesail/LuceneIndex.java This might be a big issue in Nepomuk... Enrico: do we have multiple
* threads? do we need separate threads? Leo: we have separate threads, but we don't care much for now.
*/
final private Logger logger = LoggerFactory.getLogger(this.getClass());
/**
* Set the parameter "reindexQuery=" to configure the statements to index over. Default value is
* "SELECT ?s ?p ?o ?c WHERE {{?s ?p ?o} UNION {GRAPH ?c {?s ?p ?o.}}} ORDER BY ?s" . NB: the query must
* contain the bindings ?s, ?p, ?o and ?c and must be ordered by ?s.
*/
public static final String REINDEX_QUERY_KEY = "reindexQuery";
/**
* Set the parameter "indexedfields=..." to configure a selection of fields to index, and projections of
* properties. Only the configured fields will be indexed. A property P projected to Q will cause the
* index to contain Q instead of P, when triples with P were indexed. Syntax of indexedfields - see
* <a href="#indexedfieldssyntax">above</a>
*/
public static final String INDEXEDFIELDS = "indexedfields";
/**
* Set the key "lucenedir=<path>" as sail parameter to configure the Lucene Directory on the
* filesystem where to store the lucene index.
*/
public static final String LUCENE_DIR_KEY = "lucenedir";
/**
* Set the key "useramdir=true" as sail parameter to let the LuceneSail store its Lucene index in RAM.
* This is not intended for production environments.
*/
public static final String LUCENE_RAMDIR_KEY = "useramdir";
/**
* Set the key "maxDocuments=<n>" as sail parameter to limit the maximum number of documents to
* return from a search query. The default is to return all documents. NB: this may involve extra cost for
* some SearchIndex implementations as they may have to determine this number.
*/
public static final String MAX_DOCUMENTS_KEY = "maxDocuments";
/**
* Set this key to configure which fields contain WKT and should be spatially indexed. The value should be
* a space-separated list of URIs. Default is http://www.opengis.net/ont/geosparql#asWKT.
*/
public static final String WKT_FIELDS = "wktFields";
/**
* Set this key to configure the SearchIndex class implementation. Default is
* org.eclipse.rdf4j.sail.lucene.LuceneIndex.
*/
public static final String INDEX_CLASS_KEY = "index";
public static final String DEFAULT_INDEX_CLASS = "org.eclipse.rdf4j.sail.lucene.LuceneIndex";
/**
* Set this key as sail parameter to configure the Lucene analyzer class implementation to use for text
* analysis.
*/
public static final String ANALYZER_CLASS_KEY = "analyzer";
/**
* Set this key as sail parameter to influence whether incomplete queries are treated as failure
* (Malformed queries) or whether they are ignored. Set to either "true" or "false". When ommitted in the
* properties, true is default (failure on incomplete queries). see {@link #isIncompleteQueryFails()}
*/
public static final String INCOMPLETE_QUERY_FAIL_KEY = "incompletequeryfail";
/**
* The LuceneIndex holding the indexed literals.
*/
private volatile SearchIndex luceneIndex;
protected final Properties parameters = new Properties();
private volatile String reindexQuery = "SELECT ?s ?p ?o ?c WHERE {{?s ?p ?o} UNION {GRAPH ?c {?s ?p ?o.}}} ORDER BY ?s";
private volatile boolean incompleteQueryFails = true;
private Set<IRI> indexedFields;
private Map<IRI, IRI> indexedFieldsMapping;
private IndexableStatementFilter filter = null;
private final AtomicBoolean closed = new AtomicBoolean(false);
public void setLuceneIndex(SearchIndex luceneIndex) {
this.luceneIndex = luceneIndex;
}
public SearchIndex getLuceneIndex() {
return luceneIndex;
}
@Override
public NotifyingSailConnection getConnection()
throws SailException
{
return new LuceneSailConnection(super.getConnection(), luceneIndex, this);
}
@Override
public void shutDown()
throws SailException
{
if(closed.compareAndSet(false, true)) {
try {
SearchIndex toShutDownLuceneIndex = luceneIndex;
luceneIndex = null;
if (toShutDownLuceneIndex != null) {
toShutDownLuceneIndex.shutDown();
}
}
catch (IOException e) {
throw new SailException(e);
}
finally {
// ensure that super is also invoked when the LuceneIndex causes an
// IOException
super.shutDown();
}
}
}
@Override
public void setDataDir(File dataDir) {
this.setParameter(LuceneSail.LUCENE_DIR_KEY, dataDir.getAbsolutePath() + ".index");
this.getBaseSail().setDataDir(dataDir);
}
@Override
public void initialize()
throws SailException
{
super.initialize();
if (parameters.containsKey(INDEXEDFIELDS)) {
String indexedfieldsString = parameters.getProperty(INDEXEDFIELDS);
Properties prop = new Properties();
try {
Reader reader = new StringReader(indexedfieldsString);
prop.load(reader);
reader.close();
}
catch (IOException e) {
throw new SailException("Could read " + INDEXEDFIELDS + ": " + indexedfieldsString, e);
}
ValueFactory vf = getValueFactory();
indexedFields = new HashSet<IRI>();
indexedFieldsMapping = new HashMap<IRI, IRI>();
for (Object key : prop.keySet()) {
String keyStr = key.toString();
if (keyStr.startsWith("index.")) {
indexedFields.add(vf.createIRI(prop.getProperty(keyStr)));
}
else {
indexedFieldsMapping.put(vf.createIRI(keyStr), vf.createIRI(prop.getProperty(keyStr)));
}
}
}
try {
if (parameters.containsKey(REINDEX_QUERY_KEY))
setReindexQuery(parameters.getProperty(REINDEX_QUERY_KEY));
if (parameters.containsKey(INCOMPLETE_QUERY_FAIL_KEY))
setIncompleteQueryFails(
Boolean.parseBoolean(parameters.getProperty(INCOMPLETE_QUERY_FAIL_KEY)));
if (luceneIndex == null) {
initializeLuceneIndex();
}
}
catch (Exception e) {
throw new SailException("Could not initialize LuceneSail: " + e.getMessage(), e);
}
}
protected void initializeLuceneIndex()
throws Exception
{
String indexClassName = parameters.getProperty(INDEX_CLASS_KEY, DEFAULT_INDEX_CLASS);
SearchIndex index = (SearchIndex)Class.forName(indexClassName).newInstance();
index.initialize(parameters);
setLuceneIndex(index);
}
public void setParameter(String key, String value) {
parameters.setProperty(key, value);
}
public String getParameter(String key) {
return parameters.getProperty(key);
}
public Set<String> getParameterNames() {
return parameters.stringPropertyNames();
}
/**
* See REINDEX_QUERY_KEY parameter.
*/
public String getReindexQuery() {
return reindexQuery;
}
/**
* See REINDEX_QUERY_KEY parameter.
*/
public void setReindexQuery(String query) {
this.setParameter(REINDEX_QUERY_KEY, query);
this.reindexQuery = query;
}
/**
* When this is true, incomplete queries will trigger a SailException. You can set this value either using
* {@link #setIncompleteQueryFails(boolean)} or using the parameter "incompletequeryfail"
*
* @return Returns the incompleteQueryFails.
*/
public boolean isIncompleteQueryFails() {
return incompleteQueryFails;
}
/**
* Set this to true, so that incomplete queries will trigger a SailException. Otherwise, incomplete
* queries will be logged with level WARN. Default is true. You can set this value also using the
* parameter "incompletequeryfail".
*
* @param incompleteQueryFails
* true or false
*/
public void setIncompleteQueryFails(boolean incompleteQueryFails) {
this.setParameter(INCOMPLETE_QUERY_FAIL_KEY, Boolean.toString(incompleteQueryFails));
this.incompleteQueryFails = incompleteQueryFails;
}
/**
* Starts a reindexation process of the whole sail. Basically, this will delete and add all data again, a
* long-lasting process.
*
* @throws IOException
*/
public void reindex()
throws Exception
{
// clear
logger.info("Reindexing sail: clearing...");
luceneIndex.clear();
logger.info("Reindexing sail: adding...");
luceneIndex.begin();
try {
// iterate
SailRepository repo = new SailRepository(new NotifyingSailWrapper(getBaseSail()) {
@Override
public void shutDown() {
// don't shutdown the underlying sail
// when we shutdown the repo.
}
});
// repo.initialize(); we don't need to initialize, that should be done
// already by others
SailRepositoryConnection connection = repo.getConnection();
try {
TupleQuery query = connection.prepareTupleQuery(QueryLanguage.SPARQL, reindexQuery);
TupleQueryResult res = query.evaluate();
Resource current = null;
ValueFactory vf = getValueFactory();
List<Statement> statements = new ArrayList<Statement>();
while (res.hasNext()) {
BindingSet set = res.next();
Resource r = (Resource)set.getValue("s");
IRI p = (IRI)set.getValue("p");
Value o = set.getValue("o");
Resource c = (Resource)set.getValue("c");
if (current == null) {
current = r;
}
else if (!current.equals(r)) {
if (logger.isDebugEnabled())
logger.debug("reindexing resource " + current);
// commit
luceneIndex.addDocuments(current, statements);
// re-init
current = r;
statements.clear();
}
statements.add(vf.createStatement(r, p, o, c));
}
}
finally {
connection.close();
repo.shutDown();
}
// commit the changes
luceneIndex.commit();
logger.info("Reindexing sail: done.");
}
catch (Exception e) {
logger.error("Rolling back", e);
luceneIndex.rollback();
throw e;
}
}
/**
* Sets a filter which determines whether a statement should be considered for indexing when performing
* complete reindexing.
*/
public void registerStatementFilter(IndexableStatementFilter filter) {
this.filter = filter;
}
protected boolean acceptStatementToIndex(Statement s) {
IndexableStatementFilter nextFilter = filter;
return (nextFilter != null) ? nextFilter.accept(s) : true;
}
public Statement mapStatement(Statement statement) {
IRI p = statement.getPredicate();
boolean predicateChanged = false;
Map<IRI, IRI> nextIndexedFieldsMapping = indexedFieldsMapping;
if (nextIndexedFieldsMapping != null) {
IRI res = nextIndexedFieldsMapping.get(p);
if (res != null) {
p = res;
predicateChanged = true;
}
}
Set<IRI> nextIndexedFields = indexedFields;
if (nextIndexedFields != null && !nextIndexedFields.contains(p)) {
return null;
}
if (predicateChanged) {
return getValueFactory().createStatement(statement.getSubject(), p, statement.getObject(),
statement.getContext());
}
else {
return statement;
}
}
protected Collection<SearchQueryInterpreter> getSearchQueryInterpreters() {
return Arrays.<SearchQueryInterpreter> asList(new QuerySpecBuilder(incompleteQueryFails),
new DistanceQuerySpecBuilder(luceneIndex), new GeoRelationQuerySpecBuilder(luceneIndex));
}
}
/*
* ********************************************************************* BELOW FIXMES are assumed to be fixed
* or an agreement was reached. They can be removed in Oct 2007.
*/
/*
* FIXME: The LuceneSail does not alter the datadir (i.e., passes it as-is to the wrapped Sail) and requires
* you to specify a LuceneIndex. This means more work on the side of the integrator but allows for
* fine-grained control over the type of storage used by the LuceneIndex: file-based, memory-based, db-based,
* etc. An alternative method is to give the wrapped Sail a subdir in the datadir and let the LuceneSail take
* care of creating the LuceneIndex and associated index dir. This gives the LuceneSail/Index more freedom in
* how it organizes data, e.g. when one wants to store non-committed information in a temporary index without
* having to use the system's tmp dir. Which method is to be preferred or whether both approaches can be
* combined has yet to be determined. Gunnar and Leo: Added a sail-parameter, the intialize method will create
* the luceneindex with sensible defaults if not set. Enrico: sounds good! FIXME: In light of all the issues
* mentioned in LuceneIndex and given the fact that in most applications, integrators are able to provide
* statements in a more structured manner that randomly sorted triples, it may be a good idea to provide some
* extension points that allow integrators to "do their own thing". In a way this is already possible, as they
* are able to set the LuceneIndex. More sophisticated ways are e.g. an API for updating all statements with
* the same subject at once. Gunnar and Leo: Proper transaction handling in LuceneIndex shoudl be all we need,
* or? FIXME: The SailConnectionListener wraps IOExceptions in RuntimeException so that they can be rethrown.
* This is a temporary fix until we have decided on the design of the SailConnectionListener API; it may even
* be extended to allow throwing of SailExceptions. FIXME: Investigate whether LuceneSailConnection.clear
* should address the LuceneIndex directly with a clear command, whether removed statements are reported
* already through the SailConnectionListener, or whether the latter API will be extended with a separate
* clear event. FIXME: Gunnar and Leo: Why isn't this implemented as a simple connectionwrapper? The
* connection-wrapper already forwards all calls, we can just override methods where lucene interaction is
* needed, or? Do we gain anything by doing it as a listener? Chris: it's been a while but I think this has to
* do with the SailConnection.clear accepting a number of contexts. As context info is not stored in the
* Lucene index, we have no idea which info to remove. *If* removed statements are reported to
* SailConnectionListeners (talk to Arjohn about this), we can use this event to update the index. On the
* other hand, if we go with Leo's approach of storing multiple context IDs in a single Document (see
* LuceneIndex), this may become a non-issue. Leo: Then I would implement LuceneSailConnection and do it with
* the multiple contexts. FIXME: should we use the wrapped Sail's ValueFactory when creating Literals and
* URIs? Gunnar and Leo: sure, no other solution. Enrico: yes! FIXME: Lucene's query parsing may result in a
* TooManyClauses Exception, e.g. when a wildcard query matches more than 1024 query terms in the index. This
* default threshold of max. 1024 terms is configurable through BooleanQuery.setMaxClauseCount but this may
* lead to very large memory usage (potentially OutOfMemoryErrors) and is also global for all Lucene indices
* running in the same JVM. Perhaps a modified QueryParser is a solution, e.g. by skipping term 1025 and
* beyond in order to approximate the query result? Leo: This only applies when we have no "all" field. FIXME:
* All Literal properties of a Resource are both stored separately as separate Fields, as well as concatenated
* and indexed as a single field. By *indexing* the former fields as well, we would be able to easily support
* searching for specific predicates, besides only for entire Resources. We may even need this to support
* returning snippets, or else we have no idea which property the query matched with. Cons: indexing these
* fields will increase index size and decrease upload performance. Also, this way of searching for a specific
* predicate is a bit strange for RDF, as the predicate restriction is part of the Lucene query string instead
* of the RDF graph query. Gunnar and Leo: index all fields! For proper individual ranking indexing each
* fields is important. Enrico: yes, index all fields (not only THE ALL field), we need it! Agreement: we
* index all fields, later make it configurable FIXME: It may seem logical at first to set IndexWriter's
* auto-commit (available in Lucene 2.2) to false when adding triples, as this could be useful for
* implementing Sesame's transactions: just commit the IndexWriter whenever the SailConnection is committed.
* The main problem with this approach is that you are not able to search for Documents that have not been
* committed yet, which is needed in order to update them with new properties for that subject. Consequently,
* LuceneIndex' operation is very slow: each change on the IndexWriter is immediately flushed (resulting in
* disk I/O when using a FSDirectory) and a new IndexReader is created for every added triple, which does some
* non-trivial initialization. Alternative strategies: (1) don't write Documents right away to the IndexWriter
* but cache them in main memory and only add them when a commit on the LuceneIndex is issued by the
* LuceneSailConnection. Potential risk for out-of-memory errors because you have no idea how much memory this
* is using. (2) Different mechanism but conceptually similar: buffer statements to add and process them in
* order of subject when a commit is issued or the cache overflows, so that you only need to fetch the
* Document for that subject once. The size of the cache can be approximated fairly well by looking at the
* sizes of the strings in their statements. Gunnar and Leo: We had (1) in Gnowsis, and we never ran out of
* memory :) at least not for this reason ... (2) is harded to implement, we suggest doing (1) and replacing
* with when it becomes a problem? Gunnar and Leo will do (1) in the next few days. Enrico: we also suggest to
* use (1), just keep the lucene doc until the transaction is committed so you can continue filling the doc
* and don't need to get it back from the index. Chris: (1) works for applications like Gnowsis and AutoFocus
* which probably do a commit after processing every crawled resource, the amount of statements in a
* transaction is then very small. Note however that uploading a large RDF file to a Repository (also a common
* Sesame use case) is a single transaction, that's where I expect you can easily get into trouble. Leo: ok,
* with bigger transactions there is trouble, which we leave to fix once the trouble arises. Chris (in
* skype-chat): (1) is ok for now, go for it. When statements arrive more or less in order of subject and we
* tune the caching a bit (e.g. by each time only processing half of the cache and selecting those statements
* whose subjects we haven't seen in a while), this delayed processing strategy may in some scenarios even
* lead to the most optimal case where Documents are retrieved and/or written at most once. Changing the index
* because of cache overflow still breaks SailConnection's contract though: the index should only be altered
* in a permanent way when the SailConnection gets a commit. At first I thought that a triple-centric Document
* setup (each triple has its own Document) would solve all this, as opposed to the current Resource-centric
* setup (all properties with the same subject in a single Document). However, (1) you still need to check the
* index in order to prevent adding duplicates, which cannot be done on uncommitted Documents - perhaps
* SailConnectionListener can tell us when a really new triple is added? But even then: probably works for
* quads, not for triples). Also, (2) when you *are* storing quads (assuming this leads to a context field in
* the Document), the deletion of a statement no longer simply maps on an IndexWriter.deleteDocuments(Term)
* invocation, so you need to query again to see which Documents need to be deleted. FIXME: Right now, all
* literals are stored and indexed, datatypes are ignored. Should we process some datatypes differently? Does
* it make sense to index booleans, numbers, etc.? Enrico: we don't use data type and language for querying
* anyways, so does not affect us Agreement: Datatypes are ignored. FIXME: The context of triples is
* completely ignored at this moment. Perhaps this can simply be solved by giving each Document a context ID
* besides the Resource ID? Leo (#1): yes, and multiple contextIDs, to state all contexts that contributed to
* the doc (see below, #2) FIXME: The clear(Resource...) is not implemented as we do not deal with contexts in
* this LuceneSail implementation and thus do not know which triples to remove. This is problematic when
* people do a clear with a specific context on a LuceneSail, as the LuceneIndex will then still keep legacy
* triples around. Only a global clear can be implemented, but not a clear on a specific context. To me this
* strongly suggests that we add a separate Document for each (Resource, context) pair, even though the
* objections raised in the paper (troubles with creating scores) are reasonable, because else we are not able
* to create a proper Sail implementation. This only adds to the issue we realized before with ingoring
* context, namely that full-text queries cannot be restricted to properties in a certain context. Leo: #2 An
* optimized approach would be to add multiple contextIDs, to state all contexts that contributed to the doc
* (see above #1) This means that when adding/appending to a document, all additional context-uris are added
* to the document. When deleting individual triples, the context is ignored. In clear(Resource ...) we make a
* query on all Lucene-Documents that were possibly created by this context(s). Given a document D that
* context C(1-n) contributed to. D' is the new document after clear(). - if there is only one C then D can be
* safely removed. There is no D' (I hope this is the standard case: like in ontologies, where all triples
* about a resource are in one document) - if there are multiple C, remember the uri of D, delete D, and query
* (s,p,o, ?) from the underlying store after committing the operation- this returns the literals of D', add
* D' as new document This will probably be both fast in the common case and capable enough in the multiple-C
* case. Any objections? Gunnar? Enrico? Enrico: we dont query contexts at all, so score is better in this way
* than habving (resource, context) paired docuemts. So this looks like a working solution that keeps the
* lucene index valid.
*/