Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
tree: 9001608a67
Fetching contributors…

Cannot retrieve contributors at this time

7417 lines (5625 sloc) 342.492 kb
Lucene Change Log
For more information on past and future Lucene versions, please see:
http://s.apache.org/luceneversions
======================= Lucene 5.0.0 =======================
======================= Lucene 4.0.0-BETA =======================
New features
* LUCENE-4249: Changed the explanation of the PayloadTermWeight to use the
underlying PayloadFunction's explanation as the explanation
for the payload score. (Scott Smerchek via Robert Muir)
* LUCENE-4201: Added JapaneseIterationMarkCharFilter to normalize Japanese
iteration marks. (Robert Muir, Christian Moen)
* LUCENE-3832: Added BasicAutomata.makeStringUnion method to efficiently
create automata from a fixed collection of UTF-8 encoded BytesRef
(Dawid Weiss, Robert Muir)
* LUCENE-4153: Added option to fast vector highlighting via BaseFragmentsBuilder to
respect field boundaries in the case of highlighting for multivalued fields.
(Martijn van Groningen)
* LUCENE-4227: Added DirectPostingsFormat, to hold all postings in
memory as uncompressed simple arrays. This uses a tremendous amount
of RAM but gives good search performance gains. (Mike McCandless)
API Changes
* LUCENE-4138: update of morfologik (Polish morphological analyzer) to 1.5.3.
The tag attribute class has been renamed to MorphosyntacticTagsAttribute and
has a different API (carries a list of tags instead of a compound tag). Upgrade
of embedded morfologik dictionaries to version 1.9. (Dawid Weiss)
* LUCENE-4178: set 'tokenized' to true on FieldType by default, so that if you
make a custom FieldType and set indexed = true, its analyzed by the analyzer.
(Robert Muir)
* LUCENE-4220: Removed the buggy JavaCC-based HTML parser in the benchmark
module and replaced by NekoHTML. HTMLParser interface was cleaned up while
changing method signatures. (Uwe Schindler, Robert Muir)
* LUCENE-2191: Rename Tokenizer.reset(Reader) to Tokenizer.setReader(Reader).
The purpose of this method was always to set a new Reader on the Tokenizer,
reusing the object. But the name was often confused with TokenStream.reset().
(Robert Muir)
* LUCENE-4228: Refactored CharFilter to extend java.io.FilterReader. CharFilters
filter another reader and you override correct() for offset correction.
(Robert Muir)
* LUCENE-4240: Analyzer api now just takes fieldName for getOffsetGap. If the
field is not analyzed (e.g. StringField), then the analyzer is not invoked
at all. If you want to tweak things like positionIncrementGap and offsetGap,
analyze the field with KeywordTokenizer instead. (Grant Ingersoll, Robert Muir)
Optimizations
* LUCENE-4171: Performance improvements to Packed64.
(Toke Eskildsen via Adrien Grand)
* LUCENE-4184: Performance improvements to the aligned packed bits impl.
(Toke Eskildsen, Adrien Grand)
* LUCENE-4235: Remove enforcing of Filter rewrite for NRQ queries.
(Uwe Schindler)
Bug Fixes
* LUCENE-4176: Fix AnalyzingQueryParser to analyze range endpoints as bytes,
so that it works correctly with Analyzers that produce binary non-UTF-8 terms
such as CollationAnalyzer. (Nattapong Sirilappanich via Robert Muir)
* LUCENE-4209: Fix FSTCompletionLookup to close its sorter, so that it won't
leave temp files behind in /tmp. Fix SortedTermFreqIteratorWrapper to not
leave temp files behind in /tmp on Windows. Fix Sort to not leave
temp files behind when /tmp is a separate volume. (Uwe Schindler, Robert Muir)
* LUCENE-4221: Fix overeager CheckIndex validation for term vector offsets.
(Robert Muir)
* LUCENE-4222: TieredMergePolicy.getFloorSegmentMB was returning the
size in bytes not MB (Chris Fuller via Mike McCandless)
* LUCENE-3505: Fix bug (Lucene 4.0alpha only) where boolean conjunctions
were sometimes scored incorrectly. Conjunctions of only termqueries where
at least one term omitted term frequencies (IndexOptions.DOCS_ONLY) would
be scored as if all terms omitted term frequencies. (Robert Muir)
* LUCENE-2686, LUCENE-3505: Fixed BooleanQuery scorers to return correct
freq(). Added support for scorer navigation API (Scorer.getChildren) to
all queries. Made Scorer.freq() abstract.
(Koji Sekiguchi, Mike McCandless, Robert Muir)
* LUCENE-4234: Exception when FacetsCollector is used with ScoreFacetRequest,
and the number of matching documents is too large. (Gilad Barkai via Shai Erera)
* LUCENE-4245: Make IndexWriter#close() and MergeScheduler#close()
non-interruptible. (Mark Miller, Uwe Schindler)
Build
* LUCENE-4094: Support overriding file.encoding on forked test JVMs
(force via -Drandomized.file.encoding=XXX). (Dawid Weiss)
* LUCENE-4189: Test output should include timestamps (start/end for each
test/ suite). Added -Dtests.timestamps=[off by default]. (Dawid Weiss)
* LUCENE-4110: Report long periods of forked jvm inactivity (hung tests/ suites).
Added -Dtests.heartbeat=[seconds] with the default of 60 seconds.
(Dawid Weiss)
* LUCENE-4160: Added a property to quit the tests after a given
number of failures has occurred. This is useful in combination
with -Dtests.iters=N (you can start N iterations and wait for M
failures, in particular M = 1). -Dtests.maxfailures=M. Alternatively,
specify -Dtests.failfast=true to skip all tests after the first failure.
(Dawid Weiss)
* LUCENE-4115: JAR resolution/ cleanup should be done automatically for ant
clean/ eclipse/ resolve (Dawid Weiss)
* LUCENE-4199, LUCENE-4202, LUCENE-4206: Add a new target "check-forbidden-apis"
that parses all generated .class files for use of APIs that use default
charset, default locale, or default timezone and fail build if violations
found. This ensures, that Lucene / Solr is independent on local configuration
options. (Uwe Schindler, Robert Muir, Dawid Weiss)
* LUCENE-4217: Add the possibility to run tests with Atlassian Clover
loaded from IVY. A development License solely for Apache code was added in
the tools/ folder, but is not included in releases. (Uwe Schindler)
Documentation
* LUCENE-4195: Added package documentation and examples for
org.apache.lucene.codecs (Alan Woodward via Robert Muir)
======================= Lucene 4.0.0-ALPHA =======================
More information about this release, including any errata related to the
release notes, upgrade instructions, or other changes may be found online at:
https://wiki.apache.org/lucene-java/Lucene4.0
For "contrib" changes prior to 4.0, please see:
http://svn.apache.org/repos/asf/lucene/dev/tags/lucene_solr_3_6_0/lucene/contrib/CHANGES.txt
Changes in backwards compatibility policy
* LUCENE-1458, LUCENE-2111, LUCENE-2354: Changes from flexible indexing:
- On upgrading to 4.0, if you do not fully reindex your documents,
Lucene will emulate the new flex API on top of the old index,
incurring some performance cost (up to ~10% slowdown, typically).
To prevent this slowdown, use oal.index.IndexUpgrader
to upgrade your indexes to latest file format (LUCENE-3082).
Mixed flex/pre-flex indexes are perfectly fine -- the two
emulation layers (flex API on pre-flex index, and pre-flex API on
flex index) will remap the access as required. So on upgrading to
4.0 you can start indexing new documents into an existing index.
To get optimal performance, use oal.index.IndexUpgrader
to upgrade your indexes to latest file format (LUCENE-3082).
- The postings APIs (TermEnum, TermDocsEnum, TermPositionsEnum)
have been removed in favor of the new flexible
indexing (flex) APIs (Fields, FieldsEnum, Terms, TermsEnum,
DocsEnum, DocsAndPositionsEnum). One big difference is that field
and terms are now enumerated separately: a TermsEnum provides a
BytesRef (wraps a byte[]) per term within a single field, not a
Term. Another is that when asking for a Docs/AndPositionsEnum, you
now specify the skipDocs explicitly (typically this will be the
deleted docs, but in general you can provide any Bits).
- The term vectors APIs (TermFreqVector, TermPositionVector,
TermVectorMapper) have been removed in favor of the above
flexible indexing APIs, presenting a single-document inverted
index of the document from the term vectors.
- MultiReader ctor now throws IOException
- Directory.copy/Directory.copyTo now copies all files (not just
index files), since what is and isn't and index file is now
dependent on the codecs used.
- UnicodeUtil now uses BytesRef for UTF-8 output, and some method
signatures have changed to CharSequence. These are internal APIs
and subject to change suddenly.
- Positional queries (PhraseQuery, *SpanQuery) will now throw an
exception if use them on a field that omits positions during
indexing (previously they silently returned no results).
- FieldCache.{Byte,Short,Int,Long,Float,Double}Parser's API has
changed -- each parse method now takes a BytesRef instead of a
String. If you have an existing Parser, a simple way to fix it is
invoke BytesRef.utf8ToString, and pass that String to your
existing parser. This will work, but performance would be better
if you could fix your parser to instead operate directly on the
byte[] in the BytesRef.
- The internal (experimental) API of NumericUtils changed completely
from String to BytesRef. Client code should never use this class,
so the change would normally not affect you. If you used some of
the methods to inspect terms or create TermQueries out of
prefix encoded terms, change to use BytesRef. Please note:
Do not use TermQueries to search for single numeric terms.
The recommended way is to create a corresponding NumericRangeQuery
with upper and lower bound equal and included. TermQueries do not
score correct, so the constant score mode of NRQ is the only
correct way to handle single value queries.
- NumericTokenStream now works directly on byte[] terms. If you
plug a TokenFilter on top of this stream, you will likely get
an IllegalArgumentException, because the NTS does not support
TermAttribute/CharTermAttribute. If you want to further filter
or attach Payloads to NTS, use the new NumericTermAttribute.
(Mike McCandless, Robert Muir, Uwe Schindler, Mark Miller, Michael Busch)
* LUCENE-2858, LUCENE-3733: IndexReader was refactored into abstract
AtomicReader, CompositeReader, and DirectoryReader. To open Directory-
based indexes use DirectoryReader.open(), the corresponding method in
IndexReader is now deprecated for easier migration. Only DirectoryReader
supports commits, versions, and reopening with openIfChanged(). Terms,
postings, docvalues, and norms can from now on only be retrieved using
AtomicReader; DirectoryReader and MultiReader extend CompositeReader,
only offering stored fields and access to the sub-readers (which may be
composite or atomic). SlowCompositeReaderWrapper (LUCENE-2597) can be
used to emulate atomic readers on top of composites.
Please review MIGRATE.txt for information how to migrate old code.
(Uwe Schindler, Robert Muir, Mike McCandless)
* LUCENE-2265: FuzzyQuery and WildcardQuery now operate on Unicode codepoints,
not unicode code units. For example, a Wildcard "?" represents any unicode
character. Furthermore, the rest of the automaton package and RegexpQuery use
true Unicode codepoint representation. (Robert Muir, Mike McCandless)
* LUCENE-2380: The String-based FieldCache methods (getStrings,
getStringIndex) have been replaced with BytesRef-based equivalents
(getTerms, getTermsIndex). Also, the sort values (returned in
FieldDoc.fields) when sorting by SortField.STRING or
SortField.STRING_VAL are now BytesRef instances. See MIGRATE.txt
for more details. (yonik, Mike McCandless)
* LUCENE-2480: Though not a change in backwards compatibility policy, pre-3.0
indexes are no longer supported. You should upgrade to 3.x first, then run
optimize(), or reindex. (Shai Erera, Earwin Burrfoot)
* LUCENE-2484: Removed deprecated TermAttribute. Use CharTermAttribute
and TermToBytesRefAttribute instead. (Uwe Schindler)
* LUCENE-2600: Remove IndexReader.isDeleted in favor of
AtomicReader.getDeletedDocs(). (Mike McCandless)
* LUCENE-2667: FuzzyQuery's defaults have changed for more performant
behavior: the minimum similarity is 2 edit distances from the word,
and the priority queue size is 50. To support this, FuzzyQuery now allows
specifying unscaled edit distances (foobar~2). If your application depends
upon the old defaults of 0.5 (scaled) minimum similarity and Integer.MAX_VALUE
priority queue size, you can use FuzzyQuery(Term, float, int, int) to specify
those explicitly.
* LUCENE-2674: MultiTermQuery.TermCollector.collect now accepts the
TermsEnum as well. (Robert Muir, Mike McCandless)
* LUCENE-588: WildcardQuery and QueryParser now allows escaping with
the '\' character. Previously this was impossible (you could not escape */?,
for example). If your code somehow depends on the old behavior, you will
need to change it (e.g. using "\\" to escape '\' itself).
(Sunil Kamath, Terry Yang via Robert Muir)
* LUCENE-2837: Collapsed Searcher, Searchable into IndexSearcher;
removed contrib/remote and MultiSearcher (Mike McCandless); absorbed
ParallelMultiSearcher into IndexSearcher as an optional
ExecutorServiced passed to its ctor. (Mike McCandless)
* LUCENE-2908, LUCENE-4037: Removed serialization code from lucene classes.
It is recommended that you serialize user search needs at a higher level
in your application.
(Robert Muir, Benson Margulies)
* LUCENE-2831: Changed Weight#scorer, Weight#explain & Filter#getDocIdSet to
operate on a AtomicReaderContext instead of directly on IndexReader to enable
searches to be aware of IndexSearcher's context. (Simon Willnauer)
* LUCENE-2839: Scorer#score(Collector,int,int) is now public because it is
called from other classes and part of public API. (Uwe Schindler)
* LUCENE-2865: Weight#scorer(AtomicReaderContext, boolean, boolean) now accepts
a ScorerContext struct instead of booleans.(Simon Willnauer)
* LUCENE-2882: Cut over SpanQuery#getSpans to AtomicReaderContext to enforce
per segment semantics on SpanQuery & Spans. (Simon Willnauer)
* LUCENE-2236: Similarity can now be configured on a per-field basis. See the
migration notes in MIGRATE.txt for more details. (Robert Muir, Doron Cohen)
* LUCENE-2315: AttributeSource's methods for accessing attributes are now final,
else its easy to corrupt the internal states. (Uwe Schindler)
* LUCENE-2814: The IndexWriter.flush method no longer takes "boolean
flushDocStores" argument, as we now always flush doc stores (index
files holding stored fields and term vectors) while flushing a
segment. (Mike McCandless)
* LUCENE-2548: Field names (eg in Term, FieldInfo) are no longer
interned. (Mike McCandless)
* LUCENE-2883: The contents of o.a.l.search.function has been consolidated into
the queries module and can be found at o.a.l.queries.function. See
MIGRATE.txt for more information (Chris Male)
* LUCENE-2392, LUCENE-3299: Decoupled vector space scoring from
Query/Weight/Scorer. If you extended Similarity directly before, you should
extend TFIDFSimilarity instead. Similarity is now a lower-level API to
implement other scoring algorithms. See MIGRATE.txt for more details.
(David Nemeskey, Simon Willnauer, Mike Mccandless, Robert Muir)
* LUCENE-3330: The expert visitor API in Scorer has been simplified and
extended to support arbitrary relationships. To navigate to a scorer's
children, call Scorer.getChildren(). (Robert Muir)
* LUCENE-2308: Field is now instantiated with an instance of IndexableFieldType,
of which there is a core implementation FieldType. Most properties
describing a Field have been moved to IndexableFieldType. See MIGRATE.txt
for more details. (Nikola Tankovic, Mike McCandless, Chris Male)
* LUCENE-3396: ReusableAnalyzerBase.TokenStreamComponents.reset(Reader) now
returns void instead of boolean. If a Component cannot be reset, it should
throw an Exception. (Chris Male)
* LUCENE-3396: ReusableAnalyzerBase has been renamed to Analyzer. All Analyzer
implementations must now use Analyzer.TokenStreamComponents, rather than
overriding .tokenStream() and .reusableTokenStream() (which are now final).
(Chris Male)
* LUCENE-3346: Analyzer.reusableTokenStream() has been renamed to tokenStream()
with the old tokenStream() method removed. Consequently it is now mandatory
for all Analyzers to support reusability. (Chris Male)
* LUCENE-3473: AtomicReader.getUniqueTermCount() no longer throws UOE when
it cannot be easily determined. Instead, it returns -1 to be consistent with
this behavior across other index statistics.
(Robert Muir)
* LUCENE-1536: The abstract FilteredDocIdSet.match() method is no longer
allowed to throw IOException. This change was required to make it conform
to the Bits interface. This method should never do I/O for performance reasons.
(Mike McCandless, Uwe Schindler, Robert Muir, Chris Male, Yonik Seeley,
Jason Rutherglen, Paul Elschot)
* LUCENE-3559: The methods "docFreq" and "maxDoc" on IndexSearcher were removed,
as these are no longer used by the scoring system. See MIGRATE.txt for more
details. (Robert Muir)
* LUCENE-3533: Removed SpanFilters, they created large lists of objects and
did not scale. (Robert Muir)
* LUCENE-3606: IndexReader and subclasses were made read-only. It is no longer
possible to delete or undelete documents using IndexReader; you have to use
IndexWriter now. As deleting by internal Lucene docID is no longer possible,
this requires adding a unique identifier field to your index. Deleting/
relying upon Lucene docIDs is not recommended anyway, because they can
change. Consequently commit() was removed and DirectoryReader.open(),
openIfChanged() no longer take readOnly booleans or IndexDeletionPolicy
instances. Furthermore, IndexReader.setNorm() was removed. If you need
customized norm values, the recommended way to do this is by modifying
Similarity to use an external byte[] or one of the new DocValues
fields (LUCENE-3108). Alternatively, to dynamically change norms (boost
*and* length norm) at query time, wrap your AtomicReader using
FilterAtomicReader, overriding FilterAtomicReader.norms(). To persist the
changes on disk, copy the FilteredIndexReader to a new index using
IndexWriter.addIndexes(). (Uwe Schindler, Robert Muir)
* LUCENE-3640: Removed IndexSearcher.close(), because IndexSearcher no longer
takes a Directory and no longer "manages" IndexReaders, it is a no-op.
(Robert Muir)
* LUCENE-3684: Add offsets into DocsAndPositionsEnum, and a few
FieldInfo.IndexOption: DOCS_AND_POSITIONS_AND_OFFSETS. (Robert
Muir, Mike McCandless)
* LUCENE-2858, LUCENE-3770: FilterIndexReader was renamed to
FilterAtomicReader and now extends AtomicReader. If you want to filter
composite readers like DirectoryReader or MultiReader, filter their
atomic leaves and build a new CompositeReader (e.g. MultiReader) around
them. (Uwe Schindler, Robert Muir)
* LUCENE-3736: ParallelReader was split into ParallelAtomicReader
and ParallelCompositeReader. Lucene 3.x's ParallelReader is now
ParallelAtomicReader; but the new composite variant has improved performance
as it works on the atomic subreaders. It requires that all parallel
composite readers have the same subreader structure. If you cannot provide this,
you can use SlowCompositeReaderWrapper to make all parallel readers atomic
and use ParallelAtomicReader. (Uwe Schindler, Mike McCandless, Robert Muir)
* LUCENE-2000: clone() now returns covariant types where possible. (ryan)
* LUCENE-3970: Rename Fields.getUniqueFieldCount -> .size() and
Terms.getUniqueTermCount -> .size(). (Iulius Curt via Mike McCandless)
* LUCENE-3514: IndexSearcher.setDefaultFieldSortScoring was removed
and replaced with per-search control via new expert search methods
that take two booleans indicating whether hit scores and max
score should be computed. (Mike McCandless)
* LUCENE-4055: You can't put foreign files into the index dir anymore.
* LUCENE-3866: CompositeReader.getSequentialSubReaders() now returns
unmodifiable List<? extends IndexReader>. ReaderUtil.Gather was
removed, as IndexReaderContext.leaves() is now the preferred way
to access sub-readers. (Uwe Schindler)
* LUCENE-4155: oal.util.ReaderUtil, TwoPhaseCommit, TwoPhaseCommitTool
classes were moved to oal.index package. oal.util.CodecUtil class was moved
to oal.codecs package. oal.util.DummyConcurrentLock was removed
(no longer used in Lucene 4.0). (Uwe Schindler)
Changes in Runtime Behavior
* LUCENE-2846: omitNorms now behaves like omitTermFrequencyAndPositions, if you
omitNorms(true) for field "a" for 1000 documents, but then add a document with
omitNorms(false) for field "a", all documents for field "a" will have no
norms. Previously, Lucene would fill the first 1000 documents with
"fake norms" from Similarity.getDefault(). (Robert Muir, Mike Mccandless)
* LUCENE-2846: When some documents contain field "a", and others do not, the
documents that don't have the field get a norm byte value of 0. Previously,
Lucene would populate "fake norms" with Similarity.getDefault() for these
documents. (Robert Muir, Mike Mccandless)
* LUCENE-2720: IndexWriter throws IndexFormatTooOldException on open, rather
than later when e.g. a merge starts.
(Shai Erera, Mike McCandless, Uwe Schindler)
* LUCENE-2881: FieldInfos is now tracked per segment. Before it was tracked
per IndexWriter session, which resulted in FieldInfos that had the FieldInfo
properties from all previous segments combined. Field numbers are now tracked
globally across IndexWriter sessions and persisted into a _X.fnx file on
successful commit. The corresponding file format changes are backwards-
compatible. (Michael Busch, Simon Willnauer)
* LUCENE-2956, LUCENE-2573, LUCENE-2324, LUCENE-2555: Changes from
DocumentsWriterPerThread:
- IndexWriter now uses a DocumentsWriter per thread when indexing documents.
Each DocumentsWriterPerThread indexes documents in its own private segment,
and the in memory segments are no longer merged on flush. Instead, each
segment is separately flushed to disk and subsequently merged with normal
segment merging.
- DocumentsWriterPerThread (DWPT) is now flushed concurrently based on a
FlushPolicy. When a DWPT is flushed, a fresh DWPT is swapped in so that
indexing may continue concurrently with flushing. The selected
DWPT flushes all its RAM resident documents do disk. Note: Segment flushes
don't flush all RAM resident documents but only the documents private to
the DWPT selected for flushing.
- Flushing is now controlled by FlushPolicy that is called for every add,
update or delete on IndexWriter. By default DWPTs are flushed either on
maxBufferedDocs per DWPT or the global active used memory. Once the active
memory exceeds ramBufferSizeMB only the largest DWPT is selected for
flushing and the memory used by this DWPT is substracted from the active
memory and added to a flushing memory pool, which can lead to temporarily
higher memory usage due to ongoing indexing.
- IndexWriter now can utilize ramBufferSize > 2048 MB. Each DWPT can address
up to 2048 MB memory such that the ramBufferSize is now bounded by the max
number of DWPT avaliable in the used DocumentsWriterPerThreadPool.
IndexWriters net memory consumption can grow far beyond the 2048 MB limit if
the application can use all available DWPTs. To prevent a DWPT from
exhausting its address space IndexWriter will forcefully flush a DWPT if its
hard memory limit is exceeded. The RAMPerThreadHardLimitMB can be controlled
via IndexWriterConfig and defaults to 1945 MB.
Since IndexWriter flushes DWPT concurrently not all memory is released
immediately. Applications should still use a ramBufferSize significantly
lower than the JVMs avaliable heap memory since under high load multiple
flushing DWPT can consume substantial transient memory when IO performance
is slow relative to indexing rate.
- IndexWriter#commit now doesn't block concurrent indexing while flushing all
'currently' RAM resident documents to disk. Yet, flushes that occur while a
a full flush is running are queued and will happen after all DWPT involved
in the full flush are done flushing. Applications using multiple threads
during indexing and trigger a full flush (eg call commmit() or open a new
NRT reader) can use significantly more transient memory.
- IndexWriter#addDocument and IndexWriter.updateDocument can block indexing
threads if the number of active + number of flushing DWPT exceed a
safety limit. By default this happens if 2 * max number available thread
states (DWPTPool) is exceeded. This safety limit prevents applications from
exhausting their available memory if flushing can't keep up with
concurrently indexing threads.
- IndexWriter only applies and flushes deletes if the maxBufferedDelTerms
limit is reached during indexing. No segment flushes will be triggered
due to this setting.
- IndexWriter#flush(boolean, boolean) doesn't synchronized on IndexWriter
anymore. A dedicated flushLock has been introduced to prevent multiple full-
flushes happening concurrently.
- DocumentsWriter doesn't write shared doc stores anymore.
(Mike McCandless, Michael Busch, Simon Willnauer)
* LUCENE-3309: Stored fields no longer record whether they were
tokenized or not. In general you should not rely on stored fields
to record any "metadata" from indexing (tokenized, omitNorms,
IndexOptions, boost, etc.) (Mike McCandless)
* LUCENE-3309: Fast vector highlighter now inserts the
MultiValuedSeparator for NOT_ANALYZED fields (in addition to
ANALYZED fields). To ensure your offsets are correct you should
provide an analyzer that returns 1 from the offsetGap method.
(Mike McCandless)
* LUCENE-2621: Removed contrib/instantiated. (Robert Muir)
* LUCENE-1768: StandardQueryTreeBuilder no longer uses RangeQueryNodeBuilder
for RangeQueryNodes, since theses two classes were removed;
TermRangeQueryNodeProcessor now creates TermRangeQueryNode,
instead of RangeQueryNode; the same applies for numeric nodes;
(Vinicius Barros via Uwe Schindler)
* LUCENE-3455: QueryParserBase.newFieldQuery() will throw a ParseException if
any of the calls to the Analyzer throw an IOException. QueryParseBase.analyzeRangePart()
will throw a RuntimException if an IOException is thrown by the Analyzer.
* LUCENE-4127: IndexWriter will now throw IllegalArgumentException if
the first token of an indexed field has 0 positionIncrement
(previously it silently corrected it to 1, possibly masking bugs).
OffsetAttributeImpl will throw IllegalArgumentException if startOffset
is less than endOffset, or if offsets are negative.
(Robert Muir, Mike McCandless)
API Changes
* LUCENE-2302, LUCENE-1458, LUCENE-2111, LUCENE-2514: Terms are no longer
required to be character based. Lucene views a term as an arbitrary byte[]:
during analysis, character-based terms are converted to UTF8 byte[],
but analyzers are free to directly create terms as byte[]
(NumericField does this, for example). The term data is buffered as
byte[] during indexing, written as byte[] into the terms dictionary,
and iterated as byte[] (wrapped in a BytesRef) by IndexReader for
searching.
* LUCENE-1458, LUCENE-2111: AtomicReader now directly exposes its
deleted docs (getDeletedDocs), providing a new Bits interface to
directly query by doc ID.
* LUCENE-2691: IndexWriter.getReader() has been made package local and is now
exposed via open and reopen methods on DirectoryReader. The semantics of the
call is the same as it was prior to the API change.
(Grant Ingersoll, Mike McCandless)
* LUCENE-2566: QueryParser: Unary operators +,-,! will not be treated as
operators if they are followed by whitespace. (yonik)
* LUCENE-2831: Weight#scorer, Weight#explain, Filter#getDocIdSet,
Collector#setNextReader & FieldComparator#setNextReader now expect an
AtomicReaderContext instead of an IndexReader. (Simon Willnauer)
* LUCENE-2892: Add QueryParser.newFieldQuery (called by getFieldQuery by
default) which takes Analyzer as a parameter, for easier customization by
subclasses. (Robert Muir)
* LUCENE-2953: In addition to changes in 3.x, PriorityQueue#initialize(int)
function was moved into the ctor. (Uwe Schindler, Yonik Seeley)
* LUCENE-3219: SortField type properties have been moved to an enum
SortField.Type. In be consistent, CachedArrayCreator.getSortTypeID() has
been changed CachedArrayCreator.getSortType(). (Chris Male)
* LUCENE-3225: Add TermsEnum.seekExact for faster seeking when you
don't need the ceiling term; renamed existing seek methods to either
seekCeil or seekExact; changed seekExact(ord) to return no value.
Fixed MemoryCodec and SimpleTextCodec to optimize the seekExact
case, and fixed places in Lucene to use seekExact when possible.
(Mike McCandless)
* LUCENE-1536: Filter.getDocIdSet() now takes an acceptDocs Bits interface (like
Scorer) limiting the documents that can appear in the returned DocIdSet.
Filters are now required to respect these acceptDocs, otherwise deleted documents
may get returned by searches. Most filters will pass these Bits down to DocsEnum,
but those, e.g. working on FieldCache, may need to use BitsFilteredDocIdSet.wrap()
to exclude them.
(Mike McCandless, Uwe Schindler, Robert Muir, Chris Male, Yonik Seeley,
Jason Rutherglen, Paul Elschot)
* LUCENE-3722: Similarity methods and collection/term statistics now take
long instead of int (to enable distributed scoring of > 2B docs).
(Yonik Seeley, Andrzej Bialecki, Robert Muir)
* LUCENE-3761: Generalize SearcherManager into an abstract ReferenceManager.
SearcherManager remains a concrete class, but due to the refactoring, the
method maybeReopen has been deprecated in favor of maybeRefresh().
(Shai Erera, Mike McCandless, Simon Willnauer)
* LUCENE-3859: AtomicReader.hasNorms(field) is deprecated, instead you
can inspect the FieldInfo yourself to see if norms are present, which
also allows you to get the type. (Robert Muir)
* LUCENE-2606: Changed RegexCapabilities interface to fix thread
safety, serialization, and performance problems. If you have
written a custom RegexCapabilities it will need to be updated
to the new API. (Robert Muir, Uwe Schindler)
* LUCENE-2638 MakeHighFreqTerms.TermStats public to make it more useful
for API use. (Andrzej Bialecki)
* LUCENE-2912: The field-specific hashmaps in SweetSpotSimilarity were removed.
Instead, use PerFieldSimilarityWrapper to return different SweetSpotSimilaritys
for different fields, this way all parameters (such as TF factors) can be
customized on a per-field basis. (Robert Muir)
* LUCENE-3308: DuplicateFilter keepMode and processingMode have been converted to
enums DuplicateFilter.KeepMode and DuplicateFilter.ProcessingMode repsectively.
* LUCENE-3483: Move Function grouping collectors from Solr to grouping module.
(Martijn van Groningen)
* LUCENE-3606: FieldNormModifier was deprecated, because IndexReader's
setNorm() was deprecated. Furthermore, this class is broken, as it does
not take position overlaps into account while recalculating norms.
(Uwe Schindler, Robert Muir)
* LUCENE-3936: Renamed StringIndexDocValues to DocTermsIndexDocValues.
(Martijn van Groningen)
* LUCENE-1768: Deprecated Parametric(Range)QueryNode, RangeQueryNode(Builder),
ParametricRangeQueryNodeProcessor were removed. (Vinicius Barros via Uwe Schindler)
* LUCENE-3820: Deprecated constructors accepting pattern matching bounds. The input
is buffered and matched in one pass. (Dawid Weiss)
* LUCENE-2413: Deprecated PatternAnalyzer in common/miscellaneous, in favor
of the pattern package (CharFilter, Tokenizer, TokenFilter). (Robert Muir)
* LUCENE-2413: Removed the AnalyzerUtil in common/miscellaneous. (Robert Muir)
* LUCENE-1370: Added ShingleFilter option to output unigrams if no shingles
can be generated. (Chris Harris via Steven Rowe)
* LUCENE-2514, LUCENE-2551: JDK and ICU CollationKeyAnalyzers were changed to
use pure byte keys when Version >= 4.0. This cuts sort key size approximately
in half. (Robert Muir)
* LUCENE-3400: Removed DutchAnalyzer.setStemDictionary (Chris Male)
* LUCENE-3431: Removed QueryAutoStopWordAnalyzer.addStopWords* deprecated methods
since they prevented reuse. Stopwords are now generated at instantiation through
the Analyzer's constructors. (Chris Male)
* LUCENE-3434: Removed ShingleAnalyzerWrapper.set* and PerFieldAnalyzerWrapper.addAnalyzer
since they prevent reuse. Both Analyzers should be configured at instantiation.
(Chris Male)
* LUCENE-3765: Stopset ctors that previously took Set<?> or Map<?,String> now take
CharArraySet and CharArrayMap respectively. Previously the behavior was confusing,
and sometimes different depending on the type of set, and ultimately a CharArraySet
or CharArrayMap was always used anyway. (Robert Muir)
* LUCENE-3830: Switched to NormalizeCharMap.Builder to create
immutable instances of NormalizeCharMap. (Dawid Weiss, Mike
McCandless)
* LUCENE-4063: FrenchLightStemmer no longer deletes repeated digits.
(Tanguy Moal via Steve Rowe)
* LUCENE-4122: Replace Payload with BytesRef. (Andrzej Bialecki)
* LUCENE-4132: IndexWriter.getConfig() now returns a LiveIndexWriterConfig object
which can be used to change the IndexWriter's live settings. IndexWriterConfig
is used only for initializing the IndexWriter. (Shai Erera)
* LUCENE-3866: IndexReaderContext.leaves() is now the preferred way to access
atomic sub-readers of any kind of IndexReader (for AtomicReaders it returns
itsself as only leaf with docBase=0). (Uwe Schindler)
New features
* LUCENE-2604: Added RegexpQuery support to QueryParser. Regular expressions
are directly supported by the standard queryparser via
fieldName:/expression/ OR /expression against default field/
Users who wish to search for literal "/" characters are advised to
backslash-escape or quote those characters as needed.
(Simon Willnauer, Robert Muir)
* LUCENE-1606, LUCENE-2089: Adds AutomatonQuery, a MultiTermQuery that
matches terms against a finite-state machine. Implement WildcardQuery
and FuzzyQuery with finite-state methods. Adds RegexpQuery.
(Robert Muir, Mike McCandless, Uwe Schindler, Mark Miller)
* LUCENE-3662: Add support for levenshtein distance with transpositions
to LevenshteinAutomata, FuzzyTermsEnum, and DirectSpellChecker.
(Jean-Philippe Barrette-LaPierre, Robert Muir)
* LUCENE-2321: Cutover to a more RAM efficient packed-ints based
representation for the in-memory terms dict index. (Mike
McCandless)
* LUCENE-2126: Add new classes for data (de)serialization: DataInput
and DataOutput. IndexInput and IndexOutput extend these new classes.
(Michael Busch)
* LUCENE-1458, LUCENE-2111: With flexible indexing it is now possible
for an application to create its own postings codec, to alter how
fields, terms, docs and positions are encoded into the index. The
standard codec is the default codec. IndexWriter accepts a Codec
class to obtain codecs for newly written segments.
* LUCENE-1458, LUCENE-2111: Some experimental codecs have been added
for flexible indexing, including pulsing codec (inlines
low-frequency terms directly into the terms dict, avoiding seeking
for some queries), sep codec (stores docs, freqs, positions, skip
data and payloads in 5 separate files instead of the 2 used by
standard codec), and int block (really a "base" for using
block-based compressors like PForDelta for storing postings data).
* LUCENE-1458, LUCENE-2111: The in-memory terms index used by standard
codec is more RAM efficient: terms data is stored as block byte
arrays and packed integers. Net RAM reduction for indexes that have
many unique terms should be substantial, and initial open time for
IndexReaders should be faster. These gains only apply for newly
written segments after upgrading.
* LUCENE-1458, LUCENE-2111: Terms data are now buffered directly as
byte[] during indexing, which uses half the RAM for ascii terms (and
also numeric fields). This can improve indexing throughput for
applications that have many unique terms, since it reduces how often
a new segment must be flushed given a fixed RAM buffer size.
* LUCENE-2489: Added PerFieldCodecWrapper (in oal.index.codecs) which
lets you set the Codec per field (Mike McCandless)
* LUCENE-2373: Extend Codec to use SegmentInfosWriter and
SegmentInfosReader to allow customization of SegmentInfos data.
(Andrzej Bialecki)
* LUCENE-2504: FieldComparator.setNextReader now returns a
FieldComparator instance. You can "return this", to just reuse the
same instance, or you can return a comparator optimized to the new
segment. (yonik, Mike McCandless)
* LUCENE-2648: PackedInts.Iterator now supports to advance by more than a
single ordinal. (Simon Willnauer)
* LUCENE-2649: Objects in the FieldCache can optionally store Bits
that mark which docs have real values in the native[] (ryan)
* LUCENE-2664: Add SimpleText codec, which stores all terms/postings
data in a single text file for transparency (at the expense of poor
performance). (Sahin Buyrukbilen via Mike McCandless)
* LUCENE-2589: Add a VariableSizedIntIndexInput, which, when used w/
Sep*, makes it simple to take any variable sized int block coders
(like Simple9/16) and use them in a codec. (Mike McCandless)
* LUCENE-2597: Add oal.index.SlowCompositeReaderWrapper, to wrap a
composite reader (eg MultiReader or DirectoryReader), making it
pretend it's an atomic reader. This is a convenience class (you can
use MultiFields static methods directly, instead) if you need to use
the flex APIs directly on a composite reader. (Mike McCandless)
* LUCENE-2690: MultiTermQuery boolean rewrites per segment.
(Uwe Schindler, Robert Muir, Mike McCandless, Simon Willnauer)
* LUCENE-996: The QueryParser now accepts mixed inclusive and exclusive
bounds for range queries. Example: "{3 TO 5]"
QueryParser subclasses that overrode getRangeQuery will need to be changed
to use the new getRangeQuery method. (Andrew Schurman, Mark Miller, yonik)
* LUCENE-2742: Add native per-field postings format support. Codec lets you now
register a postings format for each field and which is in turn recorded
into the index. Postings formtas are maintained on a per-segment basis and be
resolved without knowing the actual postings format used for writing the segment.
(Simon Willnauer)
* LUCENE-2741: Add support for multiple codecs that use the same file
extensions within the same segment. Codecs now use their per-segment codec
ID in the file names. (Simon Willnauer)
* LUCENE-2843: Added a new terms index impl,
VariableGapTermsIndexWriter/Reader, that accepts a pluggable
IndexTermSelector for picking which terms should be indexed in the
terms dict. This impl stores the indexed terms in an FST, which is
much more RAM efficient than FixedGapTermsIndex. (Mike McCandless)
* LUCENE-2862: Added TermsEnum.totalTermFreq() and
Terms.getSumTotalTermFreq(). (Mike McCandless, Robert Muir)
* LUCENE-3290: Added Terms.getSumDocFreq() (Mike McCandless, Robert Muir)
* LUCENE-3003: Added new expert class oal.index.DocTermsOrd,
refactored from Solr's UnInvertedField, for accessing term ords for
multi-valued fields, per document. This is similar to FieldCache in
that it inverts the index to compute the ords, but differs in that
it's able to handle multi-valued fields and does not hold the term
bytes in RAM. (Mike McCandless)
* LUCENE-3108, LUCENE-2935, LUCENE-2168, LUCENE-1231: Changes from
DocValues (ColumnStrideFields):
- IndexWriter now supports typesafe dense per-document values stored in
a column like storage. DocValues are stored on a per-document
basis where each documents field can hold exactly one value of a given
type. DocValues are provided via Fieldable and can be used in
conjunction with stored and indexed values.
- DocValues provides an entirely RAM resident document id to value
mapping per field as well as a DocIdSetIterator based disk-resident
sequential access API relying on filesystem-caches.
- Both APIs are exposed via IndexReader and the Codec / Flex API allowing
expert users to integrate customized DocValues reader and writer
implementations by extending existing Codecs.
- DocValues provides implementations for primitive datatypes like int,
long, float, double and arrays of byte. Byte based implementations further
provide storage variants like straight or dereferenced stored bytes, fixed
and variable length bytes as well as index time sorted based on
user-provided comparators.
(Mike McCandless, Simon Willnauer)
* LUCENE-3209: Added MemoryCodec, which stores all terms & postings in
RAM as an FST; this is good for primary-key fields if you frequently
need to lookup by that field or perform deletions against it, for
example in a near-real-time setting. (Mike McCandless)
* SOLR-2533: Added support for rewriting Sort and SortFields using an
IndexSearcher. SortFields can have SortField.REWRITEABLE type which
requires they are rewritten before they are used. (Chris Male)
* LUCENE-3203: FSDirectory can now limit the max allowed write rate
(MB/sec) of all running merges, to reduce impact ongoing merging has
on searching, NRT reopen time, etc. (Mike McCandless)
* LUCENE-2793: Directory#createOutput & Directory#openInput now accept an
IOContext instead of a buffer size to allow low level optimizations for
different usecases like merging, flushing and reading.
(Simon Willnauer, Mike McCandless, Varun Thacker)
* LUCENE-3354: FieldCache can cache DocTermOrds. (Martijn van Groningen)
* LUCENE-3376: ReusableAnalyzerBase has been moved from modules/analysis/common
into lucene/src/java/org/apache/lucene/analysis (Chris Male)
* LUCENE-3423: add Terms.getDocCount(), which returns the number of documents
that have at least one term for a field. (Yonik Seeley, Robert Muir)
* LUCENE-2959: Added a variety of different relevance ranking systems to Lucene.
- Added Okapi BM25, Language Models, Divergence from Randomness, and
Information-Based Models. The models are pluggable, support all of lucene's
features (boosts, slops, explanations, etc) and queries (spans, etc).
- All models default to the same index-time norm encoding as
DefaultSimilarity, so you can easily try these out/switch back and
forth/run experiments and comparisons without reindexing. Note: most of
the models do rely upon index statistics that are new in Lucene 4.0, so
for existing 3.x indexes its a good idea to upgrade your index to the
new format with IndexUpgrader first.
- Added a new subclass SimilarityBase which provides a simplified API
for plugging in new ranking algorithms without dealing with all of the
nuances and implementation details of Lucene.
- For example, to use BM25 for all fields:
searcher.setSimilarity(new BM25Similarity());
If you instead want to apply different similarities (e.g. ones with
different parameter values or different algorithms entirely) to different
fields, implement PerFieldSimilarityWrapper with your per-field logic.
(David Mark Nemeskey via Robert Muir)
* LUCENE-3396: ReusableAnalyzerBase now provides a ReuseStrategy abstraction
which controls how TokenStreamComponents are reused per request. Two
implementations are provided - GlobalReuseStrategy which implements the
current behavior of sharing components between all fields, and
PerFieldReuseStrategy which shares per field. (Chris Male)
* LUCENE-2309: Added IndexableField.tokenStream(Analyzer) which is now
responsible for creating the TokenStreams for Fields when they are to
be indexed. (Chris Male)
* LUCENE-3433: Added random access for non RAM resident IndexDocValues. RAM
resident and disk resident IndexDocValues are now exposed via the Source
interface. ValuesEnum has been removed in favour of Source. (Simon Willnauer)
* LUCENE-1536: Filters can now be applied down-low, if their DocIdSet implements
a new bits() method, returning all documents in a random access way. If the
DocIdSet is not too sparse, it will be passed as acceptDocs down to the Scorer
as replacement for AtomicReader's live docs.
In addition, FilteredQuery backs now IndexSearcher's filtering search methods.
Using FilteredQuery you can chain Filters in a very performant way
[new FilteredQuery(new FilteredQuery(query, filter1), filter2)], which was not
possible with IndexSearcher's methods. FilteredQuery also allows to override
the heuristics used to decide if filtering should be done random access or
using a conjunction on DocIdSet's iterator().
(Mike McCandless, Uwe Schindler, Robert Muir, Chris Male, Yonik Seeley,
Jason Rutherglen, Paul Elschot)
* LUCENE-3638: Added sugar methods to IndexReader and IndexSearcher to
load only certain fields when loading a document. (Peter Chang via
Mike McCandless)
* LUCENE-3628: Norms are represented as DocValues. AtomicReader exposes
a #normValues(String) method to obtain norms per field. (Simon Willnauer)
* LUCENE-3687: Similarity#computeNorm(FieldInvertState, Norm) allows to compute
norm values or arbitrary precision. Instead of returning a fixed single byte
value, custom similarities can now set a integer, float or byte value to the
given Norm object. (Simon Willnauer)
* LUCENE-2604, LUCENE-4103: Added RegexpQuery support to contrib/queryparser.
(Simon Willnauer, Robert Muir, Daniel Truemper)
* LUCENE-2373: Added a Codec implementation that works with append-only
filesystems (such as e.g. Hadoop DFS). SegmentInfos writing/reading
code is refactored to support append-only FS, and to allow for future
customization of per-segment information. (Andrzej Bialecki)
* LUCENE-2479: Added ability to provide a sort comparator for spelling suggestions along
with two implementations. The existing comparator (score, then frequency) is the default (Grant Ingersoll)
* LUCENE-2608: Added the ability to specify the accuracy at method time in the SpellChecker. The per class
method is also still available. (Grant Ingersoll)
* LUCENE-2507: Added DirectSpellChecker, which retrieves correction candidates directly
from the term dictionary using levenshtein automata. (Robert Muir)
* LUCENE-3527: Add LuceneLevenshteinDistance, which computes string distance in a compatible
way as DirectSpellChecker. This can be used to merge top-N results from more than one
SpellChecker. (James Dyer via Robert Muir)
* LUCENE-3496: Support grouping by DocValues. (Martijn van Groningen)
* LUCENE-2795: Generified DirectIOLinuxDirectory to work across any
unix supporting the O_DIRECT flag when opening a file (tested on
Linux and OS X but likely other Unixes will work), and improved it
so it can be used for indexing and searching. The directory uses
direct IO when doing large merges to avoid unnecessarily evicting
cached IO pages due to large merges. (Varun Thacker, Mike
McCandless)
* LUCENE-3827: DocsAndPositionsEnum from MemoryIndex implements
start/endOffset, if offsets are indexed. (Alan Woodward via Mike
McCandless)
* LUCENE-3802, LUCENE-3856: Support for grouped faceting. (Martijn van Groningen)
* LUCENE-3444: Added a second pass grouping collector that keeps track of distinct
values for a specified field for the top N group. (Martijn van Groningen)
* LUCENE-3778: Added a grouping utility class that makes it easier to use result
grouping for pure Lucene apps. (Martijn van Groningen)
* LUCENE-2341: A new analysis/ filter: Morfologik - a dictionary-driven lemmatizer
(accurate stemmer) for Polish (includes morphosyntactic annotations).
(Michał Dybizbański, Dawid Weiss)
* LUCENE-2413: Consolidated Lucene/Solr analysis components into analysis/common.
New features from Solr now available to Lucene users include:
- o.a.l.analysis.commongrams: Constructs n-grams for frequently occurring terms
and phrases.
- o.a.l.analysis.charfilter.HTMLStripCharFilter: CharFilter that strips HTML
constructs.
- o.a.l.analysis.miscellaneous.WordDelimiterFilter: TokenFilter that splits words
into subwords and performs optional transformations on subword groups.
- o.a.l.analysis.miscellaneous.RemoveDuplicatesTokenFilter: TokenFilter which
filters out Tokens at the same position and Term text as the previous token.
- o.a.l.analysis.miscellaneous.TrimFilter: Trims leading and trailing whitespace
from Tokens in the stream.
- o.a.l.analysis.miscellaneous.KeepWordFilter: A TokenFilter that only keeps tokens
with text contained in the required words (inverse of StopFilter).
- o.a.l.analysis.miscellaneous.HyphenatedWordsFilter: A TokenFilter that puts
hyphenated words broken into two lines back together.
- o.a.l.analysis.miscellaneous.CapitalizationFilter: A TokenFilter that applies
capitalization rules to tokens.
- o.a.l.analysis.pattern: Package for pattern-based analysis, containing a
CharFilter, Tokenizer, and Tokenfilter for transforming text with regexes.
- o.a.l.analysis.synonym.SynonymFilter: A synonym filter that supports multi-word
synonyms.
- o.a.l.analysis.phonetic: Package for phonetic search, containing various
phonetic encoders such as Double Metaphone.
Some existing analysis components changed packages:
- o.a.l.analysis.KeywordAnalyzer -> o.a.l.analysis.core.KeywordAnalyzer
- o.a.l.analysis.KeywordTokenizer -> o.a.l.analysis.core.KeywordTokenizer
- o.a.l.analysis.LetterTokenizer -> o.a.l.analysis.core.LetterTokenizer
- o.a.l.analysis.LowerCaseFilter -> o.a.l.analysis.core.LowerCaseFilter
- o.a.l.analysis.LowerCaseTokenizer -> o.a.l.analysis.core.LowerCaseTokenizer
- o.a.l.analysis.SimpleAnalyzer -> o.a.l.analysis.core.SimpleAnalyzer
- o.a.l.analysis.StopAnalyzer -> o.a.l.analysis.core.StopAnalyzer
- o.a.l.analysis.StopFilter -> o.a.l.analysis.core.StopFilter
- o.a.l.analysis.WhitespaceAnalyzer -> o.a.l.analysis.core.WhitespaceAnalyzer
- o.a.l.analysis.WhitespaceTokenizer -> o.a.l.analysis.core.WhitespaceTokenizer
- o.a.l.analysis.PorterStemFilter -> o.a.l.analysis.en.PorterStemFilter
- o.a.l.analysis.ASCIIFoldingFilter -> o.a.l.analysis.miscellaneous.ASCIIFoldingFilter
- o.a.l.analysis.ISOLatin1AccentFilter -> o.a.l.analysis.miscellaneous.ISOLatin1AccentFilter
- o.a.l.analysis.KeywordMarkerFilter -> o.a.l.analysis.miscellaneous.KeywordMarkerFilter
- o.a.l.analysis.LengthFilter -> o.a.l.analysis.miscellaneous.LengthFilter
- o.a.l.analysis.PerFieldAnalyzerWrapper -> o.a.l.analysis.miscellaneous.PerFieldAnalyzerWrapper
- o.a.l.analysis.TeeSinkTokenFilter -> o.a.l.analysis.sinks.TeeSinkTokenFilter
- o.a.l.analysis.CharFilter -> o.a.l.analysis.charfilter.CharFilter
- o.a.l.analysis.BaseCharFilter -> o.a.l.analysis.charfilter.BaseCharFilter
- o.a.l.analysis.MappingCharFilter -> o.a.l.analysis.charfilter.MappingCharFilter
- o.a.l.analysis.NormalizeCharMap -> o.a.l.analysis.charfilter.NormalizeCharMap
- o.a.l.analysis.CharArraySet -> o.a.l.analysis.util.CharArraySet
- o.a.l.analysis.CharArrayMap -> o.a.l.analysis.util.CharArrayMap
- o.a.l.analysis.ReusableAnalyzerBase -> o.a.l.analysis.util.ReusableAnalyzerBase
- o.a.l.analysis.StopwordAnalyzerBase -> o.a.l.analysis.util.StopwordAnalyzerBase
- o.a.l.analysis.WordListLoader -> o.a.l.analysis.util.WordListLoader
- o.a.l.analysis.CharTokenizer -> o.a.l.analysis.util.CharTokenizer
- o.a.l.util.CharacterUtils -> o.a.l.analysis.util.CharacterUtils
All analyzers in contrib/analyzers and contrib/icu were moved to the
analysis/ module. The 'smartcn' and 'stempel' components now depend on 'common'.
(Chris Male, Robert Muir)
* LUCENE-4004: Add DisjunctionMaxQuery support to the xml query parser.
(Benson Margulies via Robert Muir)
* LUCENE-4025: Add maybeRefreshBlocking to ReferenceManager, to let a caller
block until the refresh logic has been executed. (Shai Erera, Mike McCandless)
* LUCENE-4039: Add AddIndexesTask to benchmark, which uses IW.addIndexes.
(Shai Erera)
* LUCENE-3514: Added IndexSearcher.searchAfter when Sort is used,
returning results after a specified FieldDoc for deep
paging. (Mike McCandless)
* LUCENE-4043: Added scoring support via score mode for query time joining.
(Martijn van Groningen, Mike McCandless)
* LUCENE-3523: Added oal.search.spell.WordBreakSpellChecker, which
generates suggestions by combining two or more terms and/or
breaking terms into multiple words. See Javadocs for usage. (James Dyer)
* LUCENE-4019: Added improved parsing of Hunspell Dictionaries so those
rules missing the required number of parameters either ignored or
cause a ParseException (depending on whether strict parsing is enabled).
(Luca Cavanna via Chris Male)
* LUCENE-3440: Add ordered fragments feature with IDF-weighted terms for FVH.
(Sebastian Lutze via Koji Sekiguchi)
* LUCENE-4082: Added explain to ToParentBlockJoinQuery.
(Christoph Kaser, Martijn van Groningen)
* LUCENE-4108: add replaceTaxonomy to DirectoryTaxonomyWriter, which replaces
the taxonomy in place with the given one. (Shai Erera)
Optimizations
* LUCENE-2588: Don't store unnecessary suffixes when writing the terms
index, saving RAM in IndexReader; change default terms index
interval from 128 to 32, because the terms index now requires much
less RAM. (Robert Muir, Mike McCandless)
* LUCENE-2669: Optimize NumericRangeQuery.NumericRangeTermsEnum to
not seek backwards when a sub-range has no terms. It now only seeks
when the current term is less than the next sub-range's lower end.
(Uwe Schindler, Mike McCandless)
* LUCENE-2694: Optimize MultiTermQuery to be single pass for Term lookups.
MultiTermQuery now stores TermState per leaf reader during rewrite to re-
seek the term dictionary in TermQuery / TermWeight.
(Simon Willnauer, Mike McCandless, Robert Muir)
* LUCENE-3292: IndexWriter no longer shares the same SegmentReader
instance for merging and NRT readers, which enables directory impls
to separately tune IO flags for each. (Varun Thacker, Simon
Willnauer, Mike McCandless)
* LUCENE-3328: BooleanQuery now uses a specialized ConjunctionScorer if all
boolean clauses are required and instances of TermQuery.
(Simon Willnauer, Robert Muir)
* LUCENE-3643: FilteredQuery and IndexSearcher.search(Query, Filter,...)
now optimize the special case query instanceof MatchAllDocsQuery to
execute as ConstantScoreQuery. (Uwe Schindler)
* LUCENE-3509: Added fasterButMoreRam option for docvalues. This option controls whether the space for packed ints
should be rounded up for better performance. This option only applies for docvalues types bytes fixed sorted
and bytes var sorted. (Simon Willnauer, Martijn van Groningen)
* LUCENE-3795: Replace contrib/spatial with modules/spatial. This includes
a basic spatial strategy interface. (David Smiley, Chris Male, ryan)
* LUCENE-3932: Lucene3x codec loads terms index faster, by
pre-allocating the packed ints array based on the .tii file size
(Sean Bridges via Mike McCandless)
* LUCENE-3468: Replaced last() and remove() with pollLast() in
FirstPassGroupingCollector (Martijn van Groningen)
* LUCENE-3830: Changed MappingCharFilter/NormalizeCharMap to use an
FST under the hood, which requires less RAM. NormalizeCharMap no
longer accepts empty string match (it did previously, but ignored
it). (Dawid Weiss, Mike McCandless)
* LUCENE-4061: improve synchronization in DirectoryTaxonomyWriter.addCategory
and few general improvements to DirectoryTaxonomyWriter.
(Shai Erera, Gilad Barkai)
* LUCENE-4062: Add new aligned packed bits impls for faster lookup
performance; add float acceptableOverheadRatio to getWriter and
getMutable API to give packed ints freedom to pick faster
implementations (Adrien Grand via Mike McCandless)
* LUCENE-2357: Reduce transient RAM usage when merging segments in
IndexWriter. (Adrien Grand)
* LUCENE-4098: Add bulk get/set methods to PackedInts (Adrien Grand
via Mike McCandless)
* LUCENE-4156: DirectoryTaxonomyWriter.getSize is no longer synchronized.
(Shai Erera, Sivan Yogev)
* LUCENE-4163: Improve concurrency of MMapIndexInput.clone() by using
the new WeakIdentityMap on top of a ConcurrentHashMap to manage
the cloned instances. WeakIdentityMap was extended to support
iterating over its keys. (Uwe Schindler)
Bug fixes
* LUCENE-2803: The FieldCache can miss values if an entry for a reader
with more document deletions is requested before a reader with fewer
deletions, provided they share some segments. (yonik)
* LUCENE-2645: Fix false assertion error when same token was added one
after another with 0 posIncr. (David Smiley, Kurosaka Teruhiko via Mike
McCandless)
* LUCENE-3348: Fix thread safety hazards in IndexWriter that could
rarely cause deletions to be incorrectly applied. (Yonik Seeley,
Simon Willnauer, Mike McCandless)
* LUCENE-3515: Fix terrible merge performance versus 3.x, especially
when the directory isn't MMapDirectory, due to failing to reuse
DocsAndPositionsEnum while merging (Marc Sturlese, Erick Erickson,
Robert Muir, Simon Willnauer, Mike McCandless)
* LUCENE-3589: BytesRef copy(short) didnt set length.
(Peter Chang via Robert Muir)
* LUCENE-3045: fixed QueryNodeImpl.containsTag(String key) that was
not lowercasing the key before checking for the tag (Adriano Crestani)
* LUCENE-3890: Fixed NPE for grouped faceting on multi-valued fields.
(Michael McCandless, Martijn van Groningen)
* LUCENE-2945: Fix hashCode/equals for surround query parser generated queries.
(Paul Elschot, Simon Rosenthal, gsingers via ehatcher)
* LUCENE-3971: MappingCharFilter could return invalid final token position.
(Dawid Weiss, Robert Muir)
* LUCENE-3820: PatternReplaceCharFilter could return invalid token positions.
(Dawid Weiss)
* LUCENE-3969: Throw IAE on bad arguments that could cause confusing errors in
CompoundWordTokenFilterBase, PatternTokenizer, PositionFilter,
SnowballFilter, PathHierarchyTokenizer, ReversePathHierarchyTokenizer,
WikipediaTokenizer, and KeywordTokenizer. ShingleFilter and
CommonGramsFilter now populate PositionLengthAttribute. Fixed
PathHierarchyTokenizer to reset() all state. Protect against AIOOBE in
ReversePathHierarchyTokenizer if skip is large. Fixed wrong final
offset calculation in PathHierarchyTokenizer.
(Mike McCandless, Uwe Schindler, Robert Muir)
* LUCENE-4060: Fix a synchronization bug in
DirectoryTaxonomyWriter.addTaxonomies(). Also, the method has been renamed to
addTaxonomy and now takes only one Directory and one OrdinalMap.
(Shai Erera, Gilad Barkai)
* LUCENE-3590: Fix AIOOBE in BytesRef/CharsRef copyBytes/copyChars when
offset is nonzero, fix off-by-one in CharsRef.subSequence, and fix
CharsRef's CharSequence methods to throw exceptions in boundary cases
to properly meet the specification. (Robert Muir)
* LUCENE-4084: Attempting to reuse a single IndexWriterConfig instance
across more than one IndexWriter resulted in a cryptic exception.
This is now fixed, but requires that certain members of
IndexWriterConfig (MergePolicy, FlushPolicy,
DocumentsWriterThreadPool) implement clone. (Robert Muir, Simon
Willnauer, Mike McCandless)
* LUCENE-4079: Fixed loading of Hunspell dictionaries that use aliasing (AF rules)
(Ludovic Boutros via Chris Male)
* LUCENE-4077: Expose the max score and per-group scores from
ToParentBlockJoinCollector (Christoph Kaser, Mike McCandless)
* LUCENE-4114: Fix int overflow bugs in BYTES_FIXED_STRAIGHT and
BYTES_FIXED_DEREF doc values implementations (Walt Elder via Mike McCandless).
* LUCENE-4147: Fixed thread safety issues when rollback() and commit()
are called simultaneously. (Simon Willnauer, Mike McCandless)
* LUCENE-4165: Removed closing of the Reader used to read the affix file in
HunspellDictionary. Consumers are now responsible for closing all InputStreams
once the Dictionary has been instantiated. (Torsten Krah, Uwe Schindler, Chris Male)
Documentation
* LUCENE-3958: Javadocs corrections for IndexWriter.
(Iulius Curt via Robert Muir)
Build
* LUCENE-4047: Cleanup of LuceneTestCase: moved blocks of initialization/ cleanup
code into JUnit instance and class rules. (Dawid Weiss)
* LUCENE-4016: Require ANT 1.8.2+ for the build.
* LUCENE-3808: Refactoring of testing infrastructure to use randomizedtesting
package: http://labs.carrotsearch.com/randomizedtesting.html (Dawid Weiss)
* LUCENE-3964: Added target stage-maven-artifacts, which stages
Maven release artifacts to a Maven staging repository in preparation
for release. (Steve Rowe)
* LUCENE-2845: Moved contrib/benchmark to lucene/benchmark.
* LUCENE-2995: Moved contrib/spellchecker into lucene/suggest.
* LUCENE-3285: Moved contrib/queryparser into lucene/queryparser
* LUCENE-3285: Moved contrib/xml-query-parser's demo into lucene/demo
* LUCENE-3271: Moved contrib/queries BooleanFilter, BoostingQuery,
ChainedFilter, FilterClause and TermsFilter into lucene/queries
* LUCENE-3381: Moved contrib/queries regex.*, DuplicateFilter,
FuzzyLikeThisQuery and SlowCollated* into lucene/sandbox.
Removed contrib/queries.
* LUCENE-3286: Moved remainder of contrib/xml-query-parser to lucene/queryparser.
Classes now found at org.apache.lucene.queryparser.xml.*
* LUCENE-4059: Improve ANT task prepare-webpages (used by documentation
tasks) to correctly encode build file names as URIs for later processing by
XSL. (Greg Bowyer, Uwe Schindler)
======================= Lucene 3.6.1 =======================
More information about this release, including any errata related to the
release notes, upgrade instructions, or other changes may be found online at:
https://wiki.apache.org/lucene-java/Lucene3.6.1
Bug Fixes
* LUCENE-3969: Throw IAE on bad arguments that could cause confusing
errors in KeywordTokenizer.
(Uwe Schindler, Mike McCandless, Robert Muir)
* LUCENE-3971: MappingCharFilter could return invalid final token position.
(Dawid Weiss, Robert Muir)
* LUCENE-4023: DisjunctionMaxScorer now implements visitSubScorers().
(Uwe Schindler)
* LUCENE-2566: + - operators allow any amount of whitespace (yonik, janhoy)
* LUCENE-3590: Fix AIOOBE in BytesRef/CharsRef copyBytes/copyChars when
offset is nonzero, fix off-by-one in CharsRef.subSequence, and fix
CharsRef's CharSequence methods to throw exceptions in boundary cases
to properly meet the specification. (Robert Muir)
* LUCENE-4222: TieredMergePolicy.getFloorSegmentMB was returning the
size in bytes not MB (Chris Fuller via Mike McCandless)
API Changes
* LUCENE-4023: Changed the visibility of Scorer#visitSubScorers() to
public, otherwise it's impossible to implement Scorers outside
the Lucene package. (Uwe Schindler)
Optimizations
* LUCENE-4163: Improve concurrency of MMapIndexInput.clone() by using
the new WeakIdentityMap on top of a ConcurrentHashMap to manage
the cloned instances. WeakIdentityMap was extended to support
iterating over its keys. (Uwe Schindler)
Tests
* LUCENE-3873: add MockGraphTokenFilter, testing analyzers with
random graph tokens. (Mike McCandless)
* LUCENE-3968: factor out LookaheadTokenFilter from
MockGraphTokenFilter (Mike Mccandless)
======================= Lucene 3.6.0 =======================
More information about this release, including any errata related to the
release notes, upgrade instructions, or other changes may be found online at:
https://wiki.apache.org/lucene-java/Lucene3.6
Changes in backwards compatibility policy
* LUCENE-3594: The protected inner class (never intended to be visible)
FieldCacheTermsFilter.FieldCacheTermsFilterDocIdSet was removed and
replaced by another internal implementation. (Uwe Schindler)
* LUCENE-3620: FilterIndexReader now overrides all methods of IndexReader that
it should (note that some are still not overridden, as they should be
overridden by sub-classes only). In the process, some methods of IndexReader
were made final. This is not expected to affect many apps, since these methods
already delegate to abstract methods, which you had to already override
anyway. (Shai Erera)
* LUCENE-3636: Added SearcherFactory, used by SearcherManager and NRTManager
to create new IndexSearchers. You can provide your own implementation to
warm new searchers, set an ExecutorService, set a custom Similarity, or
even return your own subclass of IndexSearcher. The SearcherWarmer and
ExecutorService parameters on these classes were removed, as they are
subsumed by SearcherFactory. (Shai Erera, Mike McCandless, Robert Muir)
* LUCENE-3644: The expert ReaderFinishedListener api suffered problems (propagated
down to subreaders, but was not called on SegmentReaders, unless they were
the owner of the reader core, and other ambiguities). The API is revised:
You can set ReaderClosedListeners on any IndexReader, and onClose is called
when that reader is closed. SegmentReader has CoreClosedListeners that you
can register to know when a shared reader core is closed.
(Uwe Schindler, Mike McCandless, Robert Muir)
* LUCENE-3652: The package org.apache.lucene.messages was moved to
contrib/queryparser. If you have used those classes in your code
just add the lucene-queryparser.jar file to your classpath.
(Uwe Schindler)
* LUCENE-3681: FST now stores labels for BYTE2 input type as 2 bytes
instead of vInt; this can make FSTs smaller and faster, but it is a
break in the binary format so if you had built and saved any FSTs
then you need to rebuild them. (Robert Muir, Mike McCandless)
* LUCENE-3679: The expert IndexReader.getFieldNames(FieldOption) API
has been removed and replaced with the experimental getFieldInfos
API. All IndexReader subclasses must implement getFieldInfos.
(Mike McCandless)
* LUCENE-3695: Move confusing add(X) methods out of FST.Builder into
FST.Util. (Robert Muir, Mike McCandless)
* LUCENE-3701: Added an additional argument to the expert FST.Builder
ctor to take FreezeTail, which you can use to (very-expertly) customize
the FST construction process. Pass null if you want the default
behavior. Added seekExact() to FSTEnum, and added FST.save/read
from a File. (Mike McCandless, Dawid Weiss, Robert Muir)
* LUCENE-3712: Removed unused and untested ReaderUtil#subReader methods.
(Uwe Schindler)
* LUCENE-3672: Deprecate Directory.fileModified,
IndexCommit.getTimestamp and .getVersion and
IndexReader.lastModified and getCurrentVersion (Andrzej Bialecki,
Robert Muir, Mike McCandless)
* LUCENE-3760: In IndexReader/DirectoryReader, deprecate static
methods getCurrentVersion and getCommitUserData, and non-static
method getCommitUserData (use getIndexCommit().getUserData()
instead). (Ryan McKinley, Robert Muir, Mike McCandless)
* LUCENE-3867: Deprecate instance creation of RamUsageEstimator, instead
the new static method sizeOf(Object) should be used. As the algorithm
is now using Hotspot(TM) internals (reference size, header sizes,
object alignment), the abstract o.a.l.util.MemoryModel class was
completely removed (without replacement). The new static methods
no longer support String intern-ness checking, interned strings
now count to memory usage as any other Java object.
(Dawid Weiss, Uwe Schindler, Shai Erera)
* LUCENE-3738: All readXxx methods in BufferedIndexInput were made
final. Subclasses should only override protected readInternal /
seekInternal. (Uwe Schindler)
* LUCENE-2599: Deprecated the spatial contrib module, which was buggy and not
well maintained. Lucene 4 includes a new spatial module that replaces this.
(David Smiley, Ryan McKinley, Chris Male)
Changes in Runtime Behavior
* LUCENE-3796, SOLR-3241: Throw an exception if you try to set an index-time
boost on a field that omits norms. Because the index-time boost
is multiplied into the norm, previously your boost would be
silently discarded. (Tomás Fernández Löbbe, Hoss Man, Robert Muir)
* LUCENE-3848: Fix tokenstreams to not produce a stream with an initial
position increment of 0: which is out of bounds (overlapping with a
non-existent previous term). Consumers such as IndexWriter and QueryParser
still check for and silently correct this situation today, but at some point
in the future they may throw an exception. (Mike McCandless, Robert Muir)
* LUCENE-3738: DataInput/DataOutput no longer allow negative vLongs. Negative
vInts are still supported (for index backwards compatibility), but
should not be used in new code. The read method for negative vLongs
was already broken since Lucene 3.1.
(Uwe Schindler, Mike McCandless, Robert Muir)
Security fixes
* LUCENE-3588: Try harder to prevent SIGSEGV on cloned MMapIndexInputs:
Previous versions of Lucene could SIGSEGV the JVM if you try to access
the clone of an IndexInput retrieved from MMapDirectory. This security fix
prevents this as best as it can by throwing AlreadyClosedException
also on clones. (Uwe Schindler, Robert Muir)
API Changes
* LUCENE-3606: IndexReader will be made read-only in Lucene 4.0, so all
methods allowing to delete or undelete documents using IndexReader were
deprecated; you should use IndexWriter now. Consequently
IndexReader.commit() and all open(), openIfChanged(), clone() methods
taking readOnly booleans (or IndexDeletionPolicy instances) were
deprecated. IndexReader.setNorm() is superfluous and was deprecated.
If you have to change per-document boost use CustomScoreQuery.
If you want to dynamically change norms (boost *and* length norm) at
query time, wrap your IndexReader using FilterIndexReader, overriding
FilterIndexReader.norms(). To persist the changes on disk, copy the
FilteredIndexReader to a new index using IndexWriter.addIndexes().
In Lucene 4.0, SimilarityProvider will allow you to customize scoring
using external norms, too. (Uwe Schindler, Robert Muir)
* LUCENE-3735: PayloadProcessorProvider was changed to return a
ReaderPayloadProcessor instead of DirPayloadProcessor. The selection
of the provider to return for the factory is now based on the IndexReader
to be merged. To mimic the old behaviour, just use IndexReader.directory()
for choosing the provider by Directory. (Uwe Schindler)
* LUCENE-3765: Deprecated StopFilter ctor that took ignoreCase, because
in some cases (if the set is a CharArraySet), the argument is ignored.
Deprecated StandardAnalyzer and ClassicAnalyzer ctors that take File,
please use the Reader ctor instead. (Robert Muir)
* LUCENE-3766: Deprecate no-arg ctors of Tokenizer. Tokenizers are
TokenStreams with Readers: tokenizers with null Readers will not be
supported in Lucene 4.0, just use a TokenStream.
(Mike McCandless, Robert Muir)
* LUCENE-3769: Simplified NRTManager by requiring applyDeletes to be
passed to ctor only; if an app needs to mix and match it's free to
create two NRTManagers (one always applying deletes and the other
never applying deletes). (MJB, Shai Erera, Mike McCandless)
* LUCENE-3761: Generalize SearcherManager into an abstract ReferenceManager.
SearcherManager remains a concrete class, but due to the refactoring, the
method maybeReopen has been deprecated in favor of maybeRefresh().
(Shai Erera, Mike McCandless, Simon Willnauer)
* LUCENE-3776: You now acquire/release the IndexSearcher directly from
NRTManager. (Mike McCandless)
New Features
* LUCENE-3593: Added a FieldValueFilter that accepts all documents that either
have at least one or no value at all in a specific field. (Simon Willnauer,
Uwe Schindler, Robert Muir)
* LUCENE-3586: CheckIndex and IndexUpgrader allow you to specify the
specific FSDirectory implementation to use (with the new -dir-impl
command-line option). (Luca Cavanna via Mike McCandless)
* LUCENE-3634: IndexReader's static main method was moved to a new
tool, CompoundFileExtractor, in contrib/misc. (Robert Muir, Mike
McCandless)
* LUCENE-995: The QueryParser now interprets * as an open end for range
queries. Literal asterisks may be represented by quoting or escaping
(i.e. \* or "*") Custom QueryParser subclasses overriding getRangeQuery()
will be passed null for any open endpoint. (Ingo Renner, Adriano
Crestani, yonik, Mike McCandless
* LUCENE-3121: Add sugar reverse lookup (given an output, find the
input mapping to it) for FSTs that have strictly monotonic long
outputs (such as an ord). (Mike McCandless)
* LUCENE-3671: Add TypeTokenFilter that filters tokens based on
their TypeAttribute. (Tommaso Teofili via Uwe Schindler)
* LUCENE-3690,LUCENE-3913: Added HTMLStripCharFilter, a CharFilter that strips
HTML markup. (Steve Rowe)
* LUCENE-3725: Added optional packing to FST building; this uses extra
RAM during building but results in a smaller FST. (Mike McCandless)
* LUCENE-3714: Add top N shortest cost paths search for FST.
(Robert Muir, Dawid Weiss, Mike McCandless)
* LUCENE-3789: Expose MTQ TermsEnum via RewriteMethod for non package private
access (Simon Willnauer)
* LUCENE-3881: Added UAX29URLEmailAnalyzer: a standard analyzer that recognizes
URLs and emails. (Steve Rowe)
Bug fixes
* LUCENE-3595: Fixed FieldCacheRangeFilter and FieldCacheTermsFilter
to correctly respect deletions on reopened SegmentReaders. Factored out
FieldCacheDocIdSet to be a top-level class. (Uwe Schindler, Simon Willnauer)
* LUCENE-3627: Don't let an errant 0-byte segments_N file corrupt the index.
(Ken McCracken via Mike McCandless)
* LUCENE-3630: The internal method MultiReader.doOpenIfChanged(boolean doClone)
was overriding IndexReader.doOpenIfChanged(boolean readOnly), so changing the
contract of the overridden method. This method was renamed and made private.
In ParallelReader the bug was not existent, but the implementation method
was also made private. (Uwe Schindler)
* LUCENE-3641: Fixed MultiReader to correctly propagate readerFinishedListeners
to clones/reopened readers. (Uwe Schindler)
* LUCENE-3642, SOLR-2891, LUCENE-3717: Fixed bugs in CharTokenizer, n-gram tokenizers/filters,
compound token filters, thai word filter, icutokenizer, pattern analyzer,
wikipediatokenizer, and smart chinese where they would create invalid offsets in
some situations, leading to problems in highlighting.
(Max Beutel, Edwin Steiner via Robert Muir)
* LUCENE-3639: TopDocs.merge was incorrectly setting TopDocs.maxScore to
Float.MIN_VALUE when it should be Float.NaN, when there were 0
hits. Improved age calculation in SearcherLifetimeManager, to have
double precision and to compute age to be how long ago the searcher
was replaced with a new searcher (Mike McCandless)
* LUCENE-3658: Corrected potential concurrency issues with
NRTCachingDir, fixed createOutput to overwrite any previous file,
and removed invalid asserts (Robert Muir, Mike McCandless)
* LUCENE-3605: don't sleep in a retry loop when trying to locate the
segments_N file (Robert Muir, Mike McCandless)
* LUCENE-3711: SentinelIntSet with a small initial size can go into
an infinite loop when expanded. This can affect grouping using
TermAllGroupsCollector or TermAllGroupHeadsCollector if instantiated with a
non default small size. (Martijn van Groningen, yonik)
* LUCENE-3727: When writing stored fields and term vectors, Lucene
checks file sizes to detect a bug in some Sun JREs (LUCENE-1282),
however, on some NFS filesystems File.length() could be stale,
resulting in false errors like "fdx size mismatch while indexing".
These checks now use getFilePointer instead to avoid this.
(Jamir Shaikh, Mike McCandless, Robert Muir)
* LUCENE-3816: Fixed problem in FilteredDocIdSet, if null was returned
from the delegate DocIdSet.iterator(), which is allowed to return
null by DocIdSet specification when no documents match.
(Shay Banon via Uwe Schindler)
* LUCENE-3821: SloppyPhraseScorer missed documents that ExactPhraseScorer finds
When phrase query had repeating terms (e.g. "yes no yes")
sloppy query missed documents that exact query matched.
Fixed except when for repeating multiterms (e.g. "yes no yes|no").
(Robert Muir, Doron Cohen)
* LUCENE-3841: Fix CloseableThreadLocal to also purge stale entries on
get(); this fixes certain cases where we were holding onto objects
for dead threads for too long (Matthew Bellew, Mike McCandless)
* LUCENE-3872: IndexWriter.close() now throws IllegalStateException if
you call it after calling prepareCommit() without calling commit()
first. (Tim Bogaert via Mike McCandless)
* LUCENE-3874: Throw IllegalArgumentException from IndexWriter (rather
than producing a corrupt index), if a positionIncrement would cause
integer overflow. This can happen, for example when using a buggy
TokenStream that forgets to call clearAttributes() in combination
with a StopFilter. (Robert Muir)
* LUCENE-3876: Fix bug where positions for a document exceeding
Integer.MAX_VALUE/2 would produce a corrupt index.
(Simon Willnauer, Mike Mccandless, Robert Muir)
* LUCENE-3880: UAX29URLEmailTokenizer now recognizes emails when the mailto:
scheme is prepended. (Kai Gülzau, Steve Rowe)
Optimizations
* LUCENE-3653: Improve concurrency in VirtualMethod and AttributeSource by
using a WeakIdentityMap based on a ConcurrentHashMap. (Uwe Schindler,
Gerrit Jansen van Vuuren)
Documentation
* LUCENE-3597: Fixed incorrect grouping documentation. (Martijn van Groningen,
Robert Muir)
* LUCENE-3926: Improve documentation of RAMDirectory, because this
class is not intended to work with huge indexes. Everything beyond
several hundred megabytes will waste resources (GC cycles), because
it uses an internal buffer size of 1024 bytes, producing millions of
byte[1024] arrays. This class is optimized for small memory-resident
indexes. It also has bad concurrency on multithreaded environments.
It is recommended to materialize large indexes on disk and use
MMapDirectory, which is a high-performance directory implementation
working directly on the file system cache of the operating system,
so copying data to Java heap space is not useful. (Uwe Schindler,
Mike McCandless, Robert Muir)
Build
* LUCENE-3857: exceptions from other threads in beforeclass/etc do not fail
the test (Dawid Weiss)
* LUCENE-3847: LuceneTestCase will now check for modifications of System
properties before and after each test (and suite). If changes are detected,
the test will fail. A rule can be used to reset system properties to
before-scope state (and this has been used to make Solr tests pass).
(Dawid Weiss, Uwe Schindler).
* LUCENE-3228: Stop downloading external javadoc package-list files:
- Added package-list files for Oracle Java javadocs and JUnit javadocs to
Lucene/Solr subversion.
- The Oracle Java javadocs package-list file is excluded from Lucene and
Solr source release packages.
- Regardless of network connectivity, javadocs built from a subversion
checkout contain links to Oracle & JUnit javadocs.
- Building javadocs from a source release package will download the Oracle
Java package-list file if it isn't already present.
- When the Oracle Java package-list file is not present and download fails,
the javadocs targets will not fail the build, though an error will appear
in the build log. In this case, the built javadocs will not contain links
to Oracle Java javadocs.
- Links from Solr javadocs to Lucene's javadocs are enabled. When building
a X.Y.Z-SNAPSHOT version, the links are to the most recently built nightly
Jenkins javadocs. When building a release version, links are to the
Lucene release javadocs for the same version.
(Steve Rowe, hossman)
* LUCENE-3753: Restructure the Lucene build system:
- Created a new Lucene-internal module named "core" by moving the java/
and test/ directories from lucene/src/ to lucene/core/src/.
- Eliminated lucene/src/ by moving all its directories up one level.
- Each internal module (core/, test-framework/, and tools/) now has its own
build.xml, from which it is possible to run module-specific targets.
lucene/build.xml delegates all build tasks (via
<ant dir="internal-module-dir"> calls) to these modules' build.xml files.
(Steve Rowe)
* LUCENE-3774: Optimized and streamlined license and notice file validation
by refactoring the build task into an ANT task and modifying build scripts
to perform top-level checks. (Dawid Weiss, Steve Rowe, Robert Muir)
* LUCENE-3762: Upgrade JUnit to 4.10, refactor state-machine of detecting
setUp/tearDown call chaining in LuceneTestCase. (Dawid Weiss, Robert Muir)
* LUCENE-3944: Make the 'generate-maven-artifacts' target use filtered POMs
placed under lucene/build/poms/, rather than in each module's base
directory. The 'clean' target now removes them.
(Steve Rowe, Robert Muir)
* LUCENE-3930: Changed build system to use Apache Ivy for retrival of 3rd
party JAR files. Please review BUILD.txt for instructions.
(Robert Muir, Chris Male, Uwe Schindler, Steven Rowe, Hossman)
======================= Lucene 3.5.0 =======================
Changes in backwards compatibility policy
* LUCENE-3390: The first approach in Lucene 3.4.0 for missing values
support for sorting had a design problem that made the missing value
be populated directly into the FieldCache arrays during sorting,
leading to concurrency issues. To fix this behaviour, the method
signatures had to be changed:
- FieldCache.getUnValuedDocs() was renamed to FieldCache.getDocsWithField()
returning a Bits interface (backported from Lucene 4.0).
- FieldComparator.setMissingValue() was removed and added to
constructor
As this is expert API, most code will not be affected.
(Uwe Schindler, Doron Cohen, Mike McCandless)
* LUCENE-3541: Remove IndexInput's protected copyBuf. If you want to
keep a buffer in your IndexInput, do this yourself in your implementation,
and be sure to do the right thing on clone()! (Robert Muir)
* LUCENE-2822: TimeLimitingCollector now expects a counter clock instead of
relying on a private daemon thread. The global time limiting clock thread
has been exposed and is now lazily loaded and fully optional.
TimeLimitingCollector now supports setting clock baseline manually to include
prelude of a search. Previous versions set the baseline on construction time,
now baseline is set once the first IndexReader is passed to the collector
unless set before. (Simon Willnauer)
Changes in runtime behavior
* LUCENE-3520: IndexReader.openIfChanged, when passed a near-real-time
reader, will now return null if there are no changes. The API has
always reserved the right to do this; it's just that in the past for
near-real-time readers it never did. (Mike McCandless)
Bug fixes
* LUCENE-3412: SloppyPhraseScorer was returning non-deterministic results
for queries with many repeats (Doron Cohen)
* LUCENE-3421: PayloadTermQuery's explain was wrong when includeSpanScore=false.
(Edward Drapkin via Robert Muir)
* LUCENE-3432: IndexWriter.expungeDeletes with TieredMergePolicy
should ignore the maxMergedSegmentMB setting (v.sevel via Mike
McCandless)
* LUCENE-3442: TermQuery.TermWeight.scorer() returns null for non-atomic
IndexReaders (optimization bug, introcuced by LUCENE-2829), preventing
QueryWrapperFilter and similar classes to get a top-level DocIdSet.
(Dan C., Uwe Schindler)
* LUCENE-3390: Corrected handling of missing values when two parallel searches
using different missing values for sorting: the missing value was populated
directly into the FieldCache arrays during sorting, leading to concurrency
issues. (Uwe Schindler, Doron Cohen, Mike McCandless)
* LUCENE-3439: Closing an NRT reader after the writer was closed was
incorrectly invoking the DeletionPolicy and (then possibly deleting
files) on the closed IndexWriter (Robert Muir, Mike McCandless)
* LUCENE-3215: SloppyPhraseScorer sometimes computed Infinite freq
(Robert Muir, Doron Cohen)
* LUCENE-3503: DisjunctionSumScorer would give slightly different scores
for a document depending if you used nextDoc() versus advance().
(Mike McCandless, Robert Muir)
* LUCENE-3529: Properly support indexing an empty field with empty term text.
Previously, if you had assertions enabled you would receive an error during
flush, if you didn't, you would get an invalid index.
(Mike McCandless, Robert Muir)
* LUCENE-2633: PackedInts Packed32 and Packed64 did not support internal
structures larger than 256MB (Toke Eskildsen via Mike McCandless)
* LUCENE-3540: LUCENE-3255 dropped support for pre-1.9 indexes, but the
error message in IndexFormatTooOldException was incorrect. (Uwe Schindler,
Mike McCandless)
* LUCENE-3541: IndexInput's default copyBytes() implementation was not safe
across multiple threads, because all clones shared the same buffer.
(Robert Muir)
* LUCENE-3548: Fix CharsRef#append to extend length of the existing char[]
and preserve existing chars. (Simon Willnauer)
* LUCENE-3582: Normalize NaN values in NumericUtils.floatToSortableInt() /
NumericUtils.doubleToSortableLong(), so this is consistent with stored
fields. Also fix NumericRangeQuery to not falsely hit NaNs on half-open
ranges (one bound is null). Because of normalization, NumericRangeQuery
can now be used to hit NaN values by creating a query with
upper == lower == NaN (inclusive). (Dawid Weiss, Uwe Schindler)
API Changes
* LUCENE-3454: Rename IndexWriter.optimize to forceMerge to discourage
use of this method since it is horribly costly and rarely justified
anymore. MergePolicy.findMergesForOptimize was renamed to
findForcedMerges. IndexReader.isOptimized was
deprecated. IndexCommit.isOptimized was replaced with
getSegmentCount. (Robert Muir, Mike McCandless)
* LUCENE-3205: Deprecated MultiTermQuery.getTotalNumerOfTerms() [and
related methods], as the numbers returned are not useful
for multi-segment indexes. They were only needed for tests of
NumericRangeQuery. (Mike McCandless, Uwe Schindler)
* LUCENE-3574: Deprecate outdated constants in org.apache.lucene.util.Constants
and add new ones for Java 6 and Java 7. (Uwe Schindler)
* LUCENE-3571: Deprecate IndexSearcher(Directory). Use the constructors
that take IndexReader instead. (Robert Muir)
* LUCENE-3577: Rename IndexWriter.expungeDeletes to forceMergeDeletes,
and revamped the javadocs, to discourage
use of this method since it is horribly costly and rarely
justified. MergePolicy.findMergesToExpungeDeletes was renamed to
findForcedDeletesMerges. (Robert Muir, Mike McCandless)
* LUCENE-3464: IndexReader.reopen has been renamed to
IndexReader.openIfChanged (a static method), and now returns null
(instead of the old reader) if there are no changes in the index, to
prevent the common pitfall of accidentally closing the old reader.
New Features
* LUCENE-3448: Added FixedBitSet.and(other/DISI), andNot(other/DISI).
(Uwe Schindler)
* LUCENE-2215: Added IndexSearcher.searchAfter which returns results after a
specified ScoreDoc (e.g. last document on the previous page) to support deep
paging use cases. (Aaron McCurry, Grant Ingersoll, Robert Muir)
* LUCENE-1990: Adds internal packed ints implementation, to be used
for more efficient storage of int arrays when the values are
bounded, for example for storing the terms dict index (Toke
Eskildsen via Mike McCandless)
* LUCENE-3558: Moved SearcherManager, NRTManager & SearcherLifetimeManager into
core. All classes are contained in o.a.l.search. (Simon Willnauer)
Optimizations
* LUCENE-3426: Add NGramPhraseQuery which extends PhraseQuery and tries to
reduce the number of terms of the query when rewrite(), in order to improve
performance. (Robert Muir, Koji Sekiguchi)
* LUCENE-3494: Optimize FilteredQuery to remove a multiply in score()
(Uwe Schindler, Robert Muir)
* LUCENE-3534: Remove filter logic from IndexSearcher and delegate to
FilteredQuery's Scorer. This is a partial backport of a cleanup in
FilteredQuery/IndexSearcher added by LUCENE-1536 to Lucene 4.0.
(Uwe Schindler)
* LUCENE-2205: Very substantial (3-5X) RAM reduction required to hold
the terms index on opening an IndexReader (Aaron McCurry via Mike McCandless)
* LUCENE-3443: FieldCache can now set docsWithField, and create an
array, in a single pass. This results in faster init time for apps
that need both (such as sorting by a field with a missing value).
(Mike McCandless)
Test Cases
* LUCENE-3420: Disable the finalness checks in TokenStream and Analyzer
for implementing subclasses in different packages, where assertions are not
enabled. (Uwe Schindler)
* LUCENE-3506: tests relying on assertions being enabled were no-op because
they ignored AssertionError. With this fix now entire test framework
(every test) fails if assertions are disabled, unless
-Dtests.asserts.gracious=true is specified. (Doron Cohen)
Build
* SOLR-2849: Fix dependencies in Maven POMs. (David Smiley via Steve Rowe)
* LUCENE-3561: Fix maven xxx-src.jar files that were missing resources.
(Uwe Schindler)
======================= Lucene 3.4.0 =======================
Bug fixes
* LUCENE-3251: Directory#copy failed to close target output if opening the
source stream failed. (Simon Willnauer)
* LUCENE-3255: If segments_N file is all zeros (due to file
corruption), don't read that to mean the index is empty. (Gregory
Tarr, Mark Harwood, Simon Willnauer, Mike McCandless)
* LUCENE-3254: Fixed minor bug in deletes were written to disk,
causing the file to sometimes be larger than it needed to be. (Mike
McCandless)
* LUCENE-3224: Fixed a big where CheckIndex would incorrectly report a
corrupt index if a term with docfreq >= 16 was indexed more than once
at the same position. (Robert Muir)
* LUCENE-3339: Fixed deadlock case when multiple threads use the new
block-add (IndexWriter.add/updateDocuments) methods. (Robert Muir,
Mike McCandless)
* LUCENE-3340: Fixed case where IndexWriter was not flushing at
exactly maxBufferedDeleteTerms (Mike McCandless)
* LUCENE-3358, LUCENE-3361: StandardTokenizer and UAX29URLEmailTokenizer
wrongly discarded combining marks attached to Han or Hiragana characters,
this is fixed if you supply Version >= 3.4 If you supply a previous
lucene version, you get the old buggy behavior for backwards compatibility.
(Trejkaz, Robert Muir)
* LUCENE-3368: IndexWriter commits segments without applying their buffered
deletes when flushing concurrently. (Simon Willnauer, Mike McCandless)
* LUCENE-3365: Create or Append mode determined before obtaining write lock
can cause IndexWriter overriding an existing index.
(Geoff Cooney via Simon Willnauer)
* LUCENE-3380: Fixed a bug where FileSwitchDirectory's listAll() would wrongly
throw NoSuchDirectoryException when all files written so far have been
written to one directory, but the other still has not yet been created on the
filesystem. (Robert Muir)
* LUCENE-3409: IndexWriter.deleteAll was failing to close pooled NRT
SegmentReaders, leading to unused files accumulating in the
Directory. (tal steier via Mike McCandless)
* LUCENE-3418: Lucene was failing to fsync index files on commit,
meaning an operating system or hardware crash, or power loss, could
easily corrupt the index. (Mark Miller, Robert Muir, Mike
McCandless)
New Features
* LUCENE-3290: Added FieldInvertState.numUniqueTerms
(Mike McCandless, Robert Muir)
* LUCENE-3280: Add FixedBitSet, like OpenBitSet but is not elastic
(grow on demand if you set/get/clear too-large indices). (Mike
McCandless)
* LUCENE-2048: Added the ability to omit positions but still index
term frequencies, you can now control what is indexed into
the postings via AbstractField.setIndexOptions:
DOCS_ONLY: only documents are indexed: term frequencies and positions are omitted
DOCS_AND_FREQS: only documents and term frequencies are indexed: positions are omitted
DOCS_AND_FREQS_AND_POSITIONS: full postings: documents, frequencies, and positions
AbstractField.setOmitTermFrequenciesAndPositions is deprecated,
you should use DOCS_ONLY instead. (Robert Muir)
* LUCENE-3097: Added a new grouping collector that can be used to retrieve all most relevant
documents per group. This can be useful in situations when one wants to compute grouping
based facets / statistics on the complete query result. (Martijn van Groningen)
* LUCENE-3334: If Java7 is detected, IOUtils.closeSafely() will log
suppressed exceptions in the original exception, so stack trace
will contain them. (Uwe Schindler)
Optimizations
* LUCENE-3201, LUCENE-3218: CompoundFileSystem code has been consolidated
into a Directory implementation. Reading is optimized for MMapDirectory,
NIOFSDirectory and SimpleFSDirectory to only map requested parts of the
CFS into an IndexInput. Writing to a CFS now tries to append to the CF
directly if possible and merges separately written files on the fly instead
of during close. (Simon Willnauer, Robert Muir)
* LUCENE-3289: When building an FST you can now tune how aggressively
the FST should try to share common suffixes. Typically you can
greatly reduce RAM required during building, and CPU consumed, at
the cost of a somewhat larger FST. (Mike McCandless)
Test Cases
* LUCENE-3327: Fix AIOOBE when TestFSTs is run with
-Dtests.verbose=true (James Dyer via Mike McCandless)
Build
* LUCENE-3406: Add ant target 'package-local-src-tgz' to Lucene and Solr
to package sources from the local working copy.
(Seung-Yeoul Yang via Steve Rowe)
======================= Lucene 3.3.0 =======================
Changes in backwards compatibility policy
* LUCENE-3140: IndexOutput.copyBytes now takes a DataInput (superclass
of IndexInput) as its first argument. (Robert Muir, Dawid Weiss,
Mike McCandless)
* LUCENE-3191: FieldComparator.value now returns an Object not
Comparable; FieldDoc.fields also changed from Comparable[] to
Object[] (Uwe Schindler, Mike McCandless)
* LUCENE-3208: Made deprecated methods Query.weight(Searcher) and
Searcher.createWeight() final to prevent override. If you have
overridden one of these methods, cut over to the non-deprecated
implementation. (Uwe Schindler, Robert Muir, Yonik Seeley)
* LUCENE-3238: Made MultiTermQuery.rewrite() final, to prevent
problems (such as not properly setting rewrite methods, or
not working correctly with things like SpanMultiTermQueryWrapper).
To rewrite to a simpler form, instead return a simpler enum
from getEnum(IndexReader). For example, to rewrite to a single term,
return a SingleTermEnum. (ludovic Boutros, Uwe Schindler, Robert Muir)
Changes in runtime behavior
* LUCENE-2834: the hash used to compute the lock file name when the
lock file is not stored in the index has changed. This means you
will see a different lucene-XXX-write.lock in your lock directory.
(Robert Muir, Uwe Schindler, Mike McCandless)
* LUCENE-3146: IndexReader.setNorm throws IllegalStateException if the field
does not store norms. (Shai Erera, Mike McCandless)
* LUCENE-3198: On Linux, if the JRE is 64 bit and supports unmapping,
FSDirectory.open now defaults to MMapDirectory instead of
NIOFSDirectory since MMapDirectory gives better performance. (Mike
McCandless)
* LUCENE-3200: MMapDirectory now uses chunk sizes that are powers of 2.
When setting the chunk size, it is rounded down to the next possible
value. The new default value for 64 bit platforms is 2^30 (1 GiB),
for 32 bit platforms it stays unchanged at 2^28 (256 MiB).
Internally, MMapDirectory now only uses one dedicated final IndexInput
implementation supporting multiple chunks, which makes Hotspot's life
easier. (Uwe Schindler, Robert Muir, Mike McCandless)
Bug fixes
* LUCENE-3147,LUCENE-3152: Fixed open file handles leaks in many places in the
code. Now MockDirectoryWrapper (in test-framework) tracks all open files,
including locks, and fails if the test fails to release all of them.
(Mike McCandless, Robert Muir, Shai Erera, Simon Willnauer)
* LUCENE-3102: CachingCollector.replay was failing to call setScorer
per-segment (Martijn van Groningen via Mike McCandless)
* LUCENE-3183: Fix rare corner case where seeking to empty term
(field="", term="") with terms index interval 1 could hit
ArrayIndexOutOfBoundsException (selckin, Robert Muir, Mike
McCandless)
* LUCENE-3208: IndexSearcher had its own private similarity field
and corresponding get/setter overriding Searcher's implementation. If you
setted a different Similarity instance on IndexSearcher, methods implemented
in the superclass Searcher were not using it, leading to strange bugs.
(Uwe Schindler, Robert Muir)
* LUCENE-3197: Fix core merge policies to not over-merge during
background optimize when documents are still being deleted
concurrently with the optimize (Mike McCandless)
* LUCENE-3222: The RAM accounting for buffered delete terms was
failing to measure the space required to hold the term's field and
text character data. (Mike McCandless)
* LUCENE-3238: Fixed bug where using WildcardQuery("prefix*") inside
of a SpanMultiTermQueryWrapper rewrote incorrectly and returned
an error instead. (ludovic Boutros, Uwe Schindler, Robert Muir)
API Changes
* LUCENE-3208: Renamed protected IndexSearcher.createWeight() to expert
public method IndexSearcher.createNormalizedWeight() as this better describes
what this method does. The old method is still there for backwards
compatibility. Query.weight() was deprecated and simply delegates to
IndexSearcher. Both deprecated methods will be removed in Lucene 4.0.
(Uwe Schindler, Robert Muir, Yonik Seeley)
* LUCENE-3197: MergePolicy.findMergesForOptimize now takes
Map<SegmentInfo,Boolean> instead of Set<SegmentInfo> as the second
argument, so the merge policy knows which segments were originally
present vs produced by an optimizing merge (Mike McCandless)
Optimizations
* LUCENE-1736: DateTools.java general improvements.
(David Smiley via Steve Rowe)
New Features
* LUCENE-3140: Added experimental FST implementation to Lucene.
(Robert Muir, Dawid Weiss, Mike McCandless)
* LUCENE-3193: A new TwoPhaseCommitTool allows running a 2-phase commit
algorithm over objects that implement the new TwoPhaseCommit interface (such
as IndexWriter). (Shai Erera)
* LUCENE-3191: Added TopDocs.merge, to facilitate merging results from
different shards (Uwe Schindler, Mike McCandless)
* LUCENE-3179: Added OpenBitSet.prevSetBit (Paul Elschot via Mike McCandless)
* LUCENE-3210: Made TieredMergePolicy more aggressive in reclaiming
segments with deletions; added new methods
set/getReclaimDeletesWeight to control this. (Mike McCandless)
Build
* LUCENE-1344: Create OSGi bundle using dev-tools/maven.
(Nicolas Lalevée, Luca Stancapiano via ryan)
* LUCENE-3204: The maven-ant-tasks jar is now included in the source tree;
users of the generate-maven-artifacts target no longer have to manually
place this jar in the Ant classpath. NOTE: when Ant looks for the
maven-ant-tasks jar, it looks first in its pre-existing classpath, so
any copies it finds will be used instead of the copy included in the
Lucene/Solr source tree. For this reason, it is recommeded to remove
any copies of the maven-ant-tasks jar in the Ant classpath, e.g. under
~/.ant/lib/ or under the Ant installation's lib/ directory. (Steve Rowe)
======================= Lucene 3.2.0 =======================
Changes in backwards compatibility policy
* LUCENE-2953: PriorityQueue's internal heap was made private, as subclassing
with generics can lead to ClassCastException. For advanced use (e.g. in Solr)
a method getHeapArray() was added to retrieve the internal heap array as a
non-generic Object[]. (Uwe Schindler, Yonik Seeley)
* LUCENE-1076: IndexWriter.setInfoStream now throws IOException
(Mike McCandless, Shai Erera)
* LUCENE-3084: MergePolicy.OneMerge.segments was changed from
SegmentInfos to a List<SegmentInfo>. SegmentInfos itsself was changed
to no longer extend Vector<SegmentInfo> (to update code that is using
Vector-API, use the new asList() and asSet() methods returning unmodifiable
collections; modifying SegmentInfos is now only possible through
the explicitely declared methods). IndexWriter.segString() now takes
Iterable<SegmentInfo> instead of List<SegmentInfo>. A simple recompile
should fix this. MergePolicy and SegmentInfos are internal/experimental
APIs not covered by the strict backwards compatibility policy.
(Uwe Schindler, Mike McCandless)
Changes in runtime behavior
* LUCENE-3065: When a NumericField is retrieved from a Document loaded
from IndexReader (or IndexSearcher), it will now come back as
NumericField not as a Field with a string-ified version of the
numeric value you had indexed. Note that this only applies for
newly-indexed Documents; older indices will still return Field
with the string-ified numeric value. If you call Document.get(),
the value comes still back as String, but Document.getFieldable()
returns NumericField instances. (Uwe Schindler, Ryan McKinley,
Mike McCandless)
* LUCENE-1076: Changed the default merge policy from
LogByteSizeMergePolicy to TieredMergePolicy, as of Version.LUCENE_32
(passed to IndexWriterConfig), which is able to merge non-contiguous
segments. This means docIDs no longer necessarily stay "in order"
during indexing. If this is a problem then you can use either of
the LogMergePolicy impls. (Mike McCandless)
New features
* LUCENE-3082: Added index upgrade tool oal.index.IndexUpgrader
that allows to upgrade all segments to last recent supported index
format without fully optimizing. (Uwe Schindler, Mike McCandless)
* LUCENE-1076: Added TieredMergePolicy which is able to merge non-contiguous
segments, which means docIDs no longer necessarily stay "in order".
(Mike McCandless, Shai Erera)
* LUCENE-3071: Adding ReversePathHierarchyTokenizer, added skip parameter to
PathHierarchyTokenizer (Olivier Favre via ryan)
* LUCENE-1421, LUCENE-3102: added CachingCollector which allow you to cache
document IDs and scores encountered during the search, and "replay" them to
another Collector. (Mike McCandless, Shai Erera)
* LUCENE-3112: Added experimental IndexWriter.add/updateDocuments,
enabling a block of documents to be indexed, atomically, with
guaranteed sequential docIDs. (Mike McCandless)
API Changes
* LUCENE-3061: IndexWriter's getNextMerge() and merge(OneMerge) are now public
(though @lucene.experimental), allowing for custom MergeScheduler
implementations. (Shai Erera)
* LUCENE-3065: Document.getField() was deprecated, as it throws
ClassCastException when loading lazy fields or NumericFields.
(Uwe Schindler, Ryan McKinley, Mike McCandless)
* LUCENE-2027: Directory.touchFile is deprecated and will be removed
in 4.0. (Mike McCandless)
Optimizations
* LUCENE-2990: ArrayUtil/CollectionUtil.*Sort() methods now exit early
on empty or one-element lists/arrays. (Uwe Schindler)
* LUCENE-2897: Apply deleted terms while flushing a segment. We still
buffer deleted terms to later apply to past segments. (Mike McCandless)
* LUCENE-3126: IndexWriter.addIndexes copies incoming segments into CFS if they
aren't already and MergePolicy allows that. (Shai Erera)
Bug fixes
* LUCENE-2996: addIndexes(IndexReader) did not flush before adding the new
indexes, causing existing deletions to be applied on the incoming indexes as
well. (Shai Erera, Mike McCandless)
* LUCENE-3024: Index with more than 2.1B terms was hitting AIOOBE when
seeking TermEnum (eg used by Solr's faceting) (Tom Burton-West, Mike
McCandless)
* LUCENE-3042: When a filter or consumer added Attributes to a TokenStream
chain after it was already (partly) consumed [or clearAttributes(),
captureState(), cloneAttributes(),... was called by the Tokenizer],
the Tokenizer calling clearAttributes() or capturing state after addition
may not do this on the newly added Attribute. This bug affected only
very special use cases of the TokenStream-API, most users would not
have recognized it. (Uwe Schindler, Robert Muir)
* LUCENE-3054: PhraseQuery can in some cases stack overflow in
SorterTemplate.quickSort(). This fix also adds an optimization to
PhraseQuery as term with lower doc freq will also have less positions.
(Uwe Schindler, Robert Muir, Otis Gospodnetic)
* LUCENE-3068: sloppy phrase query failed to match valid documents when multiple
query terms had same position in the query. (Doron Cohen)
* LUCENE-3012: Lucene writes the header now for separate norm files (*.sNNN)
(Robert Muir)
Build
* LUCENE-3006: Building javadocs will fail on warnings by default.
Override with -Dfailonjavadocwarning=false (sarowe, gsingers)
* LUCENE-3128: "ant eclipse" creates a .project file for easier Eclipse
integration (unless one already exists). (Daniel Serodio via Shai Erera)
Test Cases
* LUCENE-3002: added 'tests.iter.min' to control 'tests.iter' by allowing to
stop iterating if at least 'tests.iter.min' ran and a failure occured.
(Shai Erera, Chris Hostetter)
======================= Lucene 3.1.0 =======================
Changes in backwards compatibility policy
* LUCENE-2719: Changed API of internal utility class
org.apache.lucene.util.SorterTemplate to support faster quickSort using
pivot values and also merge sort and insertion sort. If you have used
this class, you have to implement two more methods for handling pivots.
(Uwe Schindler, Robert Muir, Mike McCandless)
* LUCENE-1923: Renamed SegmentInfo & SegmentInfos segString method to
toString. These are advanced APIs and subject to change suddenly.
(Tim Smith via Mike McCandless)
* LUCENE-2190: Removed deprecated customScore() and customExplain()
methods from experimental CustomScoreQuery. (Uwe Schindler)
* LUCENE-2286: Enabled DefaultSimilarity.setDiscountOverlaps by default.
This means that terms with a position increment gap of zero do not
affect the norms calculation by default. (Robert Muir)
* LUCENE-2320: MergePolicy.writer is now of type SetOnce, which allows setting
the IndexWriter for a MergePolicy exactly once. You can change references to
'writer' from <code>writer.doXYZ()</code> to <code>writer.get().doXYZ()</code>
(it is also advisable to add an <code>assert writer != null;</code> before you
access the wrapped IndexWriter.)
In addition, MergePolicy only exposes a default constructor, and the one that
took IndexWriter as argument has been removed from all MergePolicy extensions.
(Shai Erera via Mike McCandless)
* LUCENE-2328: SimpleFSDirectory.SimpleFSIndexInput is moved to
FSDirectory.FSIndexInput. Anyone extending this class will have to
fix their code on upgrading. (Earwin Burrfoot via Mike McCandless)
* LUCENE-2302: The new interface for term attributes, CharTermAttribute,
now implements CharSequence. This requires the toString() methods of
CharTermAttribute, deprecated TermAttribute, and Token to return only
the term text and no other attribute contents. LUCENE-2374 implements
an attribute reflection API to no longer rely on toString() for attribute
inspection. (Uwe Schindler, Robert Muir)
* LUCENE-2372, LUCENE-2389: StandardAnalyzer, KeywordAnalyzer,
PerFieldAnalyzerWrapper, WhitespaceTokenizer are now final. Also removed
the now obsolete and deprecated Analyzer.setOverridesTokenStreamMethod().
Analyzer and TokenStream base classes now have an assertion in their ctor,
that check subclasses to be final or at least have final implementations
of incrementToken(), tokenStream(), and reusableTokenStream().
(Uwe Schindler, Robert Muir)
* LUCENE-2316: Directory.fileLength contract was clarified - it returns the
actual file's length if the file exists, and throws FileNotFoundException
otherwise. Returning length=0 for a non-existent file is no longer allowed. If
you relied on that, make sure to catch the exception. (Shai Erera)
* LUCENE-2386: IndexWriter no longer performs an empty commit upon new index
creation. Previously, if you passed an empty Directory and set OpenMode to
CREATE*, IndexWriter would make a first empty commit. If you need that
behavior you can call writer.commit()/close() immediately after you create it.
(Shai Erera, Mike McCandless)
* LUCENE-2733: Removed public constructors of utility classes with only static
methods to prevent instantiation. (Uwe Schindler)
* LUCENE-2602: The default (LogByteSizeMergePolicy) merge policy now
takes deletions into account by default. You can disable this by
calling setCalibrateSizeByDeletes(false) on the merge policy. (Mike
McCandless)
* LUCENE-2529, LUCENE-2668: Position increment gap and offset gap of empty
values in multi-valued field has been changed for some cases in index.
If you index empty fields and uses positions/offsets information on that
fields, reindex is recommended. (David Smiley, Koji Sekiguchi)
* LUCENE-2804: Directory.setLockFactory new declares throwing an IOException.
(Shai Erera, Robert Muir)
* LUCENE-2837: Added deprecations noting that in 4.0, Searcher and
Searchable are collapsed into IndexSearcher; contrib/remote and
MultiSearcher have been removed. (Mike McCandless)
* LUCENE-2854: Deprecated SimilarityDelegator and
Similarity.lengthNorm; the latter is now final, forcing any custom
Similarity impls to cutover to the more general computeNorm (Robert
Muir, Mike McCandless)
* LUCENE-2869: Deprecated Query.getSimilarity: instead of using
"runtime" subclassing/delegation, subclass the Weight instead.
(Robert Muir)
* LUCENE-2674: A new idfExplain method was added to Similarity, that
accepts an incoming docFreq. If you subclass Similarity, make sure
you also override this method on upgrade. (Robert Muir, Mike
McCandless)
Changes in runtime behavior
* LUCENE-1923: Made IndexReader.toString() produce something
meaningful (Tim Smith via Mike McCandless)
* LUCENE-2179: CharArraySet.clear() is now functional.
(Robert Muir, Uwe Schindler)
* LUCENE-2455: IndexWriter.addIndexes no longer optimizes the target index
before it adds the new ones. Also, the existing segments are not merged and so
the index will not end up with a single segment (unless it was empty before).
In addition, addIndexesNoOptimize was renamed to addIndexes and no longer
invokes a merge on the incoming and target segments, but instead copies the
segments to the target index. You can call maybeMerge or optimize after this
method completes, if you need to.
In addition, Directory.copyTo* were removed in favor of copy which takes the
target Directory, source and target files as arguments, and copies the source
file to the target Directory under the target file name. (Shai Erera)
* LUCENE-2663: IndexWriter no longer forcefully clears any existing
locks when create=true. This was a holdover from when
SimpleFSLockFactory was the default locking implementation, and,
even then it was dangerous since it could mask bugs in IndexWriter's
usage, allowing applications to accidentally open two writers on the
same directory. (Mike McCandless)
* LUCENE-2701: maxMergeMBForOptimize and maxMergeDocs constraints set on
LogMergePolicy now affect optimize() as well (as opposed to only regular
merges). This means that you can run optimize() and too large segments won't
be merged. (Shai Erera)
* LUCENE-2753: IndexReader and DirectoryReader .listCommits() now return a List,
guaranteeing the commits are sorted from oldest to latest. (Shai Erera)
* LUCENE-2785: TopScoreDocCollector, TopFieldCollector and
the IndexSearcher search methods that take an int nDocs will now
throw IllegalArgumentException if nDocs is 0. Instead, you should
use the newly added TotalHitCountCollector. (Mike McCandless)
* LUCENE-2790: LogMergePolicy.useCompoundFile's logic now factors in noCFSRatio
to determine whether the passed in segment should be compound.
(Shai Erera, Earwin Burrfoot)
* LUCENE-2805: IndexWriter now increments the index version on every change to
the index instead of for every commit. Committing or closing the IndexWriter
without any changes to the index will not cause any index version increment.
(Simon Willnauer, Mike McCandless)
* LUCENE-2650, LUCENE-2825: The behavior of FSDirectory.open has changed. On 64-bit
Windows and Solaris systems that support unmapping, FSDirectory.open returns
MMapDirectory. Additionally the behavior of MMapDirectory has been
changed to enable unmapping by default if supported by the JRE.
(Mike McCandless, Uwe Schindler, Robert Muir)
* LUCENE-2829: Improve the performance of "primary key" lookup use
case (running a TermQuery that matches one document) on a
multi-segment index. (Robert Muir, Mike McCandless)
* LUCENE-2010: Segments with 100% deleted documents are now removed on
IndexReader or IndexWriter commit. (Uwe Schindler, Mike McCandless)
* LUCENE-2960: Allow some changes to IndexWriterConfig to take effect
"live" (after an IW is instantiated), via
IndexWriter.getConfig().setXXX(...) (Shay Banon, Mike McCandless)
API Changes
* LUCENE-2076: Rename FSDirectory.getFile -> getDirectory. (George
Aroush via Mike McCandless)
* LUCENE-1260: Change norm encode (float->byte) and decode
(byte->float) to be instance methods not static methods. This way a
custom Similarity can alter how norms are encoded, though they must
still be encoded as a single byte (Johan Kindgren via Mike
McCandless)
* LUCENE-2103: NoLockFactory should have a private constructor;
until Lucene 4.0 the default one will be deprecated.
(Shai Erera via Uwe Schindler)
* LUCENE-2177: Deprecate the Field ctors that take byte[] and Store.
Since the removal of compressed fields, Store can only be YES, so
it's not necessary to specify. (Erik Hatcher via Mike McCandless)
* LUCENE-2200: Several final classes had non-overriding protected
members. These were converted to private and unused protected
constructors removed. (Steven Rowe via Robert Muir)
* LUCENE-2240: SimpleAnalyzer and WhitespaceAnalyzer now have
Version ctors. (Simon Willnauer via Uwe Schindler)
* LUCENE-2259: Add IndexWriter.deleteUnusedFiles, to attempt removing
unused files. This is only useful on Windows, which prevents
deletion of open files. IndexWriter will eventually remove these
files itself; this method just lets you do so when you know the
files are no longer open by IndexReaders. (luocanrao via Mike
McCandless)
* LUCENE-2282: IndexFileNames is exposed as a public class allowing for easier
use by external code. In addition it offers a matchExtension method which
callers can use to query whether a certain file matches a certain extension.
(Shai Erera via Mike McCandless)
* LUCENE-124: Add a TopTermsBoostOnlyBooleanQueryRewrite to MultiTermQuery.
This rewrite method is similar to TopTermsScoringBooleanQueryRewrite, but
only scores terms by their boost values. For example, this can be used
with FuzzyQuery to ensure that exact matches are always scored higher,
because only the boost will be used in scoring. (Robert Muir)
* LUCENE-2015: Add a static method foldToASCII to ASCIIFoldingFilter to
expose its folding logic. (Cédrik Lime via Robert Muir)
* LUCENE-2294: IndexWriter constructors have been deprecated in favor of a
single ctor which accepts IndexWriterConfig and a Directory. You can set all
the parameters related to IndexWriter on IndexWriterConfig. The different
setter/getter methods were deprecated as well. One should call
writer.getConfig().getXYZ() to query for a parameter XYZ.
Additionally, the setter/getter related to MergePolicy were deprecated as
well. One should interact with the MergePolicy directly.
(Shai Erera via Mike McCandless)
* LUCENE-2320: IndexWriter's MergePolicy configuration was moved to
IndexWriterConfig and the respective methods on IndexWriter were deprecated.
(Shai Erera via Mike McCandless)
* LUCENE-2328: Directory now keeps track itself of the files that are written
but not yet fsynced. The old Directory.sync(String file) method is deprecated
and replaced with Directory.sync(Collection<String> files). Take a look at
FSDirectory to see a sample of how such tracking might look like, if needed
in your custom Directories. (Earwin Burrfoot via Mike McCandless)
* LUCENE-2302: Deprecated TermAttribute and replaced by a new
CharTermAttribute. The change is backwards compatible, so
mixed new/old TokenStreams all work on the same char[] buffer
independent of which interface they use. CharTermAttribute
has shorter method names and implements CharSequence and
Appendable. This allows usage like Java's StringBuilder in
addition to direct char[] access. Also terms can directly be
used in places where CharSequence is allowed (e.g. regular
expressions).
(Uwe Schindler, Robert Muir)
* LUCENE-2402: IndexWriter.deleteUnusedFiles now deletes unreferenced commit
points too. If you use an IndexDeletionPolicy which holds onto index commits
(such as SnapshotDeletionPolicy), you can call this method to remove those
commit points when they are not needed anymore (instead of waiting for the
next commit). (Shai Erera)
* LUCENE-2481: SnapshotDeletionPolicy.snapshot() and release() were replaced
with equivalent ones that take a String (id) as argument. You can pass
whatever ID you want, as long as you use the same one when calling both.
(Shai Erera)
* LUCENE-2356: Add IndexWriterConfig.set/getReaderTermIndexDivisor, to
set what IndexWriter passes for termsIndexDivisor to the readers it
opens internally when apply deletions or creating a near-real-time
reader. (Earwin Burrfoot via Mike McCandless)
* LUCENE-2167,LUCENE-2699,LUCENE-2763,LUCENE-2847: StandardTokenizer/Analyzer
in common/standard/ now implement the Word Break rules from the Unicode 6.0.0
Text Segmentation algorithm (UAX#29), covering the full range of Unicode code
points, including values from U+FFFF to U+10FFFF
ClassicTokenizer/Analyzer retains the old (pre-Lucene 3.1) StandardTokenizer/
Analyzer implementation and behavior. Only the Unicode Basic Multilingual
Plane (code points from U+0000 to U+FFFF) is covered.
UAX29URLEmailTokenizer tokenizes URLs and E-mail addresses according to the
relevant RFCs, in addition to implementing the UAX#29 Word Break rules.
(Steven Rowe, Robert Muir, Uwe Schindler)
* LUCENE-2778: RAMDirectory now exposes newRAMFile() which allows to override
and return a different RAMFile implementation. (Shai Erera)
* LUCENE-2785: Added TotalHitCountCollector whose sole purpose is to
count the number of hits matching the query. (Mike McCandless)
* LUCENE-2846: Deprecated IndexReader.setNorm(int, String, float). This method
is only syntactic sugar for setNorm(int, String, byte), but using the global
Similarity.getDefault().encodeNormValue(). Use the byte-based method instead
to ensure that the norm is encoded with your Similarity.
(Robert Muir, Mike McCandless)
* LUCENE-2374: Added Attribute reflection API: It's now possible to inspect the
contents of AttributeImpl and AttributeSource using a well-defined API.
This is e.g. used by Solr's AnalysisRequestHandlers to display all attributes
in a structured way.
There are also some backwards incompatible changes in toString() output,
as LUCENE-2302 introduced the CharSequence interface to CharTermAttribute
leading to changed toString() return values. The new API allows to get a
string representation in a well-defined way using a new method
reflectAsString(). For backwards compatibility reasons, when toString()
was implemented by implementation subclasses, the default implementation of
AttributeImpl.reflectWith() uses toString()s output instead to report the
Attribute's properties. Otherwise, reflectWith() uses Java's reflection
(like toString() did before) to get the attribute properties.
In addition, the mandatory equals() and hashCode() are no longer required
for AttributeImpls, but can still be provided (if needed).
(Uwe Schindler)
* LUCENE-2691: Deprecate IndexWriter.getReader in favor of
IndexReader.open(IndexWriter) (Grant Ingersoll, Mike McCandless)
* LUCENE-2876: Deprecated Scorer.getSimilarity(). If your Scorer uses a Similarity,
it should keep it itself. Fixed Scorers to pass their parent Weight, so that
Scorer.visitSubScorers (LUCENE-2590) will work correctly.
(Robert Muir, Doron Cohen)
* LUCENE-2900: When opening a near-real-time (NRT) reader
(IndexReader.re/open(IndexWriter)) you can now specify whether
deletes should be applied. Applying deletes can be costly, and some
expert use cases can handle seeing deleted documents returned. The
deletes remain buffered so that the next time you open an NRT reader
and pass true, all deletes will be a applied. (Mike McCandless)
* LUCENE-1253: LengthFilter (and Solr's KeepWordTokenFilter) now
require up front specification of enablePositionIncrement. Together with
StopFilter they have a common base class (FilteringTokenFilter) that handles
the position increments automatically. Implementors only need to override an
accept() method that filters tokens. (Uwe Schindler, Robert Muir)
Bug fixes
* LUCENE-2249: ParallelMultiSearcher should shut down thread pool on
close. (Martin Traverso via Uwe Schindler)
* LUCENE-2273: FieldCacheImpl.getCacheEntries() used WeakHashMap
incorrectly and lead to ConcurrentModificationException.
(Uwe Schindler, Robert Muir)
* LUCENE-2328: Index files fsync tracking moved from
IndexWriter/IndexReader to Directory, and it no longer leaks memory.
(Earwin Burrfoot via Mike McCandless)
* LUCENE-2074: Reduce buffer size of lexer back to default on reset.
(Ruben Laguna, Shai Erera via Uwe Schindler)
* LUCENE-2496: Don't throw NPE if IndexWriter is opened with CREATE on
a prior (corrupt) index missing its segments_N file. (Mike
McCandless)
* LUCENE-2458: QueryParser no longer automatically forms phrase queries,
assuming whitespace tokenization. Previously all CJK queries, for example,
would be turned into phrase queries. The old behavior is preserved with
the matchVersion parameter for previous versions. Additionally, you can
explicitly enable the old behavior with setAutoGeneratePhraseQueries(true)
(Robert Muir)
* LUCENE-2537: FSDirectory.copy() implementation was unsafe and could result in
OOM if a large file was copied. (Shai Erera)
* LUCENE-2580: MultiPhraseQuery throws AIOOBE if number of positions
exceeds number of terms at one position (Jayendra Patil via Mike McCandless)
* LUCENE-2617: Optional clauses of a BooleanQuery were not factored
into coord if the scorer for that segment returned null. This
can cause the same document to score to differently depending on
what segment it resides in. (yonik)
* LUCENE-2272: Fix explain in PayloadNearQuery and also fix scoring issue (Peter Keegan via Grant Ingersoll)
* LUCENE-2732: Fix charset problems in XML loading in
HyphenationCompoundWordTokenFilter. (Uwe Schindler)
* LUCENE-2802: NRT DirectoryReader returned incorrect values from
getVersion, isOptimized, getCommitUserData, getIndexCommit and isCurrent due
to a mutable reference to the IndexWriters SegmentInfos.
(Simon Willnauer, Earwin Burrfoot)
* LUCENE-2852: Fixed corner case in RAMInputStream that would hit a
false EOF after seeking to EOF then seeking back to same block you
were just in and then calling readBytes (Robert Muir, Mike McCandless)
* LUCENE-2860: Fixed SegmentInfo.sizeInBytes to factor includeDocStores when it
decides whether to return the cached computed size or not. (Shai Erera)
* LUCENE-2584: SegmentInfo.files() could hit ConcurrentModificationException if
called by multiple threads. (Alexander Kanarsky via Shai Erera)
* LUCENE-2809: Fixed IndexWriter.numDocs to take into account
applied but not yet flushed deletes. (Mike McCandless)
* LUCENE-2879: MultiPhraseQuery previously calculated its phrase IDF by summing
internally, it now calls Similarity.idfExplain(Collection, IndexSearcher).
(Robert Muir)
* LUCENE-2693: RAM used by IndexWriter was slightly incorrectly computed.
(Jason Rutherglen via Shai Erera)
* LUCENE-1846: DateTools now uses the US locale everywhere, so DateTools.round()
is safe also in strange locales. (Uwe Schindler)
* LUCENE-2891: IndexWriterConfig did not accept -1 in setReaderTermIndexDivisor,
which can be used to prevent loading the terms index into memory. (Shai Erera)
* LUCENE-2937: Encoding a float into a byte (e.g. encoding field norms during
indexing) had an underflow detection bug that caused floatToByte(f)==0 where
f was greater than 0, but slightly less than byteToFloat(1). This meant that
certain very small field norms (index_boost * length_norm) could have
been rounded down to 0 instead of being rounded up to the smallest
positive number. (yonik)
* LUCENE-2936: PhraseQuery score explanations were not correctly
identifying matches vs non-matches. (hossman)
* LUCENE-2975: A hotspot bug corrupts IndexInput#readVInt()/readVLong() if
the underlying readByte() is inlined (which happens e.g. in MMapDirectory).
The loop was unwinded which makes the hotspot bug disappear.
(Uwe Schindler, Robert Muir, Mike McCandless)
New features
* LUCENE-2128: Parallelized fetching document frequencies during weight
creation. (Israel Tsadok, Simon Willnauer via Uwe Schindler)
* LUCENE-2069: Added Unicode 4 support to CharArraySet. Due to the switch
to Java 5, supplementary characters are now lowercased correctly if the
set is created as case insensitive.
CharArraySet now requires a Version argument to preserve
backwards compatibility. If Version < 3.1 is passed to the constructor,
CharArraySet yields the old behavior. (Simon Willnauer)
* LUCENE-2069: Added Unicode 4 support to LowerCaseFilter. Due to the switch
to Java 5, supplementary characters are now lowercased correctly.
LowerCaseFilter now requires a Version argument to preserve
backwards compatibility. If Version < 3.1 is passed to the constructor,
LowerCaseFilter yields the old behavior. (Simon Willnauer, Robert Muir)
* LUCENE-2034: Added ReusableAnalyzerBase, an abstract subclass of Analyzer
that makes it easier to reuse TokenStreams correctly. This issue also added
StopwordAnalyzerBase, which improves consistency of all Analyzers that use
stopwords, and implement many analyzers in contrib with it.
(Simon Willnauer via Robert Muir)
* LUCENE-2198, LUCENE-2901: Support protected words in stemming TokenFilters using a
new KeywordAttribute. (Simon Willnauer, Drew Farris via Uwe Schindler)
* LUCENE-2183, LUCENE-2240, LUCENE-2241: Added Unicode 4 support
to CharTokenizer and its subclasses. CharTokenizer now has new
int-API which is conditionally preferred to the old char-API depending
on the provided Version. Version < 3.1 will use the char-API.
(Simon Willnauer via Uwe Schindler)
* LUCENE-2247: Added a CharArrayMap<V> for performance improvements
in some stemmers and synonym filters. (Uwe Schindler)
* LUCENE-2320: Added SetOnce which wraps an object and allows it to be set
exactly once. (Shai Erera via Mike McCandless)
* LUCENE-2314: Added AttributeSource.copyTo(AttributeSource) that
allows to use cloneAttributes() and this method as a replacement
for captureState()/restoreState(), if the state itself
needs to be inspected/modified. (Uwe Schindler)
* LUCENE-2293: Expose control over max number of threads that
IndexWriter will allow to run concurrently while indexing
documents (previously this was hardwired to 5), using
IndexWriterConfig.setMaxThreadStates. (Mike McCandless)
* LUCENE-2297: Enable turning on reader pooling inside IndexWriter
even when getReader (near-real-timer reader) is not in use, through
IndexWriterConfig.enable/disableReaderPooling. (Mike McCandless)
* LUCENE-2331: Add NoMergePolicy which never returns any merges to execute. In
addition, add NoMergeScheduler which never executes any merges. These two are
convenient classes in case you want to disable segment merges by IndexWriter
without tweaking a particular MergePolicy parameters, such as mergeFactor.
MergeScheduler's methods are now public. (Shai Erera via Mike McCandless)
* LUCENE-2339: Deprecate static method Directory.copy in favor of
Directory.copyTo, and use nio's FileChannel.transferTo when copying
files between FSDirectory instances. (Earwin Burrfoot via Mike
McCandless).
* LUCENE-2074: Make StandardTokenizer fit for Unicode 4.0, if the
matchVersion parameter is Version.LUCENE_31. (Uwe Schindler)
* LUCENE-2385: Moved NoDeletionPolicy from benchmark to core. NoDeletionPolicy
can be used to prevent commits from ever getting deleted from the index.
(Shai Erera)
* LUCENE-1585: IndexWriter now accepts a PayloadProcessorProvider which can
return a DirPayloadProcessor for a given Directory, which returns a
PayloadProcessor for a given Term. The PayloadProcessor will be used to
process the payloads of the segments as they are merged (e.g. if one wants to
rewrite payloads of external indexes as they are added, or of local ones).
(Shai Erera, Michael Busch, Mike McCandless)
* LUCENE-2440: Add support for custom ExecutorService in
ParallelMultiSearcher (Edward Drapkin via Mike McCandless)
* LUCENE-2295: Added a LimitTokenCountAnalyzer / LimitTokenCountFilter
to wrap any other Analyzer and provide the same functionality as
MaxFieldLength provided on IndexWriter. This patch also fixes a bug
in the offset calculation in CharTokenizer. (Uwe Schindler, Shai Erera)
* LUCENE-2526: Don't throw NPE from MultiPhraseQuery.toString when
it's empty. (Ross Woolf via Mike McCandless)
* LUCENE-2559: Added SegmentReader.reopen methods (John Wang via Mike
McCandless)
* LUCENE-2590: Added Scorer.visitSubScorers, and Scorer.freq. Along
with a custom Collector these experimental methods make it possible
to gather the hit-count per sub-clause and per document while a
search is running. (Simon Willnauer, Mike McCandless)
* LUCENE-2636: Added MultiCollector which allows running the search with several
Collectors. (Shai Erera)
* LUCENE-2754, LUCENE-2757: Added a wrapper around MultiTermQueries
to add span support: SpanMultiTermQueryWrapper<Q extends MultiTermQuery>.
Using this wrapper its easy to add fuzzy/wildcard to e.g. a SpanNearQuery.
(Robert Muir, Uwe Schindler)
* LUCENE-2838: ConstantScoreQuery now directly supports wrapping a Query
instance for stripping off scores. The use of a QueryWrapperFilter
is no longer needed and discouraged for that use case. Directly wrapping
Query improves performance, as out-of-order collection is now supported.
(Uwe Schindler)
* LUCENE-2864: Add getMaxTermFrequency (maximum within-document TF) to
FieldInvertState so that it can be used in Similarity.computeNorm.
(Robert Muir)
* LUCENE-2720: Segments now record the code version which created them.
(Shai Erera, Mike McCandless, Uwe Schindler)
* LUCENE-2474: Added expert ReaderFinishedListener API to
IndexReader, to allow apps that maintain external per-segment caches
to evict entries when a segment is finished. (Shay Banon, Yonik
Seeley, Mike McCandless)
* LUCENE-2911: The new StandardTokenizer, UAX29URLEmailTokenizer, and
the ICUTokenizer in contrib now all tag types with a consistent set
of token types (defined in StandardTokenizer). Tokens in the major
CJK types are explicitly marked to allow for custom downstream handling:
<IDEOGRAPHIC>, <HANGUL>, <KATAKANA>, and <HIRAGANA>.
(Robert Muir, Steven Rowe)
* LUCENE-2913: Add missing getters to Numeric* classes. (Uwe Schindler)
* LUCENE-1810: Added FieldSelectorResult.LATENT to not cache lazy loaded fields
(Tim Smith, Grant Ingersoll)
* LUCENE-2692: Added several new SpanQuery classes for positional checking
(match is in a range, payload is a specific value) (Grant Ingersoll)
Optimizations
* LUCENE-2494: Use CompletionService in ParallelMultiSearcher instead of
simple polling for results. (Edward Drapkin, Simon Willnauer)
* LUCENE-2075: Terms dict cache is now shared across threads instead
of being stored separately in thread local storage. Also fixed
terms dict so that the cache is used when seeking the thread local
term enum, which will be important for MultiTermQuery impls that do
lots of seeking (Mike McCandless, Uwe Schindler, Robert Muir, Yonik
Seeley)
* LUCENE-2136: If the multi reader (DirectoryReader or MultiReader)
only has a single sub-reader, delegate all enum requests to it.
This avoid the overhead of using a PQ unnecessarily. (Mike
McCandless)
* LUCENE-2137: Switch to AtomicInteger for some ref counting (Earwin
Burrfoot via Mike McCandless)
* LUCENE-2123, LUCENE-2261: Move FuzzyQuery rewrite to separate RewriteMode
into MultiTermQuery. The number of fuzzy expansions can be specified with
the maxExpansions parameter to FuzzyQuery.
(Uwe Schindler, Robert Muir, Mike McCandless)
* LUCENE-2164: ConcurrentMergeScheduler has more control over merge
threads. First, it gives smaller merges higher thread priority than
larges ones. Second, a new set/getMaxMergeCount setting will pause
the larger merges to allow smaller ones to finish. The defaults for
these settings are now dynamic, depending the number CPU cores as
reported by Runtime.getRuntime().availableProcessors() (Mike
McCandless)
* LUCENE-2169: Improved CharArraySet.copy(), if source set is
also a CharArraySet. (Simon Willnauer via Uwe Schindler)
* LUCENE-2084: Change IndexableBinaryStringTools to work on byte[] and char[]
directly, instead of Byte/CharBuffers, and modify CollationKeyFilter to
take advantage of this for faster performance.
(Steven Rowe, Uwe Schindler, Robert Muir)
* LUCENE-2188: Add a utility class for tracking deprecated overridden
methods in non-final subclasses.
(Uwe Schindler, Robert Muir)
* LUCENE-2195: Speedup CharArraySet if set is empty.
(Simon Willnauer via Robert Muir)
* LUCENE-2285: Code cleanup. (Shai Erera via Uwe Schindler)
* LUCENE-2303: Remove code duplication in Token class by subclassing
TermAttributeImpl, move DEFAULT_TYPE constant to TypeInterface, improve
null-handling for TypeAttribute. (Uwe Schindler)
* LUCENE-2329: Switch TermsHash* from using a PostingList object per unique
term to parallel arrays, indexed by termID. This reduces garbage collection
overhead significantly, which results in great indexing performance wins
when the available JVM heap space is low. This will become even more
important when the DocumentsWriter RAM buffer is searchable in the future,
because then it will make sense to make the RAM buffers as large as
possible. (Mike McCandless, Michael Busch)
* LUCENE-2380: The terms field cache methods (getTerms,
getTermsIndex), which replace the older String equivalents
(getStrings, getStringIndex), consume quite a bit less RAM in most
cases. (Mike McCandless)
* LUCENE-2410: ~20% speedup on exact (slop=0) PhraseQuery matching.
(Mike McCandless)
* LUCENE-2531: Fix issue when sorting by a String field that was
causing too many fallbacks to compare-by-value (instead of by-ord).
(Mike McCandless)
* LUCENE-2574: IndexInput exposes copyBytes(IndexOutput, long) to allow for
efficient copying by sub-classes. Optimized copy is implemented for RAM and FS
streams. (Shai Erera)
* LUCENE-2719: Improved TermsHashPerField's sorting to use a better
quick sort algorithm that dereferences the pivot element not on
every compare call. Also replaced lots of sorting code in Lucene
by the improved SorterTemplate class.
(Uwe Schindler, Robert Muir, Mike McCandless)
* LUCENE-2760: Optimize SpanFirstQuery and SpanPositionRangeQuery.
(Robert Muir)
* LUCENE-2770: Make SegmentMerger always work on atomic subreaders,
even when IndexWriter.addIndexes(IndexReader...) is used with
DirectoryReaders or other MultiReaders. This saves lots of memory
during merge of norms. (Uwe Schindler, Mike McCandless)
* LUCENE-2824: Optimize BufferedIndexInput to do less bounds checks.
(Robert Muir)
* LUCENE-2010: Segments with 100% deleted documents are now removed on
IndexReader or IndexWriter commit. (Uwe Schindler, Mike McCandless)
* LUCENE-1472: Removed synchronization from static DateTools methods
by using a ThreadLocal. Also converted DateTools.Resolution to a
Java 5 enum (this should not break backwards). (Uwe Schindler)
Build
* LUCENE-2124: Moved the JDK-based collation support from contrib/collation
into core, and moved the ICU-based collation support into contrib/icu.
(Robert Muir)
* LUCENE-2326: Removed SVN checkouts for backwards tests. The backwards
branch is now included in the svn repository using "svn copy"
after release. (Uwe Schindler)
* LUCENE-2074: Regenerating StandardTokenizerImpl files now needs
JFlex 1.5 (currently only available on SVN). (Uwe Schindler)
* LUCENE-1709: Tests are now parallelized by default (except for benchmark). You
can force them to run sequentially by passing -Drunsequential=1 on the command
line. The number of threads that are spawned per CPU defaults to '1'. If you
wish to change that, you can run the tests with -DthreadsPerProcessor=[num].
(Robert Muir, Shai Erera, Peter Kofler)
* LUCENE-2516: Backwards tests are now compiled against released lucene-core.jar
from tarball of previous version. Backwards tests are now packaged together
with src distribution. (Uwe Schindler)
* LUCENE-2611: Added Ant target to install IntelliJ IDEA configuration:
"ant idea". See http://wiki.apache.org/lucene-java/HowtoConfigureIntelliJ
(Steven Rowe)
* LUCENE-2657: Switch from using Maven POM templates to full POMs when
generating Maven artifacts (Steven Rowe)
* LUCENE-2609: Added jar-test-framework Ant target which packages Lucene's
tests' framework classes. (Drew Farris, Grant Ingersoll, Shai Erera,
Steven Rowe)
Test Cases
* LUCENE-2037 Allow Junit4 tests in our environment (Erick Erickson
via Mike McCandless)
* LUCENE-1844: Speed up the unit tests (Mark Miller, Erick Erickson,
Mike McCandless)
* LUCENE-2065: Use Java 5 generics throughout our unit tests. (Kay
Kay via Mike McCandless)
* LUCENE-2155: Fix time and zone dependent localization test failures
in queryparser tests. (Uwe Schindler, Chris Male, Robert Muir)
* LUCENE-2170: Fix thread starvation problems. (Uwe Schindler)
* LUCENE-2248, LUCENE-2251, LUCENE-2285: Refactor tests to not use
Version.LUCENE_CURRENT, but instead use a global static value
from LuceneTestCase(J4), that contains the release version.
(Uwe Schindler, Simon Willnauer, Shai Erera)
* LUCENE-2313, LUCENE-2322: Add VERBOSE to LuceneTestCase(J4) to control
verbosity of tests. If VERBOSE==false (default) tests should not print
anything other than errors to System.(out|err). The setting can be
changed with -Dtests.verbose=true on test invocation.
(Shai Erera, Paul Elschot, Uwe Schindler)
* LUCENE-2318: Remove inconsistent system property code for retrieving
temp and data directories inside test cases. It is now centralized in
LuceneTestCase(J4). Also changed lots of tests to use
getClass().getResourceAsStream() to retrieve test data. Tests needing
access to "real" files from the test folder itself, can use
LuceneTestCase(J4).getDataFile(). (Uwe Schindler)
* LUCENE-2398, LUCENE-2611: Improve tests to work better from IDEs such
as Eclipse and IntelliJ.
(Paolo Castagna, Steven Rowe via Robert Muir)
* LUCENE-2804: add newFSDirectory to LuceneTestCase to create a FSDirectory at
random. (Shai Erera, Robert Muir)
Documentation
* LUCENE-2579: Fix oal.search's package.html description of abstract
methods. (Santiago M. Mola via Mike McCandless)
* LUCENE-2625: Add a note to IndexReader.termDocs() with additional verbiage
that the TermEnum must be seeked since it is unpositioned.
(Adriano Crestani via Robert Muir)
* LUCENE-2894: Use google-code-prettify for syntax highlighting in javadoc.
(Shinichiro Abe, Koji Sekiguchi)
================== Release 2.9.4 / 3.0.3 ====================
Changes in runtime behavior
* LUCENE-2689: NativeFSLockFactory no longer attempts to acquire a
test lock just before the real lock is acquired. (Surinder Pal
Singh Bindra via Mike McCandless)
* LUCENE-2762: Fixed bug in IndexWriter causing it to hold open file
handles against deleted files when compound-file was enabled (the
default) and readers are pooled. As a result of this the peak
worst-case free disk space required during optimize is now 3X the
index size, when compound file is enabled (else 2X). (Mike
McCandless)
* LUCENE-2773: LogMergePolicy accepts a double noCFSRatio (default =
0.1), which means any time a merged segment is greater than 10% of
the index size, it will be left in non-compound format even if
compound format is on. This change was made to reduce peak
transient disk usage during optimize which increased due to
LUCENE-2762. (Mike McCandless)
Bug fixes
* LUCENE-2142 (correct fix): FieldCacheImpl.getStringIndex no longer
throws an exception when term count exceeds doc count.
(Mike McCandless, Uwe Schindler)
* LUCENE-2513: when opening writable IndexReader on a not-current
commit, do not overwrite "future" commits. (Mike McCandless)
* LUCENE-2536: IndexWriter.rollback was failing to properly rollback
buffered deletions against segments that were flushed (Mark Harwood
via Mike McCandless)
* LUCENE-2541: Fixed NumericRangeQuery that returned incorrect results
with endpoints near Long.MIN_VALUE and Long.MAX_VALUE:
NumericUtils.splitRange() overflowed, if
- the range contained a LOWER bound
that was greater than (Long.MAX_VALUE - (1L << precisionStep))
- the range contained an UPPER bound
that was less than (Long.MIN_VALUE + (1L << precisionStep))
With standard precision steps around 4, this had no effect on
most queries, only those that met the above conditions.
Queries with large precision steps failed more easy. Queries with
precision step >=64 were not affected. Also 32 bit data types int
and float were not affected.
(Yonik Seeley, Uwe Schindler)
* LUCENE-2593: Fixed certain rare cases where a disk full could lead
to a corrupted index (Robert Muir, Mike McCandless)
* LUCENE-2620: Fixed a bug in WildcardQuery where too many asterisks
would result in unbearably slow performance. (Nick Barkas via Robert Muir)
* LUCENE-2627: Fixed bug in MMapDirectory chunking when a file is an
exact multiple of the chunk size. (Robert Muir)
* LUCENE-2634: isCurrent on an NRT reader was failing to return false
if the writer had just committed (Nikolay Zamosenchuk via Mike McCandless)
* LUCENE-2650: Added extra safety to MMapIndexInput clones to prevent accessing
an unmapped buffer if the input is closed (Mike McCandless, Uwe Schindler, Robert Muir)
* LUCENE-2384: Reset zzBuffer in StandardTokenizerImpl when lexer is reset.
(Ruben Laguna via Uwe Schindler, sub-issue of LUCENE-2074)
* LUCENE-2658: Exceptions while processing term vectors enabled for multiple
fields could lead to invalid ArrayIndexOutOfBoundsExceptions.
(Robert Muir, Mike McCandless)
* LUCENE-2235: Implement missing PerFieldAnalyzerWrapper.getOffsetGap().
(Javier Godoy via Uwe Schindler)
* LUCENE-2328: Fixed memory leak in how IndexWriter/Reader tracked
already sync'd files. (Earwin Burrfoot via Mike McCandless)
* LUCENE-2549: Fix TimeLimitingCollector#TimeExceededException to record
the absolute docid. (Uwe Schindler)
* LUCENE-2533: fix FileSwitchDirectory.listAll to not return dups when
primary & secondary dirs share the same underlying directory.
(Michael McCandless)
* LUCENE-2365: IndexWriter.newestSegment (used normally for testing)
is fixed to return null if there are no segments. (Karthick
Sankarachary via Mike McCandless)
* LUCENE-2730: Fix two rare deadlock cases in IndexWriter (Mike McCandless)
* LUCENE-2744: CheckIndex was stating total number of fields,
not the number that have norms enabled, on the "test: field
norms..." output. (Mark Kristensson via Mike McCandless)
* LUCENE-2759: Fixed two near-real-time cases where doc store files
may be opened for read even though they are still open for write.
(Mike McCandless)
* LUCENE-2618: Fix rare thread safety issue whereby
IndexWriter.optimize could sometimes return even though the index
wasn't fully optimized (Mike McCandless)
* LUCENE-2767: Fix thread safety issue in addIndexes(IndexReader[])
that could potentially result in index corruption. (Mike
McCandless)
* LUCENE-2762: Fixed bug in IndexWriter causing it to hold open file
handles against deleted files when compound-file was enabled (the
default) and readers are pooled. As a result of this the peak
worst-case free disk space required during optimize is now 3X the
index size, when compound file is enabled (else 2X). (Mike
McCandless)
* LUCENE-2216: OpenBitSet.hashCode returned different hash codes for
sets that only differed by trailing zeros. (Dawid Weiss, yonik)
* LUCENE-2782: Fix rare potential thread hazard with
IndexWriter.commit (Mike McCandless)
API Changes
* LUCENE-2773: LogMergePolicy accepts a double noCFSRatio (default =
0.1), which means any time a merged segment is greater than 10% of
the index size, it will be left in non-compound format even if
compound format is on. This change was made to reduce peak
transient disk usage during optimize which increased due to
LUCENE-2762. (Mike McCandless)
Optimizations
* LUCENE-2556: Improve memory usage after cloning TermAttribute.
(Adriano Crestani via Uwe Schindler)
* LUCENE-2098: Improve the performance of BaseCharFilter, especially for
large documents. (Robin Wojciki, Koji Sekiguchi, Robert Muir)
New features
* LUCENE-2675 (2.9.4 only): Add support for Lucene 3.0 stored field files
also in 2.9. The file format did not change, only the version number was
upgraded to mark segments that have no compression. FieldsWriter still only
writes 2.9 segments as they could contain compressed fields. This cross-version
index format compatibility is provided here solely because Lucene 2.9 and 3.0
have the same bugfix level, features, and the same index format with this slight
compression difference. In general, Lucene does not support reading newer
indexes with older library versions. (Uwe Schindler)
Documentation
* LUCENE-2239: Documented limitations in NIOFSDirectory and MMapDirectory due to
Java NIO behavior when a Thread is interrupted while blocking on IO.
(Simon Willnauer, Robert Muir)
================== Release 2.9.3 / 3.0.2 ====================
Changes in backwards compatibility policy
* LUCENE-2135: Added FieldCache.purge(IndexReader) method to the
interface. Anyone implementing FieldCache externally will need to
fix their code to implement this, on upgrading. (Mike McCandless)
Changes in runtime behavior
* LUCENE-2421: NativeFSLockFactory does not throw LockReleaseFailedException if
it cannot delete the lock file, since obtaining the lock does not fail if the
file is there. (Shai Erera)
* LUCENE-2060 (2.9.3 only): Changed ConcurrentMergeScheduler's default for
maxNumThreads from 3 to 1, because in practice we get the most gains
from running a single merge in the backround. More than one
concurrent merge causes alot of thrashing (though it's possible on
SSD storage that there would be net gains). (Jason Rutherglen, Mike
McCandless)
Bug fixes
* LUCENE-2046 (2.9.3 only): IndexReader should not see the index as changed, after
IndexWriter.prepareCommit has been called but before
IndexWriter.commit is called. (Peter Keegan via Mike McCandless)
* LUCENE-2119: Don't throw NegativeArraySizeException if you pass
Integer.MAX_VALUE as nDocs to IndexSearcher search methods. (Paul
Taylor via Mike McCandless)
* LUCENE-2142: FieldCacheImpl.getStringIndex no longer throws an
exception when term count exceeds doc count. (Mike McCandless)
* LUCENE-2104: NativeFSLock.release() would silently fail if the lock is held by
another thread/process. (Shai Erera via Uwe Schindler)
* LUCENE-2283: Use shared memory pool for term vector and stored
fields buffers. This memory will be reclaimed if needed according to
the configured RAM Buffer Size for the IndexWriter. This also fixes
potentially excessive memory usage when many threads are indexing a
mix of small and large documents. (Tim Smith via Mike McCandless)
* LUCENE-2300: If IndexWriter is pooling reader (because NRT reader
has been obtained), and addIndexes* is run, do not pool the
readers from the external directory. This is harmless (NRT reader is
correct), but a waste of resources. (Mike McCandless)
* LUCENE-2422: Don't reuse byte[] in IndexInput/Output -- it gains
little performance, and ties up possibly large amounts of memory
for apps that index large docs. (Ross Woolf via Mike McCandless)
* LUCENE-2387: Don't hang onto Fieldables from the last doc indexed,
in IndexWriter, nor the Reader in Tokenizer after close is
called. (Ruben Laguna, Uwe Schindler, Mike McCandless)
* LUCENE-2417: IndexCommit did not implement hashCode() and equals()
consistently. Now they both take Directory and version into consideration. In
addition, all of IndexComnmit methods which threw
UnsupportedOperationException are now abstract. (Shai Erera)
* LUCENE-2467: Fixed memory leaks in IndexWriter when large documents
are indexed. (Mike McCandless)
* LUCENE-2473: Clicking on the "More Results" link in the luceneweb.war
demo resulted in ArrayIndexOutOfBoundsException.
(Sami Siren via Robert Muir)
* LUCENE-2476: If any exception is hit init'ing IW, release the write
lock (previously we only released on IOException). (Tamas Cservenak
via Mike McCandless)
* LUCENE-2478: Fix CachingWrapperFilter to not throw NPE when
Filter.getDocIdSet() returns null. (Uwe Schindler, Daniel Noll)
* LUCENE-2468: Allow specifying how new deletions should be handled in
CachingWrapperFilter and CachingSpanFilter. By default, new
deletions are ignored in CachingWrapperFilter, since typically this
filter is AND'd with a query that correctly takes new deletions into
account. This should be a performance gain (higher cache hit rate)
in apps that reopen readers, or use near-real-time reader
(IndexWriter.getReader()), but may introduce invalid search results
(allowing deleted docs to be returned) for certain cases, so a new
expert ctor was added to CachingWrapperFilter to enforce deletions
at a performance cost. CachingSpanFilter by default recaches if
there are new deletions (Shay Banon via Mike McCandless)
* LUCENE-2299: If you open an NRT reader while addIndexes* is running,
it may miss some segments (Earwin Burrfoot via Mike McCandless)
* LUCENE-2397: Don't throw NPE from SnapshotDeletionPolicy.snapshot if
there are no commits yet (Shai Erera)
* LUCENE-2424: Fix FieldDoc.toString to actually return its fields
(Stephen Green via Mike McCandless)
* LUCENE-2311: Always pass a "fully loaded" (terms index & doc stores)
SegmentsReader to IndexWriter's mergedSegmentWarmer (if set), so
that warming is free to do whatever it needs to. (Earwin Burrfoot
via Mike McCandless)
* LUCENE-3029: Fix corner case when MultiPhraseQuery is used with zero
position-increment tokens that would sometimes assign different
scores to identical docs. (Mike McCandless)
* LUCENE-2486: Fixed intermittent FileNotFoundException on doc store
files when a mergedSegmentWarmer is set on IndexWriter. (Mike
McCandless)
* LUCENE-2130: Fix performance issue when FuzzyQuery runs on a
multi-segment index (Michael McCandless)
API Changes
* LUCENE-2281: added doBeforeFlush to IndexWriter to allow extensions to perform
operations before flush starts. Also exposed doAfterFlush as protected instead
of package-private. (Shai Erera via Mike McCandless)
* LUCENE-2356: Add IndexWriter.set/getReaderTermsIndexDivisor, to set
what IndexWriter passes for termsIndexDivisor to the readers it
opens internally when applying deletions or creating a
near-real-time reader. (Earwin Burrfoot via Mike McCandless)
Optimizations
* LUCENE-2494 (3.0.2 only): Use CompletionService in ParallelMultiSearcher
instead of simple polling for results. (Edward Drapkin, Simon Willnauer)
* LUCENE-2135: On IndexReader.close, forcefully evict any entries from
the FieldCache rather than waiting for the WeakHashMap to release
the reference (Mike McCandless)
* LUCENE-2161: Improve concurrency of IndexReader, especially in the
context of near real-time readers. (Mike McCandless)
* LUCENE-2360: Small speedup to recycling of reused per-doc RAM in
IndexWriter (Robert Muir, Mike McCandless)
Build
* LUCENE-2488 (2.9.3 only): Support build with JDK 1.4 and exclude Java 1.5
contrib modules on request (pass '-Dforce.jdk14.build=true') when
compiling/testing/packaging. This marks the benchmark contrib also
as Java 1.5, as it depends on fast-vector-highlighter. (Uwe Schindler)
================== Release 2.9.2 / 3.0.1 ====================
Changes in backwards compatibility policy
* LUCENE-2123 (3.0.1 only): Removed the protected inner class ScoreTerm
from FuzzyQuery. The change was needed because the comparator of this
class had to be changed in an incompatible way. The class was never
intended to be public. (Uwe Schindler, Mike McCandless)
Bug fixes
* LUCENE-2092: BooleanQuery was ignoring disableCoord in its hashCode
and equals methods, cause bad things to happen when caching
BooleanQueries. (Chris Hostetter, Mike McCandless)
* LUCENE-2095: Fixes: when two threads call IndexWriter.commit() at
the same time, it's possible for commit to return control back to
one of the threads before all changes are actually committed.
(Sanne Grinovero via Mike McCandless)
* LUCENE-2132 (3.0.1 only): Fix the demo result.jsp to use QueryParser
with a Version argument. (Brian Li via Robert Muir)
* LUCENE-2166: Don't incorrectly keep warning about the same immense
term, when IndexWriter.infoStream is on. (Mike McCandless)
* LUCENE-2158: At high indexing rates, NRT reader could temporarily
lose deletions. (Mike McCandless)
* LUCENE-2182: DEFAULT_ATTRIBUTE_FACTORY was failing to load
implementation class when interface was loaded by a different
class loader. (Uwe Schindler, reported on java-user by Ahmed El-dawy)
* LUCENE-2257: Increase max number of unique terms in one segment to
termIndexInterval (default 128) * ~2.1 billion = ~274 billion.
(Tom Burton-West via Mike McCandless)
* LUCENE-2260: Fixed AttributeSource to not hold a strong
reference to the Attribute/AttributeImpl classes which prevents
unloading of custom attributes loaded by other classloaders
(e.g. in Solr plugins). (Uwe Schindler)
* LUCENE-1941: Fix Min/MaxPayloadFunction returns 0 when
only one payload is present. (Erik Hatcher, Mike McCandless
via Uwe Schindler)
* LUCENE-2270: Queries consisting of all zero-boost clauses
(for example, text:foo^0) sorted incorrectly and produced
invalid docids. (yonik)
API Changes
* LUCENE-1609 (3.0.1 only): Restore IndexReader.getTermInfosIndexDivisor
(it was accidentally removed in 3.0.0) (Mike McCandless)
* LUCENE-1972 (3.0.1 only): Restore SortField.getComparatorSource
(it was accidentally removed in 3.0.0) (John Wang via Uwe Schindler)
* LUCENE-2190: Added a new class CustomScoreProvider to function package
that can be subclassed to provide custom scoring to CustomScoreQuery.
The methods in CustomScoreQuery that did this before were deprecated
and replaced by a method getCustomScoreProvider(IndexReader) that
returns a custom score implementation using the above class. The change
is necessary with per-segment searching, as CustomScoreQuery is
a stateless class (like all other Queries) and does not know about
the currently searched segment. This API works similar to Filter's
getDocIdSet(IndexReader). (Paul chez Jamespot via Mike McCandless,
Uwe Schindler)
* LUCENE-2080: Deprecate Version.LUCENE_CURRENT, as using this constant
will cause backwards compatibility problems when upgrading Lucene. See
the Version javadocs for additional information.
(Robert Muir)
Optimizations
* LUCENE-2086: When resolving deleted terms, do so in term sort order
for better performance (Bogdan Ghidireac via Mike McCandless)
* LUCENE-2123 (partly, 3.0.1 only): Fixes a slowdown / memory issue
added by LUCENE-504. (Uwe Schindler, Robert Muir, Mike McCandless)
* LUCENE-2258: Remove unneeded synchronization in FuzzyTermEnum.
(Uwe Schindler, Robert Muir)
Test Cases
* LUCENE-2114: Change TestFilteredSearch to test on multi-segment
index as well. (Simon Willnauer via Mike McCandless)
* LUCENE-2211: Improves BaseTokenStreamTestCase to use a fake attribute
that checks if clearAttributes() was called correctly.
(Uwe Schindler, Robert Muir)
* LUCENE-2207, LUCENE-2219: Improve BaseTokenStreamTestCase to check if
end() is implemented correctly. (Koji Sekiguchi, Robert Muir)
Documentation
* LUCENE-2114: Improve javadocs of Filter to call out that the
provided reader is per-segment (Simon Willnauer via Mike
McCandless)
======================= Release 3.0.0 =======================
Changes in backwards compatibility policy
* LUCENE-1979: Change return type of SnapshotDeletionPolicy#snapshot()
from IndexCommitPoint to IndexCommit. Code that uses this method
needs to be recompiled against Lucene 3.0 in order to work. The
previously deprecated IndexCommitPoint is also removed.
(Michael Busch)
* o.a.l.Lock.isLocked() is now allowed to throw an IOException.
(Mike McCandless)
* LUCENE-2030: CachingWrapperFilter and CachingSpanFilter now hide
the internal cache implementation for thread safety, before it was
declared protected. (Peter Lenahan, Uwe Schindler, Simon Willnauer)
* LUCENE-2053: If you call Thread.interrupt() on a thread inside
Lucene, Lucene will do its best to interrupt the thread. However,
instead of throwing InterruptedException (which is a checked
exception), you'll get an oal.util.ThreadInterruptedException (an
unchecked exception, subclassing RuntimeException). The interrupt
status on the thread is cleared when this exception is thrown.
(Mike McCandless)
* LUCENE-2052: Some methods in Lucene core were changed to accept
Java 5 varargs. This is not a backwards compatibility problem as
long as you not try to override such a method. We left common
overridden methods unchanged and added varargs to constructors,
static, or final methods (MultiSearcher,...). (Uwe Schindler)
* LUCENE-1558: IndexReader.open(Directory) now opens a readOnly=true
reader, and new IndexSearcher(Directory) does the same. Note that
this is a change in the default from 2.9, when these methods were
previously deprecated. (Mike McCandless)
* LUCENE-1753: Make not yet final TokenStreams final to enforce
decorator pattern. (Uwe Schindler)
Changes in runtime behavior
* LUCENE-1677: Remove the system property to set SegmentReader class
implementation. (Uwe Schindler)
* LUCENE-1960: As a consequence of the removal of Field.Store.COMPRESS,
support for this type of fields was removed. Lucene 3.0 is still able
to read indexes with compressed fields, but as soon as merges occur
or the index is optimized, all compressed fields are decompressed
and converted to Field.Store.YES. Because of this, indexes with
compressed fields can suddenly get larger. Also the first merge with
decompression cannot be done in raw mode, it is therefore slower.
This change has no effect for code that uses such old indexes,
they behave as before (fields are automatically decompressed
during read). Indexes converted to Lucene 3.0 format cannot be read
anymore with previous versions.
It is recommended to optimize your indexes after upgrading to convert
to the new format and decompress all fields.
If you want compressed fields, you can use CompressionTools, that
creates compressed byte[] to be added as binary stored field. This
cannot be done automatically, as you also have to decompress such
fields when reading. You have to reindex to do that.
(Michael Busch, Uwe Schindler)
* LUCENE-2060: Changed ConcurrentMergeScheduler's default for
maxNumThreads from 3 to 1, because in practice we get the most
gains from running a single merge in the background. More than one
concurrent merge causes a lot of thrashing (though it's possible on
SSD storage that there would be net gains). (Jason Rutherglen,
Mike McCandless)
API Changes
* LUCENE-1257, LUCENE-1984, LUCENE-1985, LUCENE-2057, LUCENE-1833, LUCENE-2012,
LUCENE-1998: Port to Java 1.5:
- Add generics to public and internal APIs (see below).
- Replace new Integer(int), new Double(double),... by static valueOf() calls.
- Replace for-loops with Iterator by foreach loops.
- Replace StringBuffer with StringBuilder.
- Replace o.a.l.util.Parameter by Java 5 enums (see below).
- Add @Override annotations.
(Uwe Schindler, Robert Muir, Karl Wettin, Paul Elschot, Kay Kay, Shai Erera,
DM Smith)
* Generify Lucene API:
- TokenStream/AttributeSource: Now addAttribute()/getAttribute() return an
instance of the requested attribute interface and no cast needed anymore
(LUCENE-1855).
- NumericRangeQuery, NumericRangeFilter, and FieldCacheRangeFilter
now have Integer, Long, Float, Double as type param (LUCENE-1857).
- Document.getFields() returns List<Fieldable>.
- Query.extractTerms(Set<Term>)
- CharArraySet and stop word sets in core/contrib
- PriorityQueue (LUCENE-1935)
- TopDocCollector
- DisjunctionMaxQuery (LUCENE-1984)
- MultiTermQueryWrapperFilter
- CloseableThreadLocal
- MapOfSets
- o.a.l.util.cache package
- lot's of internal APIs of IndexWriter
(Uwe Schindler, Michael Busch, Kay Kay, Robert Muir, Adriano Crestani)
* LUCENE-1944, LUCENE-1856, LUCENE-1957, LUCENE-1960, LUCENE-1961,
LUCENE-1968, LUCENE-1970, LUCENE-1946, LUCENE-1971, LUCENE-1975,
LUCENE-1972, LUCENE-1978, LUCENE-944, LUCENE-1979, LUCENE-1973, LUCENE-2011:
Remove deprecated methods/constructors/classes:
- Remove all String/File directory paths in IndexReader /
IndexSearcher / IndexWriter.
- Remove FSDirectory.getDirectory()
- Make FSDirectory abstract.
- Remove Field.Store.COMPRESS (see above).
- Remove Filter.bits(IndexReader) method and make
Filter.getDocIdSet(IndexReader) abstract.
- Remove old DocIdSetIterator methods and make the new ones abstract.
- Remove some methods in PriorityQueue.
- Remove old TokenStream API and backwards compatibility layer.
- Remove RangeQuery, RangeFilter and ConstantScoreRangeQuery.
- Remove SpanQuery.getTerms().
- Remove ExtendedFieldCache, custom and auto caches, SortField.AUTO.
- Remove old-style custom sort.
- Remove legacy search setting in SortField.
- Remove Hits and all references from core and contrib.
- Remove HitCollector and its TopDocs support implementations.
- Remove term field and accessors in MultiTermQuery
(and fix Highlighter).
- Remove deprecated methods in BooleanQuery.
- Remove deprecated methods in Similarity.
- Remove BoostingTermQuery.
- Remove MultiValueSource.
- Remove Scorer.explain(int).
...and some other minor ones (Uwe Schindler, Michael Busch, Mark Miller)
* LUCENE-1925: Make IndexSearcher's subReaders and docStarts members
protected; add expert ctor to directly specify reader, subReaders
and docStarts. (John Wang, Tim Smith via Mike McCandless)
* LUCENE-1945: All public classes that have a close() method now
also implement java.io.Closeable (IndexReader, IndexWriter, Directory,...).
(Uwe Schindler)
* LUCENE-1998: Change all Parameter instances to Java 5 enums. This
is no backwards-break, only a change of the super class. Parameter
was deprecated and will be removed in a later version.
(DM Smith, Uwe Schindler)
Bug fixes
* LUCENE-1951: When the text provided to WildcardQuery has no wildcard
characters (ie matches a single term), don't lose the boost and
rewrite method settings. Also, rewrite to PrefixQuery if the
wildcard is form "foo*", for slightly faster performance. (Robert
Muir via Mike McCandless)
* LUCENE-2013: SpanRegexQuery does not work with QueryScorer.
(Benjamin Keil via Mark Miller)
* LUCENE-2088: addAttribute() should only accept interfaces that
extend Attribute. (Shai Erera, Uwe Schindler)
* LUCENE-2045: Fix silly FileNotFoundException hit if you enable
infoStream on IndexWriter and then add an empty document and commit
(Shai Erera via Mike McCandless)
* LUCENE-2046: IndexReader should not see the index as changed, after
IndexWriter.prepareCommit has been called but before
IndexWriter.commit is called. (Peter Keegan via Mike McCandless)
New features
* LUCENE-1933: Provide a convenience AttributeFactory that creates a
Token instance for all basic attributes. (Uwe Schindler)
* LUCENE-2041: Parallelize the rest of ParallelMultiSearcher. Lots of
code refactoring and Java 5 concurrent support in MultiSearcher.
(Joey Surls, Simon Willnauer via Uwe Schindler)
* LUCENE-2051: Add CharArraySet.copy() as a simple method to copy
any Set<?> to a CharArraySet that is optimized, if Set<?> is already
an CharArraySet. (Simon Willnauer)
Optimizations
* LUCENE-1183: Optimize Levenshtein Distance computation in
FuzzyQuery. (Cédrik Lime via Mike McCandless)
* LUCENE-2006: Optimization of FieldDocSortedHitQueue to always
use Comparable<?> interface. (Uwe Schindler, Mark Miller)
* LUCENE-2087: Remove recursion in NumericRangeTermEnum.
(Uwe Schindler)
Build
* LUCENE-486: Remove test->demo dependencies. (Michael Busch)
* LUCENE-2024: Raise build requirements to Java 1.5 and ANT 1.7.0
(Uwe Schindler, Mike McCandless)
======================= Release 2.9.1 =======================
Changes in backwards compatibility policy
* LUCENE-2002: Add required Version matchVersion argument when
constructing QueryParser or MultiFieldQueryParser and, default (as
of 2.9) enablePositionIncrements to true to match
StandardAnalyzer's 2.9 default (Uwe Schindler, Mike McCandless)
Bug fixes
* LUCENE-1974: Fixed nasty bug in BooleanQuery (when it used
BooleanScorer for scoring), whereby some matching documents fail to
be collected. (Fulin Tang via Mike McCandless)
* LUCENE-1124: Make sure FuzzyQuery always matches the precise term.
(stefatwork@gmail.com via Mike McCandless)
* LUCENE-1976: Fix IndexReader.isCurrent() to return the right thing
when the reader is a near real-time reader. (Jake Mannix via Mike
McCandless)
* LUCENE-1986: Fix NPE when scoring PayloadNearQuery (Peter Keegan,
Mark Miller via Mike McCandless)
* LUCENE-1992: Fix thread hazard if a merge is committing just as an
exception occurs during sync (Uwe Schindler, Mike McCandless)
* LUCENE-1995: Note in javadocs that IndexWriter.setRAMBufferSizeMB
cannot exceed 2048 MB, and throw IllegalArgumentException if it
does. (Aaron McKee, Yonik Seeley, Mike McCandless)
* LUCENE-2004: Fix Constants.LUCENE_MAIN_VERSION to not be inlined
by client code. (Uwe Schindler)
* LUCENE-2016: Replace illegal U+FFFF character with the replacement
char (U+FFFD) during indexing, to prevent silent index corruption.
(Peter Keegan, Mike McCandless)
API Changes
* Un-deprecate search(Weight weight, Filter filter, int n) from
Searchable interface (deprecated by accident). (Uwe Schindler)
* Un-deprecate o.a.l.util.Version constants. (Mike McCandless)
* LUCENE-1987: Un-deprecate some ctors of Token, as they will not
be removed in 3.0 and are still useful. Also add some missing
o.a.l.util.Version constants for enabling invalid acronym
settings in StandardAnalyzer to be compatible with the coming
Lucene 3.0. (Uwe Schindler)
* LUCENE-1973: Un-deprecate IndexSearcher.setDefaultFieldSortScoring,
to allow controlling per-IndexSearcher whether scores are computed
when sorting by field. (Uwe Schindler, Mike McCandless)
* LUCENE-2043: Make IndexReader.commit(Map<String,String>) public.
(Mike McCandless)
Documentation
* LUCENE-1955: Fix Hits deprecation notice to point users in right
direction. (Mike McCandless, Mark Miller)
* Fix javadoc about score tracking done by search methods in Searcher
and IndexSearcher. (Mike McCandless)
* LUCENE-2008: Javadoc improvements for TokenStream/Tokenizer/Token
(Luke Nezda via Mike McCandless)
======================= Release 2.9.0 =======================
Changes in backwards compatibility policy
* LUCENE-1575: Searchable.search(Weight, Filter, int, Sort) no
longer computes a document score for each hit by default. If
document score tracking is still needed, you can call
IndexSearcher.setDefaultFieldSortScoring(true, true) to enable
both per-hit and maxScore tracking; however, this is deprecated
and will be removed in 3.0.
Alternatively, use Searchable.search(Weight, Filter, Collector)
and pass in a TopFieldCollector instance, using the following code
sample:
<code>
TopFieldCollector tfc = TopFieldCollector.create(sort, numHits, fillFields,
true /* trackDocScores */,
true /* trackMaxScore */,
false /* docsInOrder */);
searcher.search(query, tfc);
TopDocs results = tfc.topDocs();
</code>
Note that your Sort object cannot use SortField.AUTO when you
directly instantiate TopFieldCollector.
Also, the method search(Weight, Filter, Collector) was added to
the Searchable interface and the Searcher abstract class to
replace the deprecated HitCollector versions. If you either
implement Searchable or extend Searcher, you should change your
code to implement this method. If you already extend
IndexSearcher, no further changes are needed to use Collector.
Finally, the values Float.NaN and Float.NEGATIVE_INFINITY are not
valid scores. Lucene uses these values internally in certain
places, so if you have hits with such scores, it will cause
problems. (Shai Erera via Mike McCandless)
* LUCENE-1687: All methods and parsers from the interface ExtendedFieldCache
have been moved into FieldCache. ExtendedFieldCache is now deprecated and
contains only a few declarations for binary backwards compatibility.
ExtendedFieldCache will be removed in version 3.0. Users of FieldCache and
ExtendedFieldCache will be able to plug in Lucene 2.9 without recompilation.
The auto cache (FieldCache.getAuto) is now deprecated. Due to the merge of
ExtendedFieldCache and FieldCache, FieldCache can now additionally return
long[] and double[] arrays in addition to int[] and float[] and StringIndex.
The interface changes are only notable for users implementing the interfaces,
which was unlikely done, because there is no possibility to change
Lucene's FieldCache implementation. (Grant Ingersoll, Uwe Schindler)
* LUCENE-1630, LUCENE-1771: Weight, previously an interface, is now an abstract
class. Some of the method signatures have changed, but it should be fairly
easy to see what adjustments must be made to existing code to sync up
with the new API. You can find more detail in the API Changes section.
Going forward Searchable will be kept for convenience only and may
be changed between minor releases without any deprecation
process. It is not recommended that you implement it, but rather extend
Searcher.
(Shai Erera, Chris Hostetter, Martin Ruckli, Mark Miller via Mike McCandless)
* LUCENE-1422, LUCENE-1693: The new Attribute based TokenStream API (see below)
has some backwards breaks in rare cases. We did our best to make the
transition as easy as possible and you are not likely to run into any problems.
If your tokenizers still implement next(Token) or next(), the calls are
automatically wrapped. The indexer and query parser use the new API
(eg use incrementToken() calls). All core TokenStreams are implemented using
the new API. You can mix old and new API style TokenFilters/TokenStream.
Problems only occur when you have done the following:
You have overridden next(Token) or next() in one of the non-abstract core
TokenStreams/-Filters. These classes should normally be final, but some
of them are not. In this case, next(Token)/next() would never be called.
To fail early with a hard compile/runtime error, the next(Token)/next()
methods in these TokenStreams/-Filters were made final in this release.
(Michael Busch, Uwe Schindler)
* LUCENE-1763: MergePolicy now requires an IndexWriter instance to
be passed upon instantiation. As a result, IndexWriter was removed
as a method argument from all MergePolicy methods. (Shai Erera via
Mike McCandless)
* LUCENE-1748: LUCENE-1001 introduced PayloadSpans, but this was a back
compat break and caused custom SpanQuery implementations to fail at runtime
in a variety of ways. This issue attempts to remedy things by causing
a compile time break on custom SpanQuery implementations and removing
the PayloadSpans class, with its functionality now moved to Spans. To
help in alleviating future back compat pain, Spans has been changed from
an interface to an abstract class.
(Hugh Cayless, Mark Miller)
* LUCENE-1808: Query.createWeight has been changed from protected to
public. This will be a back compat break if you have overridden this
method - but you are likely already affected by the LUCENE-1693 (make Weight
abstract rather than an interface) back compat break if you have overridden
Query.creatWeight, so we have taken the opportunity to make this change.
(Tim Smith, Shai Erera via Mark Miller)
* LUCENE-1708 - IndexReader.document() no longer checks if the document is
deleted. You can call IndexReader.isDeleted(n) prior to calling document(n).
(Shai Erera via Mike McCandless)
Changes in runtime behavior
* LUCENE-1424: QueryParser now by default uses constant score auto
rewriting when it generates a WildcardQuery and PrefixQuery (it
already does so for TermRangeQuery, as well). Call
setMultiTermRewriteMethod(MultiTermQuery.SCORING_BOOLEAN_QUERY_REWRITE)
to revert to slower BooleanQuery rewriting method. (Mark Miller via Mike
McCandless)
* LUCENE-1575: As of 2.9, the core collectors as well as
IndexSearcher's search methods that return top N results, no
longer filter documents with scores <= 0.0. If you rely on this
functionality you can use PositiveScoresOnlyCollector like this:
<code>
TopDocsCollector tdc = new TopScoreDocCollector(10);
Collector c = new PositiveScoresOnlyCollector(tdc);
searcher.search(query, c);
TopDocs hits = tdc.topDocs();
...
</code>
* LUCENE-1604: IndexReader.norms(String field) is now allowed to
return null if the field has no norms, as long as you've
previously called IndexReader.setDisableFakeNorms(true). This
setting now defaults to false (to preserve the fake norms back
compatible behavior) but in 3.0 will be hardwired to true. (Shon
Vella via Mike McCandless).
* LUCENE-1624: If you open IndexWriter with create=true and
autoCommit=false on an existing index, IndexWriter no longer
writes an empty commit when it's created. (Paul Taylor via Mike
McCandless)
* LUCENE-1593: When you call Sort() or Sort.setSort(String field,
boolean reverse), the resulting SortField array no longer ends
with SortField.FIELD_DOC (it was unnecessary as Lucene breaks ties
internally by docID). (Shai Erera via Michael McCandless)
* LUCENE-1542: When the first token(s) have 0 position increment,
IndexWriter used to incorrectly record the position as -1, if no
payload is present, or Integer.MAX_VALUE if a payload is present.
This causes positional queries to fail to match. The bug is now
fixed, but if your app relies on the buggy behavior then you must
call IndexWriter.setAllowMinus1Position(). That API is deprecated
so you must fix your application, and rebuild your index, to not
rely on this behavior by the 3.0 release of Lucene. (Jonathan
Mamou, Mark Miller via Mike McCandless)
* LUCENE-1715: Finalizers have been removed from the 4 core classes
that still had them, since they will cause GC to take longer, thus
tying up memory for longer, and at best they mask buggy app code.
DirectoryReader (returned from IndexReader.open) & IndexWriter
previously released the write lock during finalize.
SimpleFSDirectory.FSIndexInput closed the descriptor in its
finalizer, and NativeFSLock released the lock. It's possible
applications will be affected by this, but only if the application
is failing to close reader/writers. (Brian Groose via Mike
McCandless)
* LUCENE-1717: Fixed IndexWriter to account for RAM usage of
buffered deletions. (Mike McCandless)
* LUCENE-1727: Ensure that fields are stored & retrieved in the
exact order in which they were added to the document. This was
true in all Lucene releases before 2.3, but was broken in 2.3 and
2.4, and is now fixed in 2.9. (Mike McCandless)
* LUCENE-1678: The addition of Analyzer.reusableTokenStream
accidentally broke back compatibility of external analyzers that
subclassed core analyzers that implemented tokenStream but not
reusableTokenStream. This is now fixed, such that if
reusableTokenStream is invoked on such a subclass, that method
will forcefully fallback to tokenStream. (Mike McCandless)
* LUCENE-1801: Token.clear() and Token.clearNoTermBuffer() now also clear
startOffset, endOffset and type. This is not likely to affect any
Tokenizer chains, as Tokenizers normally always set these three values.
This change was made to be conform to the new AttributeImpl.clear() and
AttributeSource.clearAttributes() to work identical for Token as one for all
AttributeImpl and the 6 separate AttributeImpls. (Uwe Schindler, Michael Busch)
* LUCENE-1483: When searching over multiple segments, a new Scorer is now created
for each segment. Searching has been telescoped out a level and IndexSearcher now
operates much like MultiSearcher does. The Weight is created only once for the top
level Searcher, but each Scorer is passed a per-segment IndexReader. This will
result in doc ids in the Scorer being internal to the per-segment IndexReader. It
has always been outside of the API to count on a given IndexReader to contain every
doc id in the index - and if you have been ignoring MultiSearcher in your custom code
and counting on this fact, you will find your code no longer works correctly. If a
custom Scorer implementation uses any caches/filters that rely on being based on the
top level IndexReader, it will need to be updated to correctly use contextless
caches/filters eg you can't count on the IndexReader to contain any given doc id or
all of the doc ids. (Mark Miller, Mike McCandless)
* LUCENE-1846: DateTools now uses the US locale to format the numbers in its
date/time strings instead of the default locale. For most locales there will
be no change in the index format, as DateFormatSymbols is using ASCII digits.
The usage of the US locale is important to guarantee correct ordering of
generated terms. (Uwe Schindler)
* LUCENE-1860: MultiTermQuery now defaults to
CONSTANT_SCORE_AUTO_REWRITE_DEFAULT rewrite method (previously it
was SCORING_BOOLEAN_QUERY_REWRITE). This means that PrefixQuery
and WildcardQuery will now produce constant score for all matching
docs, equal to the boost of the query. (Mike McCandless)
API Changes
* LUCENE-1419: Add expert API to set custom indexing chain. This API is
package-protected for now, so we don't have to officially support it.
Yet, it will give us the possibility to try out different consumers
in the chain. (Michael Busch)
* LUCENE-1427: DocIdSet.iterator() is now allowed to throw
IOException. (Paul Elschot, Mike McCandless)
* LUCENE-1422, LUCENE-1693: New TokenStream API that uses a new class called
AttributeSource instead of the Token class, which is now a utility class that
holds common Token attributes. All attributes that the Token class had have
been moved into separate classes: TermAttribute, OffsetAttribute,
PositionIncrementAttribute, PayloadAttribute, TypeAttribute and FlagsAttribute.
The new API is much more flexible; it allows to combine the Attributes
arbitrarily and also to define custom Attributes. The new API has the same
performance as the old next(Token) approach. For conformance with this new
API Tee-/SinkTokenizer was deprecated and replaced by a new TeeSinkTokenFilter.
(Michael Busch, Uwe Schindler; additional contributions and bug fixes by
Daniel Shane, Doron Cohen)
* LUCENE-1467: Add nextDoc() and next(int) methods to OpenBitSetIterator.
These methods can be used to avoid additional calls to doc().
(Michael Busch)
* LUCENE-1468: Deprecate Directory.list(), which sometimes (in
FSDirectory) filters out files that don't look like index files, in
favor of new Directory.listAll(), which does no filtering. Also,
listAll() will never return null; instead, it throws an IOException
(or subclass). Specifically, FSDirectory.listAll() will throw the
newly added NoSuchDirectoryException if the directory does not
exist. (Marcel Reutegger, Mike McCandless)
* LUCENE-1546: Add IndexReader.flush(Map commitUserData), allowing
you to record an opaque commitUserData (maps String -> String) into
the commit written by IndexReader. This matches IndexWriter's
commit methods. (Jason Rutherglen via Mike McCandless)
* LUCENE-652: Added org.apache.lucene.document.CompressionTools, to
enable compressing & decompressing binary content, external to
Lucene's indexing. Deprecated Field.Store.COMPRESS.
* LUCENE-1561: Renamed Field.omitTf to Field.omitTermFreqAndPositions
(Otis Gospodnetic via Mike McCandless)
* LUCENE-1500: Added new InvalidTokenOffsetsException to Highlighter methods
to denote issues when offsets in TokenStream tokens exceed the length of the
provided text. (Mark Harwood)
* LUCENE-1575, LUCENE-1483: HitCollector is now deprecated in favor of
a new Collector abstract class. For easy migration, people can use
HitCollectorWrapper which translates (wraps) HitCollector into
Collector. Note that this class is also deprecated and will be
removed when HitCollector is removed. Also TimeLimitedCollector
is deprecated in favor of the new TimeLimitingCollector which
extends Collector. (Shai Erera, Mark Miller, Mike McCandless)
* LUCENE-1592: The method TermsEnum.skipTo() was deprecated, because
it is used nowhere in core/contrib and there is only a very ineffective
default implementation available. If you want to position a TermEnum
to another Term, create a new one using IndexReader.terms(Term).
(Uwe Schindler)
* LUCENE-1621: MultiTermQuery.getTerm() has been deprecated as it does
not make sense for all subclasses of MultiTermQuery. Check individual
subclasses to see if they support getTerm(). (Mark Miller)
* LUCENE-1636: Make TokenFilter.input final so it's set only
once. (Wouter Heijke, Uwe Schindler via Mike McCandless).
* LUCENE-1658, LUCENE-1451: Renamed FSDirectory to SimpleFSDirectory
(but left an FSDirectory base class). Added an FSDirectory.open
static method to pick a good default FSDirectory implementation
given the OS. FSDirectories should now be instantiated using
FSDirectory.open or with public constructors rather than
FSDirectory.getDirectory(), which has been deprecated.
(Michael McCandless, Uwe Schindler, yonik)
* LUCENE-1665: Deprecate SortField.AUTO, to be removed in 3.0.
Instead, when sorting by field, the application should explicitly
state the type of the field. (Mike McCandless)
* LUCENE-1660: StopFilter, StandardAnalyzer, StopAnalyzer now
require up front specification of enablePositionIncrement (Mike
McCandless)
* LUCENE-1614: DocIdSetIterator's next() and skipTo() were deprecated in favor
of the new nextDoc() and advance(). The new methods return the doc Id they
landed on, saving an extra call to doc() in most cases.
For easy migration of the code, you can change the calls to next() to
nextDoc() != DocIdSetIterator.NO_MORE_DOCS and similarly for skipTo().
However it is advised that you take advantage of the returned doc ID and not
call doc() following those two.
Also, doc() was deprecated in favor of docID(). docID() should return -1 or
NO_MORE_DOCS if nextDoc/advance were not called yet, or NO_MORE_DOCS if the
iterator has exhausted. Otherwise it should return the current doc ID.
(Shai Erera via Mike McCandless)
* LUCENE-1672: All ctors/opens and other methods using String/File to
specify the directory in IndexReader, IndexWriter, and IndexSearcher
were deprecated. You should instantiate the Directory manually before
and pass it to these classes (LUCENE-1451, LUCENE-1658).
(Uwe Schindler)
* LUCENE-1407: Move RemoteSearchable, RemoteCachingWrapperFilter out
of Lucene's core into new contrib/remote package. Searchable no
longer extends java.rmi.Remote (Simon Willnauer via Mike
McCandless)
* LUCENE-1677: The global property
org.apache.lucene.SegmentReader.class, and
ReadOnlySegmentReader.class are now deprecated, to be removed in
3.0. src/gcj/* has been removed. (Earwin Burrfoot via Mike
McCandless)
* LUCENE-1673: Deprecated NumberTools in favour of the new
NumericRangeQuery and its new indexing format for numeric or
date values. (Uwe Schindler)
* LUCENE-1630, LUCENE-1771: Weight is now an abstract class, and adds
a scorer(IndexReader, boolean /* scoreDocsInOrder */, boolean /*
topScorer */) method instead of scorer(IndexReader). IndexSearcher uses
this method to obtain a scorer matching the capabilities of the Collector
wrt orderedness of docIDs. Some Scorers (like BooleanScorer) are much more
efficient if out-of-order documents scoring is allowed by a Collector.
Collector must now implement acceptsDocsOutOfOrder. If you write a
Collector which does not care about doc ID orderness, it is recommended
that you return true. Weight has a scoresDocsOutOfOrder method, which by
default returns false. If you create a Weight which will score documents
out of order if requested, you should override that method to return true.
BooleanQuery's setAllowDocsOutOfOrder and getAllowDocsOutOfOrder have been
deprecated as they are not needed anymore. BooleanQuery will now score docs
out of order when used with a Collector that can accept docs out of order.
Finally, Weight#explain now takes a sub-reader and sub-docID, rather than
a top level reader and docID.
(Shai Erera, Chris Hostetter, Martin Ruckli, Mark Miller via Mike McCandless)
* LUCENE-1466, LUCENE-1906: Added CharFilter and MappingCharFilter, which allows
chaining & mapping of characters before tokenizers run. CharStream (subclass of
Reader) is the base class for custom java.io.Reader's, that support offset
correction. Tokenizers got an additional method correctOffset() that is passed
down to the underlying CharStream if input is a subclass of CharStream/-Filter.
(Koji Sekiguchi via Mike McCandless, Uwe Schindler)
* LUCENE-1703: Add IndexWriter.waitForMerges. (Tim Smith via Mike
McCandless)
* LUCENE-1625: CheckIndex's programmatic API now returns separate
classes detailing the status of each component in the index, and
includes more detailed status than previously. (Tim Smith via
Mike McCandless)
* LUCENE-1713: Deprecated RangeQuery and RangeFilter and renamed to
TermRangeQuery and TermRangeFilter. TermRangeQuery is in constant
score auto rewrite mode by default. The new classes also have new
ctors taking field and term ranges as Strings (see also
LUCENE-1424). (Uwe Schindler)
* LUCENE-1609: The termInfosIndexDivisor must now be specified
up-front when opening the IndexReader. Attempts to call
IndexReader.setTermInfosIndexDivisor will hit an
UnsupportedOperationException. This was done to enable removal of
all synchronization in TermInfosReader, which previously could
cause threads to pile up in certain cases. (Dan Rosher via Mike
McCandless)
* LUCENE-1688: Deprecate static final String stop word array in and
StopAnalzyer and replace it with an immutable implementation of
CharArraySet. (Simon Willnauer via Mark Miller)
* LUCENE-1742: SegmentInfos, SegmentInfo and SegmentReader have been
made public as expert, experimental APIs. These APIs may suddenly
change from release to release (Jason Rutherglen via Mike
McCandless).
* LUCENE-1754: QueryWeight.scorer() can return null if no documents
are going to be matched by the query. Similarly,
Filter.getDocIdSet() can return null if no documents are going to
be accepted by the Filter. Note that these 'can' return null,
however they don't have to and can return a Scorer/DocIdSet which
does not match / reject all documents. This is already the
behavior of some QueryWeight/Filter implementations, and is
documented here just for emphasis. (Shai Erera via Mike
McCandless)
* LUCENE-1705: Added IndexWriter.deleteAllDocuments. (Tim Smith via
Mike McCandless)
* LUCENE-1460: Changed TokenStreams/TokenFilters in contrib to
use the new TokenStream API. (Robert Muir, Michael Busch)
* LUCENE-1748: LUCENE-1001 introduced PayloadSpans, but this was a back
compat break and caused custom SpanQuery implementations to fail at runtime
in a variety of ways. This issue attempts to remedy things by causing
a compile time break on custom SpanQuery implementations and removing
the PayloadSpans class, with its functionality now moved to Spans. To
help in alleviating future back compat pain, Spans has been changed from
an interface to an abstract class.
(Hugh Cayless, Mark Miller)
* LUCENE-1808: Query.createWeight has been changed from protected to
public. (Tim Smith, Shai Erera via Mark Miller)
* LUCENE-1826: Add constructors that take AttributeSource and
AttributeFactory to all Tokenizer implementations.
(Michael Busch)
* LUCENE-1847: Similarity#idf for both a Term and Term Collection have
been deprecated. New versions that return an IDFExplanation have been
added. (Yasoja Seneviratne, Mike McCandless, Mark Miller)
* LUCENE-1877: Made NativeFSLockFactory the default for
the new FSDirectory API (open(), FSDirectory subclass ctors).
All FSDirectory system properties were deprecated and all lock
implementations use no lock prefix if the locks are stored inside
the index directory. Because the deprecated String/File ctors of
IndexWriter and IndexReader (LUCENE-1672) and FSDirectory.getDirectory()
still use the old SimpleFSLockFactory and the new API
NativeFSLockFactory, we strongly recommend not to mix deprecated
and new API. (Uwe Schindler, Mike McCandless)
* LUCENE-1911: Added a new method isCacheable() to DocIdSet. This method
should return true, if the underlying implementation does not use disk
I/O and is fast enough to be directly cached by CachingWrapperFilter.
OpenBitSet, SortedVIntList, and DocIdBitSet are such candidates.
The default implementation of the abstract DocIdSet class returns false.
In this case, CachingWrapperFilter copies the DocIdSetIterator into an
OpenBitSet for caching. (Uwe Schindler, Thomas Becker)
Bug fixes
* LUCENE-1415: MultiPhraseQuery has incorrect hashCode() and equals()
implementation - Leads to Solr Cache misses.
(Todd Feak, Mark Miller via yonik)
* LUCENE-1327: Fix TermSpans#skipTo() to behave as specified in javadocs
of Terms#skipTo(). (Michael Busch)
* LUCENE-1573: Do not ignore InterruptedException (caused by
Thread.interrupt()) nor enter deadlock/spin loop. Now, an interrupt
will cause a RuntimeException to be thrown. In 3.0 we will change
public APIs to throw InterruptedException. (Jeremy Volkman via
Mike McCandless)
* LUCENE-1590: Fixed stored-only Field instances do not change the
value of omitNorms, omitTermFreqAndPositions in FieldInfo; when you
retrieve such fields they will now have omitNorms=true and
omitTermFreqAndPositions=false (though these values are unused).
(Uwe Schindler via Mike McCandless)
* LUCENE-1587: RangeQuery#equals() could consider a RangeQuery
without a collator equal to one with a collator.
(Mark Platvoet via Mark Miller)
* LUCENE-1600: Don't call String.intern unnecessarily in some cases
when loading documents from the index. (P Eger via Mike
McCandless)
* LUCENE-1611: Fix case where OutOfMemoryException in IndexWriter
could cause "infinite merging" to happen. (Christiaan Fluit via
Mike McCandless)