Skip to content

JAMES-4046 Upgrade Lucene #2342

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from
Closed

Conversation

Arsnael
Copy link
Contributor

@Arsnael Arsnael commented Jul 9, 2024

Just tried to give it a shot to upgrade directly to latest lucene version from Quan's work : #2315

in the lucene module I have 3 tests failing left, all returning the following error:

java.lang.IllegalArgumentException: cannot change field "uid" from doc values type=NUMERIC to inconsistent doc values type=NONE

Maybe @quantranhong1999 has an idea? :)

I have conscience that there is still some tests that have been disabled and of course didn't check mpt tests yet (one step at a time)

@Arsnael Arsnael self-assigned this Jul 9, 2024
@quantranhong1999
Copy link
Member

Hi Rene,

Thanks for putting this together!

java.lang.IllegalArgumentException: cannot change field "uid" from doc values type=NUMERIC to inconsistent doc values type=NONE

It seems we declared the uid field with DocValues NUMERIC (to be able to search/sort) but somehow during update the flag (modified the FLAGS_FIELD of the document at

writer.updateDocument(new Term(ID_FIELD, doc.get(ID_FIELD)), doc);
), it seems somehow the DocValues for the uid field broke to NONE?

@quantranhong1999
Copy link
Member

I have pushed 2 commits to fix the above 3 tests and another issue: https://github.com/quantranhong1999/james-project/commits/upgrade-lucene-rene/

Now we have 4 MPT failing tests left:
image

quantranhong1999 and others added 3 commits July 11, 2024 14:53
fix 3 contact tests failure: `java.lang.IllegalArgumentException: cannot change field "uid" from doc values type=NUMERIC to inconsistent doc values type=NONE`
subject field should accept multiple values

cf opensearch implementation
@quantranhong1999
Copy link
Member

I am stuck at debugging the MPT test (again).

It seems in the end (at the failing assertion), searching by flag returns no UIDs leading to James just returning all possible UIDs in the mailboxes.

I tried:

  • add await in MPT test -> not work
  • TextField, Store.Yes for the flags field -> not work
  • partial update -> only work with DocValues field -> try DocValues for the flags field -> seems not suitable
  • writer.commit everywhere -> not work
  • apply delete / write = true when declaring the IndexReader -> not work

Do you have any other clue or idea @Arsnael?

@Arsnael
Copy link
Contributor Author

Arsnael commented Jul 12, 2024

I think something is wrong with flags update in general, and maybe the search query on flags too.

On testSearch tests, I tried to modify a bit the script. I tried to add the flag flagged to just messages 3 and 4 and do a search on it, it returns 3 and 4.

But then search unflagged => returns all messages.

I tried to debug yesterday as well, I felt like maybe (aint completely sure) we were creating more documents each time with updates...

I think we might need to review the flags part of the code. In the createFlagQuery for example you can see some comments like // lucene does not support simple NOT queries so we do some nasty hack here . Maybe lucenesupports it now for example and the hack is creating issues now? Things like that...

Or also I wondered why flags and messages are being in separated documents with lucene? Was it because of some limitations of the version 2? Maybe would make sense in latest version to have it all organized in just one document? It looks like to me we have it all in one doc in opensearch for example if I'm not wrong

@quantranhong1999
Copy link
Member

I tried to debug yesterday as well, I felt like maybe (aint completely sure) we were creating more documents each time with updates...

Yes very likely, when Lucene updates the document, it would delete and create a new document.

Somehow multiple updates on the same document break the flags field's data.

Or also I wondered why flags and messages are being in separated documents with lucene? Was it because of some limitations of the version 2? Maybe would make sense in latest version to have it all organized in just one document? It looks like to me we have it all in one doc in opensearch for example if I'm not wrong

Yes. I think we could try to refactor that. But I am not sure if we should do that in this upgrade or in another ticket.

@Arsnael
Copy link
Contributor Author

Arsnael commented Jul 16, 2024

Pushing the debugging further...

I modified the search test to:

  • append 4 messages
  • store Flagged flag on message 3 and 4
  • search messages with Flagged
  • search unflagged messages

Fails on last step where it returns all messages.

By modifying the code in the search query for flags instead of doing a not flags, I did a search on all documents with flags.

Returns 6 documents (and not 4). The original 4 plus the 2 extra that were created when adding the Flagged flag.

Where I'm confused is that according to the lucene doc, the updateDocument method is supposed to delete the current doc and then create a new one. Or here it's obvious the old one does not get deleted.

Why? Need to search more. Maybe there is an other way to do an update in recent lucene? Maybe a bug? Maybe the delete is async?...

I might try as well just deleting and adding the document

@quantranhong1999
Copy link
Member

Exactly the issue I suspect before, but I could not really prove or solve it.
I tried to commit, await (just to make sure if the delete is async), or make sure open a new reader (so the reader reflects the new document changes), but those did not help...

@Arsnael
Copy link
Contributor Author

Arsnael commented Jul 16, 2024

@quantranhong1999

When I read the javadoc of IndexWriter => https://lucene.apache.org/core/9_11_1/core/org/apache/lucene/index/IndexWriter.html#close()

I understand that we might need to review our writer usage. It seems to wait for a call to close() for example before flushing out changes (however our writer is always opened)

But I tried quickly a commit() after the updateDocument(). It's costly and shouldnt be used in prod code but just for testing... Unfortunately I still got 6 documents :(

So I'm not really sure to understand anymore what we do wrong here... Could try an other time a quick refactoring where I open and close the writer after usage everywhere but... will see

@Arsnael
Copy link
Contributor Author

Arsnael commented Jul 17, 2024

I think I'm gonna drop it for now...

I tried to refactor the code to open and close the IndexWriter after each use in the code, I tried to use commit, forceMergeDeletes methods, tried to explicitly delete the old doc and add the new one, tried to change the FSDirectory (MMapDirectory, NIOFSDirectory)... Does not change anything.

Tried changing the deletion policy, the merging policy...

I think there is probably something we are missing during that migration, with number of systems using lucene I hardly believe we have a bug here... but I'm a bit at a loss here.

@woj-tek
Copy link
Contributor

woj-tek commented Aug 1, 2024

I am stuck at debugging the MPT test (again).

Just to be sure - those are tests in james-project/mpt/impl/imap-mailbox/lucenesearch (as main mailbox/lucene module builds without issue)? Anything else to keep in mind?

I'm asking as running all tests for James project takes a long time (and some are flaky)

@chibenwa
Copy link
Contributor

chibenwa commented Aug 2, 2024

Yes those are the most relevant ones.

@Arsnael
Copy link
Contributor Author

Arsnael commented Aug 5, 2024

@woj-tek if you can clear the mpt imap tests for lucene, we are likely good yes :)

@woj-tek
Copy link
Contributor

woj-tek commented Aug 5, 2024

I'm looking at src/main/resources/org/apache/james/imap/scripts/Search.test and trying to figure it out and as IMAP is not my forte I'd like to have confirmation I got it right :) We

  • create a mailbox

    • search for flagged messages (yield none)
  • set flag for all messages (1-till-end)

    • search for flagged (yield 4 messages)
  • clear flag for messages from 3rd till end

    • search for flagged (yield 2 messages, first and second)
    • serach for not flagged messages (yield 2, third and fourth)

?


Or also I wondered why flags and messages are being in separated documents with lucene? Was it because of some limitations of the version 2? Maybe would make sense in latest version to have it all organized in just one document? It looks like to me we have it all in one doc in opensearch for example if I'm not wrong

Yes. I think we could try to refactor that. But I am not sure if we should do that in this upgrade or in another ticket.

Was there a consensus why it was done that way and if it should be changed?

@Arsnael
Copy link
Contributor Author

Arsnael commented Aug 6, 2024

@woj-tek I think you got it right about what the Search.test is supposed to do yes :)

Was there a consensus why it was done that way and if it should be changed?

No idea honestly, that Lucene implementation is very old and was done way before I join the project I believe.

But it would not change the fact that for some reason Lucene does not seem to remove the old documents when doing its updates, so it just creates new ones, and that's why the Search.test is failing.

@chibenwa
Copy link
Contributor

chibenwa commented Aug 6, 2024

Yes for the test scenari.

I believe the current mpt test has value as it allowed catching issues regular unit tests did not. I would be extremely caitious if alterations are needed.

Btw we pass this very test with opensearch meaning this is doable on a lucene based setup...

Thanks for having a look @woj-tek !

@woj-tek
Copy link
Contributor

woj-tek commented Aug 6, 2024

Thank you @Arsnael for explanation.

I believe the current mpt test has value as it allowed catching issues regular unit tests did not. I would be extremely caitious if alterations are needed.

I totally agree that MPT tests are very valuable (more akin "integration tests"). Though trying to understand the issue and what's going on so more querying for information and not even suggesting changing those :)

Btw we pass this very test with opensearch meaning this is doable on a lucene based setup...

This should be more than doable, especially considering that it's working with current implementation - correct?

Though as @Arsnael mention the issue seems to be different. What bugs me is why actual unit-tests pass.

Looking at org.apache.james.mailbox.store.search.AbstractMessageSearchIndexTest#searchShouldReturnSeenMessagesWhenFlagsGotUpdated it should be somewhat similar case.

Ideally I'd like to create a unit-test in apache-james-mailbox-lucene that shows the issue and take it from there :)

@woj-tek
Copy link
Contributor

woj-tek commented Aug 9, 2024

I think the issue boils down to the (U)IDs not being matched properly somehow as mentioned in #2342 (comment)

https://github.com/apache/james-project/pull/2342/files#diff-a7c2a3c5cdb7e4a2914c899409991e27df6b25ad54488f197bc533193e3a03d0R1275-R1281

Doing very silly and naive writer.deleteDocuments(queryBuilder.build()); just after ScoreDoc[] sDocs = docs.scoreDocs; makes the test (MTP) pass.

So it seems the queried documents are correct and they even can be deleted but then re-defining the IDs makes the update fail because it can't match the documents that should be updated?

(I have zero experience with Lucene so trying to figure out stuff as I go :) )

@woj-tek
Copy link
Contributor

woj-tek commented Aug 10, 2024

tl,dr;

It seems that re-adding ID field (akin to re-adding UID_FIELD) in org.apache.james.mailbox.lucene.search.LuceneMessageSearchIndex#update() fixes the issue (MTP test are all green):

final String text = doc.get(ID_FIELD);
doc.add(new StringField(ID_FIELD, text, Store.YES));

Could someone verify it? :)

EDIT: It seems to re-add the field each time it's updated still, so additional call to remove field on update would be required but otherwise it seems to work (?!). It's somewhat counterintuitive to have multiple fields with the same name possible...


OK, my previous comment was off, but I've been digging into it more and more (and learning Lucene), running simpler tests (even just in Lucene). I even wrote on Lucene mailing list :)

In general, as per information from the list re-opening the reader should be enough:

When you open an IndexReader, it is locked to the view of the Lucene
directory at the time that it's opened.

If you make changes, you'll need to open a new IndexReader before those
changes are visible. I see that you tried creating a new IndexSearcher, but
unfortunately that's not sufficient.

What I noticed when running the tests was that adding the mail+flag document works and then first update of the flag works as well. Only subsequent updates fail.

This made me thing that for some reason new Term(ID_FIELD, doc.get(ID_FIELD)) has to be failing and indeed when looking for documents manually (new IndexReader) it doesn't yield results for newly updated documents as if those IDs weren't matched (but they are visible in the debugger and string output!).

So I decide to explicitly re-apply the ID when updating as it was done with the UID field: doc.add(new StringField(doc.get(ID_FIELD), text, Store.YES)); and it seems that this address the issue (?!)

I'm still not sure why those fields are cleared…


EDIT:

Or also I wondered why flags and messages are being in separated documents with lucene? Was it because of some limitations of the version 2? Maybe would make sense in latest version to have it all organized in just one document? It looks like to me we have it all in one doc in opensearch for example if I'm not wrong

Yes. I think we could try to refactor that. But I am not sure if we should do that in this upgrade or in another ticket.

Was there a consensus why it was done that way and if it should be changed?

No idea honestly, that Lucene implementation is very old and was done way before I join the project I believe.

Pondering it - maybe it was efficiency? Considering that when updating it removes and re-adds the document, thus updating only small-ish flag document would be better than updating whole email document each time flag is changed?

@woj-tek
Copy link
Contributor

woj-tek commented Aug 10, 2024

Some more information - it looks like for some reason ID field (flags-<uid>-<mid>) is tokenized in the end and because it follows the spec it splits the value at dashes:

Lucene's StandardTokenizer for 9.11 uses the Unicode Text Segmentation
algorithm, as specified in Unicode Standard Annex #29
http://unicode.org/reports/tr29/;.
That standard contains a "-" as a word breaker.

I was looking at the createFlagsDocument() (it add ID of the mailbox and of the messages as separate field) and at update() method, which uses UID+MID in the search query to find the document, but then uses Term (which requires string hence falling back to the ID_FIELD with flags-<uid>-<mid>) but maybe instead using updateDocument() method requiring Term we could use one that uses query (which we could re-use from the initial search for UID+MID combination):

writer.updateDocuments(queryBuilder.build(), List.of(doc));

From the quick test it seems it works OK.

@woj-tek
Copy link
Contributor

woj-tek commented Aug 11, 2024

Changes in tigase@e5fe401

Not sure how to submit those to James repo/this PR (I used gh to checkout this PR, should I create yet another PR? It seems like the only option; alternatively cherry-picking the commit to this PR?). The branch itself: https://github.com/tigase/james-project/tree/lucene-flags-test.

Both james-project/mailbox/lucene and james-project/mpt/impl/imap-mailbox/lucenesearch builds correctly while running mvn clean install -Dcheckstyle.skip=true

The only thing that "doesn't work" is checking state of the Lucene repository after running the test - it should has only 8 documents but reports 10... (line 358 in LuceneMailboxMessageFlagSearchTest.java) It doesn't impact queries for particular message flag documents but still it is a discrepancy. I read that one should check if document is marked as deleted but I think it applies to older Lucene version and I havent't seen similar API in current/updated version. As I said - it's checking Lucene internals so I commented out -- I think it's not relevant here but left it so it could be checked.

The issue with StringField (for ID_FIELD) being tokenized may be related to https://issues.apache.org/jira/browse/LUCENE-7171 / apache/lucene#8226 -- from what I can see it was reported for version 5.5 so it would fall just between original lucene version and updated version in James

PS. If someone has a formatter for IDEA that would produce sources formatted to checkstyle liking I'd be very thankfull if shared :)

@Arsnael
Copy link
Contributor Author

Arsnael commented Aug 12, 2024

Hi @woj-tek !
Huge investigation work here, thank you very much :)

So the issue is with the StringField ID_FIELD tokenization correct? You drop in the end the ID_FIELD and update the document with the query instead. Not too sure to get what exactly happens there but looks good to me :)

You want me to cherry-pick your commit on this PR or you want to open your own PR with the fix? :)

Thanks for the hard work, really appreciated :)

@quantranhong1999
Copy link
Member

Thanks a lot @woj-tek for your hard work and putting this together!

PS. If someone has a formatter for IDEA that would produce sources formatted to checkstyle liking I'd be very thankfull if shared :)

There you go: https://github.com/apache/james-project/blob/master/src/site/xdoc/server/dev-build.xml#L138

@uschindler
Copy link

uschindler commented Aug 12, 2024

Hi,

The issue with StringField (for ID_FIELD) being tokenized may be related to https://issues.apache.org/jira/browse/LUCENE-7171 / apache/lucene#8226 -- from what I can see it was reported for version 5.5 so it would fall just between original lucene version and updated version in James

Uwe here from Lucene. Actually to say it "mildly", the current code (as it already exists) in James is not working correct at all. Your test cases worked with previous versions of Lucene, because they were just testing some arbitrary IDs which luckily analyze correctly.

The issue is the following: apache/lucene#8226 is not a valid bug and will never be fixed in Lucene, because theres nothing wrong. The issue here is wrong expectations and missing knowledge about how Lucene works. My recommendation would be to remove the current version of Lucene support completely and rewrite it, ideally maybe with a configurable Index schema or better - usage of Apache Solr, Elasticsearch or Opensearch (to have the indexing clearly separated and scaleable for huge installations). E.g., Dovecot IMAP server has support for Apache Solr or Elasticsearch, so indexing e-Mail is straight forwards and by adapting the schema files shipped with the repo, it is possible to customize the text analysis without changing the code.

The following is the main issue: You cannot update documents in Lucene with the following pattern: Read document's stored fields with IndexReader/IndexSearcher, modify the stored fields and then write the Document instance back. This is - unfortunately - possible API-wise - but won't work. IndexReader/IndexSearcher#getDocument returns only stored fields and has no metadata about the fields behind. Because it only reads stored fields in the format they were STORED, reindexing with the original settings used during INDEXING won't work. Lucene does not know how the document was indexed originally, the information is not stored anywhere in index. When you then reindex the document it will aply default settings for all the fields in the Document (which is analyzed/tokenized with the default analyzer). This will transform all StringField instances to TextField. In addition numeric fields and DocValues fields will all change to analyzed text. In Lucene 4.x there were several approaches to separate the API of IndexReader#getDocument from IndexWriter#updateDocument, but this was not successful. So the API trap is still there (you can read stored fields from index and reindex them, but unfortunately with default settings).

The correct way to update document in Lucene is the following: Rebuild the document using the same code which was used during indexing - don't use any information from index. E.g., read your document from the database/mail folder/EML file/.... and create a completely new document with applying the indexing schema that was configured by the user (languages). Important is also to index IDs as StringField, because any other field type is not supported for IDs. When you do this, the document is reachable easily using its IDs (case sensitive) and can be updated.

The additional documents not deleted in your tests are coming from exactly that problem: The field was indexed correctly using StringField as ID, but later updated and therefor the ID field got a TextField. On the next update, the update document wasn't able to delete a document, because it wasn't found anymore (as the ID was tokenized and no longer a StringField, suitable for an ID). This explains why you see more documents. So your tests are failing correctly!

In earlier versions of Lucene the problems caused by reindexing stored fields were not so bad, becaus ethe Analyzers are different and possinly UUIDs were not tokenized, but in Lucene 3.x there was also the possibility that a little bit more information was saved the document returned by IndexReader#getDocument so the IndexWriter code was better in restoring the original field type. Since Lucene 4 this is no longer possible. Technically, it would be better if Lucene would throw an UOE when you try to reindex a document that was retrieved via IndexReader#getDocument. Maybe we should work on this again to make this wrong use of Lucene's APIs impossible in the future.

Some more ideas:

  • If your documents never change in the index excapt some flags (I imaging the IMAP flags like read/unread), there is a new DocValues field type, which can be updated in place, without reindexing document. You still need to reopen IndexReader/NRT reader, but you can tell IndexWriter to just update a docvalues field. Keep in mind that those fields can't be indexed or have stored fields, in-place updates only work when the field is solely a docvalues field (numeric one). So this is ideal for flags or click counts in online shops. To query docvalues only fields there are special queries which so a linear scan, but work perfectly as additional filter. Like "find all coument containing text XY that are unread". The main query is the full text query and the flag "unread" is just a filter applied.
  • Make sure to not manually reopen IndexReader, but use SearcherManager which makes sure to do this automatically. This feature uses NRT (Near Realtime Search), so ideal for searches on recently indexed updates. SearcherManager is new in Lucene 4+ and is the recommended way to most performantly makes documents visible for search without even committing the IndexWriter
  • keep IndexWriter and IndexReader open all the time (by using SearcherManager, see above). Opening and closing them kills performance as it causes JVM Safepoints (which slow down in highly concurrent environments in JDK 19+ due to shared arenas). Basically, closing an Index will temporarily halt all other threads, so it is important to have a separate lifetime management of IndexWriters and IndexReaders. All classes as threadsafe and you can hammer IndexReader/IndexWriter with tons of threads.

Uwe

@woj-tek
Copy link
Contributor

woj-tek commented Aug 12, 2024

Hi Uwe and thank you so much for the very insightful comments. Now it makes much more sense :)

My recommendation would be to remove the current version of Lucene support completely and rewrite it, ideally maybe with a configurable Index schema or better - usage of Apache Solr, Elasticsearch or Opensearch (to have the indexing clearly separated and scaleable for huge installations). E.g., Dovecot IMAP server has support for Apache Solr or Elasticsearch, so indexing e-Mail is straight forwards and by adapting the schema files shipped with the repo, it is possible to customize the text analysis without changing the code.

James already supports OpenSearch and I think it's preferred way. Though Lucene implementation is handy for small deployments where having single (or better yet, limited number of services) is convenient.

As I said before - I had zero knowledge about Lucene just a couple of days ago and I was just trying to make it work by looking at various documentations / SO and so forth.

The correct way to update document in Lucene is the following: Rebuild the document using the same code which was used during indexing - don't use any information from index. E.g., read your document from the database/mail folder/EML file/.... and create a completely new document with applying the indexing schema that was configured by the user (languages). Important is also to index IDs as StringField, because any other field type is not supported for IDs. When you do this, the document is reachable easily using its IDs (case sensitive) and can be updated.

@ james team - should we do this or go further with DocValues mentioned?

@uschindler - I would be helpful if above information could somehow be included in Lucene javadocs so it would be more clear what to use and what to avoid.

woj-tek referenced this pull request in tigase/james-project Aug 12, 2024
…p using ID_FIELD for flags and query documents directly as well as use same query for update the documents instead of relying on Term with ID_FIELD;

Unit test that reproduces the failing MTP search test;
@Arsnael
Copy link
Contributor Author

Arsnael commented Aug 12, 2024

Hi @uschindler

Thanks for that long explanation on how Lucene is working and what seems to be our issues. TBH I don't think people here are familiar with Lucene, it was a very old implementation made by other people and we just tried to upgrade without having much knowledge around it.

As @woj-tek said, we have different implementation already, like with OpenSearch, but seems some people like @woj-tek in our community are still using Lucene as a more lightweight approach for hosting their own mail server.

With all those information we could likely propose something better as Lucene indexing implementation then I agree. I think we could try to do this!

@woj-tek
Copy link
Contributor

woj-tek commented Aug 12, 2024

You want me to cherry-pick your commit on this PR or you want to open your own PR with the fix? :)

@Arsnael
Created new PR: #2373 so it's easier push it forward :)

PS. If someone has a formatter for IDEA that would produce sources formatted to checkstyle liking I'd be very thankfull if shared :)

There you go: https://github.com/apache/james-project/blob/master/src/site/xdoc/server/dev-build.xml#L138

@quantranhong1999 Thanks. I guess there is no ready-made styling/formatter file?

As @woj-tek said, we have different implementation already, like with OpenSearch, but seems some people like @woj-tek in our community are still using Lucene as a more lightweight approach for hosting their own mail server.

Yes, and we try to have relatively small mail server thus having single service is beneficial for us :)

@gautamworah96
Copy link

Hi @uschindler

I didn't understand a few details

Lucene does not know how the document was indexed originally, the information is not stored anywhere in index. When you then reindex the document it will aply default settings for all the fields in the Document (which is analyzed/tokenized with the default analyzer). This will transform all StringField instances to TextField.

Let's take the code here as a reference for the purpose of this discussion. In the code, on L1293#LuceneMessageSearchIndex.java, when they tried using a TermQuery on the ID field, that too failed. That atleast should've worked right? Sure Lucene does not store indexing information, but it would've stored the untokenized ID right (without info on whether or not it was tokenized)? All the TermQuery had to do was match against the terms indexed in the ID field?

Even the updates made on L1282 used the same StringField

@chibenwa
Copy link
Contributor

Could this get closed?

@Arsnael Arsnael closed this Aug 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants