-
Notifications
You must be signed in to change notification settings - Fork 481
JAMES-4046 Upgrade Lucene #2342
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- message2 should not have uid as the same as message1 - update/search message2 should target mailbox2, not mailbox1
Hi Rene, Thanks for putting this together!
It seems we declared the Line 1276 in 5ffd1bd
uid field broke to NONE ?
|
I have pushed 2 commits to fix the above 3 tests and another issue: https://github.com/quantranhong1999/james-project/commits/upgrade-lucene-rene/ |
fix 3 contact tests failure: `java.lang.IllegalArgumentException: cannot change field "uid" from doc values type=NUMERIC to inconsistent doc values type=NONE`
subject field should accept multiple values cf opensearch implementation
I am stuck at debugging the MPT test (again). It seems in the end (at the failing assertion), searching by flag returns no UIDs leading to James just returning all possible UIDs in the mailboxes. I tried:
Do you have any other clue or idea @Arsnael? |
I think something is wrong with flags update in general, and maybe the search query on flags too. On testSearch tests, I tried to modify a bit the script. I tried to add the flag flagged to just messages 3 and 4 and do a search on it, it returns 3 and 4. But then search unflagged => returns all messages. I tried to debug yesterday as well, I felt like maybe (aint completely sure) we were creating more documents each time with updates... I think we might need to review the flags part of the code. In the createFlagQuery for example you can see some comments like Or also I wondered why flags and messages are being in separated documents with lucene? Was it because of some limitations of the version 2? Maybe would make sense in latest version to have it all organized in just one document? It looks like to me we have it all in one doc in opensearch for example if I'm not wrong |
Yes very likely, when Lucene updates the document, it would delete and create a new document. Somehow multiple updates on the same document break the flags field's data.
Yes. I think we could try to refactor that. But I am not sure if we should do that in this upgrade or in another ticket. |
Pushing the debugging further... I modified the search test to:
Fails on last step where it returns all messages. By modifying the code in the search query for flags instead of doing a not flags, I did a search on all documents with flags. Returns 6 documents (and not 4). The original 4 plus the 2 extra that were created when adding the Flagged flag. Where I'm confused is that according to the lucene doc, the updateDocument method is supposed to delete the current doc and then create a new one. Or here it's obvious the old one does not get deleted. Why? Need to search more. Maybe there is an other way to do an update in recent lucene? Maybe a bug? Maybe the delete is async?... I might try as well just deleting and adding the document |
Exactly the issue I suspect before, but I could not really prove or solve it. |
When I read the javadoc of IndexWriter => https://lucene.apache.org/core/9_11_1/core/org/apache/lucene/index/IndexWriter.html#close() I understand that we might need to review our writer usage. It seems to wait for a call to close() for example before flushing out changes (however our writer is always opened) But I tried quickly a commit() after the updateDocument(). It's costly and shouldnt be used in prod code but just for testing... Unfortunately I still got 6 documents :( So I'm not really sure to understand anymore what we do wrong here... Could try an other time a quick refactoring where I open and close the writer after usage everywhere but... will see |
I think I'm gonna drop it for now... I tried to refactor the code to open and close the IndexWriter after each use in the code, I tried to use Tried changing the deletion policy, the merging policy... I think there is probably something we are missing during that migration, with number of systems using lucene I hardly believe we have a bug here... but I'm a bit at a loss here. |
Just to be sure - those are tests in I'm asking as running all tests for James project takes a long time (and some are flaky) |
Yes those are the most relevant ones. |
@woj-tek if you can clear the mpt imap tests for lucene, we are likely good yes :) |
I'm looking at src/main/resources/org/apache/james/imap/scripts/Search.test and trying to figure it out and as IMAP is not my forte I'd like to have confirmation I got it right :) We
?
Was there a consensus why it was done that way and if it should be changed? |
@woj-tek I think you got it right about what the Search.test is supposed to do yes :)
No idea honestly, that Lucene implementation is very old and was done way before I join the project I believe. But it would not change the fact that for some reason Lucene does not seem to remove the old documents when doing its updates, so it just creates new ones, and that's why the Search.test is failing. |
Yes for the test scenari. I believe the current mpt test has value as it allowed catching issues regular unit tests did not. I would be extremely caitious if alterations are needed. Btw we pass this very test with opensearch meaning this is doable on a lucene based setup... Thanks for having a look @woj-tek ! |
Thank you @Arsnael for explanation.
I totally agree that MPT tests are very valuable (more akin "integration tests"). Though trying to understand the issue and what's going on so more querying for information and not even suggesting changing those :)
This should be more than doable, especially considering that it's working with current implementation - correct? Though as @Arsnael mention the issue seems to be different. What bugs me is why actual unit-tests pass. Looking at Ideally I'd like to create a unit-test in |
I think the issue boils down to the (U)IDs not being matched properly somehow as mentioned in #2342 (comment)
(I have zero experience with Lucene so trying to figure out stuff as I go :) ) |
tl,dr; It seems that re-adding ID field (akin to re-adding
Could someone verify it? :) EDIT: It seems to re-add the field each time it's updated still, so additional call to remove field on update would be required but otherwise it seems to work (?!). It's somewhat counterintuitive to have multiple fields with the same name possible... OK, my previous comment was off, but I've been digging into it more and more (and learning Lucene), running simpler tests (even just in Lucene). I even wrote on Lucene mailing list :) In general, as per information from the list re-opening the reader should be enough:
What I noticed when running the tests was that adding the mail+flag document works and then first update of the flag works as well. Only subsequent updates fail. This made me thing that for some reason So I decide to explicitly re-apply the ID when updating as it was done with the UID field: I'm still not sure why those fields are cleared… EDIT:
Pondering it - maybe it was efficiency? Considering that when updating it removes and re-adds the document, thus updating only small-ish flag document would be better than updating whole email document each time flag is changed? |
Some more information - it looks like for some reason ID field (
I was looking at the
From the quick test it seems it works OK. |
Changes in tigase@e5fe401 Not sure how to submit those to James repo/this PR (I used Both The only thing that "doesn't work" is checking state of the Lucene repository after running the test - it should has only 8 documents but reports 10... (line 358 in LuceneMailboxMessageFlagSearchTest.java) It doesn't impact queries for particular message flag documents but still it is a discrepancy. I read that one should check if document is marked as deleted but I think it applies to older Lucene version and I havent't seen similar API in current/updated version. As I said - it's checking Lucene internals so I commented out -- I think it's not relevant here but left it so it could be checked. The issue with StringField (for ID_FIELD) being tokenized may be related to https://issues.apache.org/jira/browse/LUCENE-7171 / apache/lucene#8226 -- from what I can see it was reported for version 5.5 so it would fall just between original lucene version and updated version in James PS. If someone has a formatter for IDEA that would produce sources formatted to checkstyle liking I'd be very thankfull if shared :) |
Hi @woj-tek ! So the issue is with the StringField ID_FIELD tokenization correct? You drop in the end the ID_FIELD and update the document with the query instead. Not too sure to get what exactly happens there but looks good to me :) You want me to cherry-pick your commit on this PR or you want to open your own PR with the fix? :) Thanks for the hard work, really appreciated :) |
Thanks a lot @woj-tek for your hard work and putting this together!
There you go: https://github.com/apache/james-project/blob/master/src/site/xdoc/server/dev-build.xml#L138 |
Hi,
Uwe here from Lucene. Actually to say it "mildly", the current code (as it already exists) in James is not working correct at all. Your test cases worked with previous versions of Lucene, because they were just testing some arbitrary IDs which luckily analyze correctly. The issue is the following: apache/lucene#8226 is not a valid bug and will never be fixed in Lucene, because theres nothing wrong. The issue here is wrong expectations and missing knowledge about how Lucene works. My recommendation would be to remove the current version of Lucene support completely and rewrite it, ideally maybe with a configurable Index schema or better - usage of Apache Solr, Elasticsearch or Opensearch (to have the indexing clearly separated and scaleable for huge installations). E.g., Dovecot IMAP server has support for Apache Solr or Elasticsearch, so indexing e-Mail is straight forwards and by adapting the schema files shipped with the repo, it is possible to customize the text analysis without changing the code. The following is the main issue: You cannot update documents in Lucene with the following pattern: Read document's stored fields with IndexReader/IndexSearcher, modify the stored fields and then write the Document instance back. This is - unfortunately - possible API-wise - but won't work. IndexReader/IndexSearcher#getDocument returns only stored fields and has no metadata about the fields behind. Because it only reads stored fields in the format they were STORED, reindexing with the original settings used during INDEXING won't work. Lucene does not know how the document was indexed originally, the information is not stored anywhere in index. When you then reindex the document it will aply default settings for all the fields in the Document (which is analyzed/tokenized with the default analyzer). This will transform all StringField instances to TextField. In addition numeric fields and DocValues fields will all change to analyzed text. In Lucene 4.x there were several approaches to separate the API of IndexReader#getDocument from IndexWriter#updateDocument, but this was not successful. So the API trap is still there (you can read stored fields from index and reindex them, but unfortunately with default settings). The correct way to update document in Lucene is the following: Rebuild the document using the same code which was used during indexing - don't use any information from index. E.g., read your document from the database/mail folder/EML file/.... and create a completely new document with applying the indexing schema that was configured by the user (languages). Important is also to index IDs as StringField, because any other field type is not supported for IDs. When you do this, the document is reachable easily using its IDs (case sensitive) and can be updated. The additional documents not deleted in your tests are coming from exactly that problem: The field was indexed correctly using StringField as ID, but later updated and therefor the ID field got a TextField. On the next update, the update document wasn't able to delete a document, because it wasn't found anymore (as the ID was tokenized and no longer a StringField, suitable for an ID). This explains why you see more documents. So your tests are failing correctly! In earlier versions of Lucene the problems caused by reindexing stored fields were not so bad, becaus ethe Analyzers are different and possinly UUIDs were not tokenized, but in Lucene 3.x there was also the possibility that a little bit more information was saved the document returned by IndexReader#getDocument so the IndexWriter code was better in restoring the original field type. Since Lucene 4 this is no longer possible. Technically, it would be better if Lucene would throw an UOE when you try to reindex a document that was retrieved via IndexReader#getDocument. Maybe we should work on this again to make this wrong use of Lucene's APIs impossible in the future. Some more ideas:
Uwe |
Hi Uwe and thank you so much for the very insightful comments. Now it makes much more sense :)
James already supports OpenSearch and I think it's preferred way. Though Lucene implementation is handy for small deployments where having single (or better yet, limited number of services) is convenient. As I said before - I had zero knowledge about Lucene just a couple of days ago and I was just trying to make it work by looking at various documentations / SO and so forth.
@ james team - should we do this or go further with @uschindler - I would be helpful if above information could somehow be included in Lucene javadocs so it would be more clear what to use and what to avoid. |
…p using ID_FIELD for flags and query documents directly as well as use same query for update the documents instead of relying on Term with ID_FIELD; Unit test that reproduces the failing MTP search test;
Hi @uschindler Thanks for that long explanation on how Lucene is working and what seems to be our issues. TBH I don't think people here are familiar with Lucene, it was a very old implementation made by other people and we just tried to upgrade without having much knowledge around it. As @woj-tek said, we have different implementation already, like with OpenSearch, but seems some people like @woj-tek in our community are still using Lucene as a more lightweight approach for hosting their own mail server. With all those information we could likely propose something better as Lucene indexing implementation then I agree. I think we could try to do this! |
@Arsnael
@quantranhong1999 Thanks. I guess there is no ready-made styling/formatter file?
Yes, and we try to have relatively small mail server thus having single service is beneficial for us :) |
Hi @uschindler I didn't understand a few details
Let's take the code here as a reference for the purpose of this discussion. In the code, on L1293#LuceneMessageSearchIndex.java, when they tried using a TermQuery on the ID field, that too failed. That atleast should've worked right? Sure Lucene does not store indexing information, but it would've stored the untokenized ID right (without info on whether or not it was tokenized)? All the TermQuery had to do was match against the terms indexed in the ID field? Even the updates made on L1282 used the same StringField |
Could this get closed? |
Just tried to give it a shot to upgrade directly to latest lucene version from Quan's work : #2315
in the lucene module I have 3 tests failing left, all returning the following error:
Maybe @quantranhong1999 has an idea? :)
I have conscience that there is still some tests that have been disabled and of course didn't check mpt tests yet (one step at a time)