Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not yet squattered emails are always returned on any search (3.4 and later) #4692

Closed
gbulfon opened this issue Oct 17, 2023 · 39 comments · Fixed by #4698 or #4806
Closed

Not yet squattered emails are always returned on any search (3.4 and later) #4692

gbulfon opened this issue Oct 17, 2023 · 39 comments · Fixed by #4698 or #4806

Comments

@gbulfon
Copy link

gbulfon commented Oct 17, 2023

Hi, I've been digging this problem using gdb but I hardly can figure out what's happening here.

Using squatter mode in imapd.conf, when recent emails arrive and are not yet been squattered, any search will return the correct result of the squattered emails + all of these recent emails, which are not filtered considering the query.
This can be easily simulated by commenting out the rolling squatter from cyrus.conf, then sending a couple of new emails to the test account: try to do queries and you will always receive these new emails as a result, even if they don't match the query expression.

I discovered that the problem is during subquery_run_indexed() (search_query.c), after running the squatter query via bx->run() it will run the subquery_post_enginesearch callback for any folder. Here it will run index_search_evaluate(), where at the beginning runs search_expr_always_same: this returns 1 because e->op is SEOP_TRUE, causing the code to not consider the original query and going on adding the uids as a match.

This search expression is sent to index_search_evaluate from sub->expr, which is the one containing SEOP_TRUE, while I've seen previous code checking the "op" inside sub->indexed, I don't know what this means actually.
I also can't find why this expr is SEOP_TRUE when the search is a SEOP_MATCH (justs a search "FROM abcd"), and infact sub->indexed reflects it.

@rsto @elliefm you may already have checked something in the past with me about it, but maybe with these new considerations we may find where the bug is?

Here is the previous issue: #4224

Gabriele

@vladki77
Copy link

Hello, I ran into the same problem, after upgrade (Debian 3.2.6-2+deb11u2 -> 3.6.1-4+deb12u1), with search_engine: squat.
Question is if this bug is present or not while using Xapian?

@gbulfon
Copy link
Author

gbulfon commented Oct 17, 2023

We tried to move to xapian, but we found it much slower than squatter, both indexing and searching.
That's why we're stuck on latest 2.x.

I feel that from 3.2 to 3.4 a refactoring has been made to the search engine code, thus killing the original code that automatically added unindexed emails applying the same filter.
Also beware that folders without a cyrus.squat (never indexed) will never return any result.
If we're able to solve the bug above, then we may fix this last behaviour by letting code search unindexed emails when the squat file is not present.
At the moment, fixing the missing squat file code, would result in an full listing at any search.

@elliefm
Copy link
Contributor

elliefm commented Oct 17, 2023

From what (very!) little I understand of the squat backend, the squat backend is supposed to return all unindexed messages to the general search engine (as well as the indexed messages that matched), and the general search engine is supposed to apply its own (slow) filter to unindexed ones.

Perhaps the latter detail has been lost somewhere along the way, and the general search engine no longer applies its own filter to the unindexed messages. Or perhaps it now expects the unindexed messages to be flagged in some different way, and squat hasn't been updated to match, so the general engine mistakes them for indexed results. I dunno, these are some very lightly educated guesses at best!

@gbulfon
Copy link
Author

gbulfon commented Oct 19, 2023

@elliefm @rsto @vladki77 we found the solution to both problems:

  • unindexed messages are correctly filtered and merged to indexed ones
  • if the cyrus.squat is missing, code will run the generic unindexed filter

Here is the patch:

--- cyrus-imapd-3.8.1/imap/search_query.c       Thu Oct 19 00:20:08 2023
+++ cyrus-imapd-3.8.1/imap/search_query.c.new   Thu Oct 19 03:43:25 2023
@@ -486,8 +486,8 @@
     if (query->error) return;
     if (!folder->found_dirty) return;

-    if (sub->expr && query->verbose) {
-        char *s = search_expr_serialise(sub->expr);
+    if (sub->indexed && query->verbose) {
+        char *s = search_expr_serialise(sub->indexed);
         syslog(LOG_INFO, "Folder %s: applying scan expression: %s",
                 folder->mboxname, s);
         free(s);
@@ -511,7 +511,7 @@

     if (!state->exists) goto out;

-    search_expr_internalise(state, sub->expr);
+    search_expr_internalise(state, sub->indexed);

     if (query->sortcrit)
         msgno_list = (unsigned *) xmalloc(state->exists * sizeof(unsigned));
@@ -540,7 +540,7 @@
             continue;

         /* run the search program */
-        if (!index_search_evaluate(state, sub->expr, msgno))
+        if (!index_search_evaluate(state, sub->indexed, msgno))
             continue;

         /* we have a new UID that needs to be merged in */
@@ -673,6 +673,10 @@
     return 0;
 }

+static int subquery_run_one_folder(search_query_t *query,
+                                   const char *mboxname,
+                                   search_expr_t *e);
+
 static void subquery_run_indexed(const char *key __attribute__((unused)),
                                  void *data, void *rock)
 {
@@ -725,7 +729,7 @@

     bx = search_begin_search(query->state->mailbox, opts);
     if (!bx) {
-        r = IMAP_INTERNAL;
+       r = subquery_run_one_folder(query, index_mboxname(query->state), sub->indexed);
         goto out;
     }


Look to work fine to us.
I hope you can check the changes and include them on a fix release.
Or if you prefer, we can make a pull request.

Gabriele

@elliefm
Copy link
Contributor

elliefm commented Oct 20, 2023

Thanks! I've applied this patch locally and run it through cassandane, and it creates lots of errors for Xapian search. I've included the list of tests that fail or error for me below, in case it's useful to someone.

Since the patch apparently fixes Squat but breaks Xapian, and we suspected a previous change to support Xapian had broken Squat, then I guess the patch at least shows us the area of code that the backends are in disagreement about. And it looks like the disagreement is about whether to use sub->expr or sub->indexed. So perhaps this code needs to check and handle both, rather than just one or the other. @rsto what do you think?

Here's the tests that now fail with the patch:

  • Cyrus::JMAPCalendars.calendarevent_query_text
  • Cyrus::JMAPEmail.email_querychanges_fromcontactgroupid
  • Cyrus::JMAPEmail.email_query_emailaddress
  • Cyrus::JMAPEmail.email_query_fromanycontact_shared
  • Cyrus::JMAPEmail.email_query_fromcontactgroupid
  • Cyrus::JMAPEmail.email_query_guidsearch_mixedfilter2
  • Cyrus::JMAPEmail.email_query_inmailboxid_conjunction
  • Cyrus::JMAPEmail.email_query_listid
  • Cyrus::JMAPEmail.email_query_no_guidsearch_ignore_jmapuploads
  • Cyrus::JMAPEmail.email_query_not_match
  • Cyrus::JMAPEmail.email_query_toaddress
  • Cyrus::SearchFuzzy.detect_language
  • Cyrus::SearchFuzzy.fuzzyalways_annot
  • Cyrus::SearchFuzzy.mix_fuzzy_and_nonfuzzy
  • Cyrus::SearchFuzzy.search_exactmatch
  • Cyrus::SearchFuzzy.search_omit_ical
  • Cyrus::SearchFuzzy.search_subjectsnippet
  • Cyrus::SearchFuzzy.snippet_wildcard
  • Cyrus::SearchFuzzy.squatter_partials
  • Cyrus::SearchFuzzy.stem_any
  • Cyrus::SearchFuzzy.stem_verbs
  • Cyrus::SearchFuzzy.xattachmentname

@gbulfon
Copy link
Author

gbulfon commented Oct 20, 2023

@elliefm I did a step by step debug with gdb and discovered that when squatter is on, sub->expr always contains the SEOP_TRUE op, while sub->indexed contains the real expression.
I don't understand why this should be different when on xapians, squatter or default, as this part of the code should hold the same query object:

if (query->indexed_count) {

But anyway I also discovered that when no search engine is configured, the expression is found inside the query->global_sub.expr , so I suspect there is something that changes the inside of the query object depending on the search engine.

@rsto
Copy link
Member

rsto commented Oct 23, 2023

changes the inside of the query object depending on the search engine.

that is correct and there are assumptions about the search backend being Xapian present up to the JMAP layers. Also the expression rewrite logic reflects that, e.g. see

cyrus-imapd/imap/search_expr.c

Lines 1353 to 1362 in d093790

* The presence of indexable fields overrides all other
* considerations; we assume it's easier to consult the
* index and then filter out folders later. Note that this
* assumption is broken for SQUAT which is per-folder.
*
* We remove the indexed matches from the conjunction and
* build a new conjunction containing only those matches.
* Note that this assumes the search engine does not give
* false positives, which is also broken for SQUAT.
*/

For sake of getting squat backend supported again, I'd be fine if the above patch is executed on condition that the configured search backend is squat. What actually needs to get done is to sanitize the whole search code, but that is out of scope this issue.

@gbulfon
Copy link
Author

gbulfon commented Oct 23, 2023

Another possibility is to check for squat backend and set sub->expr = sub->indexed before going on with untouched code.

@rsto
Copy link
Member

rsto commented Oct 23, 2023

I'm fine with whatever gets the job done and the smallest code impact!

@gbulfon
Copy link
Author

gbulfon commented Oct 23, 2023

Ok @rsto , here is the smaller patch, verified to work our side with squatter:

--- cyrus-imapd-3.8.1/imap/search_query.c       Thu Oct 19 00:20:08 2023
+++ cyrus-imapd-3.8.1/imap/search_query.c.new   Mon Oct 23 06:21:17 2023
@@ -486,6 +486,14 @@
     if (query->error) return;
     if (!folder->found_dirty) return;

+    /*
+     * When in squatter mode, the real search expression is inside
+     * sub->indexed, while sub->expr holds a constant SEOP_TRUE.
+     * In this case let sub->expr point to sub->indexed.
+     */
+    if (config_getenum(IMAPOPT_SEARCH_ENGINE)==IMAP_ENUM_SEARCH_ENGINE_SQUAT)
+       sub->expr = sub->indexed;
+
     if (sub->expr && query->verbose) {
         char *s = search_expr_serialise(sub->expr);
         syslog(LOG_INFO, "Folder %s: applying scan expression: %s",
@@ -673,6 +681,10 @@
     return 0;
 }

+static int subquery_run_one_folder(search_query_t *query,
+                                   const char *mboxname,
+                                   search_expr_t *e);
+
 static void subquery_run_indexed(const char *key __attribute__((unused)),
                                  void *data, void *rock)
 {
@@ -725,7 +737,7 @@

     bx = search_begin_search(query->state->mailbox, opts);
     if (!bx) {
-        r = IMAP_INTERNAL;
+       r = subquery_run_one_folder(query, index_mboxname(query->state), sub->indexed);
         goto out;
     }


rsto added a commit that referenced this issue Oct 24, 2023
The squat backend returns false positives for unindexed messages,
but filtering these messages has been broken.

This patch fixes that based on a contribution of @gbulfon. Thanks!

Fixes #4692
@rsto
Copy link
Member

rsto commented Oct 24, 2023

Thanks! I've just created pull request #4698

@gbulfon
Copy link
Author

gbulfon commented Oct 24, 2023

Great! Thanks!

rsto added a commit that referenced this issue Oct 25, 2023
The squat backend returns false positives for unindexed messages,
but filtering these messages has been broken.

This patch fixes that based on a contribution of @gbulfon. Thanks!

Fixes #4692
@vladki77
Copy link

vladki77 commented Oct 27, 2023

Hello, I have applied the patch [59d41a0] to 3.6.1 and recompiled. For some users it seems to work fine (search in unindexed mails works as expected), but at least one reports that search stopped working altogether. In mail.log I see Fatal error: Internal error: assertion failed: imap/search_query.c: 514: sub->expr && sub->expr->op == SEOP_TRUE
(line 514 is line 499 from the referred patch due to being applied to different cyrus version. I tried reconstruct, delete cyrus.squat, and reindex, but the problem persists. Any ideas what's wrong?

Webmail (roundcube) search reports: Failed to send UID SEARCH command, and does not show any results.

@vladki77
Copy link

vladki77 commented Oct 27, 2023

I have replaced assert(sub->expr && sub->expr->op == SEOP_TRUE); with if (sub->expr && sub->expr->op != SEOP_TRUE) { syslog(LOG_INFO, "Mailbox %s: expected SEOP_TRUE (1) but found %d. Replacing anyway with %d.", mboxname, sub->expr->op, sub->indexed->op); } and now the search seems to work well. I get these in syslog:
Mailbox user.xy: expected SEOP_TRUE (1) but found 11 (SEOP_NOT). Replacing anyway with 7 (SEOP_MATCH).

Further testing shows that limiting search by date is ignored. On my mailbox (where search worked well when the assert was in place), i see: expected SEOP_TRUE (1) but found 6. (SEOP_GT) or 3 (SEOP_LT). On the mailbox with problems - search with date limit gives: expected SEOP_TRUE (1) but found 9 (SEOP_AND) - and does not matter if searching newer or older messages. Anyway neither of these searches was limited by date. All messages which matched the search string were returned.

As the patched version is now in production I'm collecting more mailboxes that have this problem.

@vladki77
Copy link

@rsto is there a way to reopen this issue?

@rsto rsto reopened this Oct 31, 2023
@vladki77
Copy link

vladki77 commented Oct 31, 2023

I have some more logs, and see that sub->expr->op is quite often not SEOP_TRUE, but SEOP_NOT and SEOP_OR. And a few occasions of SEOP_AND, and SEOP_MATCH. Maybe in these cases the sub->expr->op should be preserved?
Can someone with more insight in the code confirm?

@vladki77
Copy link

Any progress here? I'd like to get this fixed but I'm afraid that my patches can break things. Is there anyone who could review my patches and recommend if they are ok?

@gbulfon
Copy link
Author

gbulfon commented Jan 2, 2024

We don't experience your issue with the changes we suggested.
What version are you using?

@vladki77
Copy link

vladki77 commented Jan 2, 2024

I'm using cyrus version 3.6.1-4+deb12u1 (debian stable/bookworm), with manually applied patch [https://github.com/cyrusimap/cyrus-imapd/commit/59d41a084db3e7c8dccc24290a7a62693878f47d]

@gbulfon
Copy link
Author

gbulfon commented Jan 2, 2024

We're using 3.8 built on XStreamOS/illumos with added patches

@vladki77
Copy link

vladki77 commented Jan 2, 2024

Yeah I know that the patch is intended for 3.8. The question is if anyone is working on backporting it to 3.6, or if I make the pathch myself if anyone can review (and test) it ?

@elliefm
Copy link
Contributor

elliefm commented Jan 4, 2024

I will backport it to 3.6, but I haven't yet. Thanks for the bump.

elliefm pushed a commit that referenced this issue Jan 15, 2024
The squat backend returns false positives for unindexed messages,
but filtering these messages has been broken.

This patch fixes that based on a contribution of @gbulfon. Thanks!

Fixes #4692
@brrrrrrrt
Copy link

i noticed, that the search result also includes all messages from the trash folder, is this related?
(also using 3.6 from debian/bookworm)

@elliefm
Copy link
Contributor

elliefm commented Jan 18, 2024

@brrrrrrrt Probably, if you haven't indexed the trash folder

@brrrrrrrt
Copy link

brrrrrrrt commented Jan 18, 2024

@brrrrrrrt Probably, if you haven't indexed the trash folder

@elliefm hmm, no, the trash mailbox is indexed, just did it manually, still showing all messages from trash, no matter what i am searching for :(

elliefm pushed a commit that referenced this issue Jan 19, 2024
The squat backend returns false positives for unindexed messages,
but filtering these messages has been broken.

This patch fixes that based on a contribution of @gbulfon. Thanks!

Fixes #4692
@vladki77
Copy link

Hello I see that patches [59d41a0] [6d3b834] and [0c9e48a] are identical. As I reported that patch causes assertion failure on 3.6.x because sub->expr->op is not always SEOP_TRUE. Also when I replaced assert with just a syslog message, the search continues, but more complex searches like (text = 'something' and date > sometime) are simplified to (text = 'something'). So the patch as is presented will break cyrus version 3.6, and needs some changes. IMHO sub->expr->op should not be blindly replaced with sub->indexed->op, but only if it really is SEOP_TRUE.

@vladki77
Copy link

vladki77 commented Feb 1, 2024

So I have tried to modify the patch but failed.
cyrus-search-unindexed-and-date-failed.txt
Interestingly it logged:
Only search text: expected SEOP_TRUE (1) found. Replacing with 1. - Results OK
Search text and date older than week: expected SEOP_TRUE (1) but found 3. Not replacing with 7. - Results OK
Search text and date newer than week: expected SEOP_TRUE (1) but found 6. Not replacing with 7. - Added all unindexed mails to the results.

So I reverted back to patch:
cyrus-search-unindexed.txt
Just text search: expected SEOP_TRUE (1), found 1. Replacing with 7. - Results OK
Search text and newer than: expected SEOP_TRUE (1) but found 6. replacing with 7. - text found OK, but date ignored
Search text and older than: expected SEOP_TRUE (1) but found 3. Not replacing with 7. - text found OK, but date ignored

So to sum up - the patches published as solving this issue do not work well:

  1. assertion that sub->expr->op is always SEOP_TRUE is wrong for complex searches (using SEOP_AND, OR, NOT, GT, LT) and causes search failure
  2. blind replacing sub->expr->op with sub->indexed->op helps partially - search does not fail, but uses only SEOP_MATCH, and ignores any other search limitations (i.e. date newer or older than something)
  3. I'm not able to fix it myself, but can debug any new version of the patch.

@elliefm
Copy link
Contributor

elliefm commented Feb 2, 2024

Can you provide some examples of IMAP SEARCH commands that do and don't work on 3.6 with the patch? I could write a Cassandane test to reproduce the problem, but I'm not really familiar with the search syntax. Some examples of searches that should work and do work, and searches that should work but do not, would give me a quicker starting point than trying to work backwards from source code level details.

It seems clear that something else had changed between 3.6 and 3.8, but which has not been backported. If I can reproduce the problem from Cassandane, then it might be possible to find the missing link, and backport that too

@vladki77
Copy link

vladki77 commented Feb 5, 2024

The problem is with "search text xxx since yyy"

3.6 without patches:
"search text blabla" returns all unindexed messages, plus relevant indexed messages

3.6 with patch [https://github.com/cyrusimap/cyrus-imapd/commit/59d41a084db3e7c8dccc24290a7a62693878f47d]:
"search text blabla" returns only relevant messages, whether indexed or not
"search text blabla since 1-Feb-2024" produces assertion failure !

3.6 with my patches cyrus-search-unindexed.txt
"search text blabla" returns only relevant messages, whether indexed or not
"search text blabla since 1-Feb-2024" returns the same list as "search text blabla"

21 search text blabla
* SEARCH ... UID numbers of mails
21 OK Completed (63 msgs in 0.237 secs)

22 search text blabla since 1-Feb-2024
* SEARCH ... same UIDs
22 OK Completed (63 msgs in 0.241 secs)

3.6 with my patches cyrus-search-unindexed-and-date-failed.txt
"search text blabla" returns only relevant messages, whether indexed or not
"search text blabla since 1-Feb-2024" returns all unindexed messages, plus relevant indexed messages

@elliefm
Copy link
Contributor

elliefm commented Feb 14, 2024

Made a test that reproduces the crash with 3.6, and then ran the same test on the master branch and it also crashes. So it's not a problem with 3.6, it's a problem with the patch itself.

@elliefm
Copy link
Contributor

elliefm commented Feb 14, 2024

This comment above search_expr_split_by_folder_and_index() seems to explain the relationship between sub->indexed and sub->expr (though it calls expr 'scan'):

The 'indexed' tree, if not NULL, contains all the indexed search terms.
The 'scan' tree will never be NULL, although it may be a trivial tree
comprising a single (true) node. It contains an expression that must be
matched by every message reported by the index or the folder scan.

In subquery_post_enginesearch() where we evaluate the "expr"/"scan" expression, for squat specifically we need to re-evaluate the whole indexed expression against the folder, because the indexed search will have returned all unindexed messages and we need to filter those. If the "scan"/"expr" tree is "a trivial tree comprising a single (true) node", then it's sufficient to clone the indexed tree into its place -- this is what the original fix does. But it might not be a trivial tree! In that case, I think the correct thing to do is build a new "expr" which is the AND of the original expr and indexed parts. I think it would also be correct (but not-optimal) to build such a new expr regardless, because AND(TRUE, indexed) will have the same result as just indexed.

I'll tinker with this a little longer and see what I come up with.

@elliefm
Copy link
Contributor

elliefm commented Feb 14, 2024

Seems to be working so far, needs some tidy up and at least one more test though

@elliefm
Copy link
Contributor

elliefm commented Feb 16, 2024

"at least one more test":

C: 11 search text needle since 1-Feb-2024
S: * SEARCH 1
S: 11 OK Completed (1 msgs in 0.001 secs)

Okay so far...

C: 12 search text needle not since 1-Feb-2004
S: * SEARCH
S: 12 OK Completed (0 msgs in 0.000 secs)

Whoops! No crash at least, but it should have found message 3 here, and didn't. So it's better, but still not quite right in some way that I don't yet understand.

But since/not since both work correctly when there isn't any indexed subquery:

C: 22 search since 1-Feb-2024
S: * SEARCH 1 2 4 5
S: 22 OK Completed (4 msgs in 0.001 secs)
C: 23 search not since 1-Feb-2024
S: * SEARCH 3 6
S: 23 OK Completed (2 msgs in 0.001 secs)

So whatever the problem with "not since" is, it doesn't occur in isolation, only when combined with an indexed subquery (such as "text needle"). Hmm.

@elliefm
Copy link
Contributor

elliefm commented Feb 19, 2024

Hah, the typo was right there in my previous comment, but I still didn't see it until hours of debugging later. For the search that failed, turns out I was accidentally searching 2004 rather than 2024, doh. With that fixed, the test passes. Just a bunch of debugging mess to clean up now, I think.

@elliefm
Copy link
Contributor

elliefm commented Feb 19, 2024

@vladki77 My fix is at #4806, though most of that is tests to prove it works. Thanks heaps for your help reproducing the issue.

The one commit you need is 48f702b, which you can download as a patch here. This patch applies over the top of the previous (broken) fix that's already on the cyrus-imapd-3.6 branch, so you should revert your local changes, then apply this one.

I'll backport this to the older branches once it's been through code review.

@elliefm
Copy link
Contributor

elliefm commented Feb 19, 2024

@vladki77 Forgot to say -- I'd really appreciate if you could try it out and confirm it works for you. The tests tell me it will, but a little more confirmation never hurt :)

@vladki77
Copy link

Sorry I did not have time for tests last week. But I'll definitely do it. I promise.

@vladki77
Copy link

Compiled, deployed, search seems to work as expected. Thank you very much @elliefm

@elliefm
Copy link
Contributor

elliefm commented Feb 29, 2024

Thanks!

elliefm pushed a commit to elliefm/cyrus-imapd that referenced this issue Mar 6, 2024
The squat backend returns false positives for unindexed messages,
but filtering these messages has been broken.

This patch fixes that based on a contribution of @gbulfon. Thanks!

Fixes cyrusimap#4692
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants