compatibility synonym with other filter ? #27481

nacimgoura · 2017-11-21T15:52:01Z

Hello,
I have updated my elastic from 5.6.4 to 6.0.0 but I have a problem that I can't solve even with documentation.

Elasticsearch version : 6.0.0
Plugins installed: [analysis-icu, ingest-attachment, analysis-phonetic]
OS version : Docker elastic in ubuntu 16.04.3

I can't update my settings, I have the failed to build synonyms error.

Here's my filter :

filter: {
    french_synonym: {
                        type: 'synonym',
                        synonyms: [
                            'min, minute, minimum',
                            'boulevard, rue, avenue',
                            'ville, village',
                            'cosmos, galaxie,univers',
                            'docteur, medecin, doctor',
                            'foot2rue, foot de rue,foot 2 rue',
                            'animaux, betes',
                            'chine, asie, asiatique, chinois, cantonais, jaune',
                            'accusé, coupable',
                            'sdf, sans domilcile fixe',
                            'Histoire, légende',
                            'tgv, train, ter, sncf, train grande vitesse',
                            'canada,canadienne',
                            'terrien,terre',
                        ],
                    },
},

Here's my analyzer :

analyzer: {
               french_heavy: {
                    type: 'custom',
                    tokenizer: 'icu_tokenizer',
                    filter: [
                        'icu_folding',
                        'french_synonym',
                    ],
                },

Here's my error :

'{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"failed to build synonyms"}],"type":"illegal_argument_exception","reason":"failed to build synonyms","caused_by":{"type":"parse_exception","reason":"Invalid synonym rule at line 1","caused_by":{"type":"illegal_argument_exception","reason":"term: psychiatrie analyzed to a token with posinc != 1"}}},"status":400}'

I have the impression that the synonym filter is not compatible with other filters. For example in my case with icu_folding.
I really feel like it's a bug.
Has anyone had the same problem ?

The text was updated successfully, but these errors were encountered:

aslamy · 2017-11-21T20:17:43Z

I also get this error. I use Synonym Graph.

stefando · 2017-11-23T18:06:13Z

The same issue with synonym filter:
"curaçao,cw,cuw" => term: curaçao analyzed to a token with posinc != 1

jimczi · 2017-11-24T08:28:03Z

@nacimgoura I am not able to reproduce your issue with the settings that you provided. Can you share the entire definition for your synonym, analyzer and field ?
The exception you're hitting is not new, the synonym filter analyzes the synonym rules with an analyzer and if multiple tokens are at the same position (stacked tokens) it fails. This can happen if you use an analyzer (tokenizer or filter) that creates multiple variations of the same token.

nacimgoura · 2017-11-24T10:12:41Z

Hi @jimczi, Here is my entire definition :


// analyser
exports.settings = {
    settings: {
        analysis: {
            filter: {
                // suppression de ces mots pour diminuer le bruit
                french_elision: {
                    type: 'elision',
                    articles: [
                        'l', 'm', 't', 'qu', 'n', 's',
                        'j', 'd', 'c', 'jusqu', 'quoiqu',
                        'lorsqu', 'puisqu', 'ès', 'vers', 'a', 'à', 'afin',
                        'ai', 'ainsi', 'après', 'attendu', 'au', 'aujourd',
                        'auquel', 'aussi', 'autre', 'autres', 'aux', 'auxquelles',
                        'auxquels', 'avait', 'avant', 'avec', 'avoir', 'c', 'ça',
                        'car', 'ce', 'ceci', 'cela', 'celle', 'celles', 'celui',
                        'cependant', 'certain', 'certaine', 'certaines', 'certains',
                        'ces', 'cet', 'cette', 'ceux', 'chez', 'ci', 'combien', 'comme',
                        'comment', 'concernant', 'contre', 'd', 'dans', 'de', 'debout',
                        'dedans', 'dehors', 'delà', 'depuis', 'derrière', 'des', 'dès',
                        'désormais', 'desquelles', 'desquels', 'dessous', 'dessus', 'devant',
                        'devers', 'devra', 'divers', 'diverse', 'diverses', 'doit', 'donc',
                        'dont', 'du', 'duquel', 'durant', 'elle', 'elles', 'en', 'entre', 'environ',
                        'est', 'et', 'etc', 'été', 'etre', 'être', 'eu', 'eux', 'excepté', 'hélas',
                        'hormis', 'hors', 'hui', 'il', 'ils', 'j', 'je', 'jusqu', 'l', 'la',
                        'là', 'laquelle', 'le', 'lequel', 'les', 'lesquelles', 'lesquels', 'leur', 'leurs',
                        'lorsqu', 'lui', 'ma', 'mais', 'malgré', 'me', 'même', 'mêmes', 'merci', 'mes', 'mien',
                        'mienne', 'miennes', 'miens', 'moi', 'moins', 'mon', 'ne', 'néanmoins',
                        'ni', 'non', 'nos', 'notre', 'nôtre', 'nôtres', 'nous', 'ô', 'on', 'ont', 'ou', 'où', 'outre',
                        'par', 'parmi', 'partant', 'pas', 'passé', 'pendant', 'plein', 'plus', 'plusieurs', 'pour',
                        'pourquoi', 'près', 'proche', 'puisque', 'qu', 'quand', 'que', 'quel', 'quelle', 'quelles',
                        'quels', 'qui', 'quoi', 'quoique', 'revoici', 'revoilà', 'sa', 'sauf', 'se', 'selon',
                        'seront', 'ses', 'si', 'sien', 'sienne', 'siennes', 'siens', 'sinon', 'soi', 'soit', 'son',
                        'sont', 'sous', 'suivant', 'sur', 'ta', 'te', 'tes', 'tien', 'tienne', 'tiennes', 'tiens',
                        'toi', 'ton', 'tous', 'tout', 'toute', 'toutes', 'tu', 'un', 'une', 'va', 'voici', 'voilà',
                        'vos', 'votre', 'vôtre', 'vôtres', 'vous', 'vu', 'y',
                    ],
                },
                snowball: {
                    type: 'snowball',
                    language: 'French',
                },
                // enleve le bruit (les, nous, pas...)
                french_stop: {
                    type: 'stop',
                    stopwords: '_french_',
                    ignore_case: true,
                },
                // synonyme
                /**
                 * Attention après ajout de synonyme, réindexer
                 * https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-synonym-tokenfilter.html
                 */
                french_synonym: {
                    type: 'synonym',
                    synonyms: [
                        'pmi, protection maternelle et infantile',
                        'psycho, psychiatrie',
                        'anap, Agence nationale d\'appui à la performance des établissements de santé et médico-sociaux',
                        'ars, agences régionales de santé',
                        'c dans l\'oxygene, c dans l\'air',
                        '+ => plus',
                        '% => pour cent',
                        '10% => 10 pour cent, 10 pourcent,dix pour cent, dix pourcent',
                        '1 => un',
                        '2 => deux',
                        '3 => trois',
                        '4 => quatre',
                        '5 => cinq',
                        '6 => six',
                        '7 => sept',
                        '8 => huit',
                        '9 => neuf',
                        '10 => dix',
                        '11 => onze',
                        '12 => douze',
                        '13 => treize',
                        '14 => quatorze',
                        '15 => quinze',
                        '16 => seize',
                        '17 => dix-sept,dix sept',
                        '18 => dix-huit,dix huit',
                        '19 => dix-neuf,dix neuf',
                        '20 => vingt',
                        'min, minute, minimum',
                        'boulevard, rue, avenue',
                        'ville, village',
                        'cosmos, galaxie,univers',
                        'docteur, medecin, doctor',
                        'foot2rue, foot de rue,foot 2 rue',
                        'animaux, betes',
                        'chine, asie, asiatique, chinois, cantonais, jaune',
                        'accusé, coupable',
                        'sdf, sans domilcile fixe',
                        'Histoire, légende',
                        'tgv, train, ter, sncf, train grande vitesse',
                        'canada,canadienne',
                        'terrien,terre',
                    ],
                },
                // radical des mots
                french_stemmer: {
                    type: 'stemmer',
                    language: 'light_french',
                },
                // phonetic (alternative a fuziness pour les erreurs d'orthographe)
                phonetic_filter: {
                    type: 'phonetic',
                    encoder: 'beider_morse',
                    languageset: 'french',
                },
            },
            tokenizer: {
                tokeniser_url_mail: {
                    type: 'uax_url_email',
                    max_token_length: 10,
                },
            },
            analyzer: {
                // français elevé
                french_heavy: {
                    type: 'custom',
                    tokenizer: 'icu_tokenizer',
                    filter: [
                        'lowercase',
                        'asciifolding',
                        'icu_folding',
                        'phonetic_filter',
                        'french_elision',
                        'french_synonym',
                        'french_stemmer',
                        'french_stop',
                        'snowball',
                    ],
                },
                // français léger
                french_light: {
                    type: 'custom',
                    tokenizer: 'icu_tokenizer',
                    filter: [
                        'lowercase',
                        'asciifolding',
                        'icu_folding',
                        'french_elision',
                    ],
                },
                // analyzer for url
                url_analyzer: {
                    type: 'custom',
                    tokenizer: 'tokeniser_url_mail',
                },
                french_html: {
                    type: 'custom',
                    tokenizer: 'icu_tokenizer',
                    char_filter: ['html_strip'],
                    filter: [
                        'lowercase',
                        'asciifolding',
                        'icu_folding',
                        'phonetic_filter',
                        'french_elision',
                    ],
                },
            },
        },
    },
};

jimczi · 2017-11-24T11:21:25Z

Sorry this is a new behavior in 6.0:
https://www.elastic.co/guide/en/elasticsearch/reference/current/breaking_60_analysis_changes.html
In 6.0 the synonym and the synonym_graph filters have changed. They now apply all the filters that appear before them in the chain to build their synonym rules. o
For the french_heavy field, the french_synonym filters appears after 5 filters:

french_heavy: {
                    type: 'custom',
                    tokenizer: 'icu_tokenizer',
                    filter: [
                        'lowercase',
                        'asciifolding',
                        'icu_folding',
                        'phonetic_filter',
                        'french_elision',
                        'french_synonym',
                        'french_stemmer',
                        'french_stop',
                        'snowball',
                    ],
                }

This means that all your synonym rules are analyzed with an icu_tokenizer and the following filters:
'lowercase', 'asciifolding', 'icu_folding', 'phonetic_filter', 'french_elision'.
For each rule, we apply this chain and we verify that the produced synonyms are valid.
With this change in mind you'll have to modify your analyzers and synonym rules:

'% => pour cent' is not accepted because % is removed by the icu_tokenizer and therefore could never be found in a text that pass through this analyzer.
The phonetic filter should be put after the synonym filter. The synonyms should not be checked against the phonetic form and we disallow rules that have multiple rewriting (original form + phonetic form for instance).

It's important to note that some of your rules are not built correctly in 5.x with this configuration. For instance 'c dans l\'oxygene, c dans l\'air' should never match since the ellision filter would always remove l' from the input even though the rule is defined with an elision. With the new behavior in 6.0 this rule has a chance to match.

I hope you don't mind if I close this issue, we can continue to discuss the new behavior of the synonym filters in the forum if you want. Just open a new topic and we'll be happy to help:
https://discuss.elastic.co/c/elasticsearch

jtreher · 2018-05-29T18:33:13Z

@jimczi This is a huge breaking change. A single synonyms file that represents our vocabulary no longer is valid in all custom analyzers because it's tightly coupled to the specific filters and tokenizers on that analyzer. How will I know if we add an invalid synonym that breaks index creation? Our content editors are free to make up synonyms as they need to support user queries. For years this has worked out well and all the sudden it's completely broken.

Why does it have to fail index creation, why not just ignore the synonym? Or at least give us an option to ignore eliminated synonyms rather than failing creation.

JedatKinports · 2018-06-11T12:24:54Z

I got this error because I have applied the wrong TokenFilter to my custom analyzer (had similar tokenfilters for different languages). In case someone should have the same problem. This should be a quicker fix. It was a copy/paste mistake.

jimczi added :Search Relevance/Analysis How text is split into tokens feedback_needed labels Nov 24, 2017

nacimgoura closed this as completed Nov 24, 2017

nacimgoura reopened this Nov 24, 2017

jimczi closed this as completed Nov 24, 2017

jimczi removed the feedback_needed label Nov 24, 2017

jimczi mentioned this issue Dec 15, 2017

Error on reindex using WordNet synonyms file #27798

Closed

jtreher mentioned this issue May 30, 2018

Add flag to ignore synonym token filter exceptions due to analyzer processing #30968

Closed

damienalexandre mentioned this issue Jun 9, 2018

Synonym Token Filter wrongly assume that term is completely eliminated by analyzer #31224

Closed

rbayet mentioned this issue Aug 27, 2019

Multi-words terms with accents not handled by thesaurus Smile-SA/elasticsuite#1514

Closed

javanna added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Jul 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

compatibility synonym with other filter ? #27481

compatibility synonym with other filter ? #27481

nacimgoura commented Nov 21, 2017

aslamy commented Nov 21, 2017 •

edited

Loading

stefando commented Nov 23, 2017

jimczi commented Nov 24, 2017

nacimgoura commented Nov 24, 2017 •

edited

Loading

jimczi commented Nov 24, 2017

jtreher commented May 29, 2018 •

edited

Loading

JedatKinports commented Jun 11, 2018

compatibility synonym with other filter ? #27481

compatibility synonym with other filter ? #27481

Comments

nacimgoura commented Nov 21, 2017

aslamy commented Nov 21, 2017 • edited Loading

stefando commented Nov 23, 2017

jimczi commented Nov 24, 2017

nacimgoura commented Nov 24, 2017 • edited Loading

jimczi commented Nov 24, 2017

jtreher commented May 29, 2018 • edited Loading

JedatKinports commented Jun 11, 2018

aslamy commented Nov 21, 2017 •

edited

Loading

nacimgoura commented Nov 24, 2017 •

edited

Loading

jtreher commented May 29, 2018 •

edited

Loading