Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compatibility synonym with other filter ? #27481

Closed
nacimgoura opened this issue Nov 21, 2017 · 7 comments
Closed

compatibility synonym with other filter ? #27481

nacimgoura opened this issue Nov 21, 2017 · 7 comments
Labels
:Search Relevance/Analysis How text is split into tokens Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch

Comments

@nacimgoura
Copy link

Hello,
I have updated my elastic from 5.6.4 to 6.0.0 but I have a problem that I can't solve even with documentation.

  • Elasticsearch version : 6.0.0
  • Plugins installed: [analysis-icu, ingest-attachment, analysis-phonetic]
  • OS version : Docker elastic in ubuntu 16.04.3

I can't update my settings, I have the failed to build synonyms error.

Here's my filter :

filter: {
    french_synonym: {
                        type: 'synonym',
                        synonyms: [
                            'min, minute, minimum',
                            'boulevard, rue, avenue',
                            'ville, village',
                            'cosmos, galaxie,univers',
                            'docteur, medecin, doctor',
                            'foot2rue, foot de rue,foot 2 rue',
                            'animaux, betes',
                            'chine, asie, asiatique, chinois, cantonais, jaune',
                            'accusé, coupable',
                            'sdf, sans domilcile fixe',
                            'Histoire, légende',
                            'tgv, train, ter, sncf, train grande vitesse',
                            'canada,canadienne',
                            'terrien,terre',
                        ],
                    },
},

Here's my analyzer :

analyzer: {
               french_heavy: {
                    type: 'custom',
                    tokenizer: 'icu_tokenizer',
                    filter: [
                        'icu_folding',
                        'french_synonym',
                    ],
                },

Here's my error :

'{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"failed to build synonyms"}],"type":"illegal_argument_exception","reason":"failed to build synonyms","caused_by":{"type":"parse_exception","reason":"Invalid synonym rule at line 1","caused_by":{"type":"illegal_argument_exception","reason":"term: psychiatrie analyzed to a token with posinc != 1"}}},"status":400}'

I have the impression that the synonym filter is not compatible with other filters. For example in my case with icu_folding.
I really feel like it's a bug.
Has anyone had the same problem ?

@aslamy
Copy link

aslamy commented Nov 21, 2017

I also get this error. I use Synonym Graph.

@stefando
Copy link

The same issue with synonym filter:
"curaçao,cw,cuw" => term: curaçao analyzed to a token with posinc != 1

@jimczi
Copy link
Contributor

jimczi commented Nov 24, 2017

@nacimgoura I am not able to reproduce your issue with the settings that you provided. Can you share the entire definition for your synonym, analyzer and field ?
The exception you're hitting is not new, the synonym filter analyzes the synonym rules with an analyzer and if multiple tokens are at the same position (stacked tokens) it fails. This can happen if you use an analyzer (tokenizer or filter) that creates multiple variations of the same token.

@jimczi jimczi added :Search Relevance/Analysis How text is split into tokens feedback_needed labels Nov 24, 2017
@nacimgoura
Copy link
Author

nacimgoura commented Nov 24, 2017

Hi @jimczi, Here is my entire definition :


// analyser
exports.settings = {
    settings: {
        analysis: {
            filter: {
                // suppression de ces mots pour diminuer le bruit
                french_elision: {
                    type: 'elision',
                    articles: [
                        'l', 'm', 't', 'qu', 'n', 's',
                        'j', 'd', 'c', 'jusqu', 'quoiqu',
                        'lorsqu', 'puisqu', 'ès', 'vers', 'a', 'à', 'afin',
                        'ai', 'ainsi', 'après', 'attendu', 'au', 'aujourd',
                        'auquel', 'aussi', 'autre', 'autres', 'aux', 'auxquelles',
                        'auxquels', 'avait', 'avant', 'avec', 'avoir', 'c', 'ça',
                        'car', 'ce', 'ceci', 'cela', 'celle', 'celles', 'celui',
                        'cependant', 'certain', 'certaine', 'certaines', 'certains',
                        'ces', 'cet', 'cette', 'ceux', 'chez', 'ci', 'combien', 'comme',
                        'comment', 'concernant', 'contre', 'd', 'dans', 'de', 'debout',
                        'dedans', 'dehors', 'delà', 'depuis', 'derrière', 'des', 'dès',
                        'désormais', 'desquelles', 'desquels', 'dessous', 'dessus', 'devant',
                        'devers', 'devra', 'divers', 'diverse', 'diverses', 'doit', 'donc',
                        'dont', 'du', 'duquel', 'durant', 'elle', 'elles', 'en', 'entre', 'environ',
                        'est', 'et', 'etc', 'été', 'etre', 'être', 'eu', 'eux', 'excepté', 'hélas',
                        'hormis', 'hors', 'hui', 'il', 'ils', 'j', 'je', 'jusqu', 'l', 'la',
                        'là', 'laquelle', 'le', 'lequel', 'les', 'lesquelles', 'lesquels', 'leur', 'leurs',
                        'lorsqu', 'lui', 'ma', 'mais', 'malgré', 'me', 'même', 'mêmes', 'merci', 'mes', 'mien',
                        'mienne', 'miennes', 'miens', 'moi', 'moins', 'mon', 'ne', 'néanmoins',
                        'ni', 'non', 'nos', 'notre', 'nôtre', 'nôtres', 'nous', 'ô', 'on', 'ont', 'ou', 'où', 'outre',
                        'par', 'parmi', 'partant', 'pas', 'passé', 'pendant', 'plein', 'plus', 'plusieurs', 'pour',
                        'pourquoi', 'près', 'proche', 'puisque', 'qu', 'quand', 'que', 'quel', 'quelle', 'quelles',
                        'quels', 'qui', 'quoi', 'quoique', 'revoici', 'revoilà', 'sa', 'sauf', 'se', 'selon',
                        'seront', 'ses', 'si', 'sien', 'sienne', 'siennes', 'siens', 'sinon', 'soi', 'soit', 'son',
                        'sont', 'sous', 'suivant', 'sur', 'ta', 'te', 'tes', 'tien', 'tienne', 'tiennes', 'tiens',
                        'toi', 'ton', 'tous', 'tout', 'toute', 'toutes', 'tu', 'un', 'une', 'va', 'voici', 'voilà',
                        'vos', 'votre', 'vôtre', 'vôtres', 'vous', 'vu', 'y',
                    ],
                },
                snowball: {
                    type: 'snowball',
                    language: 'French',
                },
                // enleve le bruit (les, nous, pas...)
                french_stop: {
                    type: 'stop',
                    stopwords: '_french_',
                    ignore_case: true,
                },
                // synonyme
                /**
                 * Attention après ajout de synonyme, réindexer
                 * https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-synonym-tokenfilter.html
                 */
                french_synonym: {
                    type: 'synonym',
                    synonyms: [
                        'pmi, protection maternelle et infantile',
                        'psycho, psychiatrie',
                        'anap, Agence nationale d\'appui à la performance des établissements de santé et médico-sociaux',
                        'ars, agences régionales de santé',
                        'c dans l\'oxygene, c dans l\'air',
                        '+ => plus',
                        '% => pour cent',
                        '10% => 10 pour cent, 10 pourcent,dix pour cent, dix pourcent',
                        '1 => un',
                        '2 => deux',
                        '3 => trois',
                        '4 => quatre',
                        '5 => cinq',
                        '6 => six',
                        '7 => sept',
                        '8 => huit',
                        '9 => neuf',
                        '10 => dix',
                        '11 => onze',
                        '12 => douze',
                        '13 => treize',
                        '14 => quatorze',
                        '15 => quinze',
                        '16 => seize',
                        '17 => dix-sept,dix sept',
                        '18 => dix-huit,dix huit',
                        '19 => dix-neuf,dix neuf',
                        '20 => vingt',
                        'min, minute, minimum',
                        'boulevard, rue, avenue',
                        'ville, village',
                        'cosmos, galaxie,univers',
                        'docteur, medecin, doctor',
                        'foot2rue, foot de rue,foot 2 rue',
                        'animaux, betes',
                        'chine, asie, asiatique, chinois, cantonais, jaune',
                        'accusé, coupable',
                        'sdf, sans domilcile fixe',
                        'Histoire, légende',
                        'tgv, train, ter, sncf, train grande vitesse',
                        'canada,canadienne',
                        'terrien,terre',
                    ],
                },
                // radical des mots
                french_stemmer: {
                    type: 'stemmer',
                    language: 'light_french',
                },
                // phonetic (alternative a fuziness pour les erreurs d'orthographe)
                phonetic_filter: {
                    type: 'phonetic',
                    encoder: 'beider_morse',
                    languageset: 'french',
                },
            },
            tokenizer: {
                tokeniser_url_mail: {
                    type: 'uax_url_email',
                    max_token_length: 10,
                },
            },
            analyzer: {
                // français elevé
                french_heavy: {
                    type: 'custom',
                    tokenizer: 'icu_tokenizer',
                    filter: [
                        'lowercase',
                        'asciifolding',
                        'icu_folding',
                        'phonetic_filter',
                        'french_elision',
                        'french_synonym',
                        'french_stemmer',
                        'french_stop',
                        'snowball',
                    ],
                },
                // français léger
                french_light: {
                    type: 'custom',
                    tokenizer: 'icu_tokenizer',
                    filter: [
                        'lowercase',
                        'asciifolding',
                        'icu_folding',
                        'french_elision',
                    ],
                },
                // analyzer for url
                url_analyzer: {
                    type: 'custom',
                    tokenizer: 'tokeniser_url_mail',
                },
                french_html: {
                    type: 'custom',
                    tokenizer: 'icu_tokenizer',
                    char_filter: ['html_strip'],
                    filter: [
                        'lowercase',
                        'asciifolding',
                        'icu_folding',
                        'phonetic_filter',
                        'french_elision',
                    ],
                },
            },
        },
    },
};

@jimczi
Copy link
Contributor

jimczi commented Nov 24, 2017

Sorry this is a new behavior in 6.0:
https://www.elastic.co/guide/en/elasticsearch/reference/current/breaking_60_analysis_changes.html
In 6.0 the synonym and the synonym_graph filters have changed. They now apply all the filters that appear before them in the chain to build their synonym rules. o
For the french_heavy field, the french_synonym filters appears after 5 filters:

french_heavy: {
                    type: 'custom',
                    tokenizer: 'icu_tokenizer',
                    filter: [
                        'lowercase',
                        'asciifolding',
                        'icu_folding',
                        'phonetic_filter',
                        'french_elision',
                        'french_synonym',
                        'french_stemmer',
                        'french_stop',
                        'snowball',
                    ],
                }

This means that all your synonym rules are analyzed with an icu_tokenizer and the following filters:
'lowercase', 'asciifolding', 'icu_folding', 'phonetic_filter', 'french_elision'.
For each rule, we apply this chain and we verify that the produced synonyms are valid.
With this change in mind you'll have to modify your analyzers and synonym rules:

  • '% => pour cent' is not accepted because % is removed by the icu_tokenizer and therefore could never be found in a text that pass through this analyzer.
  • The phonetic filter should be put after the synonym filter. The synonyms should not be checked against the phonetic form and we disallow rules that have multiple rewriting (original form + phonetic form for instance).

It's important to note that some of your rules are not built correctly in 5.x with this configuration. For instance 'c dans l\'oxygene, c dans l\'air' should never match since the ellision filter would always remove l' from the input even though the rule is defined with an elision. With the new behavior in 6.0 this rule has a chance to match.

I hope you don't mind if I close this issue, we can continue to discuss the new behavior of the synonym filters in the forum if you want. Just open a new topic and we'll be happy to help:
https://discuss.elastic.co/c/elasticsearch

@jtreher
Copy link

jtreher commented May 29, 2018

@jimczi This is a huge breaking change. A single synonyms file that represents our vocabulary no longer is valid in all custom analyzers because it's tightly coupled to the specific filters and tokenizers on that analyzer. How will I know if we add an invalid synonym that breaks index creation? Our content editors are free to make up synonyms as they need to support user queries. For years this has worked out well and all the sudden it's completely broken.

Why does it have to fail index creation, why not just ignore the synonym? Or at least give us an option to ignore eliminated synonyms rather than failing creation.

@JedatKinports
Copy link

I got this error because I have applied the wrong TokenFilter to my custom analyzer (had similar tokenfilters for different languages). In case someone should have the same problem. This should be a quicker fix. It was a copy/paste mistake.

@javanna javanna added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Search Relevance/Analysis How text is split into tokens Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch
Projects
None yet
Development

No branches or pull requests

7 participants