Indexing non-English parliament corpora raises error #1473

lukavdplas · 2024-02-21T18:44:17Z

What went wrong?

Not sure if I did something wrong here. I tried indexing parliament-sweden-old locally and got this error:

elasticsearch.BadRequestError: BadRequestError(400, 'mapper_parsing_exception', 'Failed to parse mapping: analyzer [stemmed_en] has not been configured in mappings')

I fixed it by changing the definition of the speech field:

I-analyzer/backend/corpora/parliament/sweden-old.py

Lines 88 to 89 in d040118

    
           speech = field_defaults.speech() 
        
           speech.extractor = CSV(field='text')

To:

    speech = field_defaults.speech()
    speech.extractor = CSV(field='text')
    speech.language = 'sv'
    speech.es_mapping = main_content_mapping(token_counts=True, stopword_analysis=True, stemming_analysis=True, language=speech.language)

So I think this corpus is just missing its own definition for the mapping (and language) of the speech field? This seems to be true for other parliament corpora too.

What did you expect to happen?

The index operation should run without exceptions.

Screenshot

No response

Where did you find the bug?

a local server

Version

develop (~5.4.0)

Steps to reproduce

Configure the backend settings to include the parliament-sweden-old corpus. Add the corpus definition to CORPORA and add any string value for PP_SWEDEN_OLD_DATA.
Run yarn django index parliament-sweden-old

The text was updated successfully, but these errors were encountered:

BeritJanssen · 2024-02-28T09:56:51Z

Yes, this is indeed still a to do on which I got stuck: I have a branch somewhere that applies the new mapping style (with language suffix) for all corpora, but realized that we can't deploy this unless we reindex all corpora first. I did not know the best solution for this at the time, and then forgot to flag this problem.

What we could do:

apply new mapping style to & reindex all non-English corpora
overhaul mapping style such that only corpora with multiple values in the languages array will get the new mapping style

The second option will be harder to understand for outside developers, I think, but so will be the language suffix for (the majority of) corpora which aren't multilingual.

lukavdplas · 2024-02-28T10:16:48Z

Ah, I see. I don't think it's high-priority right now, but perhaps we can add a comment in the corpus definitions?

Do you think that choice would have an effect on #992 ?

BeritJanssen · 2024-02-28T13:21:07Z

No, I don't think so, as the analyzers are defined per corpus. The different language analyzers won't affect the query syntax, as far as I can foresee. Visualizations, however, may be affected by this. Will have to look at this again and will comment on the issue if I spot some problems.

lukavdplas · 2024-03-21T17:34:48Z

Hm, actually, I would prefer it if this were fixed sooner rather than later. I actually do index them quite regularly on my local machine for testing. They're now in a weird state where the code does not work but is still supposed to be maintained.

lukavdplas added bug something isn't working right backend changes to the django backend corpus changes to corpus definitions or new corpora labels Feb 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Indexing non-English parliament corpora raises error #1473

Indexing non-English parliament corpora raises error #1473

lukavdplas commented Feb 21, 2024 •

edited

BeritJanssen commented Feb 28, 2024 •

edited

lukavdplas commented Feb 28, 2024

BeritJanssen commented Feb 28, 2024 •

edited

lukavdplas commented Mar 21, 2024

Indexing non-English parliament corpora raises error #1473

Indexing non-English parliament corpora raises error #1473

Comments

lukavdplas commented Feb 21, 2024 • edited

What went wrong?

What did you expect to happen?

Screenshot

Where did you find the bug?

Version

Steps to reproduce

BeritJanssen commented Feb 28, 2024 • edited

lukavdplas commented Feb 28, 2024

BeritJanssen commented Feb 28, 2024 • edited

lukavdplas commented Mar 21, 2024

lukavdplas commented Feb 21, 2024 •

edited

BeritJanssen commented Feb 28, 2024 •

edited

BeritJanssen commented Feb 28, 2024 •

edited