Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexing non-English parliament corpora raises error #1473

Open
lukavdplas opened this issue Feb 21, 2024 · 4 comments
Open

Indexing non-English parliament corpora raises error #1473

lukavdplas opened this issue Feb 21, 2024 · 4 comments
Labels
backend changes to the django backend bug something isn't working right corpus changes to corpus definitions or new corpora

Comments

@lukavdplas
Copy link
Contributor

lukavdplas commented Feb 21, 2024

What went wrong?

Not sure if I did something wrong here. I tried indexing parliament-sweden-old locally and got this error:

elasticsearch.BadRequestError: BadRequestError(400, 'mapper_parsing_exception', 'Failed to parse mapping: analyzer [stemmed_en] has not been configured in mappings')

I fixed it by changing the definition of the speech field:

speech = field_defaults.speech()
speech.extractor = CSV(field='text')

To:

    speech = field_defaults.speech()
    speech.extractor = CSV(field='text')
    speech.language = 'sv'
    speech.es_mapping = main_content_mapping(token_counts=True, stopword_analysis=True, stemming_analysis=True, language=speech.language)

So I think this corpus is just missing its own definition for the mapping (and language) of the speech field? This seems to be true for other parliament corpora too.

What did you expect to happen?

The index operation should run without exceptions.

Screenshot

No response

Where did you find the bug?

  • a local server

Version

develop (~5.4.0)

Steps to reproduce

  • Configure the backend settings to include the parliament-sweden-old corpus. Add the corpus definition to CORPORA and add any string value for PP_SWEDEN_OLD_DATA.
  • Run yarn django index parliament-sweden-old
@lukavdplas lukavdplas added bug something isn't working right backend changes to the django backend corpus changes to corpus definitions or new corpora labels Feb 21, 2024
@BeritJanssen
Copy link
Member

BeritJanssen commented Feb 28, 2024

Yes, this is indeed still a to do on which I got stuck: I have a branch somewhere that applies the new mapping style (with language suffix) for all corpora, but realized that we can't deploy this unless we reindex all corpora first. I did not know the best solution for this at the time, and then forgot to flag this problem.

What we could do:

  • apply new mapping style to & reindex all non-English corpora
  • overhaul mapping style such that only corpora with multiple values in the languages array will get the new mapping style

The second option will be harder to understand for outside developers, I think, but so will be the language suffix for (the majority of) corpora which aren't multilingual.

@lukavdplas
Copy link
Contributor Author

Ah, I see. I don't think it's high-priority right now, but perhaps we can add a comment in the corpus definitions?

Do you think that choice would have an effect on #992 ?

@BeritJanssen
Copy link
Member

BeritJanssen commented Feb 28, 2024

No, I don't think so, as the analyzers are defined per corpus. The different language analyzers won't affect the query syntax, as far as I can foresee. Visualizations, however, may be affected by this. Will have to look at this again and will comment on the issue if I spot some problems.

@lukavdplas
Copy link
Contributor Author

Hm, actually, I would prefer it if this were fixed sooner rather than later. I actually do index them quite regularly on my local machine for testing. They're now in a weird state where the code does not work but is still supposed to be maintained.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend changes to the django backend bug something isn't working right corpus changes to corpus definitions or new corpora
Projects
None yet
Development

No branches or pull requests

2 participants