Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-deterministic results for Catalan lemmatizer across different runs #9387

Closed
BLKSerene opened this issue Oct 7, 2021 · 4 comments · Fixed by #9405
Closed

Non-deterministic results for Catalan lemmatizer across different runs #9387

BLKSerene opened this issue Oct 7, 2021 · 4 comments · Fixed by #9405
Labels
bug Bugs and behaviour differing from documentation feat / lemmatizer Feature: Rule-based and lookup lemmatization lang / ca Catalan language data and models reproducibility Consistency, reproducibility, determinism, and randomness

Comments

@BLKSerene
Copy link
Contributor

BLKSerene commented Oct 7, 2021

>>> import spacy
>>> [token.lemma_ for token in spacy.load('ca_core_news_sm')('parlada')]
['parlat']
>>> [token.lemma_ for token in spacy.load('ca_core_news_sm')('parlada')]
['parlat']
>>> [token.lemma_ for token in spacy.load('ca_core_news_sm')('parlada')]
['parlat']

================================ RESTART: Shell ================================
>>> import spacy
>>> [token.lemma_ for token in spacy.load('ca_core_news_sm')('parlada')]
['parlad']
>>> [token.lemma_ for token in spacy.load('ca_core_news_sm')('parlada')]
['parlad']
>>> [token.lemma_ for token in spacy.load('ca_core_news_sm')('parlada')]
['parlad']

May be related: #8421

  • Operating System: Windows 10 x64
  • Python Version Used: 3.8.10
  • spaCy Version Used: 3.1.1
@BLKSerene BLKSerene changed the title Non-deterministic results for Catalan lemmatizer Non-deterministic results for Catalan lemmatizer across different runs Oct 7, 2021
@polm polm added feat / lemmatizer Feature: Rule-based and lookup lemmatization lang / ca Catalan language data and models bug Bugs and behaviour differing from documentation labels Oct 7, 2021
@polm
Copy link
Contributor

polm commented Oct 7, 2021

Thanks for the report, that looks like a bug - we'll give it a look!

@polm
Copy link
Contributor

polm commented Oct 7, 2021

For the record I was able to reproduce this on Linux.

import spacy
print([token.lemma_ for token in spacy.load('ca_core_news_sm')('parlada')])
print([token.lemma_ for token in spacy.load('ca_core_news_sm')('parlada')])
print([token.lemma_ for token in spacy.load('ca_core_news_sm')('parlada')])

@polm
Copy link
Contributor

polm commented Oct 9, 2021

OK, I found the root of this problem. The lemmatizing function can find multiple potential lemmas for a word. When it does, they are accumulated in a list, but before returning them the list is made unique by calling list(set(forms)). However, the order of a set is not consistent from run to run in Python, and only the first entry in the resulting list is used normally, which causes the inconsistency we see here.

This looks like it also affects the lemmatizers for French, Dutch, Romanian, and Russian, or at least they have list(set(xxx)) calls.

I'll work on a fix for this. Thanks again for reporting!

@polm polm mentioned this issue Oct 9, 2021
3 tasks
@polm polm linked a pull request Oct 10, 2021 that will close this issue
3 tasks
@github-actions
Copy link
Contributor

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 11, 2021
@polm polm added the reproducibility Consistency, reproducibility, determinism, and randomness label Nov 22, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bugs and behaviour differing from documentation feat / lemmatizer Feature: Rule-based and lookup lemmatization lang / ca Catalan language data and models reproducibility Consistency, reproducibility, determinism, and randomness
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants