Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support non-repeatably iterable corpus for tokenizer=None #17

Merged

Conversation

Witiko
Copy link
Contributor

@Witiko Witiko commented Jan 16, 2022

Currently, rank-bm25 requires that corpus is repeatedly iterable and sized (i.e. defines __len__()).

When the corpus is not pre-tokenized (i.e. tokenizer is not None), then this makes sense: __init__() will iterate across the corpus several times, so we may as well require that the corpus is a list or some other data type that is repeatedly iterable and sized. However, when the corpus is pre-tokenized (i.e. tokenizer is None), then we only iterate over the corpus once in _initialize(). Furthermore, we don't need to know its size beforehand, because we can just count the number of iterations.

This pull request makes it possible to use a non-repeatedly iterable non-sized corpus such as a generator when tokenizer is None. This is useful if you need to generate your corpus on the fly and don't know the number of your documents beforehand.

@Witiko
Copy link
Contributor Author

Witiko commented Feb 9, 2022

@dorianbrown Please, let me know if there is anything I can do to get this merged.

@dorianbrown
Copy link
Owner

Hi Wikito,

Thanks a lot for this contribution, the case of using a generator as a corpus never occurred to me, but seems like a very useful bit of functionality to have. And thanks for you patience, it's been a bit of a busy few weeks 😄

After looking through the changes this doesn't seem to cause any issues, so I'll merge it in and check if my CD is still working correctly.

@dorianbrown dorianbrown merged commit 97a38cf into dorianbrown:master Feb 16, 2022
@dorianbrown
Copy link
Owner

The branch has been merged, and the changes have been published to pypi under the version 0.2.2 of the package.

Thanks again for helping make this package better!

@Witiko
Copy link
Contributor Author

Witiko commented Feb 16, 2022

@dorianbrown Thank you, much appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants