Support non-repeatably iterable corpus for tokenizer=None #17

Witiko · 2022-01-16T21:09:42Z

Currently, rank-bm25 requires that corpus is repeatedly iterable and sized (i.e. defines __len__()).

When the corpus is not pre-tokenized (i.e. tokenizer is not None), then this makes sense: __init__() will iterate across the corpus several times, so we may as well require that the corpus is a list or some other data type that is repeatedly iterable and sized. However, when the corpus is pre-tokenized (i.e. tokenizer is None), then we only iterate over the corpus once in _initialize(). Furthermore, we don't need to know its size beforehand, because we can just count the number of iterations.

This pull request makes it possible to use a non-repeatedly iterable non-sized corpus such as a generator when tokenizer is None. This is useful if you need to generate your corpus on the fly and don't know the number of your documents beforehand.

Witiko · 2022-02-09T18:30:15Z

@dorianbrown Please, let me know if there is anything I can do to get this merged.

dorianbrown · 2022-02-16T11:12:35Z

Hi Wikito,

Thanks a lot for this contribution, the case of using a generator as a corpus never occurred to me, but seems like a very useful bit of functionality to have. And thanks for you patience, it's been a bit of a busy few weeks 😄

After looking through the changes this doesn't seem to cause any issues, so I'll merge it in and check if my CD is still working correctly.

dorianbrown · 2022-02-16T12:13:03Z

The branch has been merged, and the changes have been published to pypi under the version 0.2.2 of the package.

Thanks again for helping make this package better!

Witiko · 2022-02-16T13:09:52Z

@dorianbrown Thank you, much appreciated.

Support iterable corpus for tokenizer=None

56880a6

dorianbrown merged commit 97a38cf into dorianbrown:master Feb 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support non-repeatably iterable corpus for tokenizer=None #17

Support non-repeatably iterable corpus for tokenizer=None #17

Witiko commented Jan 16, 2022 •

edited

Witiko commented Feb 9, 2022

dorianbrown commented Feb 16, 2022

dorianbrown commented Feb 16, 2022

Witiko commented Feb 16, 2022

Support non-repeatably iterable corpus for tokenizer=None #17

Support non-repeatably iterable corpus for tokenizer=None #17

Conversation

Witiko commented Jan 16, 2022 • edited

Witiko commented Feb 9, 2022

dorianbrown commented Feb 16, 2022

dorianbrown commented Feb 16, 2022

Witiko commented Feb 16, 2022

Witiko commented Jan 16, 2022 •

edited