Support non-repeatably iterable corpus for tokenizer=None #17
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Currently, rank-bm25 requires that
corpus
is repeatedly iterable and sized (i.e. defines__len__()
).When the corpus is not pre-tokenized (i.e.
tokenizer
is not None), then this makes sense:__init__()
will iterate across the corpus several times, so we may as well require that the corpus is a list or some other data type that is repeatedly iterable and sized. However, when the corpus is pre-tokenized (i.e.tokenizer
is None), then we only iterate over the corpus once in_initialize()
. Furthermore, we don't need to know its size beforehand, because we can just count the number of iterations.This pull request makes it possible to use a non-repeatedly iterable non-sized corpus such as a generator when
tokenizer
is None. This is useful if you need to generate your corpus on the fly and don't know the number of your documents beforehand.