Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Thanks for the nice package! Following up on the issue concerning preprocessing suggestions, I have implemented an alternative parametrization that I would like to discuss. I hope you have time to discuss this,
Parametrize the shape in which text comes differently
The implementation has been used the
is_split
boolean flag to determine the form in which the input comes along.As discussed already in the issue concerning preprocessing suggestions, it sometimes might be useful to have other options in which the data are passed to
Bertalign
.In my special case it turns out that it is better to pass over src and target as lists. This comes from the fact that I need to postprocess the data outside of
Bertalign
. Passing lists avoids some idempotency issues that I have seen. I am not going into details here can of course if needed.
So I would feel better to reparamrtrize the
is_split
into a (ternary)split_type
option:is_split=False
is_split=True
Parametrize src and target languanges differently
Allow pass get src and target languages as parameters. The current implementation relies on google translate to detect language id which is an external dependency.
In order not to remove it, I have added very inelegant code that keeps the parametrization intact as much as possible
Afaics, the language id is only used when using
split_type=='lines'
resp.is_split=True
. So maybe there is a better alternative?Tests
I have also added basic unit tests to show that the parametrization is ok . These can be run using
pytest -sv tests/test_results.py
after having the test requirements installed, assuming that the package is installed - what I have done usingpip install -e \ .
.