-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More info on configuration options #4
Comments
max_align is the maximum alignment types such as 1:1, 1:2, etc. 5 means the alignments allowed are 1:0, 0:1, 1:1, 1:2, 2:1, 2:2, 2:3, and 3:2. You can set this parameter to a higher value if the corpus to be aligned contains many complex alignments. top_k is for the search of k nearest target neighbors of each source sentence in the first-step alignment. win is the search window of dynamic programming in the second-step alignment. skip is the predefined simililarity score for 1:0 and 0:1 alignments. If your corpus consists of many omissions and insertions, you can set this value to a larger one, e.g. skip=0. margin represents modified cosine similarity as proposed in https://doi.org/10.1093/llc/fqac089. len_penalty considers the length difference between source and target sentences when calculating similarity between sentence pairs. If is_split=True, it means the corpus has already been split into sentences. Otherwise, bertalign uses sentence-splitter to split the bitexts into sentences. |
Hi. Is there a way to specify max_align with more granularity? For my use case, I would like to limit the allowable alignments to 1:1, 1:2, ..., 1:n, and the inverse thereof (1:1, 2:1, ..., n:1). In other words, I want to exclude 1:0, 0:1, and many-to-many alignments. EDIT: nvm modifying get_alignment_types or hardcoding second_alignment_types seems to have done the trick. |
Hi, thanks for providing this code! Could you please give more information (e.g. a brief explanation) of the following options?
Thank you in advance!
Rachele
The text was updated successfully, but these errors were encountered: