More info on configuration options #4

RacheleSprugnoli · 2023-07-24T09:55:49Z

Hi, thanks for providing this code! Could you please give more information (e.g. a brief explanation) of the following options?

max_align=5
top_k=3
win=5
skip=-0.1
margin=True
len_penalty=True
is_split=False

Thank you in advance!
Rachele

bfsujason · 2023-07-24T16:24:09Z

max_align is the maximum alignment types such as 1:1, 1:2, etc. 5 means the alignments allowed are 1:0, 0:1, 1:1, 1:2, 2:1, 2:2, 2:3, and 3:2. You can set this parameter to a higher value if the corpus to be aligned contains many complex alignments.

top_k is for the search of k nearest target neighbors of each source sentence in the first-step alignment.

win is the search window of dynamic programming in the second-step alignment.

skip is the predefined simililarity score for 1:0 and 0:1 alignments. If your corpus consists of many omissions and insertions, you can set this value to a larger one, e.g. skip=0.

margin represents modified cosine similarity as proposed in https://doi.org/10.1093/llc/fqac089.

len_penalty considers the length difference between source and target sentences when calculating similarity between sentence pairs.

If is_split=True, it means the corpus has already been split into sentences. Otherwise, bertalign uses sentence-splitter to split the bitexts into sentences.

jdough1982 · 2024-05-12T12:48:09Z

Hi. Is there a way to specify max_align with more granularity? For my use case, I would like to limit the allowable alignments to 1:1, 1:2, ..., 1:n, and the inverse thereof (1:1, 2:1, ..., n:1). In other words, I want to exclude 1:0, 0:1, and many-to-many alignments.

EDIT: nvm modifying get_alignment_types or hardcoding second_alignment_types seems to have done the trick.

workflow

matgille added a commit to matgille/mutilingual_collator that referenced this issue May 23, 2024

Merge pull request bfsujason#4 from ProMeText/workflow

34fd979

workflow

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More info on configuration options #4

More info on configuration options #4

RacheleSprugnoli commented Jul 24, 2023

bfsujason commented Jul 24, 2023 •

edited

Loading

jdough1982 commented May 12, 2024 •

edited

Loading

More info on configuration options #4

More info on configuration options #4

Comments

RacheleSprugnoli commented Jul 24, 2023

bfsujason commented Jul 24, 2023 • edited Loading

jdough1982 commented May 12, 2024 • edited Loading

bfsujason commented Jul 24, 2023 •

edited

Loading

jdough1982 commented May 12, 2024 •

edited

Loading