Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in Sentence Offset #24

Open
gunturbudi opened this issue Apr 1, 2021 · 1 comment
Open

Bug in Sentence Offset #24

gunturbudi opened this issue Apr 1, 2021 · 1 comment

Comments

@gunturbudi
Copy link

gunturbudi commented Apr 1, 2021

Hello, I want to report a bug. I have a Sentence like this:

As a administrator, I want to refund sponsorship money that was processed via stripe, so that people get their monies back.

When I try to convert it to CoNLL the span is not converted well. I then debug the library and found that the offset is wrong. Here is the output of the offset:

As 0
a 3
administrator, 3
I 20
want 22
to 25
refund 30
sponsorship 37
money 49
that 55
was 60
processed 64
via 74
stripe, 78
so 86
that 89
people 94
get 101
their 103
monies 111
back. 118

As you can see, in the second and third lines, the offset is the same (3 and 3, while It should be 3 and 5). This behavior makes the span undetected in the conversion process.

It seems that the get_offsets function in utils.py checks the equality in the sequence of characters to decide about the offsets.

def get_offsets(
        text: str,
        tokens: List[str],
        start: Optional[int] = 0) -> List[int]:
    """Calculate char offsets of each tokens.

    Args:
        text (str): The string before tokenized.
        tokens (List[str]): The list of the string. Each string corresponds
            token.
        start (Optional[int]): The start position.
    Returns:
        (List[str]): The list of the offset.
    """
    offsets = []
    i = 0
    for token in tokens:
        for j, char in enumerate(token):
            while char != text[i]:
                i += 1

            if j == 0:
                offsets.append(i + start)
    return offsets

It will be a problem if the last character of the previous word is the same as the first character of the next word. I'm still looking for a fix to this problem.

Cheers

@gunturbudi
Copy link
Author

I manage to solve the problems with this modification in utils.py. It basically check what I mentioned in the above issues, if the last character of the previous word is the same as the first character of the next word.

def get_offsets(
        text: str,
        tokens: List[str],
        start: Optional[int] = 0) -> List[int]:

    """Calculate char offsets of each tokens.

    Args:
        text (str): The string before tokenized.
        tokens (List[str]): The list of the string. Each string corresponds
            token.
        start (Optional[int]): The start position.
    Returns:
        (List[str]): The list of the offset.
    """

    offsets = []
    i = 0

    same_char = False

    for k, token in enumerate(tokens):
        if token[0] == tokens[k-1][-1]:
            same_char = True
        else:
            same_char = False
        
        for j, char in enumerate(token):
            while char != text[i] or same_char:
                i += 1
                same_char = False

            if j == 0:
                offsets.append(i + start)

    return offsets

I don't know if it's the best solution. But it works for me, and luckily my NER model improves :)

Regards

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant