Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Input_mask and number_indices doesn't match, because of [cls] at the beginning #4

Open
KnightZhang625 opened this issue Feb 15, 2022 · 0 comments

Comments

@KnightZhang625
Copy link

def convert_single_mathqa_example(example, is_training, tokenizer, max_seq_length,
                                  max_program_length, op_list, op_list_size,
                                  const_list, const_list_size,
                                  cls_token, sep_token):
    """Converts a single MathQAExample into an InputFeature."""
    features = []
    question_tokens = example.question_tokens
    if len(question_tokens) > max_seq_length - 2:
        print("too long")
        question_tokens = question_tokens[:max_seq_length - 2]
    tokens = [cls_token] + question_tokens + [sep_token]         # 1. This line add [cls_token] at beginning.
    segment_ids = [0] * len(tokens)

    input_ids = tokenizer.convert_tokens_to_ids(tokens)

    input_mask = [1] * len(input_ids)
    for ind, offset in enumerate(example.number_indices):          # 2. Why don't number_indices offset by 1 ?
        if offset < len(input_mask):
            input_mask[offset] = 2
        else:
            if is_training == True:

                # invalid example, drop for training
                return features

            # assert is_training == False

Hello, Thanks for the great work! However, I am confused with the code. In the 1. comment, you add [cls_token] in front of the tokens, which means that the indices of tokens in the tokens will shift to the right by 1. In. 2. comment, you just use the example.number_indices to assign 2 to the indices of numbers, this is confusing, since input_mask is created from the tokens, which contains the [cls] at the beginning. For example: tokens: [[cls], a, b, 1, c, d], the example.number_indices will be [2] (because when you calculate the example.number_indices, there is no [cls] at the beginning, the "2" refers to the number "1"'s index ), the corresponding input_mask will be [1, 1, 1, 1, 1, 1]. When you try to assign the numbers' indices to 2 by the example.number_indices , the input_mask will be [1, 1, 0, 1, 1, 1], however, the 0'index 2 refers to the "b" in the tokens. Could you please explain this? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant