Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NER] Add support for Chinese Named Entities #2676

Merged
merged 10 commits into from
Jan 13, 2023

Conversation

cheungdaven
Copy link
Contributor

Description of changes:

  1. Add support for Chinese NER.
  2. Add unit test for Chinese NER.
  3. Automatically generate BIO tags when tags are not in BIO format.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@github-actions
Copy link

Job PR-2676-f61dc77 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-2676/f61dc77/index.html

@github-actions
Copy link

Job PR-2676-8759c75 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-2676/8759c75/index.html

@@ -35,6 +36,7 @@ def fit(self, y: pd.Series, x: pd.Series):
_, entity_groups = self.extract_ner_annotations(y)
self.unique_entity_groups = self.ner_special_tags + entity_groups
self.entity_map = {entity: index for index, entity in enumerate(self.unique_entity_groups)}
self.config.entity_map = self.entity_map
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to put entity_map under self.config? Or we just need to keep self.entity_map?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need this entity map in dataprocessor, that's why it is put in config. special tags such as "O" is also in config.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we extend theNerProcessor to include the entity_map keyword then?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

class NerProcessor:
    """
    Prepare NER data for the model specified by "prefix".
    """
    def __init__(
        self,
        model: nn.Module,
        max_len: Optional[int] = None,
        entity_map: Optional[dict] = None,
        config: Optional[DictConfig] = None,
    ):

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

already add self.config in nerprocessor. entity_map can be accessed by self.config.entity_map.

@github-actions
Copy link

Job PR-2676-eaf21d6 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-2676/eaf21d6/index.html

for annot in ner_annotations:
custom_offset = annot[0]
custom_label = annot[1]
b_prefix = "B-"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to add "entity_map" in the docstring:

def process_ner_annotations(ner_annotations, ner_text, entity_map, tokenizer, is_eval=False):
    """
    Generate token-level/word-level labels with given text and NER annotations.

    Parameters
    ----------
    ner_annotations
        The NER annotations.
    ner_text
        The corresponding raw text.
    entity_map
        The entity map between token label to word label.
    tokenizer
        The tokenizer to be used.
    is_eval
        Whether it is for evaluation or not, default: False

    Returns
    -------
    Token-level/word-level labels and text features.
    """

train_data=train_df,
tuning_data=dev_df,
hyperparameters={
"model.ner_text.checkpoint_name": "microsoft/mdeberta-v3-base",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we use a smaller model?

@github-actions
Copy link

Job PR-2676-9350aa1 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-2676/9350aa1/index.html

Copy link
Collaborator

@sxjscience sxjscience left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@cheungdaven cheungdaven merged commit c89f765 into autogluon:master Jan 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants