[NER] Add support for Chinese Named Entities #2676

cheungdaven · 2023-01-11T05:14:55Z

Description of changes:

Add support for Chinese NER.
Add unit test for Chinese NER.
Automatically generate BIO tags when tags are not in BIO format.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

github-actions · 2023-01-11T21:35:44Z

Job PR-2676-f61dc77 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-2676/f61dc77/index.html

github-actions · 2023-01-12T02:19:40Z

Job PR-2676-8759c75 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-2676/8759c75/index.html

sxjscience · 2023-01-12T19:15:37Z

multimodal/src/autogluon/multimodal/data/labelencoder_ner.py

@@ -35,6 +36,7 @@ def fit(self, y: pd.Series, x: pd.Series):
        _, entity_groups = self.extract_ner_annotations(y)
        self.unique_entity_groups = self.ner_special_tags + entity_groups
        self.entity_map = {entity: index for index, entity in enumerate(self.unique_entity_groups)}
+        self.config.entity_map = self.entity_map


Do we need to put entity_map under self.config? Or we just need to keep self.entity_map?

need this entity map in dataprocessor, that's why it is put in config. special tags such as "O" is also in config.

Shall we extend theNerProcessor to include the entity_map keyword then?

class NerProcessor: """ Prepare NER data for the model specified by "prefix". """ def __init__( self, model: nn.Module, max_len: Optional[int] = None, entity_map: Optional[dict] = None, config: Optional[DictConfig] = None, ):

already add self.config in nerprocessor. entity_map can be accessed by self.config.entity_map.

github-actions · 2023-01-12T19:48:17Z

Job PR-2676-eaf21d6 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-2676/eaf21d6/index.html

sxjscience · 2023-01-12T19:49:43Z

multimodal/src/autogluon/multimodal/data/utils.py

-        for annot in ner_annotations:
-            custom_offset = annot[0]
-            custom_label = annot[1]
+    b_prefix = "B-"


Need to add "entity_map" in the docstring:

def process_ner_annotations(ner_annotations, ner_text, entity_map, tokenizer, is_eval=False): """ Generate token-level/word-level labels with given text and NER annotations. Parameters ---------- ner_annotations The NER annotations. ner_text The corresponding raw text. entity_map The entity map between token label to word label. tokenizer The tokenizer to be used. is_eval Whether it is for evaluation or not, default: False Returns ------- Token-level/word-level labels and text features. """

sxjscience · 2023-01-12T19:54:07Z

multimodal/tests/unittests/others/test_ner_chinese.py

+        train_data=train_df,
+        tuning_data=dev_df,
+        hyperparameters={
+            "model.ner_text.checkpoint_name": "microsoft/mdeberta-v3-base",


Shall we use a smaller model?

github-actions · 2023-01-13T03:51:35Z

Job PR-2676-9350aa1 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-2676/9350aa1/index.html

sxjscience

LGTM!

ner for chinese

42bbbf0

cheungdaven requested a review from sxjscience January 11, 2023 05:15

Ubuntu and others added 4 commits January 11, 2023 07:19

fix ci error

9b4e2e7

fix ci error

3dc3ffd

Merge branch 'autogluon:master' into ch

d720d26

Merge branch 'autogluon:master' into ch

f61dc77

Ubuntu and others added 3 commits January 11, 2023 22:31

move entity map to config

6bc82b6

move entity map to config

a24b8d3

Merge branch 'autogluon:master' into ch

8759c75

fix issue in visualizing chinese entities

eaf21d6

sxjscience reviewed Jan 12, 2023

View reviewed changes

remove BIO in prediction

9350aa1

sxjscience approved these changes Jan 13, 2023

View reviewed changes

cheungdaven merged commit c89f765 into autogluon:master Jan 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NER] Add support for Chinese Named Entities #2676

[NER] Add support for Chinese Named Entities #2676

cheungdaven commented Jan 11, 2023

github-actions bot commented Jan 11, 2023

github-actions bot commented Jan 12, 2023

sxjscience Jan 12, 2023

cheungdaven Jan 12, 2023

sxjscience Jan 12, 2023 •

edited

sxjscience Jan 12, 2023

cheungdaven Jan 12, 2023

github-actions bot commented Jan 12, 2023

sxjscience Jan 12, 2023

sxjscience Jan 12, 2023

github-actions bot commented Jan 13, 2023

sxjscience left a comment

[NER] Add support for Chinese Named Entities #2676

[NER] Add support for Chinese Named Entities #2676

Conversation

cheungdaven commented Jan 11, 2023

github-actions bot commented Jan 11, 2023

github-actions bot commented Jan 12, 2023

sxjscience Jan 12, 2023

Choose a reason for hiding this comment

cheungdaven Jan 12, 2023

Choose a reason for hiding this comment

sxjscience Jan 12, 2023 • edited

Choose a reason for hiding this comment

sxjscience Jan 12, 2023

Choose a reason for hiding this comment

cheungdaven Jan 12, 2023

Choose a reason for hiding this comment

github-actions bot commented Jan 12, 2023

sxjscience Jan 12, 2023

Choose a reason for hiding this comment

sxjscience Jan 12, 2023

Choose a reason for hiding this comment

github-actions bot commented Jan 13, 2023

sxjscience left a comment

Choose a reason for hiding this comment

sxjscience Jan 12, 2023 •

edited