Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chinese Penn Treebank POS tagset mapping #19

Closed
apmoore1 opened this issue Dec 10, 2021 · 2 comments · Fixed by #22
Closed

Chinese Penn Treebank POS tagset mapping #19

apmoore1 opened this issue Dec 10, 2021 · 2 comments · Fixed by #22
Assignees
Labels
enhancement New feature or request

Comments

@apmoore1
Copy link
Member

The Chinese spaCy model outputs POS tags that come from the Chinese treebank tagset rather than the Universal POS tagset. This therefore requires a mapping from the Chinese treebank tagset to the USAS core tagset to be able to use the POS tagger within the Chinese spaCy model for the USASRuleBasedTagger if we would like to make the most of the POS information within the Chinese USAS lexicon.

A solution to this is to take the mapping from the Universal POS (UPOS) tagset for mapping between the Chinese treebank tagset to the UPOS tagset, of which the mapping can be found here and swap the UPOS tags in that mapping to USAS core tagsets using the mapping we have current for UPOS to USAS core.

@apmoore1 apmoore1 self-assigned this Dec 10, 2021
@apmoore1 apmoore1 added the enhancement New feature or request label Dec 10, 2021
@apmoore1
Copy link
Member Author

apmoore1 commented Jan 7, 2022

It turns out that spaCy does have a mapping between Chinese Penn Treebank POS tagset and UPOS, I did not find this as I was inspecting the nlp.analyze_pipes function output, whereas the mapping is within their AttributeRuler patterns. This means that we do not need a mapping of our own anymore, however adding spaCy's mapping to our code base I think would be a good idea, what do you think @perayson? . The spaCy mapping can be found like so:

# First you will need to download the Chinese spaCy model like so: 
# python -m spacy download zh_core_web_sm
import spacy
nlp = spacy.load('zh_core_web_sm')
attribute_ruler = nlp.get_pipe('attribute_ruler')
for pattern in attribute_ruler.patterns:
    print(pattern)

This will output the following:

{'patterns': [[{'TAG': 'AS'}]], 'attrs': {'POS': 'PART', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'DEC'}]], 'attrs': {'POS': 'PART', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'DEG'}]], 'attrs': {'POS': 'PART', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'DER'}]], 'attrs': {'POS': 'PART', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'DEV'}]], 'attrs': {'POS': 'PART', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'ETC'}]], 'attrs': {'POS': 'PART', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'LC'}]], 'attrs': {'POS': 'PART', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'MSP'}]], 'attrs': {'POS': 'PART', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'SP'}]], 'attrs': {'POS': 'PART', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'BA'}]], 'attrs': {'POS': 'X', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'FW'}]], 'attrs': {'POS': 'X', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'IJ'}]], 'attrs': {'POS': 'INTJ', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'LB'}]], 'attrs': {'POS': 'X', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'ON'}]], 'attrs': {'POS': 'X', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'SB'}]], 'attrs': {'POS': 'X', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'X'}]], 'attrs': {'POS': 'X', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'URL'}]], 'attrs': {'POS': 'X', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'INF'}]], 'attrs': {'POS': 'X', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'NN'}]], 'attrs': {'POS': 'NOUN', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'NR'}]], 'attrs': {'POS': 'PROPN', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'NT'}]], 'attrs': {'POS': 'NOUN', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'VA'}]], 'attrs': {'POS': 'VERB', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'VC'}]], 'attrs': {'POS': 'VERB', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'VE'}]], 'attrs': {'POS': 'VERB', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'VV'}]], 'attrs': {'POS': 'VERB', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'CD'}]], 'attrs': {'POS': 'NUM', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'M'}]], 'attrs': {'POS': 'NUM', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'OD'}]], 'attrs': {'POS': 'NUM', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'DT'}]], 'attrs': {'POS': 'DET', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'CC'}]], 'attrs': {'POS': 'CCONJ', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'CS'}]], 'attrs': {'POS': 'SCONJ', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'AD'}]], 'attrs': {'POS': 'ADV', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'JJ'}]], 'attrs': {'POS': 'ADJ', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'P'}]], 'attrs': {'POS': 'ADP', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'PN'}]], 'attrs': {'POS': 'PRON', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'PU'}]], 'attrs': {'POS': 'PUNCT', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': '_SP'}]], 'attrs': {'POS': 'SPACE', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'IS_SPACE': True}]], 'attrs': {'TAG': '_SP', 'POS': 'SPACE'}, 'index': 0}

From this we can see the mapping between Chinese Penn Treebank and UPOS, e.g. AS is equivalent to PART. This mapping is slightly different to the original UPOS mapping, which can be found here, as the original UPOS tagset has been expanded through the Universal Dependencies project, the expanded UPOS tagset can be found here. However the mappings are similar with the following differences:

Chinese Penn Treebank, spaCy Mapping, Original UPOS
IJ, INTJ, X
CS, SCONJ, CONJ
NR, PROPN, NOUN

The following tags are not in the original Chinese Penn Treebank POS tagset but are in the spaCy model and having the following mappings to UPOS

spaCy Chinese Penn Treebank, Mapping
INF, X
URL, X
X, X

I think if we do add the Chinese Penn Treebank mappings to PyMUSAS so that we have a map from Chinese Penn Treebank to USAS core POS tagset, we do it through the spaCy mapping, e.g. map from:
Chinese Penn Treebank -> spaCy UPOS mapping -> USAS core

@perayson
Copy link
Member

perayson commented Jan 7, 2022

Great, this sounds good, please go ahead, and then I can ask Scott or others to sanity check the output.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants