New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chinese Penn Treebank POS tagset mapping #19
Comments
It turns out that spaCy does have a mapping between Chinese Penn Treebank POS tagset and UPOS, I did not find this as I was inspecting the # First you will need to download the Chinese spaCy model like so:
# python -m spacy download zh_core_web_sm
import spacy
nlp = spacy.load('zh_core_web_sm')
attribute_ruler = nlp.get_pipe('attribute_ruler')
for pattern in attribute_ruler.patterns:
print(pattern) This will output the following: {'patterns': [[{'TAG': 'AS'}]], 'attrs': {'POS': 'PART', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'DEC'}]], 'attrs': {'POS': 'PART', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'DEG'}]], 'attrs': {'POS': 'PART', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'DER'}]], 'attrs': {'POS': 'PART', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'DEV'}]], 'attrs': {'POS': 'PART', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'ETC'}]], 'attrs': {'POS': 'PART', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'LC'}]], 'attrs': {'POS': 'PART', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'MSP'}]], 'attrs': {'POS': 'PART', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'SP'}]], 'attrs': {'POS': 'PART', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'BA'}]], 'attrs': {'POS': 'X', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'FW'}]], 'attrs': {'POS': 'X', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'IJ'}]], 'attrs': {'POS': 'INTJ', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'LB'}]], 'attrs': {'POS': 'X', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'ON'}]], 'attrs': {'POS': 'X', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'SB'}]], 'attrs': {'POS': 'X', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'X'}]], 'attrs': {'POS': 'X', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'URL'}]], 'attrs': {'POS': 'X', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'INF'}]], 'attrs': {'POS': 'X', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'NN'}]], 'attrs': {'POS': 'NOUN', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'NR'}]], 'attrs': {'POS': 'PROPN', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'NT'}]], 'attrs': {'POS': 'NOUN', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'VA'}]], 'attrs': {'POS': 'VERB', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'VC'}]], 'attrs': {'POS': 'VERB', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'VE'}]], 'attrs': {'POS': 'VERB', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'VV'}]], 'attrs': {'POS': 'VERB', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'CD'}]], 'attrs': {'POS': 'NUM', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'M'}]], 'attrs': {'POS': 'NUM', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'OD'}]], 'attrs': {'POS': 'NUM', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'DT'}]], 'attrs': {'POS': 'DET', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'CC'}]], 'attrs': {'POS': 'CCONJ', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'CS'}]], 'attrs': {'POS': 'SCONJ', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'AD'}]], 'attrs': {'POS': 'ADV', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'JJ'}]], 'attrs': {'POS': 'ADJ', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'P'}]], 'attrs': {'POS': 'ADP', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'PN'}]], 'attrs': {'POS': 'PRON', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'PU'}]], 'attrs': {'POS': 'PUNCT', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': '_SP'}]], 'attrs': {'POS': 'SPACE', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'IS_SPACE': True}]], 'attrs': {'TAG': '_SP', 'POS': 'SPACE'}, 'index': 0} From this we can see the mapping between Chinese Penn Treebank and UPOS, e.g.
The following tags are not in the original Chinese Penn Treebank POS tagset but are in the spaCy model and having the following mappings to UPOS
I think if we do add the Chinese Penn Treebank mappings to PyMUSAS so that we have a map from Chinese Penn Treebank to USAS core POS tagset, we do it through the spaCy mapping, e.g. map from: |
Great, this sounds good, please go ahead, and then I can ask Scott or others to sanity check the output. |
The Chinese spaCy model outputs POS tags that come from the Chinese treebank tagset rather than the Universal POS tagset. This therefore requires a mapping from the Chinese treebank tagset to the USAS core tagset to be able to use the POS tagger within the Chinese spaCy model for the USASRuleBasedTagger if we would like to make the most of the POS information within the Chinese USAS lexicon.
A solution to this is to take the mapping from the Universal POS (UPOS) tagset for mapping between the Chinese treebank tagset to the UPOS tagset, of which the mapping can be found here and swap the UPOS tags in that mapping to USAS core tagsets using the mapping we have current for UPOS to USAS core.
The text was updated successfully, but these errors were encountered: