Skip to content

Part Of Speech

Panos Louridas edited this page Jul 27, 2018 · 2 revisions

Part-Of-Speech tagger

Definition from Wikipedia:

In corpus linguistics, part-of-speech tagging (POS tagging or PoS tagging or POST), also called grammatical tagging or word-category disambiguation, is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech,[1] based on both its definition and its context—i.e., its relationship with adjacent and related words in a phrase, sentence, or paragraph.

Greek POS tags

The POS tags for Greek language are based on the Universal POS tags. They are defined in the tag map file.

The Universal POS tags schema is followed because there is a public annotated Greek Dependency Treebank here which is based on the Universal POS tags and thus there was an important potential for a good kickoff for a spaCy model for POS tagging in Greek language.

The Universal POS tags (with their definitions in Greek language and some clarifications) are the following:

  1. ADJ: επίθετα
  2. ADV: επιρρήματα
  3. ADP: προθέσεις
  4. AUX: ρήματα για σχηματισμό χρόνων
  5. INTJ: επιφωνήματα
  6. PROPN: ουσιαστικά που χρησιμοποιούνται ως ονόματα
  7. VERB: ρήματα
  8. CCONJ: παρατακτικοί σύνδεσμοι
  9. SCONJ: υποτακτικοί σύνδεσμοι
  10. PART: μόρια
  11. PUNCT: σημεία στίξης
  12. SYM: σύμβολα
  13. NUM: αριθμητικά
  14. PRON: αντωνυμίες
  15. SPACE: κενό
  16. DET: άρθρα
  17. NOUN: ουσιαστικά

Note: In the Greek UD Treebank there are no annotations for the SPACE tag. There is a need for further annotation so the model can learn the SPACE tag. Coming soon.

Extended Greek POS tags

There is also an extended list of Greek POS tags for a more sophisticated model that can be used in the future if there is appropriate annotated dataset. These tags are listed here for future reference and use.

Clone this wiki locally