Permalink
Browse files

bump vers for ON tagger

  • Loading branch information...
kylepjohnson committed Nov 28, 2017
1 parent d8c4891 commit 6a744498d849c08d492ac740a4f384d22ceac591
Showing with 39 additions and 22 deletions.
  1. +38 −21 docs/old_norse.rst
  2. +1 −1 setup.py
View
@@ -1,7 +1,8 @@
Old Norse
*********
Old Norse was a North Germanic language that was spoken by inhabitants of Scandinavia and inhabitants of their overseas settlements during about the 9th to 13th centuries. The Proto-Norse language developed into Old Norse by the 8th century, and Old Norse began to develop into the modern North Germanic languages in the mid- to late 14th century, ending the language phase known as Old Norse. These dates, however, are not absolute, since written Old Norse is found well into the 15th century. (Source: `Wikipedia <https://en.wikipedia.org/wiki/Old_Norse>`_)
Old Norse was a North Germanic language that was spoken by inhabitants of Scandinavia and inhabitants of their overseas settlements during about the 9th to 13th centuries. The Proto-Norse language developed into Old Norse by the 8th century, and Old Norse began to develop into the modern North Germanic languages in the mid- to late-14th century, ending the language phase known as Old Norse. These dates, however, are not absolute, since written Old Norse is found well into the 15th century. (Source: `Wikipedia <https://en.wikipedia.org/wiki/Old_Norse>`_)
Corpora
=======
@@ -10,12 +11,12 @@ Use ``CorpusImporter()`` or browse the `CLTK GitHub organization <https://github
.. code-block:: python
In [1]: from cltk.corpus.utils.importer import CorpusImporter
>>> from cltk.corpus.utils.importer import CorpusImporter
In [2]: corpus_importer = CorpusImporter("old_norse")
>>> corpus_importer = CorpusImporter("old_norse")
In [3]: corpus_importer.list_corpora
Out[3]: ['old_norse_text_perseus']
>>> corpus_importer.list_corpora
['old_norse_text_perseus', 'old_norse_models_cltk']
Stopword Filtering
@@ -25,34 +26,50 @@ To use the CLTK's built-in stopwords list, We use an example from `Eiríks saga
<http://www.heimskringla.no/wiki/Eir%C3%ADks_saga_rau%C3%B0a>`_:
.. code-block:: python
In [1]: from nltk.tokenize.punkt import PunktLanguageVars
>>> from nltk.tokenize.punkt import PunktLanguageVars
In [2]: from cltk.stop.old_norse.stops import STOPS_LIST
>>> from cltk.stop.old_norse.stops import STOPS_LIST
In [3]: sentence = 'Þat var einn morgin, er þeir Karlsefni sá fyrir ofan rjóðrit flekk nökkurn, sem glitraði við þeim'
>>> sentence = 'Þat var einn morgin, er þeir Karlsefni sá fyrir ofan rjóðrit flekk nökkurn, sem glitraði við þeim'
In [4]: p = PunktLanguageVars()
>>> p = PunktLanguageVars()
In [5]: tokens = p.word_tokenize(sentence.lower())
>>> tokens = p.word_tokenize(sentence.lower())
In [6]: [w for w in tokens if not w in STOPS_LIST]
Out[6]: ['var', 'einn', 'morgin', ',', 'karlsefni', 'rjóðrit', 'flekk', 'nökkurn', ',', 'glitraði']
>>> [w for w in tokens if not w in STOPS_LIST]
['var',
'einn',
'morgin',
',',
'karlsefni',
'rjóðrit',
'flekk',
'nökkurn',
',',
'glitraði']
POS tagging
===========
Thanks to TnT implemented in NLTK, you can get the POS tags of Old Norse texts. The model, first import the ``old_norse_models_cltk`` corpus.
Taggers are trained from an annotated corpus. You can find it at ` <http://www.linguist.is/icelandic_treebank/Download>` and it is Icelandic Parsed Historical Corpus (IcePaHC) version 0.9.
You can get the POS tags of Old Norse texts using the CLTK's wrapper around the NLTK tokenizer. First, download the model by importing the ``old_norse_models_cltk`` corpus. This TnT tagger was trained from annotated data from `Icelandic Parsed Historical Corpus <http://www.linguist.is/icelandic_treebank/Download>`_ (version 0.9, license: LGPL).
TnT tagger
The following sentence is extracted from the first verse of Völuspá (poem describing destiny of Agards gods).
``````````
The following sentence is from the first verse of Völuspá (a poem describing destiny of Agards gods).
.. code-block:: python
In [1]: tagger.tag_tnt('Hlióðs bið ek allar.')
Out[1]:
>>> from cltk.tag.pos import POSTag
>>> tagger = POSTag('old_norse')
>> sent = 'Hlióðs bið ek allar.'
>>> tagger.tag_tnt(sent)
[('Hlióðs', 'Unk'),
('bið', 'VBPI'),
('ek', 'PRO-N'),
('allar', 'Q-A'),
('.', '.')]
('bið', 'VBPI'),
('ek', 'PRO-N'),
('allar', 'Q-A'),
('.', '.')]
View
@@ -36,7 +36,7 @@
name='cltk',
packages=find_packages(),
url='https://github.com/cltk/cltk',
version='0.1.72',
version='0.1.73',
zip_safe=True,
test_suite='cltk.tests.test_cltk',
)

0 comments on commit 6a74449

Please sign in to comment.