Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Old norse stopwords #604

Merged
merged 4 commits into from
Nov 9, 2017
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Empty file added cltk/stop/old_norse/__init__.py
Empty file.
17 changes: 17 additions & 0 deletions cltk/tests/test_stop.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
from cltk.stop.latin.stops import STOPS_LIST as LATIN_STOPS
from cltk.stop.french.stops import STOPS_LIST as FRENCH_STOPS
from cltk.stop.arabic.stopword_filter import stopwords_filter as arabic_stop_filter
from cltk.stop.old_norse.stops import STOPS_LIST as OLD_NORSE_STOPS
from nltk.tokenize.punkt import PunktLanguageVars
import os
import unittest
Expand Down Expand Up @@ -58,6 +59,7 @@ def test_latin_stopwords(self):
target_list = ['usque', 'tandem', 'abutere', ',', 'catilina', ',',
'patientia', 'nostra', '?']
self.assertEqual(no_stops, target_list)

def test_arabic_stopwords(self):
"""Test filtering arabic stopwords."""
sentence = 'سُئِل بعض الكُتَّاب عن الخَط، متى يَسْتحِقُ أن يُوصَف بِالجَودةِ؟'
Expand Down Expand Up @@ -86,5 +88,20 @@ def test_string_stop_list(self):
stoplist = StringStoplist('latin').build_stoplist(text)
self.assertEqual(stoplist, target_list)

def test_old_norse_stopwords(self):
"""
Test filtering Old Norse stopwords
Sentence extracted from Eiríks saga rauða (http://www.heimskringla.no/wiki/Eir%C3%ADks_saga_rau%C3%B0a)
"""
sentence = 'Þat var einn morgin, er þeir Karlsefni sá fyrir ofan rjóðrit flekk nökkurn, sem glitraði við þeim'
lowered = sentence.lower()
punkt = PunktLanguageVars()
tokens = punkt.word_tokenize(lowered)
no_stops = [w for w in tokens if w not in OLD_NORSE_STOPS]
print(no_stops)
target_list = ['var', 'einn', 'morgin', ',', 'karlsefni', 'rjóðrit', 'flekk', 'nökkurn', ',', 'glitraði']
self.assertEqual(no_stops, target_list)


if __name__ == '__main__':
unittest.main()
24 changes: 24 additions & 0 deletions docs/old_norse.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,3 +16,27 @@ Use ``CorpusImporter()`` or browse the `CLTK GitHub organization <https://github

In [3]: corpus_importer.list_corpora
Out[3]: ['old_norse_text_perseus']


Stopword Filtering
==================

To use the CLTK's built-in stopwords list, We use an example from `Eiríks saga rauða
<http://www.heimskringla.no/wiki/Eir%C3%ADks_saga_rau%C3%B0a>`_:

.. code-block:: python

In [1]: from nltk.tokenize.punkt import PunktLanguageVars

In [2]: from cltk.stop.old_norse.stops import STOPS_LIST

In [3]: sentence = 'Þat var einn morgin, er þeir Karlsefni sá fyrir ofan rjóðrit flekk nökkurn, sem glitraði við þeim'

In [4]: p = PunktLanguageVars()

In [5]: tokens = p.word_tokenize(sentence.lower())

In [6]: [w for w in tokens if not w in STOPS_LIST]
Out[6]: ['var', 'einn', 'morgin', ',', 'karlsefni', 'rjóðrit', 'flekk', 'nökkurn', ',', 'glitraði']