Old Norse stopwords #601

clemsciences · 2017-11-05T21:29:39Z

I can add to cltk stopwords of Old Norse. Resources for Old Norse are not enough developped, so I would like to help for that and probably for further things.

kylepjohnson · 2017-11-06T15:39:29Z

@clemsciences This would be terrific. @diyclassics is actively working on the subject of stopwords, in general, so he should chime in, too.

To get this started, you have a few options:

Find a list of ON stopwords that have been created by scholars; take it and add here :)
make your own using a ON grammar. The most similar list we now have is OE. Do not include any nouns, adjectives, or (most) verbs; but do include articles, particles, prepositions, and auxiliary verbs (e.g., "hasn't"). Adverbs are a grey area for us, if I recall right, however very common ones are probably stops.
Do a statistical study of high-frequency, low information words (tf-idf is a good algo to start with). Patrick has more to say on this.

To step back and think about the bigger picture: ON is an excellent candidate for the CLTK because there are large, open source corpora and treebanks. Probably other resources, too, like dictionaries. From this, you'll be able to write lemmatizers, POS and syntax taggers, stemmers, etc.

Thanks and please do not hesitate to reach out for guidance!

clemsciences · 2017-11-07T12:46:14Z

@kylepjohnson I have not found any stopword list in Old Norse. The only significant work I found about NLP for Old Norse is a POS tagger https://www.researchgate.net/publication/282778755_Morphological_Tagging_of_Old_Norse_Texts_and_Its_Use_in_Studying_Syntactic_Variation_and_Change However, the annotated corpora created for this do not seem to be available anywhere. That's why my method is the following :

using Old Norse grammars (Altnordisches Elementarbuch by Ranke and Hofmann and A new introduction to Old Norse by Barnes) to create a list of prépositions, particles, articles, auxialiray verbs, relative pronouns, etc ;
Then, analysing Old norse corpora provided by cltk in order to find the most common words or words with a low TF-IDF score and sort them by hand just to keep uninformative words.

I remarked two things :

in the Old Norse copora, there are texts in Old English (like Beowulf) and I think they should not be there ;
For now, I'm using the latin word tokenizer, but I think I'll program a simple one for Old Norse.

diyclassics · 2017-11-07T16:07:01Z

@clemsciences Good to hear you'll be working on extending CLTK to ON.

re: stopwords—I'm close and getting closer to generalizing stop word production for CLTK. Code in progress is here: https://github.com/diyclassics/cltk/blob/stops/cltk/stop/stop.py. Only a "most common words" Class is currently available; document-based Class is commented out and in progress. Feel free to use whatever is helpful, pass along feedback, etc.

As far as contributing a static, reference list—you can follow the Latin example for now, i.e. make an "old_norse" folder in the stop directory and add init.py and stops.py files for your work.

diyclassics · 2017-11-07T16:12:49Z

@clemsciences Two other quick points:

Please include the Rögnvaldsson & Helgadóttir reference in the comments for your work. Because of the nature of the CLTK, I like seeing references both the the compsci/programming literature as well as academic work.
If you decide to contribute an ON word tokenizer, you can make language-specific contributions to tokenize/words.py for now. That said (and hopefully @kylepjohnson agrees!)—I would like to see all of the modules move eventually to the folder/file structure I note above for the stops module.

clemsciences · 2017-11-07T23:44:58Z

I wanted to make a pull request but it failed (Can't Create Pull Request Push failed: unable to access 'https://github.com/cltk/cltk.git/': The requested URL returned error: 403) so I'll try later.

kylepjohnson · 2017-11-09T19:20:01Z

Merged PR #603 . I'll close this once your docs come through.

@clemsciences Responses to the above:

in the Old Norse copora, there are texts in Old English (like Beowulf) and I think they should not be there ;

You're the first contrib with ON knowledge, so I believe you would be the right person to clean up the corpora. Perhaps you could move beowulf from old_norse_text_perseus into a new repo old_english_text_perseus.

For now, I'm using the latin word tokenizer, but I think I'll program a simple one for Old Norse.

This would be a nice, fairly easy project.

About the POS tagger -- do you have any idea how to recreate it? Or to reach out to its author for the training data?

jtauber · 2017-11-09T19:31:54Z

I'm doing quite a bit of Germanic philology (unrelated to my Perseus work at the moment) so happy to occasionally help out with this too.

kylepjohnson · 2017-11-09T19:47:10Z

Thanks @jtauber jump in any time with ideas

Just merged #604 and have bumped the vers, so 0.1.72 should be on PyPI once the build server is done.

I'll close this ticket, but @clemsciences I'll open a new one for you, for a tokenizer

clemsciences · 2017-11-09T20:43:26Z

@kylepjohnson Ok, I'll make the new repo old_english_text_perseus as soon as I get to know how to do it.

For the POS tagger, I'll send a message to the author of the Old Norse POS tagger to ask if he can help us (I've always wanted to use such a POS tagger. Alone, I didn't dare to ask for it but now, with cltk, I have more reason to do it. I studied much about statistical POS tagger (specific hidden Markov models for POS tagging), so I can help for it. The only thing which lacks to begin is annotated corpora.

kylepjohnson · 2017-11-10T18:40:20Z

For the POS tagger, I'll send a message to the author of the Old Norse POS tagger to ask if he can help us (I've always wanted to use such a POS tagger. Alone, I didn't dare to ask for it but now, with cltk, I have more reason to do it. I studied much about statistical POS tagger (specific hidden Markov models for POS tagging), so I can help for it. The only thing which lacks to begin is annotated corpora.

We've used the NLTK's TnT tagger, which the authors use in that paper, for Greek and Latin. To reproduce their results, assuming there we have training data, will be straightforward if you follow our patterns :)

kylepjohnson assigned clemsciences and diyclassics Nov 6, 2017

kylepjohnson closed this as completed Nov 9, 2017

kylepjohnson mentioned this issue Nov 9, 2017

Make Old Norse Tokenizer #605

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Old Norse stopwords #601

Old Norse stopwords #601

clemsciences commented Nov 5, 2017 •

edited

Loading

kylepjohnson commented Nov 6, 2017

clemsciences commented Nov 7, 2017 •

edited

Loading

diyclassics commented Nov 7, 2017

diyclassics commented Nov 7, 2017

clemsciences commented Nov 7, 2017

kylepjohnson commented Nov 9, 2017

jtauber commented Nov 9, 2017

kylepjohnson commented Nov 9, 2017

clemsciences commented Nov 9, 2017

kylepjohnson commented Nov 10, 2017

Old Norse stopwords #601

Old Norse stopwords #601

Comments

clemsciences commented Nov 5, 2017 • edited Loading

kylepjohnson commented Nov 6, 2017

clemsciences commented Nov 7, 2017 • edited Loading

diyclassics commented Nov 7, 2017

diyclassics commented Nov 7, 2017

clemsciences commented Nov 7, 2017

kylepjohnson commented Nov 9, 2017

jtauber commented Nov 9, 2017

kylepjohnson commented Nov 9, 2017

clemsciences commented Nov 9, 2017

kylepjohnson commented Nov 10, 2017

clemsciences commented Nov 5, 2017 •

edited

Loading

clemsciences commented Nov 7, 2017 •

edited

Loading