Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Old Norse stopwords #601

Closed
clemsciences opened this issue Nov 5, 2017 · 10 comments
Closed

Old Norse stopwords #601

clemsciences opened this issue Nov 5, 2017 · 10 comments
Assignees

Comments

@clemsciences
Copy link
Member

clemsciences commented Nov 5, 2017

I can add to cltk stopwords of Old Norse. Resources for Old Norse are not enough developped, so I would like to help for that and probably for further things.

@kylepjohnson
Copy link
Member

@clemsciences This would be terrific. @diyclassics is actively working on the subject of stopwords, in general, so he should chime in, too.

To get this started, you have a few options:

  • Find a list of ON stopwords that have been created by scholars; take it and add here :)
  • make your own using a ON grammar. The most similar list we now have is OE. Do not include any nouns, adjectives, or (most) verbs; but do include articles, particles, prepositions, and auxiliary verbs (e.g., "hasn't"). Adverbs are a grey area for us, if I recall right, however very common ones are probably stops.
  • Do a statistical study of high-frequency, low information words (tf-idf is a good algo to start with). Patrick has more to say on this.

To step back and think about the bigger picture: ON is an excellent candidate for the CLTK because there are large, open source corpora and treebanks. Probably other resources, too, like dictionaries. From this, you'll be able to write lemmatizers, POS and syntax taggers, stemmers, etc.

Thanks and please do not hesitate to reach out for guidance!

@clemsciences
Copy link
Member Author

clemsciences commented Nov 7, 2017

@kylepjohnson I have not found any stopword list in Old Norse. The only significant work I found about NLP for Old Norse is a POS tagger https://www.researchgate.net/publication/282778755_Morphological_Tagging_of_Old_Norse_Texts_and_Its_Use_in_Studying_Syntactic_Variation_and_Change However, the annotated corpora created for this do not seem to be available anywhere. That's why my method is the following :

  • using Old Norse grammars (Altnordisches Elementarbuch by Ranke and Hofmann and A new introduction to Old Norse by Barnes) to create a list of prépositions, particles, articles, auxialiray verbs, relative pronouns, etc ;

  • Then, analysing Old norse corpora provided by cltk in order to find the most common words or words with a low TF-IDF score and sort them by hand just to keep uninformative words.

I remarked two things :

  • in the Old Norse copora, there are texts in Old English (like Beowulf) and I think they should not be there ;

  • For now, I'm using the latin word tokenizer, but I think I'll program a simple one for Old Norse.

@diyclassics
Copy link
Collaborator

@clemsciences Good to hear you'll be working on extending CLTK to ON.

re: stopwords—I'm close and getting closer to generalizing stop word production for CLTK. Code in progress is here: https://github.com/diyclassics/cltk/blob/stops/cltk/stop/stop.py. Only a "most common words" Class is currently available; document-based Class is commented out and in progress. Feel free to use whatever is helpful, pass along feedback, etc.

As far as contributing a static, reference list—you can follow the Latin example for now, i.e. make an "old_norse" folder in the stop directory and add init.py and stops.py files for your work.

@diyclassics
Copy link
Collaborator

@clemsciences Two other quick points:

  1. Please include the Rögnvaldsson & Helgadóttir reference in the comments for your work. Because of the nature of the CLTK, I like seeing references both the the compsci/programming literature as well as academic work.
  2. If you decide to contribute an ON word tokenizer, you can make language-specific contributions to tokenize/words.py for now. That said (and hopefully @kylepjohnson agrees!)—I would like to see all of the modules move eventually to the folder/file structure I note above for the stops module.

@clemsciences
Copy link
Member Author

I wanted to make a pull request but it failed (Can't Create Pull Request Push failed: unable to access 'https://github.com/cltk/cltk.git/': The requested URL returned error: 403) so I'll try later.

@kylepjohnson
Copy link
Member

Merged PR #603 . I'll close this once your docs come through.

@clemsciences Responses to the above:

in the Old Norse copora, there are texts in Old English (like Beowulf) and I think they should not be there ;

You're the first contrib with ON knowledge, so I believe you would be the right person to clean up the corpora. Perhaps you could move beowulf from old_norse_text_perseus into a new repo old_english_text_perseus.

For now, I'm using the latin word tokenizer, but I think I'll program a simple one for Old Norse.

This would be a nice, fairly easy project.

About the POS tagger -- do you have any idea how to recreate it? Or to reach out to its author for the training data?

@jtauber
Copy link

jtauber commented Nov 9, 2017

I'm doing quite a bit of Germanic philology (unrelated to my Perseus work at the moment) so happy to occasionally help out with this too.

@kylepjohnson
Copy link
Member

Thanks @jtauber jump in any time with ideas

Just merged #604 and have bumped the vers, so 0.1.72 should be on PyPI once the build server is done.

I'll close this ticket, but @clemsciences I'll open a new one for you, for a tokenizer

@clemsciences
Copy link
Member Author

@kylepjohnson Ok, I'll make the new repo old_english_text_perseus as soon as I get to know how to do it.

For the POS tagger, I'll send a message to the author of the Old Norse POS tagger to ask if he can help us (I've always wanted to use such a POS tagger. Alone, I didn't dare to ask for it but now, with cltk, I have more reason to do it. I studied much about statistical POS tagger (specific hidden Markov models for POS tagging), so I can help for it. The only thing which lacks to begin is annotated corpora.

@kylepjohnson
Copy link
Member

For the POS tagger, I'll send a message to the author of the Old Norse POS tagger to ask if he can help us (I've always wanted to use such a POS tagger. Alone, I didn't dare to ask for it but now, with cltk, I have more reason to do it. I studied much about statistical POS tagger (specific hidden Markov models for POS tagging), so I can help for it. The only thing which lacks to begin is annotated corpora.

We've used the NLTK's TnT tagger, which the authors use in that paper, for Greek and Latin. To reproduce their results, assuming there we have training data, will be straightforward if you follow our patterns :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants