-
Notifications
You must be signed in to change notification settings - Fork 328
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Old Norse stopwords #601
Comments
@clemsciences This would be terrific. @diyclassics is actively working on the subject of stopwords, in general, so he should chime in, too. To get this started, you have a few options:
To step back and think about the bigger picture: ON is an excellent candidate for the CLTK because there are large, open source corpora and treebanks. Probably other resources, too, like dictionaries. From this, you'll be able to write lemmatizers, POS and syntax taggers, stemmers, etc. Thanks and please do not hesitate to reach out for guidance! |
@kylepjohnson I have not found any stopword list in Old Norse. The only significant work I found about NLP for Old Norse is a POS tagger https://www.researchgate.net/publication/282778755_Morphological_Tagging_of_Old_Norse_Texts_and_Its_Use_in_Studying_Syntactic_Variation_and_Change However, the annotated corpora created for this do not seem to be available anywhere. That's why my method is the following :
I remarked two things :
|
@clemsciences Good to hear you'll be working on extending CLTK to ON. re: stopwords—I'm close and getting closer to generalizing stop word production for CLTK. Code in progress is here: https://github.com/diyclassics/cltk/blob/stops/cltk/stop/stop.py. Only a "most common words" Class is currently available; document-based Class is commented out and in progress. Feel free to use whatever is helpful, pass along feedback, etc. As far as contributing a static, reference list—you can follow the Latin example for now, i.e. make an "old_norse" folder in the stop directory and add init.py and stops.py files for your work. |
@clemsciences Two other quick points:
|
I wanted to make a pull request but it failed (Can't Create Pull Request Push failed: unable to access 'https://github.com/cltk/cltk.git/': The requested URL returned error: 403) so I'll try later. |
Merged PR #603 . I'll close this once your docs come through. @clemsciences Responses to the above:
You're the first contrib with ON knowledge, so I believe you would be the right person to clean up the corpora. Perhaps you could move beowulf from old_norse_text_perseus into a new repo
This would be a nice, fairly easy project. About the POS tagger -- do you have any idea how to recreate it? Or to reach out to its author for the training data? |
I'm doing quite a bit of Germanic philology (unrelated to my Perseus work at the moment) so happy to occasionally help out with this too. |
Thanks @jtauber jump in any time with ideas Just merged #604 and have bumped the vers, so 0.1.72 should be on PyPI once the build server is done. I'll close this ticket, but @clemsciences I'll open a new one for you, for a tokenizer |
@kylepjohnson Ok, I'll make the new repo old_english_text_perseus as soon as I get to know how to do it. For the POS tagger, I'll send a message to the author of the Old Norse POS tagger to ask if he can help us (I've always wanted to use such a POS tagger. Alone, I didn't dare to ask for it but now, with cltk, I have more reason to do it. I studied much about statistical POS tagger (specific hidden Markov models for POS tagging), so I can help for it. The only thing which lacks to begin is annotated corpora. |
We've used the NLTK's TnT tagger, which the authors use in that paper, for Greek and Latin. To reproduce their results, assuming there we have training data, will be straightforward if you follow our patterns :) |
I can add to cltk stopwords of Old Norse. Resources for Old Norse are not enough developped, so I would like to help for that and probably for further things.
The text was updated successfully, but these errors were encountered: