New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added word tokenization for Old English #706
Conversation
Thanks. I have merged some conflicts and am waiting for the build server to report back. I see that there is an Old French Swadesh in here, too. You mean for that to be merged, as well? |
@kylepjohnson yes. 😄 I added the Swadesh list for Old French in my previous PR ( #671 ) but you told me there were several discrepancies in it ( Issue #698 ). I have talked about those issues there and hence decided to re-add the list. |
Great, that is what I figured. Just wanted to be sure! I'll merge once all the updating/building is done. |
Thank you! 😅 |
@the-ethan-hunt There have been a few formatting issues in
Would you please take a look and make sure the module loads ok? |
Travis CI was probably failing due to this reason
@kylepjohnson , I have resolved the indentation errors and now the report says this:
I am not able to understand this. 😅 Is this something wrong from my side? |
The |
Yes. This and the file is now full of tabs (not spaces). @the-ethan-hunt I have fixed the file, however to push the changes back, you need to give me write access to your cltk fork. Please do this or replace all the tabs in word.py |
@kylepjohnson I have corrected the changes and also sent you the invite link to give you write access to my cltk fork. 😅 |
CI now should run properly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@the-ethan-hunt IMO after these changes the test shall pass 😄
cltk/tests/test_tokenize.py
Outdated
def test_old_eng_word_tokenizer(self): | ||
text = "Hƿæt! ƿē Gār-Dena in ġeār-dagum, þēod-cyninga, þrym ġefrūnon,hū ðā æþelingas ellen fremedon." | ||
target= ['Hƿæt', '!', 'ƿē', 'Gār', '-', 'Dena', 'in', 'ġeār', '-', 'dagum', ',' | ||
'þēod', '-', 'cyninga', ',', 'þrym', 'ġefrūnon', ',', 'hū', 'ðā', 'æþelingas', 'ellen', 'fremedon', '.' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@the-ethan-hunt i think you've missed a ]
here which caused a test failure .
cltk/tokenize/word.py
Outdated
@@ -26,16 +26,15 @@ def __init__(self, language): | |||
'greek', | |||
'latin', | |||
'old_norse', | |||
'old_english' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also a missing ,
here caused an error in tests .
@kylepjohnson can you have a look at this? 😅 |
What is the rationale behind this? As far as I know, English punctuation rules weren't established until well beyond the 11th century, so any apostrophes you'll come across will likely be editorial intervention, probably indicating letter omission (in which case, splitting the word probably does more harm than good) |
@Sedictious , I agree with your point. But there have been Old English texts with a modern type convention by a modern author.( You have mentioned this too). These authors sometimes use the apostrophe in replacing the interpunct( A very popular Old English punctuation mark represented by "·". To prevent ignoring this I have also used the apostrophe as a punctuation mark. 😄 |
@the-ethan-hunt My point was that you shouldn't separate words containing an apostrophe (you yourself noted that interpunct also was a punctuation mark). I am sorry, but I still fail to see why you included this bit. |
@Sedictious , as I mentioned that some authors continue to use the apostrophe in Old English texts in place of an interpunct. I thus had to include it. In cases like that, if I remove the bit, this may result in a 'wrong tokenization'. |
@the-ethan-hunt I am not entirely sure I follow you. On your comments, you make it seem like you treat apostrophe like a regular punctuation point . However, looking at the code you have: Which is first converting instances of ’ to ' and then splitting a word containing either of the aforementioned characters (e.g. " CLTK's " -> " CLTK' " "s"). Nowhere in the code is there an indication of an apostrophe being treated like a punctuation mark. Again, I am not sure whether the above behavior is intentional or not, so please correct me if there is a reason behind this. That being said, you probably are right and we should wait for a moderator to check it out 😄 |
Hey @the-ethan-hunt , you can join the Germanic Team, where we can exchange ideas about the Germanic languages. However, I do not have enough knowledge in Old English to assess your work as a professor would do. I only have the basics in Old English. But if there is a place where you can put forward your ideas, it is in this team. @kylepjohnson can add you to the Germanic team. |
Word tokenization has been added for Old English with addition in
tests/test_tokenize.py
anddocs/old_english.rst
.I will also be adding line tokenization later.
@kylepjohnson , please tell me if I have to create an issue ticket for that. 😅