Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added word tokenization for Old English #706

Closed
wants to merge 24 commits into from

Conversation

the-ethan-hunt
Copy link
Contributor

Word tokenization has been added for Old English with addition in tests/test_tokenize.py and docs/old_english.rst.
I will also be adding line tokenization later.
@kylepjohnson , please tell me if I have to create an issue ticket for that. 😅

@kylepjohnson
Copy link
Member

Thanks. I have merged some conflicts and am waiting for the build server to report back.

I see that there is an Old French Swadesh in here, too. You mean for that to be merged, as well?

@the-ethan-hunt
Copy link
Contributor Author

I see that there is an Old French Swadesh in here, too. You mean for that to be merged, as well?

@kylepjohnson yes. 😄 I added the Swadesh list for Old French in my previous PR ( #671 ) but you told me there were several discrepancies in it ( Issue #698 ). I have talked about those issues there and hence decided to re-add the list.

@kylepjohnson
Copy link
Member

Great, that is what I figured. Just wanted to be sure! I'll merge once all the updating/building is done.

@the-ethan-hunt
Copy link
Contributor Author

Thank you! 😅

@kylepjohnson
Copy link
Member

@the-ethan-hunt There have been a few formatting issues in word.py and I think they came from your version. I tried to fix the, however the build server has failed a few times, most recently for:

ERROR: Failure: TabError (inconsistent use of tabs and spaces in indentation (word.py, line 31))

Would you please take a look and make sure the module loads ok?

Travis CI was probably failing due to this reason
@the-ethan-hunt
Copy link
Contributor Author

@kylepjohnson , I have resolved the indentation errors and now the report says this:

ERROR: Failure: NameError (name 'self' is not defined)

I am not able to understand this. 😅 Is this something wrong from my side?

@diyclassics
Copy link
Collaborator

The assert statement in line 33 of cltk/tokenize/word.py is at the wrong indentation level—it needs to be one indent in (compared to the def __init__...)

@kylepjohnson
Copy link
Member

The assert statement in line 33 of cltk/tokenize/word.py is at the wrong indentation level—it needs to be one indent in (compared to the def init...)

Yes. This and the file is now full of tabs (not spaces). @the-ethan-hunt I have fixed the file, however to push the changes back, you need to give me write access to your cltk fork. Please do this or replace all the tabs in word.py

@the-ethan-hunt
Copy link
Contributor Author

@kylepjohnson I have corrected the changes and also sent you the invite link to give you write access to my cltk fork. 😅

CI now should run properly
Copy link
Member

@inishchith inishchith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@the-ethan-hunt IMO after these changes the test shall pass 😄

def test_old_eng_word_tokenizer(self):
text = "Hƿæt! ƿē Gār-Dena in ġeār-dagum, þēod-cyninga, þrym ġefrūnon,hū ðā æþelingas ellen fremedon."
target= ['Hƿæt', '!', 'ƿē', 'Gār', '-', 'Dena', 'in', 'ġeār', '-', 'dagum', ','
'þēod', '-', 'cyninga', ',', 'þrym', 'ġefrūnon', ',', 'hū', 'ðā', 'æþelingas', 'ellen', 'fremedon', '.'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@the-ethan-hunt i think you've missed a ] here which caused a test failure .

@@ -26,16 +26,15 @@ def __init__(self, language):
'greek',
'latin',
'old_norse',
'old_english'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also a missing , here caused an error in tests .

@the-ethan-hunt
Copy link
Contributor Author

@kylepjohnson can you have a look at this? 😅
P.S. I also came to know about an Old English team in CLTK. I would love to hear your review on this @clemsciences 😅

@Sedictious
Copy link
Member

Sedictious commented Mar 3, 2018

Apostrophes are considered part of the first word of the two they separate. Apostrophes are also normalized from “’” to “'“.

What is the rationale behind this? As far as I know, English punctuation rules weren't established until well beyond the 11th century, so any apostrophes you'll come across will likely be editorial intervention, probably indicating letter omission (in which case, splitting the word probably does more harm than good)

@the-ethan-hunt
Copy link
Contributor Author

@Sedictious , I agree with your point. But there have been Old English texts with a modern type convention by a modern author.( You have mentioned this too). These authors sometimes use the apostrophe in replacing the interpunct( A very popular Old English punctuation mark represented by "·". To prevent ignoring this I have also used the apostrophe as a punctuation mark. 😄

@Sedictious
Copy link
Member

@the-ethan-hunt My point was that you shouldn't separate words containing an apostrophe (you yourself noted that interpunct also was a punctuation mark). I am sorry, but I still fail to see why you included this bit.

@the-ethan-hunt
Copy link
Contributor Author

the-ethan-hunt commented Mar 3, 2018

@Sedictious , as I mentioned that some authors continue to use the apostrophe in Old English texts in place of an interpunct. I thus had to include it. In cases like that, if I remove the bit, this may result in a 'wrong tokenization'.
However, if it is causing more good than bad, let me get a final confirmation from @kylepjohnson and then I would replace it. 😄
P.S. Thanks for pointing this out. 😄

@Sedictious
Copy link
Member

@the-ethan-hunt I am not entirely sure I follow you. On your comments, you make it seem like you treat apostrophe like a regular punctuation point . However, looking at the code you have:
text = re.sub(r"’", r"'", string)
text = re.sub(r"\'", r"' ", text)
text = re.sub("(?<=.)(?=[.!?)(\";:,«»\-])", " ", text)

Which is first converting instances of ’ to ' and then splitting a word containing either of the aforementioned characters (e.g. " CLTK's " -> " CLTK' " "s"). Nowhere in the code is there an indication of an apostrophe being treated like a punctuation mark.

Again, I am not sure whether the above behavior is intentional or not, so please correct me if there is a reason behind this. That being said, you probably are right and we should wait for a moderator to check it out 😄

@clemsciences
Copy link
Member

Hey @the-ethan-hunt , you can join the Germanic Team, where we can exchange ideas about the Germanic languages. However, I do not have enough knowledge in Old English to assess your work as a professor would do. I only have the basics in Old English. But if there is a place where you can put forward your ideas, it is in this team. @kylepjohnson can add you to the Germanic team.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants