Added word tokenization for Old English #706

the-ethan-hunt · 2018-02-21T18:07:22Z

Word tokenization has been added for Old English with addition in tests/test_tokenize.py and docs/old_english.rst.
I will also be adding line tokenization later.
@kylepjohnson , please tell me if I have to create an issue ticket for that. 😅

kylepjohnson · 2018-02-22T02:05:10Z

Thanks. I have merged some conflicts and am waiting for the build server to report back.

I see that there is an Old French Swadesh in here, too. You mean for that to be merged, as well?

the-ethan-hunt · 2018-02-22T03:16:17Z

I see that there is an Old French Swadesh in here, too. You mean for that to be merged, as well?

@kylepjohnson yes. 😄 I added the Swadesh list for Old French in my previous PR ( #671 ) but you told me there were several discrepancies in it ( Issue #698 ). I have talked about those issues there and hence decided to re-add the list.

kylepjohnson · 2018-02-22T03:17:49Z

Great, that is what I figured. Just wanted to be sure! I'll merge once all the updating/building is done.

the-ethan-hunt · 2018-02-22T03:21:01Z

Thank you! 😅

kylepjohnson · 2018-02-22T03:58:53Z

@the-ethan-hunt There have been a few formatting issues in word.py and I think they came from your version. I tried to fix the, however the build server has failed a few times, most recently for:

ERROR: Failure: TabError (inconsistent use of tabs and spaces in indentation (word.py, line 31))

Would you please take a look and make sure the module loads ok?

Travis CI was probably failing due to this reason

the-ethan-hunt · 2018-02-22T13:04:31Z

@kylepjohnson , I have resolved the indentation errors and now the report says this:

ERROR: Failure: NameError (name 'self' is not defined)

I am not able to understand this. 😅 Is this something wrong from my side?

diyclassics · 2018-02-22T15:40:50Z

The assert statement in line 33 of cltk/tokenize/word.py is at the wrong indentation level—it needs to be one indent in (compared to the def __init__...)

kylepjohnson · 2018-02-22T16:24:34Z

The assert statement in line 33 of cltk/tokenize/word.py is at the wrong indentation level—it needs to be one indent in (compared to the def init...)

Yes. This and the file is now full of tabs (not spaces). @the-ethan-hunt I have fixed the file, however to push the changes back, you need to give me write access to your cltk fork. Please do this or replace all the tabs in word.py

the-ethan-hunt · 2018-02-22T17:41:12Z

@kylepjohnson I have corrected the changes and also sent you the invite link to give you write access to my cltk fork. 😅

CI now should run properly

inishchith

@the-ethan-hunt IMO after these changes the test shall pass 😄

inishchith · 2018-02-23T12:41:38Z

cltk/tests/test_tokenize.py

+    def test_old_eng_word_tokenizer(self):
+        text = "Hƿæt! ƿē Gār-Dena in ġeār-dagum, þēod-cyninga, þrym ġefrūnon,hū ðā æþelingas ellen fremedon."
+        target= ['Hƿæt', '!', 'ƿē', 'Gār', '-', 'Dena', 'in', 'ġeār', '-', 'dagum', ',' 
+                 'þēod', '-', 'cyninga', ',', 'þrym', 'ġefrūnon', ',', 'hū', 'ðā', 'æþelingas', 'ellen', 'fremedon', '.'


@the-ethan-hunt i think you've missed a ] here which caused a test failure .

inishchith · 2018-02-23T12:42:18Z

cltk/tokenize/word.py

@@ -26,16 +26,15 @@ def __init__(self, language):
                                    'greek',
                                    'latin',
                                    'old_norse',
+                                    'old_english'


also a missing , here caused an error in tests .

the-ethan-hunt · 2018-03-03T12:26:49Z

@kylepjohnson can you have a look at this? 😅
P.S. I also came to know about an Old English team in CLTK. I would love to hear your review on this @clemsciences 😅

Sedictious · 2018-03-03T13:06:19Z

Apostrophes are considered part of the first word of the two they separate. Apostrophes are also normalized from “’” to “'“.

What is the rationale behind this? As far as I know, English punctuation rules weren't established until well beyond the 11th century, so any apostrophes you'll come across will likely be editorial intervention, probably indicating letter omission (in which case, splitting the word probably does more harm than good)

the-ethan-hunt · 2018-03-03T13:27:31Z

@Sedictious , I agree with your point. But there have been Old English texts with a modern type convention by a modern author.( You have mentioned this too). These authors sometimes use the apostrophe in replacing the interpunct( A very popular Old English punctuation mark represented by "·". To prevent ignoring this I have also used the apostrophe as a punctuation mark. 😄

Sedictious · 2018-03-03T13:46:02Z

@the-ethan-hunt My point was that you shouldn't separate words containing an apostrophe (you yourself noted that interpunct also was a punctuation mark). I am sorry, but I still fail to see why you included this bit.

the-ethan-hunt · 2018-03-03T14:05:14Z

@Sedictious , as I mentioned that some authors continue to use the apostrophe in Old English texts in place of an interpunct. I thus had to include it. In cases like that, if I remove the bit, this may result in a 'wrong tokenization'.
However, if it is causing more good than bad, let me get a final confirmation from @kylepjohnson and then I would replace it. 😄
P.S. Thanks for pointing this out. 😄

Sedictious · 2018-03-03T14:19:40Z

@the-ethan-hunt I am not entirely sure I follow you. On your comments, you make it seem like you treat apostrophe like a regular punctuation point . However, looking at the code you have:
text = re.sub(r"’", r"'", string)
text = re.sub(r"\'", r"' ", text)
text = re.sub("(?<=.)(?=[.!?)(\";:,«»\-])", " ", text)

Which is first converting instances of ’ to ' and then splitting a word containing either of the aforementioned characters (e.g. " CLTK's " -> " CLTK' " "s"). Nowhere in the code is there an indication of an apostrophe being treated like a punctuation mark.

Again, I am not sure whether the above behavior is intentional or not, so please correct me if there is a reason behind this. That being said, you probably are right and we should wait for a moderator to check it out 😄

clemsciences · 2018-03-05T08:18:48Z

Hey @the-ethan-hunt , you can join the Germanic Team, where we can exchange ideas about the Germanic languages. However, I do not have enough knowledge in Old English to assess your work as a professor would do. I only have the basics in Old English. But if there is a place where you can put forward your ideas, it is in this team. @kylepjohnson can add you to the Germanic team.

the-ethan-hunt and others added 12 commits February 9, 2018 10:59

Added Old English swadesh list

3227cae

Merge branch 'master' into master

2c21277

Update swadesh.py

d27e8e7

Added Old French swadesh list

b1e582d

Added Swadesh to docs

0812393

Merge branch 'master' into master

3ff4c32

Update french.rst

26d99f0

Added tokenizer for old english

a92ff12

Added old English tokenization to testing

8184303

Added word tokenization to docs

07b2f11

Merge branch 'master' into pr-for-old_eng

5c0ec86

Merge branch 'master' into pr-for-old_eng

dc1bd75

fix indentation error

d1c849d

Merge branch 'master' into pr-for-old_eng

d09e6ca

rm extra bracket

286c14c

Fixed indentation error

04e2061

Travis CI was probably failing due to this reason

the-ethan-hunt added 2 commits February 22, 2018 23:08

Indentation changes to spaces

b74c29f

Merge branch 'master' into pr-for-old_eng

967d978

Removed bugs

624332b

CI now should run properly

inishchith reviewed Feb 23, 2018

View reviewed changes

Fixed comma error

42dca74

the-ethan-hunt added 2 commits February 23, 2018 21:50

Added required bracket

503b68e

Update test_tokenize.py

88fb4ef

the-ethan-hunt mentioned this pull request Feb 27, 2018

Added swadesh list for Marathi #687

Closed

the-ethan-hunt mentioned this pull request Mar 10, 2018

Added Stemmer for Marathi #719

Closed

the-ethan-hunt added 2 commits March 11, 2018 16:31

Merge branch 'master' into pr-for-old_eng

911fc71

Removed indentation error

91273b3

kylepjohnson closed this Apr 2, 2019

kylepjohnson mentioned this pull request Apr 2, 2019

Closing miscellaneous old PRs #892

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added word tokenization for Old English #706

Added word tokenization for Old English #706

the-ethan-hunt commented Feb 21, 2018

kylepjohnson commented Feb 22, 2018

the-ethan-hunt commented Feb 22, 2018

kylepjohnson commented Feb 22, 2018

the-ethan-hunt commented Feb 22, 2018

kylepjohnson commented Feb 22, 2018

the-ethan-hunt commented Feb 22, 2018

diyclassics commented Feb 22, 2018

kylepjohnson commented Feb 22, 2018

the-ethan-hunt commented Feb 22, 2018

inishchith left a comment •

edited

inishchith Feb 23, 2018 •

edited

inishchith Feb 23, 2018

the-ethan-hunt commented Mar 3, 2018

Sedictious commented Mar 3, 2018 •

edited

the-ethan-hunt commented Mar 3, 2018

Sedictious commented Mar 3, 2018

the-ethan-hunt commented Mar 3, 2018 •

edited

Sedictious commented Mar 3, 2018

clemsciences commented Mar 5, 2018

Added word tokenization for Old English #706

Added word tokenization for Old English #706

Conversation

the-ethan-hunt commented Feb 21, 2018

kylepjohnson commented Feb 22, 2018

the-ethan-hunt commented Feb 22, 2018

kylepjohnson commented Feb 22, 2018

the-ethan-hunt commented Feb 22, 2018

kylepjohnson commented Feb 22, 2018

the-ethan-hunt commented Feb 22, 2018

diyclassics commented Feb 22, 2018

kylepjohnson commented Feb 22, 2018

the-ethan-hunt commented Feb 22, 2018

inishchith left a comment • edited

Choose a reason for hiding this comment

inishchith Feb 23, 2018 • edited

Choose a reason for hiding this comment

inishchith Feb 23, 2018

Choose a reason for hiding this comment

the-ethan-hunt commented Mar 3, 2018

Sedictious commented Mar 3, 2018 • edited

the-ethan-hunt commented Mar 3, 2018

Sedictious commented Mar 3, 2018

the-ethan-hunt commented Mar 3, 2018 • edited

Sedictious commented Mar 3, 2018

clemsciences commented Mar 5, 2018

inishchith left a comment •

edited

inishchith Feb 23, 2018 •

edited

Sedictious commented Mar 3, 2018 •

edited

the-ethan-hunt commented Mar 3, 2018 •

edited